Advancing the Human Reference Assembly
Valerie Schneider
NCBI
25 February 2015
The Human Reference Genome: Today, Tomorrow and Next ?
http://genomereference.org
Outline
• The assembly model
• Basics
• Value added
• Challenges
• Future relevance of the reference
• Multiple genomes
• Haploid genomes
• Assembly updates
• Mechanisms
• Requirements/Challenges
Sequences from haplotype 1
Sequences from haplotype 2
Old Assembly model: compress into a consensus
Current Assembly model: represent both haplotypes
GRC Assembly Model
many
GRC Assembly Model
Alt loci alignments are an integral part of the assembly model
alignment to chr + scaffold sequence = Alt
GRCh38
• 178 regions with alt loci: 2% of chromosome
sequence (61.9 Mb)
• 261 Alt Loci: 3.6 Mb novel sequence relative to
chromosomes
• Average alt length = 400 kb, max = ~5 Mb
GRCh38
GRC Assembly Model
The human reference assembly represents population
genomic diversity in the context of linear sequences
Assembly (e.g. GRCh38.p1)
Primary
Assembly
Unit
Non-nuclear
assembly unit
(e.g. MT)
ALT
1
ALT
2
ALT
3
ALT
4
ALT
5
ALT
6
ALT
7
PAR
Genomic
Region
(MHC)
Genomic
Region
(UGT2B17)
Genomic
Region
(MAPT)
Patches
Genomic
Region
(ABO)
Genomic
Region
(FOXO6)
Genomic
Region
(FCGBP)
Assembly Updates
Patches
FIX NOVEL
SCAFFOLD STATUS AT NEXT
MAJOR ASSEMBLY RELEASE
ALT
LOCI
--
(integrated)
Treat as:
Allelic
Treat as:
Preferred
Assembly Updates
GRC
• Finished Quality
• INSDC Accessioned
• Representative of an actual DNA molecule
Criteria for Reference Assembly Component Sequences
GRCh38 Collaborators
• NCBI RefSeq and gpipe annotation team
• Havana annotators
• Karen Miga
• David Schwartz
• Steve Goldstein
• Mario Caceres
• Giulio Genovese
• Jeff Kidd
• Peter Lansdorp
• Mark Hills
• David Page
• Jim Knight
• Stephan Schuster
• 1000 Genomes
GRC SAB
• Rick Myers
• Granger Sutton
• Evan Eichler
• Jim Kent
• Roderic Guigo
• Carol Bult
• Derek Stemple
• Jan Korbel
• Liz Worthey
• Matthew Hurles
• Richard Gibbs
GRC Credits
Workshop sponsor:
http://genomereference.org
Editor's Notes
I’d like open this workshop by reminding everyone of the difference between a genome and an assembly. A human genome is a physical object. An assembly is our representation of that object. It is a model. And as shown here, genome models can take many forms.
And as these atomic models illustrate, scientific models evolve over time to reflect our growing knowledge base. And so it is with the human assembly model, the reference genome.
Today’s workshop addresses the advancement of the human reference genome assembly in the context of new data and technologies.
In my talk, I’ll discuss the current reference assembly, highlighting the following topics (read outline). I’ll be followed by Karyn, Tina and Deanna, who will each be talking in more detail about some of these items that I introduce.
When assembling the genome of single diploid individual, there may be divergent haplotypes that confound genome assembly. In the original reference assembly model, which was essentially a stick model of linear chromosomes, there really wasn’t a good way to represent highly variant or complex genomic regions. Different haplotypes were simply compressed into a consensus. The insertion of different haplotypes, however, often led to non-existent allele combinations and artificial gaps, as illustrated here.
This issue led the GRC to develop a new assembly model several years ago that has a mechanism to cleanly represent multiple haplotypes: alternate loci. They allow the reference assembly to contain alternate representations for regions where haplotype compression isn’t appropriate or a single sequence path is considered insufficient. At the same time, the model retains the linear chromosomes with which most users are comfortable.
As a result of the adoption of this model, it’s important to understand that the reference assembly isn’t a haploid or even a diploid genome representation. For any locus, it can represent many haplotypes.
This slide explains how the assembly model accomplishes this. The first thing to know is that the “assembly” is comprised of multiple assembly units.
The primary assembly unit is the collection of chromosomes and unlocalized and unplaced scaffolds. This is essentially the original haploid assembly model.
Non-nuclear genomes are assigned to their own assembly unit.
Regions are defined for those areas of the genome for which alternate sequence representation is desired.
Alternate sequence representations for those regions go into alternate loci assembly units. The first alternate sequence representations for each region goes into into one assembly unit. Each additional sequence representation for a region goes into its own assembly unit.
We also define the PAR regions, to account for sequence shared by the sex chromosomes.
The alternate loci are stand-alone accessioned scaffold sequences that are given chromosome context via their alignment to the primary assembly unit. This image shows a portion of GRCh38 chr. 17, with its regions and alt loci alignments. As you can see, the relationships of the alts to the primary assembly can be complex, with indels and inversions. For this reason, the GRC curates these alignments.
One point I want to make is that the alignments of the alt loci to the chromosomes are an integral part of the assembly model. The alignment, in conjunction with the sequence, is what defines the alt. The alignments are available for download with the assembly from GenBank.
The ideogram image in this slide shows the genome-wide locations of alternate loci in GRCh38, along with some basic alt loci stats.
What all this means is that you don’t have to wait for the development of a graph-based genome representation and corresponding tool suites to do genomic analyses that benefit from variant sequence representations. The current assembly model allows the reference to represent population genomic diversity in the context of linear sequences, which are the currency for most existing analysis pipelines. The next couple of slides show you some of the value added to analyses by use of the full assembly model.
Gene content is one way in which alt loci add value to the assembly. In this slide, you can see several genes annotated in the regions of this alternate representation of the chr. 19 KIR region that have no alignment to the chromosome. Deanna will tell you more about genes unique to alt loci in her talk.
[Thus, if you’re not using the entire assembly in your analyses, you may be missing genes. This can affect the development of exome capture reagents. In addition, many of these alts contain paralogous gene copies that will affect alignments and your understanding of the protein content of the genome.]
Alternate loci also have implications for read mapping and data interpretation. This image from the NCBI 1000G browser shows a region of GRCh37 chr.7 encompassing UPK3B, a gene expressed in primary mesothelial cells. The chromosomal representation of UPK3B has not changed in GRCh38, and is believed to represent a relatively rare insertion allele. An alternate loci for this region is included in GRCh38, and represents the deletion allele, as illustrated by its alignment to GRCh37.
As illustrated by the 3 samples shown here, alignment profiles in this region vary depending on the alignment method used, in this case bwa or mosaik. As a result, it’s difficult to ascertain the genotypes of these samples or the distribution of these alleles in the human population.
However, with the inclusion of the alternate scaffold, we can better interpret the data. This slide shows the alignment of previously unmapped reads from one of these 1000G samples to the GRCh38 alt across the indel boundary, indicating that the sample contains the deletion allele. From analyses such as these, we can see how the inclusion of alternate loci in alignment target sets may improve alignments and data interpretation.
Alternate loci also have a broader impact on read alignments. Since we first developed this model, we’ve been interested in the effect of alt loci on read mapping. This slide describes a study we did a few years ago with the GRCh37.p9 assembly. We looked at the alignment behavior of simulated reads sourced from sequence unique to alt loci. We asked what happened to them when aligned to the primary assembly unit without the alt loci, where their true target is missing. We aligned the reads either as singletons or pairs, using two different aligners (BWA and srprism).
As shown in this graph, regardless of read pairing or the aligner, 25% of these reads failed to align (red). What’s particularly concerning is that nearly three-quarters had an off-target alignment on the primary assembly unit (in blue). These off-target alignments are likely to result in errors in variation analyses. This analysis demonstrates the broader value of including alternate loci in alignment target sets.
While it’s clear that alternate loci add value to the reference assembly, you need the right set of tools to take advantage of them. Unfortunately, using many common analyses suites and file formats with the current assembly model is kind of like eating yogurt with chopsticks. They give you a taste of the richness of the data, but leave a lot behind. This is a point that Deanna will address in greater detail later today, so I’ll only outline the challenges researchers face in using the full assembly. But because the assembly model is still based on linear sequences, it should be possible to modify our current tools and file formats to take full advantage of the reference, rather than starting from scratch.
The first issue is allelic duplication. Most current aligners cannot distinguish the allelic duplication introduced by alternate loci from segmental duplication. As a result, reads aligning to sequence common to the chromosomes and alternate loci tend to be down-weighted and excluded from further analysis. This slide shows a graphical view of an alt locus scaffold, with the alignments of the chromosome and reads from a 1000G sample. The top set of alignments represent reads that aligned to both the alt and the chr. The bottom set are the reads that aligned only to the alt. Zooming in, we see these are reads aligning to an insertion in the alt sequence.
Unless the aligner can distinguish chromosomal regions associated with alt loci and not down-weight alignments in those regions, the gains in picking up new read alignments are likely to be offset by the discarding of other alignments.
Another challenge to using alternate loci comes in reporting features associated with >1 location. As shown in this image that illustrates the TNXB locus on the reference chromosome and 3 alts, genes may have different structures in different locations. Modifications to file formats such as GFF will make it easier to recognize sequence relationships across the assembly when reporting gene and exon locations.
Variant analysis and reporting is another area where changes are needed. As illustrated here, GRCh38 includes representations for the two major haplotypes at the MAPT locus. Depending on sample genotype, it may be desirable to report on more than one representation. However, the VCF format requires modification to support this.
A GRC workshop held last fall led to a publication that helped raise awareness of these issues, and some proposals, such as this one by Aaron Quinlan to make VCF alt-friendly, were discussed. There’s a git issue available for those who are interested. Additionally, bwa-mem recently became alt aware, joining SRPRISM as an alternate aware short read aligner. These changes show that use of the full assembly model is possible and the necessary tools are starting to become available.
I now want to shift gears and discuss the place of the reference as we enter a new era in which this assembly may no longer stand apart in terms of its quality or completeness. It’s important to remember that the human reference assembly is a special kind of genome model. In today’s era of personal genome sequencing, most assemblies only model a haploid or diploid genome.
But the reference assembly is a model of many diploid genomes, meant to represent the “human” genome. This slide shows the assembly composition of the GRCh38 primary assembly. While 70% of the genome comes from one donor, sequence from >70 individuals is represented.
Because the reference is derived from many individuals and includes alternate sequence representations, it is likely to remain our best resource for putting sequences identified in any individual into a genomic context. Likewise, b/c a common coordinate system remains critical for communication and reporting purposes, we’re likely to see the reference retain this role as well.
The table shows the latest versions of human genome assemblies in GenBank. Those in red are population-specific, and more population-specific genomes are under construction today. When analyzing samples from known populations, population-specific references or collections of population-specific genomes may be particularly valuable for variant or haplotype analysis. Even with a reference that is a graph of population variation, certain analyses may benefit from using only sub-paths in the graph. However, it’s important to realize that the utility of population-specific references may be limited for admixed samples. Given that much of the US population is admixed, this is an important consideration for resource development. Today you’ll hear from Tina about gold genomes, a set of genomes from diverse populations that are being sequenced to provide new representations for some of the genome’s most variable regions. These data will be incorporated into the reference.
Karyn and Tina will also be talking today about platinum assemblies that are derived from hydatidiform moles, which have haploid genomes. Without allelic duplication complicating their assembly, these resources facilitate the resolution of some of the most complex segmentally duplicated genome regions. These platinum genomes will be assembled to reference quality. However, it’s important to realize that there are no plans to replace the reference with either of these platinum mole assemblies. Like other individual genomes, they are limited in their representation of diversity. As you’ll hear, the GRC does intend to use these genomes to improve or augment the reference. As we enter a new era of multiple high quality genomes, we still envision the reference playing important roles.
In the last few minutes of this talk, I’ll discuss ongoing efforts to improve the reference. The “patches” feature of the model allows the GRC to make assembly updates available in a timely fashion without disrupting the chromosome coordinates upon which other users rely.
Regions are defined for the genomic locations to be updated, and the sequences representing those updates are put into the “Patches” assembly unit. Like the alt loci, the patches are stand-alone scaffold sequences with alignments.
It’s important to distinguish the two types of patches and the ways in which they should be used for analysis:
(1) FIX patches correct problems in the assembly: deprecated in next assembly release.
(2) NOVEL patches add new alternate sequence representations to the assembly: become alternate loci in the next assembly release.
An example of a GRCh38 fix patch is shown on top in this issue summary from the GRC website, where sequence from a fosmid was used to patch a deleted BAC disrupting representation of the FOXO6 gene. An example of a GRCh38 novel patch is shown on the bottom, where the GRC (in collaboration with the Pharmacogenomics Research Network) added representation for another structural variant of the CYP2D6 locus. The GRC releases patches on a quarterly cycle with the next release planned for the end of March.
With all of the NGS and new genome data, you might think that the GRC is awash in sequences with which to update the assembly. But the reality sometimes feels more like this. Although there is a lot of sequence data available, sequence meeting all 3 of these reference criteria is still limited. Quality is less of an issue today than a couple years ago, but more groups doing sequencing and assembly are putting their data on “public” FTP sites, but not submitting it to an INSDC database. We encourage groups to submit their data so that it can contribute to this valuable public resource. Lastly, the reference assembly is clone based, and all component sequences are representative of a DNA molecule found in an actual individual. As long as the community feels it is important for the reference to represent actual sequences, the ability to phase or resolve haplotypes in newly sequence genomes or generate finished quality sequence from single molecules will be critical to incorporating new sequence into the assembly.