The Genome Reference Consortium released the latest human reference assembly, GRCh38, on Dec. 24. While this updated assembly has many improvements, and some groups have been eagerly awaiting its release, the GRC is well aware that many users may feel the same way about GRCh38 as we all feel about the gift of new socks.Today I’m going to tell you about some of the new features in the assembly and how these updates make GRCh38 a better substrate for analyses. In the end, I’d like to convince you that whether GRCh38 was on your wish list or not, like a new pair of socks, it’s in better shape than what’s sitting in your wardrobe and ultimately, you’ll be able to put it to good use.
GRCh37 was released in 2009, and used a new assembly model in which alternate loci scaffolds were included to provide additional sequence representations for variant genomic regions. GRCh37 had 3 such regions, and 9 alternate loci scaffolds.Since then, the GRC has continued to update the assembly. Many of these updates were released as non-coordinate-changing patch scaffolds. The patches came in two flavors:FIX patches corrected problems in existing assembly sequenceNOVEL patches added new alternate sequence representationsAs shown in the box, nearly 200 regions of GRCh37 were associated with a patch, and these updates added almost 8 Mb of novel sequence to the reference assembly. Furthermore, not every assembly update was released as a patch. As this pie chart shows, the GRC resolved just over 1000 issues for GRCh38. As a result, the GRC and members of its SAB, agreed that it was time for a major assembly release.So today we have GRCh38, which now has 178 regions associated with 261 alternate loci scaffolds. There is more than 3 Mb of sequence whose only representation in the assembly occurs in the alternate loci.
I’d now like to introduce GRCh38 with some basic assembly statistics. These and additional stats for GRCh38 assembly are available on the GRC website.One measure of an assembly’s continuity is scaffold N50. You can see here that scaffold N50s increased for almost every chromosome in GRCh38, indicating the reference assembly is more contiguous than ever.
We can also compare GRCh38 to GRCh37, using a common annotation input set.There was a 5% increase in the number of aligned genes and a 3% increase in the number of aligned protein coding transcripts. There was alsoa decrease in both the numbers of annotated partial CDS and split genes (genes that span gaps).An example of one such improvement is shown here. In blue, can see the tiling path in GRCh37, where there is a gap. The TWIST2 gene spans this gap. In GRCh38, the gap has been closed by the addition of new sequence and there is complete representation for the gene.In this example, the added sequence was RP11 WGS, provided by Jim Knight, who has been working with Stephan Schuster and others on an RP11 WGS assembly (poster). The GRC used WGS sequence from this and several other WGS assemblies, including HuRef, CHM1_1.1 and the NA12878 ALLPATHS, to extend into or span gaps when clone-based sequence could not be found.
One of the updates made in the assembly was the correction of erroneous bases. The human genome is approximately 2.85 billion bases and the finished human reference assembly is accurate to an error rate of 1 per 100,000 bases. While this represents the highest quality mammalian genome assembly in existence today, it still means that an approximate 28,000 bases are incorrect. The GRC made the correction of erroneous bases a priority for GRCh38.This slide shows the bases whose updates were considered by the GRC:The largest set were ~15K SNV with MAF=0 in the 1000G phase 1 analysis.1000G also identified ~2.5K indels with MAF=0These two sets represented bases that were asserted to be incorrect in the reference assembly, as they were never seen in 1000G.An additional 1413 bases with MAF<5% (but >0%) that overlap pseudogenes, processed transcripts or polymorphic pseudogenes were also consideredAs were ~200 base update requests from annotators and clinical labs
Before attempting any of the updates, the GRC did some analysis to determine whether the bases with MAF=0 were sequencing errors or unrecognized variants. To do this, we performed a read pile-up analysis for a subset of these bases for which we had WGS data from the same genome as the reference assembly sequence. These were bases in RP11 BAC clones, which make up 70% of the reference assembly. The RP11 WGS sequence used in this analysis was generated at WashU. First graph shows the results of the pile up analysis for the SNVs: (X axis is chromosomes)Purple: Proportion of “never seen” bases that are heterozygous in RP11 (hetalt: not errors)Red: Proportion of “never seen” bases that are not seen in RP11 (hmalt: genuine errors)Across all chromosomes: 79% “never seen” SNVs are heterozygous in RP11 WGS, indicative of unrecognized variation, rather than sequencing error.The GRC did not update the heterozygous RP11 bases.
Ultimately, the GRC attempted to update 9359 bases.Of these, we succeeded in updating 8128 sites (86.8%) with mini-contigs we built from WGS reads from 1000G samples or the RP11 genome. The reads were assembled into the mini-contigs with cortex_con and differ from reference only at selected base. These were all submitted to GenBank.The Ensembl VEP found 8188 variants associated with the sites updated by mini-contigs. Most updates are not in coding sequence. Among those variants with coding consequences, most are missense or synonymous, consistent with most of the updates being SNVs. Consequences of note include:15 genes that had an internal stop codon in GRCh37 are now coding78 genes had a frameshift relative to GRCh37 that restored gene function2 genes that were coding in GRCh37 are now non-coding, but do represent the more common allele (CASP12/PRM3)
The first new feature of GRCh38 I want to mention are the centromeres. Until now, centromeres have been represented in the reference assembly by very large gaps. This is unfortunate, because centromeres play important roles in biology. Contrary to popular belief, centromeres aren’t difficult to sequence. In fact, there are large datasets of centromere sequence out there that are just waiting for a reference so that they can be analyzed.The challenge has been their assembly, which is complicated by their highly repetitive nature. As illustrated here, centromeres are comprised largely of tandemly repeated alpha-satellite sequences, that exhibit a wide range of variation. These short repeats are organized into longer higher order arrays that are highly identical. Because the centromeres are so long, they are difficult to assemble with even the longest read technology.
Centromeric sequence assembly is further complicated by the fact that these higher order arrays can vary between individuals and vary between homologous chromosomes in the same individual.
The GRC was fortunate to be contacted by Karen Miga, a postdoc in Jim Kent’s lab, who was developing an approach for generating modeled centromere sequences. All of the work I’m going to talk about was done by Karen and will soon be published in Genome Research.In short, Karen created a database of centromeric WGS reads from the HuRef genome. She determined the chromosome-specific higher order array structures and then build statistical linear models that could be used in the reference assembly, where they will serve as targets for read mapping.This next slide just shows a schematized version of graph-based representations for each of the chromosome-specific higher order arrays.
In these graphs, the nodes represent identical monomers and the edges are the likelihood of their adjacency in the array. Karen used a hidden Markov-based tool called LinearSat to build statistically based linear models from these graphs.It’s important to understand that each model represents the variants and monomer ordering in a proportional manner to that observed in the initial read database, but the long-range ordering of the repeats represents only an inferred sequence.Karen further used mate pair mapping to identify euchromatic WGS sequences from the HuRef assembly that are associated with the arrays. Like the repeats, the long range ordering of these euchromatic contigs in the models is also an inference.Users can find the coordinates of the centromere sequences in a table on the GRC website.
In addition to adding centromere sequence, the GRC has focused on adding human-specific sequences to the reference assembly.An example of this is the SRGAP gene family, which is involved in cortical development. The ancestral 1q32 gene has been duplicated in humans to 1p21 and 1q21. Work from EvanEichler’s lab found that not only were the 3 SRGAP2 human paralogs incompletely sequenced in GRCh37, but that allelic and paralogous sequences had been mixed in the assembly. 1q21 was the worst of these misassemblies, containing multiple haplotypes due to the highly duplicated nature of the region. Only by use of a single haplotype hydatidiform mole resource was it possible to disambiguate the correct paths at each locus. These updated paths were originally released as fix patches to GRCh37 and are now incorporated in the GRCh38 chromosomes. This panel shows the GRCh37.p13-GRCh38 assembly-assembly alignments in the 1Q21 region.The alignment of the GRCh37 chromosome sequence is highly fragmented, indicative of the large changes that were made.Also aligning to this region of GRCh38 is a GRCh37 chr. 1 unlocalized scaffold. This scaffold contained the HYDIN2 gene.
HYDIN2 represents another human specific gene duplication, also involved in neuronal phenotypes. The human genome contains two HYDIN loci: HYDIN on chr. 16, and HYDIN2 on chr. 1. The HYDIN2 locus was absent from previous assembly versions, unlocalized scaffold in GRCh37 and placed in GRCh38.This slide shows the alignment of the HYDIN2 and HYDIN genes from the CHM1 genome assembly (TINA POSTER) to the chr.16 HYDIN locus in the GRCh37 assembly. The HYDIN2 alignment reflects paralogous sequence differences, while the HYDIN alignment reflects allelic differences. The alignments show that the 2 loci are highly similar, explaining why it was so difficult to disambiguate the two genes. In fact, the sequences are so similar, in NCBI34, sequences from the two genes were mixed at the same locus.The high degree of similarity has complicated variation analysis of these two paralogous genes. The absence of the chr. 1 paralog in previous assembly versions has likely led to likely erroneous variant calling at the chr. 16 locus. Zooming in, we see a paralogous sequence variant in HYDIN2 that occurs at the position of an annotated SNP in HYDIN. Now that HYDIN2 is present in GRCh38, we can begin to address issues such as this.
Another set of sequences that the GRC was interested in capturing for GRCh38 was the 1000G decoy sequence. This was a 35 Mb collection of sequences that were not represented in the GRCh37 primary assembly. They were included in the 1000G phase 2 alignment target set as a read trap, as analyses showed they improved variation calling. The decoy sequences had an average repeat content of ~80%.In order to assess decoy capture in GRCh38, we looked at reads from two 1000G samples that previously aligned only to the decoy. Depending on the sample, we find that 70-75% of such reads now align to the GRCh38 primary assembly. An additional 1% percent of reads are captured when the full assembly is used as a substrate and the alt loci are present. Thus, while not fully representing the decoy, GRCh38 does include a significant portion of this important sequence and is therefore a better alignment target than GRCh37. We continue to pursue the capture of the remaining decoy, much of which is highly repetitive, in a meaningful way in the reference assembly.
This brings me to the alternate loci, which are now present in greater number and locations than ever.In the original reference assembly model, there was no good way to handle variant genomic regions. Frequently, sequences from multiple haplotypes were inserted and confounded assembly, leading to artificial gaps. In the assembly model we’re using now, there’s a mechanism to cleanly represent multiple haplotypes : these are the alternate loci. They allow the reference assembly to contain alternate representations for regions where a single sequence path is considered insufficient, while retaining the linear chromosome models that most users are comfortable with. The corollary of this statement is that the reference assembly may represent >1 allele at a locus.
So, why is it important to use the alternate loci? One simplereason is gene content. In GRCh38, there are 64 protein coding and 112 non-protein coding genes that are found only on the alternate loci.An example is shown in this slide. This image shows an alternate locus scaffold from chromosome 22. Grey bar is assembly component, green bars are genes, and the alignment is below. You can see several genes annotated in the region of the alt that has no alignment to the chromosome.Thus, if you’re not using the entire assembly in your analyses, you may be missing genes. This can affect the development of exome capture reagents. In addition, many of these alts contain paralogous gene copies that will affect alignments and your understanding of the protein content of the genome.
Alternate loci also have implications for genome interpretation:In this example, we’re looking at structural variation in the APOBEC locus on chr. 22. There is a deletion variant that results in the fusion of the APOBEC3A and 3B genes.Deletion allele is prevalent in Asians and South America. GRCh38 contains the deletion allele on an alt loci scaffold. This is a common polymorphism for which the alt contains the predominant allele for certain populations.This image shows reads from two Asian 1000G samples that align in the APOBEC intergenic region in GRCh37, displayed in the NCBI 1000G browser. B/c the samples are heterozygous, but are aligned to the primary assembly, which has only the insertion variant, it complicates the alignments. Can see that different methods give different results. Use of the full assembly, an alignment substrate that includes both variants, would likely improve the interpretation of the data.
We’ve been doing some analyses to investigate the severity of mapping errors that can occur when alternate loci aren’t used in alignment target sets. Since our analyses of GRCh38 are ongoing, I’ll talk today about a study we did with the GRCh37.p9 assembly. In that study, we looked at the behavior of simulated reads sourced from sequence unique to GRCh37.p9 patches or alternate loci. We asked what happened to them when aligned to GRCh37 primary assembly+MT, where their true target is missing. We aligned the reads either as singletons or pairs, using two different aligners (BWA and srprism).As shown in this graph, regardless of read pairing or the aligner, 25% of these reads failed to align (red). What’s particularly concerning is that nearly three-quarters have an off-target alignment on the GRCh37 primary assembly (in blue). These off-target alignments are likely to result in errors in variation analyses.This analysis demonstrates the value of including alternate loci in alignment target sets.
That being said, most commonly used short read aligners can’t currently handle the allelic duplication introduced into the assembly by non-unique sequences in alt loci. Mapping scores for reads aligning to both the alt and the corresponding chromosome region are depressed and excluded from analysis.As a result, new alternate aware tools that understand the relationship of the alt to the chromosome and don’t depress scores are needed in order for users to take advantage of the full reference assembly. Some aligners, such as iBWA and srprism, can now do this, but other aspects of variant calling tool chains still need to be updated to address this issue of allelic duplication.In the interim, the GRC has been looking at approaches that may help users make use of existing tool chains. For example, we’ve tested use of a mask that hides the duplication in the alts. In this slide, you can see the mask we’ve generated for this NOVEL patch, which has an insertion relative to the chromosome, but is identical for much of the remaining length.
We have looked at the effect of masking on BWA alignments and compared results to those obtained with use of the alternate aware aligner, srprism. In this analysis, simulated reads were aligned to GRCh37.p9 primary or full assembly. For BWA, we tested masking of the alts/patches only, or masking a combination of sequences on the alts/patches and the chromosome. We then looked at the incidence of reads with ambiguous alignments.As shown in first two columns of the figure, there is an expected increase in multiple alignments when reads are aligned to the full assembly with BWA and no mask (expanded red). In the next two columns, you can see how use of either masking approach suppresses the increase in multiple alignments. The last two columns show that srprism, the alt aware aligner, does not need a mask to prevent ambiguous mappings.We’ll be following up this analysis on GRCh38, but I hope that even this preliminary data makes the point that it is possible to develop tools that can handle the alternate loci and may allow users to reap the benefits of using the full assembly in analyses.
On that note, I’d like to wrap things up. I’d like to think I’ve convinced you that:It was time for an updateThe reference has improvedUpdates and new features will make the reference a better substrate for analysisFor those of you ready to make the switch, I’d like to plug the NCBI remapping service, which uses assembly-assembly alignments to remap features from one assembly to another. This tool can be used for mapping between GRCh37 and GRCh38. It is available as a web interface, as well as a perl script API.While you may not be excited by the new assembly as these folks are with their socks, it’s a far cry from a lump of coal.
Taking Advantage of GRCh38
12 February 2014
GRCh38 Model Centromeres
Until now, centromeres have been defined as multi-megabase gaps in the assembly
GRCh38 Model Centromeres
Karen Miga (Kent Lab, UCSC)
GRCh38 Model Centromeres
GRCh38 Sequence Addition
Dennis et al., 2012
GRCh38 Path Updates
HYDIN: chr16 (16q22.2)
Doggett et al., 2006
HYDIN2: chr1 (1q21.1)
Missing in NCBI35/NCBI36
Unlocalized in GRCh37
Placed in GRCh38
Alignment of HYDIN2 Genomic, 300 Kb, 99.4% ID
Alignment of HYDIN2 Genomic, 300 Kb, 99.4% ID
Alignment of HYDIN CHM1_1.0, >99.9% ID
Alignment of HYDIN CHM1_1.0, >99.9% ID
GRCh38: Alt Loci
Masks and alt aware aligners reduce the incidence of
ambiguous alignments observed when aligning reads to
the full assembly
Mask1: mask chr for fix patches, scaffold for novel/alts.
Mask2: mask only on scaffolds
NCBI RefSeq and gpipe annotation team