Assembly (e.g. GRCh38.p1)
SCAFFOLD STATUS AT NEXT
MAJOR ASSEMBLY RELEASE
• NCBI RefSeq and gpipe annotation team
• Havana annotators
• Karen Miga
• David Schwartz
• Steve Goldstein
• Mario Caceres
• Giulio Genovese
• Jeff Kidd
• Peter Lansdorp
• Mark Hills
• David Page
• Jim Knight
• Stephan Schuster
• 1000 Genomes
• Rick Myers
• Granger Sutton
• Evan Eichler
• Jim Kent
• Roderic Guigo
• Carol Bult
• Derek Stemple
• Jan Korbel
• Liz Worthey
• Matthew Hurles
• Richard Gibbs
I’d like to welcome everyone on behalf of the Genome Reference Consortium to the workshop. My name is Valerie Schneider, and I’m the team lead for the GRC at NCBI.
The GRC was established after the conclusion of the HGP to manage improvements to the human genome, and subsequently became responsible for the management of the mouse, zebrafish and chicken assemblies. It is comprised of the 5 institutions shown here. The GRC’s job is to do more than just fix problems in reference assemblies. It’s aim is to update them to reflect the new knowledge that we gain from using them, so that we can continue to use them to advance our understanding of biology.
One goal of today’s workshop is to provide you with information that will enable you to take better advantage of the human reference genome assembly in your research and practical applications. In my talk, I’m going to give you an in-depth look at the reference, covering the following topics. I‘ll start by reviewing some assembly basics and features that distinguish the reference assembly from other human genome assemblies. If anything is unclear, please let me know. There will also be a general Q&A following the presentations.
Left: karyotype of a human male genome. A genome is a physical object comprising the complete collection of DNA encoding an organism’s hereditary information. Most healthy humans have a diploid genome, with one maternally and one paternally inherited copy of each chromosome.
While we can represent the information encoded on a haploid chromosome as a linear sequence of bases, current sequencing technologies don’t allow us to sequence a chromosome in its entirety. Instead, we need to fragment the genome, sequence the fragments, and then re-assemble the sequences back into a representation of the DNA molecules. That’s why we refer to collections of genomic sequences as assemblies. No assembly process is perfect: an assembly is a model of the genome.
This plot shows the growth in the number of human assemblies in GenBank in the last decade. Today there are more than 40, with the largest increase in the last couple of years, reflecting drops in sequencing and assembly costs. With so many options today, which do you use for analysis? It’s important to understand how the reference and other assemblies are distinguished from one another.
How can you compare them and determine whether they are suitable for use in your analyses? Some distinctions are basic, such as: the sample/population represented, the assembly level or genome completeness. But assemblies can also be distinguished by sequencing and assembly methods. And it’s these features that result in differences in assembly metrics such as ploidy, contiguity, coverage, and repeat and gene content.
You might be hoping for a simple metric that allows you to assign a “thumbs up or down” to an assembly, but the truth is, it’s not that simple. The relevance of the distinctions on the previous slide tends dictated by your planned use for the genome assembly, some of which are shown here. We’ll now take an in-depth look at the reference assembly and its features in the context of other assemblies. You’ll hear more about the details of other genome assemblies in the talks by Tina, Deanna and Ali.
The human reference assembly is a special kind of genome model. In today’s era of personal genome sequencing, most assemblies model the diploid genome of a single individual. But the reference assembly is a model of many diploid genomes, meant to represent the “human” genome. This slide shows the assembly composition of the current reference. While 70% of the reference comes from one donor, which analyses have identified as an admixed African-European male, sequence from >70 individuals is represented.
Why were so many donors used?
One goal of the HGP was to produce a representation of the human pan-genome. At that time, we understood significantly less about human variation than we do today. We believed that a reference assembly consisting of linear chromosome models produced from multiple individuals, with differences described as sequence edits, would accomplish that goal. We’ll look at how well the reference assembly model accomplishes that goal later in the workshop. And you’ll be hearing from Tina about efforts to further expand the population diversity of the reference.
The reference is also distinguished by sequencing method. It was produced with Sanger sequencing technology, which has largely fallen out of favor due to cost. The Sanger-sequenced bases of the reference distinguish it, as they are considered very high quality, even in comparison to newer sequencing technologies. You may have heard the reference assembly referred to as a “finished” assembly. Finished has a specific meaning when it comes to genomes. It doesn’t mean the genome is “done” in terms of being gap-free or free from assembly errors. It is a metric of sequence quality, meaning that the error rate for each base is < 1 in 10,000. The error rate for the human reference is actually <1 in 100,000, which is still better than any other assembly reported to date.
The reference is also distinguished by its assembly method. Assemblies derived from today’s next gen sequencing technologies are whole genome shotgun assemblies. In the WGS approach, there is no upfront genome mapping. The entire genome is fragmented, sequenced and then assembled de novo. The alternative to a WGS assembly is a clone-based assembly.
To start, the genome is fragmented into relatively large pieces by shearing or restriction digest and cloned into a vector (like a BAC).
Clones are isolated and their relative order is determined using some sort of mapping technology (typically fingerprint mapping in conjunction with markers on linkage or RH maps).
As a minimal path is developed, individual clones are shotgun sequenced and assembled.
It’s been shown that gaps remaining after the shotgun sequencing and assembly of a clone are rarely resolved by deeper sequence coverage. Instead, finishers go in manually to fill the gaps, often by PCR.
Overlap between individual clones is determined and a consensus sequence is generated based on the overlap of adjacent clones.
NOTE: This approach reduces the de novo assembly problem from a global problem to a local problem. A second consequence is that the reference is a haploid mosaic, in which valid haplotypes may transition at clone boundaries, rather than the haploid consensus produced in WGS assemblies of diploid samples.
What does this mean? This figure from a recent review illustrates the effect that sequencing and assembly method have on the statistics of coverage and contiguity. The reference, which is the only clone-based assembly in the chart, is at the upper left with the greatest contiguity, as measured by the statistic of contig N50. Coverage is a measure of the likelihood that a base is sequenced. To balance cost and quality, the HGP sequenced the reference to an average coverage of ~10x. Despite being an order of magnitude lower than many other assemblies here, at this coverage, >99.99% of bases in the genome were sequenced. Short-read sequencing methods like Illumina and Roche, give the highest coverage, but the assemblies are highly fragmented. The Sanger-based HuRef WGS assembly has higher N50, and in green are 3 PacBio assemblies. You can see how longer reads are enabling WGS assemblies to approach the contiguity of the clone-based reference. However, you’ll hear from Deanna today about new methods that are also allowing short-read based WGS assemblies to rival these contiguities.
So what’s the take-home? The reference is still our best all-around representation of the human genome, suitable for a wide-range of analyses. It doesn’t mean the other assemblies in this chart or GenBank are bad: it just means you need to take a look at their features, so you can be sure they meet your analysis needs.
I now want to move on to talk about the assembly model.
As I mentioned earlier, the reference assembly is a modeled genome. The original reference assembly model was essentially a stick model of linear chromosomes. However, even when assembling the genome of single diploid individual, not to mention the many donors in the reference, there may be divergent haplotypes that confound genome assembly. The original stick model of linear chromosomes didn’t really have a good way to represent highly variant or complex genomic regions. Different haplotypes were simply compressed into a consensus. The compression of divergent haplotypes, however, often led to non-existent allele combinations and artificial gaps, as illustrated here.
This issue led the GRC to develop an assembly model that has a mechanism to cleanly represent multiple haplotypes: alternate loci. They allow the reference assembly to contain alternate representations for regions where haplotype compression isn’t appropriate or a single sequence path is considered insufficient. At the same time, the model retains the linear chromosomes with which most users are comfortable.
As a result of the adoption of this model, it’s important to understand that the reference assembly isn’t a haploid or even a diploid genome representation. For any locus, it can represent many haplotypes. You’ll be hearing more about diploid assemblies of individual genomes in other talks today.
This slide explains how the assembly model accomplishes this. The first thing to know is that the “assembly” is comprised of multiple assembly units.
The primary assembly unit is the non-redundant collection of chromosomes and unlocalized and unplaced scaffolds. This is essentially the original haploid assembly model.
Non-nuclear genomes are assigned to their own assembly unit.
Regions are defined for those areas of the genome for which alternate sequence representation is desired.
Alternate sequence representations for those regions go into alternate loci assembly units. The first alternate sequence representations for each region goes into into one assembly unit. Each additional sequence representation for a region goes into its own assembly unit.
We also define the PAR regions, to account for sequence shared by the sex chromosomes.
The alternate loci are stand-alone accessioned scaffold sequences that are given chromosome context via their alignment to the primary assembly unit. This image shows a portion of GRCh38 chr. 17, with its regions and alt loci alignments. The relationships of the alts to the primary assembly can be complex, with indels and inversions. For this reason, the GRC curates these alignments.
One point I want to make is that the alignments of the alt loci to the chromosomes are an integral part of the assembly model. The alignment, in conjunction with the sequence, is what defines the alt. The alignments are available for download with the assembly from GenBank.
The ideogram image in this slide shows the genome-wide locations of alternate loci in GRCh38, along with some basic alt loci stats. As you can see, they are numerous, widely distributed and add considerable novel sequence to the assembly.
So now let’s take a look at GRCh38, the current version of the reference assembly, in the context of this model. First, let’s look at why these alt loci are important.
Gene content is one way in which alt loci add value to the assembly. In this slide, you can see several genes annotated in the regions of this alternate representation of the chr. 19 KIR region that have no alignment to the chromosome. There are more than 150 genes whose only representation in the reference assembly is on the alternate loci.
Thus, if you’re not using the alt loci, you may be missing genes. This can affect the development of exome capture reagents. In addition, many of these alts contain paralogous gene copies that will affect alignments and your understanding of the protein content of the genome.
Alternate loci also have a broader impact on read alignments. Since we first developed this model, we’ve been interested in the effect of alt loci on read mapping. This slide describes a published study we did a several years ago. We looked at the alignment behavior of simulated reads sourced from sequence unique to alt loci. We asked what happened to them when aligned to the primary assembly unit without the alt loci, where their true target is missing. In short, 25% of these reads failed to align (red). What’s particularly concerning is that nearly three-quarters had an off-target alignment on the primary assembly unit (in blue). These off-target alignments are likely to result in errors in variation analyses. This analysis demonstrates the broader value of including alternate loci in alignment target sets.
Whether or not you’re ready or able to take advantage of the alternate loci and added diversity in GRCh38, I also want to emphasize that the GRCh38 primary assembly is an improvement over GRCh37. For example, it’s not just the alt loci in GRCh38 that affect read mapping in the updated assembly. This slide shows how the correction of the CDK11B gene in GRCh38 resulted in the movement of reads that were imperfectly mapped to the CDK11A paralog in GRCh37 to the corrected CDK11B gene in GRCh38. In fact, our read mapping analyses show that ~4% of reads that map with imperfect alignments to a GRCh37 region that is unchanged in GRCh38 map to a new location in GRCh38, demonstrating the impact that assembly updates have even in unchanged regions. Those results are available in our manuscript that is on bioRxiv.
Looking at other improvements: I introduced the concept of N50 earlier. This slide plots the difference in scaffold N50 between GRCh37 and GRCh38. You can see here that scaffold N50s increased for almost every chromosome, indicating the reference assembly is more contiguous than ever.
We can also compare GRCh38 to GRCh37, using a common annotation input set.
There was a 5% increase in the number of aligned genes and a 3% increase in the number of aligned protein coding transcripts. There was also a decrease in both the numbers of annotated partial CDS and split genes (genes that span gaps).
An example of one such improvement is shown here. At top, can see the tiling path in GRCh37, where there is a gap. The INPP5D gene spans this gap. In GRCh38, the gap has been closed by the addition of new sequence, adding a missing exon and providing complete representation for the gene.
The GRC also made several types of updates to the reference assembly that should make it a more robust substrate for informatics analyses, including clinical use. One of these was an update of ~8000 individual bases, as shown by the track highlighted in red of this graphical view of chr. 20.
This cartoon illustrates the types of bases that were targeted for individual updates. The 3 bottom circles represent bases asserted to be incorrect in the GRCh37 reference assembly. These 2 other sets represent bases that, while correct in GRCh37, were asserted to have significant negative impacts on variant calling, gene annotation and/or clinical diagnostics and were also updated in GRCh38. The updated bases should improve read mapping and variant calling, as well as annotation. Bases were not updated to represent the most common, longest or ancestral alleles.
The GRC also added novel sequence to the primary assembly, particularly that which includes genes. One of those sequence additions was a paralog of KCNE1 that was missing from GRCh37. The top shows the alignment of that paralog to KCNE1. Zooming in, can see that there is a SNP that has been called at the position of a paralogous sequence difference. That SNP was previously defined as a pathogenic missense variant and is catalogued in ClinVar, a resource you’ll hear more about from Melissa. This SNP may need reinterpretation in light GRCh38 and the addition of the paralog. This is just one of several regions where variation analyses will benefit from use of the latest reference assembly. You can learn more about the impacts of assembly updates on variation analysis, particularly clinical variants in our manuscript on bioRxiv.
Another update in GRCh38 I want to mention is the addition of modeled centromere sequences to all chromosomes, illustrated on this slide showing chr. 8 in GRCh37 and GRCh38. The centromeres replace the standard 3 Mb gap in all GRCh37 chromosomes. A Genome Research publication by Karen Miga and colleagues describes the methodology for generating these modeled sequences.
Preliminary GRC analyses indicate that the modeled centromeres act as a targets for read mapping. This panel shows the alignment of previously unmapped reads from a 1000G sample to the modeled centromere of chr. 7. The models appear to serve as an effective read sink, but variant analyses in these regions must account for the modeled nature of the sequence and should likely be done separately from the other genome regions.
I want to talk now about ongoing curation work for the human reference assembly. An assembly model feature known as “patches” allows the GRC to make assembly updates available in a timely fashion to researchers needing corrected or new sequence without disrupting the chromosome coordinates upon which users rely.
Regions are defined for the genomic locations to be updated, and the sequences representing those updates are put into the “Patches” assembly unit. Like the alt loci, the patches are stand-alone scaffold sequences with alignments.
It’s important to distinguish the two types of patches and the ways in which they should be used for analysis:
(1) FIX patches correct problems in the assembly: deprecated in next assembly release.
(2) NOVEL patches add new alternate sequence representations to the assembly: become alternate loci in the next assembly release.
This slide shows an overview of all patches released in GRCh38.p1-p9 (along with alt loci). The NOVEL patches represent variation detected in other assemblies, generally insertions >5 kb. You’ll hear more about the assemblies the GRC is using from Tina in here talk, and GRC efforts in add population diversity to the reference. The FIX patches include everything from single base updates to resolved inversions to large sequence additions.
Now that I’ve given you a behind the scenes look at GRCh38, let’s look at how you can take advantage of the data, both at the GRC website and in some genome browsers.
You can find information about GRC assemblies from the tab menu at the top of the home page. To find information about the human reference, click on the “Human” tab.
The GRC human overview page contains links for downloading the assembly and related data.
At the bottom is a section with information about the latest assembly release.
Scrolling down the page, the lower section includes a table that lists the current regions defined in the assembly. You can use the filters on the left to narrow the display to regions in your area of interest. Clicking on any region name in the table will take you to the corresponding region page.
The region page has 3 parts:
A summary of the region location, including an ideogram
A table of GRC issues and patch and alternate loci sequences associated with the region
A graphical view of the region
Tracks in the display have been chosen for their utility in assembly assessment, and I’ll describe in more detail shortly.
Can toggle to see graphic from perspective of the chromosome or the patch/alternate loci scaffold
The GRC website also has a display of assembly locations that are under review due to suspected error or the need for alt loci representation. You can find information about the assembly issues on which the GRC is working via the organism-specific “Issues Under Review” tabs. On top, an ideogram shows the genomic locations of issues, which are listed in the table below. On the left of the page, you can search or filter for issues.
Clicking on a particular ideogram updates the page and table to a more specific view, where the annotations can be categorized by issue type or status. Hovering over the icons opens a pop-up with more detail. Within the table you will also find links to pages describing each of the individual issues and links to view the corresponding assembly region in some of the more commonly used browsers.
An Issue detail page has 3 parts, similar to the region detail pages:
At top, brief description of the issue, plus an ideogram showing its genomic location
In middle, lists of patch and alternate loci sequences associated with the issue (if they exist)
At bottom, graphical view of the issue region
If there’s a patch or alt loci scaffold associated with the issue, you can toggle the graphic to see it from perspective of the chromosome (gap) or the patch/alternate loci scaffold (green closure), along with the sequence alignment.
The organism-specific assembly data tab takes you to a page that provides stats for the assembly. These include lengths, gap counts, N50s and global stats. The statistics are available for the current and several previous versions of the reference assembly.
If you spot a potential problem with the genome, you can report this to us and we will record the information in our tracking system. On our report page you must:
1- select the organism and build
2- tell us the location of the problem. We internally track using flanking component accessions, but you can provide the genome coordinates- we can use that and the build number to determine the flanking accessions.
3- some information about yourself so we can contact you with additional information.
4- a detailed description of the issue. You can even attach a file (and screen shots are good) to assist in describing the problem.
Let’s now take a look at how you can access assembly-related data in commonly used browsers. It’s important to understand that the human assembly is the same at each of the browsers. It’s all GRCh38. What can differ is the annotation data, which is provided by the browsers. At NCBI, you access a new browser known as the Genome Data Viewer (GDV) via the corresponding pages in the Assembly database.
The first time you visit GDV, it will display a default set of tracks. The display is managed through the “Tracks” menu button. From there, you can access a feature, known as “Track Sets” which allows one-click configuration of the display. The “Assembly Support” track set includes the tracks most valuable for assessing assembly quality. This is essentially the same set of tracks you’ll find on the GRC pages or in the GRC track hubs at the Ensembl and UCSC browsers. You can also do a custom configuration of GDV with the “Configure Tracks” option.
Within the Assembly Support track set, the Assembly Components track shows the underlying sequences and gaps in the assembly, while the “Issues” track shows you where the GRC is curating the assembly. There’s also a track showing component sequencing problems. The “Clone Placement” track can be used to identify mis-assemblies or find clones of interest. On the lower left side of the browser is a section called “Region details”. If the chromosome region you’re looking at has alts/patches associated with it, you can click here update the display to show those sequences instead. It also includes a link to the relevant issues at the GRC website. The “Your Data” section, lets you upload your own data into the browser for viewing alongside the NCBI-provided tracks. You can use this combination of features to assess whether the genome is okay in your region of interest. For more about viewing GRCh38 at NCBI, including other browsers, see poster 1926F.
Here we see the entry point to GRCh38 in the Ensembl browser.
In the configuration section for the Ensembl browser, GRC and assembly related tracks can be found in the “Sequence and Assembly” menu. GRC-specific tracks come from a GRC track hub that includes optical map data, regions under review, alt-primary alignments and annotated clone assembly problems.
This slide shows an example of GRC data in the UCSC browser. Assembly-related tracks are highlighted. The GRC track hub is also supported in the UCSC browser via this URL.
In conclusion, we’ve covered the following topics with respect to the human reference assembly. Hopefully you’ve gained some appreciation for what distinguishes the reference, the assembly model and GRCh38, as well as how you can take advantage of the data both from the GRC and at various browsers.
I would like to thank all of the GRC members, our collaborators who contributed data to GRCh38, often sharing with us ahead of publication and our current and past SAB members.
This is an example of a FIX patch resolving the assembly of the MUC2 gene, which contains a central exon with a large number of tandem repeats. The repeat number is under-represented in GRCh38. The GRC performed PacBio sequencing of a BAC clone containing MUC2 to provide a complete representation for the exon structure in this gene, and released this update as a fix patch scaffold.
As many of you may be thinking about transitioning between assembly versions, there are various resources to help users remap data at UCSC, Ensembl and NCBI.
The NCBI Remapping tool uses assembly-assembly alignments to project the features from one assembly to the other. Users select the assemblies they want to map between, remapping options and the input and output file formats. For those who need more than the web interface can offer, there is also a perl API available.
Clinical Remap allows the remapping of features from assembly sequences to RefSeqGene sequences or LRGs or from RefSeqGene and LRG sequences to an assembly.
Alt loci remap allows for the mapping of features between the Primary assembly and the alternate loci and Patches available for GRC assemblies.