More Related Content

Slideshows for you(20)

Similar to ABGT 2016 Workshop Schneider(20)


ABGT 2016 Workshop Schneider

  1. Relating New Assemblies to the Human Genome Reference Valerie Schneider, Ph.D. NCBI 10 February 2016
  2. Twitter: @GenomeRef
  3. Overview • Changes in reference assembly sequence sources • Diversity • Properties • Evaluating new sequences for use (Assemblathon) • Future of assembly curation and the reference assembly
  4. GRCh38 • 178 regions with alt loci: 2% of chromosome sequence (61.9 Mb) • 261 Alt Loci: 3.6 Mb novel sequence relative to chromosomes • Average alt length = 400 kb, max = ~5 Mb GRCh38
  5. Assembly Composition
  6. WGS Assemblies Contributing to GRCh38 Assembly Name Assembly Accession Seq Method Usage Length RP11_1.0_unmatched_regions GCA_000442295.1 454 Gaps, Correction 754717 (0.02%) CHM1_1.1 GCF_000306695.2 Illumina Gaps, Correction 133662 (0.004%) HsapALLPATHS1 GCA_000185165.1 Illumina Gaps, Correction 364303 (0.01%) HuRef GCF_000002125.1 Sanger Gaps, Correction, Alt Loci, CEN 4800690 (0.16%) LinearCen1.1 (normalized) GCA_000442335.2 Sanger CEN 59546786 (2.02%) Assembly Composition
  7. WGS Gap Closure
  8. Human assemblies available in the NCBI assembly database Assemblies in GenBank Oct. 2014: 13 assemblies Nov. 2015: 28 assemblies Feb. 2016: 39 assemblies YRI CEU CEU CHB
  9. Reference Assembly Basics Sanger Sanger Illumina Illumina PacBio (older) clone WGS WGS WGS WGS Reads: Method: PacBio (newer) WGS N50: Measure of continuity. Half of the contigs in the assembly are this length or greater.
  10. Overview • Changes in reference assembly sequence sources • Diversity • Properties • Evaluating new sequences for use (Assemblathon) • Future of assembly curation and the reference assembly
  11. Assemblathon Analysis Overview CHM1/CHM13 Assemblathon Goals • Assess aspects of data generation (coverage, length) • Assess assembler algorithms & parameters • Platinum genome selection (MGI) • More robust reference curation (GRC) • Set expectations for these new resources • Understand quality and limitations • Plan for regions needing other resources • Develop new pipelines and SOPs
  12. GRCh38 CHM13_FC CHM13_CA1 CHM13_CA2 CHM13_CA3 CHM13_CA4 GCF_000001405.26 GCA_000983455.2 GCA_000983465.1 GCA_001015355.1 GCA_000983475.1 GCA_001015385.1 Total sequences 50,304 50,304 50,304 50,304 50,304 50,304 No Alignment 21 (0.04%) 88 (0.17%) 50 (0.10%) 49 (0.10%) 46 (0.09%) 50 (0.10%) Multiple best alignments (split transcripts) 10 (0.02%) 40 (0.08%) 340 (0.68%) 316 (0.63%) 611 (1.22%) 395 (0.79%) CDS coverage < 95% 17 (0.04%) 256 (0.66%) 378 (0.97%) 326 (0.84%) 622 (1.60%) 392 (1.01%) Dropped at consolidation (coding) 0 167 259 278 240 250 Dropped at consolidation (non-coding) 0 138 212 209 185 191 Assemblathon RefSeq Alignment Stats: CHM13
  13. GRCh38 CHM13_FC CHM13_CA1 CHM13_CA2 CHM13_CA3 CHM13_CA4 Frameshifts GCF_000001405.26 GCA_000983455.2 GCA_000983465.1 GCA_001015355.1 GCA_000983475.1 GCA_001015385.1 proteins 19 218 346 503 627 439 genes 12 161 232 317 366 281 Number proteins Assemblies in which frameshifted 953 1 106 2 50 3 113 4 115 5 41 6 2 (PKD1L2)7 Assemblathon RefSeq Alignment Stats: CHM13
  14. Seq in assembly 1 Seq in assembly 2 A A B B’ B Unique well aligned region in both assemblies. Second Pass (SP) alignments SP only Expansion Assembly 1 SP + FP Collapse Assembly 2 Graphic: Deanna Church First Pass (FP) alignments Assemblathon: Assembly-Assembly Alignments
  15. Assembly Average CHM13_FC 2.36% CHM13_CA1 2.38% CHM13_CA2 2.41% CHM13_CA3 2.03% CHM13_CA4 2.13% GRCh37 1.06% ungapped
  16. Overview • Changes in reference assembly sequence sources • Diversity • Properties • Evaluating new sequences for use (Assemblathon) • Future of assembly curation and the reference assembly
  17. • Platinum and gold genomes expected to contribute to reference corrections and alternate loci • Set standards for use of other WGS assemblies • Gold and platinum assembly curation • Tools for local re-assembly • Assessing and communicating local assembly quality GRCh38CHM1 CHM13 NA19240 NA12878 NA19434 HG007033 HG00514 Future Curation
  18. • Multiple human references • Reference graphs • Long-term curation Future Curation CHM1 CHM13 NA19240 NA12878 HG000733 HG00514 NA19434 GRCh38
  19. Overview • Changes in reference assembly sequence sources • Diversity • Properties • Evaluating new sequences for use (Assemblathon) • Future of assembly curation and the reference assembly
  20. GRCh38 Credits GRCh38 Collaborators • NCBI RefSeq and gpipe annotation team • Havana annotators • Karen Miga • David Schwartz • Steve Goldstein • Mario Caceres • Giulio Genovese • Jeff Kidd • Peter Lansdorp • Mark Hills • David Page • Jim Knight • Stephan Schuster • 1000 Genomes GRC SAB • Rick Myers • Granger Sutton • Evan Eichler • Jim Kent • Roderic Guigo • Carol Bult • Derek Stemple • Jan Korbel • Liz Worthey • Matthew Hurles • Richard Gibbs Assemblathon Collaborators • Jason Chin • Adam Phillippy • Sergey Koren • Heng Li GRC Tina Graves-Lindsay Karyn Meltz Steinberg Kerstin Howe Richard Durbin Paul Flicek Laura Clarke Deanna Church Curators! Developers!

Editor's Notes

  1. Today I’ll be talking about new assembly data in the context of the Genome Reference Consortium (GRC). The GRC was established after the conclusion of the HGP to manage improvements to this important resource. It subsequently became responsible for the management of the mouse and zebrafish reference genome assemblies. It is composed of the 4 institutions shown here, including MGI, who together perform the wet lab and bioinformatics work of the consortium.
  2. The talks in today’s workshop have highlighted the need for continued reference assembly improvement, especially with regards to capturing diversity, and described new approaches and ongoing efforts to produce a suite of assemblies with near-reference quality. With so much new genomic data becoming available, it’s tempting to simply accept it for use in the reference, or to consider choosing a new assembly as the reference. But whenever these topics comes up, I always think about this old commercial. It’s a reminder that we need to be mindful of the impact that change can have, and the very large community to which the GRC must respond when change occurs. Thus, we have to really understand the properties of the new data.
  3. In my talk today, I want to take a step back and put these new assemblies in the broader context of reference curation. First, I’ll talk about how sequence sources for the reference assembly have already been changing, not just in terms of diversity, but also their basic properties, such as type, length and quality. Next, I’ll come back to the Assemblathon that Tina already introduced, and discuss some analyses that are of particular interest with respect to curation. Lastly, I’ll spend a few minutes discussing the future of assembly curation, particularly how these gold and platinum genomes fit into the curation activities of the GRC and the future of reference assemblies in general.
  4. The ideogram image in this slide shows the current major release of the human reference assembly, GRCh38. The red marks on here represent the alternate loci, which Deanna already introduced. If you recall, the alternate loci represent alternate sequence representations for complex or more highly variant genomic regions. This slide illustrates how regions for which alternate representations are already available are wide-spread through the genome. While diversity in many regions can be well-represented by just a single alternate, there are estimated to be nearly 100 regions that exhibit diversity at an MHC-like scale.
  5. For example, as this slide illustrates, the LRC_KIR region of GRCh38 on chr. 19, has over 30 different representations of the two major haplotypes at this locus. Historically, capturing this type of diversity was challenging. Generating sufficient genomic libraries and sequencing identified clones to capture the desired diversity easily becomes cost prohibitive. Nonetheless, there have been efforts to do just this, such as the NHGRI Structural Variation project in the first half of the aughts, which involved the creation and sequencing of libraries from more than 20 individuals from 4 populations.
  6. This project contributed significant sequence to GRC reference curation in the update from GRCh37 to GRCh38. Deanna already showed this pie chart illustrating reference composition. If we look at a corresponding chart for the alternate loci, you can see how they are really enabling the reference to expand its representation of diversity. In the alts there is a more evenly distributed use of different sources. At bottom is a graph showing the relative changes in library contribution from GRCh37 to GRCh38. This illustrates both the increased diversity, with an increased use of new libraries, like the SV fosmids and CH17, but also the change in resources the GRC was using to update the reference. In particular, we see an increase in the category “No Lib”, which represents WGS sequence. I’d now like to discuss the GRC’s use of WGS in its recent curation activities, as context for the future use of the platinum and gold assemblies.
  7. This slide shows WGS contributions to GRCh38, which together make up about 2% of the ungapped assembly sequence. The GRC blessed use of 5 different WGS assemblies, derived from 3 different sequencing methods, as temporary fixes to the clone-based reference. Contigs from WGS assemblies were generally used to reduce assembly gaps or correct errors in clone components. For GRCh38, WGS data was also used as the basis for the modeled centromeres.
  8. An example of WGS addition at an assembly gap is shown on this slide. In blue, can see the tiling path in GRCh37, where there is a gap. The TWIST2 gene spans this gap. In GRCh38, the gap has been closed by the addition of new sequence from a WGS assembly, providing complete representation for the gene. The current GRC SOPs call for use of WGS only when clone sequence is not available, and also for its eventual replacement by finished clones. This is because the WGS sequence from these older assemblies is recognized as only draft quality, and due to short N50s, tends to introduce many gaps. However, with the production of the more contiguous gold and platinum genomes, as well as diminishing availability of genomic clone resources, the GRC will need to re-evaluate its rules for WGS usage.
  9. While the GRC is eagerly awaiting the platinum and gold genomes as potential curation resources, I want to emphasize that there are also a growing number of other WGS assemblies in the public databases that are potential sources for curation. At the end of 2014, there were 13 human genome assemblies in GenBank, and the number doubled by the end of 2015. Strikingly, we’ve seen 11 new ones in the last 3 months alone. With all these new assemblies available, the GRC is regularly asked to use them in its curation efforts. However, these genomes were sequenced and assembled with different technologies and algorithms. Because the reference assembly has such a high quality (error rate of 10-5), the GRC must evaluate the assemblies before doing so. There are some basic features the GRC considers when looking at the assemblies, such as: Population (some represent 1000G samples) Genome representation (full vs. partial) Assembly level (chromosome vs. scaffold vs. contig)
  10. For a deeper understanding, we rely on a couple of summary metrics, like contig N50 and contig count. This slide takes a look at Contig N50 for a few of the WGS assemblies in GenBank compared to the clone-based reference. Their differences in sequencing technology are shown. While WGS assemblies have traditionally been much more fragmented than the reference, newer technologies are now closing this gap (no pun). Given that many of the remaining reference problems are in complex regions, we really need these new assembly resources with greater continuity. Simply having lots of assemblies available isn’t enough. We need high quality assemblies.
  11. So this takes me to the next part of my talk, about evaluating new sequence for use. While the GRC is eager to use the platinum and gold assembly sequences in the reference, in order to do, it’s critical to have a solid understanding of their characteristics.
  12. Tina already introduced the Assemblathon for the CHM1 and CHM13 genomes, so I’m not going to reiterate the details she’s provided. However, I would like to take a minute to talk about the goals of the Assemblathon, both in general and for the GRC. Clearly, for data generators like Vince, this will be useful in gaining a better appreciation of what factors are critical to having good input for assembly. For folks like Jason Chin and Adam Phillippy, this is feedback for algorithm development. And at a basic level for MGI, this will define the assemblies that will ultimately become the platinum genomes. But for the GRC, the main benefit from the Assemblathon is that it will help ensure we continue to provide robust curation in this era of new resources. From the Assemblathon, we will set expectations for this next generation of WGS assemblies and gain an understanding of their qualities and limitations. We’ll also see what genomic regions may require other resources for resolution and use our experience with these assemblies to develop new curation pipelines and SOPs.
  13. I’ll be presenting more of the Assemblathon results at my talk on Friday, but I’d like to talk about a couple of results from the analyses of CHM13 that are particularly pertinent to potential use of these assemblies for reference curation. This slide shows the results of the RefSeq analysis in which a suite of transcripts was aligned to each of the assemblies, and also to GRCh38, and the following metrics assessed. Can see that there is some variability among the assemblies, with no single one doing best in all categories. Can also see they have more annotation issues than the reference. What I want to draw attention to is the “dropped at consolidation” metric, and compare to GRCh38. What this measures is how many loci have 2 or more distinct transcripts with overlapping alignments. It’s a measure of collapse, and shows that the CHM13 assemblies have likely collapsed sequences from paralogous gene copies.
  14. This slide shows an analysis looking at frameshifts in aligned proteins. You can see that the CHM13 assemblies all look notably worse than the reference. Importantly, there’s also a substantial lack of overlap in the frameshifted transcripts, suggesting that these differences are really assembly-specific, and not representative of the genome itself. The exceptions may be those in these latter two groups. It’s interesting to note that both the reference genome and CHM13 appear to carry the non-functional allele of PKD1L2 (polycystic kidney disease 1-like 2), a known polymorphic pseudogene. These are the types of metrics that we must keep in mind when considering the impact of using these sequences for reference curation- or for de novo analyses. It will be very interesting to see how the integration of the BAC clones into the assembly affects these and other metrics.
  15. Lastly, I want to look at some results from assembly-assembly alignments. NCBI alignments are generated in two phases. The first phase, or 'First Pass' alignments, are reciprocal best alignments, meaning any locus on the query assembly will have 0 or 1 alignment to the target assembly. In order to capture duplicated sequences, we do a 'Second Pass' to capture large regions (>1Kb) within an assembly that have no alignment or a conflicting alignment in the First Pass. In the 'Second Pass' alignments, a given region in the query assembly can align to >1 region in the target assembly. As shown here, SP alignments can be used to identify regions of assembly collapse or expansion.
  16. This slide shows the amount of per-chromosome expansion in GRCh38 relative to the various CHM13 assemblies and GRCh37, the previous reference assembly. Chr. Y has been excluded, since it’s not present in CHM13. Can see that the GRCh38 expansion is relatively consistent with respect to all the CHM13 assemblies, but on average double that relative to GRCh37. These results suggest that there is more collapse in the CHM13 assemblies than the GRCh38 reference, which is known to be expanded relative to GRCh37. This is consistent with the consolidation analysis from RefSeq and the propensity of WGS assemblies to exhibit collapse. None of this is to say that the new assemblies are bad, and it emphasizes the importance of the clone additions that Tina already mentioned. These analyses simply underscore the concept that there are features that must be considered when using these data for assembly curation- or even for de novo analyses.
  17. Having now taken a brief look at the ongoing evaluation of these new high quality references, I want to finish by considering the future of assembly curation and the reference assembly.
  18. As you’ve heard today, these new assemblies are expected to contribute to the ongoing curation of the reference assembly. One likely outcome of their production will be the establishment of new standards for WGS sequence contributions to the reference. Just as there are standards for clone based sequences, there should be standards for WGS. Ultimately, this will help ensure the integrity of the reference. Another likely outcome is the development of tools for WGS assembly curation. Until now, WGS assemblies were essentially one and done deals. If these are to be curated, it will require development of new tools so that local edits can be made. And if the assembly is to be curated effectively, it means we also need tools to assess and communicate local assembly quality. This is something both Jason and Adam are interested in and the GRC hopes to pursue further this year.
  19. In contrast to the model displayed on the previous slide, it’s also possible or even likely that as more reference-quality genomes are produced, we may come to a future in which there are multiple reference assemblies. For example, a suite of population or condition-specific references. The existing reference may either play a central, tangential or non-existent role in analysis on other genomes in the suite. To support such a future we will need more robust resources to translate data among the references and display multiple assemblies and alignments. We must also consider what role these genomes might play in a graph of human population variation and if graphs will support use of preferred paths for analysis on population specific references. Lastly, we must consider the curation of reference suites or graphs themselves. As references, the quality and integrity of the resources are critical and require long-term maintenance. Thus, mechanisms must be implemented to define and support a set of references or graph, as opposed to a single genome. We don’t have all the answers to these questions today, but we must keep attention on them as these new data enter the public sphere.
  20. With that, I hope I’ve left you with an appreciation for the challenges we face in this new genomic era, but that you’ve also seen from the other talks in this workshop that newly available data will enable us to address some of the most challenging assembly issues and give users a more comprehensive and improved suite of genomes for analyses.
  21. With that in mind, this slide illustrating the current suite of unresolved assembly issues is here to emphasize that we still have a lot to do in the reference, making it likely that WGS will play an increasing and central role in assembly improvement.
  22. GRC doing assemblathon with Jason, Adam as a way to gain a better understanding of these new assemblies assess them for use in the reference.
  23. How users can find the new data in the reference: A feature of the assembly model known as “patches” allows the GRC to make assembly updates that provide corrected or additional sequences in a timely fashion to users who need them, without disrupting the chromosome coordinates upon which users rely. Like the alt loci, regions are defined for the genomic locations to be updated, and sequences representing those updates are put into a “Patches” assembly unit. And like the alts, these are scaffold sequences with alignments to the chromosomes. This figure distinguishes the two types of patches and the ways in which they should be used for analysis: (1) FIX patches correct problems in the assembly: deprecated in next assembly release. (2) NOVEL patches add new alternate sequence representations to the assembly: become alternate loci in the next assembly release.
  24. Example of region where single haplotype clones needed. There were some large-scale path rearrangements in GRCh38. This slide illustrates updates associated with the SRGAP gene family, which is involved in cortical development. The ancestral 1q32 gene was duplicated in humans to 1p21 and 1q21. Work from Evan Eichler’s lab found that not only were the 3 SRGAP2 human paralogs incompletely sequenced in GRCh37, but that allelic and paralogous sequences had been mixed in the assembly. Genomic clones derived from a hydatidiform mole, which represents a single haplotype, were used to disambiguate the correct paths at each locus. These updated paths were originally released as fix patches to GRCh37 and are now incorporated in the GRCh38 chromosomes. The bottom panel shows the GRCh37.p13-GRCh38 assembly-assembly alignments in the 1Q21 region. When you see fragmentation like this, the take-home is: don’t trust the earlier version of the assembly.