More Related Content


TAGC2016 schneider

  1. Reference genome assemblies: Resources & updates from the GRC Valerie Schneider, Ph.D. NCBI 16 July 2016
  2. Outline Outline  GRC Introduction  Assembly updates  Assembly resources
  3. Twitter: @GenomeRef
  4. Outline  GRC Introduction  Assembly updates  Assembly resources Outline
  5. Assembly (e.g. GRCm38) Primary Assembly Unit (C57BL/6J) Non-nuclear assembly unit (e.g. MT) 129S6/ SvEvTac 129S1/ SvImJ 129X1/ SvJ NOD/ ShiLtJ NOD/ MrkTac PAR Genomic Region (MHC) Genomic Region (DiGeorge) Genomic Region (Ren2) GRCm38 Alternate Loci Strains A/J AKR/J BALB/c CAST/Ei 129S6/SvEvTac 129P2/OlaHsd 129S2/SvPas 129S1/SvImJ 129X1/SvJ 129S7/SvEvBrd-Hprt- b-m2 NOD/MrkTac NOD/ShiLtJ RIII Assembly Model
  6. Assembly Updates Assembly (e.g. GRCm38.p5) Primary Assembly Unit (C57BL/6J) Non-nuclear assembly unit (e.g. MT) 129S6/ SvEvTac 129S1/ SvImJ 129X1/ SvJ NOD/ ShiLtJ NOD/ MrkTac PAR Genomic Region (MHC) Genomic Region (DiGeorge) Genomic Region (Ren2) Patches Genomic Region (Sftpb) Genomic Region (Nlrp4g) Genomic Region (Meg3) Patches FIX NOVEL SCAFFOLD STATUS AT NEXT MAJOR ASSEMBLY RELEASE ALT LOCI -- (integrated)
  7. Assembly Updates: Mouse
  8. GRCm38.p5 Fix Patches (GCA_000001635.7) Rims1 Traf5 Ptpmt1 Spata5I1 Auts2 Jakmip3 Muc2 Rab3a Ifi30 Nadk2 Ide Assembly Updates: Mouse GRCm39?
  9. INSDC Submitted Assemblies Mouse strains poster: M5101B Will Chow
  10. Assembly Updates: Zebrafish GRCz11: Planned for the end of 2016 • Finish remaining clones & integrate into assembly • Find “missing genes” • Resolve path issues • Integrate WGS into assembly gaps • Create alternate loci for haplotypic duplications & indels affecting gene models • Poster: Z6085A (K. Howe)
  11. Assembly Updates: Zebrafish GRCz11: Planned for the end of 2016 • WTSI -> ZFIN transition • Curation: Active -> Passive • Patch releases • Annotation: Manual -> Automated • Ensembl • Refseq
  12. Outline  GRC Introduction  Assembly updates  Assembly resources Outline
  13. Assembly Resources
  14. /assembly/grc/mouse/issues/?id=MG-117
  15. Assembly name, assembly accession, organism name
  16. (latest RefSeq)
  18. Acknowledgements GRC SAB • Rick Myers • Granger Sutton • Evan Eichler • Jim Kent • Roderic Guigo • Carol Bult • Derek Stemple • Jan Korbel • Liz Worthey • Matthew Hurles • Richard Gibbs GRC • Tina Graves-Lindsay • Kerstin Howe • Richard Durbin • Paul Flicek • Laura Clarke • Monte Westerfield • Deanna Church • Curators! • Developers! GRC Mouse/Zfish Collaborators • NCBI RefSeq/Gene • HAVANA annotators • Peter Lansdorp • Mark Hills • Derek Stemple • David Page • WTSI NOD Idd team NCBI Support • Genome Browser team • Assembly DB • Gpipe annotation team • Clone DB • Remapping Service For more info: poster M5055A
  19. Utilizing NCBI Databases for Model Organism Research News: Contact us: Time Topic Poster Number 8:00 – 8:25 The 3 W’s of Sequence Data Submission: What, Where, and When Ilene Mizrachi — 8:25 – 8:45 Reference genome assemblies: resources and updates from the GRC Valerie Schneider M5055/A 8:45 – 9:10 How to annotate for 300 species: the awesome power of NCBI’s eukaryotic genome annotation pipeline Terence Murphy D1524/B 9:10 – 9:35 An introduction to NCBI’s RefSeq and Gene resources Tripti Gupta M5104/B 9:35 – 9:55 Optimizing use of NCBI databases to analyze your favorite gene Nuala O’Leary Z6088/A

Editor's Notes

  1. I’d like to thank you all for getting up so bright and early this morning. My name is Valerie Schneider and I’m the team lead for the Genome Reference Consortium at NCBI.
  2. Today I’ll be introducing you to the Genome Reference Consortium, telling you about our ongoing curation efforts and assembly update plans for the mouse and zebrafish reference genome assemblies and showing you some GRC and NCBI resources you can use in your research.
  3. The GRC was established after the conclusion of the HGP to manage improvements to the human genome, and subsequently became responsible for the management of the mouse and zebrafish assemblies. It was initially comprised of the first 4 institutions shown here, who together perform the wet lab and bioinformatics work. These year we were pleased to have ZFIN join the GRC, who, as I’ll describe later, will be contributing to the zebrafish assembly curation effort.
  4. The GRC has noticed that it’s not uncommon for many researchers to take the following view when it comes to genome assemblies. But we’d submit that the reality looks more like this. A publication does not equal a perfect genome assembly. Assemblies are kind of like cell phones. There’s no denying that the first cell phones or the initial genome representation for an organism have a transformative effect. But think about how much more we can do with today’s smart phones. And that’s the role of the GRC: it’s not just about fixing problems in reference assemblies, but updating them as we gain new knowledge so that we can continue to use them to advance our understanding of biology.
  5. With that goal in mind , I’d like to talk today about updates and timelines for the mouse and zebrafish genome assemblies. To make sure we’re all on the same page with the terminology I’ll be using, I’ll start by briefly explain the assembly model used by the GRC, and which is depicted on the next slide.
  6. The first thing to know is that the assembly is comprised of assembly units. Primary assembly unit is the collection of chromosomes and unlocalized and unplaced scaffolds. For mouse, this is the C57BL/6J strain and for Zfish, it’s TU. Non-nuclear genomes, like the MT, are assigned to their own assembly unit. Regions (yellow) are defined for those areas of the genome for which alternate strain or haplotype representation is desired. Those alternate sequence representations go into alternate loci assembly units. In mouse, the alt loci units are strain-specific and at right is the list of the strains represented as alt loci in GRCm38. There aren’t alt loci in GRCz10, but they may be added for the next assembly release. In contrast to mouse, the alts included with the zfish assembly will represent various haplotypes within the TU strain, rather than different strains. Alt loci sequences are therefore separate from the chromosome sequences, but their relationship to the chromosomes is known and they are part of the reference assembly.
  7. While folks in the mouse and zebrafish community have historically been relatively tolerant of assembly updates that involve coordinate changes, our work on the human assembly has shown us that as the amount of data mapped on a given assembly increases, tolerance declines. The “patches” feature of the assembly model allows the GRC to make updates available in a timely fashion to researchers needing corrected or new sequence without disrupting the chromosome coordinates. Regions are defined for the genomic locations to be updated, and the sequences representing those updates are put into the “Patches” assembly unit. Like the alts, the patches aren’t part of the chromosomes, but they and their alignments to the chromosomes are part of the reference assembly. There are two types of patches: (1) FIX patches correct problems in the assembly: deprecated in next major assembly release. (2) NOVEL patches add new alternate sequence representations to the assembly: become alternate loci in the next major assembly release.
  8. This slide shows an example of how we patched the reference representation of Jakmip3 (janus kinase and microtubule interacting protein 3) on GRCm38 chr 7. This image shows the alignment of GRCm38 to C57BL/6JN, a closely related strain whose assembly is part of the Mouse Genomes Project. The B6NJ assembly spans the gap and includes several exons missing from the reference. Since we can’t put the B6NJ sequence into the B6 chromosomes, we used it to query public databases for additional B6 assembly sequences that could fill the gap. As shown below, in this image of the recently released fix patch, the new sequences span and if you zoom in, you can see that they provide the representation for those missing exons.
  9. For the mouse assembly GRCm38, we are working to resolve problems and we review data for patch releases 1-2x/year. The ideogram on the left shows the locations of all alternate loci and patch scaffolds in GRCm38, and on the right I’ve listed the genes whose representation was corrected in our most recent patch release, GRCm38.p5, just last month. For those of you wondering when the next major release will occur, the answer is that it is not currently scheduled. B/c coordinate changing updates can be disruptive, want to be sure they are worth it. In fact, as part of our decision-making process for release, we’d like to hear from you about whether the current assembly is meeting your needs or not and what sort of improvements you want to see in the next release. You can contact us through our website.
  10. There have been several great talks at this meeting about the Sanger Mouse Genomes project, and you may be wondering about its relationship to the GRC and the reference assembly. It is not a part of the GRC, but due to Sanger’s involvement in both projects, we are tied into it. We are already using MGP data in our curation efforts and comparative analyses should help us identify regions needing further curation. I direct you to this poster by Will Chow that talks about tools the GRC is using to make these comparisons.
  11. Unlike mouse, the next zebrafish genome assembly update is planned, and it’s coming soon. From now through the end of the year, the GRC is focused on addressing the issues listed here. Note that for the first time, the zfish assembly is planned to contain alternate loci. For more information on these efforts, as well as new strain sequencing and assembly efforts that will be happening at Sanger, make sure you check out Kerstin Howe’s poster.
  12. With the release of GRCz11, there will also be some changes in GRC’s management of the reference assembly. ZFIN will be assuming the primary curatorial role from Sanger, and curation will move to a passive mode. That means that we’ll be responding to user reports, but that we’ll no longer be reviewing the assembly for proactive updates. We’ll be releasing improvements we make as patches, but there are no plans for a GRCz12 at this time. And while the GRC does not do assembly annotation, I also want to point out that the manual annotation effort that’s been ongoing at Sanger will be winding down. Moving forward, those data will be merged into the Ensembl automated gene builds, and RefSeq will also continue to provide annotation for the assembly and patches as well. You’ll hear more about the NCBI pipeline from Terence in the next talk.
  13. Now that you know about the changes that are coming, I’d like to take the rest of this talk to tell you about resources from the GRC and NCBI for accessing assembly data, keeping abreast of the curation effort and assessing genome quality in your region of interest.
  14. Let’s start at the GRC homepage. From there, you can get to organism-specific overview pages, through the toolbar at the top of the page. The overview pages contain links for downloading the current assembly and at the bottom of the page is a table with information about assembly regions that have alternate loci and patches. You can download the table and the panel on the left lets you search and filter the table by various criteria. Clicking on any region name in the table will take you to a separate page that provides you with more details about the region.
  15. You can find information about the assembly issues on which the GRC is working via the organism-specific “Issues Under Review” tabs. On top, an ideogram shows the genomic locations of issues, which are listed in the table below. On the left of the page, you can search or filter for issues. Clicking on a particular ideogram updates the page and table to a more specific view, where the annotations can be categorized by issue type or status. Hovering over the icons opens a pop-up with more detail. Within the table you will also find links to pages describing each of the individual issues.
  16. An Issue detail page has 3 parts, highlighted here: At top, brief description of the issue, plus an ideogram showing its genomic location In middle, lists of patch and alternate loci sequences associated with the issue (if they exist) At bottom, graphical view of the issue region Tracks in the display have been chosen for their utility in assembly assessment, and I’ll describe in more detail shortly. If there’s a patch or alt loci scaffold associated with the issue, you can toggle the graphic to see it from perspective of the chromosome (gap) or the patch/alternate loci scaffold (green closure), along with the sequence alignment.
  17. If you’re looking for assembly stats or want quick access to the chromosome sequences, the data tab takes you to a page that includes lengths, gap counts, N50s and global stats. Using the drop down-down menu at the top, you can also find this data for previous assembly versions. Although I’m using mouse an example here, I want to emphasize that all these resources are also available for zebrafish.
  18. Maybe most importantly, if you spot a potential problem with the genome, you can report this to us! In fact, we want you to. We prioritize work on user-reported issues. But if we don’t know about it, we can’t fix it!
  19. But the GRC website isn’t the only place you can access GRC assemblies. I’d like to shift gear a little now and talk about accessing the data at NCBI. If you’re starting at the NCBI homepage, select the “Assembly” database from the drop-down menu and search. You can search by assembly name, accession or by organism.
  20. For each assembly in the database, you’ll find a summary website that includes metadata, links to download the assembly from GenBank and RefSeq FTP, and the assembly statistics. And last but not least, if you’re looking at the latest assembly version annotated by RefSeq, you’ll find a link to view the assembly in the NCBI genome browser known as the Genome Data Viewer (GDV).
  21. The first time you visit GDV, it will display a default set of tracks. The display is managed through the “Tracks” menu button. From there, you can access a feature, known as “Track Sets” which allows one-click configuration of the display. The “Assembly Support” track set includes the tracks most valuable for assessing assembly quality. This is essentially the same set of tracks you’ll find on the GRC pages or in the GRC track hubs at the Ensembl and UCSC browsers. You can also do a custom configuration of GDV with the “Configure Tracks” option.
  22. Within the Assembly Support track set, the Assembly Components track shows the underlying sequences and gaps in the assembly, while the “Issues” track shows you where the GRC is curating the assembly. There’s also a track showing component sequencing problems. The “Clone Placement” track can be used to identify mis-assemblies or find clones of interest. On the lower left side of the browser is a section called “Region details”. If the chromosome region you’re looking at has alts/patches associated with it, you can click here update the display to show those sequences instead. It also includes a link to the relevant issues at the GRC website. The “Your Data” section, lets you upload your own data into the browser for viewing alongside the NCBI-provided tracks. You can use this combination of features to assess whether the genome is okay in your region of interest.
  23. If you work on zfish and you’re already thinking ahead to GRCz11, there are resources to help you remap data. The NCBI Remapping tool uses assembly-assembly alignments to project the features from one assembly to the other. You select the assemblies you want to map between, your remapping options and your input and output file formats. For those who need more than the web interface can offer, there is also a perl API available.
  24. With that, I’d like to wrap things up and acknowledge the many contributors to this work. I hope I’ve left you with some idea of what’s happening with your reference assemblies, and how to find tools to help you assess them. We look forward to hearing from you!
  25. The region page provides: Location of region, including ideogram Lists of GRC issues and patch and alternate loci sequences associated with the region Graphical view of the region (in this view, the blue bars are the assembly components) Can toggle to see graphic from perspective of the chromosome or the patch/alternate loci scaffold Tracks in the display include the alignment of the alt/patch to the chromosome, so you can see how they differ, plus other tracks useful for assembly assessment, which I’ll describe in more detail shortly.