Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Previewing GRCm39: Assembly Updates from the GRC

482 views

Published on

Presentation at IMGC 2019 workshop describing the latest improvements to the mouse reference genome assembly and analyses performed in preparation for the next release of the mouse genome assembly (GRCm39).

Published in: Science
  • Be the first to comment

  • Be the first to like this

Previewing GRCm39: Assembly Updates from the GRC

  1. 1. Previewing GRCm39: assembly updates from the GRC Tayebeh Rezaie, Ph.D. NCBI 25 September 2019
  2. 2. Contributed: • Valerie Schneider • Kerstin Howe • Tina Graves • Paul Flicek • Tayebeh Rezaie • Nathan Bouk • Hsiu-Chuan Chen • Jo Wood • Joanna Collins • Sarah Pelan • Will Chow • James Torrance • Derek Albracht • Milinn Kremitzki • Laura Clarke • Jane Loveland • NCBI RefSeq and GenColl This work was supported in part by the Intramural Research Program of the National Library of Medicine, National Institutes of Health.
  3. 3. • Primary assembly unit: • C57BL/6J chromosomes • Unlocalized and unplaced scaffolds (scaffold: O/O set of contigs) • Strain-specific assembly units: • Seq from clones representing other strains, regions needing additional representation • Patches assembly unit GRC Assembly model
  4. 4. Mouse Chr. 1, GRCm38 What is mouse genome assembly? Reference is from C57BL/6J http://genomereference.org
  5. 5. How we do assembly curation? • Technology: sequencing, FISH, Optical Mapping, alignments of end clones, assembling Illumina reads • Sequencing: clones • FISH: localization of unlocalized sequences • Optical Mapping: gap sizing, path problem • Resources: clones, WGS, PCR products • Gap closure • Correction of clone assembly problem • Path problem correction • Represent strain-variation Examples of assembly curation in GRCm38C
  6. 6. Release of GRCm39 is planned in early 2020. An overview of GRCm39 from analyses of GRCm38C, the 2nd intermediate build. Minor or patch release: non-coordinate changing assembly versions Major release: coordinate changing assembly versions http://genomereference.org
  7. 7. Genome issues resolved post-GRCm38 Updates as of GRCm38.p6 • 65 FIX patches • 9 NOVEL patches GRCm38 Released updates 0 20 40 60 80 100 120 140 160 180 200 Gap Clone Variation Missing GRC Path Unknown Localization 39% 21% 7% 12% 15% 2.7% 2% 1.3% total = 473 Gap + Clone = 60% of all resolved post-GRCm38 Improving the reference assembly Six minor/patch releases since 2012, GRCm38 release • Patch releases: non-coordinate changing assembly versions • Fix patches (chromosome path changes) • Novel patches (alternate representations of chromosome sequences, derived from other strains)
  8. 8. GRCm38 (GCF_000001635.20) GRCm38C (GCF_008087425.1) Total length 2,730,855,475 (Primary) 2,793,712,140 (all) 2,733,095,204 (Primary) 2,798,405,461 (all) Total assembly gap length 79,291,755 (all) 78,606,933 (all) # gaps between scaffolds 191 151 # gaps within scaffolds 443 213 Scaffold N50 54,517,951 100,923,795 (85% increase) Contig N50 32,273,079 57,461,838 (78% increase)  GRCm38C has fewer gaps and is more contiguous as compared to GRCm38  In GRCm38C: 5 single scaffold chrs (11,12,15,16,18), 11 built from 2 scaffolds, 5 built >2 scaffolds GRCm38/GRCm38C assemblies stats
  9. 9. Assembly component updates between GRCm38/GRCm38C Number of components with change = 666 (~3.2%) o Added: 315 (6,640,992 bp)  Clones + PCR: 77  Assembled Illumina reads: 95  WGS from 'MmusSOAP1’ & 'MmusALLPATHS2’ assemblies: 81  WGS from MGSCv3 (original mouse genome project): 17  WGS from Eve assembly: 45 o Dropped: 330 (3,555,784 bp)  WGS from MGSCv3 replaced with >accuracy seq: 310 (94%) o Version bumped: 15 o Strand flipped: 4 o Version bumped + Strand flipped: 2  Our evaluation of scaffold/component changes in GRCm38C found no unexpected changes.
  10. 10. RefSeq Transcript Analysis GRCm38 Primary Unit GRCm38C Primary Unit Number of sequences retrieved from Entrez 42721 42721 Number of sequences not aligning* 6 2 Number of sequences with multiple best alignments (split transcripts)† 1 2 Number of sequences with CDS coverage <95% 41 19 *The 2 txpts not aligning to both GRCm38C & GRCm38 primary: • Olfr100 (annotated on alt from 129X1/SvJ) • Rs5-8s1 †GRCm38C split aligns by a gap: • Sts (PAR), no align. to GRCm38 • Rn45s *Other 4 not aligning to GRCm38 primary: • Ahsp (Clone problem) and Copg2os2 (Gap), corrected • Sts (PAR) • Rn45s
  11. 11. Genes improved representation in GRCm38C 4933416I08Rik Dnah12 Mia3 Pik3c2g Sgms2 Ahnak2 Efcab7 Muc2 Ppp2r3d Slc26a6 Anxa13 Ide Muc3 Pstpip2 Spata5l1 Atg4a Ifi30 Muc4 Ptpmt1 Spry3 Auts2 Intu Muc6 Rab3a Taf1a Baalc Jakmip3 Nadk2 Ranbp3l Tmem134 Cct6a Kazn Nhej1 Rasgrf2 Traf5 Cylc1 Kndc1 Nkain1 Rhox5 Trerf1 Dgkk Krt85 Nlrp4g Rims1 Vezf1
  12. 12. Assembly gap closure and complete representation of Efcab7
  13. 13. Correction of an assembly false GRCm38 gap caused by haplotype incompatibility
  14. 14. View curation status of Mouse Genome Issues http://genomereference.org
  15. 15. Unresolved genome issues Current curation status Resolution likelihoods as determined by GRC review; used optical mapping to size remaining gaps and FISH to localize unlocalized sequences. A major obstacle: the repetitive nature of genomic region including segmental duplications
  16. 16. Base Report Sources: • Sanger mouse genomes project (n=4,148) • Eve assembly publication (n=267) • An additional 236 bases reported in Eve are included in the Sanger set Analysis: Evaluate support for these bases • Align Illumina reads derived from another C57BL/6J sample to GRCm38 (Gnerre et al.; PMID: 21187386) • Generate pile-up results from alignments • Categorize results as: homozygous REF, homozygous ALT and heterozygous Goal: Update erroneous or very rare GRCm38 bases *All bases common with the Eve set were homozygous ALT *Bases reported only from Eve (n=184): 25% hom REF, 75% hom ALT 21187386
  17. 17. Evaluation of consequences with VEPMouse Genomes Project Bases Sites in CDS/genes: 45 • 34 homozygous REF • 11 homozygous ALT 21187386 Evaluation of erroneous or very rare GRCm38 bases Base Report Sources: • Sanger mouse genomes project (n=4,148) • Eve assembly publication (n=267)
  18. 18. Conclusion and future: • The GRC is currently preparing for the release of GRCm39 • Upon the release of GRCm39, the GRC's curation of the mouse genome reference assembly will be limited to the resolution of community reported problems o Contact us with a question or report an assembly issue or request info. about the genomic region of your interest: https://www.ncbi.nlm.nih.gov/grc/contact-us o See GRC blog posts: http://genomeref.blogspot.com/ o For FAQs and other assembly help: https://www.ncbi.nlm.nih.gov/grc/help/ o For more information see my poster P43 on Thursday Release of GRCm39 is planned in early 2020 http://genomereference.org

×