Previewing GRCm39: assembly updates from the GRC
Tayebeh Rezaie, Ph.D.
NCBI
25 September 2019
Contributed:
• Valerie Schneider
• Kerstin Howe
• Tina Graves
• Paul Flicek
• Tayebeh Rezaie
• Nathan Bouk
• Hsiu-Chuan Chen
• Jo Wood
• Joanna Collins
• Sarah Pelan
• Will Chow
• James Torrance
• Derek Albracht
• Milinn Kremitzki
• Laura Clarke
• Jane Loveland
• NCBI RefSeq and GenColl
This work was supported in part by the Intramural Research Program of the National Library of Medicine, National Institutes of Health.
• Primary assembly unit:
• C57BL/6J chromosomes
• Unlocalized and unplaced scaffolds
(scaffold: O/O set of contigs)
• Strain-specific assembly units:
• Seq from clones representing
other strains, regions needing
additional representation
• Patches assembly unit
GRC Assembly model
Mouse Chr. 1, GRCm38
What is mouse genome assembly?
Reference is from C57BL/6J
http://genomereference.org
How we do assembly curation?
• Technology: sequencing, FISH, Optical Mapping,
alignments of end clones, assembling Illumina reads
• Sequencing: clones
• FISH: localization of unlocalized sequences
• Optical Mapping: gap sizing, path problem
• Resources: clones, WGS, PCR products
• Gap closure
• Correction of clone assembly problem
• Path problem correction
• Represent strain-variation
Examples of assembly curation in GRCm38C
Release of GRCm39 is planned in early 2020. An overview of
GRCm39 from analyses of GRCm38C, the 2nd intermediate build.
Minor or patch release: non-coordinate changing assembly versions
Major release: coordinate changing assembly versions
http://genomereference.org
Genome issues resolved post-GRCm38
Updates as of GRCm38.p6
• 65 FIX patches
• 9 NOVEL patches
GRCm38 Released updates
0 20 40 60 80 100 120 140 160 180 200
Gap
Clone
Variation
Missing
GRC
Path
Unknown
Localization
39%
21%
7%
12%
15%
2.7%
2%
1.3% total = 473
Gap + Clone = 60% of all resolved post-GRCm38
Improving the reference assembly
Six minor/patch releases since 2012, GRCm38 release
• Patch releases: non-coordinate changing assembly versions
• Fix patches (chromosome path changes)
• Novel patches (alternate representations of chromosome
sequences, derived from other strains)
GRCm38 (GCF_000001635.20) GRCm38C (GCF_008087425.1)
Total length 2,730,855,475 (Primary)
2,793,712,140 (all)
2,733,095,204 (Primary)
2,798,405,461 (all)
Total assembly gap length 79,291,755 (all) 78,606,933 (all)
# gaps between scaffolds 191 151
# gaps within scaffolds 443 213
Scaffold N50 54,517,951 100,923,795 (85% increase)
Contig N50 32,273,079 57,461,838 (78% increase)
 GRCm38C has fewer gaps and is more contiguous as compared to GRCm38
 In GRCm38C: 5 single scaffold chrs (11,12,15,16,18), 11 built from 2 scaffolds, 5 built >2 scaffolds
GRCm38/GRCm38C assemblies stats
Assembly component updates between GRCm38/GRCm38C
Number of components with change = 666 (~3.2%)
o Added: 315 (6,640,992 bp)
 Clones + PCR: 77
 Assembled Illumina reads: 95
 WGS from 'MmusSOAP1’ & 'MmusALLPATHS2’ assemblies: 81
 WGS from MGSCv3 (original mouse genome project): 17
 WGS from Eve assembly: 45
o Dropped: 330 (3,555,784 bp)
 WGS from MGSCv3 replaced with >accuracy seq: 310 (94%)
o Version bumped: 15
o Strand flipped: 4
o Version bumped + Strand flipped: 2
 Our evaluation of scaffold/component changes in GRCm38C found no unexpected changes.
RefSeq Transcript Analysis
GRCm38
Primary Unit
GRCm38C
Primary Unit
Number of sequences retrieved from Entrez 42721 42721
Number of sequences not aligning* 6 2
Number of sequences with multiple best alignments
(split transcripts)† 1 2
Number of sequences with CDS coverage <95% 41 19
*The 2 txpts not aligning to both GRCm38C & GRCm38 primary:
• Olfr100 (annotated on alt from 129X1/SvJ)
• Rs5-8s1
†GRCm38C split aligns by a gap:
• Sts (PAR), no align. to GRCm38
• Rn45s
*Other 4 not aligning to GRCm38 primary:
• Ahsp (Clone problem) and Copg2os2 (Gap), corrected
• Sts (PAR)
• Rn45s
Genes improved representation in GRCm38C
4933416I08Rik Dnah12 Mia3 Pik3c2g Sgms2
Ahnak2 Efcab7 Muc2 Ppp2r3d Slc26a6
Anxa13 Ide Muc3 Pstpip2 Spata5l1
Atg4a Ifi30 Muc4 Ptpmt1 Spry3
Auts2 Intu Muc6 Rab3a Taf1a
Baalc Jakmip3 Nadk2 Ranbp3l Tmem134
Cct6a Kazn Nhej1 Rasgrf2 Traf5
Cylc1 Kndc1 Nkain1 Rhox5 Trerf1
Dgkk Krt85 Nlrp4g Rims1 Vezf1
Assembly gap closure and complete
representation of Efcab7
Correction of an assembly false GRCm38 gap caused by
haplotype incompatibility
View curation status of Mouse Genome Issues
http://genomereference.org
Unresolved genome issues Current curation status
Resolution likelihoods as determined by GRC review;
used optical mapping to size remaining gaps and FISH
to localize unlocalized sequences.
A major obstacle: the repetitive nature of genomic
region including segmental duplications
Base Report Sources:
• Sanger mouse genomes project (n=4,148)
• Eve assembly publication (n=267)
• An additional 236 bases reported in Eve are included in the Sanger set
Analysis: Evaluate support for these bases
• Align Illumina reads derived from another C57BL/6J
sample to GRCm38 (Gnerre et al.; PMID: 21187386)
• Generate pile-up results from alignments
• Categorize results as: homozygous REF, homozygous
ALT and heterozygous
Goal: Update erroneous or very rare GRCm38 bases
*All bases common with the Eve set were homozygous ALT
*Bases reported only from Eve (n=184): 25% hom REF, 75% hom ALT
21187386
Evaluation of consequences with VEPMouse Genomes Project Bases
Sites in CDS/genes: 45
• 34 homozygous REF
• 11 homozygous ALT
21187386
Evaluation of erroneous or very rare GRCm38 bases
Base Report Sources:
• Sanger mouse genomes project (n=4,148)
• Eve assembly publication (n=267)
Conclusion and future:
• The GRC is currently preparing for the release of GRCm39
• Upon the release of GRCm39, the GRC's curation of the mouse genome reference
assembly will be limited to the resolution of community reported problems
o Contact us with a question or report an assembly issue or request info. about the
genomic region of your interest: https://www.ncbi.nlm.nih.gov/grc/contact-us
o See GRC blog posts: http://genomeref.blogspot.com/
o For FAQs and other assembly help: https://www.ncbi.nlm.nih.gov/grc/help/
o For more information see my poster P43 on Thursday
Release of GRCm39 is planned in early 2020
http://genomereference.org

Previewing GRCm39: Assembly Updates from the GRC

  • 1.
    Previewing GRCm39: assemblyupdates from the GRC Tayebeh Rezaie, Ph.D. NCBI 25 September 2019
  • 2.
    Contributed: • Valerie Schneider •Kerstin Howe • Tina Graves • Paul Flicek • Tayebeh Rezaie • Nathan Bouk • Hsiu-Chuan Chen • Jo Wood • Joanna Collins • Sarah Pelan • Will Chow • James Torrance • Derek Albracht • Milinn Kremitzki • Laura Clarke • Jane Loveland • NCBI RefSeq and GenColl This work was supported in part by the Intramural Research Program of the National Library of Medicine, National Institutes of Health.
  • 3.
    • Primary assemblyunit: • C57BL/6J chromosomes • Unlocalized and unplaced scaffolds (scaffold: O/O set of contigs) • Strain-specific assembly units: • Seq from clones representing other strains, regions needing additional representation • Patches assembly unit GRC Assembly model
  • 4.
    Mouse Chr. 1,GRCm38 What is mouse genome assembly? Reference is from C57BL/6J http://genomereference.org
  • 5.
    How we doassembly curation? • Technology: sequencing, FISH, Optical Mapping, alignments of end clones, assembling Illumina reads • Sequencing: clones • FISH: localization of unlocalized sequences • Optical Mapping: gap sizing, path problem • Resources: clones, WGS, PCR products • Gap closure • Correction of clone assembly problem • Path problem correction • Represent strain-variation Examples of assembly curation in GRCm38C
  • 6.
    Release of GRCm39is planned in early 2020. An overview of GRCm39 from analyses of GRCm38C, the 2nd intermediate build. Minor or patch release: non-coordinate changing assembly versions Major release: coordinate changing assembly versions http://genomereference.org
  • 7.
    Genome issues resolvedpost-GRCm38 Updates as of GRCm38.p6 • 65 FIX patches • 9 NOVEL patches GRCm38 Released updates 0 20 40 60 80 100 120 140 160 180 200 Gap Clone Variation Missing GRC Path Unknown Localization 39% 21% 7% 12% 15% 2.7% 2% 1.3% total = 473 Gap + Clone = 60% of all resolved post-GRCm38 Improving the reference assembly Six minor/patch releases since 2012, GRCm38 release • Patch releases: non-coordinate changing assembly versions • Fix patches (chromosome path changes) • Novel patches (alternate representations of chromosome sequences, derived from other strains)
  • 8.
    GRCm38 (GCF_000001635.20) GRCm38C(GCF_008087425.1) Total length 2,730,855,475 (Primary) 2,793,712,140 (all) 2,733,095,204 (Primary) 2,798,405,461 (all) Total assembly gap length 79,291,755 (all) 78,606,933 (all) # gaps between scaffolds 191 151 # gaps within scaffolds 443 213 Scaffold N50 54,517,951 100,923,795 (85% increase) Contig N50 32,273,079 57,461,838 (78% increase)  GRCm38C has fewer gaps and is more contiguous as compared to GRCm38  In GRCm38C: 5 single scaffold chrs (11,12,15,16,18), 11 built from 2 scaffolds, 5 built >2 scaffolds GRCm38/GRCm38C assemblies stats
  • 9.
    Assembly component updatesbetween GRCm38/GRCm38C Number of components with change = 666 (~3.2%) o Added: 315 (6,640,992 bp)  Clones + PCR: 77  Assembled Illumina reads: 95  WGS from 'MmusSOAP1’ & 'MmusALLPATHS2’ assemblies: 81  WGS from MGSCv3 (original mouse genome project): 17  WGS from Eve assembly: 45 o Dropped: 330 (3,555,784 bp)  WGS from MGSCv3 replaced with >accuracy seq: 310 (94%) o Version bumped: 15 o Strand flipped: 4 o Version bumped + Strand flipped: 2  Our evaluation of scaffold/component changes in GRCm38C found no unexpected changes.
  • 10.
    RefSeq Transcript Analysis GRCm38 PrimaryUnit GRCm38C Primary Unit Number of sequences retrieved from Entrez 42721 42721 Number of sequences not aligning* 6 2 Number of sequences with multiple best alignments (split transcripts)† 1 2 Number of sequences with CDS coverage <95% 41 19 *The 2 txpts not aligning to both GRCm38C & GRCm38 primary: • Olfr100 (annotated on alt from 129X1/SvJ) • Rs5-8s1 †GRCm38C split aligns by a gap: • Sts (PAR), no align. to GRCm38 • Rn45s *Other 4 not aligning to GRCm38 primary: • Ahsp (Clone problem) and Copg2os2 (Gap), corrected • Sts (PAR) • Rn45s
  • 11.
    Genes improved representationin GRCm38C 4933416I08Rik Dnah12 Mia3 Pik3c2g Sgms2 Ahnak2 Efcab7 Muc2 Ppp2r3d Slc26a6 Anxa13 Ide Muc3 Pstpip2 Spata5l1 Atg4a Ifi30 Muc4 Ptpmt1 Spry3 Auts2 Intu Muc6 Rab3a Taf1a Baalc Jakmip3 Nadk2 Ranbp3l Tmem134 Cct6a Kazn Nhej1 Rasgrf2 Traf5 Cylc1 Kndc1 Nkain1 Rhox5 Trerf1 Dgkk Krt85 Nlrp4g Rims1 Vezf1
  • 12.
    Assembly gap closureand complete representation of Efcab7
  • 13.
    Correction of anassembly false GRCm38 gap caused by haplotype incompatibility
  • 14.
    View curation statusof Mouse Genome Issues http://genomereference.org
  • 15.
    Unresolved genome issuesCurrent curation status Resolution likelihoods as determined by GRC review; used optical mapping to size remaining gaps and FISH to localize unlocalized sequences. A major obstacle: the repetitive nature of genomic region including segmental duplications
  • 16.
    Base Report Sources: •Sanger mouse genomes project (n=4,148) • Eve assembly publication (n=267) • An additional 236 bases reported in Eve are included in the Sanger set Analysis: Evaluate support for these bases • Align Illumina reads derived from another C57BL/6J sample to GRCm38 (Gnerre et al.; PMID: 21187386) • Generate pile-up results from alignments • Categorize results as: homozygous REF, homozygous ALT and heterozygous Goal: Update erroneous or very rare GRCm38 bases *All bases common with the Eve set were homozygous ALT *Bases reported only from Eve (n=184): 25% hom REF, 75% hom ALT 21187386
  • 17.
    Evaluation of consequenceswith VEPMouse Genomes Project Bases Sites in CDS/genes: 45 • 34 homozygous REF • 11 homozygous ALT 21187386 Evaluation of erroneous or very rare GRCm38 bases Base Report Sources: • Sanger mouse genomes project (n=4,148) • Eve assembly publication (n=267)
  • 18.
    Conclusion and future: •The GRC is currently preparing for the release of GRCm39 • Upon the release of GRCm39, the GRC's curation of the mouse genome reference assembly will be limited to the resolution of community reported problems o Contact us with a question or report an assembly issue or request info. about the genomic region of your interest: https://www.ncbi.nlm.nih.gov/grc/contact-us o See GRC blog posts: http://genomeref.blogspot.com/ o For FAQs and other assembly help: https://www.ncbi.nlm.nih.gov/grc/help/ o For more information see my poster P43 on Thursday Release of GRCm39 is planned in early 2020 http://genomereference.org

Editor's Notes