Presentation at IMGC 2019 workshop describing the latest improvements to the mouse reference genome assembly and analyses performed in preparation for the next release of the mouse genome assembly (GRCm39).
2. Contributed:
• Valerie Schneider
• Kerstin Howe
• Tina Graves
• Paul Flicek
• Tayebeh Rezaie
• Nathan Bouk
• Hsiu-Chuan Chen
• Jo Wood
• Joanna Collins
• Sarah Pelan
• Will Chow
• James Torrance
• Derek Albracht
• Milinn Kremitzki
• Laura Clarke
• Jane Loveland
• NCBI RefSeq and GenColl
This work was supported in part by the Intramural Research Program of the National Library of Medicine, National Institutes of Health.
3. • Primary assembly unit:
• C57BL/6J chromosomes
• Unlocalized and unplaced scaffolds
(scaffold: O/O set of contigs)
• Strain-specific assembly units:
• Seq from clones representing
other strains, regions needing
additional representation
• Patches assembly unit
GRC Assembly model
4. Mouse Chr. 1, GRCm38
What is mouse genome assembly?
Reference is from C57BL/6J
http://genomereference.org
5. How we do assembly curation?
• Technology: sequencing, FISH, Optical Mapping,
alignments of end clones, assembling Illumina reads
• Sequencing: clones
• FISH: localization of unlocalized sequences
• Optical Mapping: gap sizing, path problem
• Resources: clones, WGS, PCR products
• Gap closure
• Correction of clone assembly problem
• Path problem correction
• Represent strain-variation
Examples of assembly curation in GRCm38C
6. Release of GRCm39 is planned in early 2020. An overview of
GRCm39 from analyses of GRCm38C, the 2nd intermediate build.
Minor or patch release: non-coordinate changing assembly versions
Major release: coordinate changing assembly versions
http://genomereference.org
7. Genome issues resolved post-GRCm38
Updates as of GRCm38.p6
• 65 FIX patches
• 9 NOVEL patches
GRCm38 Released updates
0 20 40 60 80 100 120 140 160 180 200
Gap
Clone
Variation
Missing
GRC
Path
Unknown
Localization
39%
21%
7%
12%
15%
2.7%
2%
1.3% total = 473
Gap + Clone = 60% of all resolved post-GRCm38
Improving the reference assembly
Six minor/patch releases since 2012, GRCm38 release
• Patch releases: non-coordinate changing assembly versions
• Fix patches (chromosome path changes)
• Novel patches (alternate representations of chromosome
sequences, derived from other strains)
8. GRCm38 (GCF_000001635.20) GRCm38C (GCF_008087425.1)
Total length 2,730,855,475 (Primary)
2,793,712,140 (all)
2,733,095,204 (Primary)
2,798,405,461 (all)
Total assembly gap length 79,291,755 (all) 78,606,933 (all)
# gaps between scaffolds 191 151
# gaps within scaffolds 443 213
Scaffold N50 54,517,951 100,923,795 (85% increase)
Contig N50 32,273,079 57,461,838 (78% increase)
GRCm38C has fewer gaps and is more contiguous as compared to GRCm38
In GRCm38C: 5 single scaffold chrs (11,12,15,16,18), 11 built from 2 scaffolds, 5 built >2 scaffolds
GRCm38/GRCm38C assemblies stats
9. Assembly component updates between GRCm38/GRCm38C
Number of components with change = 666 (~3.2%)
o Added: 315 (6,640,992 bp)
Clones + PCR: 77
Assembled Illumina reads: 95
WGS from 'MmusSOAP1’ & 'MmusALLPATHS2’ assemblies: 81
WGS from MGSCv3 (original mouse genome project): 17
WGS from Eve assembly: 45
o Dropped: 330 (3,555,784 bp)
WGS from MGSCv3 replaced with >accuracy seq: 310 (94%)
o Version bumped: 15
o Strand flipped: 4
o Version bumped + Strand flipped: 2
Our evaluation of scaffold/component changes in GRCm38C found no unexpected changes.
10. RefSeq Transcript Analysis
GRCm38
Primary Unit
GRCm38C
Primary Unit
Number of sequences retrieved from Entrez 42721 42721
Number of sequences not aligning* 6 2
Number of sequences with multiple best alignments
(split transcripts)† 1 2
Number of sequences with CDS coverage <95% 41 19
*The 2 txpts not aligning to both GRCm38C & GRCm38 primary:
• Olfr100 (annotated on alt from 129X1/SvJ)
• Rs5-8s1
†GRCm38C split aligns by a gap:
• Sts (PAR), no align. to GRCm38
• Rn45s
*Other 4 not aligning to GRCm38 primary:
• Ahsp (Clone problem) and Copg2os2 (Gap), corrected
• Sts (PAR)
• Rn45s
15. Unresolved genome issues Current curation status
Resolution likelihoods as determined by GRC review;
used optical mapping to size remaining gaps and FISH
to localize unlocalized sequences.
A major obstacle: the repetitive nature of genomic
region including segmental duplications
16. Base Report Sources:
• Sanger mouse genomes project (n=4,148)
• Eve assembly publication (n=267)
• An additional 236 bases reported in Eve are included in the Sanger set
Analysis: Evaluate support for these bases
• Align Illumina reads derived from another C57BL/6J
sample to GRCm38 (Gnerre et al.; PMID: 21187386)
• Generate pile-up results from alignments
• Categorize results as: homozygous REF, homozygous
ALT and heterozygous
Goal: Update erroneous or very rare GRCm38 bases
*All bases common with the Eve set were homozygous ALT
*Bases reported only from Eve (n=184): 25% hom REF, 75% hom ALT
21187386
17. Evaluation of consequences with VEPMouse Genomes Project Bases
Sites in CDS/genes: 45
• 34 homozygous REF
• 11 homozygous ALT
21187386
Evaluation of erroneous or very rare GRCm38 bases
Base Report Sources:
• Sanger mouse genomes project (n=4,148)
• Eve assembly publication (n=267)
18. Conclusion and future:
• The GRC is currently preparing for the release of GRCm39
• Upon the release of GRCm39, the GRC's curation of the mouse genome reference
assembly will be limited to the resolution of community reported problems
o Contact us with a question or report an assembly issue or request info. about the
genomic region of your interest: https://www.ncbi.nlm.nih.gov/grc/contact-us
o See GRC blog posts: http://genomeref.blogspot.com/
o For FAQs and other assembly help: https://www.ncbi.nlm.nih.gov/grc/help/
o For more information see my poster P43 on Thursday
Release of GRCm39 is planned in early 2020
http://genomereference.org