This document discusses multiple efforts related to developing reference genomes and gene annotations for laboratory mouse strains:
1) Genome assemblies have been improved for several strains using techniques like Illumina sequencing, Dovetail scaffolding, and PacBio alignments.
2) Gene predictions are being developed using a combination of annotation lifting from C57BL/6J, local refinement with strain-specific RNA-seq data, and de novo prediction.
3) Resources have been created for viewing and accessing these new reference genomes and annotations.
Multiple mouse reference genomes and strain specific gene annotations
1. Multiple mouse reference
genomes and strain specific
gene annotations
Thomas Keane,
Wellcome Trust Sanger Institute
@drtkeane @mousegenomes
tk2@sanger.ac.uk
2. Sequence variation
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
➢ 36 inbred strains
with whole-genome
illumina sequencing
➢ SNPs, indels, and
structural variants
➢ Are there more inbred
strains with deep
whole genome
illumina sequencing?
➢ LG/J, SM/J, and
JF1/MsJ pending
Anthony Doran, WTSI
3. Genome assemblies
➢ REL-1412: Illumina mate pair based de
novo scaffolds
➢ REL-1504: Pseudo-chromosomes
○ Alignment synteny with GRCm38
○ Evaluation with PacBio WGS/cDNA showed
excessive reference bias
➢ REL-1509: Pseudo-chromosomes
based on breakpoint graphs
○ Dovetail genomics scaffolds for CAST/EiJ,
PWK/PhJ, and SPRET/EiJ.
nnnn
nnnn
1. Contigs
2. Scaffolds
Chr1
3. Pseudo-chromosomes
Paired-end
Illumina
Large fragment
ends (3,6,10kb,
Dovetail, BAC
ends)
Whole-genome
alignments
4. PacBio alignments
➢ Use PacBio long reads alignment contiguity to validate the chromosome sequence
➢ Compare the number of inconsistently mapped reads
X
6. PWK/PhJ
Dovetail genomics: CAST/EiJ, PWK/PhJ, SPRET/EiJ
A) High molecular weight (50+ kbp) input
DNA
B) Reconstitute chromatin from the input
DNA
C) Addition of a fixative agent (e.g.,
formaldehyde) produces crosslinks
D) Crosslinked chromatin digested with a
restriction endonuclease to generate sticky-
ended fragments
E+F) DNA ligase added to perform blunt-
end ligation of the many ends within a given
chromatin aggregate
G) Chromatin is removed and DNA is
purified and processed to remove biotin
Enriched for biotin-containing fragments
and prepare sequencing library
http://dovetailgenomics.com/
14. Gene prediction approach
➢ TransMap - utilise as much of the Gencode C57BL/6J genome annotation as
possible
○ Local augustus - refine the lift over to allow small adjustments based on strain specific RNA-Seq
TransMap
RNA-SeqGencode M7
C57BL/6J
Ian Fiddes, UCSC
Stefanie König,
U. Greifswald
Mario Stanke,
U. Greifswald
TransMap+local
Augustus
Strain specific
Evidence
15. How many genes have at least one fully correct transcript?
Ian Fiddes, UCSC
16. Gene prediction approach
➢ TransMap - liftover as much of the Gencode C57BL/6J genome annotation as
possible
○ Local augustus - refine the lift over to allow small adjustments based on strain specific RNA-Seq
➢ Comparative gene prediction: Augustus CGP
○ Generate gene predictions based primarily on RNA-Seq evidence
○ Allows for predictions of new transcripts+exons absent in C57BL/6J
TransMap TransMap+local
Augustus
Augustus CGP
RNA-SeqGencode M7
Ian Fiddes, UCSC
Stefanie König,
U. Greifswald
Mario Stanke,
U. Greifswald
Strain specificC57BL/6J
Evidence
17. Gene prediction approach
➢ TransMap - utilise as much of the Gencode C57BL/6J genome annotation as
possible
○ Local augustus - refine the lift over to allow small adjustments based on strain specific RNA-Seq
➢ Comparative gene prediction: Augustus CGP
○ Generate gene predictions based primarily on RNA-Seq evidence
○ Allows for predictions of new transcripts+exons absent in C57BL/6J
TransMap TransMap+local
Augustus
Augustus CGP
RNA-SeqGencode M7
Consensus gene
set
Ian Fiddes, UCSC
Stefanie König,
U. Greifswald
Mario Stanke,
U. Greifswald
Strain specificC57BL/6J
Evidence
24. How can I look at the genomes?
http://hgwdev-mus-strain.sdsc.edu
Mark Diekhans, UCSC
Ian Fiddes, UCSC
25. How can I look at the genomes?
http://hgwdev-mus-strain.sdsc.edu
Mark Diekhans, UCSC
Ian Fiddes, UCSC
26. Change co-ordinate system to strain of interest
http://hgwdev-mus-strain.sdsc.edu
Mark Diekhans, UCSC
Ian Fiddes, UCSC
27. How can I look at the genomes?
Developed and maintained by the Genome Reference Informatics Team
http://mice-geval.sanger.ac.uk
Kerstin Howe,
WTSI
28. Acknowledgements
➢ Wellcome Trust Sanger Institute
○ Anthony Doran, Kim Wong, Dirk-Dominik Dolle, Jingtao Lilue, Monica Abrudan
○ David Adams, Richard Durbin, Kerstin Howe, Jennifer Harrow, Charles Steward, Mark Thomas, Ruth Bennet,, Jo Wood,
James Torrance, Will Chow, Mike Quail, Matt Dunn, Marcela Sjoberg, James Gilbert, Ed Griffiths, Anne Ferguson-Smith
➢ UCSC
○ Benedict Paten, Joel Armstrong, Mark Diekhans, Dent Earl, Ian Fiddes
➢ EBI
○ David Thybert, Duncan Odom, Paul Flicek
➢ University of Greifswald
○ Mario Stanke, Stefanie König
➢ Salk Institute
○ Son Pham, Mikhail Kolmogorov
➢ Yale
○ Fabio Navarro, Cristina Sisu, Mark Gerstein
➢ Wellcome Trust Centre for Human Genetics
○ Jonathan Flint, Richard Mott, Leo Goodstadt
➢ Jackson Laboratory
○ Laura Reinholdt, Anne Czechanski
➢ URLs
○ http://www.sanger.ac.uk/science/data/mouse-genomes-project
○ http://hgwdev-mus-strain.sdsc.edu
○ http://mice-geval.sanger.ac.uk/index.html
2014-2017 2015-2018
Sequence Variation Infrastructure Group, WTSI