2. Whole genome slims
● Provide a summary of an organim’s biology
● As a resource to plan curation (unannotated genes, intersections); to identify
“unknown/uncharacterised genes”
■ Need to be biologically relevant, reduced redundancy is better
■ Need as complete coverage as possible
Single gene overview (Allied Genome Resources ribbon)
○ Informs database user to branches of GO applicable to a single gene product (filter)
■ Usually higher level general grouping terms (redundancy is less critical)
● To interpret analysis (slimming prior to enrichment helps to interpret results, orientation)
● Summarize/display experimental results- smallest possible number of terms, but specific enough to
convey results
● Taxon specific slim
Slimming results sets (subsets of genes)
● There is no “one size fits all”, different slims for different use cases.
● With a ‘generic slim’ we should to provide instructions how to refine
Common Uses of GO Slims
3. Coverage 1: Only slim one aspect at a time!
All 3 aspects “unknown” =
Biological Process unknown =
(103+195+23+429)
103
750
4. Pombe using Pombase slim
Unslimmed
Unknown
Coverage 2: Distinguish unannotated/unknown/unslimmed
= IDs not recognised by the slim tool (i.e not in GO database)
http://go.princeton.edu/cgi-bin/GOTermMapper
will provide all 3 numbers (and can use your own slim)
Unannotated
5. These 365 identifiers were not annotated in the slim, but they had non-root annotations that were not in
the slim:
These 734 identifiers had no non-root annotations:
Total 1099 un-slimmed
Pombe using AGR slim
This number should be small in
a slim with good coverage
Coverage 3: Minimise “unslimmed”
Pombe using PomBase slim
It is difficult to define a slim to cover all annotated gene
products without including terms with:
■ very small numbers of annotations,
■ or high level or biologically
uninformative terms
6. PomBase AGR slim summary in matrix http://amigo.geneontology.org/matrix (Terms are on both axis, totals on diagonal)
Some terms are not
biologically informative for
a generic slim
because they can apply to
*any* biological process
Indicated by intersections
with every process
Information content low
Non-specific
Exact subset
OK
Relevance 1: A balance between coverage and content
7. Relevance 1: Avoid going “to high”
Broad groupings
Good for ribbon diagrams
(display)
Not good for summarizing
biology.
“Response to stimulus” is not
very informative about biology
(but covers >8000 (33%) mouse
gene products )
Regulation of biological process
50%
Mouse AGR slim in matrix http://amigo.geneontology.org/matrix (Terms are on both axis, totals on diagonal)
8. Mouse using AGR slim and GO term slim mapper
Your input list contains 22928 genes.
These 2037 identifiers were found to be unannotated:
These 420 identifiers were not annotated in the slim, but they had non-root annotations that were not in the slim:
These 4037 identifiers had no non-root annotations:
Deleting “cell differentiation” loses 0 (descendant of development).
Deleting “cell proliferation” loses only 5, most covered by development and cell cycle.
Deleting “regulation of biological process” only loses 51 even though over half 11658 proteins
annotated.
These 476 identifiers were not annotated in the slim, but they had non-root annotations that were not in the slim:
Relevance 2: Minimising overlaps (redundancy)
Because many gene products are annotated to multiple terms, it not possible to
create a slim with no overlaps.
If term removal doesn’t change number slimmed is might not be so be useful for
a slim.
Complete subset terms should be avoided
9. Relevance 3: Lumping vs. splitting with common parent
Very little intersection
Between/within modules
Largely unconnected
but have common parent in the
GO:
nuclear and mitochondrial
gene expression
transmembrane and
vesicle-mediated and
nucleocytoplasmic
transport
10. Relevance 3: Value of splitting, example with real data
Geneexpression
From Hayles et al A genome-wide resource of cell cycle and cell shape genes of fission yeast.
11. Current PomBase slim in
matrix, overlaps low,
information content high
Zero overlaps between
vesicle-mediated transport
Nucleocytoplasmic transport
and transmembrane
transport, not biologically
connected.
Relevance 3: Lumping vs. splitting with common parent
12. Relevance 4: Avoid single step processes
● GO:0016570 histone modification
● GO:0006468 protein phosphorylation
● GO:0006470 protein dephosphorylation
● GO:0043543 protein acylation
● GO:0016310 phosphorylation
● GO:0016311 dephosphorylation
● GO:0055114 oxidation-reduction process
● GO:0006464 cellular protein modification process
● GO:0043086 negative regulation of catalytic activity
All are examples of molecular function grouping terms in the BP ontology.
Not informative about physiological role, only biochemical role
For this reason “protein metabolism” the ancestor of protein modifications should
also be avoided in the generic slim.
13. Proposed Iterative procedure
Evaluate individual species coverage of existing generic slim (BP)
What is missing? Add terms to cover
Evaluate species coverage
Which terms could be removed without affecting coverage? Remove
Test (evaluate species coverage changes)
What is missing? Add terms to cover
Evaluate species coverage
Which terms should be split to improve biological relevence? Split
Check coverage was not affected (or recommend improved annotation
specificity)
15. Possible changes to evaluate
Remove
cell proliferation
cell differentiation
cellular component organization and biogenesis
RNA processing (see Gene expression)
regulation of biological process
Add
cytoskeleton organization
chromatin organization
ribosome biogenesis (>1000 annot)
tRNA metabolic process (1157 annot)
gene expression (includes translation)
Not covered currently
detoxification
amino acid metabolic process (or
vitamin metabolic process small
cofactor metabolic process molecule?)