PomBase conventions for improving annotation depth, breadth, consistency and accuracy

PomBase conventions for improving
annotation depth, breadth,
consistency and accuracy

Annotation numbers are important
…but numbers aren’t everything…..
• Use of annotation for data-mining and data-analysis is limited
by errors, inconsistencies and omissions.
• PomBase uses a combination of annotation conventions, to
improve information content (annotation coverage, specificity
and redundancy), and QC mechanisms to identify possible
annotation inconsistencies and errors.
• In combination these mechanisms address many recurring
annotation issues.

1. The definition is critical
All ontology terms have a “fixed” definition
• If a definition is misleading or incorrect its meaning cannot
be changed. To fix the term is obsoleted and annotations
are migrated.
• This makes annotations very robust to ontology changes. If
a term needs to be repositioned the annotations remain
correct .
• We annotate to the definition, not the term name. Always
check the definition.

2. Improving annotation specificity
• i) Consider descendant terms
• ii) Veto use of uninformative terms

2i. Consider descendants
Annotate as specifically as experiment allows and be
unambiguous about the biology
• regulation: positive or negative?
• translation: cytoplasmic or mitochondrial?
• transport: of what? to where? how?
• chromosome segregation: mitotic or meiotic?
If the available terms are insufficient, request a more specific
term

• For a carboxylic acid carrier
“carboxylic acid transport”
looks initially OK
• However “transmembrane transport”
is not explicit here… Carboxylic acid
might be transported in other ways…
2i. Consider descendants e.g.

More specific annotation can
provide additional detail e.g.
• substrate,
• type (transmembrane),
• sometimes directionality
Additional parents increase the
information content as
annotating indirectly to more
terms.
2. Consider descendants e.g.

2. Veto use of non-specific terms
Identify the set of ontology terms where more specific
annotation should be possible (more biological detail)
Examples:
• e.g. cellular process (which one?)
• e.g. translation (cytoplasmic? mitochondrial?)
• e.g. transport ( of what? to where? )
Some GO terms are already flagged as not for manual
annotation. Review and improve annotations to vetoed terms
PomBase blocks 1298 upper level GO terms for direct
annotation (<200 violations)

3. i) Missing parents
Original arrangement
3. Improve the ontologies

3i. Missing parents
These process annotations were originally in different branches
of the ontology, so all annotations were required

New arrangement:
3i. Missing parents

3.i Missing parents
Collapsed 6 processes to 2. Exactly the same information content
Less redundancy, easier for users to interpret annotation

3.ii Report incorrect parents
AKA “True Path Violations” or “TPVs”
For example
protein maturation
--protein processing (part_of)
----proteolysis (part_of)
(not all proteolysis is processing or
maturation)

4. The power of Annotation Extensions
Provide additional specificity for a GO annotation e.g.
• Target gene (kinase substrate, TF regulation target)
• Location of a function
• Localization dependencies (protein A localizes protein B)
• Spatial and temporal aspects of processes, functions, locations (cell cycle stage
of occurrence)
• ADD an example of a gene product specific AE
See: Huntley et. al. A method for increasing expressivity of Gene Ontology
annotations using a compositional approach. PMID:24885854

cyclin-dependent protein serine/threonine kinase
• has substrate fkh2 involved in negative regulation of conjugation with cellular fusion
• directly inhibits srw1 involved in positive regulation regulation of G1/S transition
• has substrate drc1 involved in positive regulation of mitotic cell cycle DNA replication
• has substrate cdc18, orc2 involved in negative regulation of DNA replication during mitotic G2 phase
• has substrate xlf1 involved in negative regulation of double-strand break repair via nonhomologous end joining,
during mitotic G2 phase
• has substrate rap1 involved in negative regulation of mitotic telomere tethering at nuclear periphery
during mitotic M phase
• has substrate hcn1 during mitotic M phase
• has substrate cut3 involved in positive regulation of mitotic chromosome condensation during mitotic metaphase
• has substrate mde4 involved in correction of merotelic attachment, mitotic during mitotic metaphase
• has substrate, nsk1, involved in negative regulation of attachment of mitotic spindle microtubules during mitotic
metaphase
• has substrate mde4,cut7 involved in negative regulation of mitotic spindle elongation during mitotic metaphase
• has substrate klp9 involved in negative regulation of mitotic spindle elongation during mitotic anaphase A
• directly inhibits clp1 involved in negative regulation of exit from mitosis
• has substrate byr4 involved in positive regulation of septation initiation signaling
• directly inhibits dis2,
• has substrate rum1, crb2, sds23
Link function (cyclin-dependent-kinase) to target genes, processes,
and temporal information
4. Annotation Extension e.g. cdc2

Alternative (human CDK1):
Not scalable or maintainable

4. Using AE for effectors
• Reciprocal of the extension (automated) called “target of”
• Collects known “upstream effectors” on cdc2 page

• We can use effector substrate connections to generate
networks (interaction, metabolic, regulatory)
• Provide directional links to support pathway reconstruction
4. Using Annotation Extensions to
generate networks/pathways
sty1
cmk2
srk1
rum1
atf1
srk1
gsa1
gpx1
ntp1
sro1
ish1

4. Automated AE networks e.g.
44/59 connected in automated network based on annotated
connections within “regulation of G2/M transition” (fission yeast)
(Network for each GO slim category from the slim page)

5. Suppress redundant IEA annotation
• PomBase pipelines filter redundant IEA
(Inferred from Electronic Annotation)
evidence
• Removes >90% of IEA (because an existing
manual annotation exists)

13 annotations are reduced to 4
Same information, fewer terms

Incorrect annotations are more easily spotted
Mis16 is not involved in ‘chromatin modification,- > fix mapping
5. Suppress redundant IEA,
QC of mappings

Missing parents in ontology more obvious
“inorganic anion exchanger” should be an ‘ancestor’ of
GO:0005452, to suppress the IEA as redundant
5. Suppress redundant IEA,
QC of ontology
(SPBC543.05c)

• >40,000 fission yeast IEAs available.
• PomBase filter 36000 redundant, retain 4000 (IEAs are at least
90% accurate if manual correct).
• It is easier to evaluate the remaining IEA’s to identify/fix
anomalies
Reducing IEAs over time

5. Suppress redundant IEA
• More concise view with zero loss of information
• IEA mappings derived from a single experiment/publication
can be interpreted as proof by repetition and make weak EXP
data appear multiply supported/acceptable
• Fewer annotations, easier QC of remaining IEA’s
Q “Why isn’t an IEA covered by manual annotation?” Either:
1. Incorrect mapping
2. Missing parent in ontology
3. Missing annotation -> find supporting evidence and
annotate manually (EXP or ISO)
(PomBase also filter NAS/TAS/IC)

6. Annotate by process (pathway)
• Annotating by process rather than “ad hoc”
improves consistency and allows ‘annotation
gaps’ to be targeted
• Process papers more quickly (become more
familiar with the field, experimental methods)
Become familiar with an area of biology and
the techniques used. Don’t need to read the
background every time. Recognise
phenotypes.

From PMID:22898774
Regulation of the
metaphase/anaphase
transition by the MCC, the
APC and upstream
Signalling
Identify obvious missing
annotation, for example
between complex
members
6. Annotate by process or pathway

6. Annotate by process or pathway
cdc20
proteasome
APC separase
Cohesin subunit
securin
Post transition
SAC/MCC
Can perform QC on processed or components
e.g. Use STRING to evaluate outliers (potential annotation
errors) Input list “regulation of mitotic metaphase/anaphase
transition”
Can also ask “are any
Complex members missing”

• We are annotating whole organisms…use a
holistic whole annotation approach
• Evaluate annotation breadth (coverage) using
slims
• Evaluate intersections between slim processes
7. Assess annotation at the
organismal level

7. Evaluate organismal annotation
coverage using “slims”
• EXP supported BP
• ISO/IEA inferred BP
‘unknowns’
• Species specific, no
inference possible
• Conserved, but
unannotated in any
species

7. Browsable Slim:
http://preview.pombase.org/browse-curation/fission-yeast-go-slim-terms

7. Sensible assignments?
DNA
recombination
Periodic check that
slim class contents
Look sensible

7. Monitor unslimmed gene products
Note: Exclude biologically uninformative terms like “phosphorylation” or
“response to chemical” as these could apply to any real biological role.

Unknown 830
TOTAL
5054
cytoskeleton
org 206
nuclear DNA
replica on,
recombina on,
repair
305
mito c
chromosome
segrega on
184 regula on of mito c
cell cycle 232
10
CELL DIVISION 751
27
cytokinesis
110
0
39 1
46
3
4. MITOCHONDRIAL
ORG/EXP
280
4
cell wall
org 1303
4
1
MEMBRANES, TRAFFICKING, CELL SURFACE 787
14
lipid met
222 vesicle
Mediated
transport
324
6
glycosyla on
polysacc met
140membrane
org 199
75
0
6
74
10
33
0
detox
SMALL MOLECULE TM
TRANSPORT
288
13
9
0
AA &
sulfur
met
220
vitamin
cofactor
met
9
5 nucleo-base/
side/ de met
219
small
sugar met
77
CENTRAL MET,
ENERGY
AND BUILDING
BLOCKS 549
Nitrogen
15
25
174
54
3430
other energy
genera on
25
23
signalling
404
sexual reproduc ve
process 262
(Many intersec ons)
Other 290
No intersec ons.
Includes adhesion,
many proteases,
peroxions
EXPRESSION 1294
````
EXPRESSION submod 863
4
1
3
ribosome
biogenesis
317
RNA
metabolism
772cytoplasmic
transla on
249
189
c
nucleocyto
transport
110
5
34
26
2
Transcrip on
479
32
18
PROTEIN ASSEMBLY/STABILITY 765
protein
catabolism
& autophagy
251
ubiqui na on
192
63
folding
102
complex
Assembly
325
1
3
4
1
7. Visual slim, all pombe proteins

7. Evaluate intersections between slim
categories
Evaluate intersections between processes
Many GO processes are rarely co-annotated because they are
functionally spatially or temporally distant. For example, would
not expect “ribosome biogenesis” to intersect with “vitamin
metabolism”
We can use this observation to identify potential conflicts using
the GO term matrix

Fission yeast intersections Jan 2012

Fission yeast intersections March 2017

7. Identifies ontology errors (e.g)
DNA metabolism and chromosome segregation do not usually intersect
Regulation of chromosome condensation should not be a DNA metabolic process

7. Ontology error (e.g.)
Genes annotated to folic acid metabolism were also incorrectly annotated to amino acid
metabolism. Folic acid was classified as an amino acid by CHEBI -> fix, CHEBI, which fixes GO

7. Finds incorrect mappings (e.g)
Intersect between tRNA metabolism and transcription.
Elongator is no longer thought to have a direct role in transcription, mapping removed

8. Consider Author intent
Think about the biology the author intended
e.g. rubidium ion transmembrane transporter/ transport
Rubidium ion is used as an assay for K+ transport not rubidium
(non-physiological substrate)
e.g. Apoptosis (RPS19)
Rps19 mutant displayed condensed DNA, a fragmented nucleus
and caspase activation - indicative of apoptosis.
Since RPS19 has an essential role in ribosome biogenesis
apoptosis is likely to be an indirect effect of the disruption of an
upstream process translation (i.e. an experimental readout)

9. Communication with the author
and community curation
• Most authors are happy to discuss their publications. If unsure
about an annotation ask them. PomBase routinely use the
authors as a QC step to refine annotation.

9. Community Curation
• Most authors are happy to curate their own papers
• Co-curation by author and curator improves annotation quality
(especially PhD/post doc/recent papers).
• 9619 annotations (FTPO/GO/MOD) created by Community
from 510 publications (excludes HTP spreadsheet submissions)

Some example sessions
• http://tinyurl.com/q2bgyqv
• http://tinyurl.com/p7d979b
• http://tinyurl.com/o72bzul

Very specific annotation is possible because Canto guides the user
step by step to construct genotypes and ontology based annotations.
“Drill down” to more specific terms is assisted.
Prompts are provided for AE of specified types for certain terms.

10. Prioritise error fixing
• Fixing known errors takes precedence over new annotation....
like critical bugs in code
• Even small errors often uncover larger issues, or can fix many
problems simultaneously across multiple species.
• Prevents propagation of annotation errors

11. GO process vs. phenotype
• GO annotation should reflect a gene's direct involvement
in, or role in regulating, processes or functions.
• Phenotypes may indicate that a mutation *affects* a
process, but may reflect downstream or indirect effects.
e.g. ER membrane defect -> nuclear envelope defect -> chromosome
decondensation defect-> defects in next round of DNA replication.
• A “DNA replication phenotype” alone is not enough to
make a “DNA replication” GO annotation.
• Single phenotype is often NOT SPECIFIC FOR A PROCESS.

Phenotype annotation rules
• To make GO annotations based on phenotypes
• Ask the question
“Is this phenotype or collection of phenotypes
specific to this process (usually need detailed
phenotypes)
Additional data can support GO inference from
phenotype (location, orthology), and author intent.
(Intersections between processes useful for identifying
annotation errors caused by indirect annotation)

PomBase conventions for improving annotation depth, breadth, consistency and accuracy

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (14)

Similar to PomBase conventions for improving annotation depth, breadth, consistency and accuracy

Similar to PomBase conventions for improving annotation depth, breadth, consistency and accuracy (20)

More from Valerie Wood

More from Valerie Wood (6)

Recently uploaded

Recently uploaded (20)

PomBase conventions for improving annotation depth, breadth, consistency and accuracy

Editor's Notes