Curate locally, think globally

Curate locally, think globally
(Insights from the “big-picture” view of curation)
Valerie Wood, PomBase,
Department of Biochemistry, University of Cambridge, UK
ISB 2019

The PomBase team
Midori Harris (curator , ontology developer)
Antonia Lock (curator)
Kim Rutherford (developer)

What can we learn from a “big picture”
view of curated data (especially to
improve our resources for end users) ?
How can we effectively engage users in
the curation process?

● QC- Identify annotation errors
and/or outliers
● Identify annotation gaps
● Identify knowledge gaps (and
improve annotation breadth)
● Improve data access and
presentation
Ultimately curation helps us
to join the dots and
synthesize new knowledge
from data integration.
Insights from the “big-picture” view of curationon
We often overlook the value of
the emergent knowledge from
the ‘sum of the parts’

Gene expression
Lorem Ipsum Lorem Ipsum
Making lists using ontologies and vocabularies
Gene1
RNA recognition motif
mRNA export
protein kinase activity
nucleus
transcription
Gene2
protein kinase activity
RNA binding domain
mRNA export
nucleus
transcription
mitotic cell cycle
mRNA export
Gene 1
Gene 2
Gene 4
Gene 6
Gene 8
Gene 9
transcription
Gene 1
Gene 2
Gene 3
Gene 7
Gene 10
mRNA export
Gene 1
Gene 2
Gene 4
Gene 6
Gene 8
Gene 9
transcription
Gene 1
Gene 2
Gene 3
Gene 7
Gene 10
Essentially creating 1000s of lists of
‘objects’ with similar features
We curate detail, annotating
genes to ‘terms’
These lists are often
related to each other
through ontologies
We can use sets of lists
to create “Annotation
subsets”
So why are lists useful?

GO slim = Ontology subset
of “high level” GO terms
“GO slim annotation subset”
= set of lists
GO slim
https://www.ebi.ac.uk/QuickGO/
Biological process slim (for analysis)
should represent known biology well

Cofactor metabolic
process
DNA metabolic
process
cytoplasmic
translation
mitochondrial
translation
metabolic process

Intersection
Metabolic process ∩
cellular process 3167
‘High level’ terms are often uninformative for
physiological role
Fission yeast: 4369 proteins with biological process annotation
metabolic process
3237
75% of BP
annotated
proteins
cellular process
4112
Other process terms excluded
response to (chemical)
phosphorylation
(can also apply to any module)
Terms which apply across annotation space are
often too general to be informative about
physiological role (for a biologist).
Slims with specificity are more useful.

Fission yeast GO slim, 53 terms
● Good coverage of process
(99% of gene products with
BP). Important to clearly
indicate what does not “slim”
(and why)
● Some gene products belong
to more than one slim
category. Overlaps are
unavoidable but are minimised
where possible
● Align with biologically
meaningful ‘modules’

Slim terms and intersections (biological modules)
5069 proteins
All cardiolipin biosynthesis
Unknown 700

tRNA metabolism transmembrane transport
Fission yeast 161 Fission yeast 339
Example intersection with no co-annotated genes
Using co-annotation and biological knowledge as a QC procedure
for annotation
Current intersection 10
possible annotation errors?
All GO annotation 78000 All GO annotation 14000
Transmembrane transport
∩ tRNA metabolism = 0

The Matrix Tool
http://amigo.geneontology.org/matrix
Seth Carbon & Chris Mungall), Berkeley Lab

Fission yeast intersections 01/2012

Fission yeast intersections 03/2019

Multispecies rule building results
Pilot project, tested mouse, worm, yeast
107 rules created to state that a particular annotation intersections = 0
Annotation errors (experimental ) identified (and corrected): 147
74
73
Acknowledgements, MGI David Hill, WormBase Kimberley Van Auken, SGD Stacia Engel, Rama
Balakrishnan

Multispecies rule building results
Only 0.001% of annotation corpus (600 million) . Lots of scope...
Preliminary rules are now incorporated into the GO rule base
Plan to publish soon….
Electronic annotations are based on manual annotation using experimental data.
Therefore a small number of corrections to manual annotation can fix a large number of
automatic annotation applied ac non-model species

Unknowns- the elephant in the room
Unknown 700
?

Slow progress characterizing
unknowns
Hidden in plain sight: what remains to be discovered
in the eukaryotic proteome? PMID:30938578
20% pombe and cerevisiae
still “unknown process”

20% human also unknown
117 terms
53 terms
Extended pombase slim to
cover multicellular process
annotation
We confirmed that human
unannotated are unknown,
even when not explicitly
annotated as such

Why are unknowns unstudied?
27
Based on recent gene characterizations in
fission yeast
Most recently characterised proteins are
involved in non-core functions:
● environment responsive or aging related
processes: detoxification, proteostasis,
lipidostasis, damage accumulation.
● Processes that are only required over
longer timescales
● Less than 25% are housekeeping
processes

How can we help users to cut through the complexity?
https://www.pombase.org/browse-curation/fission-yeast-go-slim-terms
See P174
for recent
updates to the
PomBase
website

New interactive view (Quilt Tool), cut across data
types

Community Curation, making small-scale data
FAIR
See P133 (Antonia Lock) P168 (Alayne Cuzick)
Easy to use curation tool (Canto), step-by-step workflow

Please, add also delta
crs1: normal onset of
premeiotic DNA replication.
Data in Fig S4.
I am wondering a normal
proportion germinates and
go on to form viable
colonies) - this is not what
the definition is suggesting,
but would be a more useful
term
I like this better. Is there
also a ….“reduced viability
of spore population” or
something like this?
….in addition to “delayed
onset of premeiotic DNA
replication” Is it possible to
use two different Term
names?
Yes of course. The peak
looks a bit broader - would
this be the equivalent to
'prolonged premeiotic DNA
replication’?
Yes, the kinetics of the
disappearance of the G1
population is much slower;
prolonged premeiotic DNA
replication is fine (or
extended).

Community curation, increasing participation
Literature triage identifies 6K ‘gene specific’ papers
among the 12.5K that mention fission yeast
Quality is EXCELLENT, coverage not so good, but improves
with subsequent sessions.
Once ‘initiated’ drop out rate is low.
Nobody does it until asked, most need reminders
Annotations per low-throughput study
9 18 41

Understand curation
improved reuse, visibility
and dissemination
Canto is easy to use
BUT we can can
improve
242 respondents who had used Canto
out of 632 total

What are the barriers?
The dog ate my homework (7)
● Many apologies for not having done
yet...
● I know I should have done.
● I keep meaning to and will!
● It is next on the 'to do' list!
● I have no excuse. I should and will
curate my paper
● feel guilty for not doing so!
● ...I'm sure it's not that difficult, just hard
to find time. I do think it's worthwhile
and that I should prioritize my curation
contribution
● Curation of papers is extremely
important and this survey definitely
motivates me to take the time to use
Canto and curate my papers.
250
105
81
67
22
13
67

Incentives and Nudges
https://www.freepik.com/
Applying small behavioural
‘nudges’ to increase
participation
Easy
Attractive
Social
Timely

Incentives
and Nudges
Reciprocal links between PubMed
and PomBase publication page
Curator Attribution

Testimonials - Making new connections
It is back and forth: think about the ..results for a while, then compare with the body of
data in PomBase, then think/work a bit more. Rinse and repeat. Martin Převorovský,
Principal Investigator, Charles University, Czech republic
I don't think we could have done anything without pombase...we build our research around
its knowledge base. Mikel Zaratiegui, Principal Investigator, Rutgers University, US
…...frequently use the ...gene annotations to make connections between pathways
and to design experiments. Amanda Bird, Principal Investigator, The Ohio State
University, US
Recently we performed a screen and by using PomBase we quickly realized that all the hits
were clustered in the same pathway. Finding this out without pombase would have
required extensive review of papers that are not within our field of expertise. In this
example a few minutes of work on PomBase gave us confidence that we were onto
something and saved us many weeks of work. Anonymous Principal Investigator
PomBase...has saved me countless hours of fruitless experiments and helped open up
many new, unexpected avenues of investigation. Gautam Dey, Postdoc, MRC LMCB
UCL, UK
“Over 300 testimonials have been received from across the research
community..Quite simply, without it, many significant discoveries
would simply not have been made….”
”Ultimately, this integrated data is driving science forward in novel
ways by enabling the community to make connections between new
and existing data…” Paul Nurse, Director, Crick Institute

acknowledgements
The PomBase team
Midori Harris (curator , ontology developer)
Antonia Lock (curator)
Kim Rutherford (developer)
Collaborators
Gene Ontology editorial team
Pascale Gaudet
David Hill,
Kimberley Van Auken
Harold Drabkin
Chris Mungall
Seth Carbon

Intersections in a simple eukaryote
cytoplasmic translation,
RNA metabolism
ribosome biogenesis
nucleocytoplasmic transport
TOTAL 1359
cell wall organization
glycosylation
lipid metabolic process
membrane organization
vesicle-mediated transport
TOTAL 722
Intermodule
only 9 shared
genes
Using co-annotation and biological knowledge as a QC
procedure for annotation

Step 1
Annotations shared between sets of GO
terms are explored and annotation
intersections (number of genes annotated)
are noted.

Step 3
Identify new annotations
violating existing rules.
Report to contributing
database(s) for validation.
Step 2
Rules are created for “zero intersects” based on known biology:
• (“cellular amino acid meta. proc.” ∩ “DNA recombination”) = 0
• (“lipid meta. proc.” ∩ “carbohydrate meta. proc.”) = 0

Step 4
Annotations critically inspected, leading to one of two outcomes:
A: Violation identified: contributing database corrects annotation
B: Annotation confirmed: rules are extended to allow specific exceptions:
Explore
co-annotation
Correct or
modify
Identify and
report
Biological
“rules”
Steps 1- 4
Iterative process

29 annotation errors corrected
Multispecies exercise: cohesin complex vs. processes

Curate locally, think globally

Recommended

Recommended

More Related Content

Similar to Curate locally, think globally

Similar to Curate locally, think globally (20)

More from Valerie Wood

More from Valerie Wood (7)

Recently uploaded

Recently uploaded (20)

Curate locally, think globally