Translating research data into
Gene Ontology annotations
Pascale Gaudet
SIB – Swiss Institute of Bioinformatics
GO Consortium
Ontology Annotations Model of biology
Gene Ontology Consortium
What we provide
A structured representation
of biology, composed of:
• Classes
• Relations
• Definitions
+ =
- Antigen binding
- Adaptive
immune response
- Extracellular
IGHA1		
Immunoglobulin	heavy	constant	alpha	1
- Glutamine-tRNA
ligase activity
- Translation
- Cytoplasm
QARS
Gln tRNA synthetase
Statements about the
functions of specific gene
products.
3 aspects:
• Molecular function
• Biological process
• Cellular component
Representation of current
knowledge in a manner
that is:
• Human
understandable
• Machine computable
GO “annotations”
§ An annotation is a statement linking a gene to
some aspect of its function (a GO ontology term)
§ Each annotation is based on some evidence,
recorded as part of the annotation
§ Evidence code (type of evidence)
§ Reference (published journal article)
Examples:
Annotation	1:	INSR	+	‘receptor	activity’
Annotation	2:	INSR	+	‘plasma	membrane’
Annotation	3:	INSR	+	‘insulin	receptor	signaling	pathway’
Semantics of a GO annotation
The association of a GO class with a gene product
is a statement that means:
§ molecular function: molecular activities of gene
products
§ cellular component: where gene products are active
§ biological process: pathways and larger processes
made up of the activities of multiple gene products.
§ In other words, annotations represent the
normal, in vivo biological role of gene products
Manual	- Literature-based Manual	- Sequence-based Algorithmic	(unreviewed)
How are annotations generated?
An	computer	program	
analyses	a	sequences	and	
make	a	prediction	based	
on	some	decision	criteria,	
for	example:	
-protein	domain	
(InterPro2GO)
- sequence	similarity	
(BLAST2GO)
An	expert	reviews	the	
literature	and	assigns	
functions,	processes	and	
cellular	components	to	
genes	products	
>	500,000	annotations >	65M	annotations
An	expert	analyses	a	
sequence	and	makes	a	
prediction concerning	the	
gene	function	based	on	
known	functions	of	
related	sequences
The	predictions	can	be	
based	on	the	known	
function	of	evolutionarily	
related	sequences	
(phylogenetic	
relationships)	
>	3M	annotations
Manual	- Literature-based
Evidence types
Chibucos MC,	Siegele DA,	Hu	JC,	Giglio M.	(2017)	Evidence	and	conclusion	ontology	PMID:	27812948	
Manual	- Sequence-based Algorithmic	(unreviewed)
EXP
experimental	evidence
IDA
inferred	from	direct	assay
IPI
inferred	from	physical	
interaction
IMP
inferred	from	mutant	
phenotype
ISS
inferred	from	sequence	
similarity
ISO
inferred	from	sequence	
ortholog
IBA
inferred	from	biological	
aspect	of	ancestor
IEA
inferred	from	electronic	
annotation
Who produces GO annotations?
• Model organism databases (SGD, FlyBase,
wormbase, MGI, etc)
• Generalist databases, for eg UniProtKB, IntAct
• Domain-specific projects: Cardiovascular project
(UCL), synapse project (VU), etc.
• Anyone who wishes to contribute their expertise
and data to the project
Best practices for generating
literature-based GO annotations
§ Ensure consistency of usage across a
broad consortium of contributors
§ Improve inferencing capabilities
Focus on the research hypothesis
§ Use prior knowledge to understand the hypothesis
being tested and its relation to the experimental
observation
Protein Known	roles Hypothesis Assay Result Conclusion	for GO
DDFB	(O76075) DNase	 The	nuclease	activity	of	
DDFB	is	required	for	
nuclear	DNA	
fragmentation	during	
apoptosis
Apoptotic	DNA	
fragmentation	
increased	in	the	
presence	of	DDFB
DDFB	mediates	nuclear	DNA	
fragmentation	during	
apoptosis
=	apoptotic	DNA	
fragmentation	
(GO:0006309)
FOXL2	(P58012) Transcription	
factor
Mutations	in	FOXL2	are	
known	to	cause	
premature	ovarian	
failure,	which	may	be	
due	to			increased	
apoptosis	
Apoptotic	DNA	
fragmentation	
increased	in	the	
presence	of	FOXL2
FOXL2	increases	the	rate	of	
apoptosis	
=	positive	regulation	of	
apoptotic	process	
(GO:0043065)
Annotate the conclusion, not the assay
1) rubidium if often used to assay potassium transport,
because the radioactive form is more readily available;
- the physiologically relevant substrate is potassium
2) Protein kinases are often tested with non-physiologically
relevant substrates, such as histone
- if the authors do not discuss the physiological relevance,
one cannot annotate the substrate
On the in vivo relevance of phenotypes
• Phenotypes can help understand the function of proteins
• Phenotypes can insights into mechanisms leading to disease
• The scope of the GO, though, is to capture the normal function of
proteins
Indirect effects of a mutation
- RNA polymerase affects essentially all cellular processes (cell
proliferation, development, etc) but does not mediate these
processes
Lack of hypothesis for a role of a protein in a process:
- Knockdown of Tmem234 in zebrafish results defects in pronephric
glomerulus formation. Annotation by IMP to glomerulus formation is
not supported by any cellular/molecular data
Get the wider perspective
• Favor a gene-by-gene or pathway-by-pathway
approach for curation rather than paper-by-paper
• Read recent publications
• Remove incorrect annotations based on invalidated
hypothesis
Guidelines for high quality
annotations
• Annotate the conclusion of the experiment
• Use the biological context to interpret the
experiments
• Carefully select publications. Read recent
publications
• Ensure consistency with existing annotations
• Keep annotation up-to date: Remove obsolete
annotations
Other approaches for quality
control
• Annotation consistency exercises
• Taxonomic constraints
• Co-occurrence of annotations
• Phylogenetic annotations
• User feedback
- from GO website
- from PubMed
- from databases
GO annotations in PubMed
Annotations for a paper
This talk was based upon
Acknowledgments
• GO PIs
• Judy Blake
• Mike Cherry
• Suzanna Lewis
• Paul Sternberg
• Paul Thomas
• GO Handbook
contributors
• Christophe Dessimoz
• Jim Hu
• Nives Skunca
• Sylvain Poux
• Funding
• NIH HG002273 (GO)

Translating research data into Gene Ontology annotations

  • 1.
    Translating research datainto Gene Ontology annotations Pascale Gaudet SIB – Swiss Institute of Bioinformatics GO Consortium
  • 2.
    Ontology Annotations Modelof biology Gene Ontology Consortium What we provide A structured representation of biology, composed of: • Classes • Relations • Definitions + = - Antigen binding - Adaptive immune response - Extracellular IGHA1 Immunoglobulin heavy constant alpha 1 - Glutamine-tRNA ligase activity - Translation - Cytoplasm QARS Gln tRNA synthetase Statements about the functions of specific gene products. 3 aspects: • Molecular function • Biological process • Cellular component Representation of current knowledge in a manner that is: • Human understandable • Machine computable
  • 3.
    GO “annotations” § Anannotation is a statement linking a gene to some aspect of its function (a GO ontology term) § Each annotation is based on some evidence, recorded as part of the annotation § Evidence code (type of evidence) § Reference (published journal article) Examples: Annotation 1: INSR + ‘receptor activity’ Annotation 2: INSR + ‘plasma membrane’ Annotation 3: INSR + ‘insulin receptor signaling pathway’
  • 4.
    Semantics of aGO annotation The association of a GO class with a gene product is a statement that means: § molecular function: molecular activities of gene products § cellular component: where gene products are active § biological process: pathways and larger processes made up of the activities of multiple gene products. § In other words, annotations represent the normal, in vivo biological role of gene products
  • 5.
    Manual - Literature-based Manual -Sequence-based Algorithmic (unreviewed) How are annotations generated? An computer program analyses a sequences and make a prediction based on some decision criteria, for example: -protein domain (InterPro2GO) - sequence similarity (BLAST2GO) An expert reviews the literature and assigns functions, processes and cellular components to genes products > 500,000 annotations > 65M annotations An expert analyses a sequence and makes a prediction concerning the gene function based on known functions of related sequences The predictions can be based on the known function of evolutionarily related sequences (phylogenetic relationships) > 3M annotations
  • 6.
    Manual - Literature-based Evidence types ChibucosMC, Siegele DA, Hu JC, Giglio M. (2017) Evidence and conclusion ontology PMID: 27812948 Manual - Sequence-based Algorithmic (unreviewed) EXP experimental evidence IDA inferred from direct assay IPI inferred from physical interaction IMP inferred from mutant phenotype ISS inferred from sequence similarity ISO inferred from sequence ortholog IBA inferred from biological aspect of ancestor IEA inferred from electronic annotation
  • 7.
    Who produces GOannotations? • Model organism databases (SGD, FlyBase, wormbase, MGI, etc) • Generalist databases, for eg UniProtKB, IntAct • Domain-specific projects: Cardiovascular project (UCL), synapse project (VU), etc. • Anyone who wishes to contribute their expertise and data to the project
  • 8.
    Best practices forgenerating literature-based GO annotations § Ensure consistency of usage across a broad consortium of contributors § Improve inferencing capabilities
  • 9.
    Focus on theresearch hypothesis § Use prior knowledge to understand the hypothesis being tested and its relation to the experimental observation Protein Known roles Hypothesis Assay Result Conclusion for GO DDFB (O76075) DNase The nuclease activity of DDFB is required for nuclear DNA fragmentation during apoptosis Apoptotic DNA fragmentation increased in the presence of DDFB DDFB mediates nuclear DNA fragmentation during apoptosis = apoptotic DNA fragmentation (GO:0006309) FOXL2 (P58012) Transcription factor Mutations in FOXL2 are known to cause premature ovarian failure, which may be due to increased apoptosis Apoptotic DNA fragmentation increased in the presence of FOXL2 FOXL2 increases the rate of apoptosis = positive regulation of apoptotic process (GO:0043065)
  • 10.
    Annotate the conclusion,not the assay 1) rubidium if often used to assay potassium transport, because the radioactive form is more readily available; - the physiologically relevant substrate is potassium 2) Protein kinases are often tested with non-physiologically relevant substrates, such as histone - if the authors do not discuss the physiological relevance, one cannot annotate the substrate
  • 11.
    On the invivo relevance of phenotypes • Phenotypes can help understand the function of proteins • Phenotypes can insights into mechanisms leading to disease • The scope of the GO, though, is to capture the normal function of proteins Indirect effects of a mutation - RNA polymerase affects essentially all cellular processes (cell proliferation, development, etc) but does not mediate these processes Lack of hypothesis for a role of a protein in a process: - Knockdown of Tmem234 in zebrafish results defects in pronephric glomerulus formation. Annotation by IMP to glomerulus formation is not supported by any cellular/molecular data
  • 12.
    Get the widerperspective • Favor a gene-by-gene or pathway-by-pathway approach for curation rather than paper-by-paper • Read recent publications • Remove incorrect annotations based on invalidated hypothesis
  • 13.
    Guidelines for highquality annotations • Annotate the conclusion of the experiment • Use the biological context to interpret the experiments • Carefully select publications. Read recent publications • Ensure consistency with existing annotations • Keep annotation up-to date: Remove obsolete annotations
  • 14.
    Other approaches forquality control • Annotation consistency exercises • Taxonomic constraints • Co-occurrence of annotations • Phylogenetic annotations • User feedback - from GO website - from PubMed - from databases
  • 15.
  • 16.
  • 17.
    This talk wasbased upon
  • 18.
    Acknowledgments • GO PIs •Judy Blake • Mike Cherry • Suzanna Lewis • Paul Sternberg • Paul Thomas • GO Handbook contributors • Christophe Dessimoz • Jim Hu • Nives Skunca • Sylvain Poux • Funding • NIH HG002273 (GO)