Ontologies for life sciences: examples
from the Gene Ontology
Melanie Courtot
GO/GOA project lead
mcourtot@ebi.ac.uk
@mcourtot
Ontologies for life sciences
Cross dom
ain
resources
.
C
ro
ss
d
o
m
a
in
re
s
o
u
rc
e
s
d
g
P
b
s
y
Data resources at EMBL-EBI
Genes, genomes & variation
RNA Central ArrayExpress
Expression Atlas
Metabolights
PRIDE
InterPro Pfam UniProt
ChEMBL SureChEMBL ChEBI
Molecular structures
Protein Data Bank in Europe
Electron Microscopy Data Bank
European Nucleotide Archive
European Variation Archive
European Genome-phenome Archive
Gene, protein & metabolite expression
Protein sequences, families & motifs
Chemical biology
Reactions, interactions &
pathways
IntAct Reactome MetaboLights
Systems
BioModels Enzyme Portal BioSamples
Ensembl
Ensembl Genomes
GWAS Catalog
Metagenomics portal
Europe PubMed Central
BioStudies
Gene Ontology
Experimental Factor
Ontology
Literature &
ontologies
Different words same concept: example of
Dyschromatopsia
Search PubMed for “color blindness”
Search PubMed for “Dyschromatopsia”
Search PubMed for "abnormality of the eye"
Thousands of sample attributes…
genomics transcriptomics proteomics metabolomics
transcriptomics metabolomics
individual
experiments genomics transcriptomics proteomics metabolomics
transcriptomics metabolomics
individual
experiments genomics transcriptomics proteomics metabolomics
transcriptomics metabolomics
individual
experiments
Data integration in times of ‘omics’
genomics transcriptomics proteomics metabolomics
transcriptomics metabolomics
individual
experiments
conducted at different times by different researchers using
different equipment/approaches reporting same type of results differently
Data growth is fast
12 month doubling
18 month doubling
4 month doubling
3 month doubling
100000000
1E+09
1E+10
1E+11
1E+12
1E+13
1E+14
1E+15
1E+16
2002	
   2004	
   2006	
   2008	
   2010	
   2012	
   2014	
   2016	
  
bytes
date
EGA
ENA
PRIDE
MetaboLights
ArrayExpress
Slide credit: Paul Flicek
Data growth is fast
12 month doubling
18 month doubling
4 month doubling
3 month doubling
100000000
1E+09
1E+10
1E+11
1E+12
1E+13
1E+14
1E+15
1E+16
2002	
   2004	
   2006	
   2008	
   2010	
   2012	
   2014	
   2016	
  
bytes
date
EGA
ENA
PRIDE
MetaboLights
ArrayExpress
Slide credit: Paul Flicek
Vast amount of data generated
means
vast amount of data submitted to repositories
Curation - Dirty data and the long tail
200100
sex:female
gender:female
disease:breast cancer
frequency=2285 frequency=1288
data integration [ˈdeɪtə ˌɪntəˈgreɪʃən]:
(computational) means to access, retrieve
and analyse data sets from different
sources in order to exploit them, i.e., gain
new knowledge, and share that new
knowledge
data integration [ˈdeɪtə ˌɪntəˈgreɪʃən]:
(computational) means to access, retrieve
and analyse data sets from different
sources in order to exploit them, i.e., gain
new knowledge, and share that new
knowledge
Standards
What do they offer?
•  uniformity and consistency in reporting data
•  effective reuse, integration and mining of data
•  creation of SOPs, benchmarks, quality assessment
•  community cohesion
What constitutes a standard?
1.  Establish your community
2.  Define community needs
3.  Define minimal information which needs to be gathered
and exchanged by that community
4.  Design* an interchange format
5.  Design* domain-specific controlled vocabularies
*Design = review, reuse and fill the gaps
https://xkcd.com/927/
http://www.biosharing.org
•  Many “Minimum information about a…..” papers now
published.
Standards – XML interchange formats
http://www.sbml.org
Adding semantics to the data formats
•  Same name for different concepts
•  Different names for the same concept
Inconsistency in naming of biological concepts
?
An example …
Tactition Tactile sense
Taction
perception of touch ; GO:0050975
Sample description with semantic markup
CL:CL_0000071
(blood vessel
endothelial cell)
obo:CHEBI_39867
(valproic acid)
NCBITaxon:NCBITa
xon_9606
(Homo Sapiens)
Curation
Ontologies
•  Representation of important things in a specific domain
•  Describes types of entities (e.g. cells) and relations between them
•  An active, formal computational artifact
•  A mathematical model based on a subset of first order logic
•  Tools can automatically process ontologies
•  A communication tool
•  Provides a dictionary for collaborators, a shared understanding
•  Allows data sharing
Reasoning is critical
•  Prokaryotic and Eukaryotic
cell are declared disjoints
•  Fungal cell is a Eukaryotic
cell
•  Spore is a Fungal cell and a
Prokaryotic cell
⇒ Unsatisfiability
⇒ Solution: clarify spore
(sensu Mycetozoa) AND
actinomycete-type spore
http://www.plosone.org/article/info:doi/10.1371/journal.pone.0022006
Different words same concept: example of
Dyschromatopsia
We searched earlier for :
-  Dyschromatopsia
-  Colorblindness
-  Abnormality of the eye
The ontology of color blindness
HP:0011518 (Dichromacy )HP:0011518 (Eye)
HP:0000551 (Abnormality of color vision )
HP:0007641 (Dyschromatopsia)
Is-a
Is-a
Disease-location
The ontology of color blindness
HP:0011518 (Dichromacy )HP:0011518 (Eye)
HP:0000551 (Abnormality of color vision )
HP:0007641 (Dyschromatopsia)
Is-a
Is-a
Disease-location
“Colorblindness”
“A form of colorblindness in
which only two of the three
fundamental colors can be
distinguished due to a lack of
one of the retinal cone
pigments.”
synonym
definition
Building ontologies
•  Put things into categories
•  Helps organise the data
•  Allows us to generalise over data
•  Capture the relations between things
•  Anatomical parts
Biopolymer
Nucleic Acid Polypeptide
EnzymeDNA RNA
tRNA mRNA smRNA
Ontologies add value
Smarter searching
Data visualisation
Data analysis
Data integration
CMPO term:
graped micronucleus
CMPO_0000156
CMPO term:
graped micronucleus
CMPO_0000156
Integrate file formats
Integrate metadata
Apply phenotype ontology
Predict disease gene/biomarkers
Human
Disease
Cell
Gene knockdown
31
32
Genotype Phenotype
Sequence
Proteins
Gene products Transcript
Pathways
Cell type
BRENDA tissue /
enzyme source
Development
Anatomy
Phenotype
Plasmodium
life cycle
- Sequence types
and features
- Genetic Context
- Molecule role
- Molecular Function
- Biological process
- Cellular component
- Protein covalent bond
- Protein domain
- UniProt taxonomy
-Pathway ontology
-Event (INOH pathway
ontology)
-Systems Biology
-Protein-protein
interaction
-Arabidopsis development
-Cereal plant development
-Plant growth and developmental stage
-C. elegans development
-Drosophila development FBdv fly
development.obo OBO yes yes
-Human developmental anatomy, abstract
version
-Human developmental anatomy, timed version
-Mosquito gross anatomy
-Mouse adult gross anatomy
-Mouse gross anatomy and development
-C. elegans gross anatomy
-Arabidopsis gross anatomy
-Cereal plant gross anatomy
-Drosophila gross anatomy
-Dictyostelium discoideum anatomy
-Fungal gross anatomy FAO
-Plant structure
-Maize gross anatomy
-Medaka fish anatomy and development
-Zebrafish anatomy and development
-NCI Thesaurus
-Mouse pathology
-Human disease
-Cereal plant trait
-PATO PATO attribute and value.obo
-Mammalian phenotype
- Human phenotype
-Habronattus courtship
-Loggerhead nesting
-Animal natural history and life history
eVOC (Expressed
Sequence Annotation
for Humans)
Ontologies for life sciences
Open Biological and Biomedical Ontologies
(OBO)
A subset of biological and biomedical ontologies whose developers have
agreed in advance to accept a common set of principles reflecting best
practice in ontology development designed to ensure …
•  tight connection to the biomedical basic sciences
•  compatibility
•  interoperability, common relations
•  formal robustness
•  support for logic-based reasoning
http://www.obofoundry.org
OBO Foundry
Building metadata (& ontology) rich resources
•  We build tools for semantic
enrichment and alignment
•  Interoperability toolkit
•  Microservices based architecture
•  Technology-agnostic
•  Pushing boundaries of ontology
“embedding”
Raw Data to Explicit Knowledge
Data
Exploration
and
Cleanup
Data
structuring
Ontology
Annotation
Data cleaning
and mapping
Ontology
building
Webulous
OxO mapping service
Searching for ontology terms: the EBI
Ontology Lookup Service
•  for searching and visualizing >140 ontologies from the biomedical
domain
•  includes (among others):
•  Gene Ontology
•  OBO Relations ontology
•  Evidence ontology
•  Pathogen Transmission Ontology
•  Symptom Ontology
•  Basic Formal Ontology
Ontology Lookup Service
•  Ontology search engine
•  Ontology visualisation
•  Powerful RESTful API
•  Open source project
•  Generic infrastructure (can load any ontology represented in OWL)
https://github.com/EBISPOT/OLS
Repository of over 150 biomedical ontologies (4.5 million terms, 11 million relations)
http://www.ebi.ac.uk/ols
Choosing the right term
•  Sample attributes and variables are mapped to EFO ontology
Sample attribute
Mapping data to ontology terms
• Zooma automatically annotates sample attributes and variables with
ontology classes
Mapping data to ontology terms
Mapping data to ontology terms
Information supplied as
part of a search
The source of this
mapping
ZOOMA contains a linked data repository of
annotation knowledge and highly annotated data
Expression Atlas: source of mappings
•  Atlas automated pipeline runs against Zooma, then curators:
•  Check that the automatic mappings are all correct
•  Create a list of new mappings that should be added to Zooma
•  Webulous Google Add-On
•  Connect to the Webulous server from Google Spreadsheets
•  Load templates from the Webulous server
•  Submit populated templates back to the server for processing 
Expression Atlas: curation
What happens when
we need a term that
is not in EFO?
Adding diseases to EFO using
•  Design pattern templates can be loaded into Google Sheets
•  A Webulous template specifies a series of fields (columns) for the input data
Some fields only
allow values from a
list of ontology terms
Adding diseases to EFO using
This data validation provides user
with convenient term autocomplete
when entering data into a cell
Adding diseases to EFO using
Raw Data to Explicit Knowledge
Data
Exploration
and
Cleanup
Data
structuring
Ontology
Annotation
Data cleaning
and mapping
Ontology
building
Webulous
OxO mapping service
BioSolr
“BioSolr aims to significantly advance the state of the art with
regards to indexing and querying biomedical data with
freely available open source software”
flaxsearch/BioSolr
Solr documents with
ontology annotation
Enriched Solr with ontology content
(synonyms, structure, relations)
Solr/Elastic plugin Query expansion and
hierarchical faceting
Which other diseases are associated with
PDE4D?
View diseases grouped in
therapeutic areas or
organised in a tree
View more information about PDE4D
Filter by
therapeutic area
http://www.ebi.ac.uk/rdf
Publishing biological data as
Linked Open Data
•  The EBI RDF platform
•  Released Nov 2013
•  Currently over 16 billion RDF triples
•  Datasets updated ~ quarterly
LOD diagram August
2014
Jupp et al (2013). The EBI RDF Platform: Linked Open Data
for the Life Sciences. Bioinformatics.
RDF Platform Integration points
Gene (via identifiers.
org/ensembl)
RNA transcript (via
identifiers.org/ensembl)
uniprot:Protein
rdfs:seeAlso (not currently linking
to identifiers.org but soon)
discretized differential
gene expression ratio
(sio: SIO_001078)
Gene Expression Atlas
Ensembl
sio:'is attribute of'
(sio:SIO_000011)
Uniprot
Gene Ontology
GO BP GO MF GO CC
uniprot:classifiedWith
bq:occursIn
Organisms
Organism/taxon
ChEMBL
Assay
(?)
chem
bl:hasTarget
?
bq:isVersionOf
uniprot:organism
rdfs:seeAlso
1
1
1
*
1
* * *
1
1
BioModels
SBMLModel
Reaction
Species
Compartment
bq:is
bq:isVersionOf
bq:isVersionOf
bq:is
bq:isVersionOf
bq:isHomologTo
bq:hasPart
ChEBI
Reactome
Pathway
bq:isVersionOf
bq:isVersionOf
SBO
bq:is
Relationships within
Biomodels can be found
at https://github.
com/sarala/ricordo-
rdfconverter/wiki/SBML-
RDF-Schema
rdfs:seeAlso
Structure
PDB
1
rdfs:seeAlso
Target (?)
uniprot:transcribedFrom
Protein (via identifiers.
org/ensembl)
uniprot:translatedTo
bq:isVersionOf
RDF Platform – lessons learned
Successes
•  Novel queries possible over
EBI datasets
•  Production quality RDF
releases
•  Community of users
•  Highly available public
SPARQL endpoints
•  500+ users (10-50 million
hits per month)
•  Lots of interest
•  Catalyst for new RDF efforts
Lessons
●  Public SPARQL endpoints
problematic
●  Query federation not
performant
●  Inference support limited
●  Not scalable for all EBI data
e.g. Variation, ENA
●  Lack of expertise in service
teams
●  Too much overhead to get
started quickly in this space
An example: The Gene Ontology and
Gene Ontology Annotation
Model
Organism
Databases
• A way to capture
biological knowledge for
individual gene products
in a written and
computable form
The Gene Ontology
• A set of concepts
and their relationships
to each other arranged
as a hierarchy www.ebi.ac.uk/QuickGO
Less specific concepts
More specific concepts
The Gene Ontology
http://geneontology.org/
•  Collaborative effort to address the need for consistent
descriptions of genes/gene products across databases
•  Use of GO terms by collaborating databases facilitates
uniform queries across all of them
Aims of the GO project
•  compile the ontologies
•  >40000 terms
•  constantly increasing and improving
•  annotate gene products using the terms
•  provide public resource of data and tools
•  regular releases of annotations
•  tools for browsing/querying annotations and editing the GO
The GO editorial office at EMBL-EBI
•  Part of the Sample, Phenotypes and Ontology team (SPOT)
•  Contributes to development of the Gene Ontology
•  Specific areas of interest: autophagy, synapse…
•  Answers user requests
•  New terms, modifications, updates
•  Help support
•  Curator requests
GO editorial office at the EBI:
Paola
Roncaglia
David
Osumi-Sutherland
Develop the ontology
•  An OWL ontology of >41,000 classes
•  biological process, cellular component, molecular function
•  > 14,000 imported classes (CL, Uberon, ChEBI, NCBI_tax)
•  >136,000 logical axioms, including:
•  ~72,000 subClassOf axioms between named GO classes
•  ~41,000 simple existential restrictions (subClassOf R some C)
•  EL expressivity => fast, scalable reasoning (with ELK)
https://www.cs.ox.ac.uk/isg/tools/ELK/
Ontology structure
• Hierarchical
Terms can have more than one parent
• Terms are linked by
relationships
is_a
part_of
regulates (and +/- regulates)
www.ebi.ac.uk/QuickGOoccurs_in
has_part
These relationships allow for complex analysis of large datasets
Terms can have more than one child
Biological Process
what does a gene product do?
cell division
transcription
A commonly recognised series of events
Molecular Function
how does a gene product act?
•  insulin binding
•  insulin receptor activity
•  glucose-6-phosphate isomerase activity
Cellular Component
where is a gene product located?
plasma
membrane
•  mitochondrion
•  mitochondrial membrane
•  mitochondrial matrix
•  mitochondrial lumen
• ribosome
• large ribosomal subunit
• small ribosomal subunit
Example GO annotation – cytochrome c
cellular
components
molecular
functions
biological
processes
Electron carrier activity
GO:0009055
oxidation-reduction process
GO:0055114
Mitochondrion
GO:0005739
https://www.ebi.ac.uk/QuickGO/GProtein?ac=P99999
Anatomy of a GO term
Unique identifier
Term name
Definition
Synonyms
Cross-references
Hands-on
Finding GO term
information
https://www.ebi.ac.uk/QuickGO/
What is the GO ID for the term mitochondrial
chromosome
What is the GO ID for the term mitochondrial
chromosome
GO:0000262
What are the four direct parents of the term
nucleosome?
What are the four direct parents of the term
nucleosome?
Chromatin
Chromosomal part
DNA packaging complex
Protein-DNA complex
What types of relationships are there between
the term nucleosome and its direct parents?
What types of relationships are there between
the term nucleosome and its direct parents?
Part of chromatin
Is a for the others
Building the GO
•  The GO editorial team
•  Submission via GitHub, https://github.com/geneontology/
•  Submissions via TermGenie, http://go.termgenie.org
•  ~80% terms are now created this way
Annotate gene products
GOA
Database
external annotation groups
(25)
manual annotation by
curators (125)
electronic prediction methods
(11)
Making annotations available
GOA
Database
GOA & GOC ftp sites
QuickGO
Manual annotations
•  Time-consuming process
producing lower numbers of
annotations (~2,800 taxons
covered)
•  More specific GO terms
•  Manual annotation is essential for
creating predictions
• Part of the Protein Function content team
• Largest open-source contributor of annotations to GO
•  Focuses on human, but provide annotations for more than
441,000 species
• Human curators, and collate manual and electronic
annotations across community
UniProt-Gene Ontology Annotation (UniProt-
GOA) project at the EMBL-EBI
http://www.ebi.ac.uk/GOA
Aleksandra
Shypitsyna
Elena
Speretta
Penelope
Garmiri
Tony
Sawford
UniProt-GOA project at the EBI:
…a statement that a gene product;
P00505
Accession Name GO ID GO term name Reference Evidence
code
IDAPMID:2731362aspartate transaminase activityGO:0004069GOT2
A GO annotation is …
…a statement that a gene product;
1. has a particular molecular function
or is involved in a particular biological process
or is located within a certain cellular component
A GO annotation is …
P00505
Accession Name GO ID GO term name Reference Evidence
code
IDAPMID:2731362aspartate transaminase activityGO:0004069GOT2
…a statement that a gene product;
1. has a particular molecular function
or is involved in a particular biological process
or is located within a certain cellular component
2. as described in a particular reference
A GO annotation is …
P00505
Accession Name GO ID GO term name Reference Evidence
code
IDAPMID:2731362aspartate transaminase activityGO:0004069GOT2
…a statement that a gene product;
1. has a particular molecular function
or is involved in a particular biological process
or is located within a certain cellular component
2. as described in a particular reference
3. as determined by a particular method
A GO annotation is …
P00505
Accession Name GO ID GO term name Reference Evidence
code
IDAPMID:2731362aspartate transaminase activityGO:0004069GOT2
Experimental
data
Computational
analysis
Author statements/
curator inference
(+ Inferred from electronic annotations)
http://www.evidenceontology.org/
Tracking provenance
Evidence codes
http://geneontology.org/page/evidence-code-decision-tree
Hands-on
Manual annotation
example
PMID:18573874
FIG. 2. Human Nbp35 is a cytosolic
protein. (A) EGFP fluorescence of a HeLa
cell transiently transfected with a vector
encoding a huNbp35-EGFP fusion protein
(right) in comparison to the endogenous
autofluorescence (AFL) of control cells
(left).
(C) Sub-cellular localization of huNbp35 by cell fractionation. […]HuNbp35
exclusively colocalizes with tubulin in the cytosolic fraction, but not with
mitochondrial aconitase (mtAconitase) present in the membrane fraction.
Human Nbp35 is a cytosolic protein.
Protein GO term Supporting evidence
Human Nbp35 is a cytosolic protein.
•  Find the correct UniProt entry
http://www.uniprot.org
Human Nbp35 is a cytosolic protein.
Human Nbp35 is a cytosolic protein.
Protein GO term Supporting evidence
NUBP1
Human Nbp35 is a cytosolic protein.
•  Find the right GO term
https://www.ebi.ac.uk/QuickGO/
Human Nbp35 is a cytosolic protein.
Human Nbp35 is a cytosolic protein.
Protein GO term Supporting evidence
NUBP1 GO:0005829
Human Nbp35 is a cytosolic protein.
•  Evidence:
•  Fig 2A Immunofluorescence and/or
•  Fig 2C subcellular fractionation
GO evidence codes [small excerpt]
TAS, Traceable author statement
NAS, Non-traceable author statement
IDA, Inferred from Direct Assay
IMP, Inferred from Mutant Phenotype
IPI, Inferred from Physical Interaction
Experimental
evidence,
Methods &
Results
Abstract &
Introduction
Human Nbp35 is a cytosolic protein.
Protein GO term Supporting evidence
NUBP1 GO:0005829 IDA
Electronic Annotations
•  Quick way of producing large numbers of annotations
•  Annotations use less-specific GO terms
Only source of annotation
for ~438,000 non-model
organism species
Electronic Annotations
•  Quick way of producing large numbers of annotations
•  Annotations use less-specific GO terms
•  Only source of annotation for ~438,000 non-model
organism species
orthology
taxon
constraints
Broad taxonomic coverage
…as well as less
well-studied species that have;
• Complete proteome
• >25% GO annotation coverage
We provide annotation files for
well-studied species…
We have annotations for species that may not have a dedicated
curation effort;
e.g. for 1,400 Solanacae species’ we have
~360,000 annotations for ~64,000 proteins
1. Mapping of external concepts to GO terms
e.g. InterPro2GO, UniProt Keyword2GO, Enzyme Commission2GO
Electronic annotation methods
GO:0004715 ; non-membrane spanning protein tyrosine kinase activity
Annotations are high-quality and have an explanation of the method (GO_REF)
Macaque
Mouse DogCow
Guinea PigChimpanzee Rat
Chicken
Ensembl compara
2. Automatic transfer of manual annotations to orthologs
...and more
e.g. Human
Arabidopsis
Rice
Brachypodium
Maize
Poplar
Grape
…and moreEnsembl compara
Electronic annotation methods
http://www.geneontology.org/cgi-bin/references.cgi
An example
ACCESSION	
   GO ID	
   GO ASPECT	
   GO TERM	
  
P04637	
   GO:0047485	
   F	
   protein N-terminus binding	
  
P04637	
   GO:0051087	
   F	
   chaperone binding	
  
P04637	
   GO:0051721	
   F	
   protein phosphatase 2A binding	
  
P04637	
   GO:0000733	
   P	
   DNA strand renaturation	
  
P04637	
   GO:0006289	
   P	
   nucleotide-excision repair	
  
P04637	
   GO:0006355	
   P	
   regulation of transcription, DNA-templated	
  
P04637	
   GO:0006461	
   P	
   protein complex assembly	
  
ACCESSION	
   GO ID	
   GO ASPECT	
   GO TERM	
  
Q549C9	
   GO:0047485	
   F	
   protein N-terminus binding	
  
Q549C9	
   GO:0051087	
   F	
   chaperone binding	
  
Q549C9	
   GO:0051721	
   F	
   protein phosphatase 2A binding	
  
Q549C9	
   GO:0000733	
   P	
   DNA strand renaturation	
  
Q549C9	
   GO:0006289	
   P	
   nucleotide-excision repair	
  
Q549C9	
   GO:0006355	
   P	
   regulation of transcription, DNA-templated	
  
Q549C9	
   GO:0006461	
   P	
   protein complex assembly	
  
Annotations from the source…
…are projected on to the target
InterPro	
  
Source of ~93 million GO mappings for ~30 million distinct
UniProtKB sequences (Oct 30 2015 release)
3. Propagation of GO annotations to protein groups
GO mapping to domains:
Function of domain may not be function of protein
Family members can be experimentally characterised as lacking function:
P14210 - a serine protease homologue with no proteolytic activity
(proteins are reported to GOA to be blacklisted)
Broad families that are functionally diverse:
The GHMP kinase superfamily includes
- Galactokinases (EC=2.7.1.6)
- Homoserine kinases (EC=2.7.1.39)
- Mevalonate kinases (EC=2.7.1.36)
- Diphosphomevalonate decarboxylases (EC 4.1.1.33)
Considerations for mapping GO terms
* Includes manual annotations integrated from external model organism and
specialist groups
2,811,622Manual annotations*
280,313,749Electronic annotations
Number of annotations in UniProt-GOA
database (June 2016)
Many ways to access GO
annotation data
http://www.ebi.ac.uk/QuickGO
Map-up annotations
with GO slims
Search GO terms
or proteins
Find sets of
GO annotations
Questions on how to use QuickGO?
Contact goa@ebi.ac.uk
One example: the QuickGO browser
http://www.ebi.ac.uk/QuickGO-Beta/
GO term enrichment analysis
•  What is it?
•  What can you use it for?
•  How does it actually work?
•  How can I actually do it?
•  When is it NOT a good idea to do it?
Enrichment analysis – basic principle
Sample
40%
20%
Enrichment analysis
Sample
40%
20%
Reference
20%
20%
=> The sample is over-enriched for
Enrichment analysis
Sample
40%
20%
Reference
20%
20%
GO term enrichment analysis
•  What is it?
•  Most popular type of GO analysis
•  Determines which GO terms are more often associated with a
specified list of genes/proteins compared with a control list or
rest of genome
GO term enrichment analysis
•  What can you use it for?
GO term enrichment analysis
“Our gene list contains targets for GATA1 (orange balls) and SP1
(blue balls) transcription factors (TFs). For each TF, we extract the
proportion of targets in the gene list and in the genome to
construct the contingency table. Fisher's exact test is used to
determine if there is a nonrandom association between the gene
list and the specific regulation of a TF.”
•  http://bioinfo.cipf.es/docs/renato/simple_enrichment_analysis
GO term enrichment analysis
GO term enrichment analysis
•  How does it actually work?
•  http://geneontology.org/page/go-enrichment-analysis
•  http://geneontology.org/faq/what-minimum-information-
include-functional-analysis-paper
•  Also useful for GO analysis in general:
GO term enrichment analysis
•  How can I actually do it?
•  Many tools available to do this analysis
•  User must decide which is best for their analysis
•  We’ll focus on the tool provided by the GO Consortium
•  Be aware that there are numerous third-party tools and that
they do not all use up-to-date GO data
GO term enrichment analysis
•  How do you get to the GO TE tool?
•  From front page of GO website
•  From AmiGO
http://geneontology.org
http://geneontology.org
http://amigo.geneontology.org/amigo
http://amigo.geneontology.org/amigo
Spinocerebellar ataxia type 28
Paola
Roncaglia
Novel biomarkers of rectal radiotherapy
Biomarker for diagnosis and prognosis
Gene expression changes in diabetes
Improved network analysis
Hands on - Dataset
•  Download http://tinyurl.com/IDs-for-enrichment
•  Go to http://geneontology.org
•  Run the enrichment analysis
Caveats
•  When can you NOT do an enrichment analysis?
•  Too few target genes/proteins
•  Genes/proteins of interest are not present in your background
set (e.g. array)
•  Genes/proteins of interest are not expressed/translated in your
sample(s)
138
Many gene products are associated with a
large number of descriptive, leaf GO nodes:
GO slims
…however annotations can be mapped up
to a smaller set of parent GO terms:
GO slims
Slim generation for industry
•  Collaboration funded by Roche
•  Need a custom GO slim for analysis of genesets of interest
•  Need to be descriptive enough
•  Without redundancy
•  Internal proprietary vocabulary – hard to maintain
•  Desire to automatically map to GO
http://www.swat4ls.org/wp-content/uploads/2015/10/SWAT4LS_2015_paper_44.pdf
ROCHE CV
GSEA with full GO GSEA with Roche CV
Courtesy Laura Badi
•  Mapping query: participant_OR_reg_participant some
cannabinoid
•  Description: “A process in which a cannabinoid
participates, or that regulates a process in which a
cannabinoid participates.”
Results
•  We have successfully mapped 84% of terms from RCV
(308/365) to OWL queries that can be used to replicate
some proportion of the original manual mapping.
•  In addition, these queries find 1000s of terms that were
missed in the original mapping.
David
Osumi-Sutherland
GO SLIM (generic)
ROCHE CV – MANUAL ONLY
ROCHE CV MANUAL + AUTO
Go slims for metagenomics
functional analysis
https://www.ebi.ac.uk/metagenomics/projects/SRP033553/samples/SRS512695/runs/SRR1045093/results/versions/3.0
Samples
comparison
BP CC MF
Samples
comparison
(detail)
BP
CC
MF
http://www.ebi.ac.uk/about/news/service-news/metagenomics-go-slim-2016
Acknowledgements
•  GO editors and developers
•  GO annotators
•  The Gene Ontology (GO) Consortium
•  Samples, Phenotype and Ontology team (Helen Parkinson)
•  Protein Function Content team (Claire O’Donovan)
•  Funding: EMBL-EBI, National Human Genome Research Institute
(NHGRI)
Thank you for your attention!
Contact Gene Ontology Annotation:
goa@ebi.ac.uk
Contact Gene Ontology:
http://geneontology.org/form/contact-go

Ontologies for life sciences: examples from the gene ontology

  • 1.
    Ontologies for lifesciences: examples from the Gene Ontology Melanie Courtot GO/GOA project lead mcourtot@ebi.ac.uk @mcourtot
  • 2.
  • 3.
    Cross dom ain resources . C ro ss d o m a in re s o u rc e s d g P b s y Data resourcesat EMBL-EBI Genes, genomes & variation RNA Central ArrayExpress Expression Atlas Metabolights PRIDE InterPro Pfam UniProt ChEMBL SureChEMBL ChEBI Molecular structures Protein Data Bank in Europe Electron Microscopy Data Bank European Nucleotide Archive European Variation Archive European Genome-phenome Archive Gene, protein & metabolite expression Protein sequences, families & motifs Chemical biology Reactions, interactions & pathways IntAct Reactome MetaboLights Systems BioModels Enzyme Portal BioSamples Ensembl Ensembl Genomes GWAS Catalog Metagenomics portal Europe PubMed Central BioStudies Gene Ontology Experimental Factor Ontology Literature & ontologies
  • 4.
    Different words sameconcept: example of Dyschromatopsia
  • 5.
    Search PubMed for“color blindness”
  • 6.
    Search PubMed for“Dyschromatopsia”
  • 7.
    Search PubMed for"abnormality of the eye"
  • 8.
    Thousands of sampleattributes…
  • 9.
    genomics transcriptomics proteomicsmetabolomics transcriptomics metabolomics individual experiments genomics transcriptomics proteomics metabolomics transcriptomics metabolomics individual experiments genomics transcriptomics proteomics metabolomics transcriptomics metabolomics individual experiments Data integration in times of ‘omics’ genomics transcriptomics proteomics metabolomics transcriptomics metabolomics individual experiments conducted at different times by different researchers using different equipment/approaches reporting same type of results differently
  • 10.
    Data growth isfast 12 month doubling 18 month doubling 4 month doubling 3 month doubling 100000000 1E+09 1E+10 1E+11 1E+12 1E+13 1E+14 1E+15 1E+16 2002   2004   2006   2008   2010   2012   2014   2016   bytes date EGA ENA PRIDE MetaboLights ArrayExpress Slide credit: Paul Flicek
  • 11.
    Data growth isfast 12 month doubling 18 month doubling 4 month doubling 3 month doubling 100000000 1E+09 1E+10 1E+11 1E+12 1E+13 1E+14 1E+15 1E+16 2002   2004   2006   2008   2010   2012   2014   2016   bytes date EGA ENA PRIDE MetaboLights ArrayExpress Slide credit: Paul Flicek Vast amount of data generated means vast amount of data submitted to repositories
  • 12.
    Curation - Dirtydata and the long tail 200100 sex:female gender:female disease:breast cancer frequency=2285 frequency=1288
  • 13.
    data integration [ˈdeɪtəˌɪntəˈgreɪʃən]: (computational) means to access, retrieve and analyse data sets from different sources in order to exploit them, i.e., gain new knowledge, and share that new knowledge
  • 14.
    data integration [ˈdeɪtəˌɪntəˈgreɪʃən]: (computational) means to access, retrieve and analyse data sets from different sources in order to exploit them, i.e., gain new knowledge, and share that new knowledge
  • 15.
    Standards What do theyoffer? •  uniformity and consistency in reporting data •  effective reuse, integration and mining of data •  creation of SOPs, benchmarks, quality assessment •  community cohesion
  • 16.
    What constitutes astandard? 1.  Establish your community 2.  Define community needs 3.  Define minimal information which needs to be gathered and exchanged by that community 4.  Design* an interchange format 5.  Design* domain-specific controlled vocabularies *Design = review, reuse and fill the gaps
  • 17.
  • 18.
    http://www.biosharing.org •  Many “Minimuminformation about a…..” papers now published.
  • 19.
    Standards – XMLinterchange formats http://www.sbml.org
  • 20.
    Adding semantics tothe data formats
  • 21.
    •  Same namefor different concepts •  Different names for the same concept Inconsistency in naming of biological concepts ? An example … Tactition Tactile sense Taction perception of touch ; GO:0050975
  • 22.
    Sample description withsemantic markup CL:CL_0000071 (blood vessel endothelial cell) obo:CHEBI_39867 (valproic acid) NCBITaxon:NCBITa xon_9606 (Homo Sapiens) Curation
  • 23.
    Ontologies •  Representation ofimportant things in a specific domain •  Describes types of entities (e.g. cells) and relations between them •  An active, formal computational artifact •  A mathematical model based on a subset of first order logic •  Tools can automatically process ontologies •  A communication tool •  Provides a dictionary for collaborators, a shared understanding •  Allows data sharing
  • 25.
    Reasoning is critical • Prokaryotic and Eukaryotic cell are declared disjoints •  Fungal cell is a Eukaryotic cell •  Spore is a Fungal cell and a Prokaryotic cell ⇒ Unsatisfiability ⇒ Solution: clarify spore (sensu Mycetozoa) AND actinomycete-type spore http://www.plosone.org/article/info:doi/10.1371/journal.pone.0022006
  • 26.
    Different words sameconcept: example of Dyschromatopsia We searched earlier for : -  Dyschromatopsia -  Colorblindness -  Abnormality of the eye
  • 27.
    The ontology ofcolor blindness HP:0011518 (Dichromacy )HP:0011518 (Eye) HP:0000551 (Abnormality of color vision ) HP:0007641 (Dyschromatopsia) Is-a Is-a Disease-location
  • 28.
    The ontology ofcolor blindness HP:0011518 (Dichromacy )HP:0011518 (Eye) HP:0000551 (Abnormality of color vision ) HP:0007641 (Dyschromatopsia) Is-a Is-a Disease-location “Colorblindness” “A form of colorblindness in which only two of the three fundamental colors can be distinguished due to a lack of one of the retinal cone pigments.” synonym definition
  • 29.
    Building ontologies •  Putthings into categories •  Helps organise the data •  Allows us to generalise over data •  Capture the relations between things •  Anatomical parts Biopolymer Nucleic Acid Polypeptide EnzymeDNA RNA tRNA mRNA smRNA
  • 30.
    Ontologies add value Smartersearching Data visualisation Data analysis Data integration
  • 31.
    CMPO term: graped micronucleus CMPO_0000156 CMPOterm: graped micronucleus CMPO_0000156 Integrate file formats Integrate metadata Apply phenotype ontology Predict disease gene/biomarkers Human Disease Cell Gene knockdown 31
  • 32.
    32 Genotype Phenotype Sequence Proteins Gene productsTranscript Pathways Cell type BRENDA tissue / enzyme source Development Anatomy Phenotype Plasmodium life cycle - Sequence types and features - Genetic Context - Molecule role - Molecular Function - Biological process - Cellular component - Protein covalent bond - Protein domain - UniProt taxonomy -Pathway ontology -Event (INOH pathway ontology) -Systems Biology -Protein-protein interaction -Arabidopsis development -Cereal plant development -Plant growth and developmental stage -C. elegans development -Drosophila development FBdv fly development.obo OBO yes yes -Human developmental anatomy, abstract version -Human developmental anatomy, timed version -Mosquito gross anatomy -Mouse adult gross anatomy -Mouse gross anatomy and development -C. elegans gross anatomy -Arabidopsis gross anatomy -Cereal plant gross anatomy -Drosophila gross anatomy -Dictyostelium discoideum anatomy -Fungal gross anatomy FAO -Plant structure -Maize gross anatomy -Medaka fish anatomy and development -Zebrafish anatomy and development -NCI Thesaurus -Mouse pathology -Human disease -Cereal plant trait -PATO PATO attribute and value.obo -Mammalian phenotype - Human phenotype -Habronattus courtship -Loggerhead nesting -Animal natural history and life history eVOC (Expressed Sequence Annotation for Humans) Ontologies for life sciences
  • 34.
    Open Biological andBiomedical Ontologies (OBO) A subset of biological and biomedical ontologies whose developers have agreed in advance to accept a common set of principles reflecting best practice in ontology development designed to ensure … •  tight connection to the biomedical basic sciences •  compatibility •  interoperability, common relations •  formal robustness •  support for logic-based reasoning
  • 35.
  • 36.
    Building metadata (&ontology) rich resources •  We build tools for semantic enrichment and alignment •  Interoperability toolkit •  Microservices based architecture •  Technology-agnostic •  Pushing boundaries of ontology “embedding”
  • 37.
    Raw Data toExplicit Knowledge Data Exploration and Cleanup Data structuring Ontology Annotation Data cleaning and mapping Ontology building Webulous OxO mapping service
  • 38.
    Searching for ontologyterms: the EBI Ontology Lookup Service •  for searching and visualizing >140 ontologies from the biomedical domain •  includes (among others): •  Gene Ontology •  OBO Relations ontology •  Evidence ontology •  Pathogen Transmission Ontology •  Symptom Ontology •  Basic Formal Ontology
  • 39.
    Ontology Lookup Service • Ontology search engine •  Ontology visualisation •  Powerful RESTful API •  Open source project •  Generic infrastructure (can load any ontology represented in OWL) https://github.com/EBISPOT/OLS Repository of over 150 biomedical ontologies (4.5 million terms, 11 million relations) http://www.ebi.ac.uk/ols
  • 40.
  • 41.
    •  Sample attributesand variables are mapped to EFO ontology Sample attribute Mapping data to ontology terms
  • 42.
    • Zooma automatically annotatessample attributes and variables with ontology classes Mapping data to ontology terms
  • 43.
    Mapping data toontology terms Information supplied as part of a search The source of this mapping ZOOMA contains a linked data repository of annotation knowledge and highly annotated data
  • 44.
    Expression Atlas: sourceof mappings •  Atlas automated pipeline runs against Zooma, then curators: •  Check that the automatic mappings are all correct •  Create a list of new mappings that should be added to Zooma
  • 45.
    •  Webulous GoogleAdd-On •  Connect to the Webulous server from Google Spreadsheets •  Load templates from the Webulous server •  Submit populated templates back to the server for processing Expression Atlas: curation What happens when we need a term that is not in EFO?
  • 46.
    Adding diseases toEFO using •  Design pattern templates can be loaded into Google Sheets
  • 47.
    •  A Webuloustemplate specifies a series of fields (columns) for the input data Some fields only allow values from a list of ontology terms Adding diseases to EFO using This data validation provides user with convenient term autocomplete when entering data into a cell
  • 48.
  • 49.
    Raw Data toExplicit Knowledge Data Exploration and Cleanup Data structuring Ontology Annotation Data cleaning and mapping Ontology building Webulous OxO mapping service
  • 50.
    BioSolr “BioSolr aims tosignificantly advance the state of the art with regards to indexing and querying biomedical data with freely available open source software” flaxsearch/BioSolr Solr documents with ontology annotation Enriched Solr with ontology content (synonyms, structure, relations) Solr/Elastic plugin Query expansion and hierarchical faceting
  • 51.
    Which other diseasesare associated with PDE4D? View diseases grouped in therapeutic areas or organised in a tree View more information about PDE4D Filter by therapeutic area
  • 52.
  • 53.
    Publishing biological dataas Linked Open Data •  The EBI RDF platform •  Released Nov 2013 •  Currently over 16 billion RDF triples •  Datasets updated ~ quarterly LOD diagram August 2014 Jupp et al (2013). The EBI RDF Platform: Linked Open Data for the Life Sciences. Bioinformatics.
  • 54.
    RDF Platform Integrationpoints Gene (via identifiers. org/ensembl) RNA transcript (via identifiers.org/ensembl) uniprot:Protein rdfs:seeAlso (not currently linking to identifiers.org but soon) discretized differential gene expression ratio (sio: SIO_001078) Gene Expression Atlas Ensembl sio:'is attribute of' (sio:SIO_000011) Uniprot Gene Ontology GO BP GO MF GO CC uniprot:classifiedWith bq:occursIn Organisms Organism/taxon ChEMBL Assay (?) chem bl:hasTarget ? bq:isVersionOf uniprot:organism rdfs:seeAlso 1 1 1 * 1 * * * 1 1 BioModels SBMLModel Reaction Species Compartment bq:is bq:isVersionOf bq:isVersionOf bq:is bq:isVersionOf bq:isHomologTo bq:hasPart ChEBI Reactome Pathway bq:isVersionOf bq:isVersionOf SBO bq:is Relationships within Biomodels can be found at https://github. com/sarala/ricordo- rdfconverter/wiki/SBML- RDF-Schema rdfs:seeAlso Structure PDB 1 rdfs:seeAlso Target (?) uniprot:transcribedFrom Protein (via identifiers. org/ensembl) uniprot:translatedTo bq:isVersionOf
  • 55.
    RDF Platform –lessons learned Successes •  Novel queries possible over EBI datasets •  Production quality RDF releases •  Community of users •  Highly available public SPARQL endpoints •  500+ users (10-50 million hits per month) •  Lots of interest •  Catalyst for new RDF efforts Lessons ●  Public SPARQL endpoints problematic ●  Query federation not performant ●  Inference support limited ●  Not scalable for all EBI data e.g. Variation, ENA ●  Lack of expertise in service teams ●  Too much overhead to get started quickly in this space
  • 56.
    An example: TheGene Ontology and Gene Ontology Annotation
  • 57.
  • 58.
    • A way tocapture biological knowledge for individual gene products in a written and computable form The Gene Ontology • A set of concepts and their relationships to each other arranged as a hierarchy www.ebi.ac.uk/QuickGO Less specific concepts More specific concepts
  • 59.
    The Gene Ontology http://geneontology.org/ • Collaborative effort to address the need for consistent descriptions of genes/gene products across databases •  Use of GO terms by collaborating databases facilitates uniform queries across all of them
  • 60.
    Aims of theGO project •  compile the ontologies •  >40000 terms •  constantly increasing and improving •  annotate gene products using the terms •  provide public resource of data and tools •  regular releases of annotations •  tools for browsing/querying annotations and editing the GO
  • 61.
    The GO editorialoffice at EMBL-EBI •  Part of the Sample, Phenotypes and Ontology team (SPOT) •  Contributes to development of the Gene Ontology •  Specific areas of interest: autophagy, synapse… •  Answers user requests •  New terms, modifications, updates •  Help support •  Curator requests GO editorial office at the EBI: Paola Roncaglia David Osumi-Sutherland
  • 62.
    Develop the ontology • An OWL ontology of >41,000 classes •  biological process, cellular component, molecular function •  > 14,000 imported classes (CL, Uberon, ChEBI, NCBI_tax) •  >136,000 logical axioms, including: •  ~72,000 subClassOf axioms between named GO classes •  ~41,000 simple existential restrictions (subClassOf R some C) •  EL expressivity => fast, scalable reasoning (with ELK) https://www.cs.ox.ac.uk/isg/tools/ELK/
  • 63.
    Ontology structure • Hierarchical Terms canhave more than one parent • Terms are linked by relationships is_a part_of regulates (and +/- regulates) www.ebi.ac.uk/QuickGOoccurs_in has_part These relationships allow for complex analysis of large datasets Terms can have more than one child
  • 64.
    Biological Process what doesa gene product do? cell division transcription A commonly recognised series of events
  • 65.
    Molecular Function how doesa gene product act? •  insulin binding •  insulin receptor activity •  glucose-6-phosphate isomerase activity
  • 66.
    Cellular Component where isa gene product located? plasma membrane •  mitochondrion •  mitochondrial membrane •  mitochondrial matrix •  mitochondrial lumen • ribosome • large ribosomal subunit • small ribosomal subunit
  • 67.
    Example GO annotation– cytochrome c cellular components molecular functions biological processes Electron carrier activity GO:0009055 oxidation-reduction process GO:0055114 Mitochondrion GO:0005739 https://www.ebi.ac.uk/QuickGO/GProtein?ac=P99999
  • 68.
    Anatomy of aGO term Unique identifier Term name Definition Synonyms Cross-references
  • 69.
  • 70.
    What is theGO ID for the term mitochondrial chromosome
  • 71.
    What is theGO ID for the term mitochondrial chromosome GO:0000262
  • 72.
    What are thefour direct parents of the term nucleosome?
  • 73.
    What are thefour direct parents of the term nucleosome? Chromatin Chromosomal part DNA packaging complex Protein-DNA complex
  • 74.
    What types ofrelationships are there between the term nucleosome and its direct parents?
  • 75.
    What types ofrelationships are there between the term nucleosome and its direct parents? Part of chromatin Is a for the others
  • 76.
    Building the GO • The GO editorial team •  Submission via GitHub, https://github.com/geneontology/ •  Submissions via TermGenie, http://go.termgenie.org •  ~80% terms are now created this way
  • 77.
    Annotate gene products GOA Database externalannotation groups (25) manual annotation by curators (125) electronic prediction methods (11)
  • 79.
  • 80.
    Manual annotations •  Time-consumingprocess producing lower numbers of annotations (~2,800 taxons covered) •  More specific GO terms •  Manual annotation is essential for creating predictions
  • 81.
    • Part of theProtein Function content team • Largest open-source contributor of annotations to GO •  Focuses on human, but provide annotations for more than 441,000 species • Human curators, and collate manual and electronic annotations across community UniProt-Gene Ontology Annotation (UniProt- GOA) project at the EMBL-EBI http://www.ebi.ac.uk/GOA Aleksandra Shypitsyna Elena Speretta Penelope Garmiri Tony Sawford UniProt-GOA project at the EBI:
  • 82.
    …a statement thata gene product; P00505 Accession Name GO ID GO term name Reference Evidence code IDAPMID:2731362aspartate transaminase activityGO:0004069GOT2 A GO annotation is …
  • 83.
    …a statement thata gene product; 1. has a particular molecular function or is involved in a particular biological process or is located within a certain cellular component A GO annotation is … P00505 Accession Name GO ID GO term name Reference Evidence code IDAPMID:2731362aspartate transaminase activityGO:0004069GOT2
  • 84.
    …a statement thata gene product; 1. has a particular molecular function or is involved in a particular biological process or is located within a certain cellular component 2. as described in a particular reference A GO annotation is … P00505 Accession Name GO ID GO term name Reference Evidence code IDAPMID:2731362aspartate transaminase activityGO:0004069GOT2
  • 85.
    …a statement thata gene product; 1. has a particular molecular function or is involved in a particular biological process or is located within a certain cellular component 2. as described in a particular reference 3. as determined by a particular method A GO annotation is … P00505 Accession Name GO ID GO term name Reference Evidence code IDAPMID:2731362aspartate transaminase activityGO:0004069GOT2
  • 86.
    Experimental data Computational analysis Author statements/ curator inference (+Inferred from electronic annotations) http://www.evidenceontology.org/ Tracking provenance
  • 87.
  • 88.
  • 89.
  • 90.
    FIG. 2. HumanNbp35 is a cytosolic protein. (A) EGFP fluorescence of a HeLa cell transiently transfected with a vector encoding a huNbp35-EGFP fusion protein (right) in comparison to the endogenous autofluorescence (AFL) of control cells (left). (C) Sub-cellular localization of huNbp35 by cell fractionation. […]HuNbp35 exclusively colocalizes with tubulin in the cytosolic fraction, but not with mitochondrial aconitase (mtAconitase) present in the membrane fraction.
  • 91.
    Human Nbp35 isa cytosolic protein. Protein GO term Supporting evidence
  • 92.
    Human Nbp35 isa cytosolic protein. •  Find the correct UniProt entry http://www.uniprot.org
  • 93.
    Human Nbp35 isa cytosolic protein.
  • 94.
    Human Nbp35 isa cytosolic protein. Protein GO term Supporting evidence NUBP1
  • 95.
    Human Nbp35 isa cytosolic protein. •  Find the right GO term https://www.ebi.ac.uk/QuickGO/
  • 96.
    Human Nbp35 isa cytosolic protein.
  • 97.
    Human Nbp35 isa cytosolic protein. Protein GO term Supporting evidence NUBP1 GO:0005829
  • 98.
    Human Nbp35 isa cytosolic protein. •  Evidence: •  Fig 2A Immunofluorescence and/or •  Fig 2C subcellular fractionation
  • 99.
    GO evidence codes[small excerpt] TAS, Traceable author statement NAS, Non-traceable author statement IDA, Inferred from Direct Assay IMP, Inferred from Mutant Phenotype IPI, Inferred from Physical Interaction Experimental evidence, Methods & Results Abstract & Introduction
  • 100.
    Human Nbp35 isa cytosolic protein. Protein GO term Supporting evidence NUBP1 GO:0005829 IDA
  • 102.
    Electronic Annotations •  Quickway of producing large numbers of annotations •  Annotations use less-specific GO terms Only source of annotation for ~438,000 non-model organism species
  • 103.
    Electronic Annotations •  Quickway of producing large numbers of annotations •  Annotations use less-specific GO terms •  Only source of annotation for ~438,000 non-model organism species orthology taxon constraints
  • 104.
    Broad taxonomic coverage …aswell as less well-studied species that have; • Complete proteome • >25% GO annotation coverage We provide annotation files for well-studied species… We have annotations for species that may not have a dedicated curation effort; e.g. for 1,400 Solanacae species’ we have ~360,000 annotations for ~64,000 proteins
  • 105.
    1. Mapping ofexternal concepts to GO terms e.g. InterPro2GO, UniProt Keyword2GO, Enzyme Commission2GO Electronic annotation methods GO:0004715 ; non-membrane spanning protein tyrosine kinase activity
  • 106.
    Annotations are high-qualityand have an explanation of the method (GO_REF) Macaque Mouse DogCow Guinea PigChimpanzee Rat Chicken Ensembl compara 2. Automatic transfer of manual annotations to orthologs ...and more e.g. Human Arabidopsis Rice Brachypodium Maize Poplar Grape …and moreEnsembl compara Electronic annotation methods http://www.geneontology.org/cgi-bin/references.cgi
  • 107.
    An example ACCESSION  GO ID   GO ASPECT   GO TERM   P04637   GO:0047485   F   protein N-terminus binding   P04637   GO:0051087   F   chaperone binding   P04637   GO:0051721   F   protein phosphatase 2A binding   P04637   GO:0000733   P   DNA strand renaturation   P04637   GO:0006289   P   nucleotide-excision repair   P04637   GO:0006355   P   regulation of transcription, DNA-templated   P04637   GO:0006461   P   protein complex assembly   ACCESSION   GO ID   GO ASPECT   GO TERM   Q549C9   GO:0047485   F   protein N-terminus binding   Q549C9   GO:0051087   F   chaperone binding   Q549C9   GO:0051721   F   protein phosphatase 2A binding   Q549C9   GO:0000733   P   DNA strand renaturation   Q549C9   GO:0006289   P   nucleotide-excision repair   Q549C9   GO:0006355   P   regulation of transcription, DNA-templated   Q549C9   GO:0006461   P   protein complex assembly   Annotations from the source… …are projected on to the target
  • 108.
    InterPro   Source of~93 million GO mappings for ~30 million distinct UniProtKB sequences (Oct 30 2015 release) 3. Propagation of GO annotations to protein groups
  • 109.
    GO mapping todomains: Function of domain may not be function of protein Family members can be experimentally characterised as lacking function: P14210 - a serine protease homologue with no proteolytic activity (proteins are reported to GOA to be blacklisted) Broad families that are functionally diverse: The GHMP kinase superfamily includes - Galactokinases (EC=2.7.1.6) - Homoserine kinases (EC=2.7.1.39) - Mevalonate kinases (EC=2.7.1.36) - Diphosphomevalonate decarboxylases (EC 4.1.1.33) Considerations for mapping GO terms
  • 110.
    * Includes manualannotations integrated from external model organism and specialist groups 2,811,622Manual annotations* 280,313,749Electronic annotations Number of annotations in UniProt-GOA database (June 2016)
  • 111.
    Many ways toaccess GO annotation data
  • 112.
    http://www.ebi.ac.uk/QuickGO Map-up annotations with GOslims Search GO terms or proteins Find sets of GO annotations Questions on how to use QuickGO? Contact goa@ebi.ac.uk One example: the QuickGO browser
  • 113.
  • 115.
    GO term enrichmentanalysis •  What is it? •  What can you use it for? •  How does it actually work? •  How can I actually do it? •  When is it NOT a good idea to do it?
  • 116.
    Enrichment analysis –basic principle Sample 40% 20%
  • 117.
  • 118.
    => The sampleis over-enriched for Enrichment analysis Sample 40% 20% Reference 20% 20%
  • 119.
    GO term enrichmentanalysis •  What is it? •  Most popular type of GO analysis •  Determines which GO terms are more often associated with a specified list of genes/proteins compared with a control list or rest of genome
  • 120.
    GO term enrichmentanalysis •  What can you use it for?
  • 121.
    GO term enrichmentanalysis “Our gene list contains targets for GATA1 (orange balls) and SP1 (blue balls) transcription factors (TFs). For each TF, we extract the proportion of targets in the gene list and in the genome to construct the contingency table. Fisher's exact test is used to determine if there is a nonrandom association between the gene list and the specific regulation of a TF.” •  http://bioinfo.cipf.es/docs/renato/simple_enrichment_analysis
  • 122.
  • 123.
    GO term enrichmentanalysis •  How does it actually work? •  http://geneontology.org/page/go-enrichment-analysis •  http://geneontology.org/faq/what-minimum-information- include-functional-analysis-paper •  Also useful for GO analysis in general:
  • 124.
    GO term enrichmentanalysis •  How can I actually do it? •  Many tools available to do this analysis •  User must decide which is best for their analysis •  We’ll focus on the tool provided by the GO Consortium •  Be aware that there are numerous third-party tools and that they do not all use up-to-date GO data
  • 126.
    GO term enrichmentanalysis •  How do you get to the GO TE tool? •  From front page of GO website •  From AmiGO
  • 127.
  • 128.
  • 129.
  • 130.
  • 131.
    Spinocerebellar ataxia type28 Paola Roncaglia
  • 132.
    Novel biomarkers ofrectal radiotherapy
  • 133.
  • 134.
  • 135.
  • 136.
    Hands on -Dataset •  Download http://tinyurl.com/IDs-for-enrichment •  Go to http://geneontology.org •  Run the enrichment analysis
  • 137.
    Caveats •  When canyou NOT do an enrichment analysis? •  Too few target genes/proteins •  Genes/proteins of interest are not present in your background set (e.g. array) •  Genes/proteins of interest are not expressed/translated in your sample(s)
  • 138.
  • 139.
    Many gene productsare associated with a large number of descriptive, leaf GO nodes: GO slims
  • 140.
    …however annotations canbe mapped up to a smaller set of parent GO terms: GO slims
  • 141.
    Slim generation forindustry •  Collaboration funded by Roche •  Need a custom GO slim for analysis of genesets of interest •  Need to be descriptive enough •  Without redundancy •  Internal proprietary vocabulary – hard to maintain •  Desire to automatically map to GO http://www.swat4ls.org/wp-content/uploads/2015/10/SWAT4LS_2015_paper_44.pdf
  • 142.
    ROCHE CV GSEA withfull GO GSEA with Roche CV Courtesy Laura Badi
  • 143.
    •  Mapping query:participant_OR_reg_participant some cannabinoid •  Description: “A process in which a cannabinoid participates, or that regulates a process in which a cannabinoid participates.”
  • 144.
    Results •  We havesuccessfully mapped 84% of terms from RCV (308/365) to OWL queries that can be used to replicate some proportion of the original manual mapping. •  In addition, these queries find 1000s of terms that were missed in the original mapping. David Osumi-Sutherland
  • 145.
  • 146.
    ROCHE CV –MANUAL ONLY
  • 147.
  • 148.
    Go slims formetagenomics functional analysis
  • 149.
  • 150.
  • 151.
  • 152.
  • 153.
    Acknowledgements •  GO editorsand developers •  GO annotators •  The Gene Ontology (GO) Consortium •  Samples, Phenotype and Ontology team (Helen Parkinson) •  Protein Function Content team (Claire O’Donovan) •  Funding: EMBL-EBI, National Human Genome Research Institute (NHGRI)
  • 154.
    Thank you foryour attention! Contact Gene Ontology Annotation: goa@ebi.ac.uk Contact Gene Ontology: http://geneontology.org/form/contact-go