Integrating Pathway
Databases with Gene
Ontology Causal Activity
Models
Benjamin Good
Gene Ontology Consortium
Lawrence Berkeley National Labs
@bgood
Rocky Bioinformatics Conference
Aspen, CO, USA, Dec. 8, 2018
Gene
Ontology
GO Causal
Activity Models
Pathways to
GO-CAMs
Causal
Inference
Trail Talk
Map
GO Mission
“The mission of the GO Consortium is
to develop a comprehensive,
computational model of biological
systems, ranging from the molecular to
the organism level, across the
multiplicity of species in the tree of life.”
http://geneontology.org
The GO was created to meet
the needs of the genome era
1998 birth
of the GO
Lathe, W., Williams, J., Mangan, M. & Karolchik, D. (2008) Genomic
Data Resources: Challenges and Promises. Nature Education 1(3):2
Base Pairs in
GenBank
Databases in
Nucleic Acids
Research
database issue
20081982
The Tower of Babel by Pieter Bruegel the Elder
1563 (image from Wikipedia)
Founders recognized
that gene function is
largely conserved
across species
Can use knowledge of
one genome to help
understand the next
genome
But you must speak a
common language
Genes have the
common language of
DNA
Fly
Human
Gene Ontology:
common language of gene
function
Allows
knowledge of
gene function to
be organized
and re-used
across
databases
The Gene Ontology is
represented as a directed graph
Image courtesy of Chris Mungall
Ontology
- 45k terms
- 106k edges
Biological
Process
Cellular
Component
Molecular
Function
Annotations shared and
managed in a flatfile
NEDD4 Ubiquitin-protein
ligase activity
pmid:770007ECO:IDA
GO Knowledge Base
2018
More than 140,000 published papers used to create …
More than 750,000 experimentally supported
annotations ...
Used to infer more than 7,000,000 functional annotations
for more than 3000 organisms.
22K citations of original paper.
GO Consortium. (2018) The Gene Ontology Resource: 20 years and still GOing strong. Nucleic Acids Res. gky1055
Use of the GO
Gene Set
analysis
dominates
usage.
See next talk by
Judy Blake!
Experiment
Ranked genes
GO analysis
Insight
Gene
Ontology
GO Causal
Activity Models
Pathways to
GO-CAMs
Causal
Inference
Trail Talk
Map
45,000 ontology
terms...
• Are hard to manage and hard to use.
• One reason the GO has grown so large is the
flatfile annotation model
• In order indicate a new function, you need a new
term
• No way to say “Gene X has function Y in context
C”
• So you get a new term for each Y:C combination
(and C itself is complex)
Building
monsters
GO:0086094
positive regulation of ryanodine-sensitive calcium-release channel
activity by adrenergic receptor signaling pathway involved in positive
regulation of cardiac muscle contraction
GO:0010768
negative regulation of transcription from RNA polymerase II promoter in
response to UV-induced DNA damage
By Philippe Semeria
Relationship between gene and
term unspecified
NEDD4 GO:0006298
?
GO Causal Activity
Model (GO-CAM)
• Activity Flow model of gene function
• Meant to capture how gene products
work together in specific contexts to
enact biological processes
• Annotations are expressed as OWL
models
• Look a lot like pathway
representations
Ontology of a GO-CAM
Classes from GO (and other OBO) are used to
describe the types of nodes in the model.
Relations from the Relation Ontology are used to
describe the relations linking the nodes.
Evidence classes from the Evidence Code Ontology
justify and provide provenance for the asserted
relationships
OWL
• W3C standard Web Ontology Language
• https://www.w3.org/OWL/
• A Description Logic
• General purpose tools for editing and reasoning with
it.
• Reasoning = automatic inference of unstated,
entailed relationships
Involved in
creating
Dr. McGuinness
From unconnected
gene annotations to
models of function
NEDD4 Ubiquitin-protein ligase activity
NEDD4 Nucleus
GO-CAM
Molecular Function
Node (Folded
View)
NEDD4 Ubiquitin-protein ligase activity
NEDD4 Nucleus
From unconnected
gene annotations to
models of function
NEDD4 Ubiquitin-protein ligase activity
NEDD4 Nucleus
NEDD4
Ubiquitin-dependent protein
catabolic process
GO-CAM
NEDD4
NEDD4 Ubiquitin-protein ligase activity
NEDD4 Nucleus
NEDD4 Ubiquitin-dependent protein
catabolic process
NEDD4
Cellular response to DNA
damage stimulus
NEDD4 Negative regulation of
transcription from RNA ...
OWL Inference
example
?Y
Protein Complex
DNA-Directed
Polymerase Activity
RO:
Enabled by?X
DNA
polymerase
Complex
...
OWL
axioms
GO-CAM model
(OWL instances)
RDF:type
rdfs:subClassOf
RDF:type
RO: Capable of
DNA Polymerase
Activity
RDF:type
?X type protein-
containing complex
?X capable of ?Y
?Y type DNA
polymerase activity
?X→ type DNA
polymerase complex
Arachne Reasoner
https://github.com/balhoff/arachne
Classifier in Noctua editor
Translesion synthesis by POLI
GO-CAM current status
2331 GO-CAMs
Curators actively building more
Goal is to transition all GOC curators onto the use of
GO-CAMs and the noctua editor. (WIP)
http://geneontology.org/go-cam
http://rdf.geneontology.org/
Browse
Query
Gene
Ontology
GO Causal
Activity Models
Pathways to
GO-CAMs
Causal
Inference
Trail Talk
Map
… comprehensive,
computational model
of biological
systems...
I think I’m
going to
need a
little help...
Pathway Databases
• People have been charting
biochemical pathways for more
than 60 years, mostly as images
• A variety of computable pathway
databases now exist
• KEGG, Reactome, BioCyc and
more.
• These resources complement
the efforts of the GO
Roche Metabolic Pathways.
GO-CAM annotations related to gdf3
(by Sabrina Toro)
http://geneontology.org/go-
cam
No other models
found, dead end?
Signaling by BMP,
Reactome
Signaling by BMP,
WikiPathways
Signaling by BMP,
HumanCyc
Signaling by BMP,
etc.
Goal: an integrated
biological
knowledge
base (‘graph’)
Pathway Resources
GO-CAMs
Approach
GO-CAMs as the unifying
structure
Manual curation
Automatic import
Automatic part
1.Rule based transformation from
BioPAX
(BioPAX is a standard exchange format for biochemical
pathways)
1.OWL Inference
Pathway
to GO-
CAMReactome
GO-CAM
(OWL)
BioPAX
Series of rule-
based
transformations
Reaction = Activity (molecular
function)
BMP2 binds to the receptor complex (Reactome)
GO:0005160
Mapping rules
Pathway/Biological Process to
Causal Activity Model
GO-CAM
BioPAX
Mapping rule:
‘Next Step’
RO:Provides direct input for
Reaction Reaction
biopax:nextStep
Molecular
Function
Molecular
Function
Mapping rule:
‘negative regulation by
sequestration’
Signaling by BMP
(by Reactome as a GO-
CAM)
Reactome
conversion
• All 2244 human Reactome pathways
automatically converted into GO-
CAMs
• Doubling number of GO-CAMs
• accessible for testing at
http://noctua-dev.berkeleybop.org
Once GO-CAM generated,
can apply OWL Reasoning
Use axioms encoded in GO, RO, (more OBO) to :
a. Find errors
b. Infer new relationships
Function Classification
Connecting to the
Gene Ontology
automatically
Biological Process classification
Relation Inference
Positively
regulates
Relation
Ontology
Gene
Ontology
GO Causal
Activity Models
Pathways to
GO-CAMs
Causal
Inference
Trail Talk
Map
Answer the Why
question
?
Causes
Causes
Causes
Causes
?
?
Answer the Why
question
?
Causes
Causes
Causes
Causes
?
?
Causal Relation Query
Ongoing challenges
● Infer mappings to GO
when they are not
provided.
● Merge pathways from
multiple sources
● Incorporate existing
knowledge from GO
annotations
● Develop new applications
● Develop effective
workflows for using
pathway knowledge
○ Adapt external
pathways into new
GO-CAMs ?
○ Dynamically add new
edges into extant GO-
CAMs ?
The manual part:
supporting GO curators Direct kb creation
Open Issues in GitHub
https://github.com/geneontology/
(62 repos..)
Solve an issue,
get a job!
cjmungall@lbl.gov
/noctua (GO-CAM editor): 138
/go-site (website, graph store): 271
/go-ontology : 486
/amigo (Web User UI): 134
/pathways2go: 13
Acknowledgements
A very long list of people that created and contributed to
the Gene Ontology Consortium over the past 20 years
Specifically:
Chris Mungall, Paul Thomas, David Hill, Huaiyu Mi,
Peter D’Eustachio, Kimberly Van Auken, Seth
Carbon, Jim Balhoff, Laurent-Philippe Albou
Any questions?
https://github.com/geneontology/pathways2GO
http://geneontology.org/go-cam
@bgood
bgood@lbl.gov
How to convert existing
GO annotations?
GOC has not decided yet
Ideal outcome (IMO) is one GO-CAM per biological
process term
Question centers on how much can be automatically
done versus how much requires curators
Why Not just BioPAX?
• Does not support automated inference
• Does not enforce the use of ontologies to describe
e.g. pathways, reactions
• Does not have any concept of causal relationships
• Actual implementations vary widely, forcing manual
work to integrate
• Does not align with GO model (OBO, OWL, etc.)
• Happy to talk about a new BioPAX level 4...
What about my db?
Conversion code should in principle work on any
BioPAX level 3 input.
In practice different groups produce different BioPAX
Successful tests on Pathway Commons for example
pathways from KEGG, BioCyc, and WikiPathways
Causal edges provide a scaffold for explanation that
can be integrated and further explained using other
relations, potentially from other knowledge bases
that use the same ontologies
Function
Causes
Function Function
Causes
Gene
Product
enables
Gene
Product
enables
Biological
Process
drug
activates
gene
disease
encodes
Genetic
association
Has part
Automatic type:of
Inference
• Reactome Complexes to GO Cellular Component
terms from 0 to 2192 classified complexes - infinite
improvement !
• When complete 2192/9476
• 23% complexes have some assignment to a GO
term
GO-CAM element Ontology or identifier source(s) Example
Molecular activity GO molecular function ubiquitin-protein transferase
activity (GO:0004842)
Biological process GO biological process cellular response to UV
(GO:0034644)
Location GO cellular component nucleus (GO:0005634)
Cell Type Ontology (CL) (8) retinal cell (CL: 0009004)
anatomy ontologies, e.g. UBERON (9),
C. elegans gross anatomy (10), EMAPA
(11)
eye (UBERON: 0000970)
Active entity Gene, protein, RNA or complex
identifier from a standard source, e.g.
HGNC for a human gene
NEDD4 (HGNC:7727)
Biological phase GO biological phase (GO:0044848) mitotic G1 phase
(GO:0000080)
Developmental phase ontology, e.g.
Mouse Developmental Stage
Theiler stage 02
(MmusDv:0000005)
Relations Relations Ontology occurs in (BFO:0000066)

Integrating Pathway Databases with Gene Ontology Causal Activity Models

  • 1.
    Integrating Pathway Databases withGene Ontology Causal Activity Models Benjamin Good Gene Ontology Consortium Lawrence Berkeley National Labs @bgood Rocky Bioinformatics Conference Aspen, CO, USA, Dec. 8, 2018
  • 2.
    Gene Ontology GO Causal Activity Models Pathwaysto GO-CAMs Causal Inference Trail Talk Map
  • 3.
    GO Mission “The missionof the GO Consortium is to develop a comprehensive, computational model of biological systems, ranging from the molecular to the organism level, across the multiplicity of species in the tree of life.” http://geneontology.org
  • 4.
    The GO wascreated to meet the needs of the genome era 1998 birth of the GO Lathe, W., Williams, J., Mangan, M. & Karolchik, D. (2008) Genomic Data Resources: Challenges and Promises. Nature Education 1(3):2 Base Pairs in GenBank Databases in Nucleic Acids Research database issue 20081982
  • 5.
    The Tower ofBabel by Pieter Bruegel the Elder 1563 (image from Wikipedia) Founders recognized that gene function is largely conserved across species Can use knowledge of one genome to help understand the next genome But you must speak a common language
  • 6.
    Genes have the commonlanguage of DNA Fly Human
  • 7.
    Gene Ontology: common languageof gene function Allows knowledge of gene function to be organized and re-used across databases
  • 8.
    The Gene Ontologyis represented as a directed graph Image courtesy of Chris Mungall Ontology - 45k terms - 106k edges Biological Process Cellular Component Molecular Function
  • 9.
    Annotations shared and managedin a flatfile NEDD4 Ubiquitin-protein ligase activity pmid:770007ECO:IDA
  • 10.
    GO Knowledge Base 2018 Morethan 140,000 published papers used to create … More than 750,000 experimentally supported annotations ... Used to infer more than 7,000,000 functional annotations for more than 3000 organisms. 22K citations of original paper. GO Consortium. (2018) The Gene Ontology Resource: 20 years and still GOing strong. Nucleic Acids Res. gky1055
  • 11.
    Use of theGO Gene Set analysis dominates usage. See next talk by Judy Blake! Experiment Ranked genes GO analysis Insight
  • 12.
    Gene Ontology GO Causal Activity Models Pathwaysto GO-CAMs Causal Inference Trail Talk Map
  • 13.
    45,000 ontology terms... • Arehard to manage and hard to use. • One reason the GO has grown so large is the flatfile annotation model • In order indicate a new function, you need a new term • No way to say “Gene X has function Y in context C” • So you get a new term for each Y:C combination (and C itself is complex)
  • 14.
    Building monsters GO:0086094 positive regulation ofryanodine-sensitive calcium-release channel activity by adrenergic receptor signaling pathway involved in positive regulation of cardiac muscle contraction GO:0010768 negative regulation of transcription from RNA polymerase II promoter in response to UV-induced DNA damage By Philippe Semeria
  • 15.
    Relationship between geneand term unspecified NEDD4 GO:0006298 ?
  • 16.
    GO Causal Activity Model(GO-CAM) • Activity Flow model of gene function • Meant to capture how gene products work together in specific contexts to enact biological processes • Annotations are expressed as OWL models • Look a lot like pathway representations
  • 17.
    Ontology of aGO-CAM Classes from GO (and other OBO) are used to describe the types of nodes in the model. Relations from the Relation Ontology are used to describe the relations linking the nodes. Evidence classes from the Evidence Code Ontology justify and provide provenance for the asserted relationships
  • 18.
    OWL • W3C standardWeb Ontology Language • https://www.w3.org/OWL/ • A Description Logic • General purpose tools for editing and reasoning with it. • Reasoning = automatic inference of unstated, entailed relationships Involved in creating Dr. McGuinness
  • 19.
    From unconnected gene annotationsto models of function NEDD4 Ubiquitin-protein ligase activity NEDD4 Nucleus GO-CAM
  • 20.
    Molecular Function Node (Folded View) NEDD4Ubiquitin-protein ligase activity NEDD4 Nucleus
  • 21.
    From unconnected gene annotationsto models of function NEDD4 Ubiquitin-protein ligase activity NEDD4 Nucleus NEDD4 Ubiquitin-dependent protein catabolic process GO-CAM NEDD4
  • 22.
    NEDD4 Ubiquitin-protein ligaseactivity NEDD4 Nucleus NEDD4 Ubiquitin-dependent protein catabolic process NEDD4 Cellular response to DNA damage stimulus NEDD4 Negative regulation of transcription from RNA ...
  • 23.
    OWL Inference example ?Y Protein Complex DNA-Directed PolymeraseActivity RO: Enabled by?X DNA polymerase Complex ... OWL axioms GO-CAM model (OWL instances) RDF:type rdfs:subClassOf RDF:type RO: Capable of DNA Polymerase Activity RDF:type ?X type protein- containing complex ?X capable of ?Y ?Y type DNA polymerase activity ?X→ type DNA polymerase complex Arachne Reasoner https://github.com/balhoff/arachne
  • 24.
    Classifier in Noctuaeditor Translesion synthesis by POLI
  • 25.
    GO-CAM current status 2331GO-CAMs Curators actively building more Goal is to transition all GOC curators onto the use of GO-CAMs and the noctua editor. (WIP) http://geneontology.org/go-cam http://rdf.geneontology.org/ Browse Query
  • 26.
    Gene Ontology GO Causal Activity Models Pathwaysto GO-CAMs Causal Inference Trail Talk Map
  • 27.
    … comprehensive, computational model ofbiological systems... I think I’m going to need a little help...
  • 28.
    Pathway Databases • Peoplehave been charting biochemical pathways for more than 60 years, mostly as images • A variety of computable pathway databases now exist • KEGG, Reactome, BioCyc and more. • These resources complement the efforts of the GO Roche Metabolic Pathways.
  • 29.
    GO-CAM annotations relatedto gdf3 (by Sabrina Toro)
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
    Pathway Resources GO-CAMs Approach GO-CAMs asthe unifying structure Manual curation Automatic import
  • 37.
    Automatic part 1.Rule basedtransformation from BioPAX (BioPAX is a standard exchange format for biochemical pathways) 1.OWL Inference
  • 38.
  • 39.
    Reaction = Activity(molecular function) BMP2 binds to the receptor complex (Reactome) GO:0005160 Mapping rules
  • 40.
  • 41.
    GO-CAM BioPAX Mapping rule: ‘Next Step’ RO:Providesdirect input for Reaction Reaction biopax:nextStep Molecular Function Molecular Function
  • 42.
  • 43.
    Signaling by BMP (byReactome as a GO- CAM)
  • 44.
    Reactome conversion • All 2244human Reactome pathways automatically converted into GO- CAMs • Doubling number of GO-CAMs • accessible for testing at http://noctua-dev.berkeleybop.org
  • 45.
    Once GO-CAM generated, canapply OWL Reasoning Use axioms encoded in GO, RO, (more OBO) to : a. Find errors b. Infer new relationships
  • 46.
    Function Classification Connecting tothe Gene Ontology automatically
  • 47.
  • 48.
  • 49.
    Gene Ontology GO Causal Activity Models Pathwaysto GO-CAMs Causal Inference Trail Talk Map
  • 50.
  • 51.
  • 52.
  • 53.
    Ongoing challenges ● Infermappings to GO when they are not provided. ● Merge pathways from multiple sources ● Incorporate existing knowledge from GO annotations ● Develop new applications ● Develop effective workflows for using pathway knowledge ○ Adapt external pathways into new GO-CAMs ? ○ Dynamically add new edges into extant GO- CAMs ? The manual part: supporting GO curators Direct kb creation
  • 54.
    Open Issues inGitHub https://github.com/geneontology/ (62 repos..) Solve an issue, get a job! cjmungall@lbl.gov /noctua (GO-CAM editor): 138 /go-site (website, graph store): 271 /go-ontology : 486 /amigo (Web User UI): 134 /pathways2go: 13
  • 55.
    Acknowledgements A very longlist of people that created and contributed to the Gene Ontology Consortium over the past 20 years Specifically: Chris Mungall, Paul Thomas, David Hill, Huaiyu Mi, Peter D’Eustachio, Kimberly Van Auken, Seth Carbon, Jim Balhoff, Laurent-Philippe Albou
  • 56.
  • 58.
    How to convertexisting GO annotations? GOC has not decided yet Ideal outcome (IMO) is one GO-CAM per biological process term Question centers on how much can be automatically done versus how much requires curators
  • 59.
    Why Not justBioPAX? • Does not support automated inference • Does not enforce the use of ontologies to describe e.g. pathways, reactions • Does not have any concept of causal relationships • Actual implementations vary widely, forcing manual work to integrate • Does not align with GO model (OBO, OWL, etc.) • Happy to talk about a new BioPAX level 4...
  • 60.
    What about mydb? Conversion code should in principle work on any BioPAX level 3 input. In practice different groups produce different BioPAX Successful tests on Pathway Commons for example pathways from KEGG, BioCyc, and WikiPathways
  • 61.
    Causal edges providea scaffold for explanation that can be integrated and further explained using other relations, potentially from other knowledge bases that use the same ontologies Function Causes Function Function Causes Gene Product enables Gene Product enables Biological Process drug activates gene disease encodes Genetic association Has part
  • 62.
    Automatic type:of Inference • ReactomeComplexes to GO Cellular Component terms from 0 to 2192 classified complexes - infinite improvement ! • When complete 2192/9476 • 23% complexes have some assignment to a GO term
  • 63.
    GO-CAM element Ontologyor identifier source(s) Example Molecular activity GO molecular function ubiquitin-protein transferase activity (GO:0004842) Biological process GO biological process cellular response to UV (GO:0034644) Location GO cellular component nucleus (GO:0005634) Cell Type Ontology (CL) (8) retinal cell (CL: 0009004) anatomy ontologies, e.g. UBERON (9), C. elegans gross anatomy (10), EMAPA (11) eye (UBERON: 0000970) Active entity Gene, protein, RNA or complex identifier from a standard source, e.g. HGNC for a human gene NEDD4 (HGNC:7727) Biological phase GO biological phase (GO:0044848) mitotic G1 phase (GO:0000080) Developmental phase ontology, e.g. Mouse Developmental Stage Theiler stage 02 (MmusDv:0000005) Relations Relations Ontology occurs in (BFO:0000066)