Collaboratively Creating the Knowledge
Graph of Life
Chris Mungall
cjmungall@lbl.gov @chrismungall
JPM Graph Gang April 2021
Who am I and why am I here?
Education: University of Edinburgh
(BSc & PhD, AI + Bioinformatics)
Current: Staff Scientist, Berkeley Lab,
Environmental Genomics and Systems
Biology
In between: Lots of hacking and
occasional research papers
What I’m known for:
Biological databases and ontologies
What I actually do:
Write proposals and wrange grants
and let others do the work
My Interests
genes
environment
effects
My Interests
genes
environment
biocuration
machine inference effects
mechanisms
experiments
Drugs
Protein
structure
Clinical
data
Omics
Literature
Protein
function
Assembling the
data is a huge
challenge!
Biological data management is hard.
We have many named things.
Drugs 10k
Chemicals 4tn?
Species
~9 million
Diseases and
Phenotypes
10-50k/species
Cells
10,000s+
types
per species)
Experiments
Raw data
?? exabytes
Genes 20k per species
Genetic
variants
3m in human
alone
The things are interconnected
Cirrhosis
MONDO:0005155
Liver
UBERON:0002107
Hepatocyte
CL:0000182
Ethanol
CHEBI:16236
It’s hard to find and integrate the things
I guess I
have a lot of
reading to
do!
Ontologies and Knowledge Graphs to the rescue!
I can organize it all
for you!
Ontolowhat?
genes diseases cell types
What is an ontology anyway?
??? how does this
help me?
It is the branch of
metaphysics dealing with
the nature of being.
No, it’s a formal, explicit
specification of a shared
conceptualization
What is an ontology anyway?
...better,
I guess
A graph (network)
connecting all the things
you care about
me
pizza cheese
food
Victor
cat
human
mammal
fromage@fr
type
is a
has pet
depicted by
likes
has part
has role
type
Ontologies enable discovery
This is
fun
actually...
Do owners of different
kinds of pets like different
kinds of food? What do
those foods have in
common?
me
pizza cheese
food
Victor
cat
human
mammal
fromage@fr
type
is a
has pet
depicted by
likes
has part
has role
type
Lizard owners
like spicy food
[p=0.04]
Some of the things you can do with ontologies
Standardize,
organize, &
communicate
data
Filter &
search for
data
Connect &
harmonize data
Infer
knowledge &
make
suggestions
The Gene Ontology: An Ontology for Genes
Genes 20k/species
Gene Ontology (GO)
45k ontology classes
Each gene can be categorized with multiple
GO terms describing the role of each gene
The Gene Ontology is the work of many people
● Manual curation forms the
backbone of the GO
● AI can help but not replace!
id: GO:0043570
name: maintenance of
DNA repeat elements
id: GO:0006915
name: apoptosis
id: GO:0016446
name: somatic hypermutation of
immunoglobulin genes
Inferring GO
classification
based on
evolutionary
history of
genes
Effects of space radiation on CSF molecular profiles
• Innate immune system overactivated
• Decreased nervous system development
• axonal fasciculation, astrocyte & oligodendrocyte
differentiation,
synaptic plasticity, axonogenesis, …
• Many negative regulation processes impaired:
• Neuron proliferation, differentiation and projection
• Leukocyte proliferation and differentiation
• Extrinsic and possibly intrinsic apoptotic signaling pathways
Goal: predict individual risks for
behavior deficits and brain
pathologies in astronauts
proteomic
data
GO
analysis
predict
GO is used by researchers… and in the clinic!
doi:10.1038/nature24487
Transgenic replacement
skin was tested for off-
target mutations using GO
GO describes just one aspect of biology
Drugs 10k
Chemicals 4tn?
Species
~9 million
Diseases and
Phenotypes
10-50k/species
Cells
1000s+ core
types
per species)
Experiments
Raw data
?? exabytes
Genes 20k per species
Gene Functions
Genetic
variants
3m in human
alone
There are many ontologies
to categorize the other
things
many biological ontologies!
Problems:
● Duplication
● Silos
● Lack of interoperability
We can build
the universal
ontological
map of life...
...but how do
we put the
pieces
together?
Step 1:
Agreeing to
work together
Open Biological Ontologies (OBO)
http://obofoundry.org
1. Well-integrated
Modular ontologies
(SUBSET of bioportal)
2. Provide technical and
sociotechnological
framework for
cooperation
4. Allow us to describe all
of the things
3. Provide tools,
best practices and
infrastructure for
forging new
ontologies
@obofoundry
• 160 active ontologies
○ Developed by different teams
• Millions of classes
• Wide variety
○ Specific
■ E.g. cephalopod
○ General
■ E.g. chemicals
http://obofoundry.org
The OBO Foundry
The OBO Dashboard
Step 2: Connecting
the pieces
The original bio-ontologies were silos
glucan biosynthesis
(GO:0009250)
polysaccharide biosynthesis
(GO:0000271)
is_a
glucan
(CHEBI:37163)
polysaccharide
(CHEBI:18154)
is_a
GO:
Biological
Processes
CHEBI:
Chemical
Entities
No reuse or
connection
OWL to the rescue!
MODULARITY
TOOLS +
REASONING
Ontology
Development
Environment
http://robot.obolibrary.org
Command
line tool for
operating on
ontologies
ODK: ONTOLOGY DEVELOPMENT KIT
kernel
ODK container
ROBOT
Make
odk.py
dosdp-tools
Reasoners
container
Ontology
Operations
(Command Line)
Workflows: chains
together
operations
Seed an ontology project:
Create a GitHub
repository
with workflows in place
Build ontologies rapidly
from
Design Pattern templates
Includes Elk, HermiT,
Konklude
Complements ODEs
(Protégé)
fastobo
Validation of obo
format files
(Rust)
https://github.com/INCATools/ontology-development-kit
ROBOT is an OBO tool
http://robot.obolibrary.org/
Standard
Command
Line
Operations
OWL Axiomatization
glucan biosynthesis
(GO:0009250)
polysaccharide
biosynthesis
(GO:0000271)
⊑
≡
biosynthesis
(GO:0009058)
glucan
(CHEBI:37163)
⊓
biosynthesis
(GO:0009058)
polysaccharide
(CHEBI:18154)
⊓
∃.has_output
≡
∃.has_output
Hill et al, Dovetailing biology and chemistry: integrating the Gene Ontology with the ChEBI chemical ontology. BMC genomics, 14(1):513
OWL Reasoning leverages modularity
glucan biosynthesis
(GO:0009250)
polysaccharide
biosynthesis
(GO:0000271)
⊑
≡
⊓
⊓
∃.has_output
≡
∃.has_output
⊑
Inferred by
OWL
reasoner
biosynthesis
(GO:0009058)
glucan
(CHEBI:37163)
biosynthesis
(GO:0009058)
polysaccharide
(CHEBI:18154)
OBO Relation Ontology: glue
within and between ontologies
http://obofoundry.org/ontology/ro
Connected Knowledge Graphs using Ontologies
So far, so good...
Challenges with OWL
Under-
axiomatization
Over-
axiomatization
Ontology Users
Ontology
Developer
s
OWL
experts
● Author OWL templates
● Create Design Patterns
● Implement OWL templates
● Test against Design Patterns
● Consume pre-
reasoned hierarchies
Leverage the Expertise Pyramid
https://github.com/INCATools/dead_simple_owl_design_patterns
Can we make semantic tools easier?
RDF
OWL
SPARQL
SHACL
Semantic
engineer /
ontologist
Developer
Data Scientist
Scientists, Clinicians, ..
Python
SQL
Mongo
JSON
Pandas
BigTable
SPARK
Scikit-learn
Excel
Web Portals
???
id: https://example.org/linkml/hello-world
title: Really basic LinkML model
name: hello-world
version: 0.0.1
prefixes:
linkml: https://w3id.org/linkml/
sdo: https://schema.org/
ex: https://example.org/linkml/hello-world/
default_prefix: ex
default_curi_maps:
- semweb_context
imports:
- linkml:types
classes:
Person:
description: Minimal information about a person
class_uri: sdo:Person
attributes:
id:
identifier: true
slot_uri: sdo:taxID
first_name:
required: true
slot_uri: sdo:givenName
multivalued: true
last_name:
required: true
slot_uri: sdo:familyName
knows:
range: Person
multivalued: true
slot_uri: foaf:knows
LinkML: Linked Data Modeling Language
MyModel
Documentation
OWL
JSON Schema
ShEx Schema
Schema.py
GraphQL Schema
JSONLD Context
. . .
LinkML
schema
http://linkml.github.io
Biolink
Model
Biolink: Goals
The charge from NCATS:
● Create a Knowledge Graph Schema
● Encompass all biology from molecules through to clinical entities
● Get 20 different sites using the same data model
○ (oh: Only a handful of which use RDF/OWL)
● Do it quickly and break new ground in Translational Science
43
Biolink
Model
Where we are (year 2 or 5)
● All participants using common KG datamodel
● Early demonstrations of powerful federated queries
● LinkML advantages:
○ Edges are first-class citizens
○ Ontologies/OWL leveraged, but in background
44
NationalMicro
biome Data
Collaborative
Goal
● Make multi-omics microbiome data FAIR
○ Environments
○ Metagenomes
○ Metatranscriptomes
○ Metabolomics
○ Metaproteomics
● Leverage existing ontologies and standards
● Enable discovery in microbiome science
● Collaboration between multiple National Labs
45
NationalMicro
biome Data
Collaborative
Approach
● Formalize existing “checklist” standards
● Create modular schema
● Leverage MIxS, ENVO, PROV
Why LinkML
● Developers like JSON + JSON-Schema
● Biologists like spreadsheets
● “Semantic enums” work well
● Needed something that worked with
traditional technology (Mongo, Postgres)
● “Stealth semantics”
○ Everything has URI
○ All JSON is transparently JSON-LD
46
NationalMicro
biome Data
Collaborative
Where we are (year 2)
● Unified modular schema
● Easy for developers
○ System based mainly on JSON
exchange
○ RDF can be leveraged
○ Currently Mongo + Postgres +
TerminusDB
● Easy for biologists
○ Spreadsheets and validators created
from the schema
● Everything has semantics
○ “On-the-fly” JSON-LD
○ Satisfies FAIR mandate
47
Take Homes
Building the graph of life requires collaboration, social engineering, and lots of
curation
OWL is a powerful framework but it can be challenging to deploy effectively in an
information system
Integrating data into cohesive ontologies/KGs is hard but the return on investment is
high
LinkML provides a unifying layer over tooling… but more hands on deck required!
1
2
3
4
Some Links
● Open Bio Ontologies: http://obofoundry.org/resources
● ODK: https://github.com/INCATools/ontology-development-kit
● LinkML: https://linkml.github.io
● KG Hub: https://knowledge-graph-hub.github.io/
● GO: http://geneontology.org
● http://douroucouli.wordpress.com: My blog on all things OWL and Knowledge
Graphs

Collaboratively Creating the Knowledge Graph of Life

  • 1.
    Collaboratively Creating theKnowledge Graph of Life Chris Mungall cjmungall@lbl.gov @chrismungall JPM Graph Gang April 2021
  • 2.
    Who am Iand why am I here? Education: University of Edinburgh (BSc & PhD, AI + Bioinformatics) Current: Staff Scientist, Berkeley Lab, Environmental Genomics and Systems Biology In between: Lots of hacking and occasional research papers What I’m known for: Biological databases and ontologies What I actually do: Write proposals and wrange grants and let others do the work
  • 3.
  • 4.
  • 5.
  • 6.
    Biological data managementis hard. We have many named things. Drugs 10k Chemicals 4tn? Species ~9 million Diseases and Phenotypes 10-50k/species Cells 10,000s+ types per species) Experiments Raw data ?? exabytes Genes 20k per species Genetic variants 3m in human alone
  • 7.
    The things areinterconnected Cirrhosis MONDO:0005155 Liver UBERON:0002107 Hepatocyte CL:0000182 Ethanol CHEBI:16236
  • 8.
    It’s hard tofind and integrate the things I guess I have a lot of reading to do!
  • 9.
    Ontologies and KnowledgeGraphs to the rescue! I can organize it all for you! Ontolowhat? genes diseases cell types
  • 10.
    What is anontology anyway? ??? how does this help me? It is the branch of metaphysics dealing with the nature of being. No, it’s a formal, explicit specification of a shared conceptualization
  • 11.
    What is anontology anyway? ...better, I guess A graph (network) connecting all the things you care about me pizza cheese food Victor cat human mammal fromage@fr type is a has pet depicted by likes has part has role type
  • 12.
    Ontologies enable discovery Thisis fun actually... Do owners of different kinds of pets like different kinds of food? What do those foods have in common? me pizza cheese food Victor cat human mammal fromage@fr type is a has pet depicted by likes has part has role type Lizard owners like spicy food [p=0.04]
  • 13.
    Some of thethings you can do with ontologies Standardize, organize, & communicate data Filter & search for data Connect & harmonize data Infer knowledge & make suggestions
  • 14.
    The Gene Ontology:An Ontology for Genes Genes 20k/species Gene Ontology (GO) 45k ontology classes Each gene can be categorized with multiple GO terms describing the role of each gene
  • 15.
    The Gene Ontologyis the work of many people ● Manual curation forms the backbone of the GO ● AI can help but not replace!
  • 16.
    id: GO:0043570 name: maintenanceof DNA repeat elements id: GO:0006915 name: apoptosis id: GO:0016446 name: somatic hypermutation of immunoglobulin genes Inferring GO classification based on evolutionary history of genes
  • 17.
    Effects of spaceradiation on CSF molecular profiles • Innate immune system overactivated • Decreased nervous system development • axonal fasciculation, astrocyte & oligodendrocyte differentiation, synaptic plasticity, axonogenesis, … • Many negative regulation processes impaired: • Neuron proliferation, differentiation and projection • Leukocyte proliferation and differentiation • Extrinsic and possibly intrinsic apoptotic signaling pathways Goal: predict individual risks for behavior deficits and brain pathologies in astronauts proteomic data GO analysis predict
  • 18.
    GO is usedby researchers… and in the clinic! doi:10.1038/nature24487 Transgenic replacement skin was tested for off- target mutations using GO
  • 19.
    GO describes justone aspect of biology Drugs 10k Chemicals 4tn? Species ~9 million Diseases and Phenotypes 10-50k/species Cells 1000s+ core types per species) Experiments Raw data ?? exabytes Genes 20k per species Gene Functions Genetic variants 3m in human alone
  • 20.
    There are manyontologies to categorize the other things many biological ontologies! Problems: ● Duplication ● Silos ● Lack of interoperability
  • 21.
    We can build theuniversal ontological map of life... ...but how do we put the pieces together?
  • 22.
  • 23.
    Open Biological Ontologies(OBO) http://obofoundry.org 1. Well-integrated Modular ontologies (SUBSET of bioportal) 2. Provide technical and sociotechnological framework for cooperation 4. Allow us to describe all of the things 3. Provide tools, best practices and infrastructure for forging new ontologies @obofoundry
  • 24.
    • 160 activeontologies ○ Developed by different teams • Millions of classes • Wide variety ○ Specific ■ E.g. cephalopod ○ General ■ E.g. chemicals http://obofoundry.org The OBO Foundry
  • 25.
  • 26.
  • 27.
    The original bio-ontologieswere silos glucan biosynthesis (GO:0009250) polysaccharide biosynthesis (GO:0000271) is_a glucan (CHEBI:37163) polysaccharide (CHEBI:18154) is_a GO: Biological Processes CHEBI: Chemical Entities No reuse or connection
  • 28.
    OWL to therescue! MODULARITY TOOLS + REASONING
  • 29.
  • 30.
  • 32.
    ODK: ONTOLOGY DEVELOPMENTKIT kernel ODK container ROBOT Make odk.py dosdp-tools Reasoners container Ontology Operations (Command Line) Workflows: chains together operations Seed an ontology project: Create a GitHub repository with workflows in place Build ontologies rapidly from Design Pattern templates Includes Elk, HermiT, Konklude Complements ODEs (Protégé) fastobo Validation of obo format files (Rust) https://github.com/INCATools/ontology-development-kit
  • 33.
    ROBOT is anOBO tool http://robot.obolibrary.org/ Standard Command Line Operations
  • 34.
  • 35.
    OWL Reasoning leveragesmodularity glucan biosynthesis (GO:0009250) polysaccharide biosynthesis (GO:0000271) ⊑ ≡ ⊓ ⊓ ∃.has_output ≡ ∃.has_output ⊑ Inferred by OWL reasoner biosynthesis (GO:0009058) glucan (CHEBI:37163) biosynthesis (GO:0009058) polysaccharide (CHEBI:18154)
  • 36.
    OBO Relation Ontology:glue within and between ontologies http://obofoundry.org/ontology/ro
  • 37.
  • 38.
    So far, sogood...
  • 39.
  • 40.
    Ontology Users Ontology Developer s OWL experts ● AuthorOWL templates ● Create Design Patterns ● Implement OWL templates ● Test against Design Patterns ● Consume pre- reasoned hierarchies Leverage the Expertise Pyramid https://github.com/INCATools/dead_simple_owl_design_patterns
  • 41.
    Can we makesemantic tools easier? RDF OWL SPARQL SHACL Semantic engineer / ontologist Developer Data Scientist Scientists, Clinicians, .. Python SQL Mongo JSON Pandas BigTable SPARK Scikit-learn Excel Web Portals ???
  • 42.
    id: https://example.org/linkml/hello-world title: Reallybasic LinkML model name: hello-world version: 0.0.1 prefixes: linkml: https://w3id.org/linkml/ sdo: https://schema.org/ ex: https://example.org/linkml/hello-world/ default_prefix: ex default_curi_maps: - semweb_context imports: - linkml:types classes: Person: description: Minimal information about a person class_uri: sdo:Person attributes: id: identifier: true slot_uri: sdo:taxID first_name: required: true slot_uri: sdo:givenName multivalued: true last_name: required: true slot_uri: sdo:familyName knows: range: Person multivalued: true slot_uri: foaf:knows LinkML: Linked Data Modeling Language MyModel Documentation OWL JSON Schema ShEx Schema Schema.py GraphQL Schema JSONLD Context . . . LinkML schema http://linkml.github.io
  • 43.
    Biolink Model Biolink: Goals The chargefrom NCATS: ● Create a Knowledge Graph Schema ● Encompass all biology from molecules through to clinical entities ● Get 20 different sites using the same data model ○ (oh: Only a handful of which use RDF/OWL) ● Do it quickly and break new ground in Translational Science 43
  • 44.
    Biolink Model Where we are(year 2 or 5) ● All participants using common KG datamodel ● Early demonstrations of powerful federated queries ● LinkML advantages: ○ Edges are first-class citizens ○ Ontologies/OWL leveraged, but in background 44
  • 45.
    NationalMicro biome Data Collaborative Goal ● Makemulti-omics microbiome data FAIR ○ Environments ○ Metagenomes ○ Metatranscriptomes ○ Metabolomics ○ Metaproteomics ● Leverage existing ontologies and standards ● Enable discovery in microbiome science ● Collaboration between multiple National Labs 45
  • 46.
    NationalMicro biome Data Collaborative Approach ● Formalizeexisting “checklist” standards ● Create modular schema ● Leverage MIxS, ENVO, PROV Why LinkML ● Developers like JSON + JSON-Schema ● Biologists like spreadsheets ● “Semantic enums” work well ● Needed something that worked with traditional technology (Mongo, Postgres) ● “Stealth semantics” ○ Everything has URI ○ All JSON is transparently JSON-LD 46
  • 47.
    NationalMicro biome Data Collaborative Where weare (year 2) ● Unified modular schema ● Easy for developers ○ System based mainly on JSON exchange ○ RDF can be leveraged ○ Currently Mongo + Postgres + TerminusDB ● Easy for biologists ○ Spreadsheets and validators created from the schema ● Everything has semantics ○ “On-the-fly” JSON-LD ○ Satisfies FAIR mandate 47
  • 48.
    Take Homes Building thegraph of life requires collaboration, social engineering, and lots of curation OWL is a powerful framework but it can be challenging to deploy effectively in an information system Integrating data into cohesive ontologies/KGs is hard but the return on investment is high LinkML provides a unifying layer over tooling… but more hands on deck required! 1 2 3 4
  • 49.
    Some Links ● OpenBio Ontologies: http://obofoundry.org/resources ● ODK: https://github.com/INCATools/ontology-development-kit ● LinkML: https://linkml.github.io ● KG Hub: https://knowledge-graph-hub.github.io/ ● GO: http://geneontology.org ● http://douroucouli.wordpress.com: My blog on all things OWL and Knowledge Graphs