Standardized biological
knowledge graphs: The
BioLink Model
Chris Mungall
2018-04-13
Challenge: making representations of biological
knowledge interoperable
OMIM
MGI
HGNC
FlyBase
ClinVar
CTD
DrugBank
UniProt
BGeeDb
GO
SGD
RGD PomBase Monarch
WormBasePharmGKB
Reactome
GWAS
catalog
CHEMBL
ENSEMBL
DrugBank
BioGrid
KEGG
Panther
ZFIN
Xen
Base
Animal
QTLdb
What do we mean by knowledge here?
● Data, sensu lato : collection of values in some organized form
○ Data, sensu stricto: Output of a data collection process
■ Instrumentation or observation; raw or processed; not altered by curation
■ Serves role as evidence
■ E.g. read count in RNAseq experiment OR examination of KO mouse
○ Metadata: Data about data (or more typically) datasets
■ May be curated at source, post-hoc, manually or automatically
■ E.g. details about an RNAseq experiment (factors, instrumentation, sample prep)
○ Knowledge: Propositional assertions inferred from data
■ Something you need evidence for
■ E.g.
● gene G is expressed in tissue T under condition C
● Knocking out G gives rise to phenotype P with high penetrance
● Many bio-”databases” are actually “knowledge bases” (by this definition)
● Usual caveats:
○ Other definitions available, divisions can be murky, this is a guide rather than dogma, etc
Solution: Standard schema/datamodel for all of
biology?
Haven’t we been here before?
http://www.mged.org/Meetings/presentations/OMG/sld019.htm
Haven’t we been here before?
http://www.mged.org/Meetings/presentations/OMG/sld019.htm
Complexity and fluidity of biological knowledge vs
schema rigidity
// hypothetical strawman schema
class Gene {
String: name
String: function
String: phenotype
Protein: product
Int: start
Int: end
String: chromosome
}
Bad assumption:
- Genes actually have multiple functions
- String representation rather than vocab
Bad assumption:
- Different builds?
- Should be inherited from generic
seq feature
Bad assumption:
- Genes can have multiple products
- Products not necessarily genes
- What about transcript, exon, ...
}
The backwards evolution of schema languages
● 80s: ER, SQL DDL
○ Basis in FOL, formal algebra/calculus
● 90s: OO, UML, Description Logics
○ Rich polymorphism
● 00s: XML, SOAP
○ Can’t even...
● 10s: JSON and JSON-Schema
○ No polymorphism
○ Limited typing
○ Tree-based
○ Geared towards web-apps, not rich modeling
What works: Open-ended knowledge representation
using RDF Graphs plus OWL
● RDF: minimal
representation
model for
representing simple
facts as edges
● OWL: encodes
semantics about
RDF graphs
Success of OWL:
Bio-Ontologies
● One datamodel (OWL),
covers rich variety of
interconnected biology
● APIs, SPARQL, ...
http://obofoundry.org/ontology/uberon.html
Analogous approach in biological databases
● GMOD Chado
● Graph-like database layered
over RDBMS
● Allowed flexibility and
extensibility
● Large uptake by small MODs
Mungall, C. J., Emmert, D. B., et al. (2007) A. Bioinformatics, 23(13),
i337-346. http://doi.org/10.1093/bioinformatics/btm189
https://github.com/GMOD/Chado
Knowledge Graphs, the most pluripotent representation of data, are no longer as exotic or
experimental as they were 10 years ago. Goofaceamazonlink etc are all using them to some degree.
Challenge: too much flexibility
● With flexible schema-free graph-based
representations, multiple ways of modeling
things
● OWL provides semantic open-world
biological constraints
○ All genes are located_on exactly 1 chromosome
● Software often needs more rigid closed-
world information model constraints
○ Information System A: gene can be located on
multiple contigs/scaffolds
○ Information System B: locational info not relevant
BioLink Model Approach
● Define a powerful underlying metamodel
○ Mix aspects of closed-world UML and open-world OWL
○ Build for extensibility
○ Define exports: UML, SQL DDL, GraphQL, Json-Schema, Java, ...
● Define core biological types (E)
○ Gene, disease, anatomical entity, disease, ...
○ Cede detailed typology to ontologies
● Define core properties (R)
○ Id, name, synonym
○ Part-of, interacts-with, gives-rise-to
● Define taxonomy of relationships (extension of R)
○ Gene-gene-interaction, gene-tissue-expression
● Extensibility through use-case specific profiles
https://biolink.github.io/biolink-model
Browsing the model
● YAML source
● Autogenerated website docs: https://biolink.github.io/biolink-model
● OWL export
○ Protege
○ Bioportal
● JSON-Schema (lossy unless working in JSON-LD)
● GraphQL (lossy)
● UML Diagrams (lossy)
https://biolink.github.io/biolink-model
https://github.com/biolink/biolink-model/blob/master/biolink-model.yaml
https://github.com/biolink/biolink-model/tree/master/ontology
https://bioportal.bioontology.org/ontologies/BLM/
https://biolink.github.io/biolink-model/
Entities
Relationships
Aka assertions, facts, propositions, reified triples, edges ...
Profiles
● Different projects require different views of the data
○ E.g. omission/inclusion of different fields
○ Denormalizations
○ Inlining vs referencing
● Metamodel supports remixing and mixins
● One core conceptual model
● Different serializations for different profiles
● Well-defined transforms
● Caveat: this part is not well documented yet
How do I use it? How do I get data?
● Data model is serialization neutral
○ Plus: Flexible
○ Negative: Additional layer of abstraction
● RDF/Turtle serialization
○ http://data.monarchinitiative.org/ttl/
○ Turtle conforms to association patterns
● Property graphs
○ http://neo4j.monarchinitiative.org/
● JSON
○ Challenge: lack of polymorphism
○ Available via generic model or specific models
○ API http://api.monarchinitiative.org/api/
○ Preview: https://data.monarchinitiative.org/json/
○ BDBags of JSON coming soon
What NOT to use the biolink-model for
● Raw data
● Metadata about a dataset
● ..
● However..
○ Underlying metamodel may be useful in providing flexible representations of these
○ Currently aligning with FHIR metamodel
How does this relate to KC7?
● One view: DC is about data sensu stricto, and metadata
○ Search = lightweight ontology (syns + subsumption) + metadata datamodels
○ “Knowledge bases” have their own specialized search interfaces developed by specialists
○ No role for a standard KM in DC
● Counterview
○ We’re not trying to compete with bio-KBs
○ We want to leverage knowledge to enhance data search
■ Analogous to how google KG enhances google search
○ Example:
■ Find TopMed studies relevant to my disease
● Exploit KG linkages between disease-phenotype, phenotype-variable, phenotype-
gene

Introduction to the BioLink datamodel

  • 1.
    Standardized biological knowledge graphs:The BioLink Model Chris Mungall 2018-04-13
  • 2.
    Challenge: making representationsof biological knowledge interoperable OMIM MGI HGNC FlyBase ClinVar CTD DrugBank UniProt BGeeDb GO SGD RGD PomBase Monarch WormBasePharmGKB Reactome GWAS catalog CHEMBL ENSEMBL DrugBank BioGrid KEGG Panther ZFIN Xen Base Animal QTLdb
  • 3.
    What do wemean by knowledge here? ● Data, sensu lato : collection of values in some organized form ○ Data, sensu stricto: Output of a data collection process ■ Instrumentation or observation; raw or processed; not altered by curation ■ Serves role as evidence ■ E.g. read count in RNAseq experiment OR examination of KO mouse ○ Metadata: Data about data (or more typically) datasets ■ May be curated at source, post-hoc, manually or automatically ■ E.g. details about an RNAseq experiment (factors, instrumentation, sample prep) ○ Knowledge: Propositional assertions inferred from data ■ Something you need evidence for ■ E.g. ● gene G is expressed in tissue T under condition C ● Knocking out G gives rise to phenotype P with high penetrance ● Many bio-”databases” are actually “knowledge bases” (by this definition) ● Usual caveats: ○ Other definitions available, divisions can be murky, this is a guide rather than dogma, etc
  • 4.
  • 5.
    Haven’t we beenhere before? http://www.mged.org/Meetings/presentations/OMG/sld019.htm
  • 6.
    Haven’t we beenhere before? http://www.mged.org/Meetings/presentations/OMG/sld019.htm
  • 7.
    Complexity and fluidityof biological knowledge vs schema rigidity // hypothetical strawman schema class Gene { String: name String: function String: phenotype Protein: product Int: start Int: end String: chromosome } Bad assumption: - Genes actually have multiple functions - String representation rather than vocab Bad assumption: - Different builds? - Should be inherited from generic seq feature Bad assumption: - Genes can have multiple products - Products not necessarily genes - What about transcript, exon, ... }
  • 8.
    The backwards evolutionof schema languages ● 80s: ER, SQL DDL ○ Basis in FOL, formal algebra/calculus ● 90s: OO, UML, Description Logics ○ Rich polymorphism ● 00s: XML, SOAP ○ Can’t even... ● 10s: JSON and JSON-Schema ○ No polymorphism ○ Limited typing ○ Tree-based ○ Geared towards web-apps, not rich modeling
  • 9.
    What works: Open-endedknowledge representation using RDF Graphs plus OWL ● RDF: minimal representation model for representing simple facts as edges ● OWL: encodes semantics about RDF graphs
  • 10.
    Success of OWL: Bio-Ontologies ●One datamodel (OWL), covers rich variety of interconnected biology ● APIs, SPARQL, ... http://obofoundry.org/ontology/uberon.html
  • 11.
    Analogous approach inbiological databases ● GMOD Chado ● Graph-like database layered over RDBMS ● Allowed flexibility and extensibility ● Large uptake by small MODs Mungall, C. J., Emmert, D. B., et al. (2007) A. Bioinformatics, 23(13), i337-346. http://doi.org/10.1093/bioinformatics/btm189 https://github.com/GMOD/Chado
  • 12.
    Knowledge Graphs, themost pluripotent representation of data, are no longer as exotic or experimental as they were 10 years ago. Goofaceamazonlink etc are all using them to some degree.
  • 13.
    Challenge: too muchflexibility ● With flexible schema-free graph-based representations, multiple ways of modeling things ● OWL provides semantic open-world biological constraints ○ All genes are located_on exactly 1 chromosome ● Software often needs more rigid closed- world information model constraints ○ Information System A: gene can be located on multiple contigs/scaffolds ○ Information System B: locational info not relevant
  • 14.
    BioLink Model Approach ●Define a powerful underlying metamodel ○ Mix aspects of closed-world UML and open-world OWL ○ Build for extensibility ○ Define exports: UML, SQL DDL, GraphQL, Json-Schema, Java, ... ● Define core biological types (E) ○ Gene, disease, anatomical entity, disease, ... ○ Cede detailed typology to ontologies ● Define core properties (R) ○ Id, name, synonym ○ Part-of, interacts-with, gives-rise-to ● Define taxonomy of relationships (extension of R) ○ Gene-gene-interaction, gene-tissue-expression ● Extensibility through use-case specific profiles https://biolink.github.io/biolink-model
  • 15.
    Browsing the model ●YAML source ● Autogenerated website docs: https://biolink.github.io/biolink-model ● OWL export ○ Protege ○ Bioportal ● JSON-Schema (lossy unless working in JSON-LD) ● GraphQL (lossy) ● UML Diagrams (lossy) https://biolink.github.io/biolink-model
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
    Relationships Aka assertions, facts,propositions, reified triples, edges ...
  • 22.
    Profiles ● Different projectsrequire different views of the data ○ E.g. omission/inclusion of different fields ○ Denormalizations ○ Inlining vs referencing ● Metamodel supports remixing and mixins ● One core conceptual model ● Different serializations for different profiles ● Well-defined transforms ● Caveat: this part is not well documented yet
  • 23.
    How do Iuse it? How do I get data? ● Data model is serialization neutral ○ Plus: Flexible ○ Negative: Additional layer of abstraction ● RDF/Turtle serialization ○ http://data.monarchinitiative.org/ttl/ ○ Turtle conforms to association patterns ● Property graphs ○ http://neo4j.monarchinitiative.org/ ● JSON ○ Challenge: lack of polymorphism ○ Available via generic model or specific models ○ API http://api.monarchinitiative.org/api/ ○ Preview: https://data.monarchinitiative.org/json/ ○ BDBags of JSON coming soon
  • 24.
    What NOT touse the biolink-model for ● Raw data ● Metadata about a dataset ● .. ● However.. ○ Underlying metamodel may be useful in providing flexible representations of these ○ Currently aligning with FHIR metamodel
  • 25.
    How does thisrelate to KC7? ● One view: DC is about data sensu stricto, and metadata ○ Search = lightweight ontology (syns + subsumption) + metadata datamodels ○ “Knowledge bases” have their own specialized search interfaces developed by specialists ○ No role for a standard KM in DC ● Counterview ○ We’re not trying to compete with bio-KBs ○ We want to leverage knowledge to enhance data search ■ Analogous to how google KG enhances google search ○ Example: ■ Find TopMed studies relevant to my disease ● Exploit KG linkages between disease-phenotype, phenotype-variable, phenotype- gene