1. LinkML: A brief guide for Monarch team
members
Chris Mungall
Monarch Initiative
cjmungall@lbl.gov
Monarch Data Call Sep 2021
2. Data modeling is ubiquitous in Monarch and Phen1
Ontology
developer
How do infectious
diseases (MONDO) relate
to infectious agents
(NCBITaxon)? Or to
treatments (MAXO) or
exposures (ECTO) Phenopacket Team
KG/Ingest
Engineer
How do
diseases relate
to genes?
How do patients
relate to
diseases? Or to
samples/.
UI developer
How do disease
pages relate to
gene pages?
3. We love our stacks
Ontology
developer
I ❤️ OWL
I ❤️ DOSDPs
I ❤️ ROBOT templates
Phenopacket Team
KG/Ingest
Engineer /
Graph ML
I ❤️ KGX
I ❤️ TSVs
I ❤️ Neo4J
I ❤️ Protobuf
4. Monarch-adjacent and beyond
Translator
How do genes relate
to chemicals
CRDC-H; NMDC
How do I relate
samples to
ontological
descriptors
CD2H / N3C
How do patients
relate to drug
treatments?
Allen / HCA
How do cell
types relate to
genes?
OBO / RO /
COB
How do
processes
relate to inputs
5. Monarch-adjacent and beyond
Semweb developer
I ❤️ RDF + triplestores
I ❤️ SHACL/ShEx
Clinical informatics
I ❤️ FHIR
Biologists
I ❤️ spreadsheets
CD2H / N3C
I ❤️ OMOP
GA4GH, HCA, many
devs
I ❤️ JSON-Schema
I ❤️ SQLite
6. LinkML: One ring to bind them….
https://linkml.io/ * https://github.com/linkml/linkml/
7.
8. LinkML philosophy: Parasitize rather than compete
Strategy
● Be expressive enough to cover all our use cases
● Allow compilation to people’s favored stack (e.g JSON-Schema)
● Simple to do simple things, but add-ons where necessary
● Stealth Semantics
○ Everything can be RDF/Linked Data - if you want it to be
9. LinkML parasitizes other toolchains
YourModel
Documentation
OWL
JSON Schema
ShEx Schema
Schema.py
Object model
GraphQL Schema
Your LinkML Schema
(YAML)
JSONLD Context
. . .
LinkML
parser
Philosophy:
● Be expressive
● Parasitize
● Be developer-friendly
● Stealth semantics
● 80% rule
10. LinkML != Biolink Model
Biolink Model
● Expressed using LinkML
● An uber-data model for biology
○ Main types: gene, chemical, disease, …
○ Translator
○ Monarch KG
○ KG-COVID-19
○ KG-Microbe
○ KG-OBO
● Not appropriate for everything
○ Highly patient specific (CCDH, Pfx, FHIR)
○ Sample and omics data (CCDH, NMDC)
○ Single-cell data
LinkML
● A modeling language
● Can express multiple datamodels
11. id: https://example.org/linkml/hello-world
title: Really basic LinkML model
name: hello-world
license: https://creativecommons.org/publicdomain/zero/1.0/
version: 0.0.1
prefixes:
linkml: https://w3id.org/linkml/
sdo: https://schema.org/
ex: https://example.org/linkml/hello-world/
default_prefix: ex
default_curi_maps:
- semweb_context
imports:
- linkml:types
classes:
Person:
description: Minimal information about a person
class_uri: sdo:Person
attributes:
id:
identifier: true
slot_uri: sdo:taxID
first_name:
required: true
slot_uri: sdo:givenName
multivalued: true
last_name:
required: true
slot_uri: sdo:familyName
knows:
range: Person
multivalued: true
slot_uri: foaf:knows
Metadata
Dependencies
Namespaces
Actual Model
A sample LinkML Schema
11
12. id: https://example.org/linkml/hello-world
title: Really basic LinkML model
name: hello-world
license: https://creativecommons.org/publicdomain/zero/1.0/
version: 0.0.1
prefixes:
linkml: https://w3id.org/linkml/
sdo: https://schema.org/
ex: https://example.org/linkml/hello-world/
default_prefix: ex
default_curi_maps:
- semweb_context
imports:
- linkml:types
classes:
Person:
description: Minimal information about a person
class_uri: sdo:Person
attributes:
id:
identifier: true
slot_uri: sdo:taxID
first_name:
required: true
slot_uri: sdo:givenName
multivalued: true
last_name:
required: true
slot_uri: sdo:familyName
knows:
range: Person
multivalued: true
slot_uri: foaf:knows
Metadata
Dependencies
Namespaces
Actual Model
LinkML RDF is hidden in plain sight
12
Reuse schema
elements from
core
vocabularies
FAIR
(specifically:
allows diverse
data to be
combined)
23. Currently in use for: key projects
National Microbiome Data Collaborative
● Samples
● Omics datatypes
Genomics Standards Consortium
Translator (BioLink)
All our KGs (BioLink + Source modeling)
Alliance of Genome Resources
Center for Cancer Data Harmonization
Knowledge Graph Change Language (KGCL)
Chemical Ontology Schema
SSSOM
GHGPA
Critical Path Institute
...and more
25. External tooling: DataHarmonizer driven by LinkML
https://genepio.org/DataHarmonizer/main.html -- spiritual successor to Phenote
26. Caveats / expectation management
The core is stable
● i.e. schemas won’t break in linkml1.x.x
series
● Used in production in multiple projects
Some things are currently incomplete
● Mapping to JSON-Schema
● Mapping to JSON-LD Contexts
● Documentation
New language features being added
● E.g constraint language, mapping
language
● These are extensions and won’t break
existing schemas
The tool stack is constantly evolving
● Other frameworks -> LinkML
● Automated schema mapping
● Generators for other languages
○ new : javagen
● Binding to databases
○ SQL, SPARQL, Solr, MongoDB, ...
Web based model documentation can be emitted “out of the box”, and several LinkML users have added fancier tool-specific(?) documentation packages. Note that the above model is not UML -- YUML is a graphics tool that makes UML diagrams but not XMI
Generating the schemas in various target languages enables the the use of tooling and other resources developed for that particular language. LinkML emits JSON-Schema, ShEx and GraphQL today. SQL ORM work is underway and future plans include UML, SHACL and FHIR.
LinkML OWL can generate the necessary “glue” to allow model instances (e.g. Person) to be used in reasoners.
LinkML OWL can generate the necessary “glue” to allow model instances (e.g. Person) to be used in reasoners.
Model instances can be constructed using python and emitted as JSON, YAML or RDF. Columnar (CSV, TSV, Excel, …) is on the todo list. Others can be created as needed.
One can also use YAML, JSON or RDF loaders to import information. Columnar input is on the horizon but make note that the ability to import RDF, potentially from a large graph, SPARQL query or ShEx “slurp” allows us to work with an RDF data store (e.g. WikiData) or other source (schema.org annotated web pages).
Note: as of 4/14/2021, we are still working through issues in the rdf_loader.
Three permissible values -- no meaning connection. This works for basic models, but lacks the semantic (RDF) bridge necessary to do transformation
Three permissible values -- no meaning connection. This works for basic models, but lacks the semantic (RDF) bridge necessary to do transformation