Building a Network of Interoperable and Independently Produced Linked and Open Biomedical Data

Building a network of interoperable and
independently produced linked and open
biomedical data
1
Michel Dumontier, Ph.D.
Associate Professor of Medicine (Biomedical Informatics)
Stanford University
@micheldumontier::ACS:23-08-16
An invited talk in support of the 2016 Herman Skolnik Awardees

My research aims to
develop computational
methods for biomedical
knowledge discovery
We develop tools and
methods to represent,
store, publish, integrate,
query, and reuse
biomedical data,
software, and ontologies

reuse needs to be
considered firmly in the
context of discovery and
reproducibility

4 @micheldumontier::ACS:23-08-16
Most published research findings are false
- John Ioannidis, Stanford University

Reproducible discovery
1. Data Science Tools and Methods
– Infrastructure: To identify, annotate, link, integrate,
search for and query data and services
– Tools: To identify and uncover support for known or
novel associations
2. Community Standards
to contribute to and interrogate a massive, decentralized
network of interconnected data and software

FAIR: Findable, Accessible, Interoperable, Re-usable

FAIR: Findable, Accessible,
Interoperable, Re-usable
Findable
– Globally unique identifiers for datasets and the data they contain
– Rich set of descriptors to search and filter with
– Indexed and searchable
Accessible
– Identifiers can be used to retrieve representations using standard protocols
(e.g. HTTP)
– Metadata is always available.
Interoperable
– Data represented with formal knowledge representations
– Include links to other datasets/vocabularies
Reusable
– Licensing, Provenance, Community standards

The Semantic Web
is the new global web of knowledge
standards for publishing, sharing and querying
facts, expert knowledge and services
scalable approach for the discovery
of independently formulated
and distributed knowledge

Linked Data
offers a solid foundation for FAIR data
• Entities (people, proteins, pathways, etc) are
identified using globally unique identifiers (URIs)
• Entity descriptions are represented with a
standardized language (RDF)
• Data can be retrieved using a universal protocol
(HTTP)
• Entities (concepts, data, resources) can be linked
together to increase interoperability

Linked Data for the Life Sciences
10
Bio2RDF is an open source project to unify the
representation and interlinking of biological data using RDF.
chemicals/drugs/formulations,
genomes/genes/proteins, domains
Interactions, complexes & pathways
animal models and phenotypes
Disease, genetic markers, treatments
Terminologies & publications
• 11B+ interlinked statements from 35 biomedical
datasets and 400+ ontologies
• dataset description, provenance & statistics
• A growing interoperable ecosystem with the EBI,
NCBI, DBCLS, NCBO, OpenPHACTS, and
commercial tool providers

Bio2RDF
normalizes identifiers, formats, links, and access

Bio2RDF shows how datasets are
connected together

Queries can be federated across
private and public SPARQL databases
Get all protein catabolic processes (and more specific GO terms) in biomodels
SELECT ?go ?label count(distinct ?x)
WHERE {
service <http://bioportal.bio2rdf.org/sparql> {
?go rdfs:label ?label .
?go rdfs:subClassOf+ ?tgo
?tgo rdfs:label ?tlabel .
FILTER regex(?tlabel, "^protein catabolic process")
}
service <http://biomodels.bio2rdf.org/sparql> {
?x <http://bio2rdf.org/biopax_vocabulary:identical-to> ?go .
?x a <http://www.biopax.org/release/biopax-level3.owl#BiochemicalReaction> .
}
}

Graph-like representation amenable
to finding mismatches and discovering new links
W Hu, H Qiu, M Dumontier. Link Analysis of Life Science Linked Data.
International Semantic Web Conference (2) 2015: 446-462.

EbolaKB
Using Linked Data and Software
Kamdar, Dumontier. An Ebola virus-centered knowledge base. Database. 2015 Jun 8;2015. doi: 10.1093/database/bav049.

Network analysis and discovery
McCusker, McGuiness, Dumontier. In prep.

Can we implement
an open version of
PREDICT using
Linked Data?
AUC 0.91 across all therapeutic indications
A. Chemical structure Similarity
B. Side Effect Similarity
C. Target Sequence Similarity
D. Target Functional Similarity
E. Network Distance
A. Phenotype Based
B. Text Extracted Concepts
Disease-disease similarityDrug-drug similarity

HyQue: Hypothesis Validation
• A platform for knowledge discovery that
uses data retrieval coupled with
automated reasoning to validate
scientific hypotheses
• Leverages semantic technologies to
provide access to linked data,
ontologies, and semantic web services
• Uses positive and negative findings,
captures provenance
• Weighs evidence according to context
• Used to find aging genes in worm,
assess cardiotoxicity of tyrosine kinase
inhibitors
HyQue: evaluating hypotheses using Semantic Web technologies. J Biomed Semantics. 2011 May 17;2 Suppl 2:S3.
Evaluating scientific hypotheses using the SPARQL Inferencing Notation. Extended Semantic Web Conference (ESWC 2012). Heraklion, Crete.
May 27-31, 2012. @micheldumontier::ACS:23-08-1619

What evidence might we gather?
• clinical: Are there cardiotoxic effects associated with the drug?
– Literature (studies) [curated db]
– Product labels (studies) [r3:sider]
– Clinical trials (studies) [r3:clinicaltrials]
– Adverse event reports [r2:pharmgkb/onesides]
– Electronic health records (observations)
• pre-clinical associations:
– genotype-phenotype (null/disease models) [r2:mgi, r2:sgd; r3:wormbase]
– in vitro assays (IC50) [r3:chembl]
– drug targets [r2:drugbank; r2:ctd; r3:stitch]
– drug-gene expression [r3:gxa]
– pathways [r2:kegg; r3:reactome]
– Drug-pathway, disease-pathway enrichments [aberrant pathways]
– Chemical properties [r2:pubchem; r2.drugbank]
– Toxicology [r1.toxkb/cebs]

HyQue

Beyond Bio2RDF

Network of Linked Data (~2007)

Expansion across domains
“Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/”@micheldumontier::ACS:23-08-1624

A rapidly growing network of Linked Data
25 @micheldumontier::ACS:23-08-16Linking Open Data cloud diagram 2014, by Max Schmachtenberg, Christian Bizer, Anja Jentzsch and Richard Cyganiak. http://lod-cloud.net/"

but the lack of coordination makes
Linked Open Data is chaotic and unwieldy

There is no shortage
of vocabularies, ontologies and community-based
standards

68 168

3
2
metadatacenter.org
NIH COMMONS
Making it Easier, Possibly Even Pleasant, to Author
Interoperable Experimental Metadata

PubChem engaged the community to
reuse and extend existing vocabularies

Semanticscience Ontology (SIO)
An effective upper level ontology.
1500+ classes
207 object properties (inc. inverses)
1 datatype property

Chemical Information Ontology
(CHEMINF)
• Collaborative ontology
• Distinguishes algorithmic,
or procedural information
from declarative, or factual
information, and renders of
particular importance the
annotation of provenance
to calculated data.

Where are we going?
• Large scale publishing on the web across
biomedical datatypes is possible on the web
• Hubs, such as NCBI and EBI now integrate data,
but there is need for global coordination on all
datatypes
• Standard Vocabularies must to be open, freely
accessible, and demonstrably reused
• Use of worldwide data integration formats (RDF)
and improved linking of data
• Easier to deploy toolkits for providing standards-
compliant linked data

Linked Data Platform
Docker
• Data conversion scripts
• Query Editor
• Faceted Browser
• Relation Exploration
• API
• Data and data store
Model Organism Linked Data
MO-LD.org
37

In Summary
• We use semantic technologies such as ontologies
and linked data to make sense of and facilitate
access to biomedical data (FAIR)
• The intimate development and use of standards
by PubChem and others brings us closer to an
interoperability ideal
• Much more work is needed to support
(computational) discovery in a reproducible
manner.

Acknowledgements
Dumontier Lab
• Amrapali Zaveri
• Mary Panahiazar
• Shima Dastgheib
• Sandeep Ayyar
• Remzi Celebi
• David Odgers
• Wei Hu
• Ruben Verborgh
• Leo Chepelev
• Alison Callahan
• Jose Miguel Toledo Cruz
• Tanya Hiebert
• Beatriz Lujan
+ many more
Collaborators
• Mark Musen
• Nigam Shah
• Robert Hoehndorf
• Janna Hastings
• Christoph Steinbeck
• Egon Willighagen
• Nico Adams
• Colin Batchelor
• David Wild
• Evan Bolton
• Gang Fu
+ many more

dumontierlab.com
michel.dumontier@stanford.edu
Website: http://dumontierlab.com
Presentations: http://slideshare.com/micheldumontier

Building a Network of Interoperable and Independently Produced Linked and Open Biomedical Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Building a Network of Interoperable and Independently Produced Linked and Open Biomedical Data

Similar to Building a Network of Interoperable and Independently Produced Linked and Open Biomedical Data (20)

More from Michel Dumontier

More from Michel Dumontier (20)

Recently uploaded

Recently uploaded (20)

Building a Network of Interoperable and Independently Produced Linked and Open Biomedical Data

Editor's Notes