Building a network of interoperable and
independently produced linked and open
biomedical data
1
Michel Dumontier, Ph.D.
Associate Professor of Medicine (Biomedical Informatics)
Stanford University
@micheldumontier::ACS:23-08-16
An invited talk in support of the 2016 Herman Skolnik Awardees
My research aims to
develop computational
methods for biomedical
knowledge discovery
We develop tools and
methods to represent,
store, publish, integrate,
query, and reuse
biomedical data,
software, and ontologies
@micheldumontier::ACS:23-08-162
@micheldumontier::ACS:23-08-163
reuse needs to be
considered firmly in the
context of discovery and
reproducibility
4 @micheldumontier::ACS:23-08-16
Most published research findings are false
- John Ioannidis, Stanford University
Reproducible discovery
1. Data Science Tools and Methods
– Infrastructure: To identify, annotate, link, integrate,
search for and query data and services
– Tools: To identify and uncover support for known or
novel associations
2. Community Standards
to contribute to and interrogate a massive, decentralized
network of interconnected data and software
@micheldumontier::ACS:23-08-165
@micheldumontier::ACS:23-08-166
FAIR: Findable, Accessible, Interoperable, Re-usable
FAIR: Findable, Accessible,
Interoperable, Re-usable
Findable
– Globally unique identifiers for datasets and the data they contain
– Rich set of descriptors to search and filter with
– Indexed and searchable
Accessible
– Identifiers can be used to retrieve representations using standard protocols
(e.g. HTTP)
– Metadata is always available.
Interoperable
– Data represented with formal knowledge representations
– Include links to other datasets/vocabularies
Reusable
– Licensing, Provenance, Community standards
@micheldumontier::ACS:23-08-167
The Semantic Web
is the new global web of knowledge
8 @micheldumontier::ACS:23-08-16
standards for publishing, sharing and querying
facts, expert knowledge and services
scalable approach for the discovery
of independently formulated
and distributed knowledge
Linked Data
offers a solid foundation for FAIR data
• Entities (people, proteins, pathways, etc) are
identified using globally unique identifiers (URIs)
• Entity descriptions are represented with a
standardized language (RDF)
• Data can be retrieved using a universal protocol
(HTTP)
• Entities (concepts, data, resources) can be linked
together to increase interoperability
@micheldumontier::ACS:23-08-169
@micheldumontier::ACS:23-08-16
Linked Data for the Life Sciences
10
Bio2RDF is an open source project to unify the
representation and interlinking of biological data using RDF.
chemicals/drugs/formulations,
genomes/genes/proteins, domains
Interactions, complexes & pathways
animal models and phenotypes
Disease, genetic markers, treatments
Terminologies & publications
• 11B+ interlinked statements from 35 biomedical
datasets and 400+ ontologies
• dataset description, provenance & statistics
• A growing interoperable ecosystem with the EBI,
NCBI, DBCLS, NCBO, OpenPHACTS, and
commercial tool providers
Bio2RDF
normalizes identifiers, formats, links, and access
11 @micheldumontier::ACS:23-08-16
@micheldumontier::ACS:23-08-1612
Bio2RDF shows how datasets are
connected together
@micheldumontier::ACS:23-08-1613
Queries can be federated across
private and public SPARQL databases
Get all protein catabolic processes (and more specific GO terms) in biomodels
SELECT ?go ?label count(distinct ?x)
WHERE {
service <http://bioportal.bio2rdf.org/sparql> {
?go rdfs:label ?label .
?go rdfs:subClassOf+ ?tgo
?tgo rdfs:label ?tlabel .
FILTER regex(?tlabel, "^protein catabolic process")
}
service <http://biomodels.bio2rdf.org/sparql> {
?x <http://bio2rdf.org/biopax_vocabulary:identical-to> ?go .
?x a <http://www.biopax.org/release/biopax-level3.owl#BiochemicalReaction> .
}
}
@micheldumontier::ACS:23-08-1614
Graph-like representation amenable
to finding mismatches and discovering new links
@micheldumontier::ACS:23-08-1615
W Hu, H Qiu, M Dumontier. Link Analysis of Life Science Linked Data.
International Semantic Web Conference (2) 2015: 446-462.
EbolaKB
Using Linked Data and Software
@micheldumontier::ACS:23-08-1616
Kamdar, Dumontier. An Ebola virus-centered knowledge base. Database. 2015 Jun 8;2015. doi: 10.1093/database/bav049.
@micheldumontier::ACS:23-08-1617
Network analysis and discovery
McCusker, McGuiness, Dumontier. In prep.
@micheldumontier::ACS:23-08-1618
Can we implement
an open version of
PREDICT using
Linked Data?
AUC 0.91 across all therapeutic indications
A. Chemical structure Similarity
B. Side Effect Similarity
C. Target Sequence Similarity
D. Target Functional Similarity
E. Network Distance
A. Phenotype Based
B. Text Extracted Concepts
Disease-disease similarityDrug-drug similarity
HyQue: Hypothesis Validation
• A platform for knowledge discovery that
uses data retrieval coupled with
automated reasoning to validate
scientific hypotheses
• Leverages semantic technologies to
provide access to linked data,
ontologies, and semantic web services
• Uses positive and negative findings,
captures provenance
• Weighs evidence according to context
• Used to find aging genes in worm,
assess cardiotoxicity of tyrosine kinase
inhibitors
HyQue: evaluating hypotheses using Semantic Web technologies. J Biomed Semantics. 2011 May 17;2 Suppl 2:S3.
Evaluating scientific hypotheses using the SPARQL Inferencing Notation. Extended Semantic Web Conference (ESWC 2012). Heraklion, Crete.
May 27-31, 2012. @micheldumontier::ACS:23-08-1619
What evidence might we gather?
• clinical: Are there cardiotoxic effects associated with the drug?
– Literature (studies) [curated db]
– Product labels (studies) [r3:sider]
– Clinical trials (studies) [r3:clinicaltrials]
– Adverse event reports [r2:pharmgkb/onesides]
– Electronic health records (observations)
• pre-clinical associations:
– genotype-phenotype (null/disease models) [r2:mgi, r2:sgd; r3:wormbase]
– in vitro assays (IC50) [r3:chembl]
– drug targets [r2:drugbank; r2:ctd; r3:stitch]
– drug-gene expression [r3:gxa]
– pathways [r2:kegg; r3:reactome]
– Drug-pathway, disease-pathway enrichments [aberrant pathways]
– Chemical properties [r2:pubchem; r2.drugbank]
– Toxicology [r1.toxkb/cebs]
@micheldumontier::ACS:23-08-1620
@micheldumontier::ACS:23-08-1621
HyQue
Beyond Bio2RDF
@micheldumontier::ACS:23-08-1622
Network of Linked Data (~2007)
@micheldumontier::ACS:23-08-1623
Expansion across domains
“Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/”@micheldumontier::ACS:23-08-1624
A rapidly growing network of Linked Data
25 @micheldumontier::ACS:23-08-16Linking Open Data cloud diagram 2014, by Max Schmachtenberg, Christian Bizer, Anja Jentzsch and Richard Cyganiak. http://lod-cloud.net/"
@micheldumontier::ACS:23-08-1626
@micheldumontier::ACS:23-08-1627
@micheldumontier::ACS:23-08-1628
but the lack of coordination makes
Linked Open Data is chaotic and unwieldy
@micheldumontier::ACS:23-08-1629
There is no shortage
of vocabularies, ontologies and community-based
standards
@micheldumontier::ACS:23-08-1630
68 168
@micheldumontier::ACS:23-08-1631
3
2
metadatacenter.org
NIH COMMONS
@micheldumontier::ACS:23-08-16
Making it Easier, Possibly Even Pleasant, to Author
Interoperable Experimental Metadata
PubChem engaged the community to
reuse and extend existing vocabularies
@micheldumontier::ACS:23-08-1633
@micheldumontier::ACS:23-08-1634
Semanticscience Ontology (SIO)
An effective upper level ontology.
1500+ classes
207 object properties (inc. inverses)
1 datatype property
Chemical Information Ontology
(CHEMINF)
• Collaborative ontology
• Distinguishes algorithmic,
or procedural information
from declarative, or factual
information, and renders of
particular importance the
annotation of provenance
to calculated data.
@micheldumontier::ACS:23-08-1635
Where are we going?
• Large scale publishing on the web across
biomedical datatypes is possible on the web
• Hubs, such as NCBI and EBI now integrate data,
but there is need for global coordination on all
datatypes
• Standard Vocabularies must to be open, freely
accessible, and demonstrably reused
• Use of worldwide data integration formats (RDF)
and improved linking of data
• Easier to deploy toolkits for providing standards-
compliant linked data
@micheldumontier::ACS:23-08-1636
Linked Data Platform
Docker
• Data conversion scripts
• Query Editor
• Faceted Browser
• Relation Exploration
• API
• Data and data store
Model Organism Linked Data
MO-LD.org
37
In Summary
• We use semantic technologies such as ontologies
and linked data to make sense of and facilitate
access to biomedical data (FAIR)
• The intimate development and use of standards
by PubChem and others brings us closer to an
interoperability ideal
• Much more work is needed to support
(computational) discovery in a reproducible
manner.
@micheldumontier::ACS:23-08-1638
Acknowledgements
Dumontier Lab
• Amrapali Zaveri
• Mary Panahiazar
• Shima Dastgheib
• Sandeep Ayyar
• Remzi Celebi
• David Odgers
• Wei Hu
• Ruben Verborgh
• Leo Chepelev
• Alison Callahan
• Jose Miguel Toledo Cruz
• Tanya Hiebert
• Beatriz Lujan
+ many more
Collaborators
• Mark Musen
• Nigam Shah
• Robert Hoehndorf
• Janna Hastings
• Christoph Steinbeck
• Egon Willighagen
• Nico Adams
• Colin Batchelor
• David Wild
• Evan Bolton
• Gang Fu
+ many more
@micheldumontier::ACS:23-08-1639
dumontierlab.com
michel.dumontier@stanford.edu
Website: http://dumontierlab.com
Presentations: http://slideshare.com/micheldumontier
40 @micheldumontier::ACS:23-08-16

Building a Network of Interoperable and Independently Produced Linked and Open Biomedical Data

  • 1.
    Building a networkof interoperable and independently produced linked and open biomedical data 1 Michel Dumontier, Ph.D. Associate Professor of Medicine (Biomedical Informatics) Stanford University @micheldumontier::ACS:23-08-16 An invited talk in support of the 2016 Herman Skolnik Awardees
  • 2.
    My research aimsto develop computational methods for biomedical knowledge discovery We develop tools and methods to represent, store, publish, integrate, query, and reuse biomedical data, software, and ontologies @micheldumontier::ACS:23-08-162
  • 3.
    @micheldumontier::ACS:23-08-163 reuse needs tobe considered firmly in the context of discovery and reproducibility
  • 4.
    4 @micheldumontier::ACS:23-08-16 Most publishedresearch findings are false - John Ioannidis, Stanford University
  • 5.
    Reproducible discovery 1. DataScience Tools and Methods – Infrastructure: To identify, annotate, link, integrate, search for and query data and services – Tools: To identify and uncover support for known or novel associations 2. Community Standards to contribute to and interrogate a massive, decentralized network of interconnected data and software @micheldumontier::ACS:23-08-165
  • 6.
  • 7.
    FAIR: Findable, Accessible, Interoperable,Re-usable Findable – Globally unique identifiers for datasets and the data they contain – Rich set of descriptors to search and filter with – Indexed and searchable Accessible – Identifiers can be used to retrieve representations using standard protocols (e.g. HTTP) – Metadata is always available. Interoperable – Data represented with formal knowledge representations – Include links to other datasets/vocabularies Reusable – Licensing, Provenance, Community standards @micheldumontier::ACS:23-08-167
  • 8.
    The Semantic Web isthe new global web of knowledge 8 @micheldumontier::ACS:23-08-16 standards for publishing, sharing and querying facts, expert knowledge and services scalable approach for the discovery of independently formulated and distributed knowledge
  • 9.
    Linked Data offers asolid foundation for FAIR data • Entities (people, proteins, pathways, etc) are identified using globally unique identifiers (URIs) • Entity descriptions are represented with a standardized language (RDF) • Data can be retrieved using a universal protocol (HTTP) • Entities (concepts, data, resources) can be linked together to increase interoperability @micheldumontier::ACS:23-08-169
  • 10.
    @micheldumontier::ACS:23-08-16 Linked Data forthe Life Sciences 10 Bio2RDF is an open source project to unify the representation and interlinking of biological data using RDF. chemicals/drugs/formulations, genomes/genes/proteins, domains Interactions, complexes & pathways animal models and phenotypes Disease, genetic markers, treatments Terminologies & publications • 11B+ interlinked statements from 35 biomedical datasets and 400+ ontologies • dataset description, provenance & statistics • A growing interoperable ecosystem with the EBI, NCBI, DBCLS, NCBO, OpenPHACTS, and commercial tool providers
  • 11.
    Bio2RDF normalizes identifiers, formats,links, and access 11 @micheldumontier::ACS:23-08-16
  • 12.
  • 13.
    Bio2RDF shows howdatasets are connected together @micheldumontier::ACS:23-08-1613
  • 14.
    Queries can befederated across private and public SPARQL databases Get all protein catabolic processes (and more specific GO terms) in biomodels SELECT ?go ?label count(distinct ?x) WHERE { service <http://bioportal.bio2rdf.org/sparql> { ?go rdfs:label ?label . ?go rdfs:subClassOf+ ?tgo ?tgo rdfs:label ?tlabel . FILTER regex(?tlabel, "^protein catabolic process") } service <http://biomodels.bio2rdf.org/sparql> { ?x <http://bio2rdf.org/biopax_vocabulary:identical-to> ?go . ?x a <http://www.biopax.org/release/biopax-level3.owl#BiochemicalReaction> . } } @micheldumontier::ACS:23-08-1614
  • 15.
    Graph-like representation amenable tofinding mismatches and discovering new links @micheldumontier::ACS:23-08-1615 W Hu, H Qiu, M Dumontier. Link Analysis of Life Science Linked Data. International Semantic Web Conference (2) 2015: 446-462.
  • 16.
    EbolaKB Using Linked Dataand Software @micheldumontier::ACS:23-08-1616 Kamdar, Dumontier. An Ebola virus-centered knowledge base. Database. 2015 Jun 8;2015. doi: 10.1093/database/bav049.
  • 17.
    @micheldumontier::ACS:23-08-1617 Network analysis anddiscovery McCusker, McGuiness, Dumontier. In prep.
  • 18.
    @micheldumontier::ACS:23-08-1618 Can we implement anopen version of PREDICT using Linked Data? AUC 0.91 across all therapeutic indications A. Chemical structure Similarity B. Side Effect Similarity C. Target Sequence Similarity D. Target Functional Similarity E. Network Distance A. Phenotype Based B. Text Extracted Concepts Disease-disease similarityDrug-drug similarity
  • 19.
    HyQue: Hypothesis Validation •A platform for knowledge discovery that uses data retrieval coupled with automated reasoning to validate scientific hypotheses • Leverages semantic technologies to provide access to linked data, ontologies, and semantic web services • Uses positive and negative findings, captures provenance • Weighs evidence according to context • Used to find aging genes in worm, assess cardiotoxicity of tyrosine kinase inhibitors HyQue: evaluating hypotheses using Semantic Web technologies. J Biomed Semantics. 2011 May 17;2 Suppl 2:S3. Evaluating scientific hypotheses using the SPARQL Inferencing Notation. Extended Semantic Web Conference (ESWC 2012). Heraklion, Crete. May 27-31, 2012. @micheldumontier::ACS:23-08-1619
  • 20.
    What evidence mightwe gather? • clinical: Are there cardiotoxic effects associated with the drug? – Literature (studies) [curated db] – Product labels (studies) [r3:sider] – Clinical trials (studies) [r3:clinicaltrials] – Adverse event reports [r2:pharmgkb/onesides] – Electronic health records (observations) • pre-clinical associations: – genotype-phenotype (null/disease models) [r2:mgi, r2:sgd; r3:wormbase] – in vitro assays (IC50) [r3:chembl] – drug targets [r2:drugbank; r2:ctd; r3:stitch] – drug-gene expression [r3:gxa] – pathways [r2:kegg; r3:reactome] – Drug-pathway, disease-pathway enrichments [aberrant pathways] – Chemical properties [r2:pubchem; r2.drugbank] – Toxicology [r1.toxkb/cebs] @micheldumontier::ACS:23-08-1620
  • 21.
  • 22.
  • 23.
    Network of LinkedData (~2007) @micheldumontier::ACS:23-08-1623
  • 24.
    Expansion across domains “LinkingOpen Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/”@micheldumontier::ACS:23-08-1624
  • 25.
    A rapidly growingnetwork of Linked Data 25 @micheldumontier::ACS:23-08-16Linking Open Data cloud diagram 2014, by Max Schmachtenberg, Christian Bizer, Anja Jentzsch and Richard Cyganiak. http://lod-cloud.net/"
  • 26.
  • 27.
  • 28.
  • 29.
    but the lackof coordination makes Linked Open Data is chaotic and unwieldy @micheldumontier::ACS:23-08-1629
  • 30.
    There is noshortage of vocabularies, ontologies and community-based standards @micheldumontier::ACS:23-08-1630
  • 31.
  • 32.
    3 2 metadatacenter.org NIH COMMONS @micheldumontier::ACS:23-08-16 Making itEasier, Possibly Even Pleasant, to Author Interoperable Experimental Metadata
  • 33.
    PubChem engaged thecommunity to reuse and extend existing vocabularies @micheldumontier::ACS:23-08-1633
  • 34.
    @micheldumontier::ACS:23-08-1634 Semanticscience Ontology (SIO) Aneffective upper level ontology. 1500+ classes 207 object properties (inc. inverses) 1 datatype property
  • 35.
    Chemical Information Ontology (CHEMINF) •Collaborative ontology • Distinguishes algorithmic, or procedural information from declarative, or factual information, and renders of particular importance the annotation of provenance to calculated data. @micheldumontier::ACS:23-08-1635
  • 36.
    Where are wegoing? • Large scale publishing on the web across biomedical datatypes is possible on the web • Hubs, such as NCBI and EBI now integrate data, but there is need for global coordination on all datatypes • Standard Vocabularies must to be open, freely accessible, and demonstrably reused • Use of worldwide data integration formats (RDF) and improved linking of data • Easier to deploy toolkits for providing standards- compliant linked data @micheldumontier::ACS:23-08-1636
  • 37.
    Linked Data Platform Docker •Data conversion scripts • Query Editor • Faceted Browser • Relation Exploration • API • Data and data store Model Organism Linked Data MO-LD.org 37
  • 38.
    In Summary • Weuse semantic technologies such as ontologies and linked data to make sense of and facilitate access to biomedical data (FAIR) • The intimate development and use of standards by PubChem and others brings us closer to an interoperability ideal • Much more work is needed to support (computational) discovery in a reproducible manner. @micheldumontier::ACS:23-08-1638
  • 39.
    Acknowledgements Dumontier Lab • AmrapaliZaveri • Mary Panahiazar • Shima Dastgheib • Sandeep Ayyar • Remzi Celebi • David Odgers • Wei Hu • Ruben Verborgh • Leo Chepelev • Alison Callahan • Jose Miguel Toledo Cruz • Tanya Hiebert • Beatriz Lujan + many more Collaborators • Mark Musen • Nigam Shah • Robert Hoehndorf • Janna Hastings • Christoph Steinbeck • Egon Willighagen • Nico Adams • Colin Batchelor • David Wild • Evan Bolton • Gang Fu + many more @micheldumontier::ACS:23-08-1639
  • 40.

Editor's Notes

  • #11 The Bio2RDF project transforms silos of life science data into a globally distributed network of linked data for biological knowledge discovery.
  • #33 CEDAR Architecture. Note the connection to NCBO services, and the use of the NCBO ontology repository for CEDAR resources.