Advancing Biomedical
Knowledge Reuse with FAIR
Michel Dumontier
Distinguished Professor of Data Science
1 @micheldumontier::BH17:2017-09-17
@micheldumontier::BH17:2017-09-172
So what do we need to achieve this?
1. Data Science
Infrastructure to identify, represent, store, retrieve,
aggregate, query, and analyze data using software and
services in a reliable and reproducible manner.
Methods to discover and report plausible, supported,
prioritized, and experimentally verifiable associations.
2. Community
to build a massive, decentralized network of
interconnected and interoperable data and services
@micheldumontier::BH17:2017-09-173
@micheldumontier::BH17:2017-09-17
A set of principles that apply to all digital resources
software, images, data, repositories, web services, scholarly
publications
and their metadata.
identifiers, licensing, provenance, access protocols…
Developed and endorsed by researchers, publishers, funding
agencies, industry partners. Work started at the Lorentz Center in
the Netherlands in 2014, principles refined at BH 15.
4
Rapid Adoption of Principles
As of Sept 2017,
200+ citations since
2016 publication
Included in G20
communique, EOSC,
H2020, NIH, and more…
@micheldumontier::BH17:2017-09-175
nature.com/articles/sdata201618
The Semantic Web
is a portal to the web of knowledge
6 @micheldumontier::BH17:2017-09-17
standards for publishing, sharing and querying
facts, expert knowledge and services
scalable approach for the discovery
of independently constructed,
collaboratively described,
distributed knowledge
Together, we are building a massive
decentralized knowledge graph
7 @micheldumontier::BH17:2017-09-17Linking Open Data cloud diagram 2014, by Max Schmachtenberg, Christian Bizer, Anja Jentzsch and Richard Cyganiak. http://lod-cloud.net
• 10B+ interlinked statements from 30+
conventional and high value datasets
• Partnerships with EBI, SIB, NCBI, DBCLS, NCBO,
OpenPHACTS, and many others
chemicals/drugs/formulations,
genomes/genes/proteins, domains
Interactions, complexes & pathways
animal models and phenotypes
Disease, genetic markers, treatments
Terminologies & publications
@micheldumontier::BH17:2017-09-178
Alison Callahan, Jose Cruz-Toledo, Peter Ansell, Michel Dumontier:
Bio2RDF Release 2: Improved Coverage, Interoperability and
Provenance of Life Science Linked Data. ESWC 2013: 200-212
Linked Data for the Life Sciences
Bio2RDF is an open source project that uses semantic web
technologies to make it easier to reuse biomedical data
Efficiently find and explore data
@micheldumontier::BH17:2017-09-179
Examine the facts and their provenance
@micheldumontier::BH17:2017-09-1710
@micheldumontier::BH17:2017-09-1711
Find new uses for existing drugs
Finding melanoma drugs through a probabilistic knowledge
graph. PeerJ Computer Science. 2017. 3:e106
https://doi.org/10.7717/peerj-cs.106
by exploring a probabilistic
semantic knowledge graph
And validate them against
pipelines for drug discovery
6256 tested drug combinations measuring cell viability over 118 drugs and 85 cancer
cell lines + monotherapy drug response data for each drug and cell line. Chemical
formula for ½ drugs.
Used: Elastic Net Regression, Random Forest, Gradient Boosting Trees, SVM
Spearman/Pearson correlation: ~0.48 - Very difficult problem!
gene expression + chemical structure + targets most informative features.
@micheldumontier::BH17:2017-09-1712
High Quality Metadata are
Essential
for Large-Scale Reuse
and Biomedical Discovery
13 @micheldumontier::BH17:2017-09-17
@micheldumontier::BH17:2017-09-1714
http://www.w3.org/TR/hcls-dataset/
Core Metadata
• Identifiers
• Title
• Description
• Homepage
• License
• Language
• Keywords
• Concepts and vocabularies
• Standard compliance
• Publication
Extended Metadata
• Provenance Metadata
• Versioning Metadata
• Content Metadata
A guideline to produce detailed RDF metadata
Started in BH13
smartAPI
Builds on
Builds on OWL2
@micheldumontier::BH17:2017-09-1715
What’s missing?
• More “success” stories of using linked data for
discovery
• Making it easier to survey, locate, retrieve, and reuse
FAIR digital resources
– Registries
– Standardized APIs: What’s the right balance between REST
apis and query APIs.
– Docker
• Metrics to assess the FAIRness of digital resources
(FAIR Metrics)
• Making it easier to retrieve distributed knowledge
expressed using different ontologies or schemas
@micheldumontier::BH17:2017-09-1716
@micheldumontier::BH17:2017-09-1717 Influenced a number of BH projects
@BH17
• Provide support and expertise on FAIR,
ontologies, data and service descriptions
• Update and deploy Bio2RDF with Docker on new
infrastructure (w/Alex Malic)
• Work on FAIR metrics (fairmetrics.org), apply to
Bio2RDF data and services (w/Mark Wilkinson)
• Revise the smartAPI UI for v3 (w/Chunlei Wu)
• Explore the use of deep learning frameworks over
linked data (w/Robert Hoehndorf)
@micheldumontier::BH17:2017-09-1718
michel.dumontier@maastrichtuniversity.nl
Website: http://www.maastrichtuniversity.nl/ids
Presentations: http://slideshare.com/micheldumontier
The mission of Institute of Data Science aims
to accelerate scientific discovery, improve
health and wellbeing, and empower
communities by fostering a collaborative
environment for multidisciplinary team
science, interdisciplinary training, and youth
innovation using effective computation on
FAIR digital resources.
@micheldumontier::BH17:2017-09-17

Advancing Biomedical Knowledge Reuse with FAIR

  • 1.
    Advancing Biomedical Knowledge Reusewith FAIR Michel Dumontier Distinguished Professor of Data Science 1 @micheldumontier::BH17:2017-09-17
  • 2.
  • 3.
    So what dowe need to achieve this? 1. Data Science Infrastructure to identify, represent, store, retrieve, aggregate, query, and analyze data using software and services in a reliable and reproducible manner. Methods to discover and report plausible, supported, prioritized, and experimentally verifiable associations. 2. Community to build a massive, decentralized network of interconnected and interoperable data and services @micheldumontier::BH17:2017-09-173
  • 4.
    @micheldumontier::BH17:2017-09-17 A set ofprinciples that apply to all digital resources software, images, data, repositories, web services, scholarly publications and their metadata. identifiers, licensing, provenance, access protocols… Developed and endorsed by researchers, publishers, funding agencies, industry partners. Work started at the Lorentz Center in the Netherlands in 2014, principles refined at BH 15. 4
  • 5.
    Rapid Adoption ofPrinciples As of Sept 2017, 200+ citations since 2016 publication Included in G20 communique, EOSC, H2020, NIH, and more… @micheldumontier::BH17:2017-09-175 nature.com/articles/sdata201618
  • 6.
    The Semantic Web isa portal to the web of knowledge 6 @micheldumontier::BH17:2017-09-17 standards for publishing, sharing and querying facts, expert knowledge and services scalable approach for the discovery of independently constructed, collaboratively described, distributed knowledge
  • 7.
    Together, we arebuilding a massive decentralized knowledge graph 7 @micheldumontier::BH17:2017-09-17Linking Open Data cloud diagram 2014, by Max Schmachtenberg, Christian Bizer, Anja Jentzsch and Richard Cyganiak. http://lod-cloud.net
  • 8.
    • 10B+ interlinkedstatements from 30+ conventional and high value datasets • Partnerships with EBI, SIB, NCBI, DBCLS, NCBO, OpenPHACTS, and many others chemicals/drugs/formulations, genomes/genes/proteins, domains Interactions, complexes & pathways animal models and phenotypes Disease, genetic markers, treatments Terminologies & publications @micheldumontier::BH17:2017-09-178 Alison Callahan, Jose Cruz-Toledo, Peter Ansell, Michel Dumontier: Bio2RDF Release 2: Improved Coverage, Interoperability and Provenance of Life Science Linked Data. ESWC 2013: 200-212 Linked Data for the Life Sciences Bio2RDF is an open source project that uses semantic web technologies to make it easier to reuse biomedical data
  • 9.
    Efficiently find andexplore data @micheldumontier::BH17:2017-09-179
  • 10.
    Examine the factsand their provenance @micheldumontier::BH17:2017-09-1710
  • 11.
    @micheldumontier::BH17:2017-09-1711 Find new usesfor existing drugs Finding melanoma drugs through a probabilistic knowledge graph. PeerJ Computer Science. 2017. 3:e106 https://doi.org/10.7717/peerj-cs.106 by exploring a probabilistic semantic knowledge graph And validate them against pipelines for drug discovery
  • 12.
    6256 tested drugcombinations measuring cell viability over 118 drugs and 85 cancer cell lines + monotherapy drug response data for each drug and cell line. Chemical formula for ½ drugs. Used: Elastic Net Regression, Random Forest, Gradient Boosting Trees, SVM Spearman/Pearson correlation: ~0.48 - Very difficult problem! gene expression + chemical structure + targets most informative features. @micheldumontier::BH17:2017-09-1712
  • 13.
    High Quality Metadataare Essential for Large-Scale Reuse and Biomedical Discovery 13 @micheldumontier::BH17:2017-09-17
  • 14.
    @micheldumontier::BH17:2017-09-1714 http://www.w3.org/TR/hcls-dataset/ Core Metadata • Identifiers •Title • Description • Homepage • License • Language • Keywords • Concepts and vocabularies • Standard compliance • Publication Extended Metadata • Provenance Metadata • Versioning Metadata • Content Metadata A guideline to produce detailed RDF metadata Started in BH13
  • 15.
    smartAPI Builds on Builds onOWL2 @micheldumontier::BH17:2017-09-1715
  • 16.
    What’s missing? • More“success” stories of using linked data for discovery • Making it easier to survey, locate, retrieve, and reuse FAIR digital resources – Registries – Standardized APIs: What’s the right balance between REST apis and query APIs. – Docker • Metrics to assess the FAIRness of digital resources (FAIR Metrics) • Making it easier to retrieve distributed knowledge expressed using different ontologies or schemas @micheldumontier::BH17:2017-09-1716
  • 17.
  • 18.
    @BH17 • Provide supportand expertise on FAIR, ontologies, data and service descriptions • Update and deploy Bio2RDF with Docker on new infrastructure (w/Alex Malic) • Work on FAIR metrics (fairmetrics.org), apply to Bio2RDF data and services (w/Mark Wilkinson) • Revise the smartAPI UI for v3 (w/Chunlei Wu) • Explore the use of deep learning frameworks over linked data (w/Robert Hoehndorf) @micheldumontier::BH17:2017-09-1718
  • 19.
    michel.dumontier@maastrichtuniversity.nl Website: http://www.maastrichtuniversity.nl/ids Presentations: http://slideshare.com/micheldumontier Themission of Institute of Data Science aims to accelerate scientific discovery, improve health and wellbeing, and empower communities by fostering a collaborative environment for multidisciplinary team science, interdisciplinary training, and youth innovation using effective computation on FAIR digital resources. @micheldumontier::BH17:2017-09-17

Editor's Notes

  • #6 G20: http://europa.eu/rapid/press-release_STATEMENT-16-2967_en.htm EOSC: https://ec.europa.eu/research/openscience/pdf/realising_the_european_open_science_cloud_2016.pdf H2020: https://goo.gl/Strjua
  • #9 The Bio2RDF project transforms silos of life science data into a globally distributed network of linked data for biological knowledge discovery.
  • #14  Biomedical researchers will remain stymied in their ability to take full advantage of the Big Data revolution if they can never find the datasets that they need to analyze, if there is lack of clarity about what particular datasets contain, and if data are insufficiently described.