2016 ACS Semantic Approaches for Biochemical Knowledge Discovery

Semantic Approaches
for Biochemical Knowledge Discovery
1
Michel Dumontier, Ph.D.
Associate Professor of Medicine (Biomedical Informatics)
Stanford University
@micheldumontier::ACS:15-03-2016

Science!

3 @micheldumontier::ACS:15-03-2016
Most published research findings are false.
- John Ioannidis, Stanford University

Science is hard.

Scientific knowledge
is growing at an unprecedented rate

Reusing raw and curated data in thousands of databases is
challenging: identifiers, formats, access methods, links

Various software are needed to analyze data
(problems: OS, versioning, input/output formats)

Ultimately, scientists develop fairly sophisticated
programs/workflows to test hypotheses

The absence of intelligent systems
requires vast amounts of
experience and technical expertise

How can we
automatically find
the evidence that
support or dispute a
scientific hypothesis
using the latest data,
tools and scientific
knowledge?

So what do we need to achieve this?
1. Data Science Tools and Methods
– To identify, represent, interlink, integrate, and query
data and services
– To identify and uncover support for known or novel
associations
2. Community Standards to share and interrogate a
massive, decentralized network of interconnected data
and software

First, we need FAIR data
Findable
– Globally unique identifiers for datasets and the data they contain
– Rich set of descriptors to search and filter with
– Indexed and searchable
Accessible
– Metadata is eternally available.
– Identifiers are used to retrieve representations using standard protocols (e.g.
HTTP)
Interoperable
– Data represented with formal knowledge representations
– Include links to other datasets/vocabularies
Reusable
– Licensing, Provenance, Community standards
“Numbers have no way of speaking for themselves. We need to
imbue them with meaning.” - Nate Silver, The signal and the noise

FAIR: Findable, Accessible, Interoperable, Re-usable
See paper for motivation and examples
We are now starting to think about quality measures.

The Semantic Web
is the new global web of knowledge
standards for publishing, sharing and querying
facts, expert knowledge and services
scalable approach for the discovery
of independently formulated
and distributed knowledge

Linked Data is FAIR data
15 @micheldumontier::ACS:15-03-2016Linking Open Data cloud diagram 2014, by Max Schmachtenberg, Christian Bizer, Anja Jentzsch and Richard Cyganiak. http://lod-cloud.net/"

Linked Data for the Life Sciences
16
Bio2RDF is an open source project to unify the
representation and interlinking of biological data using RDF.
chemicals/drugs/formulations,
genomes/genes/proteins, domains
Interactions, complexes & pathways
animal models and phenotypes
Disease, genetic markers, treatments
Terminologies & publications
• 11B+ interlinked statements from 35 biomedical
datasets
• dataset description, provenance & statistics
• A growing interoperable ecosystem with the EBI,
NCBI, DBCLS, NCBO, OpenPHACTS, and
commercial tool providers

Bio2RDF shows how datasets are
connected together

graph methods for data quality
to find mismatches and discover new links
W Hu, H Qiu, M Dumontier. Link Analysis of Life Science Linked Data.
International Semantic Web Conference (2) 2015: 446-462.

Federated Queries
over public SPARQL EndPoints
Get all protein catabolic processes (and more specific GO terms) in biomodels
SELECT ?go ?label count(distinct ?x)
WHERE {
service <http://bioportal.bio2rdf.org/sparql> {
?go rdfs:label ?label .
?go rdfs:subClassOf+ ?tgo
?tgo rdfs:label ?tlabel .
FILTER regex(?tlabel, "^protein catabolic process")
}
service <http://biomodels.bio2rdf.org/sparql> {
?x <http://bio2rdf.org/biopax_vocabulary:identical-to> ?go .
?x a <http://www.biopax.org/release/biopax-level3.owl#BiochemicalReaction> .
}
}

EbolaKB
Using Linked Data and Software
Kamdar, Dumontier. An Ebola virus-centered knowledge base. Database. 2015 Jun 8;2015. doi: 10.1093/database/bav049.

Network analysis and discovery
Jim McCusker & Deb McGuiness
David Wild, Ying Ding

HyQue

tactical formalization
Take what you need
and represent it in a way that directly serves your objective
STANDARDS
for broader reuse
APPLICATIONS
for optimized experience

High Quality Metadata are
Essential
for Large-Scale Reuse
and Biomedical Discovery

Making it Easier, Possibly Even Pleasant, to Author
Interoperable Experimental Metadata

2
7
metadatacenter.org
NIH COMMONS

smartAPI
The goal is to reduce the barrier for the discovery and
reuse of web APIs through richer semantic metadata.
i) a coordinated facility for the intelligent annotation of
smart APIs
ii) a web application to discover smart APIs and how
they connect to each other.
iii) The augmentation of existing APIs to provide FAIR
data

smartAPI
29
Gene
myGene.infomyVariant.info
Linking API Data
Web Services
Linked Data
Cloud

Evan’s Questions
• What should we be doing now?
– Encouraging researchers to publish FAIR data and
services
• How should we be doing it?
– As Linked Data
– Institutional repositories and available in wikidata and
other aggregators
• Where are things going in the future?
– Reproducible analyses over indexed, archived, and
massively connected knowledge graphs

dumontierlab.com
michel.dumontier@stanford.edu
Website: http://dumontierlab.com
Presentations: http://slideshare.com/micheldumontier

2016 ACS Semantic Approaches for Biochemical Knowledge Discovery

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to 2016 ACS Semantic Approaches for Biochemical Knowledge Discovery

Similar to 2016 ACS Semantic Approaches for Biochemical Knowledge Discovery (20)

More from Michel Dumontier

More from Michel Dumontier (20)

Recently uploaded

Recently uploaded (20)

2016 ACS Semantic Approaches for Biochemical Knowledge Discovery