2010 CASCON - Towards a integrated network of data and services for the life sciences

Towards a integrated network of data
and services for the life sciences
1
Michel Dumontier, Ph.D.
Associate Professor of Bioinformatics
Carleton University
Department of Biology
School of Computer Science
Institute of Biochemistry
Ottawa Institute of Systems Biology
Ottawa-Carleton Institute of Biomedical Engineering

Finding the right information to answer a question is hard
and sometimes requires a sophisticated workflow
2

What if we could answer a question
by automatically building a knowledge base
using both data and services?4

The Semantic Web is a web of knowledge.
5
It is about standards for publishing, sharing and querying
knowledge drawn from diverse sources
It enables the answering of
sophisticated questions

Is caffeine a drug-like molecule?

To answer this question we need to know:
• what ‘drug like molecule’ really
means
• caffeine’s molecular structure
• use the structural information to
compute the attributes
• determine whether caffeine
satisfies the requirements of being
‘drug like’

Lipinski Rule of Five
• Rule of thumb for druglikeness (orally active in humans)
(4 rules with multiples of 5)
– mass of less than 500 Daltons
– fewer than 5 hydrogen bond donors
– fewer than 10 hydrogen bond acceptors
– A partition coefficient value between -5 and 5
We need a more formal (machine understandable) description
of a ‘drug-like molecule’ which specifies values for chemical
descriptors

ontology as a
strategy to
formally represent
knowledge
9

The Web Ontology Language (OWL) Has
Explicit Semantics
Can therefore be used to capture knowledge in a
machine understandable way
10

The Chemical Information Ontology
(CHEMINF)
• 100+ chemical descriptors
• 50+ chemical qualities
• Relates descriptors to their
specifications, the software that
generated them (along with the running
parameters, and the algorithms that they
implement)
• Contributors: Nico Adams, Leonid Chepelev,
Michel Dumontier, Janna Hastings, Egon
Willighagen, Peter Murray-Rust, Cristoph
Steinbeck
11
http://semanticchemistry.googlecode.com

Molecular structure can be represented using a
SMILES string, which is a common representation
of the chemical graph
12
ball & stick model for
caffeine
SMILES string
for caffeine
Cn1cnc2n(C)c(=O)n(C)c(=O)c12

Lipinski Rule of Five
• Empirically derived ruleset for druglikeness
(4 rules with multiples of 5)
– mass of less than 500 Daltons
– fewer than 5 hydrogen bond donors
– fewer than 10 hydrogen bond acceptors
– A partition coefficient value between -5 and 5
• A formal description using OWL:

What we then need are services that will consume SMILES
strings and annotate the molecule with the required chemical
descriptors
14
then we can reason
about whether it
satisfies the drug-
likeness definition

Semantic Automated
Discovery and Integration
http://sadiframework.org
Mark Wilkinson, UBC
Michel Dumontier, Carleton University
Christopher Baker, UNB
SADI is a framework to create Semantic Web services using OWL
classes as service inputs and outputs
15

SADI
• OWL classes in SADI are local to individual
services
– They should uniquely specify the service input and
outputs (they exactly have the right restrictions)
– one service’s world-view can conflict with another,
but a client can use any or all
• maximize interoperability by reusing types
and relations

Semanticscience Integrated Ontology
(SIO)
• OWL2 ontology
• 800 classes covering basic types (physical, processual,
informational) with an emphasis on biological entities
• 129 basic relations (mereological, participatory,
attribute/quality, spatial, temporal and representational)
• axioms can be used by reasoners to generate inferences
for consistency checking, classification and answering
questions about life science knowledge
• embodies emerging ontology design patterns
• dereferenceable URIs
• searchable in the NCBO bioportal
http://semanticscience.org/ontology/sio.owl
17 CASCON: Nov 3, 2010

Create code stubs using the ontology
• Publish the ontology to a web-accessible location
http://semanticscience.org/sadi/ontology/lipinskiserviceontology.owl
• Make sure that the class names are resolvable
(easy when using the hash notation)
http://semanticscience.org/sadi/ontology/lipinskiserviceontology.owl#smiles-molecule
http://semanticscience.org/sadi/ontology/lipinskiserviceontology.owl#logp-molecule
http://semanticscience.org/sadi/ontology/lipinskiserviceontology.owl#hbdc-molecule
http://semanticscience.org/sadi/ontology/lipinskiserviceontology.owl#hdba-molecule
http://semanticscience.org/sadi/ontology/lipinskiserviceontology.owl#lipinksi-druglike-molecule
• Download/checkout the code
http://sadiframework.org
• Run the code generator
– specify the URIs that correspond to input and output types
18

Implement the functionality
• Java version
– Uses Jena to manipulate the RDF graph
– Uses Maven to build from command-line or Eclipse; Invokes Jetty for
service testing
• Chemistry
– We used the Chemistry Development Kit (CDK) to implement 4
services
19

Responds to a GET operation by providing
the service description in RDF
conforms to Feta (BioMoby, myGrid)
20
curl http://cbrass.biordf.net/logpdc/logpc
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:j.0="http://www.mygrid.org.uk/mygrid-moby-service#" >
<rdf:Description rdf:about="">
<j.0:hasServiceDescriptionText>no description</j.0:hasServiceDescriptionText>
<j.0:hasServiceNameText rdf:datatype="http://www.w3.org/2001/XMLSchema#string">logpc</j.0:hasServiceNameText>
<j.0:hasOperation rdf:resource="#operation"/>
<rdf:type rdf:resource="http://www.mygrid.org.uk/mygrid-moby-service#serviceDescription"/>
</rdf:Description>
<rdf:Description rdf:about="#input">
<j.0:objectType rdf:resource="http://semanticscience.org/sadi/ontology/lipinskiserviceontology.owl#smilesmolecule"/>
<rdf:type rdf:resource="http://www.mygrid.org.uk/mygrid-moby-service#parameter"/>
</rdf:Description>
<rdf:Description rdf:about="#operation">
<j.0:outputParameter rdf:resource="#output"/>
<j.0:inputParameter rdf:resource="#input"/>
<rdf:type rdf:resource="http://www.mygrid.org.uk/mygrid-moby-service#operation"/>
</rdf:Description>
<rdf:Description rdf:about="#output">
<j.0:objectType rdf:resource="http://semanticscience.org/sadi/ontology/lipinskiserviceontology.owl#alogpsmilesmolecule"/>
<rdf:type rdf:resource="http://www.mygrid.org.uk/mygrid-moby-service#parameter"/>
</rdf:Description>
</rdf:RDF>

Responds to a POST containing service
input with a service output in RDF
21
<rdf:Description rdf:about="http://semanticscience.org/sadi/ontology/caffeine.rdf#mdalogp">
<rdf:type rdf:resource="http://semanticscience.org/resource/CHEMINF_000251"/>
<j.0:SIO_000300 rdf:datatype="http://www.w3.org/2001/XMLSchema#double">-0.4311000000000006</j.0:SIO_000300>
</rdf:Description>
<rdf:RDF xmlns="http://semanticscience.org/sadi/ontology/caffeine.rdf#"
xmlns:so="http://semanticscience.org/sadi/ontology/lipinskiserviceontology.owl#"
xmlns:owl="http://www.w3.org/2002/07/owl#"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:sio="http://semanticscience.org/resource/"
xmlns:xsd="http://www.w3.org/2001/XMLSchema#">
<so:smilesmolecule rdf:about="http://semanticscience.org/sadi/ontology/caffeine.rdf#m">
<sio:SIO_000008 rdf:resource = "http://semanticscience.org/sadi/ontology/caffeine.rdf#msmiles"/>
</so:smilesmolecule>
<sio:CHEMINF_000018 rdf:about = "http://semanticscience.org/sadi/ontology/caffeine.rdf#msmiles">
<sio:SIO_000300 rdf:datatype="xsd:string">Cn1cnc2n(C)c(=O)n(C)c(=O)c12</sio:SIO_000300>
</sio:CHEMINF_000018>
</rdf:RDF>
curl --data @caffeine.rdf http://cbrass.biordf.net/logpdc/logpc

23
Semantic Health and Research Environment
SHARE is an application that execute (SPARQL) queries as workflows
over SADI Services

“Reckoning”
dynamic discovery of instances of OWL classes
through synthesis and invocation of a Web Service
workflow capable of generating data described by
the OWL class restrictions, followed by reasoning
to classify the data into that ontology
24

Bio2RDF provides ChEBI in RDF 
26

Bio2RDF is now serving over
40 billion triples of linked biological data
27

Bio2RDF covers the major biological
databases
28

Bio2RDF is part of a growing web of linked data
29 “Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/”

something you can lookup or
search for with rich descriptions
30

31
SPARQL is the new cool kid on the query block
SQL SPARQL

Query:
34

Benefits
• Data remains distributed – as the internet was
meant to be!
• Data is not “exposed” as a SPARQL endpoint
– greater provider-control over computational
resources
• Service invocation is straightforward and
matchmaking by reasoning about ontology-based
input/output descriptions
35

Summary
• Semantic Web technologies offer tantalizing
new opportunities to publish, share and query
data and services
• Bio2RDF provides linked life science data
• SADI provides a framework to provide
semantic web services
• SHARE allows us to simultaneously query and
reason about data and services represented
using RDF/OWL
36 CASCON: Nov 3, 2010

37
Acknowledgements
This research is supported by The Heart + Stroke Foundation of BC and Yukon, Microsoft Research,
The Canadian Institutes of Health Research, The Natural Sciences and Engineering Research Council of Canada and CANARIE.
Marc-Alexandre Nolin & Francois Belleau (Bio2RDF)
Leo Chepelev (implementing the services)
Luke McCarthy (SADI technical support)
Mark Wilkinson (vision and leadership)
Chris Baker (lipidomics)
CHEMINF Group
Leo Chepelev
Janna Hastings
Egon Willighagen
Nico Adams

dumontierlab.com
michel_dumontier@carleton.ca
38

2010 CASCON - Towards a integrated network of data and services for the life sciences

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to 2010 CASCON - Towards a integrated network of data and services for the life sciences

Similar to 2010 CASCON - Towards a integrated network of data and services for the life sciences (20)

More from Michel Dumontier

More from Michel Dumontier (20)

Recently uploaded

Recently uploaded (20)

2010 CASCON - Towards a integrated network of data and services for the life sciences

Editor's Notes