Evaluating Hypotheses using SPARQL-DL as an abstract workflow language to choreograph SADI Semantic Web Services

Carole Goble

“Shopping for data
should be as easy
as shopping for
shoes!!”

Evaluating Hypotheses
using SPARQL-DL as an
abstract workflow language
to choreograph
SADI Semantic Web Services

Mark Wilkinson, Isaac Peral Senior Researcher in Biological Informatics
Centro de Biotecnología y Genómica de Plantas, UPM.

We wanted to duplicate
a real, peer-reviewed, bioinformatics analysis

simply by providing a model describing
what the answer
(if one exists)
would look like...

...the machine had to make
every other decision
on it’s own

This is the study we chose:

Gordon, P.M.K., Soliman, M.A., Bose, P.,
Trinh, Q., Sensen, C.W., Riabowol, K.

Interspecies data mining to predict novel
ING-protein interactions in human.

BMC Genomics 9, 426 (2008).

Gordon, P.M.K., Soliman, M.A., Bose, P., Trinh, Q., Sensen, C.W., Riabowol, K.: Interspecies
data mining to predict novel ING-protein interactions in human. BMC genomics. 9, 426 (2008).

Original Study - simplified and abstracted:

Using what is known about interactions in other species,
predict new interactions in your species of interest

Given a protein P in Species X

Find proteins similar to P in Species Y
Retrieve interactors in Species Y
Sequence-compare Y-interactors with Species X genome
(1)  Keep only those with homologue in X

Find proteins similar to P in Species Z
Retrieve interactors in Species Z
Sequence-compare Z-interactors with (1)

 Putative interactors in Species X

For our prototype study, we simplified this further to:

X 2
Then intersect

The tricky part is...

In the abstract, the two workflows
are identical

but in reality they will be different
because they call for information
from different species

Modeling the result – Step 1

OWL

Web Ontology Language (OWL) is the
language approved by the W3C
for representing knowledge on the Web

Modeling the result – Step 2

ProbableInteractor:
is homologous to (
protein from ModelOrganism1…) # Potential Interactor in previous slide
and
protein from ModelOrganism2…) # Potential Interactor in previous slide

Probable Interactor is defined in OWL as a subclass of Potential Interactor
that requires homologous pairs of interacting proteins to exist in both
comparator model organisms.

(Effectively, an intersection)

We then publish our OWL model of a Probable Interactor on the Web

In a local data-file we provide the protein we are interested in,
and the two species we wish to use in our comparison

taxon:9606 a i:OrganismOfInterest . # human
uniprot:Q9UK53 a i:ProteinOfInterest . # ING1
taxon:4932 a i:ModelOrganism1 . # yeast
taxon:7227 a i:ModelOrganism2 . # fly

These four lines represent all of the data provided to the
query I am about to show you...

This is the question we ask:

PREFIX i: <http://sadiframework.org/ontologies/InteractingProteins.owl#>

SELECT ?protein
FROM <file:/local/workflow.input.n3>
WHERE {

?protein a i:ProbableInteractor .

}

The reference to our OWL model of the answer

Our system then derives (and executes) the following workflow automatically

These are different
Web services!

...selected at run-time
based on the same model

There are two very cool things about what you just saw...


The system was able to
create a workflow based on
an OWL model


The workflow it created
(i.e. the services chosen)
differed depending on
context

We got the answer

“simply” by designing a model of the answer!

Semantic Automated Discovery and Integration

A Semantic Web-focused Web Services specification
Microsoft
Research http://sadiframework.org

Web Services

vs.

Semantic Web

Web Services
XML + XML Schema

Semantic Web
RDF + OWL

Web Services
POST of SOAP

Semantic Web
GET of RDF

Web Services
No (rigorous) semantics

Semantic Web
Rich, flexible semantics

Web Services
&
Semantic Web

Fundamentally different
Web technologies

A design-practice for
Web Service provision on the
Semantic Web

100% standards-compliant
with no “invented” standards

Lightweight
(only 2 inter-related “rules”)

Rules come from
observations:

SADI Observation #1:

Web Services in Bioinformatics create
implicit biological relationships
between their input and output


SeqRet

SADI Design Practice #1

Make the implicit explicit…

A Web Service should create “triples” linking
the input data to the output data thus
explicitly describing the semantic relationship
between them

SADI Best Practice #1

This is what bioinformatics Web Services
implicitly do anyway!

Easy to implement this as a
best-practice

...and makes SADI Services a source of Linked Data

HTTP GET and POST

GET guarantees
the response relates to the request URI
in a very precise and predictable way

POST does not…

HTTP GET and POST

That’s why Web Services have a fundamentally
different behaviour than the Semantic Web

GET and POST

We can fix that!

(without breaking any existing rules or standards!)


SUBJECT URI of the output graph (triples)

is the same as

SUBJECT URI of the input graph (triples)

(the output is “about” the input... Now explicitly!)


GeneID GeneID

rdf:type rdf:type

BRCA1 BRCA1

hasDNASequence

SeqRet AGCTTA...

Consequence

Web Services now exhibit a very similar
behavior to the Web itself

POST “behaves like” GET

SADI Interfaces

Service Interfaces can be described by
two OWL classes

(this is 100% compatible with the SAWSDL standard)

SADI Interfaces

OWL Class #1: My Input Class

SADI Interfaces

OWL Class #2: My Output Class

SADI Service Functionality

Consumes OWL Individuals of Class #1

Returns OWL Individuals of Class #2

but the URI of those two GeneID GeneID

individuals is the same; they are rdf:type
rdf:type

BRCA1 BRCA1

the same individual, just now a hasDNASequence

member of a new class. SeqRet AGCTTA...

In practice, of course, you don’t return the input data

Strip it and add the new data provided by the Service

But since the output is still “rooted” in the input node,

Input and output are easily merged client-side

(just concatenate the output with the input)

Service Description
INPUT OWL Class
NamedIndividual: things with
a “name” property
from “foaf” ontology

OUTPUT OWL Class
GreetedIndividual: things with
a “greeting” property
from “hello” ontology

POST http://example.org/myservice

person:1

foaf:name rdf:type

hello:Named
Guy Incognito Individual

person:1

hello:greeting rdf:type

hello:Greeted
Hello, Guy Incognito! Individual

How do we discover services?

Input and output are about the same “thing”

Therefore, to describe what a service does
simply compare (“diff”) the
Input and Output OWL classes

This is not prescriptive! Just how we use it

Service Description
INPUT OWL Class
NamedIndividual: things with
a “name” property
from “foaf” ontology

OUTPUT OWL Class
GreetedIndividual: things with
a “greeting” property
from “hello” ontology

The service provides
POST http://example.org/myservice
a “greeting” to a
Named Individual
person:1
based on its “name”
foaf:name rdf:type

hello:Named
Guy Incognito Individual

person:1

hello:greeting rdf:type

hello:Greeted
Hello, Guy Incognito! Individual

Service Discovery

Index all of the properties
added by all of the services
under all circumstances

Real-world Example

Input Data: BRCA1 rdf:type Gene ID

Output Data: BRCA1 hasDNASequence AGCTTAGCCA…

Registry Index: Service provides “hasDNASequence” property to Gene IDs

Service Discovery

Simply search for the property of interest
based on the data in-hand

e.g. The question:

“what is the DNA sequence of BRCA1?”

Discover a SADI Web Service that generates the
DNA Sequence property for gene identifiers

DEMO
Knowledge Explorer
Plug-in
For more information about the Knowledge Explorer surf to:
http://io-informatics.com

SADI has just filled-in “Encodes” property for the three genes
from the output of discovered Web Service(s)

Discover services that provide the hasGOTerm
property for Protein Sequence datatype

This kind of “Web Service Surfing” is very
intuitive for the Biologist!

No need to describe the algorithm or the database
just describe the properties that will be added

Semantic Health And Research Environment

SHARE answers arbitrary SPARQL queries
by finding and executing SADI Services

Example #1

What is the phenotype of every allele of the
Antirrhinum majus DEFICIENS gene

SELECT ?allele ?image ?desc

WHERE {
locus:DEF genetics:hasVariant ?allele .
?allele info:visualizedByImage ?image .
?image info:hasDescription ?desc
}

Example #1

What is the phenotype of every allele of the
Antirrhinum majus DEFICIENS gene

SELECT ?allele ?image ?desc

WHERE {
locus:DEF genetics:hasVariant ?allele .
?allele info:visualizedByImage ?image .
?image info:hasDescription ?desc
}

Note that there is no “FROM” clause!
We don’t tell it where it should get the information,
The machine has to figure that out by itself...

Enter that query into
SHARE

...and in a few seconds you get your answer.

Based on predicates in your query, SHARE utilized SADI
to automatically discover the resources required to answer your question.

Because it is the Semantic Web
The query results are live hyperlinks
to the respective Database or images

Importantly

We posed, and answered a
complex database query

WITHOUT A DATABASE

(in fact, the data didn’t even have to exist... as I’ll now show you)

Example #2

Show me the latest Blood Urea Nitrogen and Creatinine levels
of patients who appear to be rejecting their transplants

SELECT ?patient ?bun ?creat
FROM <http://sadiframework.org/ontologies/patients.rdf>
WHERE {
?patient rdf:type patient:LikelyRejecter .
?patient l:latestBUN ?bun .
?patient l:latestCreatinine ?creat .
}

Likely Rejecter:

A patient who has creatinine levels
that are increasing over time
- - Wilkinson “MD”

Likely Rejecter:
Our database contains various
blood chemistry measurements
at various time-points

Likely Rejecter:

…but there is no “likely rejecter”
column or table in our database…

SHARE determines

by itself

the need to do a
Linear Regression analysis over
Creatinine blood chemistry measurements

SHARE determines

by itself

how and where that analysis
can be done

and does it

The SHARE system utilizes Semantics (via SADI) to discover and access
analytical services on the Web that do linear regression analysis

Ontology Spectrum

Thesauri Frames Selected
“narrower (Properties) Logical
Catalog/ term” Formal Constraints
ID relation is-a (disjointness,
inverse, …)

Terms/ Informal Formal General
Value Logical
glossary is-a instance
Restrs. constraints

Originally from AAAI 1999- Ontologies Panel by Gruninger, Lehmann, McGuinness, Uschold, Welty;
– updated by McGuinness.
Description in: www.ksl.stanford.edu/people/dlm/papers/ontologies-come-of-age-abstract.html

Ontology Spectrum
BETTER!!
Thesauri Frames Selected
inverse, …)

Value Logical
Restrs. constraints

Basic SADI/SHARE functionality

Ontology Spectrum
Because it
fulfils XYZ
Thesauri WHY? Frames Selected
inverse, …)

Value Logical
Restrs. constraints
Because
I say so!

Why is this data a member of this OWL Class?
(and therefore valid input to the service)

Discovery systems - flexible

Selected
Thesauri Logical
Catalog/ “narrower term” Formal
Frames Constraints
(disjointness,
ID (Properties)
relation is-a inverse, …)

Terms/ Informal Formal Value General
Logical
glossary is-a instance Restrs. constraints

Categorization Systems – like library shelves, inflexible

In the upper end of the Ontology Spectrum,

if the data has the right properties

It can be discovered to be a valid input to a Service

regardless of how it was originally classified or which

ontology was used for that classification

...and in the context of a SHARE query

those individual properties may have been aggregated

from many different places;

The data becomes a valid input as properties aggregate

In exactly the same way that the OWL property
restrictions of a SADI Input Class tell SHARE what
properties a service requires as input

The property restrictions of an OWL Class in the
SPARQL query tell SHARE what properties it
needs to retrieve to create members of that class

Show me the latest Blood Urea Nitrogen and Creatinine levels
of patients who appear to be rejecting their transplants

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX patient: <http://sadiframework.org/ontologies/patients.owl#>
PREFIX l: <http://sadiframework.org/ontologies/predicates.owl#>
SELECT ?patient ?bun ?creat
FROM <http://sadiframework.org/ontologies/patients.rdf>
WHERE {
?patient rdf:type patient:LikelyRejecter .
?patient l:latestBUN ?bun .
?patient l:latestCreatinine ?creat .
}

The definition of a Likely Rejecter category is encoded in
a machine-readable document written in the OWL Ontology language

Basically:

“the regression line over creatinine measurements should have an increasing slope”

SHARE burrows down through the ontological definition
to learn about what the properties of “regression models” are

SHARE utilizes SADI to discover analytical services on the Web
that do linear regression analysis then other services that can
determine the “latest” measurements from a time-series

Let’s go through that again from a different perspective

OWL Classes that include property restrictions can be
“executed” as if they were workflows

QUERY:

SELECT images of
mutations from genes in
organism XXX that
share homology to this
gene in organism YYY

Concept:

“Homologous Mutant
Image”

As OWL Axioms
HomologousMutantImage
is owl:equivalentTo {

Gene Q hasImage image P

Gene Q hasSequence Sequence Q

Gene R hasSequence Sequence R

Sequence Q similarTo Sequence R

Gene R = “my gene of interest” }

Those axioms
combine to
create an
OWL Class:
Homologous
Mutant Image

QUERY:
Retrieve
owl:Homologous
Mutant Image for
gene XXX

SHARE:
Decomposes the
owl:Homologous Mutant Image
Class, discovers SADI services
relevant to that class, and
pipelines them together into a
workflow

“The user experiments showed that
workflow re-use… is difficult for bioinformaticians.”

- Gooderis, A. (2008) Ph.D. Thesis

Under slightly different circumstances
(e.g. studying the same phenomenon in a
different organism)

this workflow

WILL NOT SOLVE THE PROBLEM

It must be edited on a case-by-case basis

This editing turns out to be extremely
difficult, and can ~only be done manually

This great idea would be even better
if workflows were a little bit easier to re-purpose!

With SHARE, a workflow is generated
dynamically, based on all of the information
presented in the query

e.g. to create a new workflow simply specify
a different organism ID in the query!

With SHARE, a workflow is generated
dynamically, based on all of the information
presented in the query

Moreover, the workflow plan CHANGES
dynamically based on service outputs!

With SHARE, the ontology is the workflow
(not the same as “an ontology that describes workflows”)

The ontology acts as an abstraction of a
workflow, which is concretized at run-
time based on circumstances of the query

Works (best) with ontologies in the “Frames+” part of
the spectrum, if there are SADI services available
Selected
Thesauri Logical
“narrower Constraints
Formal Frames
Catalog/ term” (disjointness,
is-a (Properties) inverse, …)
ID relation

Terms/ Informal Formal Value General
Logical
glossary is-a instance Restrs.
constraints

As far as we are aware, SHARE is the only system that
exhibits this particular behaviour

...and IMO this is a pretty big deal...

That experiment can now be represented ONCE
as an OWL Class

It becomes concretized automatically

for each individual researcher

as a distinct workflow given any starting protein
and any combination of comparator species

This is far beyond simply changing the parameters entered into a workflow...

Moreover, the idea of automated workflow “individuality” is quite interesting to us...

This is an early prototype of a

Patient-driven Personalized Medicine

Web interface

Matching based on official
name, compound name,
brand name, trade name, or
“common name” 

Still needs some work...

??!?!?

Why the alert?

Link out to PubMed

The SADI+SHARE workflow and reasoning was
personalized to YOUR medical data

In future iterations, we will enable the workflow
to be further customized through “personalized”
OWL Classes (e.g. Provided by your Clinician!!)

These OWL Classes might include information about the
current trajectory of your treatment for a chronic disease,
for example, such that what you read on the Web is
placed in the context of your expert Clinical care...

Frankly, I think it’s quite cool that “people”
are creating and running personalized
workflows at the touch of a button...

...as many of you know, that has been my
dream since I started studying this
problem a decade ago!

An experiment... based on a hypothesis

An experiment... based on a hypothesis

now modeled in OWL

...
...

Does this OWL Class represent a Hypothesis?

I think it does!

...
...

I believe that we will soon show

using SADI + SHARE

that we can model a non-trivial
hypothetical biological scenario

then evaluate if that hypothesis is supported or not
based on whether the automatically-synthesized workflow
returns any individuals
that conform to the model.

Ontology = Hypothesis = Query = Workflow [= Materials and Methods ]

Most of your publication is done!

All you need to do now is interpret the results!

These can be automatically derived through
provenance information during workflow execution

Please join us!

SADI and SHARE are Open-Source projects

http://sadiframework.org

University of British Columbia

Luke McCarthy – Lead Dev. Edward Kawas
Everything... SADI Service auto-generator

Benjamin VanderValk Ian Wood
SHARE & SADI & Experimental modeling & Experimental modeling project
myHeath Button

Soroush Samadian
Cardiovascular data modeling and queries

C-BRASS Collaborators at other sites

U of New Brunswick Carleton University

Dr. Chris Baker Dr. Michel Dumontier
Alexandre Riazanov Marc-Alexandre Nolin
Leonid Chepelev
Steve Etlinger
Nichaella Kieth
Jose Cruz

Evaluating Hypotheses using SPARQL-DL as an abstract workflow language to choreograph SADI Semantic Web Services

More Related Content

Viewers also liked

Similar to Evaluating Hypotheses using SPARQL-DL as an abstract workflow language to choreograph SADI Semantic Web Services

More from Mark Wilkinson

Recently uploaded

Evaluating Hypotheses using SPARQL-DL as an abstract workflow language to choreograph SADI Semantic Web Services