This document discusses using graph analytics and linked open data to study pharmacology. It describes challenges in querying across heterogeneous life sciences linked open data sources and proposes a pattern-based approach to rewrite queries. An application called PhLeGrA is presented that generates a k-partite network by retrieving entities and relations from multiple sources to study drug mechanisms. Preliminary results using network-based algorithms to rank mechanisms are also discussed.
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Graph Analytics in Pharmacology over the Web of Life Sciences Linked Open Data
1. Graph Analytics in Pharmacology over the
Web of Life Sciences Linked Open Data
26th World Wide Web Conference (WWW)
Perth, 4th – 8th April 2017
MAU LIK R . KA MDA R A N D MA RK A . MU S E N
Stanford Center for Biomedical Informatics Research
maulikrk@stanford.edu
5. Semantic Web: Publishing Data as a Graph
5
589.25
mol_weight
Gleevec (Mol. Wt.: 589.25 g/mol, Half-Life: 18 hours)
inhibits PDGFR, involved in signal transduction.
“18 hours”
half-life
x-ref
Gleevec
DrugB: DB00619
Gleevec
Resource Description Framework (RDF)
Inhibits
target name
type
GO:0007165
(Signal
Transduction)
process
PDGFR
KEGG: D01441http://bio2rdf.org/kegg:D01441
http://bio2rdf.org/drugbank:DB00619
Uniform Resource Identifier
6. Semantic Web: Querying the Graph
< 1000
mol_weight
?half-life
x-ref
?
?
What are the half-lives of drugs that have
Mol. Wt < 1000 g/mol and inhibit proteins
involved in signal transduction?
SPARQL Query Language
6
Inhibits
?target name
type
GO:0007165
(Signal
Transduction)
process
7. Life Sciences Linked Open Data Cloud – query federation
• Challenges associated with retrieving information from LSLOD sources
• Pattern-based method to rewrite queries across LSLOD sources
• An application in mechanism-based pharmacovigilance - PhLeGrA
What this talk is about …
7
9. Query Federation: Rewriting and executing
queries across different sources
QUERY FEDERATION
Drug
molecular-weight < 1000
target
process = “GO:0007165”
half-life
9Schwarte, et al. ISWC 2012
Drug
molecular-weight < 1000
target
half-life
Drug
molecular-weight < 1000
target
process = “GO:0007165”
What are the half-lives of drugs that
have Mol. Wt < 1000 g/mol and inhibit
proteins involved in signal transduction?
10. Heterogeneity in the LSLOD Cloud
10
Gleevec
molecular-weight
493.61 Gleevec
mol_weight
589.25
Label Mismatch: Different labels for classes, relations and attributes
(clinical features) (biological features)
11. Heterogeneity in the LSLOD Cloud
11
Gleevec
molecular-weight
493.61 Gleevec
mol_weight
589.25
Label Mismatch: Different labels for classes, relations and attributes
(clinical features) (biological features)
12. Heterogeneity in the LSLOD Cloud
12
Gleevec PDGFR
drug-target
Gleevec
Inhibits
PDGFR
target
name
type
PubMed: 21152856
source
Model Mismatch: Different graph patterns to capture granularity
Gleevec
molecular-weight
493.61 Gleevec
mol_weight
589.25
Label Mismatch: Different labels for classes, relations and attributes
(clinical features) (biological features)
13. Heterogeneity in the LSLOD Cloud
13
• Inconsistent Meanings
• Inconsistent URI labels for
classes, relations and attributes
• Inconsistent Attribute values for entities
• Inconsistent Graph patterns for
SPARQL queries
• Incomplete Relations between entities
14. Query Rewriting fails over the LSLOD Cloud
What are the half-lives of drugs that have Mol. Wt < 1000 g/mol and
inhibit proteins involved in signal transduction?
?s a <Drug>
?s <molecular-weight> ?mw
?s <target> ?protein
?s <half-life> ?hl
?mw < 1000 g/mol
?protein <hasGO> <GO:0007165>
?s a <Drug>
{?s <molecular-weight> ?mw}
{?s <half-life> ?hl}
?mw < 1000 g/mol
?s a <Drug>
{?s <target> ?protein}
?protein <hasGO> <GO:0007165>
Query
Rewriting
14
16. Using Graph Patterns for Query Rewriting
?Drug DrugBank:drug-target ?Protein
?Drug KEGG:target ?blank KEGG:link ?Protein
Mapping Rules:
What are the half-lives of drugs that have Mol. Wt < 1000 g/mol and
inhibit proteins involved in signal transduction?
?s a <Drug>
?s <hasMolWt> ?mw
?s <hasTarget> ?protein
?s <hasHalfLife> ?hl
?mw < 1000 g/mol
?protein <hasGO> <GO:0007165>
?s a <Drug>
{?s <molecular-weight> ?mw}
?s <drug-target> ?protein
{?s <half-life> ?hl}
?mw < 1000 g/mol
?s a <Drug>
?s <mol_wt> ?mw
{?s <target> ?protein_blank
?protein_blank <link> ?protein}
?protein <hasGO> <GO:0007165>
Query
RewriteQuery
Rewriting
16
?Drug hasTarget ?Protein
17. Life Sciences Linked Open Data Cloud – query federation
• Challenges associated with retrieving information from LSLOD sources
• Pattern-based method to rewrite queries across LSLOD sources
• An application in mechanism-based pharmacovigilance - PhLeGrA
What this talk is about …
17
18. PhLeGrA – Linked Graph Analytics in Pharmacology
18
Phlegra is a spider genus of the Salticidae family, commonly termed jumping spiders.
20. Entities and Relations from 4 different sources
are retrieved to create the k-partite Network
This k-partite network is generated in < 1 day
20
21. Query Federation overcomes heterogeneous
Distribution of Entities and Relations
R1: Drug hasTarget ProteinE1: Drug
• Similar and complete unique entities and relations exist between data sources
• Necessary to get the complete picture, but also determine sources of noise
21
25. The story so far …
25
Pattern-based federation methods can retrieve data
from multiple sources in the Life Sciences Linked Open
Data Cloud, and can enable development of advanced
methods for mechanism-based pharmacovigilance.
…
Using Semantic Web Technologies, data publishers and other researchers can represent data in a graphical format to create machine-readable platform called the Linked Open Data (LOD) cloud – this solves integrated data analysis and storage problems, and enables users to query these dataset without being concerned of underlying formats or representations. Semantic Web is the idea of a decentralized, distributed and heterogeneous data space, extending over the traditional Web.
We will be focusing on the Life Sciences region of the linked open data cloud … consists of DrugBank …
----- Meeting Notes (3/17/17 13:11) -----
tell them what each source means
----- Meeting Notes (3/24/17 13:53) -----
this representation gives the impression everything is linked up ....
To solve the integrative bioinformatics challenges, data publishers have started using Semantic Web Technologies to create the biomedical Semantic Web.
Semantic Web is the vision to represent and link data and knowledge on the web for web-scale reasoning and inference. Using the Resource Description Framework, we can publish data as a graph. Attributes and relations that are typically stored in fields and tables in relational databases, are converted to nodes and edges, with explicit semantics.
For example, Gleevec, a drug, has a molecular weight of 589.25. Complex relation, such as, Gleevec inhibits PDGFR involved in signal transduction, can be represented using “blank nodes”.
One can link similar entities in different sources, using explicitly-labeled cross reference edges. Here we link Gleevec from KEGG, an interaction database, with Gleevec in DrugBank, a drug database.
Hence, Using RDF, we can link multiple sources, without worrying of underlying formats and entity notations.
----- Meeting Notes (3/24/17 13:53) -----
use logo here
Once, these graphs are published, they can be queried using the SPARQL graph query language. another SW technology.
By using appropriate variables, We can retrieve a set of drugs from KEGG, that have molecular weight of < 1000 and inhibit proteins involved in signal transduction.
We can query multiple sources on the Semantic web – for example, we can use the explicit cross reference edges to navigate to DrugBank, to get the half-life of these drugs.
Hence SPARQL facilitates integrated querying of these sources and reconciles similar entities.
----- Meeting Notes (3/24/17 13:53) -----
half-lives ....
symbol for kegg and drugbank
----- Meeting Notes (3/24/17 13:53) -----
larger font size for bullets
Federated querying is a method that decomposes a source query and rewrites multiple sub-queries that can be executed separately over the different graphs.
Federated querying is required because there may be some relations or attributes that are unique to a given data source: e.g. half life of drugs can only be obtained from DrugBank, whereas processes in which a protein target is involved can only be obtained from KEGG.
Moreover, there may be some relations that can be obtained from multiple sources, e.g. molecular weights and protein targets.
----- Meeting Notes (3/24/17 13:53) -----
use logos here ...
half-lives
read the query and make it clear no one source has everything
There is a lot of heterogeneity in the biomedical Semantic Web that stems from the heterogeneity in the schemas used to structure the underlying data and knowledge sources.
For example, the labels for classes, relations and attributes may be completely different for the same relations, e.g. slight differences in the label of molecular weight in DrugBank and KEGG
Moreover, Semantic Web aims to capture the entire granularity of the underlying source – some relations may be explained in greater detail in different sources. For example, in KEGG, you get details on the type of interaction between Gleevec and PDGFR, as well as the source of the interaction in literature.
As SPARQL querying (or for that matter any database querying) requires these labels and the patterns to be exactly similar to those in the RDF graphs, for the query to retrieve results.
There is a lot of heterogeneity in the biomedical Semantic Web that stems from the heterogeneity in the schemas used to structure the underlying data and knowledge sources.
For example, the labels for classes, relations and attributes may be completely different for the same relations, e.g. slight differences in the label of molecular weight in DrugBank and KEGG
Moreover, Semantic Web aims to capture the entire granularity of the underlying source – some relations may be explained in greater detail in different sources. For example, in KEGG, you get details on the type of interaction between Gleevec and PDGFR, as well as the source of the interaction in literature.
As SPARQL querying (or for that matter any database querying) requires these labels and the patterns to be exactly similar to those in the RDF graphs, for the query to retrieve results.
There is a lot of heterogeneity in the biomedical Semantic Web that stems from the heterogeneity in the schemas used to structure the underlying data and knowledge sources.
For example, the labels for classes, relations and attributes may be completely different for the same relations, e.g. slight differences in the label of molecular weight in DrugBank and KEGG
Moreover, Semantic Web aims to capture the entire granularity of the underlying source – some relations may be explained in greater detail in different sources. For example, in KEGG, you get details on the type of interaction between Gleevec and PDGFR, as well as the source of the interaction in literature.
As SPARQL querying (or for that matter any database querying) requires these labels and the patterns to be exactly similar to those in the RDF graphs, for the query to retrieve results.
----- Meeting Notes (3/24/17 13:53) -----
animation does not work ...
There is a lot of heterogeneity in the biomedical Semantic Web that stems from the heterogeneity in the schemas used to structure the underlying data and knowledge sources.
For example, the labels for classes, relations and attributes may be completely different for the same relations, e.g. slight differences in the label of molecular weight in DrugBank and KEGG
Moreover, Semantic Web aims to capture the entire granularity of the underlying source – some relations may be explained in greater detail in different sources. For example, in KEGG, you get details on the type of interaction between Gleevec and PDGFR, as well as the source of the interaction in literature.
As SPARQL querying (or for that matter any database querying) requires these labels and the patterns to be exactly similar to those in the RDF graphs, for the query to retrieve results.
----- Meeting Notes (3/24/17 13:53) -----
make the font bigger
two terms that are said to owl:sameAs might not be equal
Cloud should be capitalized
----- Meeting Notes (3/24/17 13:53) ---
there should not be animation
half lives
make the entire figure bigger
----- Meeting Notes (3/24/17 14:06) -----
cloud should be capitalized
----- Meeting Notes (3/24/17 14:06) -----
mapping rules bigger
spend more time talking on it
walk them through it ...
use two arrows ... and only one label in the left ...
----- Meeting Notes (3/24/17 14:06) -----
mapping rules bigger
spend more time talking on it
walk them through it ...
use two arrows ... and only one label in the left ...
----- Meeting Notes (3/24/17 13:53) -----
larger font size for bullets
----- Meeting Notes (3/24/17 14:06) -----
say that phlegra helps you jump around all sources