The Open PHACTS Discovery Platform aims to provide an integrated information space to advance pharmacological research in the area of drug discovery. Effective drug discovery requires comprehensive data coverage, i.e. integrating all available sources of pharmacology data. While many relevant data sources are available on the linked open data cloud, their content needs to be combined with that of commercial datasets and the licensing of these commercial datasets respected when providing access to the data. Additionally, pharmaceutical companies have built up their own extensive private data collections that they require to be included in their pharmacological dataspace. In this paper we discuss the challenges of incorporating private and commercial data into a linked dataspace: focusing on the modelling of these datasets and their interlinking. We also present the graph-based access control mechanism that ensures commercial and private datasets are only available to authorized users.
http://link.springer.com/chapter/10.1007/978-3-642-41338-4_5
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
Integrating Commercial and Private Data into an Open Linked Data Platform for Drug Discovery
1. Incorporating Commercial and Private
Data into an Open Linked Data Platform
for Drug Discovery
Carole Goble, Alasdair J G Gray, Lee Harland,
Karen Karapetyan, Antonis Loizou, Ivan Mikhailov,
Yrjänä Rankka, Stefan Senger, Valery Tkachenko,
Antony J Williams, and Egon L Willighagen
www.openphacts.org
@open_phacts
A.J.G.Gray@hw.ac.uk
@gray_alasdair
2. Pre-competitive Informatics
Pharmaceutical companies are all accessing, processing,
storing & re-processing external research data
Literature Genbank
Patents PubChem
Data Integration
Databases
Data Analysis
Downloads
x
Repeat @
each
company
Firewalled Databases
Lowering industry firewalls: pre-competitive informatics in drug discovery
25/10/2013 Reviews Drug Discovery (2009) 2013
ISWC 8, 701-708 doi:10.1038/nrd2944
Nature
1
6. Real Business Questions
“Let me compare
Proteins logP and PSA
MW,
for known
oxidoreductase
Genes
inhibitors”
Pathways
Interactions
“What is the
Pharmacological
selectivity profile of
Activities
known p38 inhibitors?”
Transcripts
Clinical Drug
Applications
Biological
Processes
Pathological
Processes
25/10/2013
“Find me compounds
that inhibit targets in
NFkB pathway assayed
in only functional assays
with a potency <1 μM”
Drugs
Indications
Diseases
Compounds
ISWC 2013
5
7. OPS Discovery Platform
Core Platform
Apps
Identity
Resolution
Service
Identifier
Management
Service
“Adenosine
receptor 2a”
Linked Data API (RDF/XML, TTL, JSON)
P12374
EC2.43.4
CS4532
Domain
Specific
Services
Semantic Workflow Engine
Chemistry
Registration
Normalisatio
n & Q/C
Data Cache
(Virtuoso Triple Store)
Indexing
VoID
VoID
VoID
Nanopub
Public
Ontologies
Db
Db
25/10/2013
VoID
Nanopub
Db
Nanopub
Db
ISWC 2013
Public Content
VoID
Commercial
User
Annotations
6
9. Semantic Integration Methodology
1. Define use cases
2. Identify Data
–
–
Create RDF
VoID dataset descriptions
3. Create mappings
–
–
25/10/2013
between data set and known data sets
(instance level)
index for text to URL conversion
ISWC 2013
8
10. Semantic Integration Methodology
4. Ingest RDF into data cache
(i.e. triple store)
5. Define access paths to core concepts in data
6. Extend or create SPARQL queries for API calls
7. Publish API calls
25/10/2013
ISWC 2013
9
11. Commercial Data Use Case
“What is the
selectivity profile of
known p38 inhibitors?”
• Comprehensive data
coverage
– Commercial data
collections
– Extensive private
collections
“There is relevant
data in various
commercial datasets.”
• Control data responses
– Only authorised data
“My company X has its
own private dataset on
this topic.”
25/10/2013
ISWC 2013
10
12. Commercial Data Sets Pilot
Pathways
Proteins
Interactions
Genes
Pharmacological
Activities
Transcripts
Clinical Drug
Applications
Biological
Processes
Pathological
Processes
25/10/2013
Drugs
Indications
Diseases
Compounds
ISWC 2013
11
13. Linked Open Data
★
make your stuff available on the Web
(whatever format) under an open license
★★
make it available as structured data
(e.g. Excel instead of image scan of a table)
★★★
use non-proprietary formats
(e.g. CSV instead of Excel)
★★★★ use URIs to denote things,
so that people can point at your stuff
★★★★★ link your data to other data
to provide context
http://5stardata.info/
25/10/2013
ISWC 2013
12
14. Commercial Linked Data
• Same conversion challenges as Open Data!
– Goal to have 5★ linked data
– www.openphacts.org/specs/rdfguide/
• Pilot (sample) data provided as data dumps
– XML
– CSV
– RDF
• Structurally similar to ChEMBL
• Converted to interoperable RDF
25/10/2013
ISWC 2013
13
15. Data Modelling Challenges
• Contain private terminologies
– Mapped to public equivalents
– On going work
• Units represented as strings
– Not always consistent, e.g. IC50, IC_50, IC-50
– QUDT extended, e.g. IC50
– www.openphacts.org/specs/units/
25/10/2013
ISWC 2013
14
16. Dataset Descriptions
www.openphacts.org/specs/datadesc/
Enable
• Discovery
– Name
– Description
– Coverage
• Access control
– License
– File locations
• Answer Provenance
– Returned data links to
description
25/10/2013
Commercial Data
Description
• Publicly discoverable
– Advertisement for data
– Bring in more customers
• Restricted access by
license
Private Data Description
• Hidden to all but
authorised
• Restricted access
ISWC 2013
15
17. Chemical mappings
• Data is messy!
• Identify common
problems:
– Charge imbalance
– Stereochemistry
• Link based on structure
25/10/2013
ISWC 2013
16
18. Chemistry Registration
ChemSpider Service
Chemical Registration Service
• Validates and
standardizes chemical
representations
• Manual curation by
RSC staff
• Data loaded in
ChemSpider
• Open data: unsuitable for
• Utilizes ChemSpider
Validation and
Standardization platform
• Utilizes FDA rule set as
basis for standardization
• Generates OPSID for
chemicals
• Computes properties
– Commercial data
– Private data
25/10/2013
ISWC 2013
17
20. Data Access
• Each data set loaded into separate graph in
cache
• Pilot data same form as open ChEMBL data
– Extend queries with sub-queries for each set
• Restricted access
– Virtuoso offers graph-based access restriction
– Commercial data sets turned on/off
25/10/2013
ISWC 2013
19
21. Conclusions
• Drug discovery requires full data coverage
– Public/open data
• Open description
• Open data
– Commercial data
• Open description
• Restricted data
– Private data
• Restricted description
• Restricted data
• Pilot study with three commercial datasets
25/10/2013
ISWC 2013
20
22. Conclusions
• Data Modelling
– Similar challenges as public data
• Access restriction
– Provided by standard mechanisms
– Graph-based access
• Open PHACTS Discovery Platform
– Releasing version 1.3 (late 2013)
– Version 1.4 will contain commercial data (2014)
25/10/2013
ISWC 2013
21
A platform for integratedpharmacology data Reliedupon by pharma companiesPublic domain, commercial, and private data sourcesBasedon open standardsProvidesdomainspecific APIMakingiteasyto build multiple drugdiscoveryapplications:examplesdeveloped in the project
Comprehensive picture requiredDrug discovery requires data from lots of domains:Chemistry: compound propertiesGenes:Proteins:Drug interactionsDiseasesBiological processes/interactionsPharmacology activities
Already exist many public open datasets
Driven by real business questions and use cases
Cache copies of data: (1) Performance (2) Social (1000s hits a day)Chemistry data normalisation/alignment through ChemSpiderDomain specific APIAPI calls populate SPARQL queries
Statistics to be added1,030,727,289 triplesHosted on beefy hardware; data in memory (aim)
More data!Driven by users
Sample of three commercial datasetsInformation on handful of targets only
Commercial data does not have open license
Same challenges as open data: data about the same things
Statistics may affect the perception of the commercial dataset
Data is messy; even chemicalsRequire linking between datasetsCommon backbone to integrate data
FDA: US Food and Drug Administration Substance Registration SystemOPSID: Open PHACTS identifier for chemicals