Incorporating Commercial and Private
Data into an Open Linked Data Platform
for Drug Discovery
Carole Goble, Alasdair J G ...
Pre-competitive Informatics
Pharmaceutical companies are all accessing, processing,
storing & re-processing external resea...
Open PHACTS objective
Apps

Interactive
responses

Open
Standards
Domain API

Provenance of
data

Drug Discovery Platform
...
Drug Discovery Data
Pathways
Proteins
Interactions
Genes

Pharmacological
Activities

Transcripts
Clinical Drug
Applicatio...
Public Data
Pathways
Proteins
Interactions
Genes

Pharmacological
Activities

Transcripts
Clinical Drug
Applications
Biolo...
Real Business Questions
“Let me compare
Proteins logP and PSA
MW,
for known
oxidoreductase
Genes
inhibitors”

Pathways
Int...
OPS Discovery Platform

Core Platform

Apps
Identity
Resolution
Service
Identifier
Management
Service

“Adenosine
receptor...
Present Content: Public Data
Source

Initial Records

Triples

Properties

ChEMBL

1,247,403

305,419,649

77

19,628

517...
Semantic Integration Methodology
1. Define use cases
2. Identify Data
–
–

Create RDF
VoID dataset descriptions

3. Create...
Semantic Integration Methodology
4. Ingest RDF into data cache
(i.e. triple store)
5. Define access paths to core concepts...
Commercial Data Use Case
“What is the
selectivity profile of
known p38 inhibitors?”

• Comprehensive data
coverage
– Comme...
Commercial Data Sets Pilot
Pathways
Proteins
Interactions
Genes

Pharmacological
Activities

Transcripts
Clinical Drug
App...
Linked Open Data
★

make your stuff available on the Web
(whatever format) under an open license
★★
make it available as s...
Commercial Linked Data
• Same conversion challenges as Open Data!
– Goal to have 5★ linked data
– www.openphacts.org/specs...
Data Modelling Challenges
• Contain private terminologies
– Mapped to public equivalents
– On going work

• Units represen...
Dataset Descriptions
www.openphacts.org/specs/datadesc/

Enable
• Discovery
– Name
– Description
– Coverage

• Access cont...
Chemical mappings
• Data is messy!
• Identify common
problems:
– Charge imbalance
– Stereochemistry

• Link based on struc...
Chemistry Registration
ChemSpider Service

Chemical Registration Service

• Validates and
standardizes chemical
representa...
Access Requirements

25/10/2013

ISWC 2013

18
Data Access
• Each data set loaded into separate graph in
cache
• Pilot data same form as open ChEMBL data
– Extend querie...
Conclusions
• Drug discovery requires full data coverage
– Public/open data
• Open description
• Open data

– Commercial d...
Conclusions
• Data Modelling
– Similar challenges as public data

• Access restriction
– Provided by standard mechanisms
–...
Acknowledgements
• GVK Bio
GOSTAR
gostardb.com
• Thomson Reuters
Integrity
integrity.thomson-pharma.com
• Aureus Sciences ...
Questions
A.J.G.Gray@hw.ac.uk
www.macs.hw.ac.uk/~ajg33
@gray_alasdair

Open PHACTS Project

pmu@openphacts.org
www.openpha...
Upcoming SlideShare
Loading in …5
×

Incorporating Commercial and Private Data into an Open Linked Data Platform for Drug Discovery

1,073 views

Published on

The Open PHACTS Discovery Platform aims to provide an integrated information space to advance pharmacological research in the area of drug discovery. Effective drug discovery requires comprehensive data coverage, i.e. integrating all available sources of pharmacology data. While many relevant data sources are available on the linked open data cloud, their content needs to be combined with that of commercial datasets and the licensing of these commercial datasets respected when providing access to the data. Additionally, pharmaceutical companies have built up their own extensive private data collections that they require to be included in their pharmacological dataspace. In this paper we discuss the challenges of incorporating private and commercial data into a linked dataspace: focusing on the modelling of these datasets and their interlinking. We also present the graph-based access control mechanism that ensures commercial and private datasets are only available to authorized users.

http://link.springer.com/chapter/10.1007/978-3-642-41338-4_5

Published in: Technology, Business
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,073
On SlideShare
0
From Embeds
0
Number of Embeds
49
Actions
Shares
0
Downloads
11
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide
  • Big waste of resourcesNo competitive advantage
  • A platform for integratedpharmacology data Reliedupon by pharma companiesPublic domain, commercial, and private data sourcesBasedon open standardsProvidesdomainspecific APIMakingiteasyto build multiple drugdiscoveryapplications:examplesdeveloped in the project
  • Comprehensive picture requiredDrug discovery requires data from lots of domains:Chemistry: compound propertiesGenes:Proteins:Drug interactionsDiseasesBiological processes/interactionsPharmacology activities
  • Already exist many public open datasets
  • Driven by real business questions and use cases
  • Cache copies of data: (1) Performance (2) Social (1000s hits a day)Chemistry data normalisation/alignment through ChemSpiderDomain specific APIAPI calls populate SPARQL queries
  • Statistics to be added1,030,727,289 triplesHosted on beefy hardware; data in memory (aim)
  • More data!Driven by users
  • Sample of three commercial datasetsInformation on handful of targets only
  • Commercial data does not have open license
  • Same challenges as open data: data about the same things
  • Statistics may affect the perception of the commercial dataset
  • Data is messy; even chemicalsRequire linking between datasetsCommon backbone to integrate data
  • FDA: US Food and Drug Administration Substance Registration SystemOPSID: Open PHACTS identifier for chemicals
  • Data sets with license requirements
  • Incorporating Commercial and Private Data into an Open Linked Data Platform for Drug Discovery

    1. 1. Incorporating Commercial and Private Data into an Open Linked Data Platform for Drug Discovery Carole Goble, Alasdair J G Gray, Lee Harland, Karen Karapetyan, Antonis Loizou, Ivan Mikhailov, Yrjänä Rankka, Stefan Senger, Valery Tkachenko, Antony J Williams, and Egon L Willighagen www.openphacts.org @open_phacts A.J.G.Gray@hw.ac.uk @gray_alasdair
    2. 2. Pre-competitive Informatics Pharmaceutical companies are all accessing, processing, storing & re-processing external research data Literature Genbank Patents PubChem Data Integration Databases Data Analysis Downloads x Repeat @ each company Firewalled Databases Lowering industry firewalls: pre-competitive informatics in drug discovery 25/10/2013 Reviews Drug Discovery (2009) 2013 ISWC 8, 701-708 doi:10.1038/nrd2944 Nature 1
    3. 3. Open PHACTS objective Apps Interactive responses Open Standards Domain API Provenance of data Drug Discovery Platform Production quality 25/10/2013 ISWC 2013 2
    4. 4. Drug Discovery Data Pathways Proteins Interactions Genes Pharmacological Activities Transcripts Clinical Drug Applications Biological Processes Pathological Processes 25/10/2013 Drugs Indications Diseases Compounds ISWC 2013 3
    5. 5. Public Data Pathways Proteins Interactions Genes Pharmacological Activities Transcripts Clinical Drug Applications Biological Processes Pathological Processes 25/10/2013 Drugs Indications Diseases Compounds ISWC 2013 4
    6. 6. Real Business Questions “Let me compare Proteins logP and PSA MW, for known oxidoreductase Genes inhibitors” Pathways Interactions “What is the Pharmacological selectivity profile of Activities known p38 inhibitors?” Transcripts Clinical Drug Applications Biological Processes Pathological Processes 25/10/2013 “Find me compounds that inhibit targets in NFkB pathway assayed in only functional assays with a potency <1 μM” Drugs Indications Diseases Compounds ISWC 2013 5
    7. 7. OPS Discovery Platform Core Platform Apps Identity Resolution Service Identifier Management Service “Adenosine receptor 2a” Linked Data API (RDF/XML, TTL, JSON) P12374 EC2.43.4 CS4532 Domain Specific Services Semantic Workflow Engine Chemistry Registration Normalisatio n & Q/C Data Cache (Virtuoso Triple Store) Indexing VoID VoID VoID Nanopub Public Ontologies Db Db 25/10/2013 VoID Nanopub Db Nanopub Db ISWC 2013 Public Content VoID Commercial User Annotations 6
    8. 8. Present Content: Public Data Source Initial Records Triples Properties ChEMBL 1,247,403 305,419,649 77 19,628 517,584 74 ? 533,394,147 82 6,187 73,838 2 ChEBI 40,575 40,575 2 GeneOntology 38,137 1,265,273 26 ? 23,489,501 15 ChemSpider 1,194,437 161,336,857 26 ConceptWiki 2,828,966 3,739,884 1 946 1,449,981 34 DrugBank UniProt ENZYME GOA WikiPathways 25/10/2013 ISWC 2013 7
    9. 9. Semantic Integration Methodology 1. Define use cases 2. Identify Data – – Create RDF VoID dataset descriptions 3. Create mappings – – 25/10/2013 between data set and known data sets (instance level) index for text to URL conversion ISWC 2013 8
    10. 10. Semantic Integration Methodology 4. Ingest RDF into data cache (i.e. triple store) 5. Define access paths to core concepts in data 6. Extend or create SPARQL queries for API calls 7. Publish API calls 25/10/2013 ISWC 2013 9
    11. 11. Commercial Data Use Case “What is the selectivity profile of known p38 inhibitors?” • Comprehensive data coverage – Commercial data collections – Extensive private collections “There is relevant data in various commercial datasets.” • Control data responses – Only authorised data “My company X has its own private dataset on this topic.” 25/10/2013 ISWC 2013 10
    12. 12. Commercial Data Sets Pilot Pathways Proteins Interactions Genes Pharmacological Activities Transcripts Clinical Drug Applications Biological Processes Pathological Processes 25/10/2013 Drugs Indications Diseases Compounds ISWC 2013 11
    13. 13. Linked Open Data ★ make your stuff available on the Web (whatever format) under an open license ★★ make it available as structured data (e.g. Excel instead of image scan of a table) ★★★ use non-proprietary formats (e.g. CSV instead of Excel) ★★★★ use URIs to denote things, so that people can point at your stuff ★★★★★ link your data to other data to provide context http://5stardata.info/ 25/10/2013 ISWC 2013 12
    14. 14. Commercial Linked Data • Same conversion challenges as Open Data! – Goal to have 5★ linked data – www.openphacts.org/specs/rdfguide/ • Pilot (sample) data provided as data dumps – XML – CSV – RDF • Structurally similar to ChEMBL • Converted to interoperable RDF 25/10/2013 ISWC 2013 13
    15. 15. Data Modelling Challenges • Contain private terminologies – Mapped to public equivalents – On going work • Units represented as strings – Not always consistent, e.g. IC50, IC_50, IC-50 – QUDT extended, e.g. IC50 – www.openphacts.org/specs/units/ 25/10/2013 ISWC 2013 14
    16. 16. Dataset Descriptions www.openphacts.org/specs/datadesc/ Enable • Discovery – Name – Description – Coverage • Access control – License – File locations • Answer Provenance – Returned data links to description 25/10/2013 Commercial Data Description • Publicly discoverable – Advertisement for data – Bring in more customers • Restricted access by license Private Data Description • Hidden to all but authorised • Restricted access ISWC 2013 15
    17. 17. Chemical mappings • Data is messy! • Identify common problems: – Charge imbalance – Stereochemistry • Link based on structure 25/10/2013 ISWC 2013 16
    18. 18. Chemistry Registration ChemSpider Service Chemical Registration Service • Validates and standardizes chemical representations • Manual curation by RSC staff • Data loaded in ChemSpider • Open data: unsuitable for • Utilizes ChemSpider Validation and Standardization platform • Utilizes FDA rule set as basis for standardization • Generates OPSID for chemicals • Computes properties – Commercial data – Private data 25/10/2013 ISWC 2013 17
    19. 19. Access Requirements 25/10/2013 ISWC 2013 18
    20. 20. Data Access • Each data set loaded into separate graph in cache • Pilot data same form as open ChEMBL data – Extend queries with sub-queries for each set • Restricted access – Virtuoso offers graph-based access restriction – Commercial data sets turned on/off 25/10/2013 ISWC 2013 19
    21. 21. Conclusions • Drug discovery requires full data coverage – Public/open data • Open description • Open data – Commercial data • Open description • Restricted data – Private data • Restricted description • Restricted data • Pilot study with three commercial datasets 25/10/2013 ISWC 2013 20
    22. 22. Conclusions • Data Modelling – Similar challenges as public data • Access restriction – Provided by standard mechanisms – Graph-based access • Open PHACTS Discovery Platform – Releasing version 1.3 (late 2013) – Version 1.4 will contain commercial data (2014) 25/10/2013 ISWC 2013 21
    23. 23. Acknowledgements • GVK Bio GOSTAR gostardb.com • Thomson Reuters Integrity integrity.thomson-pharma.com • Aureus Sciences Elsevier AurSCOPE www.aureus-sciences.com 25/10/2013 ISWC 2013 22
    24. 24. Questions A.J.G.Gray@hw.ac.uk www.macs.hw.ac.uk/~ajg33 @gray_alasdair Open PHACTS Project pmu@openphacts.org www.openphacts.org @open_phacts

    ×