Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Practical semantics in the pharmaceutical industry - the Open PHACTS project


Published on

The information revolution has transformed many business sectors over the last decade and the pharmaceutical industry is no exception. Developments in scientific and information technologies have unleashed an avalanche of content on research scientists who are struggling to access and filter this in an efficient manner. Furthermore, this domain has traditionally suffered from a lack of standards in how entities, processes and experimental results are described, leading to difficulties in determining whether results from two different sources can be reliably compared. The need to transform the way the life-science industry uses information has led to new thinking about how companies should work beyond their firewalls. In this talk we will provide an overview of the traditional approaches major pharmaceutical companies have taken to knowledge management and describe the business reasons why pre-competitive, cross-industry and public-private partnerships have gained much traction in recent years. We will consider the scientific challenges concerning the integration of biomedical knowledge, highlighting the complexities in representing everyday scientific objects in computerised form. This leads us to discuss how the semantic web might lead us to a long-overdue solution. The talk will be illustrated by focusing on the EU-Open PHACTS initiative (, established to provide a unique public-private infrastructure for pharmaceutical discovery. The aims of this work will be described and how technologies such as just-in-time identity resolution, nanopublication and interactive visualisations are helping to build a powerful software platform designed to appeal to directly to scientific users across the public and private sectors.

Published in: Technology, Education, Business
  • Login to see the comments

Practical semantics in the pharmaceutical industry - the Open PHACTS project

  1. 1. Practical semantics in the pharmaceutical industry - the Open PHACTS project Antony Williams On behalf of the Open PHACTS Team (and with a focus on Chemistry!)
  2. 2. Fundamental issue: There is a LOT of science online! Chaotic, varying quality and very valuable! Scientists want to find information quickly and easily Often they just “can‟t get there” (or don‟t even know where “there” is) And you have to manage it all (or not)
  3. 3. Pre-competitive Informatics: Pharma are all accessing, processing, storing & re-processing external research data Literature PubChem Genbank Patents Databases Downloads Data Integration Data Analysis Firewalled Databases Repeat @ each company x Lowering industry firewalls: pre-competitive informatics in drug discovery Nature Reviews Drug Discovery (2009) 8, 701-708 doi:10.1038/nrd2944
  4. 4. The Project Innovative Medicines Initiative • EC funded public-private partnership for pharmaceutical research • Focus on key problems – Efficacy, Safety, Educati on & Training, Knowledge Management The Open PHACTS Project • Create a semantic integration hub (“Open Pharmacological Space”)… • Delivering services to support on-going drug discovery programs in pharma and public domain • Not just another project; Leading academics in semantics, pharmacology and informatics, driven by solid industry business requirements • 23 academic partners, 8 pharmaceutical companies, 3 biotechs INITIALLY • Work split into clusters: • Technical Build • Scientific Drive • Community & Sustainability
  5. 5. Major Work Streams Build: OPS service layer and resource integration Drive: Development of exemplar work packages & Applications Sustain: Community engagement and long-term sustainability Assertion & Meta Data Mgmt Transform / Translate Integrator OPS Service Layer Corpus 1 „Consumer‟ Firewall Supplier Firewall Db 2 Db 3 Db 4 Corpus 5 Std Public Vocabularies Target Dossier Compound Dossier Pharmacological Networks Business Rules Work Stream 1: Open Pharmacological Space (OPS)Service Layer Standardised softwarelayerto allow public DD resource integration − Define standards and construct OPS service layer − Develop interface (API) for data access, integration and analysis − Develop secure access models Existing Drug Discovery (DD)Resource Integration Work Stream 2: Exemplar Drug Discovery Informatics tools Develop exemplar services to test OPS Service Layer Target Dossier (Data Integration) Pharmacological Network Navigator (Data Visualisation) Compound Dossier (Data Analysis)
  6. 6. ChEMBL DrugBank Gene Ontology Wikipathways UniProt ChemSpider UMLS ConceptWiki ChEBI TrialTrove GVKBio GeneGo TR Integrity “Find me compounds that inhibit targets in NFkB pathway assayed in only functional assays with a potency <1 μM” “What is the selectivity profile of known p38 inhibitors?” “Let me compare MW, logP and PSA for known oxidoreductase inhibitors”
  7. 7. Number sum Nr of 1 Question 15 12 9 All oxidoreductase inhibitors active <100nM in both human and mouse 18 14 8 Given compound X, what is its predicted secondary pharmacology? What are the on and off,target safety concerns for a compound? What is the evidence and how reliable is that evidence (journal impact factor, KOL) for findings associated with a compound? 24 13 8 Given a target find me all actives against that target. Find/predict polypharmacology of actives. Determine ADMET profile of actives. 32 13 8 For a given interaction profile, give me compounds similar to it. 37 13 8 The current Factor Xa lead series is characterised by substructure X. Retrieve all bioactivity data in serine protease assays for molecules that contain substructure X. 38 13 8 Retrieve all experimental and clinical data for a given list of compounds defined by their chemical structure (with options to match stereochemistry or not). 41 13 8 A project is considering Protein Kinase C Alpha (PRKCA) as a target. What are all the compounds known to modulate the target directly? What are the compounds that may modulate the target directly? i.e. return all cmpds active in assays where the resolution is at least at the level of the target family (i.e. PKC) both from structured assay databases and the literature. 44 13 8 Give me all active compounds on a given target with the relevant assay data 46 13 8 Give me the compound(s) which hit most specifically the multiple targets in a given pathway (disease) 59 14 8 Identify all known protein-protein interaction inhibitors Business Question Driven Approach
  8. 8. Open PHACTS Scientific Services Platform Explorer Standards Apps API “Provenance Everywhere”
  9. 9. Nanopub Db VoID Data Cache (Virtuoso Triple Store) Semantic Workflow Engine Linked Data API (RDF/XML, TTL, JSON) Domain Specific Services Identity Resolution Service Chemistry Registration Normalisation & Q/C Identifier Management Service Indexing CorePlatform P12374 EC2.43.4 CS4532 “Adenosine receptor 2a” VoID Db Nanopub Db VoID Db VoID Nanopub VoID Public Content Commercial Public Ontologies User Annotations Apps
  10. 10. RDF/VoID RDF (Resource Description Framework) VoID (Vocabulary of Interlinked Datasets) – Metadata describing the RDF – Describes how Datasets are linked using Linksets • skos:exactMatch (Simple Knowledge Organisation System) E.g. To link compounds in OPS with compounds in ChEBI. • skos:closeMatch E.g. To link Stereo Insensitive Parents to their Children within OPS. • skos:relatedMatch E.g. To link Parent compounds that contain others as Fragments. • dul:expresses (DOLCE+DnS Ultralite) – describes what links the Datasets. We use Cheminf to express the links E.g. represents an InChIKey. – Recommendations on how to create the VoID have been specified by Manchester here:
  11. 11. Chemistry Registration Normalisation & Q/C Chemistry Registration • Old chemistry registration system uses standard ChemSpider deposition system: includes low- level structure validation and manual curation service by RSC staff. • New Registration System • Utilizes ChemSpider Validation and Standardization platform including collapsing tautomers • Utilizes FDA rule set as basis for standardizations • Generate Open PHACTS identifier (OPS ID)
  12. 12. STANDARD_TYPE UNIT_COUNT ---------------- ------- AC50 7 Activity 421 EC50 39 IC50 46 ID50 42 Ki 23 Log IC50 4 Log Ki 7 Potency 11 log IC50 0 STANDARD_TYPE STANDARD_UNITS COUNT(*) ------------------ ------------------ -------- IC50 nM 829448 IC50 ug.mL-1 41000 IC50 38521 IC50 ug/ml 2038 IC50 ug ml-1 509 IC50 mg kg-1 295 IC50 molar ratio 178 IC50 ug 117 IC50 % 113 IC50 uM well-1 52 ~ 100 units >5000 types Implemented using the Quantities, Dimension, Units, Types Ontology ( Quantitative Data Challenges
  13. 13. Content Changes Regularly! POINT IN TIME Source Initial Records Triples Properties ChEMBL 1,149,792 ~1,091,462 cmpds ~8845 targets 146,079,194 17 cmpds 13 targets DrugBank 19,628 ~14,000 drugs ~5000 targets 517,584 74 UniProt 536,789 156,569,764 78 ENZYME 6,187 73,838 2 ChEBI 35,584 905,189 2 GO/GOA 38,137 24,574,774 42 ChemSpider/ACD 1,194,437 161,336,857 22 ACD, 4 CS ConceptWiki 2,828,966 3,739,884 1
  14. 14. Data Cache (Virtuoso Triple Store) Semantic Workflow Engine Infrastructure Hardware (development) - 2 x Intel Xeon E5-2640 
- 384 GB DDR3 1333MHz RAM
- 1.5 TB SSD 
- 3TB 7200rpm Triple Store - Virtuoso 7 column store - Shown to scale to > 100 billion triples Network - AMX-IS - Extensive memcache
  15. 15. Antony Williams vs Identifiers Passport ID Dad, Tony, others SSN Green Card License 5 email addresses ChemConnector (blog, Twitter account, Facebook, Fri endfeed) OpenID, ORCID ….
  16. 16. P12047 X31045 GB:29384 Let a Mapping Service take the strain….
  17. 17. PubChemDrugbankChemSpider Imatinib Mesylate What Is Gleevec?
  18. 18. Strict Relaxed Analysing Browsing Dynamic Equality LinkSet#1 { chemspider:gleevec hasParent imatinib ... drugbank:gleevec exactMatch imatinib ... } chemspider:gleevec drugbank:gleevec
  19. 19. ChemSpider Validation & Standardization Platform Quality Assurance
  20. 20. Chemistry Validation and Standardization Platform (CVSP) at • Validation • Standardization • Parent generation RDF Export Data
  21. 21. CTAB REGID1 DataSource Synonym1 Synonym2 XRef1 etc Deposited SDF record Standardized entity OPS_ID1 Parents Charge Parent (OPS_ID7) Isotope Parent (OPS_ID5) Stereo Parent (OPS_ID4) Tautomer Parent (OPS_ID6) Super Parent (OPS_ID8) Fragment (OPS_ID3) Fragment (OPS_ID2)
  22. 22. For each Compound (CSID) parent generation is attempted: “Tautomerism in large databases”, Sitzmann and others, J.Comput Aided Mol Des (2010) Parent Description Charge- Unsensitive An attempt is made to neutralize ionized acids and bases. Envisioned to be an ongoing improvement while new cases appear. Isotope- Unsensitive Isotopes replaced by common weight Stereo-Unsensitive Stereo is stripped Tautomer- Unsensitive Tautomer canonicalization is attempting to generate a “reasonable” tautomer Super-Unsensitive This parent is all of the above
  23. 23. OPS1 DrugBank ID DB07241 OPS5OPS4 OPS3 OPS2 OPS6 ops:OPS1 skos:exactMatch <http://www4.wiwiss.fu-> . ops:OPS2 skos:relatedMatch ops:OPS1 . ops:OPS3 skos:relatedMatch ops:OPS1 . ops:OPS3 skos:closeMatch ops:OPS4 . ops:OPS3 skos:closeMatch ops:OPS5 . ops:OPS4 skos:closeMatch ops:OPS6 . ops:OPS5 skos:closeMatch ops:OPS6 .
  24. 24. A Precompetitive Knowledge Framework Integration Pharma Needs Inputs Sustainabilit y Stability Security Management / Governance Data Mining Services/Alg orithms Mapping & Populating Architecture Interfaces & Services Content Structured & Unstructured Vocabularies & Identifiers (URIs) Community KD Innovation
  25. 25. The Ecosystem is …. API Approach Community Industry Academia Data Provider Software Provider
  26. 26. Kick-Starting Sustainability Collaboration Grants Industry Open PHACTS APIUsers Apps API
  27. 27.
  28. 28. Example applications Advanced analytics ChemBioNavigator Navigating at the interface of chemical and biological data with sorting and plotting options TargetDossier Interconnecting Open PHACTS with multiple target centric services. Exploring target similarity using diverse criteria PharmaTrek Interactive Polypharmacology space of experimental annotations UTOPIA Semantic enrichment of scientific PDFs Predictions GARFIELD Prediction of target pharmacology based on the Similar Ensemble Approach eTOX connector Automatic extraction of data for building predictive toxicology models in eTOX project
  29. 29. Front-endframework to visualizebiologicaldata Target dossier (CNIO)
  30. 30. The Open PHACTS community ecosystem
  31. 31. Becoming part of the Open PHACTS Foundation Members membership offers early access to platform updates and releases the opportunity to steer research and development directions receive technical support work with the ecosystem of developers and semantic data integrators around Open PHACTS tiered membership familiar business and governance model A UK-based not-for-profit member owned company
  32. 32. What are the problems with licensing we had to address? – To make data and software generated by the project usable/ reusable – Multiplicity of unclear or non-standard licenses on original data sources • „Public‟ can mean use but not redistribute, use in commercial environment, • Legal position on use and reuse extremely unclear • Different issues than just linking to data – Legal status of integrated collections of the above, and of derived knowledge? – Appropriate software license selection – Legal clarity for EFPIA and end users – Approaches for commercial data integration, EFPIA in-house data AIM: enable maximum possible dissemination and usability of integrated data and architecture with approaches that will be applicable in other data integration projects Licensing Challenges
  33. 33. Chose John Wilbanks as consultant A framework built around STANDARD well-understood Creative Commons licences – and how they interoperate Deal with the problems by: Interoperable licences Appropriate terms Declare expectations to users and data publishers One size won„t fit all requirements Data Licensing Solution
  34. 34. Open PHACTS Project Partners Pfizer Limited – Coordinator Universität Wien – Managing entity Technical University of Denmark University of Hamburg, Center for Bioinformatics BioSolveIT GmBH Consorci Mar Parc de Salut de Barcelona Leiden University Medical Centre Royal Society of Chemistry Vrije Universiteit Amsterdam Spanish National Cancer Research Centre University of Manchester Maastricht University Aqnowledge University of Santiago de Compostela Rheinische Friedrich-Wilhelms-Universität Bonn AstraZeneca GlaxoSmithKline Esteve Novartis Merck Serono H. Lundbeck A/S Eli Lilly Netherlands Bioinformatics Centre Swiss Institute of Bioinformatics ConnectedDiscovery EMBL-European Bioinformatics Institute Janssen OpenLink