The information revolution has transformed many business sectors over the last decade and the pharmaceutical industry is no exception. Developments in scientific and information technologies have unleashed an avalanche of content on research scientists who are struggling to access and filter this in an efficient manner. Furthermore, this domain has traditionally suffered from a lack of standards in how entities, processes and experimental results are described, leading to difficulties in determining whether results from two different sources can be reliably compared. The need to transform the way the life-science industry uses information has led to new thinking about how companies should work beyond their firewalls. In this talk we will provide an overview of the traditional approaches major pharmaceutical companies have taken to knowledge management and describe the business reasons why pre-competitive, cross-industry and public-private partnerships have gained much traction in recent years. We will consider the scientific challenges concerning the integration of biomedical knowledge, highlighting the complexities in representing everyday scientific objects in computerised form. This leads us to discuss how the semantic web might lead us to a long-overdue solution. The talk will be illustrated by focusing on the EU-Open PHACTS initiative (openphacts.org), established to provide a unique public-private infrastructure for pharmaceutical discovery. The aims of this work will be described and how technologies such as just-in-time identity resolution, nanopublication and interactive visualisations are helping to build a powerful software platform designed to appeal to directly to scientific users across the public and private sectors.
Practical semantics in the pharmaceutical industry - the Open PHACTS project
1. Practical semantics in the pharmaceutical
industry - the Open PHACTS project
Antony Williams
On behalf of the Open PHACTS Team
(and with a focus on Chemistry!)
2. Fundamental issue:
There is a LOT of science online!
Chaotic, varying quality and very valuable!
Scientists want to find information quickly and
easily
Often they just “can‟t get there” (or don‟t even
know where “there” is)
And you have to manage it all (or not)
3. Pre-competitive Informatics:
Pharma are all accessing, processing, storing & re-processing external research data
Literature
PubChem
Genbank
Patents
Databases
Downloads
Data Integration Data Analysis
Firewalled Databases
Repeat @
each
company
x
Lowering industry firewalls: pre-competitive informatics in drug discovery
Nature Reviews Drug Discovery (2009) 8, 701-708 doi:10.1038/nrd2944
4. The Project
Innovative Medicines Initiative
• EC funded public-private
partnership for
pharmaceutical research
• Focus on key problems
– Efficacy, Safety, Educati
on &
Training, Knowledge
Management
The Open PHACTS Project
• Create a semantic integration hub (“Open
Pharmacological Space”)…
• Delivering services to support on-going drug
discovery programs in pharma and public domain
• Not just another project; Leading academics in
semantics, pharmacology and informatics, driven
by solid industry business requirements
• 23 academic partners, 8 pharmaceutical
companies, 3 biotechs INITIALLY
• Work split into clusters:
• Technical Build
• Scientific Drive
• Community & Sustainability
5. Major Work Streams
Build: OPS service layer and resource integration
Drive: Development of exemplar work packages & Applications
Sustain: Community engagement and long-term sustainability
Assertion & Meta Data Mgmt
Transform / Translate
Integrator
OPS Service Layer
Corpus 1
„Consumer‟
Firewall
Supplier
Firewall
Db 2
Db 3
Db 4
Corpus 5
Std Public
Vocabularies
Target
Dossier
Compound
Dossier
Pharmacological
Networks
Business
Rules
Work Stream 1: Open Pharmacological
Space (OPS)Service Layer
Standardised softwarelayerto allow public
DD resource integration
− Define standards and construct OPS service layer
− Develop interface (API) for data access, integration
and analysis
− Develop secure access models
Existing Drug Discovery (DD)Resource
Integration
Work Stream 2: Exemplar Drug Discovery Informatics tools
Develop exemplar services to test OPS Service Layer
Target Dossier (Data Integration)
Pharmacological Network Navigator (Data Visualisation)
Compound Dossier (Data Analysis)
7. Number sum Nr of 1 Question
15 12 9 All oxidoreductase inhibitors active <100nM in both human and mouse
18 14 8
Given compound X, what is its predicted secondary pharmacology? What are the on and
off,target safety concerns for a compound? What is the evidence and how reliable is that
evidence (journal impact factor, KOL) for findings associated with a compound?
24 13 8
Given a target find me all actives against that target. Find/predict polypharmacology of actives.
Determine ADMET profile of actives.
32 13 8 For a given interaction profile, give me compounds similar to it.
37 13 8
The current Factor Xa lead series is characterised by substructure X. Retrieve all bioactivity data
in serine protease assays for molecules that contain substructure X.
38 13 8
Retrieve all experimental and clinical data for a given list of compounds defined by their chemical
structure (with options to match stereochemistry or not).
41 13 8
A project is considering Protein Kinase C Alpha (PRKCA) as a target. What are all the
compounds known to modulate the target directly? What are the compounds that may modulate
the target directly? i.e. return all cmpds active in assays where the resolution is at least at the
level of the target family (i.e. PKC) both from structured assay databases and the literature.
44 13 8 Give me all active compounds on a given target with the relevant assay data
46 13 8
Give me the compound(s) which hit most specifically the multiple targets in a given pathway
(disease)
59 14 8 Identify all known protein-protein interaction inhibitors
Business Question Driven Approach
8.
9. Open PHACTS Scientific Services
Platform Explorer
Standards
Apps
API
“Provenance Everywhere”
10. Nanopub
Db
VoID
Data Cache
(Virtuoso Triple Store)
Semantic Workflow Engine
Linked Data API (RDF/XML, TTL, JSON)
Domain
Specific
Services
Identity
Resolution
Service
Chemistry
Registration
Normalisation
& Q/C
Identifier
Management
Service
Indexing
CorePlatform
P12374
EC2.43.4
CS4532
“Adenosine
receptor 2a”
VoID
Db
Nanopub
Db
VoID
Db
VoID
Nanopub
VoID
Public Content Commercial
Public
Ontologies
User
Annotations
Apps
11.
12. RDF/VoID
RDF (Resource Description Framework)
VoID (Vocabulary of Interlinked Datasets)
– Metadata describing the RDF
– Describes how Datasets are linked using Linksets
• skos:exactMatch (Simple Knowledge Organisation System)
E.g. To link compounds in OPS with compounds in ChEBI.
• skos:closeMatch
E.g. To link Stereo Insensitive Parents to their Children within OPS.
• skos:relatedMatch
E.g. To link Parent compounds that contain others as Fragments.
• dul:expresses (DOLCE+DnS Ultralite) – describes what links the Datasets. We
use Cheminf to express the links E.g.
http://semanticscience.org/resource/CHEMINF_000059 represents an
InChIKey.
– Recommendations on how to create the VoID have been specified by Manchester
here: http://www.cs.man.ac.uk/~graya/ops/2012/ED-datadesc/
13. Chemistry
Registration
Normalisation
& Q/C
Chemistry Registration
• Old chemistry registration system uses standard
ChemSpider deposition system: includes low-
level structure validation and manual curation
service by RSC staff.
• New Registration System
• Utilizes ChemSpider Validation and
Standardization platform including collapsing
tautomers
• Utilizes FDA rule set as basis for
standardizations
• Generate Open PHACTS identifier (OPS ID)
14. STANDARD_TYPE UNIT_COUNT
---------------- -------
AC50 7
Activity 421
EC50 39
IC50 46
ID50 42
Ki 23
Log IC50 4
Log Ki 7
Potency 11
log IC50 0
STANDARD_TYPE STANDARD_UNITS COUNT(*)
------------------ ------------------ --------
IC50 nM 829448
IC50 ug.mL-1 41000
IC50 38521
IC50 ug/ml 2038
IC50 ug ml-1 509
IC50 mg kg-1 295
IC50 molar ratio 178
IC50 ug 117
IC50 % 113
IC50 uM well-1 52
~ 100 units
>5000 types
Implemented using the Quantities, Dimension, Units, Types
Ontology (http://www.qudt.org/)
Quantitative Data Challenges
28. For each Compound (CSID) parent generation is
attempted: “Tautomerism in large databases”, Sitzmann
and others, J.Comput Aided Mol Des (2010)
Parent Description
Charge-
Unsensitive
An attempt is made to neutralize ionized
acids and bases. Envisioned to be an
ongoing improvement while new cases
appear.
Isotope-
Unsensitive
Isotopes replaced by common weight
Stereo-Unsensitive Stereo is stripped
Tautomer-
Unsensitive
Tautomer canonicalization is attempting to
generate a “reasonable” tautomer
Super-Unsensitive This parent is all of the above
35. Example applications
Advanced analytics
ChemBioNavigator Navigating at the interface of chemical and
biological data with sorting and plotting options
TargetDossier Interconnecting Open PHACTS with multiple
target centric services. Exploring target
similarity using diverse criteria
PharmaTrek Interactive Polypharmacology space of
experimental annotations
UTOPIA Semantic enrichment of scientific PDFs
Predictions
GARFIELD Prediction of target pharmacology based on the
Similar Ensemble Approach
eTOX connector Automatic extraction of data for building
predictive toxicology models in eTOX project
43. Becoming part of the Open PHACTS Foundation
Members
membership offers early access to platform updates and releases
the opportunity to steer research and development directions
receive technical support
work with the ecosystem of developers and semantic data
integrators around Open PHACTS
tiered membership
familiar business and governance model
A UK-based not-for-profit
member owned company
44. What are the problems with licensing we had to address?
– To make data and software generated by the project usable/ reusable
– Multiplicity of unclear or non-standard licenses on original data sources
• „Public‟ can mean use but not redistribute, use in commercial environment,
• Legal position on use and reuse extremely unclear
• Different issues than just linking to data
– Legal status of integrated collections of the above, and of derived
knowledge?
– Appropriate software license selection
– Legal clarity for EFPIA and end users
– Approaches for commercial data integration, EFPIA in-house data
AIM: enable maximum possible dissemination and usability of integrated
data and architecture with approaches that will be applicable in other data
integration projects
Licensing Challenges
45. Chose John Wilbanks as consultant
A framework built around STANDARD well-understood
Creative Commons licences – and how they interoperate
Deal with the problems by:
Interoperable licences
Appropriate terms
Declare expectations to users and data publishers
One size won„t fit all requirements
Data Licensing Solution
46. Open PHACTS Project Partners
Pfizer Limited – Coordinator
Universität Wien – Managing entity
Technical University of Denmark
University of Hamburg, Center for Bioinformatics
BioSolveIT GmBH
Consorci Mar Parc de Salut de Barcelona
Leiden University Medical Centre
Royal Society of Chemistry
Vrije Universiteit Amsterdam
Spanish National Cancer Research Centre
University of Manchester
Maastricht University
Aqnowledge
University of Santiago de Compostela
Rheinische Friedrich-Wilhelms-Universität Bonn
AstraZeneca
GlaxoSmithKline
Esteve
Novartis
Merck Serono
H. Lundbeck A/S
Eli Lilly
Netherlands Bioinformatics Centre
Swiss Institute of Bioinformatics
ConnectedDiscovery
EMBL-European Bioinformatics Institute
Janssen
OpenLink
pmu@openphacts.org
Editor's Notes
Work Stream 1: Open Pharmacological Space (OPS) Service LayerStandardised software layer to allow public DD resource integrationDefine standards and construct OPS service layerDevelop interface (API) for data access, integration and analysisDevelop secure access modelsExisting Drug Discovery (DD) Resource IntegrationModify existing public DD resources to operate with OPS service layerIMI funds cost of modificationsConsider partnership with international resourcesWS2: Development of exemplar work packagesDevelop exemplar services to test OPS Service Layer proof of concept of the functionality developed in work stream 1services will depend on the expertise of the consortiumSome possible exemplar services:Target Dossier (Data Integration)Integrate target info from diverse sourcese.g. bioactive molecules, druggability, orthology, drug expression signatures Pharmacological Network Navigator (Data Visualisation)Profile “chemical space” of screening librariesVisualisation to assist lead hoppingCompound Dossier (Data Analysis)Integrate compound informationAlgorithms to predict drug action
Mx/psa, how calculated who did it?Mash up. With your data too,- top layer join together but need them allcommerical
10Can go get everythingOPS not a repo of the world, specific sources