Building linked-data, large-scale chemistry
platform: challenges, lessons and solutions
Valery Tkachenko, Alexey Pshenichnov, Aileen Day,
Colin Batchelor, Peter Corbett
Royal Society of Chemistry
ACS Spring 2016
San Diego, CA
March 13th 2016
ChemSpider – 2007 - 2011
OpenPHACTS – 2011 - 2014
Chemistry Data Platform – 2014 - …
• 45 million chemicals and growing
• Data sourced from >500 different sources
• Crowdsourced curation and annotation
• Ongoing deposition of data from our
journals and our collaborators
• A structure centric hub for web-searching
ChemSpider
Chemical vendors and datasources
ChemSpider
Properties - experimental
Literature and patents references
Classification
Spectra
Multimedia
Tagging
ChemSpider - Summary
• Simple, flattish data model
• InChI as a primary identifier
• Linked by synonyms
• Linked by “ExtId”
• Standard searches (identity, substructure,
similarity)
• Very little semantics
Open PHACTS Mission:
Integrate Multiple Research
Biomedical Data Resources
Into A Single Open & Sustainable
Access Point
OpenPHACTS: 2011-2014
info@openphactsfoundation.org @Open_PHACTS
Open PHACTS Practical Semantics
OpenPHACTS
GlaxoSmithKline – Coordinator
Universität Wien – Managing entity
Technical University of Denmark
University of Hamburg, Center for
Bioinformatics
BioSolveIT GmBH
Consorci Mar Parc de Salut de Barcelona
Leiden University Medical Centre
Royal Society of Chemistry
Vrije Universiteit Amsterdam
Novartis
Merck Serono
H. Lundbeck A/S
Eli Lilly
Netherlands Bioinformatics Centre
Swiss Institute of Bioinformatics
ConnectedDiscovery
EMBL-European Bioinformatics Institute
Janssen Esteve Almirall
OpenLink Scibite
The Open PHACTS Foundation
Spanish National Cancer Research Centre
University of Manchester
Maastricht University
Aqnowledge
University of Santiago de Compostela
Rheinische Friedrich-Wilhelms-Universität
Bonn
AstraZeneca
Pfizer
Why is it so hard to….
Competitors?
What’s the
structure?
Are they in our
file?
What’s
similar?
What’s the
target?Pharmacology
data?
Known
Pathways?
Working On
Now?
Connections to
disease?
Expressed in right
cell type?
IP?
@gray_alasdair Big Data Integration 18
OpenPHACTS Discovery Platform
Nanopub
Db
VoID
Data Cache
(Virtuoso Triple Store)
Semantic Workflow Engine
Linked Data API (RDF/XML, TTL, JSON)
Domain
Specific
Services
Identity
Resolution
Service
Chemistry
Registration
Normalisation
& Q/C
Identifier
Management
Service
Indexing
CorePlatform
P12374
EC2.43.4
CS4532
“Adenosine
receptor 2a”
VoID
Db
Nanopub
Db
VoID
Db
VoID
Nanopub
VoID
Public Content Commercial
Public Ontologies
User
Annotations
Apps
21 October 2014 Scientific Lenses – A. J. G. Gray 19
Gleevec®: Imatinib Mesylate
21 October 2014 Scientific Lenses – A. J. G. Gray 20
DrugbankChemSpider PubChem
Imatinib
MesylateImatinib Mesylate
YLMAHDNUQAMNNX-UHFFFAOYSA-N
Scientific Lenses – A. J. G. Gray 21
skos:exactMatch
(InChI)
Strict Relaxed
Analysing Browsing
Structure Lens
21 October 2014
I need to compute an analysis, give me
details of the active compound in Gleevec.
Commercial ibuprofen is a racemic mixture containing the same proportion
of two chiral forms. Both chiral forms are equally active. Typically, the user
will wish to retrieve info for any stereoisomer.
CHEMBL427526
CHEMBL521
CHEMBL175
Lens Effects: Ibuprofen
21 October 2014 Scientific Lenses – A. J. G. Gray 22
Commercial ibuprofen is a racemic mixture containing the same proportion
of two chiral forms. Both chiral forms are equally active. Typically, the user
will wish to retrieve info for any stereoisomer.
Default Lens
21 October 2014 Scientific Lenses – A. J. G. Gray 23
Commercial ibuprofen is a racemic mixture containing the same proportion
of two chiral forms. Both chiral forms are equally active. Typically, the user
will wish to retrieve info for any stereoisomer.
Stereoisomer Lens
21 October 2014 Scientific Lenses – A. J. G. Gray 24
Mapping Generation
21 October 2014 Scientific Lenses – A. J. G. Gray 25
ops:OPS437281
✔
ops:OPS380297
has_stereoundefined_parent
[ci:CHEMINF_000456]
ops:OPS380297
is_stereoisomer_of
[ci:CHEMINF_000461]
Other relationships
• has part
• is tautomer of
• uncharged counterpart
• isotope
…
OpenPHACTS UI
http://explorer.openphacts.org/
Explorer Screenshot
21 October 2014 Scientific Lenses – A. J. G. Gray 27
Explorer Screenshot
21 October 2014 Scientific Lenses – A. J. G. Gray 28
OpenPHACTS - Summary
• Principal difference – inter-domain links
• More complex, but still structure-centric
data model
• Ontological relationships introduced
• Chemical Lenses – new type of search
Chemistry Data Platform – 2014 - …
Dimensions and complexity of science
RSC Archive – since 1841
Digitally Enabling RSC Archive
ChemSpider Synthetic Pages
Compounds
Reaction
Analytical Data
Text and References
RSC Databases
RSC Compounds
RSC Reactions
RSC Spectra
RSC Crystals
RSC Polymers
RSC Materials
RSC Assays
RSC Algorithms
RSC Models
…and on…
Compounds domain
Data quality issue and CVSP
– Robochemistry
– Proliferation of errors in public and
private databases
• ChemSpider
• PubChem
• DrugBank
• KEGG
• ChEBI/ChEMBL
– Automated quality control system
Chemistry Validation and Standardization Platform
Chemistry Validation and Standardization Platform
Reactions domain
Analytical data domain
Crystallography domain
Chemistry Data Platform - Summary
• Simplified models within domain
• Domains are described with its own models
with embedded semantics
• No proper domain-specific identifiers
• Extensive quality control – CVSP (DOI
10.1186/s13321-015-0072-8)
There is no way back
Thank you
Email: tkachenkov@rsc.org
Slides:
http://www.slideshare.net/valerytkachenko16

Building linked data large-scale chemistry platform - challenges, lessons and solutions

  • 1.
    Building linked-data, large-scalechemistry platform: challenges, lessons and solutions Valery Tkachenko, Alexey Pshenichnov, Aileen Day, Colin Batchelor, Peter Corbett Royal Society of Chemistry ACS Spring 2016 San Diego, CA March 13th 2016
  • 2.
    ChemSpider – 2007- 2011 OpenPHACTS – 2011 - 2014 Chemistry Data Platform – 2014 - …
  • 3.
    • 45 millionchemicals and growing • Data sourced from >500 different sources • Crowdsourced curation and annotation • Ongoing deposition of data from our journals and our collaborators • A structure centric hub for web-searching
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
    ChemSpider - Summary •Simple, flattish data model • InChI as a primary identifier • Linked by synonyms • Linked by “ExtId” • Standard searches (identity, substructure, similarity) • Very little semantics
  • 14.
    Open PHACTS Mission: IntegrateMultiple Research Biomedical Data Resources Into A Single Open & Sustainable Access Point OpenPHACTS: 2011-2014
  • 15.
    info@openphactsfoundation.org @Open_PHACTS Open PHACTSPractical Semantics OpenPHACTS GlaxoSmithKline – Coordinator Universität Wien – Managing entity Technical University of Denmark University of Hamburg, Center for Bioinformatics BioSolveIT GmBH Consorci Mar Parc de Salut de Barcelona Leiden University Medical Centre Royal Society of Chemistry Vrije Universiteit Amsterdam Novartis Merck Serono H. Lundbeck A/S Eli Lilly Netherlands Bioinformatics Centre Swiss Institute of Bioinformatics ConnectedDiscovery EMBL-European Bioinformatics Institute Janssen Esteve Almirall OpenLink Scibite The Open PHACTS Foundation Spanish National Cancer Research Centre University of Manchester Maastricht University Aqnowledge University of Santiago de Compostela Rheinische Friedrich-Wilhelms-Universität Bonn AstraZeneca Pfizer
  • 16.
    Why is itso hard to…. Competitors? What’s the structure? Are they in our file? What’s similar? What’s the target?Pharmacology data? Known Pathways? Working On Now? Connections to disease? Expressed in right cell type? IP?
  • 17.
  • 18.
    OpenPHACTS Discovery Platform Nanopub Db VoID DataCache (Virtuoso Triple Store) Semantic Workflow Engine Linked Data API (RDF/XML, TTL, JSON) Domain Specific Services Identity Resolution Service Chemistry Registration Normalisation & Q/C Identifier Management Service Indexing CorePlatform P12374 EC2.43.4 CS4532 “Adenosine receptor 2a” VoID Db Nanopub Db VoID Db VoID Nanopub VoID Public Content Commercial Public Ontologies User Annotations Apps 21 October 2014 Scientific Lenses – A. J. G. Gray 19
  • 19.
    Gleevec®: Imatinib Mesylate 21October 2014 Scientific Lenses – A. J. G. Gray 20 DrugbankChemSpider PubChem Imatinib MesylateImatinib Mesylate YLMAHDNUQAMNNX-UHFFFAOYSA-N
  • 20.
    Scientific Lenses –A. J. G. Gray 21 skos:exactMatch (InChI) Strict Relaxed Analysing Browsing Structure Lens 21 October 2014 I need to compute an analysis, give me details of the active compound in Gleevec.
  • 21.
    Commercial ibuprofen isa racemic mixture containing the same proportion of two chiral forms. Both chiral forms are equally active. Typically, the user will wish to retrieve info for any stereoisomer. CHEMBL427526 CHEMBL521 CHEMBL175 Lens Effects: Ibuprofen 21 October 2014 Scientific Lenses – A. J. G. Gray 22
  • 22.
    Commercial ibuprofen isa racemic mixture containing the same proportion of two chiral forms. Both chiral forms are equally active. Typically, the user will wish to retrieve info for any stereoisomer. Default Lens 21 October 2014 Scientific Lenses – A. J. G. Gray 23
  • 23.
    Commercial ibuprofen isa racemic mixture containing the same proportion of two chiral forms. Both chiral forms are equally active. Typically, the user will wish to retrieve info for any stereoisomer. Stereoisomer Lens 21 October 2014 Scientific Lenses – A. J. G. Gray 24
  • 24.
    Mapping Generation 21 October2014 Scientific Lenses – A. J. G. Gray 25 ops:OPS437281 ✔ ops:OPS380297 has_stereoundefined_parent [ci:CHEMINF_000456] ops:OPS380297 is_stereoisomer_of [ci:CHEMINF_000461] Other relationships • has part • is tautomer of • uncharged counterpart • isotope …
  • 25.
  • 26.
    Explorer Screenshot 21 October2014 Scientific Lenses – A. J. G. Gray 27
  • 27.
    Explorer Screenshot 21 October2014 Scientific Lenses – A. J. G. Gray 28
  • 28.
    OpenPHACTS - Summary •Principal difference – inter-domain links • More complex, but still structure-centric data model • Ontological relationships introduced • Chemical Lenses – new type of search
  • 29.
  • 30.
  • 32.
    RSC Archive –since 1841
  • 33.
  • 34.
  • 35.
    RSC Databases RSC Compounds RSCReactions RSC Spectra RSC Crystals RSC Polymers RSC Materials RSC Assays RSC Algorithms RSC Models …and on…
  • 36.
  • 38.
    Data quality issueand CVSP – Robochemistry – Proliferation of errors in public and private databases • ChemSpider • PubChem • DrugBank • KEGG • ChEBI/ChEMBL – Automated quality control system
  • 39.
    Chemistry Validation andStandardization Platform
  • 40.
    Chemistry Validation andStandardization Platform
  • 41.
  • 43.
  • 44.
  • 45.
    Chemistry Data Platform- Summary • Simplified models within domain • Domains are described with its own models with embedded semantics • No proper domain-specific identifiers • Extensive quality control – CVSP (DOI 10.1186/s13321-015-0072-8)
  • 46.
    There is noway back
  • 47.

Editor's Notes

  • #17 Remember this, some of these questions are easier to answer than others
  • #18 17
  • #19 Open PHACTS was developed to support the key questions of drug discovery Business questions have been at the heart of Open PHACTS and have driven the development of the platform Mx/psa, how calculated who did it? Mash up. With your data too, - top layer join together but need them all commercial Data provided by many publishers Originally in many formats: relational, SD files and RDF Worked closely with publishers Data licensing was a major issue Over 5 billion triples – 14 datasets & growing Hosted on beefy hardware; data in memory (aim) Extensive memcaching Pose complex queries to extract data
  • #20 Import data into cache API calls populate SPARQL queries Integration approach Data kept in original model Data cached in central triple store API call translated to SPARQL query Query expressed in terms of original data Queries expanded by IMS to cover URIs of original datasets
  • #21 Example drug: Gleevec Cancer drug for leukemia Lookup in three popular public chemical databases Different results
  • #22 Interested in physiochemical properties of Gleevec
  • #26 Validate structure: Source data is messy! Identify common problems: Charge imbalance Stereochemistry Compute physiochemical properties Identify related properties based on structure 17 relationship types
  • #29 Pharmacology count 2370  3044
  • #32 What about science and chemistry in particular?
  • #43 Information typically associated with reactions