Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Building support for the semantic web for chemistry at the Royal Society of Chemistry


Published on

The Royal Society of Chemistry provides a variety of databases and services covering multiple domains of Chemistry. That includes our electronic publishing platform, ChemSpider and its related databases, the National Chemistry Database and digital access to the RSC archive that spans over 170 years. In order to support the rising tide of semantic web technologies we are now working on exposing our data to conform with the linked data paradigm. This presentation will provide an overview of our work to introduce semantic structure to all RSC electronic resources as well as outlining ways to access this information using standard formats and various APIs.

Published in: Technology, Education
  • Be the first to comment

Building support for the semantic web for chemistry at the Royal Society of Chemistry

  1. 1. Presented by Karen Karapetyan, Colin Batchelor, Jonathan Steele , David Sharpe Valery Tkachenko, Antony Williams ACS Indianapolis September 2013 Building support for the semantic web for chemistry at the Royal Society of Chemistry
  2. 2. Open PHACTS is an Innovative Medicines Initiative (IMI) project, aiming to reduce the barriers to drug discovery in industry, academia and for small businesses. Semantic web is one of the corner stones
  3. 3. RDF Export Data: ChEMBL HMDB DrugBank Chemistry Validation and Standardization Platform (CVSP) at •Validation •Standardization •Parent generation •Run on Hadoop-based farm
  4. 4. CVSP : chemical validation Free chemistry validation platform that performs: •Structure validation • Atoms • Bonds • Valence • Stereo • If aromatic - check that uniquely dearomatized • Strongest acid not ionized first in partially-ionized system •Cross-matching of SDF fields • synonyms • InChIs • Smiles
  5. 5. Input formats supported: CDX, Mol, Sdf Zip Gz Tab-delimited text files
  6. 6. CVSP: standardization modules • Custom processing let’s user to put together workflow from pre-defined standardization modules list
  7. 7. • ChemSpider (passed 100K records) • All records are planned to pass through CVSP • DrugBank (~6.5K records) • ChEMBL (~1.2 mln records) Data set examples
  8. 8. ChemSpider issues
  9. 9. DrugBank dataset (6516 records) ~60 records that can’t be dearomatized unambiguously DB04283 DB04462
  10. 10. ~30 records with bonds that do not make sense DB04283 DDB04009
  11. 11. 2 records where Smiles, InChI, and name did not match the structure DB00611 DB01547
  12. 12. ~40 records where InChIs did not match the structure DrugBank ID: DB00755 InChI=1S/C20H28O2/c1-15(8-6-9-16(2)14-19(21)22)11-12-18-17(3)10-7-13- 20(18,4)5/h6,8-9,11-12,14H,7,10,13H2,1-5H3,(H,21,22)/b9-6+,12-11+,15-8+,16- 14+ DruGBank ID: DB00614
  13. 13. DB08128 J. Brechner, IUPAC Graphical Representation of stereochem. configurations Section: ST-1.1.10 DB06287 7 records with 2 stereo bonds at chiral atoms
  14. 14. CVSP validation of ChEMBL 16 (~1.3 mln. records) • Overall 0.7% of records had validation issues • Stereo problems (~82%) • Directions of bonds do not make sense (~63%) • Ambiguous stereo : 2 stereo bonds at chiral center (~19%)
  15. 15. “Direction of bond makes no sense” – 63%
  16. 16. “Stereo types of the opposite bonds mismatch” -15%
  17. 17. “Stereo types of non-opposite bonds match” – 2%
  18. 18. “atom not recognized” – 3% isotopes Should be atom from periodic table No mass difference in atom line No “M ISO” in connection table In molfile:
  19. 19. CVSP : standardization • Standardization workflow was developed for Open PHACTS’s registration system • Workflow includes modules like • SMIRKS rules derived from FDA SRS manual • Resetting symmetric stereo • Dearomatize • Layout • Fix “fixable” stereo issues • Disconnect all metals from N, O, F • Fold non-stereo hydrogens • Handle partial ionization of acid-base • etc
  20. 20. Open PHACTS chemical registry system: what we use as chemical identity? •Standard InChI/InChIKey (currently used ChemSpider) •Absolute smiles (isomeric canonical) Drawbacks •SMILES –many flavors •Standard InChI • does not include unknown/undefined stereo unless at least one defined stereo is present • does not distinguish between undefined and unknown stereo (always “?”) • standard InChI does some basic tautomer canonicalization which we wanted to prevent to distinguish between all tautomers (sometimes useful for linking spectral data to specific tautomer) • assumes absolute stereo or no stereo at all Path we took: Non-standard InChI with options: SUU SLUUD FixedH SUCF •Always include unknown/undefined stereo (‘u’,’?’) •add Fixed H layer (to distinguish between tautomers) •Uses chiral flag in MOL/SD record (ON – absolute stereo, OFF-relative)
  21. 21. For each Compound (CSID) parent generation is attempted “Tautomerism in large databases”, Sitzmann and others, J.Comput Aided Mol Des (2010) Parent Description Charge-Unsensitive An attempt is made to neutralize ionized acids and bases. Envisioned to be an ongoing improvement while new cases appear. Isotope-Unsensitive Isotopes replaced by common weight Stereo-Unsensitive Stereo is stripped Tautomer-Unsensitive Tautomer canonicalization is attempting to generate a “reasonable” tautomer Super-Unsensitive This parent is all of the above No fragment unsensitive parent – we treat all fragments as equal entities
  22. 22. CTAB REGID1 DataSource Synonym1 Synonym2 XRef1 etc Deposited SDF record Standardized entity OPS_ID1 Super Parent (OPS_ID8) Parents Charge Parent (OPS_ID7) Isotope Parent (OPS_ID5) Stereo Parent (OPS_ID4) Tautomer Parent (OPS_ID6) Fragment (OPS_ID3) Fragment (OPS_ID2)
  23. 23. Chemistry Validation and Standardization Platform (CVSP) at •Validation •Standardization •Parent generation RDF Export Data
  24. 24. Data is being imported from ChemSpider to Open PHACTS in RDF/turtle
  25. 25. RDF/VoID – VoID is an RDF Schema vocabulary for expressing metadata about RDF datasets. It is intended as a bridge between the publishers and users of RDF data. • skos:exactMatch (Simple Knowledge Organisation System) E.g. To link compounds in OPS with compounds in ChEBI. • skos:closeMatch E.g. To link Stereo Insensitive Parents to their Children within OPS. • skos:relatedMatch E.g. To link Parent compounds that contain others as Fragments. – Recommendations on how to create the VoID have been specified by Manchester here:
  26. 26. OPS1 DrugBank ID DB07241 OPS5OPS4 OPS3 OPS2 OPS6 ops:OPS1 skos:exactMatch <http://www4.wiwiss.fu-> . ops:OPS2 skos:relatedMatch ops:OPS1 . ops:OPS3 skos:relatedMatch ops:OPS1 . ops:OPS3 skos:closeMatch ops:OPS4 . ops:OPS3 skos:closeMatch ops:OPS5 . ops:OPS4 skos:closeMatch ops:OPS6 . ops:OPS5 skos:closeMatch ops:OPS6 .
  27. 27. Future work Enabling full semantic web capabilities: •Establishing RDF server with all relationships (including parent-child relationships) •Develop SPARQL capability for querying RDF Validate all records in ChemSpider by passing it through CVSP