Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The RSC chemical validation and standardization platform, a potential path to quality-conscious databases


Published on

High quality chemical databases are struggling with protecting their data from the flow of wild machine-generated chemistry and lower-quality data. The period of primarily human curation prior to deposition in a database is gone and quality-conscious databases need to heavily rely on automated validation checks. An automated chemical validation system is being developed by the cheminformatics team at the Royal Society of Chemistry to be the “quality gatekeeper” of databases at the point of deposition. ChemSpider is leading a community-wide standardization approach starting with our support of the Open PHACTS semantic web project, an Innovative Medicines Initiative. The Chemical Validation and Standardization Platform (CVSP) is being designed as an open, flexible chemical validation and standardization platform that validates and standardizes chemical records. This presentation will review the existing beta version of the system and work in progress.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

The RSC chemical validation and standardization platform, a potential path to quality-conscious databases

  1. 1. Chemistry Validation andStandardization Platform Modularization and “Hadoop”ization Kenneth Karapetyan, Colin Batchelor, Valery Tkachenko, Antony Williams ACS New Orleans April 2013
  2. 2. Overview• Motivation• What we support• Modularization• Parallelization• Examples
  3. 3. Motivation: validationOpen and free chemical validation system for:•Structure validation – Warn on query atoms, pseudo atoms, polymers, etc. – Nonsensical stereo•SDF field mapping for validating depositor-provided names, InChI, SMILES
  4. 4. Motivation: standardizationAllows users to use CVSP default standardization workflow (orFDA, Open PHACTS and so on)Allows users to put together their own workflow usingmodules provided:•Apply default CVSP or user-defined SMIRKS rules•Layout•Neutralize•Get canonical tautomer using ChemAxon’s algorithms•Get biggest organic fragment
  5. 5. What we support• SD files and mol files• ChemDraw files (in-house code)• Tab-delimited text files of names, InChIs, SMILES• Zipped files• GZipped files
  6. 6. CVSP: modularization
  7. 7. Reusable workflows
  8. 8. SMIRKS-based rules
  9. 9. “Hadoop”izationApache Hadoop is a framework for the distributed processing of large datasets across clusters of computers.CVSP is written in C#. To run it on Linux machines we use Mono (cross-platform .NET runtime environment)Farm:•28 CPU cores•42G memory•2T disk spaceProcessor intensive tasks•Tautomerization
  10. 10. Deposit ID in Input file Convert to SD format database Upload to farm for Hadoop processing processing on HadoopUpload results todatabase for user Download results preview
  11. 11. Hadoop queuesThree Hadoop queues are used (capacity queue) to prioritize big/large CVSPsubmissions•“Small” submission queue for submissions under 500 records•Large submissions queue•Internal queue – For internal projects, e.g. tautomer analysis of ChemSpider or ChemSpider standardizationAll records have to be processed on Hadoop to user to see the results (no partialpreview)
  12. 12. ExamplesDrugBank•~6500 records, approximately 2 records persecondPubMed•~100 000 records, about 9 h
  13. 13. Rate-limiting step?Canonical tautomerizationThis molecule took45 min tocanonicalize.
  14. 14. DrugBank dataset (6516 records)Errors•2 records with query(any) bond•2 records with R groups•3 polymers•18 porphyrins with metal coordinated inside with one of themetal-nitrogen bonds stereogenic•Unusual valence: ~20Warnings•INCHI not matching structure (100+)•SMILES not matching structure (100+)
  15. 15. DrugBank ID: DB00755InChI=1S/C20H28O2/c1-15(8-6-9-16(2)14-19(21)22)11-12-18-17(3)10-7-13-20(18,4)5/h6,8-9,11-12,14H,7,10,13H2,1-5H3,(H,21,22)/b9-6+,12-11+,15-8+,16-14+DrugBank ID: DB00614
  16. 16. Stereo issues J. Brecher, Pure Appl. Chem., 2008, doi:10.1351/pac200880020277DB08128 DB06287
  17. 17. Please try CVSP athttp://cv.beta.rsc-us.orgThank youE-mail:,