Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.



Published on

Chemical Validation and Standardization Platform (CVSP) from Royal Society of Chemistry

Published in: Technology
  • Be the first to comment

  • Be the first to like this


  1. 1. SERMACS 2012 Chemical Validation and Standardization Platform (CVSP) Kenneth Karapetyan, Colin Batchelor, Valery Tkachenko, Antony Williams
  2. 2. A quality chemical record  Correct file format (SDF, MOL, CDX, etc.)  “Valid” chemical structure  Valid atoms (not query atoms)  Valid bonds  Valid valences  Valid charges  Stereo (defined/unknown/undefined)  Synonyms  Names (name to structure)  SMILES, InChIs (SMILES/InChI to structure)  XRefs
  3. 3. “Quality-conscious” databases  Not just for storage but shares responsibility for the data  Understands that bad data will eventually propagate to others  Uses deposition system with strong validation rules  Warns users when conflicts are detected and when records are ambiguous  Stops user from depositing when critical chemical errors are suspected  Help depositors to “self-curate” their own data
  4. 4. A Vision of a Validation System  Validation rules are transparent to community  Validation rules can be revised by users for their particular purpose  Free to the community  Accessible in batch mode by Web GUI and Web service  Could be integrated to any chemical deposition system  Rules are based on SMARTS/SMILES for finding offending patterns in molecules. Non-standard INCHI is used for determining unknown/undefined stereo descriptors
  5. 5. Standardization  Who is concerned with standardization?  Database maintainers  Chemists that prepare data for property calculations  Scientists investigating structure-activity relationships  They all can have different business rules in mind  CVSP is an open and flexible platform for different flavors of standardization  Rules are based on SMIRKS transformations
  6. 6. CVSP: select rule set
  7. 7. CVSP: Acid-Base competitive ionization
  8. 8. CVSP: define rule variables
  9. 9. CVSP: result of processing
  10. 10. DrugBank dataset (6516 records)  Marked as Errors (arbitrary)  2 records with query bonds  3 records with invalid atoms (asterisk in polymers)  Unusual valence: ~70 (oxygen 3, sulfur 3 and 5, Mg 4, B 5, etc.)  Warnings  INCHI not matching structure (100+)  SMILES not matching structure (100+)
  11. 11.  DrugBank ID: DB00755  InChI=1S/C20H28O2/c1-15(8-6-9-16(2)14-19(21)22)11-12-18- 17(3)10-7-13-20(18,4)5/h6,8-9,11-12,14H,7,10,13H2,1- 5H3,(H,21,22)/b9-6+,12-11+,15-8+,16-14+  DrugBank ID: DB00614
  12. 12. Lessons  Don’t bury a depositor with tons of validation messages. Rank them by chemical severity  Bring the depositors attention to records with the highest warning count  Records that have inconsistent InChI and SMILES and NAMES certainly have issues  Only bring the most ambiguous records to a depositors’ attention – let them “self-curate”. Raise records that are suspected to have critical chemical errors to the top
  13. 13. Where will we use CVSP?  For external projects we are involved with – for example, Open PHACTS where we are responsible for hosting chemistry data mapped to biology data  Made available to RSC authors to check their chemical compounds prior to submitting articles  To be integrated to the ChemSpider deposition system to validate and standardize new data  For validating and standardizing existing data in ChemSpider and other RSC compound databases
  14. 14. Thanks We would appreciate some feedback on CVSP. Please email us to request access. Also if you would like us to run some datasets through CVSP and give you the results back please email us.  Email: