Successfully reported this slideshow.
Your SlideShare is downloading. ×



Check these out next

1 of 15 Ad

More Related Content

Similar to SERMACS 2012 (20)

More from Ken Karapetyan (9)


Recently uploaded (20)


  1. 1. SERMACS 2012 Chemical Validation and Standardization Platform (CVSP) Kenneth Karapetyan, Colin Batchelor, Valery Tkachenko, Antony Williams
  2. 2. A quality chemical record  Correct file format (SDF, MOL, CDX, etc.)  “Valid” chemical structure  Valid atoms (not query atoms)  Valid bonds  Valid valences  Valid charges  Stereo (defined/unknown/undefined)  Synonyms  Names (name to structure)  SMILES, InChIs (SMILES/InChI to structure)  XRefs
  3. 3. “Quality-conscious” databases  Not just for storage but shares responsibility for the data  Understands that bad data will eventually propagate to others  Uses deposition system with strong validation rules  Warns users when conflicts are detected and when records are ambiguous  Stops user from depositing when critical chemical errors are suspected  Help depositors to “self-curate” their own data
  4. 4. A Vision of a Validation System  Validation rules are transparent to community  Validation rules can be revised by users for their particular purpose  Free to the community  Accessible in batch mode by Web GUI and Web service  Could be integrated to any chemical deposition system  Rules are based on SMARTS/SMILES for finding offending patterns in molecules. Non-standard INCHI is used for determining unknown/undefined stereo descriptors
  5. 5. Standardization  Who is concerned with standardization?  Database maintainers  Chemists that prepare data for property calculations  Scientists investigating structure-activity relationships  They all can have different business rules in mind  CVSP is an open and flexible platform for different flavors of standardization  Rules are based on SMIRKS transformations
  6. 6. CVSP: select rule set
  7. 7. CVSP: Acid-Base competitive ionization
  8. 8. CVSP: define rule variables
  9. 9. CVSP: result of processing
  10. 10. DrugBank dataset (6516 records)  Marked as Errors (arbitrary)  2 records with query bonds  3 records with invalid atoms (asterisk in polymers)  Unusual valence: ~70 (oxygen 3, sulfur 3 and 5, Mg 4, B 5, etc.)  Warnings  INCHI not matching structure (100+)  SMILES not matching structure (100+)
  11. 11.  DrugBank ID: DB00755  InChI=1S/C20H28O2/c1-15(8-6-9-16(2)14-19(21)22)11-12-18- 17(3)10-7-13-20(18,4)5/h6,8-9,11-12,14H,7,10,13H2,1- 5H3,(H,21,22)/b9-6+,12-11+,15-8+,16-14+  DrugBank ID: DB00614
  12. 12. Lessons  Don’t bury a depositor with tons of validation messages. Rank them by chemical severity  Bring the depositors attention to records with the highest warning count  Records that have inconsistent InChI and SMILES and NAMES certainly have issues  Only bring the most ambiguous records to a depositors’ attention – let them “self-curate”. Raise records that are suspected to have critical chemical errors to the top
  13. 13. Where will we use CVSP?  For external projects we are involved with – for example, Open PHACTS where we are responsible for hosting chemistry data mapped to biology data  Made available to RSC authors to check their chemical compounds prior to submitting articles  To be integrated to the ChemSpider deposition system to validate and standardize new data  For validating and standardizing existing data in ChemSpider and other RSC compound databases
  14. 14. Thanks We would appreciate some feedback on CVSP. Please email us to request access. Also if you would like us to run some datasets through CVSP and give you the results back please email us.  Email:

Editor's Notes

  • I would like to start by defining what a “quality” record means, because that is what validation part of the CVSP is about. The chemical record has several aspect to its quality. One that is easiest to check is file format correctness. Each file format has its own formatting rules that record in that format needs to follow. This type of file validation is done by all the database maintainers that have deposition systems.Another, more relevant, type of validation is the chemical validation. A record can be perfectly formatted from file format point of view, but make no sense in chemical. So structure validation is something that is usually overlooked or not prioritized highly. Some of the chemical validations are atom validation – checking that atom is legal chemical atom, its charges and valences. That stereo is defined. Synonym validation is very useful for spotting records that are inconsistent and worth pointing depositor to look at them. Often during data export/import synonyms and/or structure are being manipulated and relationships between them can become faulty. So attempting to verify that synonym and structure actually match something worth doing.SMILES/INCHIs – again relationship between chemical record and depositor’s provided INCHI or SMILES can be faulty. As I’ll show later, this inconsistency could reveal a systematic issue with data set as sometimes INCHi or SMILEs do not match the structure.
  • Going from a quality record to quality-conscience database.. If database maintainers care and try their best to maintain quality of its records – that is what I call quality-conscious database.So what it takes? First – cultural shift. Let’s imagine that I am a database owner. Of course, it is easier for me to shift full responsibility for the quality of the data to original depositors. While this is understandable, each database owner has to understand that bad data in their database will propagate to others as well and they share responsibility for the data quality. To avoid propagation, one needs to build a set of strong validation business rules into deposition system. Depositors need to be prominently warned about records that have unusual valance for elements, unusual charge, that record is ambiguous, etc. Bad records need to be stopped at deposition and user has to revise records that are suspected with critical chemical error. Let depositor “self-curate” their data. Big data starts with small data. Raising quality standards for small data will positively affect the big data as well.
  • So what is a “perfect” validation system. Several things..System that has transparent/open validation rules. Community-based rules are best. Chemistry community benefited from IUPAC in terms of chemical names, graphical representation, and development of INCHI. Development of validation standards could be another beneficial adventure of IUPAC or any other working group. It is understandable that for some users standard validation is not enough. Open SMARTS-based validation with custom warning would make system flexible.System should be free to the communityValidation should be easy to use for organic chemists via GUI, but power users should be able to run remote validation via web servicesAny chemical deposition system should be able to link to validation system and use it as part of workflow.
  • Standardization or normalization is a term used for collapsing or transforming different representations of same molecule into a “favorite” standard one. Some additional database-specific transformations could also be in order. Transformations could include kekulization, metal-carbon bond mending or disconnection, de-esterification, removal of solvents, enol->ketone transformations, etc. Each database has its own flavor of standardization, different business rules, and unfortunately they wouldn’t share their know-how. Tautomer standardization is another problem that each databases solves differently. So what we are developing is an open transparent platform with flexible business rule set. We are not forcing users to use CVSP rule set - we offer users a choice depending on their purpose. They can use CVSP default rule set, they can clone and revise our rule set, they can use INCHI normalization rule set, or FDA SRS rule set, etc. We believe it is unnecessary for each database to develop their own chemical workflow components. One has to develop a flexible infrastructure so the rest of the community would benefit.
  • Here is the current interface for CVSP user profile, particularly rule settings. As you can see a user can choose a CVSP rule set, can upload his own rules etc as XML file, or clone CVSP rule set and revise it later.
  • Basically here you see acid/base rules with ranking. We used the FDA SRS manual to establish base-acid competitive ionization ranking. The higher the rank the weaker the acid. Strongest acid will deprotonate and protonate strongest base.
  • One handy thing present in the interface is the user capability for creating macro variables that can be used in SMIRKS based rules.For example, {M} can stand for certain list of metals. And then in SMIRKS it could be used as variable.
  • The result of processing is a list of records with validation messages in the middle. If record was standardized then “Standardized” column is present with the structure.
  • Here is the bigger DrugBank dataset we have processed. Some warnings are shown in the dropdown list. Warnings about metals, stereo, enol presence, etc.
  • DrugBank data set was used as Guinea pig in this case and passed through CVSP and several issues were found. OF course, issues are marked as warnings or errors arbitrarily, but they correlate with the level of attention that depositors need to pay attention when reviewing their submission. There were 2 records where non aromatic query bonds were used, 3 records with invalid atoms (e.g. ”A” in polymers). About 70 records had issues: mostly radicals, inorganics, and coordination compounds have unusual valences. But these are records that experts in inorganic chemistry or coordination chemistry need to look at and have their verdict.
  • First example is where InChI shows known stereo on double bonds but structure doesn’t.Second example is where neither the InChI nor the name match the structure. This is an example where the primary data – in this case original DrugBank dataset – with these inconsistencies have propagated to few big databases. It would be beneficial if original dataset owners pass their data through validation and fix them. We would be happy to assist in this adventure to anybody that is interested in quality of their data sets.