ChemValidator – an online service for validating and standardizing chemical structure files
Upcoming SlideShare
Loading in...5
×
 

ChemValidator – an online service for validating and standardizing chemical structure files

on

  • 902 views

The production of valid and appropriate chemical structure representations which are appropriate for deposition into chemical structure databases and for inclusion into scientific publications ...

The production of valid and appropriate chemical structure representations which are appropriate for deposition into chemical structure databases and for inclusion into scientific publications requires adoption of a set of pre-processing filters and standardization procedures. As part of our ongoing effort to improve the quality of data for deposition into the RSC ChemSpider database, to provide a manner by which to validate and prepare data for publication and to provide a valuable service to the chemistry community, we have delivered the ChemValidator online service. This website provides access to an intuitive user interface for the upload of chemical compounds in various formats, pre-processing and standardization relative to a defined set of standards and validation checking of the chemicals according to a number of rules including hypervalency, absence of stereochemistry and charge balance. This presentation will report on the development of ChemValidator.

This presentation was given by David Sharpe at the ACS Fall Meeting in 2012

Statistics

Views

Total Views
902
Views on SlideShare
743
Embed Views
159

Actions

Likes
0
Downloads
8
Comments
0

3 Embeds 159

http://www.chemspider.com 157
https://twitter.com 1
https://plu.mx 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution-NonCommercial LicenseCC Attribution-NonCommercial License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • This chemical is in many places…Wolfram Alpha, PubChem etc…. Is there value in “duplicates/triplicates”? A record with two instances of the same molecule? What about 15 waters? Can you show an example in PubChem that we inherited???
  • We believe our dataset may collapse significantly…

ChemValidator – an online service for validating and standardizing chemical structure files ChemValidator – an online service for validating and standardizing chemical structure files Presentation Transcript

  • Delivering an online service for validating andstandardizing chemical structure files usingthe ChemSpider platform
  • Overview• Introduction – Why do we need to validate/standardise data – Examples of problems in general – Examples of Problems in ChemSpider – Why InChI is not enough – FDA rules
  • What are we trying to achieve?• Everyone wants high quality data• The ChemSpider team is building a reputation on data quality• Many datasources have errors• We need to identify: – Errors – Inconsistencies – Data duplication/Inappropriate separation of data• Requires a process of validation and standardization
  • What do we mean by Validation and Standardisation?• Validated – Check for hypervalency, charge balance, missing stereo – Name-Structure relationships, etc.• Standardized – Use standard rules to “standardize” compounds; Nitro groups, O-Metal bonds, tautomers, etc.
  • Where will CVSP be useful• Currently, a standalone system• In the future; Validation/standardisation routines will be used: – Built in to our deposition system – At registration for new compounds – To improve existing data in ChemSpider – pass through the ChemSpider backfile• Potential to offer optional checking service to authors
  • What we want to avoid
  • What do we do now?• Currently, ChemSpider uses structures (as InChI’s) as the database key• Need structures for depositions• 2 Steps: – Pre-processing prior to deposition – InChI algorithm; provides standardisation and mapping
  • What are the common errors?• Records without a structure• Incorrect valences• Atom labels
  • What are the common errors?• Unbalanced charge – Name-structure errors• Salts• Polymers/Organometallics• Missing stereochemistry
  • Side Effects of InChI on ChemSpider: Sort of helpful
  • Side Effects of InChI on ChemSpider• Advantages and disadvantages – The depictions are meant to represent the same molecule – Not easy to pick out “bad” representations
  • Substance Registry System• How do you decide your standardisation rules?• Avoid standards in isolationhttp://www.fda.gov/downloads/ForIndustry/DataStandards/SubstanceRegistrationSyste m-UniqueIngredientIdentifierUNII/ucm127743.pdf• Note: This document is only a starting point
  • Salt and Ionic Bonds
  • Nitro groups
  • Ammonium salts
  • Validation rulesIn XML:Code generated dynamically from rule set.Indigo API used behind the scenes.
  • Standardization rulesCorrections stored in database:SMIRKS-based corrections and also proximity- based metal–non-metal reconnection.
  • Case study: DrugBank• DrugBank (http://www.drugbank.ca/) maintained by David Wishart• Database contains 6711 structures• Widely regarded as a well curated, high quality datasetDrugBank 3.0: a comprehensive resource for omics research on drugs. Knox C, Law V,Jewison T, Liu P, Ly S, Frolkis A, Pon A, Banco K, Mak C, Neveu V, Djoumbou Y, Eisner R,Guo AC, Wishart DS., Nucleic Acids Res., 2011, 39, Jan, D1035-41.
  • ChemSpider Standardization• Entire ChemSpider database will be standardized using modified FDA rule set• Original Molfiles will be standardized and all properties (predicted properties, SMILES, InChIs, Names) will all be regenerated• Standardization procedures automatically applied to all future depositions
  • CVSP as a Flexible System• There will be various rules sets – Rigid pre-defined rules: e.g. Meeting FDA specifications as written, Open PHACTS modified rules set, etc. – Flexible user-defined rules: users upload their rules in our custom format (XML) – The Open PHACTS rule set will be open to the community to reuse
  • Incorporating CVSP into dataprocessing platforms: Knime • The workflow includes: – SDF reader – Indigo nodes – calls for ChemSpider validation Web services
  • Incorporating CVSP into data processing platforms: Knime• Warning is returned as a result of processing
  • Summary• Will release back results of DrugBank• Alpha version of CVSP available: http://cv.beta.rsc-us.org/Batches.aspx• Will be a resource for the Community• Will improve ChemSpider• Still a long way to go….
  • Thank youEmail: chemspider@rsc.orgTwitter: ChemSpiderhttp://www.chemspider.comhttp://cssp.chemspider.com/