The RSC chemical validation and standardization platform, a potential path to quality-conscious databases
Upcoming SlideShare
Loading in...5
×
 

The RSC chemical validation and standardization platform, a potential path to quality-conscious databases

on

  • 839 views

High quality chemical databases are struggling with protecting their data from the flow of wild machine-generated chemistry and lower-quality data. The period of primarily human curation prior to ...

High quality chemical databases are struggling with protecting their data from the flow of wild machine-generated chemistry and lower-quality data. The period of primarily human curation prior to deposition in a database is gone and quality-conscious databases need to heavily rely on automated validation checks. An automated chemical validation system is being developed by the cheminformatics team at the Royal Society of Chemistry to be the “quality gatekeeper” of databases at the point of deposition. ChemSpider is leading a community-wide standardization approach starting with our support of the Open PHACTS semantic web project, an Innovative Medicines Initiative. The Chemical Validation and Standardization Platform (CVSP) is being designed as an open, flexible chemical validation and standardization platform that validates and standardizes chemical records. This presentation will review the existing beta version of the system and work in progress.

Statistics

Views

Total Views
839
Slideshare-icon Views on SlideShare
829
Embed Views
10

Actions

Likes
0
Downloads
4
Comments
0

4 Embeds 10

http://www.chemspider.com 5
https://twitter.com 3
https://plu.mx 1
http://www.pinterest.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution-NonCommercial LicenseCC Attribution-NonCommercial License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    The RSC chemical validation and standardization platform, a potential path to quality-conscious databases The RSC chemical validation and standardization platform, a potential path to quality-conscious databases Presentation Transcript

    • Chemistry Validation andStandardization Platform Modularization and “Hadoop”ization Kenneth Karapetyan, Colin Batchelor, Valery Tkachenko, Antony Williams ACS New Orleans April 2013
    • Overview• Motivation• What we support• Modularization• Parallelization• Examples
    • Motivation: validationOpen and free chemical validation system for:•Structure validation – Warn on query atoms, pseudo atoms, polymers, etc. – Nonsensical stereo•SDF field mapping for validating depositor-provided names, InChI, SMILES
    • Motivation: standardizationAllows users to use CVSP default standardization workflow (orFDA, Open PHACTS and so on)Allows users to put together their own workflow usingmodules provided:•Apply default CVSP or user-defined SMIRKS rules•Layout•Neutralize•Get canonical tautomer using ChemAxon’s algorithms•Get biggest organic fragment
    • What we support• SD files and mol files• ChemDraw files (in-house code)• Tab-delimited text files of names, InChIs, SMILES• Zipped files• GZipped files
    • CVSP: modularization
    • Reusable workflows
    • SMIRKS-based rules
    • “Hadoop”izationApache Hadoop is a framework for the distributed processing of large datasets across clusters of computers.CVSP is written in C#. To run it on Linux machines we use Mono (cross-platform .NET runtime environment)Farm:•28 CPU cores•42G memory•2T disk spaceProcessor intensive tasks•Tautomerization
    • Deposit ID in Input file Convert to SD format database Upload to farm for Hadoop processing processing on HadoopUpload results todatabase for user Download results preview
    • Hadoop queuesThree Hadoop queues are used (capacity queue) to prioritize big/large CVSPsubmissions•“Small” submission queue for submissions under 500 records•Large submissions queue•Internal queue – For internal projects, e.g. tautomer analysis of ChemSpider or ChemSpider standardizationAll records have to be processed on Hadoop to user to see the results (no partialpreview)
    • ExamplesDrugBank•~6500 records, approximately 2 records persecondPubMed•~100 000 records, about 9 h
    • Rate-limiting step?Canonical tautomerizationThis molecule took45 min tocanonicalize.
    • DrugBank dataset (6516 records)Errors•2 records with query(any) bond•2 records with R groups•3 polymers•18 porphyrins with metal coordinated inside with one of themetal-nitrogen bonds stereogenic•Unusual valence: ~20Warnings•INCHI not matching structure (100+)•SMILES not matching structure (100+)
    • DrugBank ID: DB00755InChI=1S/C20H28O2/c1-15(8-6-9-16(2)14-19(21)22)11-12-18-17(3)10-7-13-20(18,4)5/h6,8-9,11-12,14H,7,10,13H2,1-5H3,(H,21,22)/b9-6+,12-11+,15-8+,16-14+DrugBank ID: DB00614
    • Stereo issues J. Brecher, Pure Appl. Chem., 2008, doi:10.1351/pac200880020277DB08128 DB06287
    • Please try CVSP athttp://cv.beta.rsc-us.orgThank youE-mail: karapetyank@rsc.org, batchelorc@rsc.org