Your SlideShare is downloading. ×
Chemical Database Projects Delivered by RSC eScience
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Chemical Database Projects Delivered by RSC eScience


Published on

This presentation is an overview of some of the projects we are involved with at RSC eScience. The presentation was given at the FDA Meeting regarding the “Development of a Freely Distributable Data …

This presentation is an overview of some of the projects we are involved with at RSC eScience. The presentation was given at the FDA Meeting regarding the “Development of a Freely Distributable Data System for the Registration of Substances

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide
  • The result of processing is a list of records with validation messages in the middle. If record was standardized then “Standardized” column is present with the structure.
  • Here is the bigger DrugBank dataset we have processed. Some warnings are shown in the dropdown list. Warnings about metals, stereo, enol presence, etc.
  • DrugBank data set was used as Guinea pig in this case and passed through CVSP and several issues were found. OF course, issues are marked as warnings or errors arbitrarily, but they correlate with the level of attention that depositors need to pay attention when reviewing their submission. There were 2 records where non aromatic query bonds were used, 3 records with invalid atoms (e.g. ”A” in polymers). About 70 records had issues: mostly radicals, inorganics, and coordination compounds have unusual valences. But these are records that experts in inorganic chemistry or coordination chemistry need to look at and have their verdict.
  • First example is where InChI shows known stereo on double bonds but structure doesn’t. Second example is where neither the InChI nor the name match the structure. This is an example where the primary data – in this case original DrugBank dataset – with these inconsistencies have propagated to few big databases. It would be beneficial if original dataset owners pass their data through validation and fix them. We would be happy to assist in this adventure to anybody that is interested in quality of their data sets.
  • Transcript

    • 1. Chemical Database Projects Delivered by RSC eScience at the FDA Meeting “Development of a Freely Distributable Data System for the Registration of Substances” Antony Williams
    • 2. RSC eScience What was once just ChemSpider is much more…  ChemSpider Reactions  Chemicals Validation and Standardization Platform  Learn Chemistry Wiki  National Chemical Database Service  Open PHACTS  PharmaSea  Global Chemistry Hub
    • 3. We are known for ChemSpider… The Free Chemical Database A central hub for chemists to source information  >28 million unique chemical records  Aggregated from >400 data sources  Chemicals, spectra, CIF files, movies, images, podcasts, links to patents, publications, predictions A central hub for chemists to deposit & curate data
    • 4. We Want to Answer Questions Questions a chemist might ask…  What is the melting point of n-heptanol?  What is the chemical structure of Xanax?  Chemically, what is phenolphthalein?  What are the stereocenters of cholesterol?  Where can I find publications about xylene?  What are the different trade names for Ketoconazole?  What is the NMR spectrum of Aspirin?  What are the safety handling issues for Thymol Blue?
    • 5. I want to know about “Vincristine”
    • 6. Vincristine: Identifiers and Properties
    • 7. Vincristine: Vendors and Sources
    • 8. Vincristine: Articles
    • 9. How did we build it? We deal in Molfiles or SDF files – with coordinates Deposit anything that has an InChI – we support what InChI can handle, good and bad Standardization based on “InChI standardization” InChIs aggregate (certain) tautomers
    • 10. The InChI Identifier
    • 11. Downsides of InChI Good for small molecules – but no polymers, issues with inorganics, organometallics, imperfect stereochemistry. ChemSpider is “small molecules” InChI used as the “deduplicator” – FIRST version of a compound into the database becomes THE structure to deduplicate against…
    • 12. Side Effects of InChI Usage
    • 13. SMILES by comparison…
    • 14. Side Effects of InChI Usage
    • 15. Searches: The INTERNET
    • 16. Search by InChI
    • 17. ChemSpider Google Search
    • 18. How did we build it? We deal in Molfiles or SDF files – with coordinates Deposit anything that has an InChI – we support what InChI can handle, good and bad Standardization based on “InChI standardization” InChIs aggregate (certain) tautomers We deal with “various forms” of data
    • 19. Crowdsourced “Annotations” Users can add  Descriptions/Syntheses/Commentaries  Links to PubMed articles  Links to articles via DOIs  Add spectral data  Add Crystallographic Information Files  Add photos  Add MP3 files  Add Videos
    • 20. ChemSpider : Spectra Linked
    • 21. ChemSpider ID 24528095 H1 NMR
    • 22. ChemSpider ID 24528095 HHCOSY
    • 23. How did we build it? We deal in Molfiles or SDF files – with coordinates Deposit anything that has an InChI – we support what InChI can handle, good and bad Standardization based on “InChI standardization” InChIs aggregate (certain) tautomers We deal with “various forms” of data We are challenged with the complexities of chemical names
    • 24. Antony Williams vs Identifiers Passport ID Dad, Tony, others5 email addresses LicenseChemSpiderman (blog, SSNTwitter account,Facebook, Friendfeed)OpenID…. Green Card
    • 25. Aspirin names and synonyms • Text searches depend on correct association • >300 suggested identifiers for Aspirin just on PubChem • Disambiguation dictionaries are necessary, not just for authors!
    • 26. The Final Search Strategy
    • 27. All Those Names, One Structure
    • 28. Curated Dictionaries Matter
    • 29. Crowdsourcing ChemSpider ChemSpider is crowdsourced Community deposition, annotation and curation Anyone can “Leave Feedback” Registered users can add data
    • 30. “Curate” Identifiers
    • 31. “Curate” Identifiers
    • 32. “Curate” Identifiers
    • 33. Success Depends on Dictionaries
    • 34. Vincristine: Identifiers and Properties
    • 35. Vincristine: PatentsLinked by Name
    • 36. Validated Names for Searching…
    • 37. And yes..there are challenges
    • 38. Licensing Data is Tough…
    • 39. Data Licensing, Open Data The use of CAS data in third party Data Mining Tools is permitted as long as CAS Records are downloaded via STN® AnaVist™. All of these new "freedoms" are aimed at further enabling the dissemination of scientific information and the advancement of scientific research. CAS does not permit the building of Databases that have wide and general availability and no longer fulfill the purpose of individual or team research that CAS permits but instead serve as a substitute for the use of CAS Databases.
    • 40. A Comment on Quality For >28 million chemical compounds there are some errors:  “Incorrect” structure representations  Mismatched name-structure relationships  Experimental properties (the values, the units)  Real vs. virtual compounds – text-mining and conversion  We have deprecated a LOT of data…
    • 41. Identifier Dictionaries Reciprocal curation processes…share curation with each other. If a database has a compound already then use InChiKeys to match “suggested” validation against the compound. A series of “added” and “removed” synonyms against InChIKeys for matching.
    • 42. Federated Data Curation SharingWho wants to work with us?
    • 43. Structure Validation using feed Look for approved synonyms Compare feed InChIKey with database InChIKey If different, flag for inspection
    • 44. Many Problems Can be Solved… Clean up databases – structure validation, structure standardization Warn about  Valency, charge balance, depiction issues, bond types, absent stereo, and another 100 rules (or so…) Standardize  Agree community rules to “Standardize”
    • 45. Structure Validation
    • 46. Structure Validation - Fixed
    • 47. What needs to happen? If we could validate  Catch errors in databases (and clean)  Proactively catch errors in publications/patents  Reduce junk in the ether – improve QUALITY! If we standardized  Interlinking should improve
    • 48. CVSP: result of processing
    • 49. NCATS Dataset
    • 50. DrugBank dataset (6516 records) Marked as Errors (arbitrary)  2 records with query bonds  3 records with invalid atoms (asterisk in polymers)  Unusual valence: ~70 (oxygen 3, sulfur 3 and 5, Mg 4, B 5, etc.) Warnings  INCHI not matching structure (100+)  SMILES not matching structure (100+)
    • 51.  DrugBank ID: DB00755 InChI=1S/C20H28O2/c1-15(8-6-9-16(2)14-19(21)22)11-12-18- 17(3)10-7-13-20(18,4)5/h6,8-9,11-12,14H,7,10,13H2,1-5H3, (H,21,22)/b9-6+,12-11+,15-8+,16-14+ DrugBank ID: DB00614
    • 52. Connecting Chemistry across the web So much of what is seen on ChemSpider is retrieved in real time using services
    • 53. Connecting Chemistry across the web
    • 54. Online Predictions
    • 55. Web Services Open Up Collaboration Agilent, Bruker, Waters and Thermo all use our web-based services for compound lookup Many academic sites integrating directly – metabonomics, name lookup, semantic markup Mobile app integration Commercial structure drawing packages
    • 56. Web Services
    • 57. ChemSpider Everywhere: Spectral Game
    • 58. ChemSpider EverywhereCrowdsourced Curation of Spectra
    • 59. Web Services Integrate INTERNAL Projects Integration between ChemSpider and…  Our publishing platform for structure display  ChemSpider SyntheticPages  LearnChemistry Wiki  National Chemical Database Service  And….a growing list….
    • 60. What ChemSpider Does Not Handle Polymers Markush structures Organometallics Many Inorganics Materials Reactions…but….
    • 61. ChemSpider Reactions
    • 62. DERA Digitally Enabling the RSC Archive…back to 1841  Extracting data and making available via appropriate platform  Chemicals  Reactions  Analytical Data  Figures
    • 63. Chemical Database Service
    • 64. Data for life sciences IP? IP? What’s the What’s the structure? structure? Are they in Are they in our file? our file? What’s What’s similar? similar? What’s the What’s the Pharmacology Pharmacology target? target? data? data? Known Known Pathways? Pathways? Competitors? Competitors? Working On Working On Connections Connections Now? Now? to disease? to disease? Expressed in Expressed in right cell type? right cell type?
    • 65. OpenPHACTS Open PHACTS is an Innovative Medicines Initiative (IMI) – 3 years project To reduce the barriers to drug discovery in industry, academia and for small businesses To build an open platform, integrating chemistry and biology data from public domain resources Semantic web platform Open Standards, Open Data and Open Source
    • 66.  Crowdsourcing across drug discovery Open PHACTS : partnership between European Community and European Pharma Companies 22 partners, 8 pharmaceutical companies, 3 biotechs working together for 3 years Freely accessible for knowledge discovery and verification.  Data on chemistry and biology  Pharmacological profiles  Proprietary and public data sources.
    • 67. PharmaSea• FP7 Initiative. PharmaSea: increasing value and flow in the marine biodiscovery pipeline (2012-2017)• Improve the quality, volume and value of active agents discovered in the marine environment and increase the speed at which they can be delivered• RSC: Providing dereplication via ChemSpider, analytical data algorithms, integration with computer-assisted structure elucidation algorithms
    • 68. Conclusions RSC eScience supporting increasing number of grant-based projects ChemSpider grows daily – community depositions and data from RSC Content with a focus on expanding data while improving quality ChemSpider is an integration platform for MANY projects through web services CVSP processing is available to use and provide feedback – will be available as a service also We believe in curation sharing - who wants to collaborate?
    • 69. Thank youEmail: williamsa@rsc.orgTwitter: ChemConnectorBlog: Blog: www.chemconnector.comSLIDES: