ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources

on

  • 2,767 views

ChemSpider is a structure centric database hosted by the Royal Society of Chemistry and integrating over 25 million chemical compounds to over 400 internet-based resources including many public domain ...

ChemSpider is a structure centric database hosted by the Royal Society of Chemistry and integrating over 25 million chemical compounds to over 400 internet-based resources including many public domain databases, Wikipedia, chemical vendors, patents, publications and other web-based services. The intention is for ChemSpider to become one of the primary online hubs for chemists to source chemistry related data. During the development of the ChemSpider database we have utilized numerous approaches to standardizing, curating and validating the data supplied to us for hosting and integration. This presentation will provide an overview of our initial development of the ChemSpider database and provide an overview of our present processes and procedures for handling incoming data depositions. We will also discuss how crowdsourcing can help to expand, curate and validate the data on the ChemSpider database.

Statistics

Views

Total Views
2,767
Views on SlideShare
1,936
Embed Views
831

Actions

Likes
0
Downloads
6
Comments
0

8 Embeds 831

http://www.chemconnector.com 784
http://lanyrd.com 33
http://www.chemspider.com 8
http://twitter.com 2
http://74.6.117.48 1
http://translate.yandex.net 1
http://www.linkedin.com 1
http://webcache.googleusercontent.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution-NonCommercial LicenseCC Attribution-NonCommercial License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources Presentation Transcript

  • 1. ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources Antony Williams 5th Meeting on U.S. Government Chemical Databases and Open Chemistry August 2011
  • 2. I want to know about “Vincristine”
  • 3. Vincristine: Identifiers and Properties
  • 4. Vincristine: Vendors and Sources
  • 5. Vincristine: Patents
  • 6. Vincristine: Articles
  • 7. Vincristine: RSC Databases
  • 8. Searches: The INTERNET
  • 9. Validated Names for Searching…
  • 10. And InChIs…
  • 11. ChemSpider
    • The Free Chemical Database
    • A central hub for chemists to source information
      • >26 million unique chemical records
      • Aggregated from >400 data sources
      • Chemicals, spectra, CIF files, movies, images, podcasts, links to patents, publications, predictions
    • A central hub for chemists to deposit & curate data
  • 12. Essential aspects of ChemSpider
    • ChemSpider is a BIG database..and growing
    • Our focus has increasingly become QUALITY over quantity
    • Data curation and validation is our strength – crowdsourcing is contributing, more is required
    • Validated data has enabled linking of the internet
  • 13. “ All That Glisters is Not Gold” What is the structure of Discodermolide?
  • 14. How to distinguish…who’s wrong?
  • 15. Neither is wrong
  • 16. Data Curation…long torturous task
    • Data curation – JUST structure-name validation is a long, torturous, iterative task.
    • How about validating “data” – PhysChem data such as logP data, boiling points, melting points (J.C.Bradley’s talk), spectra
  • 17. PHYSPROP Database
    • The freely downloadable database under the EPI Suite prediction software
    • Very Basic filters suggest data quality issues
  • 18. The Stereochemistry challenge. 12500 chemicals with “missed” stereo
  • 19. NIST Webbook
  • 20. EPA’s DailyMed
  • 21. EPA’s DailyMed
  • 22. EPA’s DailyMed
  • 23. PubChem
  • 24. Linking
  • 25.  
  • 26.  
  • 27.  
  • 28.  
  • 29. Patents
  • 30. Patents
  • 31. WYSIWYG compounds
  • 32. WYSIWYG compounds
  • 33. Data Curation…long torturous task
    • Data curation – JUST structure-name validation is a long, torturous, iterative task.
    • How about validating “data” – PhysChem data such as logP data, boiling points, melting points (J.C.Bradley’s talk), spectra
    • The crowd in crowdsourcing is …generally small
    • Which of the large databases are doing careful curation. How can we share the workload? Hmm..
  • 34.
    • Consider searching each of these chemical databases by chemical name (systematic name, trade name or synonym). Please mark each online resource according to how much you generally trust the results.
  • 35.  
  • 36. Drug Name Generic Name ChEBI ChemSpider CAS Com. Chem ChemIDPlus DailyMed DrugBank PubChem Wikipedia Spiriva Tiotropium Bromide No Hits  No Hits    4/0  Depakote Valproate semisodium        No Structure Basen Voglibose   No Hits  No Hits  2/1  Symbicort 1) Budesonide       8/1  Symbicort 2) Formoterol WRONG  No Hits    6/1  Vytorin 1) Ezetimibe   No Hits      Vytorin 2) Simvastatin       2/1  Taxol Paclitaxel       44/1  Thalidomid Thalidomide No Hits        Zocor Simvastatin       2/1  Crestor Rosuvastatin   No Hits    2/1 
  • 37. ChemSpider can “do it” for us
    • ChemSpider has built a curation interface used by the community and ourselves for curating.
    • All curation activities are available for review, online immediately, iteratively checked.
    • Curators have different abilities based on their profile: There are only a few “Master Curators”.
    • Can we “share” the curation workload?
  • 38. Identifier Dictionaries
    • Reciprocal curation processes…share curation with each other.
    • If a database has a compound already then use InChiKeys to match “suggested” validation against the compound.
    • A series of “added” and “removed” synonyms against InChIKeys for matching.
    • Who will participate???
  • 39. Proof of Concept Data Curation Sharing
  • 40. Lessons Learned : Big vs Good!
  • 41. 15 compounds called Yohimbine 54 Skeletons for Yohimbine
  • 42. Aggegators suffer dilution…
  • 43. User Understanding of Data
    • Users searching “Yohimbine” expect to find it…not labeled versions of it, not ambiguous stereochemistries, not partial stereochemistries.
    • Data “aggregation” into a meaningful form is a major challenge. e.g. Assays for radiolabeled compounds linked to actual drugs.
    • Data curation efforts such as ChEMBL are essential!
  • 44. SciMobileApps.com
  • 45. SciDBs.com (Coming soon)
  • 46.
    • Open PHACTS : partnership between European Community and EFPIA
    • Freely accessible for knowledge discovery and verification.
      • Data on small molecules
      • Pharmacological profiles
      • Pharmacokinetics
      • ADMET data
      • Biological targets and pathways
      • Proprietary and public data sources.
  • 47. Standardization and Quality
    • Our initial approaches to standardization were imperfect. We are revisiting to support OpenPHACTS.
    • Highly dependent on InChI and not enough standardization prior to InChI generation.
    • InChI is excellent and acknowledged imperfect. Way better than SMILES for linking the internet!
  • 48. Conclusions
    • ChemSpider is one of many important chemistry resources on the internet
    • We have assumed an important role of curating and validating data – specifically name-structure dictionaries are of high importance but data validation is also key
    • We are a part of the federation of internet databases serving chemistry. MORE collaboration can serve us all better…how?
  • 49. Acknowledgments
    • Our development team – headed by THAT man..
    • Many in this room: InChI, PubChem, DssTOX, FDA, ChEBI/ChEMBL, SureChem, many more
    • Curators – special gratitude to Barrie Walker!
    • Software providers – OpenEye, ChemDoodle, ACD/Labs, GGA Software, Open Source (Jmol, JSpecView, OpenBabel)
  • 50. Thank you Email: williamsa@rsc.org Twitter: ChemConnector Personal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams