ChemSpider -Connecting and Curating Online Chemistry Resources


Published on

This is a presentation given at the European Informatics Institute (EBI), in Cambridge on December 1st 2010. This was at an EMBL-EBI Industry Program Workshop regarding "Chemical Structure Resources". This is where I unveiled details regarding the intra/inter-validation studies validating drug structures on multiple public domain chemistry databases. I also unveiled early results regarding the SurveyMonkey study of "trust" that the community has about public domain chemistry resources

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

ChemSpider -Connecting and Curating Online Chemistry Resources

  1. 1. ChemSpider -Connecting and Curating Online Chemistry Resources Antony Williams EBI, November 30 th 2010
  2. 2. Chemistry on the Internet <ul><li>100s of websites serving up chemistry data, SDF files of structures and data </li></ul><ul><li>Some primary resources : PubChem, ChEBI, DrugBank, ChemIDPlus, Wikipedia </li></ul><ul><li>ChemSpider “links” chemistry on the internet </li></ul><ul><ul><li>Almost 25 million compounds, 400 data sources </li></ul></ul><ul><ul><li>Allows community deposition, curation, annotation </li></ul></ul><ul><ul><li>Integrating properties, publications, patents, media </li></ul></ul><ul><ul><li>Text, structure, substructure (in testing) searching </li></ul></ul>
  3. 3.
  4. 4. Search for a Chemical
  5. 5. Available Information… <ul><li>Linked to vendors, safety data, toxicity, metabolism </li></ul>
  6. 6. We Have Delivered the Vision <ul><ul><li>“ Build a Structure Centric Community to </li></ul></ul><ul><ul><li>Serve Chemists” </li></ul></ul><ul><ul><li>Integrate chemical structure data on the web </li></ul></ul><ul><ul><li>Create a “structure-based hub” to information, data and algorithmic predictions </li></ul></ul><ul><ul><li>Let chemists contribute their own data </li></ul></ul><ul><ul><li>Allow the community to curate/correct data </li></ul></ul>
  7. 7. How Did We Build It? <ul><li>We deal in Molfiles or SDF files – including coordinates </li></ul><ul><li>We do rudimentary filtering – valence checking, charge imbalance – prior to deposition </li></ul><ul><li>We have our own “business logic” to standardize </li></ul><ul><li>We use InChI to “aggregate tautomers” to one record </li></ul><ul><li>Link out to external sites where possible using IDs </li></ul>
  8. 8. Inherited Errors <ul><li>We have inherited errors from every database… all public compound databases, including ours, have errors </li></ul><ul><li>“ Incorrect” structures – assertions, timelines etc </li></ul><ul><li>“ Incorrect” names associated with structures </li></ul><ul><li>Properties </li></ul><ul><li>Links </li></ul><ul><li>Publications </li></ul><ul><li>ENORMOUS CHALLENGE </li></ul>
  9. 9. What is the Structure of Vitamin K?
  10. 10. MeSH <ul><li>A lipid cofactor that is required for normal blood clotting. Several forms of vitamin K have been identified: VITAMIN K 1 (phytomenadione) derived from plants , VITAMIN K 2 (menaquinone) from bacteria, and synthetic naphthoquinone provitamins, VITAMIN K 3 (menadione). Vitamin K 3 provitamins, after being alkylated in vivo, exhibit the antifibrinolytic activity of vitamin K. Green leafy vegetables, liver, cheese, butter, and egg yolk are good sources of vitamin K </li></ul>
  11. 11. What is the Structure of Vitamin K1?
  12. 12. What is the Structure of Vitamin K1?
  13. 13. CAS’s Common Chemistry
  14. 14. Wikipedia
  15. 17. ChEBI – Manual Curation
  16. 20. PubChem
  17. 22. <ul><li>“ 2-methyl-3-(3,7,11,15-tetramethyl hexadec-2-enyl)naphthalene-1,4-dione” </li></ul><ul><li>Variants of systematic names on PubChem </li></ul><ul><li>2-methyl-3-[(E,7R,11R)-3,7,11,15-tetramethyl </li></ul><ul><li>2-methyl-3-[(E,7S,11R)-3,7,11,15-tetramethyl </li></ul><ul><li>2-methyl-3-[(E,7R,11S)-3,7,11,15-tetramethyl </li></ul><ul><li>2-methyl-3-[(E,7S,11S)-3,7,11,15-tetramethyl </li></ul><ul><li>2-methyl-3-[(E,11S)-3,7,11,15-tetramethyl </li></ul><ul><li>2-methyl-3-[(E)-3,7,11,15-tetramethyl </li></ul><ul><li>2-methyl-3-(3,7,11,15-tetramethyl </li></ul><ul><li>2-methyl-3-[(E)-3,7,11,15-tetramethyl </li></ul>
  18. 23. Public Domain Chemistry Databases <ul><li>Our databases are a mess… </li></ul><ul><li>Non-curated databases are proliferating errors </li></ul><ul><li>We source and deposit data between databases </li></ul><ul><li>Original sources of errors hard to determine </li></ul><ul><li>Curation is time-consuming, challenging and exacting </li></ul><ul><li>An examination of quality in databases – inter/intra lab comparison of processes for 150 drugs </li></ul>
  19. 25. Vytorin: Ezetimibe/Simvastatin
  20. 26. Vytorin: Ezetimibe/Simvastatin
  21. 27. Vytorin: Ezetimibe/Simvastatin
  22. 28. Vytorin: Ezetimibe/Simvastatin
  23. 29. Vytorin: Ezetimibe/Simvastatin
  24. 30. Symbicort: Budesonide + Formoterol
  25. 31. Symbicort: Budesonide + Formoterol ChemIDPlus Wikipedia
  26. 32. DrugBank: Search Symbicort…
  27. 33. Symbicort: Budesonide + Formoterol <ul><li>PubChem </li></ul><ul><ul><li>8 structures called Budesonide. 1 “correct” </li></ul></ul><ul><ul><li>6 structures called Formoterol. 1 “correct” </li></ul></ul><ul><ul><li>Search on “Symbicort” gives 1 structure. </li></ul></ul>
  28. 34. Taxol: Paclitaxel 44 structures
  29. 35. Taxol: Paclitaxel Bioassay Data
  30. 36. Taxol: Paclitaxel Bioassay Data <ul><li>Most Bioassay data associated with structure with one ambiguous stereocenter </li></ul>
  31. 37. Data on the Web – Good or Bad?? Taken from: Rafael Sidis’ Blog
  32. 38. Data on the Registry
  33. 39. Data on the Registry
  34. 40. Data on the Registry
  35. 41. How are data handled in Pharma? <ul><li>Algorithms for “collapsing” data? Skeletons only? </li></ul><ul><li>Processing structure-name pairs? </li></ul><ul><li>Manual curation? </li></ul><ul><li>Does it matter relative to the noise in the measurements? </li></ul><ul><li>Do correct structure representations matter, and to who????? </li></ul>
  36. 42. EPA’s DailyMed
  37. 43. EPA’s DailyMed
  38. 44. EPA’s DailyMed
  39. 45. <ul><li>Consider searching each of these chemical databases by chemical name (systematic name, trade name or synonym). Please mark each online resource according to how much you generally trust the results. </li></ul>
  40. 47. Drug Name Generic Name ChEBI ChemSpider CAS Com. Chem ChemIDPlus DailyMed DrugBank PubChem Wikipedia Spiriva Tiotropium Bromide No Hits  No Hits    4/0  Depakote Valproate semisodium        No Structure Basen Voglibose   No Hits  No Hits  2/1  Symbicort 1) Budesonide       8/1  Symbicort 2) Formoterol WRONG  No Hits    6/1  Vytorin 1) Ezetimibe   No Hits      Vytorin 2) Simvastatin       2/1  Taxol Paclitaxel       44/1  Thalidomid Thalidomide No Hits        Zocor Simvastatin       2/1  Crestor Rosuvastatin   No Hits    2/1 
  41. 48. Why Curated Dictionaries Matter
  42. 49. Success Depends on Dictionaries
  43. 50. Online Curation <ul><li>Online databases generally do NOT allow curation or annotation </li></ul><ul><li>If you find errors they stay there! </li></ul><ul><li>ChemSpider allows immediate curation </li></ul>
  44. 51. Crowdsourcing Works <ul><li>Over 100 people have deposited data (structures, spectra, etc) and participated in data curation </li></ul><ul><li>Different level curators check each others work </li></ul><ul><li>Wikipedia is the modern primary example </li></ul><ul><li>Some curators are “madmen”… </li></ul>
  45. 52. Crowdsourcing Works <ul><li>Over 100 people have deposited data (structures, spectra, etc) and participated in data curation </li></ul><ul><li>Different level curators check each others work </li></ul><ul><li>Wikipedia is the modern primary example </li></ul><ul><li>Some curators are “madmen”… </li></ul><ul><li>The Oxford English Dictionary </li></ul>
  46. 53. Collaborative Data Curation <ul><li>How can we COLLECTIVELY clean online data? </li></ul><ul><li>ChemSpider has inherited junk from >400 data sources. Some of this has proliferated into PubChem. We should deprecate it. </li></ul><ul><li>We need to develop a way to share curation actions back to original data sources </li></ul><ul><li>A mindset of bigger is better is problematic. How many “real chemicals” are in the public databases? </li></ul>
  47. 54. ChemSpider <ul><li>ChemSpider is free to use. </li></ul><ul><li>Multiple web services are available. </li></ul><ul><li>New data added daily. </li></ul><ul><li>Curation and data validation ongoing everyday. </li></ul><ul><li>Provided by the RSC. </li></ul><ul><li> </li></ul>
  48. 55. Thank you Email: Twitter: ChemConnector Blog: Personal Blog: SLIDES:
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.