0
ChemSpider -Connecting and Curating  Online Chemistry Resources Antony Williams EBI, November 30 th  2010
Chemistry on the Internet <ul><li>100s of websites serving up chemistry data, SDF files of structures and data </li></ul><...
www.chemspider.com
Search for a Chemical
Available Information… <ul><li>Linked to vendors, safety data, toxicity, metabolism </li></ul>
We Have Delivered the Vision <ul><ul><li>“ Build a Structure Centric Community to </li></ul></ul><ul><ul><li>Serve Chemist...
How Did We Build It? <ul><li>We deal in Molfiles or SDF files – including coordinates </li></ul><ul><li>We do rudimentary ...
Inherited Errors <ul><li>We have inherited errors from every database… all public compound databases, including ours, have...
What is the Structure of Vitamin K?
MeSH <ul><li>A lipid cofactor that is required for normal blood clotting. Several forms of vitamin K have been identified:...
What is the Structure of Vitamin K1?
What is the Structure of Vitamin K1?
CAS’s Common Chemistry
Wikipedia
 
 
ChEBI – Manual Curation
 
 
PubChem
 
<ul><li>“ 2-methyl-3-(3,7,11,15-tetramethyl hexadec-2-enyl)naphthalene-1,4-dione” </li></ul><ul><li>Variants of systematic...
Public Domain Chemistry Databases <ul><li>Our  databases are a mess… </li></ul><ul><li>Non-curated databases are prolifera...
 
Vytorin: Ezetimibe/Simvastatin
Vytorin: Ezetimibe/Simvastatin
Vytorin: Ezetimibe/Simvastatin
Vytorin: Ezetimibe/Simvastatin
Vytorin: Ezetimibe/Simvastatin
Symbicort: Budesonide + Formoterol
Symbicort: Budesonide + Formoterol ChemIDPlus Wikipedia
DrugBank: Search Symbicort…
Symbicort: Budesonide + Formoterol <ul><li>PubChem </li></ul><ul><ul><li>8 structures called Budesonide. 1 “correct” </li>...
Taxol: Paclitaxel  44  structures
Taxol: Paclitaxel  Bioassay  Data
Taxol: Paclitaxel  Bioassay  Data <ul><li>Most  Bioassay data associated with structure with one ambiguous stereocenter </...
Data on the Web – Good or Bad?? Taken from: Rafael Sidis’ Blog
Data on the Registry
Data on the Registry
Data on the Registry
How are data handled in Pharma? <ul><li>Algorithms for “collapsing” data? Skeletons only? </li></ul><ul><li>Processing str...
EPA’s DailyMed
EPA’s DailyMed
EPA’s DailyMed
<ul><li>Consider searching each of these chemical databases by chemical name (systematic name, trade name or synonym). Ple...
 
Drug Name Generic Name ChEBI ChemSpider CAS Com. Chem ChemIDPlus DailyMed DrugBank PubChem Wikipedia Spiriva Tiotropium Br...
Why Curated Dictionaries Matter
Success Depends on Dictionaries
Online Curation <ul><li>Online databases generally do NOT allow curation or annotation </li></ul><ul><li>If you find error...
Crowdsourcing Works <ul><li>Over 100 people have deposited data (structures, spectra, etc) and participated in data curati...
Crowdsourcing Works <ul><li>Over 100 people have deposited data (structures, spectra, etc) and participated in data curati...
Collaborative Data Curation <ul><li>How can we  COLLECTIVELY  clean online data? </li></ul><ul><li>ChemSpider has inherite...
ChemSpider <ul><li>ChemSpider is free to use.  </li></ul><ul><li>Multiple web services are available. </li></ul><ul><li>Ne...
Thank you Email: williamsa@rsc.org  Twitter: ChemConnector Blog: www.chemspider.com/blog Personal Blog: www.chemconnector....
Upcoming SlideShare
Loading in...5
×

ChemSpider -Connecting and Curating Online Chemistry Resources

1,782

Published on

This is a presentation given at the European Informatics Institute (EBI), in Cambridge on December 1st 2010. This was at an EMBL-EBI Industry Program Workshop regarding "Chemical Structure Resources". This is where I unveiled details regarding the intra/inter-validation studies validating drug structures on multiple public domain chemistry databases. I also unveiled early results regarding the SurveyMonkey study of "trust" that the community has about public domain chemistry resources

Published in: Technology, Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,782
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
20
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "ChemSpider -Connecting and Curating Online Chemistry Resources"

  1. 1. ChemSpider -Connecting and Curating Online Chemistry Resources Antony Williams EBI, November 30 th 2010
  2. 2. Chemistry on the Internet <ul><li>100s of websites serving up chemistry data, SDF files of structures and data </li></ul><ul><li>Some primary resources : PubChem, ChEBI, DrugBank, ChemIDPlus, Wikipedia </li></ul><ul><li>ChemSpider “links” chemistry on the internet </li></ul><ul><ul><li>Almost 25 million compounds, 400 data sources </li></ul></ul><ul><ul><li>Allows community deposition, curation, annotation </li></ul></ul><ul><ul><li>Integrating properties, publications, patents, media </li></ul></ul><ul><ul><li>Text, structure, substructure (in testing) searching </li></ul></ul>
  3. 3. www.chemspider.com
  4. 4. Search for a Chemical
  5. 5. Available Information… <ul><li>Linked to vendors, safety data, toxicity, metabolism </li></ul>
  6. 6. We Have Delivered the Vision <ul><ul><li>“ Build a Structure Centric Community to </li></ul></ul><ul><ul><li>Serve Chemists” </li></ul></ul><ul><ul><li>Integrate chemical structure data on the web </li></ul></ul><ul><ul><li>Create a “structure-based hub” to information, data and algorithmic predictions </li></ul></ul><ul><ul><li>Let chemists contribute their own data </li></ul></ul><ul><ul><li>Allow the community to curate/correct data </li></ul></ul>
  7. 7. How Did We Build It? <ul><li>We deal in Molfiles or SDF files – including coordinates </li></ul><ul><li>We do rudimentary filtering – valence checking, charge imbalance – prior to deposition </li></ul><ul><li>We have our own “business logic” to standardize </li></ul><ul><li>We use InChI to “aggregate tautomers” to one record </li></ul><ul><li>Link out to external sites where possible using IDs </li></ul>
  8. 8. Inherited Errors <ul><li>We have inherited errors from every database… all public compound databases, including ours, have errors </li></ul><ul><li>“ Incorrect” structures – assertions, timelines etc </li></ul><ul><li>“ Incorrect” names associated with structures </li></ul><ul><li>Properties </li></ul><ul><li>Links </li></ul><ul><li>Publications </li></ul><ul><li>ENORMOUS CHALLENGE </li></ul>
  9. 9. What is the Structure of Vitamin K?
  10. 10. MeSH <ul><li>A lipid cofactor that is required for normal blood clotting. Several forms of vitamin K have been identified: VITAMIN K 1 (phytomenadione) derived from plants , VITAMIN K 2 (menaquinone) from bacteria, and synthetic naphthoquinone provitamins, VITAMIN K 3 (menadione). Vitamin K 3 provitamins, after being alkylated in vivo, exhibit the antifibrinolytic activity of vitamin K. Green leafy vegetables, liver, cheese, butter, and egg yolk are good sources of vitamin K </li></ul>
  11. 11. What is the Structure of Vitamin K1?
  12. 12. What is the Structure of Vitamin K1?
  13. 13. CAS’s Common Chemistry
  14. 14. Wikipedia
  15. 17. ChEBI – Manual Curation
  16. 20. PubChem
  17. 22. <ul><li>“ 2-methyl-3-(3,7,11,15-tetramethyl hexadec-2-enyl)naphthalene-1,4-dione” </li></ul><ul><li>Variants of systematic names on PubChem </li></ul><ul><li>2-methyl-3-[(E,7R,11R)-3,7,11,15-tetramethyl </li></ul><ul><li>2-methyl-3-[(E,7S,11R)-3,7,11,15-tetramethyl </li></ul><ul><li>2-methyl-3-[(E,7R,11S)-3,7,11,15-tetramethyl </li></ul><ul><li>2-methyl-3-[(E,7S,11S)-3,7,11,15-tetramethyl </li></ul><ul><li>2-methyl-3-[(E,11S)-3,7,11,15-tetramethyl </li></ul><ul><li>2-methyl-3-[(E)-3,7,11,15-tetramethyl </li></ul><ul><li>2-methyl-3-(3,7,11,15-tetramethyl </li></ul><ul><li>2-methyl-3-[(E)-3,7,11,15-tetramethyl </li></ul>
  18. 23. Public Domain Chemistry Databases <ul><li>Our databases are a mess… </li></ul><ul><li>Non-curated databases are proliferating errors </li></ul><ul><li>We source and deposit data between databases </li></ul><ul><li>Original sources of errors hard to determine </li></ul><ul><li>Curation is time-consuming, challenging and exacting </li></ul><ul><li>An examination of quality in databases – inter/intra lab comparison of processes for 150 drugs </li></ul>
  19. 25. Vytorin: Ezetimibe/Simvastatin
  20. 26. Vytorin: Ezetimibe/Simvastatin
  21. 27. Vytorin: Ezetimibe/Simvastatin
  22. 28. Vytorin: Ezetimibe/Simvastatin
  23. 29. Vytorin: Ezetimibe/Simvastatin
  24. 30. Symbicort: Budesonide + Formoterol
  25. 31. Symbicort: Budesonide + Formoterol ChemIDPlus Wikipedia
  26. 32. DrugBank: Search Symbicort…
  27. 33. Symbicort: Budesonide + Formoterol <ul><li>PubChem </li></ul><ul><ul><li>8 structures called Budesonide. 1 “correct” </li></ul></ul><ul><ul><li>6 structures called Formoterol. 1 “correct” </li></ul></ul><ul><ul><li>Search on “Symbicort” gives 1 structure. </li></ul></ul>
  28. 34. Taxol: Paclitaxel 44 structures
  29. 35. Taxol: Paclitaxel Bioassay Data
  30. 36. Taxol: Paclitaxel Bioassay Data <ul><li>Most Bioassay data associated with structure with one ambiguous stereocenter </li></ul>
  31. 37. Data on the Web – Good or Bad?? Taken from: Rafael Sidis’ Blog
  32. 38. Data on the Registry
  33. 39. Data on the Registry
  34. 40. Data on the Registry
  35. 41. How are data handled in Pharma? <ul><li>Algorithms for “collapsing” data? Skeletons only? </li></ul><ul><li>Processing structure-name pairs? </li></ul><ul><li>Manual curation? </li></ul><ul><li>Does it matter relative to the noise in the measurements? </li></ul><ul><li>Do correct structure representations matter, and to who????? </li></ul>
  36. 42. EPA’s DailyMed
  37. 43. EPA’s DailyMed
  38. 44. EPA’s DailyMed
  39. 45. <ul><li>Consider searching each of these chemical databases by chemical name (systematic name, trade name or synonym). Please mark each online resource according to how much you generally trust the results. </li></ul>
  40. 47. Drug Name Generic Name ChEBI ChemSpider CAS Com. Chem ChemIDPlus DailyMed DrugBank PubChem Wikipedia Spiriva Tiotropium Bromide No Hits  No Hits    4/0  Depakote Valproate semisodium        No Structure Basen Voglibose   No Hits  No Hits  2/1  Symbicort 1) Budesonide       8/1  Symbicort 2) Formoterol WRONG  No Hits    6/1  Vytorin 1) Ezetimibe   No Hits      Vytorin 2) Simvastatin       2/1  Taxol Paclitaxel       44/1  Thalidomid Thalidomide No Hits        Zocor Simvastatin       2/1  Crestor Rosuvastatin   No Hits    2/1 
  41. 48. Why Curated Dictionaries Matter
  42. 49. Success Depends on Dictionaries
  43. 50. Online Curation <ul><li>Online databases generally do NOT allow curation or annotation </li></ul><ul><li>If you find errors they stay there! </li></ul><ul><li>ChemSpider allows immediate curation </li></ul>
  44. 51. Crowdsourcing Works <ul><li>Over 100 people have deposited data (structures, spectra, etc) and participated in data curation </li></ul><ul><li>Different level curators check each others work </li></ul><ul><li>Wikipedia is the modern primary example </li></ul><ul><li>Some curators are “madmen”… </li></ul>
  45. 52. Crowdsourcing Works <ul><li>Over 100 people have deposited data (structures, spectra, etc) and participated in data curation </li></ul><ul><li>Different level curators check each others work </li></ul><ul><li>Wikipedia is the modern primary example </li></ul><ul><li>Some curators are “madmen”… </li></ul><ul><li>The Oxford English Dictionary </li></ul>
  46. 53. Collaborative Data Curation <ul><li>How can we COLLECTIVELY clean online data? </li></ul><ul><li>ChemSpider has inherited junk from >400 data sources. Some of this has proliferated into PubChem. We should deprecate it. </li></ul><ul><li>We need to develop a way to share curation actions back to original data sources </li></ul><ul><li>A mindset of bigger is better is problematic. How many “real chemicals” are in the public databases? </li></ul>
  47. 54. ChemSpider <ul><li>ChemSpider is free to use. </li></ul><ul><li>Multiple web services are available. </li></ul><ul><li>New data added daily. </li></ul><ul><li>Curation and data validation ongoing everyday. </li></ul><ul><li>Provided by the RSC. </li></ul><ul><li>www.chemspider.com </li></ul>
  48. 55. Thank you Email: williamsa@rsc.org Twitter: ChemConnector Blog: www.chemspider.com/blog Personal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×