ChemSpider -Connecting and Curating Online Chemistry Resources

  • 1,689 views
Uploaded on

This is a presentation given at the European Informatics Institute (EBI), in Cambridge on December 1st 2010. This was at an EMBL-EBI Industry Program Workshop regarding "Chemical Structure Resources". …

This is a presentation given at the European Informatics Institute (EBI), in Cambridge on December 1st 2010. This was at an EMBL-EBI Industry Program Workshop regarding "Chemical Structure Resources". This is where I unveiled details regarding the intra/inter-validation studies validating drug structures on multiple public domain chemistry databases. I also unveiled early results regarding the SurveyMonkey study of "trust" that the community has about public domain chemistry resources

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,689
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
15
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. ChemSpider -Connecting and Curating Online Chemistry Resources Antony Williams EBI, November 30 th 2010
  • 2. Chemistry on the Internet
    • 100s of websites serving up chemistry data, SDF files of structures and data
    • Some primary resources : PubChem, ChEBI, DrugBank, ChemIDPlus, Wikipedia
    • ChemSpider “links” chemistry on the internet
      • Almost 25 million compounds, 400 data sources
      • Allows community deposition, curation, annotation
      • Integrating properties, publications, patents, media
      • Text, structure, substructure (in testing) searching
  • 3. www.chemspider.com
  • 4. Search for a Chemical
  • 5. Available Information…
    • Linked to vendors, safety data, toxicity, metabolism
  • 6. We Have Delivered the Vision
      • “ Build a Structure Centric Community to
      • Serve Chemists”
      • Integrate chemical structure data on the web
      • Create a “structure-based hub” to information, data and algorithmic predictions
      • Let chemists contribute their own data
      • Allow the community to curate/correct data
  • 7. How Did We Build It?
    • We deal in Molfiles or SDF files – including coordinates
    • We do rudimentary filtering – valence checking, charge imbalance – prior to deposition
    • We have our own “business logic” to standardize
    • We use InChI to “aggregate tautomers” to one record
    • Link out to external sites where possible using IDs
  • 8. Inherited Errors
    • We have inherited errors from every database… all public compound databases, including ours, have errors
    • “ Incorrect” structures – assertions, timelines etc
    • “ Incorrect” names associated with structures
    • Properties
    • Links
    • Publications
    • ENORMOUS CHALLENGE
  • 9. What is the Structure of Vitamin K?
  • 10. MeSH
    • A lipid cofactor that is required for normal blood clotting. Several forms of vitamin K have been identified: VITAMIN K 1 (phytomenadione) derived from plants , VITAMIN K 2 (menaquinone) from bacteria, and synthetic naphthoquinone provitamins, VITAMIN K 3 (menadione). Vitamin K 3 provitamins, after being alkylated in vivo, exhibit the antifibrinolytic activity of vitamin K. Green leafy vegetables, liver, cheese, butter, and egg yolk are good sources of vitamin K
  • 11. What is the Structure of Vitamin K1?
  • 12. What is the Structure of Vitamin K1?
  • 13. CAS’s Common Chemistry
  • 14. Wikipedia
  • 15.  
  • 16.  
  • 17. ChEBI – Manual Curation
  • 18.  
  • 19.  
  • 20. PubChem
  • 21.  
  • 22.
    • “ 2-methyl-3-(3,7,11,15-tetramethyl hexadec-2-enyl)naphthalene-1,4-dione”
    • Variants of systematic names on PubChem
    • 2-methyl-3-[(E,7R,11R)-3,7,11,15-tetramethyl
    • 2-methyl-3-[(E,7S,11R)-3,7,11,15-tetramethyl
    • 2-methyl-3-[(E,7R,11S)-3,7,11,15-tetramethyl
    • 2-methyl-3-[(E,7S,11S)-3,7,11,15-tetramethyl
    • 2-methyl-3-[(E,11S)-3,7,11,15-tetramethyl
    • 2-methyl-3-[(E)-3,7,11,15-tetramethyl
    • 2-methyl-3-(3,7,11,15-tetramethyl
    • 2-methyl-3-[(E)-3,7,11,15-tetramethyl
  • 23. Public Domain Chemistry Databases
    • Our databases are a mess…
    • Non-curated databases are proliferating errors
    • We source and deposit data between databases
    • Original sources of errors hard to determine
    • Curation is time-consuming, challenging and exacting
    • An examination of quality in databases – inter/intra lab comparison of processes for 150 drugs
  • 24.  
  • 25. Vytorin: Ezetimibe/Simvastatin
  • 26. Vytorin: Ezetimibe/Simvastatin
  • 27. Vytorin: Ezetimibe/Simvastatin
  • 28. Vytorin: Ezetimibe/Simvastatin
  • 29. Vytorin: Ezetimibe/Simvastatin
  • 30. Symbicort: Budesonide + Formoterol
  • 31. Symbicort: Budesonide + Formoterol ChemIDPlus Wikipedia
  • 32. DrugBank: Search Symbicort…
  • 33. Symbicort: Budesonide + Formoterol
    • PubChem
      • 8 structures called Budesonide. 1 “correct”
      • 6 structures called Formoterol. 1 “correct”
      • Search on “Symbicort” gives 1 structure.
  • 34. Taxol: Paclitaxel 44 structures
  • 35. Taxol: Paclitaxel Bioassay Data
  • 36. Taxol: Paclitaxel Bioassay Data
    • Most Bioassay data associated with structure with one ambiguous stereocenter
  • 37. Data on the Web – Good or Bad?? Taken from: Rafael Sidis’ Blog
  • 38. Data on the Registry
  • 39. Data on the Registry
  • 40. Data on the Registry
  • 41. How are data handled in Pharma?
    • Algorithms for “collapsing” data? Skeletons only?
    • Processing structure-name pairs?
    • Manual curation?
    • Does it matter relative to the noise in the measurements?
    • Do correct structure representations matter, and to who?????
  • 42. EPA’s DailyMed
  • 43. EPA’s DailyMed
  • 44. EPA’s DailyMed
  • 45.
    • Consider searching each of these chemical databases by chemical name (systematic name, trade name or synonym). Please mark each online resource according to how much you generally trust the results.
  • 46.  
  • 47. Drug Name Generic Name ChEBI ChemSpider CAS Com. Chem ChemIDPlus DailyMed DrugBank PubChem Wikipedia Spiriva Tiotropium Bromide No Hits  No Hits    4/0  Depakote Valproate semisodium        No Structure Basen Voglibose   No Hits  No Hits  2/1  Symbicort 1) Budesonide       8/1  Symbicort 2) Formoterol WRONG  No Hits    6/1  Vytorin 1) Ezetimibe   No Hits      Vytorin 2) Simvastatin       2/1  Taxol Paclitaxel       44/1  Thalidomid Thalidomide No Hits        Zocor Simvastatin       2/1  Crestor Rosuvastatin   No Hits    2/1 
  • 48. Why Curated Dictionaries Matter
  • 49. Success Depends on Dictionaries
  • 50. Online Curation
    • Online databases generally do NOT allow curation or annotation
    • If you find errors they stay there!
    • ChemSpider allows immediate curation
  • 51. Crowdsourcing Works
    • Over 100 people have deposited data (structures, spectra, etc) and participated in data curation
    • Different level curators check each others work
    • Wikipedia is the modern primary example
    • Some curators are “madmen”…
  • 52. Crowdsourcing Works
    • Over 100 people have deposited data (structures, spectra, etc) and participated in data curation
    • Different level curators check each others work
    • Wikipedia is the modern primary example
    • Some curators are “madmen”…
    • The Oxford English Dictionary
  • 53. Collaborative Data Curation
    • How can we COLLECTIVELY clean online data?
    • ChemSpider has inherited junk from >400 data sources. Some of this has proliferated into PubChem. We should deprecate it.
    • We need to develop a way to share curation actions back to original data sources
    • A mindset of bigger is better is problematic. How many “real chemicals” are in the public databases?
  • 54. ChemSpider
    • ChemSpider is free to use.
    • Multiple web services are available.
    • New data added daily.
    • Curation and data validation ongoing everyday.
    • Provided by the RSC.
    • www.chemspider.com
  • 55. Thank you Email: williamsa@rsc.org Twitter: ChemConnector Blog: www.chemspider.com/blog Personal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams