Chemspider hosting linking and curating chemistry data for the community


Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Chemspider hosting linking and curating chemistry data for the community

  1. 1. ChemSpider – Hosting, Linking and Curating Chemistry Data for the Community Valery Tkachenko SLA Meeting, June 2011
  2. 2. Chemistry on the Internet <ul><li>100s of websites hosting chemistry-related data </li></ul><ul><li>Chemistry information is generally “compound-based” </li></ul><ul><ul><li>Chemical “structures” </li></ul></ul><ul><ul><li>Identifiers, names and synonyms </li></ul></ul><ul><ul><li>Properties </li></ul></ul><ul><ul><li>Analytical data </li></ul></ul><ul><ul><li>How to synthesize </li></ul></ul><ul><ul><li>Articles, patents, safety information </li></ul></ul><ul><li>Chemistry “language and dialects” </li></ul>
  3. 3. Dialects describing chemicals
  4. 4. A Pragmatic Vision <ul><ul><li>“ Build a Structure Centric Community” </li></ul></ul><ul><li>Integrate chemistry across the internet based on “chemical structure” </li></ul><ul><ul><li>A “structure-based hub” to information and data </li></ul></ul><ul><ul><li>Let chemists contribute their own data </li></ul></ul><ul><ul><li>Allow the community to curate & annotate data </li></ul></ul>
  5. 5.
  6. 6. Answering Questions for Chemists <ul><li>Questions a chemist might ask… </li></ul><ul><ul><li>What is the melting point of n-heptanol? </li></ul></ul><ul><ul><li>What is the chemical structure of Xanax? </li></ul></ul><ul><ul><li>Chemically, what is phenolphthalein? </li></ul></ul><ul><ul><li>What are the stereocenters of cholesterol? </li></ul></ul><ul><ul><li>Where can I find publications about xylene? </li></ul></ul><ul><ul><li>What are the different trade names for Aspirin? </li></ul></ul><ul><ul><li>What is the NMR spectrum of Benzoic Acid? </li></ul></ul><ul><ul><li>What are the safety handling issues for toluene? </li></ul></ul>
  7. 7. Search for a Chemical…by name
  8. 8. Available Information… <ul><li>Linked to chemical vendors, safety data, toxicity, metabolism… </li></ul>
  9. 9. Available Information….
  10. 10. ChemSpider Today <ul><li>Over 26 million unique chemicals </li></ul><ul><li>Over 420 data sources </li></ul><ul><li>Grows daily – community and RSC depositions </li></ul><ul><li>Community annotation and curation </li></ul><ul><li>We curate, edit, change, enhance data daily </li></ul>
  11. 11. Three Years of Experience <ul><li>Internet-based chemistry is a mess ! </li></ul><ul><li>Public compound databases are contaminated </li></ul><ul><li>The annotation/curation of data online is difficult </li></ul><ul><li>Most database hosts are non-responsive to feedback – “We are a host/repository of data” </li></ul><ul><li>Who cares ? We all should!!! </li></ul>
  12. 12. Linked Data on the Web
  13. 13. Where is chemistry online? <ul><li>Encyclopedic articles (Wikipedia) </li></ul><ul><li>Chemical vendor databases </li></ul><ul><li>Metabolic pathway databases </li></ul><ul><li>Property databases </li></ul><ul><li>Patents with chemical structures </li></ul><ul><li>Drug Discovery data </li></ul><ul><li>Scientific publications </li></ul><ul><li>Compound aggregators </li></ul><ul><li>Blogs/Wikis and Open Notebook Science </li></ul>
  14. 14. What is the Structure of Vitamin K1?
  15. 15. What is the Structure of Vitamin K1?
  16. 16. Chemical Abstracts “Common Chemistry” Database
  17. 17. Wikipedia
  18. 20. Internet-Based Chemistry is a Mess <ul><li>Algorithms can get you so far </li></ul><ul><li>Human curation is necessary </li></ul><ul><li>Only the crowds can help with big data… ChemSpider is over 26 million compounds </li></ul><ul><li>Imagine if we worked together to create a centralized validated structure-name dictionary! Enhances text-mining, searching, linking… </li></ul>
  19. 21. Search “Vitamin H”
  20. 22. Search “Vitamin H”
  21. 23. “ Curate” Identifiers
  22. 24. “ Curate” Identifiers
  23. 25. “ Curate” Identifiers
  24. 26. Crowd-sourcing Chemistry Curation <ul><li>Crowd-sourced curation: identify/tag errors, edit names, synonyms, identify records to deprecate </li></ul>
  25. 27. “ Curate” Identifiers <ul><li>General curation activities </li></ul><ul><ul><li>Remove incorrect names </li></ul></ul><ul><ul><li>Correct spellings </li></ul></ul><ul><ul><li>Add multilingual names </li></ul></ul><ul><ul><li>Add alternative names </li></ul></ul><ul><li>In 3 years over 1 million structure-identifier relationships have been validated – robotically and manually </li></ul><ul><li>130 people have participated in validation or annotation. “ Crowds ” can be quite small! </li></ul>
  26. 28. Vancomycin – Curate This!!!
  27. 29. Vancomycin on ChemSpider 1 compound – 3 days
  28. 30. Crowdsourced “Annotations” <ul><li>Users can add </li></ul><ul><ul><li>Descriptions/Syntheses/Commentaries </li></ul></ul><ul><ul><li>Links to articles </li></ul></ul><ul><ul><li>Spectral data </li></ul></ul><ul><ul><li>Photos </li></ul></ul><ul><ul><li>MP3 files </li></ul></ul><ul><ul><li>Videos </li></ul></ul>
  29. 31. Multimedia Content Holder
  30. 32. Gaming for Validation of Spectra
  31. 33. Crowdsourced Validation of Spectra
  32. 34. “ Game-based” Validation of Data
  33. 35. ChemSpider SyntheticPages
  34. 36. Sharing Our Activities <ul><li>Presently defining approaches with other public compound databases to share results of curation activities </li></ul><ul><li>Member of large European project to link data from the Life Sciences. Sharing results of curation is essential </li></ul><ul><li>Making curation and contribution interfaces Mobile. </li></ul>
  35. 37. Thank you Email: Twitter: ChemConnector Blog: Personal Blog: SLIDES: