Crowdsourced Curation of Chemistry Data.  How Bad is Online Chemistry Data? Antony Williams Wolfram Summit, September 2010
A Pragmatic Vision <ul><ul><li>“ Build a Structure Centric Community” </li></ul></ul><ul><li>Integrate chemistry across th...
www.chemspider.com
We Answer Questions for  Chemists <ul><li>Questions a chemist might ask… </li></ul><ul><ul><li>What is the melting point o...
Search for a Chemical…by name
Available Information… <ul><li>Linked to vendors, safety data, toxicity, metabolism </li></ul>
Available Information….
Search for chemicals
ChemSpider Today <ul><li>24.8 million  structures </li></ul><ul><li>400  data sources </li></ul><ul><li>Grows daily </li><...
Linked Data on the Web
Three Years of Experience <ul><li>Internet-based chemistry is a  mess ! </li></ul><ul><li>Most public compound databases o...
Where is chemistry online? <ul><li>Encyclopedic articles (Wikipedia) </li></ul><ul><li>Chemical vendor databases </li></ul...
What is the Structure of Vitamin K?
MeSH – Medical Subject Headings <ul><li>A lipid cofactor that is required for normal blood clotting. Several forms of vita...
What is the Structure of Vitamin K1?
What is the Structure of Vitamin K1?
Chemical Abstracts “Common Chemistry” Database
Wikipedia
 
  Incorrect Structures
Wow!
Lack of Stereochemistry
Does stereochemistry matter? <ul><li>Distaval, Talimol, Nibrol, Sedimide, Quietoplex, Contergan, Neurosedyn, Softenon,  Th...
 
 
 
PubChem
 
<ul><li>“ 2-methyl-3-(3,7,11,15-tetramethyl hexadec-2-enyl)naphthalene-1,4-dione” </li></ul><ul><li>Variants of systematic...
ChEBI – Manual Curation
 
 
 
What’s Methane?
What’s Methane?
What  ELSE  is Methane???
The EXPERTS must get it right?!
Wikipedia, C&E News, PubChem <ul><li>C&E News (from ACS) </li></ul>
Internet-Based Chemistry is a Mess <ul><li>Algorithms can get you so far </li></ul><ul><li>Human curation is necessary </l...
Search “Vitamin H”
Search “Vitamin H”
“ Curate” Identifiers
“ Curate” Identifiers
“ Curate” Identifiers
“ Curate” Identifiers <ul><li>General curation activities </li></ul><ul><ul><li>Remove incorrect names </li></ul></ul><ul>...
Crowdsourced “Annotations” <ul><li>Registered Users can add  </li></ul><ul><ul><li>Descriptions/Syntheses/Commentaries </l...
Data Validation – Not Vitamin K1
Data Validation – Not Beclamethasone Dipropionate DailyMed Article
Data Validation …NOT Cholesterol
Data Validation – ONE Cymarin Question Quality in Big Databases
First  request to Database Hosts! <ul><li>Every public compound database host should add ONE feature – “Leave Comments” </...
Second  request to Database Hosts! Show Comments
Always Question Online Chemistry
Thank you Email: williamsa@rsc.org  Twitter: ChemConnector Blog: www.chemspider.com/blog Personal Blog: www.chemconnector....
Upcoming SlideShare
Loading in …5
×

Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?

2,677 views
2,646 views

Published on

This presentation was given at the Wolfram Data Summit in Washington DC on Sept 9th 2010 as part of a panel series of presentations and discussions on crowdsourcing approaches for data. It was a rant by me on the quality of what's online and questioning "who cares".

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,677
On SlideShare
0
From Embeds
0
Number of Embeds
722
Actions
Shares
0
Downloads
13
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?

  1. 1. Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data? Antony Williams Wolfram Summit, September 2010
  2. 2. A Pragmatic Vision <ul><ul><li>“ Build a Structure Centric Community” </li></ul></ul><ul><li>Integrate chemistry across the internet based on “chemical structure” </li></ul><ul><ul><li>A “structure-based hub” to information and data </li></ul></ul><ul><ul><li>Let chemists contribute their own data </li></ul></ul><ul><ul><li>Allow the community to curate/correct data </li></ul></ul>
  3. 3. www.chemspider.com
  4. 4. We Answer Questions for Chemists <ul><li>Questions a chemist might ask… </li></ul><ul><ul><li>What is the melting point of n-heptanol? </li></ul></ul><ul><ul><li>What is the chemical structure of Xanax? </li></ul></ul><ul><ul><li>Chemically, what is phenolphthalein? </li></ul></ul><ul><ul><li>What are the stereocenters of cholesterol? </li></ul></ul><ul><ul><li>Where can I find publications about xylene? </li></ul></ul><ul><ul><li>What are the different trade names for Aspirin? </li></ul></ul><ul><ul><li>What is the NMR spectrum of Benzoic Acid? </li></ul></ul><ul><ul><li>What are the safety handling issues for toluene? </li></ul></ul>
  5. 5. Search for a Chemical…by name
  6. 6. Available Information… <ul><li>Linked to vendors, safety data, toxicity, metabolism </li></ul>
  7. 7. Available Information….
  8. 8. Search for chemicals
  9. 9. ChemSpider Today <ul><li>24.8 million structures </li></ul><ul><li>400 data sources </li></ul><ul><li>Grows daily </li></ul><ul><li>Community annotation and curation </li></ul><ul><li>We curate, edit, change, enhance data daily </li></ul>
  10. 10. Linked Data on the Web
  11. 11. Three Years of Experience <ul><li>Internet-based chemistry is a mess ! </li></ul><ul><li>Most public compound databases on the web are contaminated . Including ours ! </li></ul><ul><li>The annotation/curation of data online is difficult </li></ul><ul><li>Most database hosts are non-responsive to feedback – “We are a host/repository of data” </li></ul><ul><li>Who cares ? </li></ul>
  12. 12. Where is chemistry online? <ul><li>Encyclopedic articles (Wikipedia) </li></ul><ul><li>Chemical vendor databases </li></ul><ul><li>Metabolic pathway databases </li></ul><ul><li>Property databases </li></ul><ul><li>Patents with chemical structures </li></ul><ul><li>Drug Discovery data </li></ul><ul><li>Scientific publications </li></ul><ul><li>Compound aggregators </li></ul><ul><li>Blogs/Wikis and Open Notebook Science </li></ul>
  13. 13. What is the Structure of Vitamin K?
  14. 14. MeSH – Medical Subject Headings <ul><li>A lipid cofactor that is required for normal blood clotting. Several forms of vitamin K have been identified: VITAMIN K 1 (phytomenadione) derived from plants , VITAMIN K 2 (menaquinone) from bacteria, and synthetic naphthoquinone provitamins, VITAMIN K 3 (menadione). Vitamin K 3 provitamins, after being alkylated in vivo, exhibit the antifibrinolytic activity of vitamin K. Green leafy vegetables, liver, cheese, butter, and egg yolk are good sources of vitamin K </li></ul>
  15. 15. What is the Structure of Vitamin K1?
  16. 16. What is the Structure of Vitamin K1?
  17. 17. Chemical Abstracts “Common Chemistry” Database
  18. 18. Wikipedia
  19. 20. Incorrect Structures
  20. 21. Wow!
  21. 22. Lack of Stereochemistry
  22. 23. Does stereochemistry matter? <ul><li>Distaval, Talimol, Nibrol, Sedimide, Quietoplex, Contergan, Neurosedyn, Softenon, Thalidomide </li></ul>
  23. 27. PubChem
  24. 29. <ul><li>“ 2-methyl-3-(3,7,11,15-tetramethyl hexadec-2-enyl)naphthalene-1,4-dione” </li></ul><ul><li>Variants of systematic names on PubChem </li></ul><ul><li>2-methyl-3-[(E,7R,11R)-3,7,11,15-tetramethyl </li></ul><ul><li>2-methyl-3-[(E,7S,11R)-3,7,11,15-tetramethyl </li></ul><ul><li>2-methyl-3-[(E,7R,11S)-3,7,11,15-tetramethyl </li></ul><ul><li>2-methyl-3-[(E,7S,11S)-3,7,11,15-tetramethyl </li></ul><ul><li>2-methyl-3-[(E,11S)-3,7,11,15-tetramethyl </li></ul><ul><li>2-methyl-3-[(E)-3,7,11,15-tetramethyl </li></ul><ul><li>2-methyl-3-(3,7,11,15-tetramethyl </li></ul><ul><li>2-methyl-3-[(E)-3,7,11,15-tetramethyl </li></ul>
  25. 30. ChEBI – Manual Curation
  26. 34. What’s Methane?
  27. 35. What’s Methane?
  28. 36. What ELSE is Methane???
  29. 37. The EXPERTS must get it right?!
  30. 38. Wikipedia, C&E News, PubChem <ul><li>C&E News (from ACS) </li></ul>
  31. 39. Internet-Based Chemistry is a Mess <ul><li>Algorithms can get you so far </li></ul><ul><li>Human curation is necessary </li></ul><ul><li>Only the crowds can help with big data… ChemSpider is approaching 25 million compounds </li></ul>
  32. 40. Search “Vitamin H”
  33. 41. Search “Vitamin H”
  34. 42. “ Curate” Identifiers
  35. 43. “ Curate” Identifiers
  36. 44. “ Curate” Identifiers
  37. 45. “ Curate” Identifiers <ul><li>General curation activities </li></ul><ul><ul><li>Remove incorrect names </li></ul></ul><ul><ul><li>Correct spellings </li></ul></ul><ul><ul><li>Add multilingual names </li></ul></ul><ul><ul><li>Add alternative names </li></ul></ul><ul><li>In 3 years over 1 million structure-identifier relationships have been validated – robotically and manually </li></ul><ul><li>130 people have participated in validation or annotation. “ Crowds ” can be quite small! </li></ul>
  38. 46. Crowdsourced “Annotations” <ul><li>Registered Users can add </li></ul><ul><ul><li>Descriptions/Syntheses/Commentaries </li></ul></ul><ul><ul><li>Links to articles, blogs, wikis etc </li></ul></ul><ul><ul><li>Add spectral data </li></ul></ul><ul><ul><li>Add photos </li></ul></ul><ul><ul><li>Add MP3 files </li></ul></ul><ul><ul><li>Add Videos </li></ul></ul>
  39. 47. Data Validation – Not Vitamin K1
  40. 48. Data Validation – Not Beclamethasone Dipropionate DailyMed Article
  41. 49. Data Validation …NOT Cholesterol
  42. 50. Data Validation – ONE Cymarin Question Quality in Big Databases
  43. 51. First request to Database Hosts! <ul><li>Every public compound database host should add ONE feature – “Leave Comments” </li></ul>
  44. 52. Second request to Database Hosts! Show Comments
  45. 53. Always Question Online Chemistry
  46. 54. Thank you Email: williamsa@rsc.org Twitter: ChemConnector Blog: www.chemspider.com/blog Personal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams

×