Your SlideShare is downloading. ×
0
ChemSpider: How the Wisdom of the Crowd Can Improve the Quality of Chemistry on the Internet
Overview <ul><li>Chemistry on the Internet </li></ul><ul><li>An introduction to ChemSpider </li></ul><ul><li>How crowds ca...
Linked Data on the Web Taken from: Rafael Sidis’ Blog
Where Would You look?  What Do You Trust?
Lots of “Public Compound” Databases <ul><li>PubChem </li></ul><ul><li>Drugbank </li></ul><ul><li>ChEBI/ChEMBL </li></ul><u...
What is a compound?
Connecting Chemistry on the Web <ul><li>The internet is searchable by chemical structure and substructure (e.g.Wikipedia, ...
Antony Williams vs Identifiers Passport ID Dad, Tony, others SSN Green Card License 5 email addresses ChemSpiderman (blog,...
Aspirin vs Chemical Identifiers
Aspirin names and synonyms <ul><li>Text searches depend on correct association </li></ul><ul><li>335  suggested identifier...
 
 
 
The Final Search Strategy
All Those Names, One Structure
Connections Can Lead Anywhere
The InChI Identifier
Multiple Layers
InChIStrings Hash to InChIKeys
Oleoylethanolamine
Search Engine Dependencies
Search Engine Dependencies
InChIs have traction…
Vancomycin
 
Vancomycin <ul><li>Who will curate? </li></ul><ul><li>How would you clean such a large dataset? </li></ul>
What is ChemSpider? <ul><li>ChemSpider is: </li></ul><ul><ul><li>Building a Structure Centric Community for Chemists </li>...
Search Cholesterol
Search Cholesterol
Search Cholesterol
Search Cholesterol
Search Cholesterol
Search Cholesterol
Linked across the internet
Kyoto Encyclopedia of Genes and Genomes
Link off a structure in ChemSpider <ul><ul><li>Chemical suppliers </li></ul></ul><ul><ul><li>Other publications </li></ul>...
Links to Patents based on structure
Pubmed Articles Linked
Answering Questions for Chemists <ul><li>Questions a chemist might ask… </li></ul><ul><ul><li>What is the melting point of...
Complex Data and Information
ChemSpider Searches
ChemSpider Complex Searches
Chemistry on the Internet <ul><li>Much of the information is based on assertions and  User Beware! </li></ul><ul><li>The Q...
Question Everything online: www.dhmo.org
Vancomycin
Vancomycin on ChemSpider
Vancomycin
Vancomycin Search Molecular SKELETON Search Full Molecule
Full  Skeleton  Search: 104 Hits
Full  Molecule  Search: 4 Hits
The InChI “Resolver”
The EXPERTS must get it right?!
Wikipedia, C&E News, PubChem <ul><li>C&E News (from ACS) </li></ul>
Feedback from Steve Ritter <ul><li>“ Although CAS and C&EN are both part of the ACS Publications Division,  we at C&EN sti...
What About Digitonin?
CAS as an authority
The Blogging Community Participate
Collaborative Knowledge Management
Assertion and  Chemical Entities <ul><li>Who says what Taxol is? </li></ul><ul><li>What is the “timeline” for a molecule? ...
Crowd-sourcing Chemistry Curation <ul><li>Crowd-sourced curation: identify/tag errors, edit names, synonyms, identify reco...
Multi-level Curation and Approval Building a Structure Centric Community for Chemists
DailyMed
The Intention <ul><li>Make DailyMed structure searchable via ChemSpider </li></ul><ul><li>In the process curate data on Ch...
Poor Representations
Lack of Stereochemisty
Simply Wrong
Missing Fragments?
Hmmmm….
Does it Matter? <ul><li>Does it matter to the consumer that the structures are wrong? No…what matters is what is in the bo...
The Process <ul><li>Import all XML files from DailyMed </li></ul><ul><li>Use “Home built” entity extraction based on our d...
State of the Data
Tolinase: DailyMed on ChemSpider
OTHER Mentioned Chemicals
One Name – Multiple Structures NO Stereo  Full Stereo  Partial Stereo  Partial Stereo
Editing a Record <ul><li>Do NOT deprecate record…remove association between name and chemical structure </li></ul>
 
Partial Stereochemistry
Loop of Assertions <ul><li>Reduce to ONE structure – with full explicit stereo </li></ul>
How bad can it get??? And who is right????
Name-Structure Pairs <ul><li>Cleaning up the associations of names and structures is torturous and time-consuming </li></u...
ChemMantis <ul><li>Chem ical  M arkup  A nd  N omenclature  T ransformation  I ntegrated  S ystem –  ChemMantis </li></ul>...
Name-Structure Pairs
Species – linked to Wikipedia
Semantic Linking of Structures <ul><li>What would you want to link off a structure? </li></ul><ul><ul><li>Chemical supplie...
Outlinks…
Back to DailyMed
The Difference…
ChemMantis Markup
Species Markup
Dictionaries are Easily Enhanced <ul><li>Copy-Paste into appropriate Entity Dictionary </li></ul><ul><li>Impacts all futur...
Expand Dictionaries
Where To From Here? <ul><li>The platform is built…it’s all eyeballs for curation now </li></ul><ul><li>As structure-identi...
If We Had Our Way… <ul><li>Convert every DailyMed Label to a ChemMantis marked up document  </li></ul><ul><li>Use the XML ...
Citizen Scientists
Become a Data Source
 
Synthesis Procedures
Links to Data or Deposit Data
ChemSpider Everywhere : Embed
ChemSpider Everywhere: Spectral Game
ChemSpider Everywhere Crowdsourced Curation of Spectra
ChemSpider Everywhere ChemMobi Building a Structure Centric Community for Chemists
ChemSpider Web Services
Linked Data on the Web Taken from: Rafael Sidis’ Blog
 
Linking to Resources  <ul><li>Linking to resources by structures or name </li></ul><ul><li>Example integration: HSDB or Ch...
It’s a long road ahead…
Conclusions <ul><li>The internet enables chemistry, at a reduced cost </li></ul><ul><li>Web 2.0 is here and improving qual...
Thank you [email_address] Twitter: ChemSpiderman www.chemspider.com/blog SLIDES: www.slideshare.net/AntonyWilliams
Upcoming SlideShare
Loading in...5
×

ChemSpider and How The Wisdom Of The Crowds Can Improve The Quality Of Chemistry On The Internet

1,462

Published on

This is a presentation I gave at the FDA on December 1st 2009 in Wahington DC as part of a symposium involving PubChem, ChemIDPLus, PillBox, DailyMed and other related systems. The focus was, as usual, on the quality of data online and how to clean up the information and with a specific focus on the quality of data on the FDA's DailyMed and our efforts to apply semantic markup to the DailyMed articles

Published in: Technology, Education
1 Comment
2 Likes
Statistics
Notes
  • A great presentation. You made an excellent point about the difference between a ’chemical’ and a ’chemical structure,’ as well as the importance of ’identifiers.’
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
1,462
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
25
Comments
1
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "ChemSpider and How The Wisdom Of The Crowds Can Improve The Quality Of Chemistry On The Internet"

  1. 1. ChemSpider: How the Wisdom of the Crowd Can Improve the Quality of Chemistry on the Internet
  2. 2. Overview <ul><li>Chemistry on the Internet </li></ul><ul><li>An introduction to ChemSpider </li></ul><ul><li>How crowds can enhance quality of public databases </li></ul><ul><li>The DailyMed Project on ChemSpider </li></ul><ul><ul><li>Observed Quality Issues </li></ul></ul><ul><ul><li>The Curation Platform </li></ul></ul><ul><ul><li>Semantic Markup and Integration </li></ul></ul>
  3. 3. Linked Data on the Web Taken from: Rafael Sidis’ Blog
  4. 4. Where Would You look? What Do You Trust?
  5. 5. Lots of “Public Compound” Databases <ul><li>PubChem </li></ul><ul><li>Drugbank </li></ul><ul><li>ChEBI/ChEMBL </li></ul><ul><li>KEGG </li></ul><ul><li>LipidMAPs </li></ul><ul><li>ChemIDPlus </li></ul><ul><li>eMolecules </li></ul><ul><li>ZINC </li></ul><ul><li>Lots of chemical vendors </li></ul><ul><li>ChemSpider </li></ul>
  6. 6. What is a compound?
  7. 7. Connecting Chemistry on the Web <ul><li>The internet is searchable by chemical structure and substructure (e.g.Wikipedia, Google Scholar) </li></ul><ul><li>Chemistry articles are indexed and searchable by a free online service </li></ul><ul><li>The web is linked together through the “language of chemistry” </li></ul>
  8. 8. Antony Williams vs Identifiers Passport ID Dad, Tony, others SSN Green Card License 5 email addresses ChemSpiderman (blog, Twitter account, Facebook, Friendfeed) OpenID … .
  9. 9. Aspirin vs Chemical Identifiers
  10. 10. Aspirin names and synonyms <ul><li>Text searches depend on correct association </li></ul><ul><li>335 suggested identifiers for Aspirin just on PubChem! </li></ul><ul><li>Disambiguation dictionaries are necessary </li></ul>
  11. 14. The Final Search Strategy
  12. 15. All Those Names, One Structure
  13. 16. Connections Can Lead Anywhere
  14. 17. The InChI Identifier
  15. 18. Multiple Layers
  16. 19. InChIStrings Hash to InChIKeys
  17. 20. Oleoylethanolamine
  18. 21. Search Engine Dependencies
  19. 22. Search Engine Dependencies
  20. 23. InChIs have traction…
  21. 24. Vancomycin
  22. 26. Vancomycin <ul><li>Who will curate? </li></ul><ul><li>How would you clean such a large dataset? </li></ul>
  23. 27. What is ChemSpider? <ul><li>ChemSpider is: </li></ul><ul><ul><li>Building a Structure Centric Community for Chemists </li></ul></ul><ul><ul><li>Ca. 23 million compounds, ca. 300 data sources </li></ul></ul><ul><ul><li>A deposition and curation platform </li></ul></ul><ul><ul><li>A publishing platform for the community </li></ul></ul><ul><ul><li>Grows daily – more depositions, more links, more data sources </li></ul></ul>
  24. 28. Search Cholesterol
  25. 29. Search Cholesterol
  26. 30. Search Cholesterol
  27. 31. Search Cholesterol
  28. 32. Search Cholesterol
  29. 33. Search Cholesterol
  30. 34. Linked across the internet
  31. 35. Kyoto Encyclopedia of Genes and Genomes
  32. 36. Link off a structure in ChemSpider <ul><ul><li>Chemical suppliers </li></ul></ul><ul><ul><li>Other publications </li></ul></ul><ul><ul><li>Analytical Data </li></ul></ul><ul><ul><li>Related Reactions </li></ul></ul><ul><ul><li>Wikipedia </li></ul></ul><ul><ul><li>Patents </li></ul></ul><ul><ul><li>“ Everything” </li></ul></ul>
  33. 37. Links to Patents based on structure
  34. 38. Pubmed Articles Linked
  35. 39. Answering Questions for Chemists <ul><li>Questions a chemist might ask… </li></ul><ul><ul><li>What is the melting point of n-butanol? </li></ul></ul><ul><ul><li>What is the chemical structure of Xanax? </li></ul></ul><ul><ul><li>Chemically, what is phenolphthalein? </li></ul></ul><ul><ul><li>What are the stereocenters of cholesterol? </li></ul></ul><ul><ul><li>Where can I find publications about xylene? </li></ul></ul><ul><ul><li>What are the different trade names for Ketoconazole? </li></ul></ul><ul><ul><li>What is the NMR spectrum of Aspirin? </li></ul></ul><ul><ul><li>What are the safety handling issues for Thymol Blue? </li></ul></ul>
  36. 40. Complex Data and Information
  37. 41. ChemSpider Searches
  38. 42. ChemSpider Complex Searches
  39. 43. Chemistry on the Internet <ul><li>Much of the information is based on assertions and User Beware! </li></ul><ul><li>The Quality of information available is diverse and how does the user know what is and is not “correct”? </li></ul>
  40. 44. Question Everything online: www.dhmo.org
  41. 45. Vancomycin
  42. 46. Vancomycin on ChemSpider
  43. 47. Vancomycin
  44. 48. Vancomycin Search Molecular SKELETON Search Full Molecule
  45. 49. Full Skeleton Search: 104 Hits
  46. 50. Full Molecule Search: 4 Hits
  47. 51. The InChI “Resolver”
  48. 52. The EXPERTS must get it right?!
  49. 53. Wikipedia, C&E News, PubChem <ul><li>C&E News (from ACS) </li></ul>
  50. 54. Feedback from Steve Ritter <ul><li>“ Although CAS and C&EN are both part of the ACS Publications Division, we at C&EN still have to pay for our SciFinder access, strangely enough.” </li></ul><ul><li>“ It would be nice to have an authoritative web-based source of standard, well-drawn structures for chemists to go to so they can freely cut and paste structures into their papers, PowerPoint presentations, and anything else they might need. Maybe Wikipedia will be that source one day .” </li></ul>
  51. 55. What About Digitonin?
  52. 56. CAS as an authority
  53. 57. The Blogging Community Participate
  54. 58. Collaborative Knowledge Management
  55. 59. Assertion and Chemical Entities <ul><li>Who says what Taxol is? </li></ul><ul><li>What is the “timeline” for a molecule? </li></ul><ul><li>How do we clean up the Public data? </li></ul>
  56. 60. Crowd-sourcing Chemistry Curation <ul><li>Crowd-sourced curation: identify/tag errors, edit names, synonyms, identify records to deprecate </li></ul>
  57. 61. Multi-level Curation and Approval Building a Structure Centric Community for Chemists
  58. 62. DailyMed
  59. 63. The Intention <ul><li>Make DailyMed structure searchable via ChemSpider </li></ul><ul><li>In the process curate data on ChemSpider and validate data on DailyMed </li></ul><ul><li>Improve the curation platform on ChemSpider </li></ul><ul><li>Perform markup of DailyMed articles to enhance the reading experience </li></ul>
  60. 64. Poor Representations
  61. 65. Lack of Stereochemisty
  62. 66. Simply Wrong
  63. 67. Missing Fragments?
  64. 68. Hmmmm….
  65. 69. Does it Matter? <ul><li>Does it matter to the consumer that the structures are wrong? No…what matters is what is in the bottle is the right medication! </li></ul><ul><li>To make DailyMed structure searchable it DOES matter </li></ul><ul><li>To data mine DailyMed it matters </li></ul><ul><li>To mark up DailyMed it matters </li></ul>
  66. 70. The Process <ul><li>Import all XML files from DailyMed </li></ul><ul><li>Use “Home built” entity extraction based on our dictionary of chemical names </li></ul><ul><li>Articles online here: </li></ul><ul><ul><li>http://www.chemspider.com/DailyMed.aspx </li></ul></ul><ul><ul><li>Example Article: http://www.chemspider.com/DailyMedArticle.aspx?id=2 </li></ul></ul>
  67. 71. State of the Data
  68. 72. Tolinase: DailyMed on ChemSpider
  69. 73. OTHER Mentioned Chemicals
  70. 74. One Name – Multiple Structures NO Stereo Full Stereo Partial Stereo Partial Stereo
  71. 75. Editing a Record <ul><li>Do NOT deprecate record…remove association between name and chemical structure </li></ul>
  72. 77. Partial Stereochemistry
  73. 78. Loop of Assertions <ul><li>Reduce to ONE structure – with full explicit stereo </li></ul>
  74. 79. How bad can it get??? And who is right????
  75. 80. Name-Structure Pairs <ul><li>Cleaning up the associations of names and structures is torturous and time-consuming </li></ul><ul><li>Decisions get made and can be challenged </li></ul><ul><li>Names are not “removed” …they are still on the database </li></ul><ul><li>Such a curated “dictionary” is very valuable </li></ul>
  76. 81. ChemMantis <ul><li>Chem ical M arkup A nd N omenclature T ransformation I ntegrated S ystem – ChemMantis </li></ul><ul><li>A platform for entity extraction for chemistry documents, markup and integration to online information sources – Wikipedia, ChemSpider, Entrez… </li></ul><ul><li>Web-based submission, markup and publishing platform </li></ul>
  77. 82. Name-Structure Pairs
  78. 83. Species – linked to Wikipedia
  79. 84. Semantic Linking of Structures <ul><li>What would you want to link off a structure? </li></ul><ul><ul><li>Chemical suppliers </li></ul></ul><ul><ul><li>Other publications </li></ul></ul><ul><ul><li>Analytical Data </li></ul></ul><ul><ul><li>Related Reactions </li></ul></ul><ul><ul><li>Wikipedia </li></ul></ul><ul><ul><li>Patents </li></ul></ul><ul><ul><li>“ Everything” </li></ul></ul>
  80. 85. Outlinks…
  81. 86. Back to DailyMed
  82. 87. The Difference…
  83. 88. ChemMantis Markup
  84. 89. Species Markup
  85. 90. Dictionaries are Easily Enhanced <ul><li>Copy-Paste into appropriate Entity Dictionary </li></ul><ul><li>Impacts all future markups </li></ul><ul><li>Expanding knowledgebases of information </li></ul><ul><li>Linked out to rich sources of information </li></ul>
  86. 91. Expand Dictionaries
  87. 92. Where To From Here? <ul><li>The platform is built…it’s all eyeballs for curation now </li></ul><ul><li>As structure-identifier pairs are curated DailyMed will improve </li></ul><ul><li>The project is now on hold – no resources to continue </li></ul>
  88. 93. If We Had Our Way… <ul><li>Convert every DailyMed Label to a ChemMantis marked up document </li></ul><ul><li>Use the XML segregation of the Tablet Labels to tag where chemicals are in the label </li></ul><ul><li>Allow data mining based on “where” in a label the chemicals are..drug-drug interactions etc </li></ul><ul><li>Markup and mine property data out of the labels using new dictionaries related to properties such as IC50 and toxicity </li></ul>
  89. 94. Citizen Scientists
  90. 95. Become a Data Source
  91. 97. Synthesis Procedures
  92. 98. Links to Data or Deposit Data
  93. 99. ChemSpider Everywhere : Embed
  94. 100. ChemSpider Everywhere: Spectral Game
  95. 101. ChemSpider Everywhere Crowdsourced Curation of Spectra
  96. 102. ChemSpider Everywhere ChemMobi Building a Structure Centric Community for Chemists
  97. 103. ChemSpider Web Services
  98. 104. Linked Data on the Web Taken from: Rafael Sidis’ Blog
  99. 106. Linking to Resources <ul><li>Linking to resources by structures or name </li></ul><ul><li>Example integration: HSDB or ChemIDPlus </li></ul>
  100. 107. It’s a long road ahead…
  101. 108. Conclusions <ul><li>The internet enables chemistry, at a reduced cost </li></ul><ul><li>Web 2.0 is here and improving quality </li></ul><ul><li>Question Quality! All data sources are imperfect – some more imperfect than others </li></ul><ul><li>Crowdsourcing to expand, curate and integrate </li></ul><ul><li>Structures submitted to DailyMed need checking </li></ul>
  102. 109. Thank you [email_address] Twitter: ChemSpiderman www.chemspider.com/blog SLIDES: www.slideshare.net/AntonyWilliams
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×