ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community


Published on

This is the presentation I gave at OpenSciNY 2010. It was a great gathering of Librarians and people interested in Open Science. Sharing the stage with Beth Brown Jean-Claude Bradley and Heather Joseph was, as usual, a good opportunity to discuss how openness and online data sharing is changing the way we access and share data. We live in interesting and exciting times.

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

  1. 1. ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community OpenSciNY, New York, May 2010,
  2. 2. Once Upon a Time Over a “Coffee”
  3. 3. Which is better for Plants? Vodka, Sprite or Viagra?
  4. 4. It Works – Viagra Wins the Day
  5. 5. Now Which is Better? <ul><li>Viagra or Cialis? </li></ul><ul><li>Images sourced from Wikipedia </li></ul>
  6. 6. Cialis <ul><li>I want… </li></ul><ul><ul><ul><li>The structure </li></ul></ul></ul><ul><ul><ul><li>Any patent information </li></ul></ul></ul><ul><ul><ul><li>Related publications </li></ul></ul></ul><ul><ul><ul><li>Where can I buy it? </li></ul></ul></ul><ul><ul><ul><li>Metabolic pathway info </li></ul></ul></ul><ul><ul><ul><li>What else is easy to find… </li></ul></ul></ul>
  7. 7. Cialis on Google?
  8. 8. What is Cialis?
  9. 9. What is Cialis? Can we trust Wikipedia?
  10. 10. What is Cialis? 6 hits on PubChem
  11. 11. What is Cialis?
  12. 12. Search by Trade Name
  13. 13. Are there other names???
  14. 14. Are there other names??? <ul><li>PubMed hits: </li></ul><ul><ul><li>736 Tadalafil </li></ul></ul><ul><ul><li>744 Cialis </li></ul></ul>
  15. 15. Are there other names???
  16. 16. Are There Other Names?
  17. 17. IC351 on PubChem? 5 HITS for IC351 ZERO HITS for IC 351
  18. 18. Chemistry on the Web <ul><li>Text searching the web is far from optimal </li></ul><ul><li>The quality of data on the web is a problem </li></ul><ul><li>It may be hard to find but it is “out there” </li></ul><ul><li>What was once locked up behind an expensive license can generally be found </li></ul><ul><li>Structure searching the web is already possible! </li></ul>
  19. 19. Text Searching the Web <ul><li>Text searching the web for chemical compounds is an enormous challenge </li></ul><ul><li>RSC has multiple databases, >500,000 articles and a lot of other resources. How do we do? </li></ul>
  20. 20. The RSC Publishing Platform (Beta)
  21. 21. 2+2 = 4 Articles?
  22. 22. CAS Number Search
  23. 23. Text Searching the Web <ul><li>Disambiguation dictionaries of name-structure relationships would be very enabling. </li></ul><ul><ul><li>IC351 = IC 351 = Tadalafil = Cialis = … </li></ul></ul><ul><li>Creating validated dictionaries is an enormous challenge to cover chemistry </li></ul>
  24. 24. CAS Registry – LOTS of Chemicals!
  25. 27. The Final Search Strategy A “Disambiguation Query!”
  26. 28. All Those Names, One Structure A problem to solve…
  27. 29. ChemSpider - A Pragmatic Vision <ul><ul><li>“ Build a Structure Centric Community to </li></ul></ul><ul><ul><li>Serve Chemists” </li></ul></ul><ul><ul><li>Aggregate and integrate chemical structure data on the web – names, structures, links </li></ul></ul><ul><ul><li>Create a “structure-based hub” to information, data and algorithmic predictions </li></ul></ul><ul><ul><li>Let chemists contribute their own data </li></ul></ul><ul><ul><li>Allow the community to curate/correct data </li></ul></ul>
  28. 30. <ul><li>As few interfaces as possible </li></ul>What do humans want?
  29. 31. Aggregating Data – Who to Trust??? <ul><li>Encyclopedic articles (Wikipedia) </li></ul><ul><li>Chemical vendor databases </li></ul><ul><li>Metabolic pathway databases </li></ul><ul><li>Property databases </li></ul><ul><li>Patents with chemical structures </li></ul><ul><li>Drug Discovery data </li></ul><ul><li>Scientific publications </li></ul><ul><li>Compound aggregators </li></ul><ul><li>Blogs/Wikis and Open Notebook Science </li></ul>
  30. 32. Just “Public Compound” Databases <ul><li>PubChem </li></ul><ul><li>Drugbank </li></ul><ul><li>ChEBI/ChEMBL </li></ul><ul><li>KEGG </li></ul><ul><li>LipidMAPs </li></ul><ul><li>ChemIDPlus </li></ul><ul><li>eMolecules </li></ul><ul><li>ZINC </li></ul><ul><li>Lots of chemical vendors </li></ul>
  31. 33. Question Everything online:
  32. 34. Di-Hydrogen Monoxide <ul><li>2H </li></ul>
  33. 35. Di-Hydrogen Monoxide <ul><li>2H + 1O </li></ul>
  34. 36. Di-Hydrogen Monoxide <ul><li>H2O </li></ul>
  35. 37. Di-Hydrogen Monoxide <ul><li>H2O </li></ul><ul><li>Water </li></ul>
  36. 38. It’s all on Wikipedia…
  37. 39. What About Gases? Methane…
  38. 40. What’s Methane?
  39. 41. What’s Methane?
  40. 42. What ELSE is Methane???
  41. 43. Structural Data for Life Sciences DailyMed
  42. 44. Lack of Stereochemisty
  43. 45. Incorrect Structures
  44. 46. Pragmatic Vision Delivered… <ul><li>Aggregate, integrate and link data from across the internet </li></ul><ul><li>Almost 25 million structures from > 300 data sources </li></ul><ul><li>Linked to vendors, literature, online databases (open and commercial), open notebook science, patents and…. </li></ul><ul><li>Robotic and Crowdsourced Curation </li></ul>
  45. 47. Search “OEA”
  46. 48. Search OEA
  47. 49. Search OEA
  48. 50. Search OEA
  49. 51. Linked Patents for OEA
  50. 52. Answering Questions… <ul><li>Questions a student might ask… </li></ul><ul><ul><li>What is the structure of levulinic acid? </li></ul></ul><ul><ul><li>Chemically, what is phenolphthalein? </li></ul></ul><ul><ul><li>What are the stereocenters of cholesterol? </li></ul></ul><ul><ul><li>Where can I find publications about xylene? </li></ul></ul><ul><ul><li>What are the different trade names for Ketoconazole? </li></ul></ul><ul><ul><li>What is the NMR spectrum of Aspirin? </li></ul></ul><ul><ul><li>How can I synthesize 2,4-dichlorophenol? </li></ul></ul><ul><ul><li>What are the safety handling issues for Thymol Blue? </li></ul></ul>
  51. 53. Back to Cialis…
  52. 54. Cialis on ChemSpider : 1 hit <ul><li>Chemicals are curated/validated on ChemSpider by ourselves and the community </li></ul><ul><li>Based on assertions from various sources. Iterative, time-consuming and exacting! </li></ul><ul><li>We believe we know the structure now </li></ul><ul><li>What is linked and available? </li></ul>
  53. 55. Google Patents
  54. 56. ChemSpider – Patents Linked SURECHEM PATENTS GOOGLE
  55. 57. Google Books
  56. 58. Microsoft Academic Search
  57. 59. Google Scholar – Articles were found by CAS Number !
  58. 60. Identifiers for Tadalafil
  59. 61. How Many Articles in RSC Journals ? <ul><li>Based on 171596-29​-5 there are 13 articles in RSC journals </li></ul><ul><li>What about if we VALIDATE identifiers? </li></ul>
  60. 62. Validated Dictionaries Hit APIs This is data curation...
  61. 63. Does this generate more results?
  62. 64. RSC Journals
  63. 65. RSC Journals REMEMBER 2+2 = 4
  64. 66. PubMed
  65. 67. Google Scholar – Expanded Hit Set
  66. 68. Microsoft Academic Search
  67. 69. Microsoft Academic Search <ul><li>Be careful! More mussels than drugs… </li></ul>
  68. 70. Searching Chemistry on the Internet <ul><li>Do we get complete a result set will we get if we search for “chemicals” only by name? </li></ul><ul><li>Is there a better way to link chemistry databases? Linking by “names” is dangerous </li></ul><ul><li>Chemists want structure and SUBstructure searching </li></ul>
  69. 71. Structure Searching the Web <ul><li>We have resources about Tadalafil actively linked to ChemSpider </li></ul><ul><li>What about searching the web for Tadalafil by structure…not based on the various identifiers </li></ul><ul><li>How? </li></ul>
  70. 72. Link the Internet with InChIKeys! Taken from: Rafael Sidis’ Blog
  71. 73. The InChI Identifier
  72. 74. Multiple Layers
  73. 75. InChIStrings Hash to InChIKeys
  74. 76. Cialis – Searching the Web by InChI Search Molecular SKELETON Search Full Molecule
  75. 77. InChI Search the Web by Skeleton 78 Hits by Skeleton
  76. 78. InChI Search the Web Exact Match 32 Hits by InChIKey
  77. 79. InChI Search the Web Exact Match 6 Hits by Standard InChIKey
  78. 80. InChifying the Web <ul><li>There are more than 2X “skeletons” for Cialis than exact matches – different stereo? Mistakes? </li></ul><ul><li>Our judgment…MISTAKES </li></ul>
  79. 81. Vancomycin – Search the Internet
  80. 82. Full Molecule Search: 4 Hits
  81. 83. Full Skeleton Search: 104 Hits
  82. 84. InChIKeys Make the internet searchable by adding InChIKeys Publishers add InChIKeys to papers now… But what is the structure???
  83. 85. We need an InChI “Resolver”
  84. 86. InChI Resolver to DOIs Structure Search the Web
  85. 87. Semantic Markup: Project Prospect
  86. 88. Depends on Validated Dictionaries Link to a Structure or the Right Structure?
  87. 89. Name-Structure Pairs
  88. 90. Semantic Linking of Structures <ul><li>What would you want to link off a structure? </li></ul><ul><ul><li>Chemical suppliers </li></ul></ul><ul><ul><li>Other publications </li></ul></ul><ul><ul><li>Analytical Data </li></ul></ul><ul><ul><li>Related Reactions </li></ul></ul><ul><ul><li>Wikipedia </li></ul></ul><ul><ul><li>Patents </li></ul></ul><ul><ul><li>“ Everything” </li></ul></ul><ul><ul><li>Through ChemSpider! </li></ul></ul>
  89. 91. Unpublished Chemistry <ul><li>Only a fraction of chemistry is published </li></ul><ul><li>Only a tiny fraction of chemistry is patented </li></ul><ul><li>What of the “Lost Chemistry”- never published and cannot be abstracted </li></ul><ul><ul><li>Reactions performed </li></ul></ul><ul><ul><li>Structures made and studied </li></ul></ul><ul><ul><li>Spectra acquired and then disposed of </li></ul></ul><ul><ul><li>Available chemicals never found </li></ul></ul>
  90. 92. Org Prep Daily (Blog)
  91. 93. ChemSpider SyntheticPages
  92. 94. Submission process <ul><li>Register as a user </li></ul><ul><li>Use the Submit button and fill in the fields… </li></ul>
  93. 95. Submission Process <ul><li>Submissions reviewed by editorial board </li></ul><ul><li>Published as is or comments sent to author </li></ul><ul><li>Online Peer Review process </li></ul><ul><li>Data supported include web movies, images, live spectra etc. </li></ul>
  94. 96. Micro- and Nano-publications <ul><li>Blogs, wiki entries and even Amazon book reviews are micro/nano-publications </li></ul><ul><li>ChemSpider SyntheticPages will be DOI’ed – students can add these “micro-publications” to their resume </li></ul><ul><li>Structures and spectra are nano-publications – these can be tracked and referenced also. (depositions, curations etc). Students participate in building one of the premier sources of chemistry data. </li></ul>
  95. 97. ChemSpider : Spectra Linked
  96. 98. Spectra Linked
  97. 99. Spectra Linked
  98. 100. Not Just NMR Data
  99. 101.
  100. 102. Spectral Game
  101. 103. Increasing Complexity
  102. 104. Spectral Game
  103. 105. ChemSpider Content <ul><li>ChemSpider is a container…supports multimedia </li></ul><ul><ul><li>Spectra </li></ul></ul><ul><ul><li>Crystal structures </li></ul></ul><ul><ul><li>Images </li></ul></ul><ul><ul><li>MP3s </li></ul></ul><ul><ul><li>Videos </li></ul></ul>
  104. 106. Roses’ Crystal Image Collection
  105. 107. MP3s and Videos : Titanium
  106. 108. Periodic Table Images
  107. 109. How Can You Help ChemSpider? <ul><li>Deposit your data and share with the community </li></ul><ul><ul><li>Structures – one or many </li></ul></ul><ul><ul><li>Spectra </li></ul></ul><ul><ul><li>Links </li></ul></ul><ul><ul><li>Syntheses into SyntheticPages </li></ul></ul><ul><li>Curate data – most basic level…just add comments </li></ul><ul><li>Spread the word – ChemSpider is an untapped resource </li></ul>
  108. 110. Community Contribution <ul><li>We can make a bigger contribution to the community if the community shares via ChemSpider </li></ul><ul><li>Don’t underestimate what others will find of value </li></ul><ul><li>ChemSpider wins “Community </li></ul><ul><li>contribution” best practice award” </li></ul>
  109. 111. Chemistry on the Internet FUTURE <ul><li>The semantic web for chemistry is in place </li></ul><ul><li>Crowdsourced contributions are commonplace </li></ul><ul><li>Chemists will search by structure/substructure </li></ul><ul><li>Chemistry articles indexed and searchable </li></ul><ul><li>Reduced number of searches to find data </li></ul><ul><li>Data are integrated – compounds, vendors, syntheses, data, publications and patents </li></ul><ul><li>A world of Open Access and Open Data </li></ul>
  110. 112. Thank you [email_address] Twitter: ChemSpiderman SLIDES: