Navigating the Complex Web of Chemistry Using ChemSpider


Published on

The internet has revolutionized the sharing of data and information and in the domain of chemistry there are many resources available to help with our research. In recent years various online resources have been introduced that allow users to access information, properties and data associated with chemical entities. At a time when CAS has declared that they now have over 50 million unique chemical entities in the registry the number of chemical structures distributed across the internet also measures in the tens of millions. There are many tens of databases on the internet hosting chemical structures associated with data focused on the specific nature of the collection – metabolic pathways, spectral data collections, chemical vendor collections, biological assay data and crystal structures are examples. Unfortunately there has been no single way to search across all of these resources. ChemSpider has taken on the task of integrating the multiple online resources of information into a single database using the chemical structure as the primary key and retaining the link out and attribution to the original datasource. In this manner ChemSpider intends to become a structure-centric hub for the chemistry community. This talk will provide an overview of the ChemSpider platform, how it is being used as a crowdsourcing platform for community-based curation of the data and the future vision of ChemSpider as one of the pillars of the semantic web of chemistry.

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Navigating the Complex Web of Chemistry Using ChemSpider

  1. 1. Navigating the Complex Web of Chemistry Using ChemSpider
  2. 2. Imagine a time when …. <ul><li>The internet is searchable by chemical structure and substructure (e.g.Wikipedia, Google Scholar) </li></ul><ul><li>Chemistry articles are indexed and searchable by a free online service </li></ul><ul><li>The web is linked together through the “language of chemistry” </li></ul><ul><li>Publicly funded research data can be shared and discussed in the Open, maybe as ONS? </li></ul>
  3. 3. It’s Coming…Linked Data Cloud
  4. 4. For Synthesis…
  5. 5. Org Prep Daily (Blog)
  6. 6. Molbank (Open Access Journal)
  7. 7. Synthetic Pages (Website)
  8. 8. For Chemical Compounds <ul><li>Vendor sites – Aldrich, Alfa Aesar, TCI and 100s of others </li></ul><ul><li>Government databases – PubChem, DSSTox, FDA databases, ChemIDPlus,… </li></ul><ul><li>Biological Databases – Protein Database, Stitch, KEGG, ChEBI,… </li></ul><ul><li>Analytical databases – Red Hen Spectra, NMRShiftDB,… </li></ul>
  9. 9. For Chemical Compounds <ul><li>Vendor sites – Aldrich, Alfa Aesar, TCI and 100s of others </li></ul><ul><li>Government databases – PubChem, DSSTox, FDA databases, ChemIDPlus,… </li></ul><ul><li>Biological Databases – Protein Database, Stitch, KEGG, ChEBI,… </li></ul><ul><li>Analytical databases – Red Hen Spectra, NMRShiftDB,… </li></ul>
  10. 10. What is ChemSpider? <ul><li>ChemSpider is: </li></ul><ul><ul><li>Building a Structure Centric Community for Chemists </li></ul></ul><ul><ul><li>22.2 million compounds, >200 data sources </li></ul></ul><ul><ul><li>A deposition and curation platform </li></ul></ul><ul><ul><li>A publishing platform for the community </li></ul></ul><ul><ul><li>Grows daily – more depositions, more links, more data sources </li></ul></ul>
  11. 11. How Was ChemSpider Built? <ul><li>ChemSpider was a “hobby project” </li></ul><ul><li>Housed in a basement and running off three servers – one bought, two built </li></ul><ul><li>Sensitive to weather and power stability </li></ul><ul><li>Went live at ACS Spring 2007 in Chicago </li></ul>
  12. 12. Search Cholesterol
  13. 13. Search Cholesterol
  14. 14. Search Cholesterol
  15. 15. Search Cholesterol
  16. 16. Search Cholesterol
  17. 17. Search Cholesterol
  18. 18. Linked across the internet
  19. 19. Kyoto Encyclopedia of Genes and Genomes
  20. 20. Link off a structure in ChemSpider <ul><ul><li>Chemical suppliers </li></ul></ul><ul><ul><li>Other publications </li></ul></ul><ul><ul><li>Analytical Data </li></ul></ul><ul><ul><li>Related Reactions </li></ul></ul><ul><ul><li>Wikipedia </li></ul></ul><ul><ul><li>Patents </li></ul></ul><ul><ul><li>“ Everything” </li></ul></ul>
  21. 21. Links to Patents based on structure
  22. 22. Clickthrough to Patent (SureChem)
  23. 23. Pubmed Articles Linked
  24. 24. Answering Questions for Chemists <ul><li>Questions a chemist might ask… </li></ul><ul><ul><li>What is the melting point of n-butanol? </li></ul></ul><ul><ul><li>What is the chemical structure of Xanax? </li></ul></ul><ul><ul><li>Chemically, what is phenolphthalein? </li></ul></ul><ul><ul><li>What are the stereocenters of cholesterol? </li></ul></ul><ul><ul><li>Where can I find publications about xylene? </li></ul></ul><ul><ul><li>What are the different trade names for Ketoconazole? </li></ul></ul><ul><ul><li>What is the NMR spectrum of Aspirin? </li></ul></ul><ul><ul><li>What are the safety handling issues for Thymol Blue? </li></ul></ul>
  25. 25. Complex Data and Information
  26. 26. ChemSpider is a structure-centric hub <ul><li>ChemSpider aggregates and links out across the internet </li></ul><ul><li>Data aggregate based on “structures and links” </li></ul><ul><li>What defines a chemical compound? </li></ul>
  27. 27. What is a compound?
  28. 28. Question Everything online:
  29. 29. PubChem
  30. 30. Caution! Question Everything!
  31. 31. Vancomycin <ul><li>Who will curate? </li></ul><ul><li>PubChem is not resourced to clean these errors </li></ul><ul><li>How would you clean such a large dataset? </li></ul>
  32. 32. Vancomycin on ChemSpider 1 compound – 3 days
  33. 33. The EXPERTS must get it right?!
  34. 34. Wikipedia, C&E News, PubChem <ul><li>C&E News (from ACS) </li></ul>
  35. 35. Feedback from Steve Ritter <ul><li>“ As for where we source our structures, our primary source is the researcher and peer-reviewed papers , because many compounds are novel. </li></ul><ul><li>..we always double check them against one or more primary sources, typically Merck Index and SciFinder. </li></ul><ul><li>Although CAS and C&EN are both part of the ACS Publications Division, we at C&EN still have to pay for our SciFinder access, strangely enough.” </li></ul>
  36. 36. Feedback from Steve Ritter <ul><li>“ As a rule, we at C&EN don’t use Wikipedia as a primary source for structures or chemical information, and I recommend that policy to anyone .” </li></ul><ul><li>“ It would be nice to have an authoritative web-based source of standard, well-drawn structures for chemists to go to so they can freely cut and paste structures into their papers, PowerPoint presentations, and anything else they might need. Maybe Wikipedia will be that source one day .” </li></ul>
  37. 37. What About Digitonin?
  38. 38. Comments on the Blog <ul><li>Kirill Degtyarenko says: </li></ul><ul><li>September 15th, 2009 at 1:57 pm It looks like both ChEBI and Wikipedia structures are wrong as far as aglycon is concerned. According to </li></ul><ul><li>“… for the first time to confirm beyond all doubt the structure suggested by Tschesche and Wulff for digitonin by means of modern NMR techniques, and to assign all proton and carbon resonances.” Structure 1 shows methyl group at C-20 going UP, i.e. 20β (while by default spirostan is 20α). </li></ul>
  39. 39. CAS as an authority
  40. 40. The Blogging Community Participate
  41. 41. Will it ever end? <ul><li>The community says the structure of digitonin has “up” 20-Methyl. </li></ul><ul><li>If so, then multiple substances related to digitonin have OPPOSITE stereo at 20-Methyl </li></ul><ul><li>The spirostane skeleton is considered to have a “down” Methyl group so all spirostane-related structures would be wrong </li></ul><ul><li>The ACD/Dictionary has 24 structures with close skeleton and all have the “down” Methyl group. </li></ul>
  42. 42. The FDA’s DailyMed
  43. 43. Structures on DailyMed
  44. 44. Lack of Stereochemisty
  45. 45. Incorrect Structures
  46. 46. Wow!
  47. 47. Collaborative Knowledge Management
  48. 48. Drugbank
  49. 49. Taxol on PubChem
  50. 50. FDA’s DailyMed
  51. 51. The InChI Identifier
  52. 52. Multiple Layers
  53. 53. InChIStrings Hash to InChIKeys
  54. 54. InChIs for Taxol
  55. 55. Back to Taxol <ul><li>DrugBank: RCINICONZNJXQF-CLDWUXIMDD </li></ul><ul><li>ChEBI: RCINICONZNJXQF-GXKQXQCDDN </li></ul><ul><li>Wikipedia: RCINICONZNJXQF-MZXODVADBJ </li></ul><ul><li>Which one is correct??? </li></ul>
  56. 56. InChIKeys for Taxol <ul><li>DrugBank: RCINICONZNJXQF-CLDWUXIMDD </li></ul><ul><li>ChEBI: RCINICONZNJXQF-GXKQXQCDDN </li></ul><ul><li>Wikipedia: RCINICONZNJXQF-MZXODVADBJ </li></ul><ul><li>ChEBI and Wikipedia are the SAME structure </li></ul><ul><li>Drugbank is a DIFFERENT structure – ONE stereocenter </li></ul>
  57. 57. Does one stereocenter matter?
  58. 58. Does one stereocenter matter? <ul><li>Distaval, Talimol, Nibrol, Sedimide, Quietoplex, Contergan, Neurosedyn, and Softenon </li></ul>
  59. 59. Does one stereocenter matter? <ul><li>Distaval, Talimol, Nibrol, Sedimide, Quietoplex, Contergan, Neurosedyn, and Softenon </li></ul>
  60. 60. Building a Structure Centric Community for Chemists
  61. 61. Assertion and Chemical Entities <ul><li>Who says what Taxol is? </li></ul><ul><li>What is the “timeline” for a molecule? </li></ul><ul><li>How do we clean up the Public data? </li></ul><ul><li>The Quality source is Chemical Abstracts Service… </li></ul>
  62. 62. ChemSpider Searches
  63. 63. ChemSpider Searches
  64. 64. ChemSpider Complex Searches
  65. 65. ChemSpider Searches
  66. 66. ChemSpider Searches
  67. 67. InChIKey Searches Work
  68. 68. The InChI “Resolver”
  69. 69. Content is King and Quality Costs <ul><li>Chemistry “content” is big money </li></ul><ul><ul><li>Patent searching </li></ul></ul><ul><ul><li>Structures and properties </li></ul></ul><ul><ul><li>Drug databases </li></ul></ul><ul><ul><li>Literature databases </li></ul></ul><ul><li>Chemical Abstracts Service (CAS), the “Gold Standard” in Chemistry related information </li></ul><ul><ul><li>101 years of content </li></ul></ul><ul><ul><li>$260 million revenue (2006) </li></ul></ul><ul><ul><li>>50 million substances </li></ul></ul><ul><ul><li>>60 million sequences </li></ul></ul>
  70. 70. Crowd-sourcing Chemistry Curation <ul><li>Crowd-sourced curation: identify/tag errors, edit names, synonyms, identify records to deprecate </li></ul>
  71. 71. Chemistry – A Deposition Platform <ul><li>CAS indexes published literature, patents and chemical vendors </li></ul><ul><li>CAS indexes ChemSpider – >303,000 records </li></ul><ul><li>“ Lost Chemistry” – syntheses in theses, lab notebooks? Compounds in private collections? </li></ul><ul><li>ChemSpider accepts public depositions, linking to websites, hosting of details etc. Accepts structures, text, spectra, images. </li></ul>
  72. 72. Structure Searching Articles… <ul><li>Searching articles based on chemical structure and substructure is very expensive.. but is changing </li></ul><ul><li>The web IS ready - when will publishers deliver? </li></ul><ul><ul><li>Structures can be shown </li></ul></ul><ul><ul><li>Spectra can be interactive </li></ul></ul><ul><ul><li>Graphics don’t need to be static </li></ul></ul><ul><ul><li>Publishers can enhance their articles </li></ul></ul>
  73. 73. Semantic Mark-up for Chemistry <ul><li>Semantic mark-up for chemistry is here </li></ul><ul><ul><li>RSC project prospect (structure linking, IUPAC Gold Book ontology and other ontologies </li></ul></ul><ul><ul><li>Nature publishing group compound linking </li></ul></ul><ul><ul><li>ChemSpider Journal of Chemistry </li></ul></ul>
  74. 74. Nature Chemistry Compound Pages
  75. 75. Project Prospect
  76. 76. ChemSpider and Publishing <ul><li>The curation efforts on ChemSpider led to a set of validated dictionaries </li></ul><ul><li>Integrate best-in-class entity extraction with validated name dictionaries </li></ul><ul><li>Additional dictionaries gave reactions, groups, families, hardware and software vendors etc </li></ul>
  77. 77. Name Recognition <ul><li>Azo aldehyde 2   was  synthesized according to a reported  method [17]. To  a stirred  solution  of azo aldehyde 2   (1.08 g, 3.76 mmol )  in  dry CH2Cl2  (30.00 mL) were  successively  added.  </li></ul><ul><li>(3,4-diaminophenyl)phenyl methanone 1 (0.40 g, 1.88 mmol) and an excess of anhydrous MgSO4 (2.00 g, 16.67 mmol) . </li></ul>
  78. 79. ChemMantis and CJOC
  79. 80. Name-Structure Pairs
  80. 81. Converting Detected Names… <ul><li>Names are searched against a validated dictionary (this expands as ChemSpider is curated) </li></ul><ul><li>If not found then they are passed through a Name to Structure algorithm </li></ul><ul><li>If they cannot convert then ChemSpider is searched for non-validated names </li></ul>
  81. 82. Manual Curation is Necessary
  82. 83. Deposit Structures
  83. 84. Custom Dictionaries <ul><li>Entity Extraction built around modified algorithms from SureChem </li></ul><ul><li>Optimized for “publications” </li></ul><ul><li>Dictionaries for chemical entities, groups, reactions, elements, families, species… </li></ul><ul><li>Dictionaries can be expanded </li></ul>
  84. 85. Species – linked to Wikipedia
  85. 86. Semantic Linking of Structures <ul><li>What would you want to link off a structure? </li></ul><ul><ul><li>Chemical suppliers </li></ul></ul><ul><ul><li>Other publications </li></ul></ul><ul><ul><li>Analytical Data </li></ul></ul><ul><ul><li>Related Reactions </li></ul></ul><ul><ul><li>Wikipedia </li></ul></ul><ul><ul><li>Patents </li></ul></ul><ul><ul><li>“ Everything” </li></ul></ul>
  86. 87. RSC Supplementary Info
  87. 88. RSC Supplementary Info
  88. 89. ChemSpider Synthesis <ul><li>ChemSpider Synthesis will be a home for all things “synthetic” </li></ul><ul><li>An online resource for synthetic procedures from blogs, other online resources, RSC supplementary info, other publishers etc. </li></ul><ul><li>Public peer-review and feedback for synthetic procedures </li></ul>
  89. 90. Online Journals and Live Data
  90. 91. ChemSpider Everywhere <ul><li>Linked from Wikipedia </li></ul><ul><li>Linked from Open Notebook Science sites using EMBED </li></ul><ul><li>Linked from Blogs using Structure/Spectra EMBED </li></ul><ul><li>Integrated into structure drawing packages such as ACD/ChemSketch, Symyx Draw, Open Source applets </li></ul><ul><li>Integrated to software offerings from Thermo, Waters, Agilent, Bruker </li></ul>
  91. 92. ChemSpider Everywhere : Embed
  92. 93. ChemSpider Everywhere: Spectral Game
  93. 94. ChemSpider Everywhere : ChemMobi
  94. 95. Not in a basement now...
  95. 96. Conclusions <ul><li>The internet enables chemistry, at a reduced cost </li></ul><ul><li>Web 2.0 is here and improving quality </li></ul><ul><li>Question Quality! </li></ul><ul><li>Crowdsourcing to expand, curate and integrate </li></ul><ul><li>InChIs are enabling chemistry on the internet </li></ul>
  96. 97. You are invited.. <ul><li>Deposit your data with us </li></ul><ul><ul><li>Structures </li></ul></ul><ul><ul><li>Spectra </li></ul></ul><ul><ul><li>Synthesis procedures </li></ul></ul><ul><li>ChemSpider Synthesis is under development </li></ul><ul><li>What is Digitonin? </li></ul>
  97. 98. Acknowledgments <ul><li>Valery Tkachenko and Sergey Golotvin </li></ul><ul><li>RSC infrastructure team </li></ul><ul><li>The ChemSpider advisory group </li></ul><ul><li>The Wikipedia Chemistry team </li></ul><ul><li>JC Bradley, Andy Lang – Spectral Game </li></ul>
  98. 99. Thank you [email_address] Twitter: ChemSpiderman