Online Public Compound Databases


Published on

This is a workshop I gave on "Online Public Compound Databases" at the BCCE in Dallas, Texas on August 3rd 2010. It is an overview of online resources, InChI, linking data, online data quality, searching and ChemSpider.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Online Public Compound Databases

  1. 1. Online Public Compound Databases Antony Williams
  2. 2. Introductions…. <ul><li>Hi….I’m Antony Williams, ChemSpiderman </li></ul><ul><ul><li>NMR Spectroscopist by training </li></ul></ul><ul><ul><li>Worked in gov’t lab, academia, Fortune 500, start-up, founded ChemSpider, now work for the Royal Society of Chemistry </li></ul></ul><ul><ul><li>I am the host of ChemSpider… </li></ul></ul><ul><ul><ul><li>25 million compounds </li></ul></ul></ul><ul><ul><ul><li>Linked to 400 data sources </li></ul></ul></ul><ul><ul><ul><li>You’ll hear more…. </li></ul></ul></ul>
  3. 3. What’s your interest in Public Compound DBs ? <ul><li>What public compound databases do you use? </li></ul><ul><li>What are you looking to find? </li></ul><ul><li>What proprietary databases do you presently use? </li></ul><ul><li>What do you trust? </li></ul><ul><li>Why are for-fee databases not enough? </li></ul><ul><li>What issues do you have with free chemistry databases/resources online? </li></ul><ul><li>What would the ideal solution provide???? </li></ul>
  4. 4. Content is King and Quality Costs <ul><li>Chemistry “content” is big money </li></ul><ul><ul><li>Patent searching </li></ul></ul><ul><ul><li>Structures and properties </li></ul></ul><ul><ul><li>Drug databases </li></ul></ul><ul><ul><li>Literature databases </li></ul></ul><ul><li>Chemical Abstracts Service (CAS), the “Gold Standard” in Chemistry related information </li></ul><ul><ul><li>103 years of content </li></ul></ul><ul><ul><li>$260 million revenue (2006) </li></ul></ul><ul><ul><li>>50 million substances </li></ul></ul><ul><ul><li>>60 million sequences </li></ul></ul>
  5. 5. What’s the Status of Chemistry online? <ul><li>Encyclopedic articles (Wikipedia) </li></ul><ul><li>Chemical vendor databases (eMolecules) </li></ul><ul><li>Metabolic pathway databases (WikiPathways) </li></ul><ul><li>Virtual Screening databases (ZINC DB) </li></ul><ul><li>Property databases (Beilstein) </li></ul><ul><li>Screening assay results (PubChem) </li></ul><ul><li>Patents with chemical structures (SureChem) </li></ul><ul><li>ADME/Tox data (OEChem) </li></ul><ul><li>Scientific publications (Many publishers) </li></ul><ul><li>Compound aggregators (ChemSpider) </li></ul><ul><li>Blogs/Wikis and Open Notebook Science (Many) </li></ul>
  6. 6. Synthesis Blogs…
  7. 7. Org Prep Daily
  8. 8. Molbank (Open Access Journal)
  9. 9. Synthetic Pages (Website)
  10. 10. Lots of “Public Compound” Databases <ul><li>PubChem </li></ul><ul><li>Drugbank </li></ul><ul><li>ChEBI/ChEMBL </li></ul><ul><li>KEGG </li></ul><ul><li>LipidMAPs </li></ul><ul><li>ChemIDPlus </li></ul><ul><li>eMolecules </li></ul><ul><li>ZINC </li></ul><ul><li>ChemSpider </li></ul><ul><li>Lots of chemical vendors </li></ul><ul><li>What’s missing??? What do you use online? </li></ul>
  11. 11. Where Would You look? What Do You Trust?
  12. 12. Linked Data on the Web Taken from: Rafael Sidis’ Blog
  13. 13. What is a compound?
  14. 14. Connections Can Lead Anywhere
  15. 15. Where Would You look? What Do You Trust?
  16. 16. PubChem
  17. 17. PubChem <ul><li>PubChem is “a repository of screening data” </li></ul><ul><li>BUT, publishers, vendors, lots of chemical data holders..almost 30 million compounds </li></ul><ul><li>Properties, 3D optimized structures, links to various databases, chemical names, registry numbers, synonyms, trade names </li></ul><ul><li>PubChem is a repository – non-curated, no way to annotate or clean data. Data are free, not “open” </li></ul>
  18. 18. LIVE DEMO of PubChem <ul><li>Name a chemical compound – search and review </li></ul><ul><li>Next slides: </li></ul><ul><ul><li>Methane… </li></ul></ul><ul><ul><li>Diamond </li></ul></ul><ul><ul><li>Vancomycin </li></ul></ul><ul><ul><li>Taxol </li></ul></ul><ul><ul><li>Cholesterol </li></ul></ul>
  19. 19. Chemistry on The Internet Is Messy
  20. 20. It’s Methane…
  21. 21. What’s Methane?
  22. 22. What’s Methane?
  23. 23. What ELSE is Methane???
  24. 24. The Challenges of Internet Data <ul><li>Text-based searches commonly will get you to “representative data” </li></ul><ul><li>Accurate chemical structures are hard to find! </li></ul><ul><li>Wikipedia IS a good source of accurate chemistry data..not perfect but good. </li></ul><ul><ul><li>See Tacrolimus </li></ul></ul><ul><ul><li>Tell the story of Domoic Acid – Next Slide </li></ul></ul>
  25. 25. The EXPERTS must get it right?!
  26. 26. Wikipedia, C&E News, PubChem <ul><li>C&E News (from ACS) </li></ul>
  27. 27. Feedback from Steve Ritter <ul><li>“ As for where we source our structures, our primary source is the researcher and peer-reviewed papers , because many compounds are novel. </li></ul><ul><li>..we always double check them against one or more primary sources, typically Merck Index and SciFinder. </li></ul><ul><li>Although CAS and C&EN are both part of the ACS Publications Division, we at C&EN still have to pay for our SciFinder access, strangely enough.” </li></ul>
  28. 28. Feedback from Steve Ritter <ul><li>“ As a rule, we at C&EN don’t use Wikipedia as a primary source for structures or chemical information, and I recommend that policy to anyone .” </li></ul><ul><li>“ It would be nice to have an authoritative web-based source of standard, well-drawn structures for chemists to go to so they can freely cut and paste structures into their papers, PowerPoint presentations, and anything else they might need. Maybe Wikipedia will be that source one day .” </li></ul>
  29. 29. The Challenges of Internet Data <ul><li>Text-based searches commonly will get you to “representative data” </li></ul><ul><li>Accurate chemical structures are hard to find! </li></ul><ul><li>Wikipedia IS a good source of accurate chemistry data..not perfect but good. </li></ul><ul><ul><li>See Tacrolimus </li></ul></ul><ul><ul><li>Tell the story of Domoic Acid </li></ul></ul><ul><li>Unfortunately, question everything  </li></ul>
  30. 30. Question Everything online:
  31. 31. The FDA’s DailyMed
  32. 32. Structures on DailyMed
  33. 33. Lack of Stereochemistry
  34. 34. Does Stereochemistry Matter?
  35. 35. Does one stereocenter matter? <ul><li>Distaval, Talimol, Nibrol, Sedimide, Quietoplex, Contergan, Neurosedyn, and Softenon </li></ul>
  36. 36. Does one stereocenter matter? <ul><li>Distaval, Talimol, Nibrol, Sedimide, Quietoplex, Contergan, Neurosedyn, Softenon, Thalidomide </li></ul>
  37. 37. Incorrect Structures
  38. 38. Wow!
  39. 39. Collaborative Knowledge Management
  40. 40. Taxol on PubChem
  41. 41. Drugbank
  42. 42. Digitonin? More Crowdsourcing…
  43. 43. Comments on the Blog <ul><li>September 15th, 2009 at 1:57 pm It looks like both ChEBI and Wikipedia structures are wrong as far as aglycon is concerned. According to </li></ul><ul><li>“… for the first time to confirm beyond all doubt the structure suggested by Tschesche and Wulff for digitonin by means of modern NMR techniques, and to assign all proton and carbon resonances.” Structure 1 shows methyl group at C-20 going UP, i.e. 20β (while by default spirostan is 20α). </li></ul>
  44. 44. CAS as an authority
  45. 45. The Blogging Community Participate
  46. 46. Will it ever end? <ul><li>The community says the structure of digitonin has “up” 20-Methyl. </li></ul><ul><li>If so, then multiple substances related to digitonin have OPPOSITE stereo at 20-Methyl </li></ul><ul><li>The spirostane skeleton is considered to have a “down” Methyl group so all spirostane-related structures would be wrong </li></ul><ul><li>The ACD/Dictionary has 24 structures with close skeleton and all have the “down” Methyl group. </li></ul>
  47. 47. Chemistry is REALLY Messy
  48. 48. Vancomycin <ul><li>Who will curate? </li></ul><ul><li>How would you clean such a large dataset? </li></ul><ul><li>Assertions!!! </li></ul>
  49. 49. An Introduction to the InChI Identifier
  50. 50. Multiple Layers
  51. 51. InChIStrings Hash to InChIKeys
  52. 52. InChIs for Taxol
  53. 53. Back to Taxol <ul><li>DrugBank: RCINICONZNJXQF-CLDWUXIMDD </li></ul><ul><li>ChEBI: RCINICONZNJXQF-GXKQXQCDDN </li></ul><ul><li>Wikipedia: RCINICONZNJXQF-MZXODVADBJ </li></ul><ul><li>Which one is correct??? </li></ul>
  54. 54. Vancomycin – Search the Internet
  55. 55. Full Molecule Search: 4 Hits
  56. 56. Full Skeleton Search: 104 Hits
  57. 57. Vancomycin on ChemSpider 1 compound – 3 days
  58. 58. Assertion and Chemical Entities <ul><li>Who says what Taxol is? </li></ul><ul><li>What is the “timeline” for a molecule? </li></ul><ul><li>How do we clean up the Public data? </li></ul>
  59. 59. InChIKeys for Taxol <ul><li>DrugBank: RCINICONZNJXQF-CLDWUXIMDD </li></ul><ul><li>ChEBI: RCINICONZNJXQF-GXKQXQCDDN </li></ul><ul><li>Wikipedia: RCINICONZNJXQF-MZXODVADBJ </li></ul><ul><li>Structure validation is tough work! </li></ul><ul><li>Who is validating chemistry data online??? </li></ul>
  60. 60. Bio-Break <ul><li>Next Up – QUALITY CHOICES for online data </li></ul><ul><li>An introduction to ChemSpider </li></ul><ul><li>Crowdsourced Participation and Curation </li></ul>
  61. 61. Tony’s Quality Choices For Data <ul><li>Chemical Abstracts Service and Reaxys – not free but definitely high quality! </li></ul><ul><li>Wikipedia Chemistry is good </li></ul><ul><li>ChEBI (look for “starred” compounds to indicate manual curation) </li></ul><ul><li>DSSTox – manually curated EPA database. Very high quality </li></ul><ul><li>ChemIDPlus – ongoing curation and good quality </li></ul><ul><li>The databases of David Wishart – manually curated. Good but not perfect – DrugBank, HMDB, FooDB and others </li></ul>
  62. 62. ChEBI: <ul><li>Chemical Entities of Biological Interest from the European Bioinformatics Institute </li></ul>
  63. 63. DSSTox: <ul><li>Distributed Structure Searching for Toxicology from Ann Richards at the Environmental Protection Agency </li></ul>
  64. 64. ChemIDPlus – 350,000 Compounds
  65. 65. DrugBank:
  66. 66. And Our Own Work... ChemSpider <ul><li>ChemSpider is: </li></ul><ul><ul><li>Building a Structure Centric Community for Chemists </li></ul></ul><ul><ul><li>>25 million compounds, >400 data sources </li></ul></ul><ul><ul><li>A deposition and curation platform </li></ul></ul><ul><ul><li>A publishing platform for the community </li></ul></ul><ul><ul><li>Grows daily – more depositions, more links, more data sources </li></ul></ul>
  67. 67. How Was ChemSpider Built? <ul><li>ChemSpider was a “hobby project” </li></ul><ul><li>Housed in a basement and running off three servers – one bought, two built </li></ul><ul><li>Sensitive to weather and power stability </li></ul><ul><li>Went live at ACS Spring 2007 in Chicago </li></ul>
  68. 68. Search Cholesterol
  69. 69. Live DEMO <ul><li>ChemSpider demo… </li></ul>
  70. 70. Link off a structure in ChemSpider <ul><ul><li>Chemical suppliers </li></ul></ul><ul><ul><li>Other publications </li></ul></ul><ul><ul><li>Analytical Data </li></ul></ul><ul><ul><li>Related Reactions </li></ul></ul><ul><ul><li>Wikipedia </li></ul></ul><ul><ul><li>Patents </li></ul></ul><ul><ul><li>“ Everything” </li></ul></ul>
  71. 71. Answering Questions for Chemists <ul><li>Questions a chemist might ask… </li></ul><ul><ul><li>What is the melting point of n-butanol? </li></ul></ul><ul><ul><li>What is the chemical structure of Xanax? </li></ul></ul><ul><ul><li>Chemically, what is phenolphthalein? </li></ul></ul><ul><ul><li>What are the stereocenters of cholesterol? </li></ul></ul><ul><ul><li>Where can I find publications about xylene? </li></ul></ul><ul><ul><li>What are the different trade names for Ketoconazole? </li></ul></ul><ul><ul><li>What is the NMR spectrum of Aspirin? </li></ul></ul><ul><ul><li>What are the safety handling issues for Thymol Blue? </li></ul></ul>
  72. 72. Complex Data and Information
  73. 73. Crowd-sourcing Chemistry Curation <ul><li>Crowd-sourced curation: identify/tag errors, edit names, synonyms, identify records to deprecate </li></ul>
  74. 74. Multi-level Curation and Approval
  75. 75. ChemSpider SyntheticPages <ul><li>ChemSpider Synthesis will be a home for all things “synthetic” </li></ul><ul><li>An online resource for synthetic procedures from blogs, other online resources, RSC supplementary info, other publishers etc. </li></ul><ul><li>Public peer-review and feedback for synthetic procedures </li></ul>
  76. 76. ChemSpider Everywhere : Embed
  77. 77. ChemSpider Everywhere: Spectral Game
  78. 78. ChemSpider Everywhere Crowdsourced Curation of Spectra
  79. 79. ChemSpider Everywhere ChemMobi
  80. 80. Where is ChemSpider Lacking? <ul><li>More databases coming online monthly </li></ul><ul><li>Quality of data remains the primary issue </li></ul><ul><li>ChemSpider is limited to “defined chemicals”. No support for: </li></ul><ul><ul><li>Polymers </li></ul></ul><ul><ul><li>Minerals </li></ul></ul><ul><ul><li>Markush structures </li></ul></ul>
  81. 81. It’s a long road ahead…
  82. 82. Conclusions <ul><li>The internet enables chemistry, at a reduced cost </li></ul><ul><li>Web 2.0 is here and improving quality </li></ul><ul><li>Question Quality! </li></ul><ul><li>Crowdsourcing to expand, curate and integrate </li></ul><ul><li>InChIs are enabling chemistry on the internet </li></ul>
  83. 83. Thank you [email_address] Twitter: ChemSpiderman SLIDES: