Online Public Compound Databases Antony Williams
Introductions…. <ul><li>Hi….I’m Antony Williams,  ChemSpiderman </li></ul><ul><ul><li>NMR Spectroscopist by training </li>...
What’s your interest in  Public Compound DBs ? <ul><li>What public compound databases do you use? </li></ul><ul><li>What a...
Content is King and  Quality  Costs <ul><li>Chemistry “content” is big  money </li></ul><ul><ul><li>Patent searching </li>...
What’s the Status of Chemistry online? <ul><li>Encyclopedic articles (Wikipedia) </li></ul><ul><li>Chemical vendor databas...
Synthesis Blogs…TotallySynthetic.com
Org Prep Daily
Molbank (Open Access Journal)
Synthetic Pages (Website)
Lots of “Public Compound” Databases <ul><li>PubChem </li></ul><ul><li>Drugbank </li></ul><ul><li>ChEBI/ChEMBL </li></ul><u...
Where Would You look?  What Do You Trust?
Linked Data on the Web Taken from: Rafael Sidis’ Blog
What is a compound?
Connections Can Lead Anywhere
Where Would You look?  What Do You Trust?
PubChem
PubChem <ul><li>PubChem is “a repository of screening data” </li></ul><ul><li>BUT, publishers, vendors, lots of chemical d...
LIVE DEMO of PubChem <ul><li>Name a chemical compound – search and review </li></ul><ul><li>Next slides:  </li></ul><ul><u...
Chemistry on The Internet Is Messy
It’s Methane…
What’s Methane?
What’s Methane?
What  ELSE  is Methane???
The Challenges of Internet Data <ul><li>Text-based searches commonly will get you to “representative data” </li></ul><ul><...
The EXPERTS must get it right?!
Wikipedia, C&E News, PubChem <ul><li>C&E News (from ACS) </li></ul>
Feedback from Steve Ritter <ul><li>“ As for where we source our structures, our  primary source is the researcher and peer...
Feedback from Steve Ritter <ul><li>“ As a rule,  we at C&EN don’t use Wikipedia as a primary source for structures or chem...
The Challenges of Internet Data <ul><li>Text-based searches commonly will get you to “representative data” </li></ul><ul><...
Question Everything online: www.dhmo.org
The FDA’s DailyMed
  Structures on DailyMed
Lack of Stereochemistry
Does Stereochemistry Matter?
Does one stereocenter matter? <ul><li>Distaval, Talimol, Nibrol, Sedimide, Quietoplex, Contergan, Neurosedyn, and Softenon...
Does one stereocenter matter? <ul><li>Distaval, Talimol, Nibrol, Sedimide, Quietoplex, Contergan, Neurosedyn, Softenon,  T...
  Incorrect Structures
Wow!
Collaborative Knowledge Management
Taxol on PubChem
Drugbank
Digitonin? More Crowdsourcing…
Comments on the Blog <ul><li>September 15th, 2009 at 1:57 pm   It looks like both ChEBI and Wikipedia structures are wrong...
CAS as an authority
The Blogging Community Participate
Will it ever end? <ul><li>The community says the structure of digitonin has “up” 20-Methyl. </li></ul><ul><li>If so, then ...
Chemistry is REALLY Messy
Vancomycin <ul><li>Who will curate? </li></ul><ul><li>How would you clean such a large dataset? </li></ul><ul><li>Assertio...
An Introduction to the  InChI Identifier
Multiple Layers
InChIStrings Hash to InChIKeys
InChIs for Taxol
Back to Taxol <ul><li>DrugBank: RCINICONZNJXQF-CLDWUXIMDD </li></ul><ul><li>ChEBI:   RCINICONZNJXQF-GXKQXQCDDN  </li></ul>...
Vancomycin –  Search the Internet
Full  Molecule  Search: 4 Hits
Full  Skeleton  Search: 104 Hits
Vancomycin on ChemSpider  1 compound – 3 days
Assertion and  Chemical Entities <ul><li>Who says what Taxol is? </li></ul><ul><li>What is the “timeline” for a molecule? ...
InChIKeys for Taxol <ul><li>DrugBank: RCINICONZNJXQF-CLDWUXIMDD </li></ul><ul><li>ChEBI:   RCINICONZNJXQF-GXKQXQCDDN  </li...
Bio-Break <ul><li>Next Up –  QUALITY CHOICES  for online data </li></ul><ul><li>An introduction to ChemSpider </li></ul><u...
Tony’s  Quality Choices  For Data <ul><li>Chemical Abstracts Service and Reaxys – not free but definitely high quality! </...
ChEBI:  http://www.ebi.ac.uk/chebi/ <ul><li>Chemical Entities of Biological Interest from the European Bioinformatics Inst...
DSSTox:  http://www.epa.gov/comptox/dsstox/   <ul><li>Distributed Structure Searching for Toxicology from Ann Richards at ...
ChemIDPlus – 350,000 Compounds http://chem.sis.nlm.nih.gov/chemidplus/
DrugBank:  http://www.drugbank.ca/
And Our Own Work... ChemSpider <ul><li>ChemSpider is: </li></ul><ul><ul><li>Building a Structure Centric Community for Che...
How Was ChemSpider Built? <ul><li>ChemSpider was a “hobby project”  </li></ul><ul><li>Housed in a basement and running off...
Search Cholesterol
Live DEMO <ul><li>ChemSpider demo… </li></ul>
Link off a structure in ChemSpider <ul><ul><li>Chemical suppliers </li></ul></ul><ul><ul><li>Other publications </li></ul>...
Answering Questions for Chemists <ul><li>Questions a chemist might ask… </li></ul><ul><ul><li>What is the melting point of...
Complex Data and Information
Crowd-sourcing Chemistry Curation <ul><li>Crowd-sourced curation: identify/tag errors, edit names, synonyms, identify reco...
Multi-level Curation and Approval
ChemSpider SyntheticPages <ul><li>ChemSpider Synthesis will be a home for all things “synthetic”  </li></ul><ul><li>An onl...
ChemSpider Everywhere : Embed
ChemSpider Everywhere: Spectral Game
ChemSpider Everywhere Crowdsourced Curation of Spectra
ChemSpider Everywhere ChemMobi
Where is ChemSpider Lacking? <ul><li>More databases coming online monthly </li></ul><ul><li>Quality of data remains the pr...
It’s a long road ahead…
Conclusions <ul><li>The internet enables chemistry, at a reduced cost </li></ul><ul><li>Web 2.0 is here and improving qual...
Thank you [email_address] Twitter: ChemSpiderman www.chemspider.com/blog SLIDES: www.slideshare.net/AntonyWilliams
Upcoming SlideShare
Loading in...5
×

Online Public Compound Databases

2,582

Published on

This is a workshop I gave on "Online Public Compound Databases" at the BCCE in Dallas, Texas on August 3rd 2010. It is an overview of online resources, InChI, linking data, online data quality, searching and ChemSpider.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,582
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
34
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Online Public Compound Databases

  1. 1. Online Public Compound Databases Antony Williams
  2. 2. Introductions…. <ul><li>Hi….I’m Antony Williams, ChemSpiderman </li></ul><ul><ul><li>NMR Spectroscopist by training </li></ul></ul><ul><ul><li>Worked in gov’t lab, academia, Fortune 500, start-up, founded ChemSpider, now work for the Royal Society of Chemistry </li></ul></ul><ul><ul><li>I am the host of ChemSpider… </li></ul></ul><ul><ul><ul><li>25 million compounds </li></ul></ul></ul><ul><ul><ul><li>Linked to 400 data sources </li></ul></ul></ul><ul><ul><ul><li>You’ll hear more…. </li></ul></ul></ul>
  3. 3. What’s your interest in Public Compound DBs ? <ul><li>What public compound databases do you use? </li></ul><ul><li>What are you looking to find? </li></ul><ul><li>What proprietary databases do you presently use? </li></ul><ul><li>What do you trust? </li></ul><ul><li>Why are for-fee databases not enough? </li></ul><ul><li>What issues do you have with free chemistry databases/resources online? </li></ul><ul><li>What would the ideal solution provide???? </li></ul>
  4. 4. Content is King and Quality Costs <ul><li>Chemistry “content” is big money </li></ul><ul><ul><li>Patent searching </li></ul></ul><ul><ul><li>Structures and properties </li></ul></ul><ul><ul><li>Drug databases </li></ul></ul><ul><ul><li>Literature databases </li></ul></ul><ul><li>Chemical Abstracts Service (CAS), the “Gold Standard” in Chemistry related information </li></ul><ul><ul><li>103 years of content </li></ul></ul><ul><ul><li>$260 million revenue (2006) </li></ul></ul><ul><ul><li>>50 million substances </li></ul></ul><ul><ul><li>>60 million sequences </li></ul></ul>
  5. 5. What’s the Status of Chemistry online? <ul><li>Encyclopedic articles (Wikipedia) </li></ul><ul><li>Chemical vendor databases (eMolecules) </li></ul><ul><li>Metabolic pathway databases (WikiPathways) </li></ul><ul><li>Virtual Screening databases (ZINC DB) </li></ul><ul><li>Property databases (Beilstein) </li></ul><ul><li>Screening assay results (PubChem) </li></ul><ul><li>Patents with chemical structures (SureChem) </li></ul><ul><li>ADME/Tox data (OEChem) </li></ul><ul><li>Scientific publications (Many publishers) </li></ul><ul><li>Compound aggregators (ChemSpider) </li></ul><ul><li>Blogs/Wikis and Open Notebook Science (Many) </li></ul>
  6. 6. Synthesis Blogs…TotallySynthetic.com
  7. 7. Org Prep Daily
  8. 8. Molbank (Open Access Journal)
  9. 9. Synthetic Pages (Website)
  10. 10. Lots of “Public Compound” Databases <ul><li>PubChem </li></ul><ul><li>Drugbank </li></ul><ul><li>ChEBI/ChEMBL </li></ul><ul><li>KEGG </li></ul><ul><li>LipidMAPs </li></ul><ul><li>ChemIDPlus </li></ul><ul><li>eMolecules </li></ul><ul><li>ZINC </li></ul><ul><li>ChemSpider </li></ul><ul><li>Lots of chemical vendors </li></ul><ul><li>What’s missing??? What do you use online? </li></ul>
  11. 11. Where Would You look? What Do You Trust?
  12. 12. Linked Data on the Web Taken from: Rafael Sidis’ Blog
  13. 13. What is a compound?
  14. 14. Connections Can Lead Anywhere
  15. 15. Where Would You look? What Do You Trust?
  16. 16. PubChem
  17. 17. PubChem <ul><li>PubChem is “a repository of screening data” </li></ul><ul><li>BUT, publishers, vendors, lots of chemical data holders..almost 30 million compounds </li></ul><ul><li>Properties, 3D optimized structures, links to various databases, chemical names, registry numbers, synonyms, trade names </li></ul><ul><li>PubChem is a repository – non-curated, no way to annotate or clean data. Data are free, not “open” </li></ul>
  18. 18. LIVE DEMO of PubChem <ul><li>Name a chemical compound – search and review </li></ul><ul><li>Next slides: </li></ul><ul><ul><li>Methane… </li></ul></ul><ul><ul><li>Diamond </li></ul></ul><ul><ul><li>Vancomycin </li></ul></ul><ul><ul><li>Taxol </li></ul></ul><ul><ul><li>Cholesterol </li></ul></ul>
  19. 19. Chemistry on The Internet Is Messy
  20. 20. It’s Methane…
  21. 21. What’s Methane?
  22. 22. What’s Methane?
  23. 23. What ELSE is Methane???
  24. 24. The Challenges of Internet Data <ul><li>Text-based searches commonly will get you to “representative data” </li></ul><ul><li>Accurate chemical structures are hard to find! </li></ul><ul><li>Wikipedia IS a good source of accurate chemistry data..not perfect but good. </li></ul><ul><ul><li>See Tacrolimus </li></ul></ul><ul><ul><li>Tell the story of Domoic Acid – Next Slide </li></ul></ul>
  25. 25. The EXPERTS must get it right?!
  26. 26. Wikipedia, C&E News, PubChem <ul><li>C&E News (from ACS) </li></ul>
  27. 27. Feedback from Steve Ritter <ul><li>“ As for where we source our structures, our primary source is the researcher and peer-reviewed papers , because many compounds are novel. </li></ul><ul><li>..we always double check them against one or more primary sources, typically Merck Index and SciFinder. </li></ul><ul><li>Although CAS and C&EN are both part of the ACS Publications Division, we at C&EN still have to pay for our SciFinder access, strangely enough.” </li></ul>
  28. 28. Feedback from Steve Ritter <ul><li>“ As a rule, we at C&EN don’t use Wikipedia as a primary source for structures or chemical information, and I recommend that policy to anyone .” </li></ul><ul><li>“ It would be nice to have an authoritative web-based source of standard, well-drawn structures for chemists to go to so they can freely cut and paste structures into their papers, PowerPoint presentations, and anything else they might need. Maybe Wikipedia will be that source one day .” </li></ul>
  29. 29. The Challenges of Internet Data <ul><li>Text-based searches commonly will get you to “representative data” </li></ul><ul><li>Accurate chemical structures are hard to find! </li></ul><ul><li>Wikipedia IS a good source of accurate chemistry data..not perfect but good. </li></ul><ul><ul><li>See Tacrolimus </li></ul></ul><ul><ul><li>Tell the story of Domoic Acid </li></ul></ul><ul><li>Unfortunately, question everything  </li></ul>
  30. 30. Question Everything online: www.dhmo.org
  31. 31. The FDA’s DailyMed
  32. 32. Structures on DailyMed
  33. 33. Lack of Stereochemistry
  34. 34. Does Stereochemistry Matter?
  35. 35. Does one stereocenter matter? <ul><li>Distaval, Talimol, Nibrol, Sedimide, Quietoplex, Contergan, Neurosedyn, and Softenon </li></ul>
  36. 36. Does one stereocenter matter? <ul><li>Distaval, Talimol, Nibrol, Sedimide, Quietoplex, Contergan, Neurosedyn, Softenon, Thalidomide </li></ul>
  37. 37. Incorrect Structures
  38. 38. Wow!
  39. 39. Collaborative Knowledge Management
  40. 40. Taxol on PubChem
  41. 41. Drugbank
  42. 42. Digitonin? More Crowdsourcing…
  43. 43. Comments on the Blog <ul><li>September 15th, 2009 at 1:57 pm It looks like both ChEBI and Wikipedia structures are wrong as far as aglycon is concerned. According to http://www3.interscience.wiley.com/journal/20330/abstract </li></ul><ul><li>“… for the first time to confirm beyond all doubt the structure suggested by Tschesche and Wulff for digitonin by means of modern NMR techniques, and to assign all proton and carbon resonances.” Structure 1 shows methyl group at C-20 going UP, i.e. 20β (while by default spirostan is 20α). </li></ul>
  44. 44. CAS as an authority
  45. 45. The Blogging Community Participate
  46. 46. Will it ever end? <ul><li>The community says the structure of digitonin has “up” 20-Methyl. </li></ul><ul><li>If so, then multiple substances related to digitonin have OPPOSITE stereo at 20-Methyl </li></ul><ul><li>The spirostane skeleton is considered to have a “down” Methyl group so all spirostane-related structures would be wrong </li></ul><ul><li>The ACD/Dictionary has 24 structures with close skeleton and all have the “down” Methyl group. </li></ul>
  47. 47. Chemistry is REALLY Messy
  48. 48. Vancomycin <ul><li>Who will curate? </li></ul><ul><li>How would you clean such a large dataset? </li></ul><ul><li>Assertions!!! </li></ul>
  49. 49. An Introduction to the InChI Identifier
  50. 50. Multiple Layers
  51. 51. InChIStrings Hash to InChIKeys
  52. 52. InChIs for Taxol
  53. 53. Back to Taxol <ul><li>DrugBank: RCINICONZNJXQF-CLDWUXIMDD </li></ul><ul><li>ChEBI: RCINICONZNJXQF-GXKQXQCDDN </li></ul><ul><li>Wikipedia: RCINICONZNJXQF-MZXODVADBJ </li></ul><ul><li>Which one is correct??? </li></ul>
  54. 54. Vancomycin – Search the Internet
  55. 55. Full Molecule Search: 4 Hits
  56. 56. Full Skeleton Search: 104 Hits
  57. 57. Vancomycin on ChemSpider 1 compound – 3 days
  58. 58. Assertion and Chemical Entities <ul><li>Who says what Taxol is? </li></ul><ul><li>What is the “timeline” for a molecule? </li></ul><ul><li>How do we clean up the Public data? </li></ul>
  59. 59. InChIKeys for Taxol <ul><li>DrugBank: RCINICONZNJXQF-CLDWUXIMDD </li></ul><ul><li>ChEBI: RCINICONZNJXQF-GXKQXQCDDN </li></ul><ul><li>Wikipedia: RCINICONZNJXQF-MZXODVADBJ </li></ul><ul><li>Structure validation is tough work! </li></ul><ul><li>Who is validating chemistry data online??? </li></ul>
  60. 60. Bio-Break <ul><li>Next Up – QUALITY CHOICES for online data </li></ul><ul><li>An introduction to ChemSpider </li></ul><ul><li>Crowdsourced Participation and Curation </li></ul>
  61. 61. Tony’s Quality Choices For Data <ul><li>Chemical Abstracts Service and Reaxys – not free but definitely high quality! </li></ul><ul><li>Wikipedia Chemistry is good </li></ul><ul><li>ChEBI (look for “starred” compounds to indicate manual curation) </li></ul><ul><li>DSSTox – manually curated EPA database. Very high quality </li></ul><ul><li>ChemIDPlus – ongoing curation and good quality </li></ul><ul><li>The databases of David Wishart – manually curated. Good but not perfect – DrugBank, HMDB, FooDB and others </li></ul>
  62. 62. ChEBI: http://www.ebi.ac.uk/chebi/ <ul><li>Chemical Entities of Biological Interest from the European Bioinformatics Institute </li></ul>
  63. 63. DSSTox: http://www.epa.gov/comptox/dsstox/ <ul><li>Distributed Structure Searching for Toxicology from Ann Richards at the Environmental Protection Agency </li></ul>
  64. 64. ChemIDPlus – 350,000 Compounds http://chem.sis.nlm.nih.gov/chemidplus/
  65. 65. DrugBank: http://www.drugbank.ca/
  66. 66. And Our Own Work... ChemSpider <ul><li>ChemSpider is: </li></ul><ul><ul><li>Building a Structure Centric Community for Chemists </li></ul></ul><ul><ul><li>>25 million compounds, >400 data sources </li></ul></ul><ul><ul><li>A deposition and curation platform </li></ul></ul><ul><ul><li>A publishing platform for the community </li></ul></ul><ul><ul><li>Grows daily – more depositions, more links, more data sources </li></ul></ul>
  67. 67. How Was ChemSpider Built? <ul><li>ChemSpider was a “hobby project” </li></ul><ul><li>Housed in a basement and running off three servers – one bought, two built </li></ul><ul><li>Sensitive to weather and power stability </li></ul><ul><li>Went live at ACS Spring 2007 in Chicago </li></ul>
  68. 68. Search Cholesterol
  69. 69. Live DEMO <ul><li>ChemSpider demo… </li></ul>
  70. 70. Link off a structure in ChemSpider <ul><ul><li>Chemical suppliers </li></ul></ul><ul><ul><li>Other publications </li></ul></ul><ul><ul><li>Analytical Data </li></ul></ul><ul><ul><li>Related Reactions </li></ul></ul><ul><ul><li>Wikipedia </li></ul></ul><ul><ul><li>Patents </li></ul></ul><ul><ul><li>“ Everything” </li></ul></ul>
  71. 71. Answering Questions for Chemists <ul><li>Questions a chemist might ask… </li></ul><ul><ul><li>What is the melting point of n-butanol? </li></ul></ul><ul><ul><li>What is the chemical structure of Xanax? </li></ul></ul><ul><ul><li>Chemically, what is phenolphthalein? </li></ul></ul><ul><ul><li>What are the stereocenters of cholesterol? </li></ul></ul><ul><ul><li>Where can I find publications about xylene? </li></ul></ul><ul><ul><li>What are the different trade names for Ketoconazole? </li></ul></ul><ul><ul><li>What is the NMR spectrum of Aspirin? </li></ul></ul><ul><ul><li>What are the safety handling issues for Thymol Blue? </li></ul></ul>
  72. 72. Complex Data and Information
  73. 73. Crowd-sourcing Chemistry Curation <ul><li>Crowd-sourced curation: identify/tag errors, edit names, synonyms, identify records to deprecate </li></ul>
  74. 74. Multi-level Curation and Approval
  75. 75. ChemSpider SyntheticPages <ul><li>ChemSpider Synthesis will be a home for all things “synthetic” </li></ul><ul><li>An online resource for synthetic procedures from blogs, other online resources, RSC supplementary info, other publishers etc. </li></ul><ul><li>Public peer-review and feedback for synthetic procedures </li></ul>
  76. 76. ChemSpider Everywhere : Embed
  77. 77. ChemSpider Everywhere: Spectral Game
  78. 78. ChemSpider Everywhere Crowdsourced Curation of Spectra
  79. 79. ChemSpider Everywhere ChemMobi
  80. 80. Where is ChemSpider Lacking? <ul><li>More databases coming online monthly </li></ul><ul><li>Quality of data remains the primary issue </li></ul><ul><li>ChemSpider is limited to “defined chemicals”. No support for: </li></ul><ul><ul><li>Polymers </li></ul></ul><ul><ul><li>Minerals </li></ul></ul><ul><ul><li>Markush structures </li></ul></ul>
  81. 81. It’s a long road ahead…
  82. 82. Conclusions <ul><li>The internet enables chemistry, at a reduced cost </li></ul><ul><li>Web 2.0 is here and improving quality </li></ul><ul><li>Question Quality! </li></ul><ul><li>Crowdsourcing to expand, curate and integrate </li></ul><ul><li>InChIs are enabling chemistry on the internet </li></ul>
  83. 83. Thank you [email_address] Twitter: ChemSpiderman www.chemspider.com/blog SLIDES: www.slideshare.net/AntonyWilliams
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×