Online Public Compound Databases
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Online Public Compound Databases

on

  • 3,130 views

This is a workshop I gave on "Online Public Compound Databases" at the BCCE in Dallas, Texas on August 3rd 2010. It is an overview of online resources, InChI, linking data, online data quality, ...

This is a workshop I gave on "Online Public Compound Databases" at the BCCE in Dallas, Texas on August 3rd 2010. It is an overview of online resources, InChI, linking data, online data quality, searching and ChemSpider.

Statistics

Views

Total Views
3,130
Views on SlideShare
2,917
Embed Views
213

Actions

Likes
2
Downloads
33
Comments
0

4 Embeds 213

http://www.chemspider.com 209
http://webcache.googleusercontent.com 2
http://blackboard.kti.wa.edu.au 1
https://twitter.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Online Public Compound Databases Presentation Transcript

  • 1. Online Public Compound Databases Antony Williams
  • 2. Introductions….
    • Hi….I’m Antony Williams, ChemSpiderman
      • NMR Spectroscopist by training
      • Worked in gov’t lab, academia, Fortune 500, start-up, founded ChemSpider, now work for the Royal Society of Chemistry
      • I am the host of ChemSpider…
        • 25 million compounds
        • Linked to 400 data sources
        • You’ll hear more….
  • 3. What’s your interest in Public Compound DBs ?
    • What public compound databases do you use?
    • What are you looking to find?
    • What proprietary databases do you presently use?
    • What do you trust?
    • Why are for-fee databases not enough?
    • What issues do you have with free chemistry databases/resources online?
    • What would the ideal solution provide????
  • 4. Content is King and Quality Costs
    • Chemistry “content” is big money
      • Patent searching
      • Structures and properties
      • Drug databases
      • Literature databases
    • Chemical Abstracts Service (CAS), the “Gold Standard” in Chemistry related information
      • 103 years of content
      • $260 million revenue (2006)
      • >50 million substances
      • >60 million sequences
  • 5. What’s the Status of Chemistry online?
    • Encyclopedic articles (Wikipedia)
    • Chemical vendor databases (eMolecules)
    • Metabolic pathway databases (WikiPathways)
    • Virtual Screening databases (ZINC DB)
    • Property databases (Beilstein)
    • Screening assay results (PubChem)
    • Patents with chemical structures (SureChem)
    • ADME/Tox data (OEChem)
    • Scientific publications (Many publishers)
    • Compound aggregators (ChemSpider)
    • Blogs/Wikis and Open Notebook Science (Many)
  • 6. Synthesis Blogs…TotallySynthetic.com
  • 7. Org Prep Daily
  • 8. Molbank (Open Access Journal)
  • 9. Synthetic Pages (Website)
  • 10. Lots of “Public Compound” Databases
    • PubChem
    • Drugbank
    • ChEBI/ChEMBL
    • KEGG
    • LipidMAPs
    • ChemIDPlus
    • eMolecules
    • ZINC
    • ChemSpider
    • Lots of chemical vendors
    • What’s missing??? What do you use online?
  • 11. Where Would You look? What Do You Trust?
  • 12. Linked Data on the Web Taken from: Rafael Sidis’ Blog
  • 13. What is a compound?
  • 14. Connections Can Lead Anywhere
  • 15. Where Would You look? What Do You Trust?
  • 16. PubChem
  • 17. PubChem
    • PubChem is “a repository of screening data”
    • BUT, publishers, vendors, lots of chemical data holders..almost 30 million compounds
    • Properties, 3D optimized structures, links to various databases, chemical names, registry numbers, synonyms, trade names
    • PubChem is a repository – non-curated, no way to annotate or clean data. Data are free, not “open”
  • 18. LIVE DEMO of PubChem
    • Name a chemical compound – search and review
    • Next slides:
      • Methane…
      • Diamond
      • Vancomycin
      • Taxol
      • Cholesterol
  • 19. Chemistry on The Internet Is Messy
  • 20. It’s Methane…
  • 21. What’s Methane?
  • 22. What’s Methane?
  • 23. What ELSE is Methane???
  • 24. The Challenges of Internet Data
    • Text-based searches commonly will get you to “representative data”
    • Accurate chemical structures are hard to find!
    • Wikipedia IS a good source of accurate chemistry data..not perfect but good.
      • See Tacrolimus
      • Tell the story of Domoic Acid – Next Slide
  • 25. The EXPERTS must get it right?!
  • 26. Wikipedia, C&E News, PubChem
    • C&E News (from ACS)
  • 27. Feedback from Steve Ritter
    • “ As for where we source our structures, our primary source is the researcher and peer-reviewed papers , because many compounds are novel.
    • ..we always double check them against one or more primary sources, typically Merck Index and SciFinder.
    • Although CAS and C&EN are both part of the ACS Publications Division, we at C&EN still have to pay for our SciFinder access, strangely enough.”
  • 28. Feedback from Steve Ritter
    • “ As a rule, we at C&EN don’t use Wikipedia as a primary source for structures or chemical information, and I recommend that policy to anyone .”
    • “ It would be nice to have an authoritative web-based source of standard, well-drawn structures for chemists to go to so they can freely cut and paste structures into their papers, PowerPoint presentations, and anything else they might need. Maybe Wikipedia will be that source one day .”
  • 29. The Challenges of Internet Data
    • Text-based searches commonly will get you to “representative data”
    • Accurate chemical structures are hard to find!
    • Wikipedia IS a good source of accurate chemistry data..not perfect but good.
      • See Tacrolimus
      • Tell the story of Domoic Acid
    • Unfortunately, question everything 
  • 30. Question Everything online: www.dhmo.org
  • 31. The FDA’s DailyMed
  • 32. Structures on DailyMed
  • 33. Lack of Stereochemistry
  • 34. Does Stereochemistry Matter?
  • 35. Does one stereocenter matter?
    • Distaval, Talimol, Nibrol, Sedimide, Quietoplex, Contergan, Neurosedyn, and Softenon
  • 36. Does one stereocenter matter?
    • Distaval, Talimol, Nibrol, Sedimide, Quietoplex, Contergan, Neurosedyn, Softenon, Thalidomide
  • 37. Incorrect Structures
  • 38. Wow!
  • 39. Collaborative Knowledge Management
  • 40. Taxol on PubChem
  • 41. Drugbank
  • 42. Digitonin? More Crowdsourcing…
  • 43. Comments on the Blog
    • September 15th, 2009 at 1:57 pm It looks like both ChEBI and Wikipedia structures are wrong as far as aglycon is concerned. According to http://www3.interscience.wiley.com/journal/20330/abstract
    • “… for the first time to confirm beyond all doubt the structure suggested by Tschesche and Wulff for digitonin by means of modern NMR techniques, and to assign all proton and carbon resonances.” Structure 1 shows methyl group at C-20 going UP, i.e. 20β (while by default spirostan is 20α).
  • 44. CAS as an authority
  • 45. The Blogging Community Participate
  • 46. Will it ever end?
    • The community says the structure of digitonin has “up” 20-Methyl.
    • If so, then multiple substances related to digitonin have OPPOSITE stereo at 20-Methyl
    • The spirostane skeleton is considered to have a “down” Methyl group so all spirostane-related structures would be wrong
    • The ACD/Dictionary has 24 structures with close skeleton and all have the “down” Methyl group.
  • 47. Chemistry is REALLY Messy
  • 48. Vancomycin
    • Who will curate?
    • How would you clean such a large dataset?
    • Assertions!!!
  • 49. An Introduction to the InChI Identifier
  • 50. Multiple Layers
  • 51. InChIStrings Hash to InChIKeys
  • 52. InChIs for Taxol
  • 53. Back to Taxol
    • DrugBank: RCINICONZNJXQF-CLDWUXIMDD
    • ChEBI: RCINICONZNJXQF-GXKQXQCDDN
    • Wikipedia: RCINICONZNJXQF-MZXODVADBJ
    • Which one is correct???
  • 54. Vancomycin – Search the Internet
  • 55. Full Molecule Search: 4 Hits
  • 56. Full Skeleton Search: 104 Hits
  • 57. Vancomycin on ChemSpider 1 compound – 3 days
  • 58. Assertion and Chemical Entities
    • Who says what Taxol is?
    • What is the “timeline” for a molecule?
    • How do we clean up the Public data?
  • 59. InChIKeys for Taxol
    • DrugBank: RCINICONZNJXQF-CLDWUXIMDD
    • ChEBI: RCINICONZNJXQF-GXKQXQCDDN
    • Wikipedia: RCINICONZNJXQF-MZXODVADBJ
    • Structure validation is tough work!
    • Who is validating chemistry data online???
  • 60. Bio-Break
    • Next Up – QUALITY CHOICES for online data
    • An introduction to ChemSpider
    • Crowdsourced Participation and Curation
  • 61. Tony’s Quality Choices For Data
    • Chemical Abstracts Service and Reaxys – not free but definitely high quality!
    • Wikipedia Chemistry is good
    • ChEBI (look for “starred” compounds to indicate manual curation)
    • DSSTox – manually curated EPA database. Very high quality
    • ChemIDPlus – ongoing curation and good quality
    • The databases of David Wishart – manually curated. Good but not perfect – DrugBank, HMDB, FooDB and others
  • 62. ChEBI: http://www.ebi.ac.uk/chebi/
    • Chemical Entities of Biological Interest from the European Bioinformatics Institute
  • 63. DSSTox: http://www.epa.gov/comptox/dsstox/
    • Distributed Structure Searching for Toxicology from Ann Richards at the Environmental Protection Agency
  • 64. ChemIDPlus – 350,000 Compounds http://chem.sis.nlm.nih.gov/chemidplus/
  • 65. DrugBank: http://www.drugbank.ca/
  • 66. And Our Own Work... ChemSpider
    • ChemSpider is:
      • Building a Structure Centric Community for Chemists
      • >25 million compounds, >400 data sources
      • A deposition and curation platform
      • A publishing platform for the community
      • Grows daily – more depositions, more links, more data sources
  • 67. How Was ChemSpider Built?
    • ChemSpider was a “hobby project”
    • Housed in a basement and running off three servers – one bought, two built
    • Sensitive to weather and power stability
    • Went live at ACS Spring 2007 in Chicago
  • 68. Search Cholesterol
  • 69. Live DEMO
    • ChemSpider demo…
  • 70. Link off a structure in ChemSpider
      • Chemical suppliers
      • Other publications
      • Analytical Data
      • Related Reactions
      • Wikipedia
      • Patents
      • “ Everything”
  • 71. Answering Questions for Chemists
    • Questions a chemist might ask…
      • What is the melting point of n-butanol?
      • What is the chemical structure of Xanax?
      • Chemically, what is phenolphthalein?
      • What are the stereocenters of cholesterol?
      • Where can I find publications about xylene?
      • What are the different trade names for Ketoconazole?
      • What is the NMR spectrum of Aspirin?
      • What are the safety handling issues for Thymol Blue?
  • 72. Complex Data and Information
  • 73. Crowd-sourcing Chemistry Curation
    • Crowd-sourced curation: identify/tag errors, edit names, synonyms, identify records to deprecate
  • 74. Multi-level Curation and Approval
  • 75. ChemSpider SyntheticPages
    • ChemSpider Synthesis will be a home for all things “synthetic”
    • An online resource for synthetic procedures from blogs, other online resources, RSC supplementary info, other publishers etc.
    • Public peer-review and feedback for synthetic procedures
  • 76. ChemSpider Everywhere : Embed
  • 77. ChemSpider Everywhere: Spectral Game
  • 78. ChemSpider Everywhere Crowdsourced Curation of Spectra
  • 79. ChemSpider Everywhere ChemMobi
  • 80. Where is ChemSpider Lacking?
    • More databases coming online monthly
    • Quality of data remains the primary issue
    • ChemSpider is limited to “defined chemicals”. No support for:
      • Polymers
      • Minerals
      • Markush structures
  • 81. It’s a long road ahead…
  • 82. Conclusions
    • The internet enables chemistry, at a reduced cost
    • Web 2.0 is here and improving quality
    • Question Quality!
    • Crowdsourcing to expand, curate and integrate
    • InChIs are enabling chemistry on the internet
  • 83. Thank you [email_address] Twitter: ChemSpiderman www.chemspider.com/blog SLIDES: www.slideshare.net/AntonyWilliams