Online Public Compound Databases Antony Williams
Introductions…. Hi….I’m Antony Williams,  ChemSpiderman NMR Spectroscopist by training Worked in gov’t lab, academia, Fortune 500, start-up, founded ChemSpider, now work for the Royal Society of Chemistry I am the host of ChemSpider… 25 million compounds Linked to 400 data sources You’ll hear more….
What’s your interest in  Public Compound DBs ? What public compound databases do you use? What are you looking to find? What proprietary databases do you presently use? What do you trust? Why are for-fee databases not enough? What issues do you have with free chemistry databases/resources online? What would the ideal solution provide????
Content is King and  Quality  Costs Chemistry “content” is big  money Patent searching Structures and properties Drug databases Literature databases Chemical Abstracts Service  (CAS), the “Gold Standard” in Chemistry related information 103 years of content $260 million revenue (2006) >50 million substances  >60 million sequences
What’s the Status of Chemistry online? Encyclopedic articles (Wikipedia) Chemical vendor databases (eMolecules) Metabolic pathway databases (WikiPathways) Virtual Screening databases (ZINC DB) Property databases (Beilstein) Screening assay results (PubChem) Patents with chemical structures (SureChem) ADME/Tox data (OEChem) Scientific publications (Many publishers) Compound aggregators (ChemSpider) Blogs/Wikis and Open Notebook Science (Many)
Synthesis Blogs…TotallySynthetic.com
Org Prep Daily
Molbank (Open Access Journal)
Synthetic Pages (Website)
Lots of “Public Compound” Databases PubChem Drugbank ChEBI/ChEMBL KEGG LipidMAPs ChemIDPlus eMolecules ZINC ChemSpider Lots of chemical vendors What’s missing??? What do you use online?
Where Would You look?  What Do You Trust?
Linked Data on the Web Taken from: Rafael Sidis’ Blog
What is a compound?
Connections Can Lead Anywhere
Where Would You look?  What Do You Trust?
PubChem
PubChem PubChem is “a repository of screening data” BUT, publishers, vendors, lots of chemical data holders..almost 30 million compounds Properties, 3D optimized structures, links to various databases, chemical names, registry numbers, synonyms, trade names PubChem is a repository – non-curated, no way to annotate or clean data. Data are free, not “open”
LIVE DEMO of PubChem Name a chemical compound – search and review Next slides:  Methane… Diamond Vancomycin Taxol Cholesterol
Chemistry on The Internet Is Messy
It’s Methane…
What’s Methane?
What’s Methane?
What  ELSE  is Methane???
The Challenges of Internet Data Text-based searches commonly will get you to “representative data” Accurate chemical structures are hard to find! Wikipedia IS a good source of accurate chemistry data..not perfect but good.  See Tacrolimus Tell the story of Domoic Acid –  Next Slide
The EXPERTS must get it right?!
Wikipedia, C&E News, PubChem C&E News (from ACS)
Feedback from Steve Ritter “ As for where we source our structures, our  primary source is the researcher and peer-reviewed papers , because many compounds are novel.  ..we always double check them against one or more primary sources, typically Merck Index and SciFinder.  Although CAS and C&EN are both part of the ACS Publications Division,  we at C&EN still have to pay for our SciFinder access, strangely enough.”
Feedback from Steve Ritter “ As a rule,  we at C&EN don’t use Wikipedia as a primary source for structures or chemical information, and I recommend that policy to anyone .” “ It would be  nice to have an authoritative web-based source of standard, well-drawn structures  for chemists to go to so they can freely cut and paste structures into their papers, PowerPoint presentations, and anything else they might need.  Maybe Wikipedia will be that source one day .”
The Challenges of Internet Data Text-based searches commonly will get you to “representative data” Accurate chemical structures are hard to find! Wikipedia IS a good source of accurate chemistry data..not perfect but good.  See Tacrolimus Tell the story of Domoic Acid Unfortunately, question everything  
Question Everything online: www.dhmo.org
The FDA’s DailyMed
  Structures on DailyMed
Lack of Stereochemistry
Does Stereochemistry Matter?
Does one stereocenter matter? Distaval, Talimol, Nibrol, Sedimide, Quietoplex, Contergan, Neurosedyn, and Softenon
Does one stereocenter matter? Distaval, Talimol, Nibrol, Sedimide, Quietoplex, Contergan, Neurosedyn, Softenon,  Thalidomide
  Incorrect Structures
Wow!
Collaborative Knowledge Management
Taxol on PubChem
Drugbank
Digitonin? More Crowdsourcing…
Comments on the Blog September 15th, 2009 at 1:57 pm   It looks like both ChEBI and Wikipedia structures are wrong as far as aglycon is concerned. According to  http://www3.interscience.wiley.com/journal/20330/abstract “… for the first time to confirm beyond all doubt the structure suggested by Tschesche and Wulff for digitonin by means of modern NMR techniques, and to assign all proton and carbon resonances.” Structure 1 shows methyl group at C-20 going UP, i.e. 20β (while by default spirostan is 20α).
CAS as an authority
The Blogging Community Participate
Will it ever end? The community says the structure of digitonin has “up” 20-Methyl. If so, then multiple substances related to digitonin have OPPOSITE stereo at 20-Methyl The spirostane skeleton is considered to have a “down” Methyl group so all spirostane-related structures would be wrong The ACD/Dictionary has 24 structures with close skeleton and all have the “down” Methyl group.
Chemistry is REALLY Messy
Vancomycin Who will curate? How would you clean such a large dataset? Assertions!!!
An Introduction to the  InChI Identifier
Multiple Layers
InChIStrings Hash to InChIKeys
InChIs for Taxol
Back to Taxol DrugBank: RCINICONZNJXQF-CLDWUXIMDD ChEBI:   RCINICONZNJXQF-GXKQXQCDDN  Wikipedia: RCINICONZNJXQF-MZXODVADBJ Which one is correct???
Vancomycin –  Search the Internet
Full  Molecule  Search: 4 Hits
Full  Skeleton  Search: 104 Hits
Vancomycin on ChemSpider  1 compound – 3 days
Assertion and  Chemical Entities Who says what Taxol is? What is the “timeline” for a molecule? How do we clean up the Public data?
InChIKeys for Taxol DrugBank: RCINICONZNJXQF-CLDWUXIMDD ChEBI:   RCINICONZNJXQF-GXKQXQCDDN  Wikipedia: RCINICONZNJXQF-MZXODVADBJ Structure validation is tough work! Who is validating chemistry data online???
Bio-Break Next Up –  QUALITY CHOICES  for online data An introduction to ChemSpider Crowdsourced Participation and Curation
Tony’s  Quality Choices  For Data Chemical Abstracts Service and Reaxys – not free but definitely high quality! Wikipedia Chemistry is good  ChEBI (look for “starred” compounds to indicate manual curation) DSSTox – manually curated EPA database. Very high quality ChemIDPlus – ongoing curation and good quality The databases of David Wishart – manually curated. Good but not perfect – DrugBank, HMDB, FooDB and others
ChEBI:  http://www.ebi.ac.uk/chebi/ Chemical Entities of Biological Interest from the European Bioinformatics Institute
DSSTox:  http://www.epa.gov/comptox/dsstox/   Distributed Structure Searching for Toxicology from Ann Richards at the Environmental Protection Agency
ChemIDPlus – 350,000 Compounds http://chem.sis.nlm.nih.gov/chemidplus/
DrugBank:  http://www.drugbank.ca/
And Our Own Work... ChemSpider ChemSpider is: Building a Structure Centric Community for Chemists >25 million compounds, >400 data sources A deposition and curation platform A publishing platform for the community Grows daily – more depositions, more links, more data sources
How Was ChemSpider Built? ChemSpider was a “hobby project”  Housed in a basement and running off three servers – one bought, two built Sensitive to weather and power stability Went live at ACS Spring 2007 in Chicago
Search Cholesterol
Live DEMO ChemSpider demo…
Link off a structure in ChemSpider Chemical suppliers Other publications Analytical Data Related Reactions Wikipedia Patents “ Everything”
Answering Questions for Chemists Questions a chemist might ask… What is the melting point of n-butanol?  What is the chemical structure of Xanax? Chemically, what is phenolphthalein? What are the stereocenters of cholesterol? Where can I find publications about xylene? What are the different trade names for Ketoconazole? What is the NMR spectrum of Aspirin? What are the safety handling issues for Thymol Blue?
Complex Data and Information
Crowd-sourcing Chemistry Curation Crowd-sourced curation: identify/tag errors, edit names, synonyms, identify records to deprecate
Multi-level Curation and Approval
ChemSpider SyntheticPages ChemSpider Synthesis will be a home for all things “synthetic”  An online resource for synthetic procedures from blogs, other online resources, RSC supplementary info, other publishers etc. Public peer-review and feedback for synthetic procedures
ChemSpider Everywhere : Embed
ChemSpider Everywhere: Spectral Game
ChemSpider Everywhere Crowdsourced Curation of Spectra
ChemSpider Everywhere ChemMobi
Where is ChemSpider Lacking? More databases coming online monthly Quality of data remains the primary issue ChemSpider is limited to “defined chemicals”. No support for: Polymers Minerals Markush structures
It’s a long road ahead…
Conclusions The internet enables chemistry, at a reduced cost Web 2.0 is here and improving quality Question Quality! Crowdsourcing to expand, curate and integrate InChIs are enabling chemistry on the internet
Thank you [email_address] Twitter: ChemSpiderman www.chemspider.com/blog SLIDES: www.slideshare.net/AntonyWilliams

Online Public Compound Databases

  • 1.
    Online Public CompoundDatabases Antony Williams
  • 2.
    Introductions…. Hi….I’m AntonyWilliams, ChemSpiderman NMR Spectroscopist by training Worked in gov’t lab, academia, Fortune 500, start-up, founded ChemSpider, now work for the Royal Society of Chemistry I am the host of ChemSpider… 25 million compounds Linked to 400 data sources You’ll hear more….
  • 3.
    What’s your interestin Public Compound DBs ? What public compound databases do you use? What are you looking to find? What proprietary databases do you presently use? What do you trust? Why are for-fee databases not enough? What issues do you have with free chemistry databases/resources online? What would the ideal solution provide????
  • 4.
    Content is Kingand Quality Costs Chemistry “content” is big money Patent searching Structures and properties Drug databases Literature databases Chemical Abstracts Service (CAS), the “Gold Standard” in Chemistry related information 103 years of content $260 million revenue (2006) >50 million substances >60 million sequences
  • 5.
    What’s the Statusof Chemistry online? Encyclopedic articles (Wikipedia) Chemical vendor databases (eMolecules) Metabolic pathway databases (WikiPathways) Virtual Screening databases (ZINC DB) Property databases (Beilstein) Screening assay results (PubChem) Patents with chemical structures (SureChem) ADME/Tox data (OEChem) Scientific publications (Many publishers) Compound aggregators (ChemSpider) Blogs/Wikis and Open Notebook Science (Many)
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
    Lots of “PublicCompound” Databases PubChem Drugbank ChEBI/ChEMBL KEGG LipidMAPs ChemIDPlus eMolecules ZINC ChemSpider Lots of chemical vendors What’s missing??? What do you use online?
  • 11.
    Where Would Youlook? What Do You Trust?
  • 12.
    Linked Data onthe Web Taken from: Rafael Sidis’ Blog
  • 13.
    What is acompound?
  • 14.
  • 15.
    Where Would Youlook? What Do You Trust?
  • 16.
  • 17.
    PubChem PubChem is“a repository of screening data” BUT, publishers, vendors, lots of chemical data holders..almost 30 million compounds Properties, 3D optimized structures, links to various databases, chemical names, registry numbers, synonyms, trade names PubChem is a repository – non-curated, no way to annotate or clean data. Data are free, not “open”
  • 18.
    LIVE DEMO ofPubChem Name a chemical compound – search and review Next slides: Methane… Diamond Vancomycin Taxol Cholesterol
  • 19.
    Chemistry on TheInternet Is Messy
  • 20.
  • 21.
  • 22.
  • 23.
    What ELSE is Methane???
  • 24.
    The Challenges ofInternet Data Text-based searches commonly will get you to “representative data” Accurate chemical structures are hard to find! Wikipedia IS a good source of accurate chemistry data..not perfect but good. See Tacrolimus Tell the story of Domoic Acid – Next Slide
  • 25.
    The EXPERTS mustget it right?!
  • 26.
    Wikipedia, C&E News,PubChem C&E News (from ACS)
  • 27.
    Feedback from SteveRitter “ As for where we source our structures, our primary source is the researcher and peer-reviewed papers , because many compounds are novel. ..we always double check them against one or more primary sources, typically Merck Index and SciFinder. Although CAS and C&EN are both part of the ACS Publications Division, we at C&EN still have to pay for our SciFinder access, strangely enough.”
  • 28.
    Feedback from SteveRitter “ As a rule, we at C&EN don’t use Wikipedia as a primary source for structures or chemical information, and I recommend that policy to anyone .” “ It would be nice to have an authoritative web-based source of standard, well-drawn structures for chemists to go to so they can freely cut and paste structures into their papers, PowerPoint presentations, and anything else they might need. Maybe Wikipedia will be that source one day .”
  • 29.
    The Challenges ofInternet Data Text-based searches commonly will get you to “representative data” Accurate chemical structures are hard to find! Wikipedia IS a good source of accurate chemistry data..not perfect but good. See Tacrolimus Tell the story of Domoic Acid Unfortunately, question everything 
  • 30.
  • 31.
  • 32.
    Structureson DailyMed
  • 33.
  • 34.
  • 35.
    Does one stereocentermatter? Distaval, Talimol, Nibrol, Sedimide, Quietoplex, Contergan, Neurosedyn, and Softenon
  • 36.
    Does one stereocentermatter? Distaval, Talimol, Nibrol, Sedimide, Quietoplex, Contergan, Neurosedyn, Softenon, Thalidomide
  • 37.
    IncorrectStructures
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
    Comments on theBlog September 15th, 2009 at 1:57 pm It looks like both ChEBI and Wikipedia structures are wrong as far as aglycon is concerned. According to http://www3.interscience.wiley.com/journal/20330/abstract “… for the first time to confirm beyond all doubt the structure suggested by Tschesche and Wulff for digitonin by means of modern NMR techniques, and to assign all proton and carbon resonances.” Structure 1 shows methyl group at C-20 going UP, i.e. 20β (while by default spirostan is 20α).
  • 44.
    CAS as anauthority
  • 45.
  • 46.
    Will it everend? The community says the structure of digitonin has “up” 20-Methyl. If so, then multiple substances related to digitonin have OPPOSITE stereo at 20-Methyl The spirostane skeleton is considered to have a “down” Methyl group so all spirostane-related structures would be wrong The ACD/Dictionary has 24 structures with close skeleton and all have the “down” Methyl group.
  • 47.
  • 48.
    Vancomycin Who willcurate? How would you clean such a large dataset? Assertions!!!
  • 49.
    An Introduction tothe InChI Identifier
  • 50.
  • 51.
  • 52.
  • 53.
    Back to TaxolDrugBank: RCINICONZNJXQF-CLDWUXIMDD ChEBI: RCINICONZNJXQF-GXKQXQCDDN Wikipedia: RCINICONZNJXQF-MZXODVADBJ Which one is correct???
  • 54.
    Vancomycin – Search the Internet
  • 55.
    Full Molecule Search: 4 Hits
  • 56.
    Full Skeleton Search: 104 Hits
  • 57.
    Vancomycin on ChemSpider 1 compound – 3 days
  • 58.
    Assertion and Chemical Entities Who says what Taxol is? What is the “timeline” for a molecule? How do we clean up the Public data?
  • 59.
    InChIKeys for TaxolDrugBank: RCINICONZNJXQF-CLDWUXIMDD ChEBI: RCINICONZNJXQF-GXKQXQCDDN Wikipedia: RCINICONZNJXQF-MZXODVADBJ Structure validation is tough work! Who is validating chemistry data online???
  • 60.
    Bio-Break Next Up– QUALITY CHOICES for online data An introduction to ChemSpider Crowdsourced Participation and Curation
  • 61.
    Tony’s QualityChoices For Data Chemical Abstracts Service and Reaxys – not free but definitely high quality! Wikipedia Chemistry is good ChEBI (look for “starred” compounds to indicate manual curation) DSSTox – manually curated EPA database. Very high quality ChemIDPlus – ongoing curation and good quality The databases of David Wishart – manually curated. Good but not perfect – DrugBank, HMDB, FooDB and others
  • 62.
    ChEBI: http://www.ebi.ac.uk/chebi/Chemical Entities of Biological Interest from the European Bioinformatics Institute
  • 63.
    DSSTox: http://www.epa.gov/comptox/dsstox/ Distributed Structure Searching for Toxicology from Ann Richards at the Environmental Protection Agency
  • 64.
    ChemIDPlus – 350,000Compounds http://chem.sis.nlm.nih.gov/chemidplus/
  • 65.
  • 66.
    And Our OwnWork... ChemSpider ChemSpider is: Building a Structure Centric Community for Chemists >25 million compounds, >400 data sources A deposition and curation platform A publishing platform for the community Grows daily – more depositions, more links, more data sources
  • 67.
    How Was ChemSpiderBuilt? ChemSpider was a “hobby project” Housed in a basement and running off three servers – one bought, two built Sensitive to weather and power stability Went live at ACS Spring 2007 in Chicago
  • 68.
  • 69.
  • 70.
    Link off astructure in ChemSpider Chemical suppliers Other publications Analytical Data Related Reactions Wikipedia Patents “ Everything”
  • 71.
    Answering Questions forChemists Questions a chemist might ask… What is the melting point of n-butanol? What is the chemical structure of Xanax? Chemically, what is phenolphthalein? What are the stereocenters of cholesterol? Where can I find publications about xylene? What are the different trade names for Ketoconazole? What is the NMR spectrum of Aspirin? What are the safety handling issues for Thymol Blue?
  • 72.
    Complex Data andInformation
  • 73.
    Crowd-sourcing Chemistry CurationCrowd-sourced curation: identify/tag errors, edit names, synonyms, identify records to deprecate
  • 74.
  • 75.
    ChemSpider SyntheticPages ChemSpiderSynthesis will be a home for all things “synthetic” An online resource for synthetic procedures from blogs, other online resources, RSC supplementary info, other publishers etc. Public peer-review and feedback for synthetic procedures
  • 76.
  • 77.
  • 78.
  • 79.
  • 80.
    Where is ChemSpiderLacking? More databases coming online monthly Quality of data remains the primary issue ChemSpider is limited to “defined chemicals”. No support for: Polymers Minerals Markush structures
  • 81.
    It’s a longroad ahead…
  • 82.
    Conclusions The internetenables chemistry, at a reduced cost Web 2.0 is here and improving quality Question Quality! Crowdsourcing to expand, curate and integrate InChIs are enabling chemistry on the internet
  • 83.
    Thank you [email_address]Twitter: ChemSpiderman www.chemspider.com/blog SLIDES: www.slideshare.net/AntonyWilliams