Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Integrating and curating internet based chemistry resources to serve life scientists


Published on

The internet now offers access to a myriad of online resources that can be of value to chemists working in the Life Sciences. While finding information online is, in many cases, a simple search away, the accuracy and validity of the associated data and information should be questioned. As more databases and resources are introduced online, and commonly not integrated to other resources, a scientist must perform multiple searches and then undertake the task of meshing and merging data. ChemSpider is a freely accessible online database that has taken on the challenge of meshing together distributed resources across the internet to provide a structure-based hub. It is a crowdsourcing environment hosting over 26 million unique compounds linked out to over 400 data sources. With well defined programming interfaces for integration ChemSpider has been integrated to many commercial and open software packages and is presently serving as the chemistry foundation for the IMI Open PHACTS project.

Published in: Technology

Integrating and curating internet based chemistry resources to serve life scientists

  1. 1. ChemSpider – Integrating and Curating Internet-Based Chemistry Resources to Serve Life Scientists Antony Williams PharmSciFair, July 2011
  2. 2. The Internet for Life Scientists <ul><li>What resources are you using online? </li></ul><ul><li>How well are they working? </li></ul><ul><li>What problems exist, that you know of? </li></ul><ul><li>ChemSpider – “curating Chemistry with the world” </li></ul><ul><li>The benefits of crowdsourcing chemistry </li></ul><ul><li>A Semantic Web for the Life Sciences </li></ul><ul><li>An introduction to Open PHACTS </li></ul>
  3. 3. Where is chemistry online? <ul><li>Encyclopedic articles (Wikipedia) </li></ul><ul><li>Chemical vendor databases </li></ul><ul><li>Metabolic pathway databases </li></ul><ul><li>Property databases </li></ul><ul><li>Patents with chemical structures </li></ul><ul><li>Drug Discovery data </li></ul><ul><li>Scientific publications </li></ul><ul><li>Compound aggregators </li></ul><ul><li>Blogs/Wikis and Open Notebook Science </li></ul>
  4. 4. Where can we find data online?
  5. 5. Life Scientists and Online Resources <ul><li>Where do life scientists resource information online? </li></ul><ul><ul><li>PubChem </li></ul></ul><ul><ul><li>ChEBI/ChEMBL </li></ul></ul><ul><ul><li>Protein Data Bank (PDB) </li></ul></ul><ul><ul><li>DrugBank </li></ul></ul><ul><ul><li>Wikipedia </li></ul></ul><ul><ul><li>What else do you use?? </li></ul></ul><ul><ul><li>What do you TRUST?? </li></ul></ul>
  6. 6. What is the Structure of Vitamin K1?
  7. 7. What is the Structure of Vitamin K1?
  8. 8. CAS’s Common Chemistry
  9. 9. Wikipedia
  10. 12. ChEBI – Manual Curation
  11. 15. PubChem
  12. 17. <ul><li>“ 2-methyl-3-(3,7,11,15-tetramethyl hexadec-2-enyl)naphthalene-1,4-dione” </li></ul><ul><li>Variants of systematic names on PubChem </li></ul><ul><li>2-methyl-3-[(E,7R,11R)-3,7,11,15-tetramethyl </li></ul><ul><li>2-methyl-3-[(E,7S,11R)-3,7,11,15-tetramethyl </li></ul><ul><li>2-methyl-3-[(E,7R,11S)-3,7,11,15-tetramethyl </li></ul><ul><li>2-methyl-3-[(E,7S,11S)-3,7,11,15-tetramethyl </li></ul><ul><li>2-methyl-3-[(E,11S)-3,7,11,15-tetramethyl </li></ul><ul><li>2-methyl-3-[(E)-3,7,11,15-tetramethyl </li></ul><ul><li>2-methyl-3-(3,7,11,15-tetramethyl </li></ul><ul><li>2-methyl-3-[(E)-3,7,11,15-tetramethyl </li></ul>
  13. 18. Public Domain Chemistry Databases <ul><li>Our databases are a mess… </li></ul><ul><li>Non-curated databases are proliferating errors </li></ul><ul><li>We source and deposit data between databases </li></ul><ul><li>Original sources of errors hard to determine </li></ul><ul><li>Curation is time-consuming, challenging and exacting </li></ul>
  14. 19. Lipitor <ul><li>What are people ACTUALLY measuring BioAssays on? </li></ul><ul><li>Does stereochemistry matter? </li></ul>
  15. 20. The FDA’s DailyMed
  16. 21. Structures on DailyMed
  17. 22. Lack of Stereochemisty
  18. 23. Incorrect Structures
  19. 24. PDSP <ul><li>The database has 55440K i values for searching </li></ul>
  20. 25. PDSP Structures – Canonical SMILES Is Stereochemistry important???!!!
  21. 26. What’s Methane?
  22. 27. What’s Methane?
  23. 28. What ELSE is Methane???
  24. 29. Build Models with GOOD DATA!
  25. 31. So you want data on drugs??? <ul><li>Sourcing data based on drug names is difficult! </li></ul><ul><li>Where would you find the “correct chemical structures”? </li></ul><ul><li>What databases can you trust? </li></ul>
  26. 32. <ul><li>Consider searching each of these chemical databases by chemical name (systematic name, trade name or synonym). Please mark each online resource according to how much you generally trust the results . </li></ul>
  27. 34. Public Domain Chemistry Databases <ul><li>An examination of quality in databases – inter/intra lab comparison of processes for 150 drugs </li></ul>
  28. 36. Vytorin: Ezetimibe/Simvastatin
  29. 37. Vytorin: Ezetimibe/Simvastatin
  30. 38. Vytorin: Ezetimibe/Simvastatin
  31. 39. Vytorin: Ezetimibe/Simvastatin
  32. 40. Vytorin: Ezetimibe/Simvastatin
  33. 41. Taxol: Paclitaxel 44 structures
  34. 42. Drug Name Generic Name ChEBI ChemSpider CAS Com. Chem ChemIDPlus DailyMed DrugBank PubChem Wikipedia Spiriva Tiotropium Bromide No Hits  No Hits    4/0  Depakote Valproate semisodium        No Structure Basen Voglibose   No Hits  No Hits  2/1  Symbicort 1) Budesonide       8/1  Symbicort 2) Formoterol WRONG  No Hits    6/1  Vytorin 1) Ezetimibe   No Hits      Vytorin 2) Simvastatin       2/1  Taxol Paclitaxel       44/1  Thalidomid Thalidomide No Hits        Zocor Simvastatin       2/1  Crestor Rosuvastatin   No Hits    2/1 
  35. 43. Vision: Connect Chemistry on the Web <ul><li>The internet is searchable by chemical structure and substructure (e.g.Wikipedia, Google Scholar) </li></ul><ul><li>Chemistry articles are indexed and searchable by a free online service </li></ul><ul><li>The web is linked together through the “language of chemistry” </li></ul><ul><li>Publicly funded research data is linked </li></ul>
  36. 44. We Have Delivered the Vision <ul><ul><li>“ Build a Structure Centric Community to </li></ul></ul><ul><ul><li>Serve Chemists” </li></ul></ul><ul><ul><li>Integrate chemical structure data on the web </li></ul></ul><ul><ul><li>Create a “structure-based hub” to information, data and algorithmic predictions </li></ul></ul><ul><ul><li>Let chemists contribute their own data </li></ul></ul><ul><ul><li>Allow the community to curate/correct data </li></ul></ul>
  37. 45.
  38. 46. We Want to Answer Questions <ul><li>Questions a chemist might ask… </li></ul><ul><ul><li>What is the melting point of n-heptanol? </li></ul></ul><ul><ul><li>What is the chemical structure of Xanax? </li></ul></ul><ul><ul><li>Chemically, what is phenolphthalein? </li></ul></ul><ul><ul><li>What are the stereocenters of cholesterol? </li></ul></ul><ul><ul><li>Where can I find publications about vancomycin? </li></ul></ul><ul><ul><li>What are the different trade names for Ketoconazole? </li></ul></ul><ul><ul><li>What is the NMR spectrum of Aspirin? </li></ul></ul><ul><ul><li>What are the safety handling issues for Thymol Blue? </li></ul></ul>
  39. 47. Search for a Chemical…by name
  40. 48. Link off a structure in ChemSpider <ul><ul><li>Chemical suppliers </li></ul></ul><ul><ul><li>Other publications </li></ul></ul><ul><ul><li>Analytical Data </li></ul></ul><ul><ul><li>Related Reactions </li></ul></ul><ul><ul><li>Wikipedia </li></ul></ul><ul><ul><li>Patents </li></ul></ul><ul><ul><li>“ Everything” </li></ul></ul>
  41. 49. Available Information… <ul><li>Linked to vendors, safety data, toxicity, metabolism </li></ul>
  42. 50. Available Information….
  43. 51. What else is available? <ul><li>Links to patents – SureChem and Google Patents </li></ul><ul><li>Links to literature – PubMed, Google Scholar, RSC backfile and databases </li></ul><ul><li>Measured and experimental physchem data </li></ul><ul><li>Links to prediction algorithms </li></ul><ul><li>Links to suppliers </li></ul>
  44. 52. Structure and substructure searches
  45. 53. Crowdsourced “Annotations” <ul><li>Users can add </li></ul><ul><ul><li>Descriptions/Syntheses/Commentaries </li></ul></ul><ul><ul><li>Links to PubMed articles </li></ul></ul><ul><ul><li>Links to articles via DOIs </li></ul></ul><ul><ul><li>Add spectral data </li></ul></ul><ul><ul><li>Add Crystallographic Information Files </li></ul></ul><ul><ul><li>Add photos </li></ul></ul><ul><ul><li>Add MP3 files </li></ul></ul><ul><ul><li>Add Videos </li></ul></ul>
  46. 55. Content is King and Quality Costs <ul><li>Curated Chemistry “content” is expensive to create </li></ul><ul><ul><li>Patent searching </li></ul></ul><ul><ul><li>Structures and properties </li></ul></ul><ul><ul><li>Drug databases </li></ul></ul><ul><ul><li>Literature databases </li></ul></ul><ul><li>Chemical Abstracts Service (CAS), lauded as the “Gold Standard” in Chemistry related information </li></ul><ul><ul><li>104 years of content </li></ul></ul><ul><ul><li>>50 million substances </li></ul></ul><ul><ul><li>Proprietary platform </li></ul></ul>
  47. 56. With Great Fanfare…also costs…
  48. 57. NPC Browser
  49. 58. Curation required
  50. 59. Curation required
  51. 60. My favorite
  52. 61. Neomycin
  53. 62. Inherited Errors <ul><li>Inherited errors from every database… all public compound databases, including ours , have errors </li></ul><ul><li>“ Incorrect” structures – assertions, timelines etc </li></ul><ul><li>“ Incorrect” names associated with structures </li></ul><ul><li>ENORMOUS CHALLENGE </li></ul>
  54. 63. Online Curation <ul><li>Online databases generally do NOT allow curation or annotation </li></ul><ul><li>If you find errors they stay there! </li></ul><ul><li>ChemSpider allows immediate curation </li></ul>
  55. 64. Search “Vitamin H”
  56. 65. “ Curate” Identifiers
  57. 66. “ Curate” Identifiers
  58. 67. “ Curate” Identifiers
  59. 68. Crowd-sourcing Chemistry Curation
  60. 69. Crowdsourcing Works <ul><li>>130 people have deposited data and participated in data curation </li></ul><ul><li>Different level curators check each other </li></ul><ul><li>Wikipedia is the modern primary example </li></ul><ul><li>It ALSO works for crowdsourcing SYNTHESIS </li></ul>
  61. 70. ChemSpider SyntheticPages
  62. 71. Why Curated Dictionaries Matter
  63. 72. Nature Chemistry
  64. 73. Success Depends on Dictionaries
  65. 74. Validated Name-Structure Dictionaries <ul><li>Chemical name dictionaries are used for: </li></ul><ul><ul><ul><li>Text-mining (publications, patents) </li></ul></ul></ul><ul><ul><ul><ul><li>Used to index PubMed and link to Google Patents </li></ul></ul></ul></ul><ul><ul><ul><li>Linking to other databases – think Biology! </li></ul></ul></ul><ul><ul><ul><ul><li>When structures are not available drug names link </li></ul></ul></ul></ul><ul><ul><ul><li>Searching the web </li></ul></ul></ul><ul><ul><ul><ul><li>Names link to structures link to InChIs </li></ul></ul></ul></ul>
  66. 75. The InChI Identifier
  67. 76. Multiple Layers
  68. 77. InChIStrings Hash to InChIKeys
  69. 78. Vancomycin – Search the Internet
  70. 79. Full Skeleton Search: 104 Hits
  71. 80. Full Molecule Search: 4 Hits
  72. 81. OpenTox uses InChIs
  73. 82. There will always be gaps... <ul><li>What ChemSpider does not deal with, yet... </li></ul><ul><ul><li>Materials </li></ul></ul><ul><ul><li>Minerals </li></ul></ul><ul><ul><li>Polymers </li></ul></ul><ul><ul><li>Biological macromolecules </li></ul></ul><ul><ul><li>Mappings to diseases, targets etc. ONLY to chemicals in other databases </li></ul></ul><ul><ul><li>Daily updates to chemistry! </li></ul></ul>
  74. 83. Continuous changes..June 2011 USANS
  75. 84. The Future of Chemistry on the Web? <ul><li>Public compound databases federate & build a linked environment of validated data! </li></ul><ul><li>Data validation needs are not ignored </li></ul><ul><li>Publishers layer on information to make publications discoverable </li></ul><ul><li>Public-Private databases can be linked </li></ul><ul><li>Open Data proliferate </li></ul><ul><li>The “ Semantic Web ” in action </li></ul><ul><li>It will require COLLABORATION </li></ul>
  76. 85. The Future: Open PHACTS
  77. 87. Open PHACTS Overview <ul><li>Develop a set of robust standards to enable: </li></ul><ul><ul><li>Integration between data sources via semantic technologies </li></ul></ul><ul><ul><li>Development of high quality assertions </li></ul></ul><ul><ul><li>Workflows and analysis pipelines across resources </li></ul></ul>
  78. 88. Open PHACTS Overview <ul><li>Implement standards in a semantic integration hub (“Open Pharmacological Space”) </li></ul><ul><ul><li>Develop an open, public domain infrastructure for drug discovery data integration </li></ul></ul><ul><ul><li>Development of open web-services for drug discovery </li></ul></ul><ul><ul><li>Development of a secure access model to enable queries with proprietary data </li></ul></ul>
  79. 89. Open PHACTS Overview <ul><li>Deliver services to support ongoing drug discovery programs in pharma and public domain </li></ul><ul><ul><li>Align development of standards, vocabularies and data integration to selected drug discovery issues </li></ul></ul>
  80. 90. <ul><li>Collaboration between pharmaceutical companies, medicinal chemists, cheminformaticians, semantic web scientists and publishers </li></ul><ul><li>Will include public-private data sharing </li></ul><ul><li>Open PHACTS Project Partners </li></ul>
  81. 91. Acknowledgments <ul><li>RSC|ChemSpider team </li></ul><ul><li>The “Crowd” of curators </li></ul><ul><li>All Data Source providers </li></ul><ul><li>The Open PHACTS team – a large cast!!! </li></ul><ul><li>GGA Software Services </li></ul><ul><li>ACD/Labs </li></ul><ul><li>OpenEye </li></ul><ul><li>Accelrys </li></ul>
  82. 92. Thank you Email: Twitter: ChemConnector Personal Blog: SLIDES: