AZ of Chemspider February 2011


Published on

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

AZ of Chemspider February 2011

  1. 1. ChemSpider – The Vision and Challenges Associated with Building a Free Online Community Resource for Chemists Antony Williams AZ, February 2011
  2. 2. What’s the Status of Chemistry online? <ul><li>Encyclopedic articles (Wikipedia) </li></ul><ul><li>Chemical vendor databases </li></ul><ul><li>Metabolic pathway databases </li></ul><ul><li>Virtual Screening databases </li></ul><ul><li>Property databases </li></ul><ul><li>Screening assay results </li></ul><ul><li>Patents with chemical structures </li></ul><ul><li>ADME/Tox data </li></ul><ul><li>Scientific publications </li></ul><ul><li>Compound aggregators </li></ul><ul><li>Blogs/Wikis and Open Notebook Science </li></ul>
  3. 3. For Synthesis…
  4. 4. Org Prep Daily (Blog)
  5. 5. Molbank (Open Access Journal)
  6. 6. Lots of “Public Compound” Databases <ul><li>PubChem </li></ul><ul><li>Drugbank </li></ul><ul><li>ChEBI/ChEMBL </li></ul><ul><li>KEGG </li></ul><ul><li>LipidMAPs </li></ul><ul><li>ChemIDPlus </li></ul><ul><li>eMolecules </li></ul><ul><li>ZINC </li></ul><ul><li>Lots of chemical vendors </li></ul><ul><li>ChemSpider </li></ul>
  7. 7. Where Would You look? What Do You Trust?
  8. 8. Linked Data on the Web
  9. 9. What is a compound? “ARTAs”
  10. 10. Vision: Connect Chemistry on the Web <ul><li>The internet is searchable by chemical structure and substructure (e.g.Wikipedia, Google Scholar) </li></ul><ul><li>Chemistry articles are indexed and searchable by a free online service </li></ul><ul><li>The web is linked together through the “language of chemistry” </li></ul><ul><li>Publicly funded research data is linked </li></ul>
  11. 11. We Have Delivered the Vision <ul><ul><li>“ Build a Structure Centric Community to </li></ul></ul><ul><ul><li>Serve Chemists” </li></ul></ul><ul><ul><li>Integrate chemical structure data on the web </li></ul></ul><ul><ul><li>Create a “structure-based hub” to information, data and algorithmic predictions </li></ul></ul><ul><ul><li>Let chemists contribute their own data </li></ul></ul><ul><ul><li>Allow the community to curate/correct data </li></ul></ul>
  12. 12. How Was ChemSpider Built? <ul><li>ChemSpider was a “hobby project” </li></ul><ul><li>Housed in a basement and running off three servers – one bought, two built </li></ul><ul><li>Sensitive to weather and power stability </li></ul><ul><li>Went live at ACS Spring 2007 in Chicago </li></ul>
  13. 13. How Did We Build It? <ul><li>We deal in Molfiles or SDF files </li></ul><ul><li>We do rudimentary filtering – valence checking, charge imbalance – prior to deposition </li></ul><ul><li>We have our own “business logic” to standardize </li></ul><ul><li>Link out to external sites where possible using IDs </li></ul>
  14. 14.
  15. 15. We Want to Answer Questions <ul><li>Questions a chemist might ask… </li></ul><ul><ul><li>What is the melting point of n-heptanol? </li></ul></ul><ul><ul><li>What is the chemical structure of Xanax? </li></ul></ul><ul><ul><li>Chemically, what is phenolphthalein? </li></ul></ul><ul><ul><li>What are the stereocenters of cholesterol? </li></ul></ul><ul><ul><li>Where can I find publications about xylene? </li></ul></ul><ul><ul><li>What are the different trade names for Ketoconazole? </li></ul></ul><ul><ul><li>What is the NMR spectrum of Aspirin? </li></ul></ul><ul><ul><li>What are the safety handling issues for Thymol Blue? </li></ul></ul>
  16. 16. Search for a Chemical…by name
  17. 17. Link off a structure in ChemSpider <ul><ul><li>Chemical suppliers </li></ul></ul><ul><ul><li>Other publications </li></ul></ul><ul><ul><li>Analytical Data </li></ul></ul><ul><ul><li>Related Reactions </li></ul></ul><ul><ul><li>Wikipedia </li></ul></ul><ul><ul><li>Patents </li></ul></ul><ul><ul><li>“ Everything” </li></ul></ul>
  18. 18. Available Information… <ul><li>Linked to vendors, safety data, toxicity, metabolism </li></ul>
  19. 19. Available Information….
  20. 20. Clickthrough to Patent (SureChem)
  21. 21. Crowdsourced “Annotations” <ul><li>Registered Users can add </li></ul><ul><ul><li>Descriptions/Syntheses/Commentaries </li></ul></ul><ul><ul><li>Links to PubMed articles </li></ul></ul><ul><ul><li>Links to articles via DOIs </li></ul></ul><ul><ul><li>Add spectral data </li></ul></ul><ul><ul><li>Add Crystallographic Information Files </li></ul></ul><ul><ul><li>Add photos </li></ul></ul><ul><ul><li>Add MP3 files </li></ul></ul><ul><ul><li>Add Videos </li></ul></ul>
  22. 23. Spectra Linked
  23. 24. Spectra Linked
  24. 25. Search for a chemical…by structure Substructure search coming…
  25. 26. Inherited Errors <ul><li>Inherited errors from every database… all public compound databases, including ours, have errors </li></ul><ul><li>“ Incorrect” structures – assertions, timelines etc </li></ul><ul><li>“ Incorrect” names associated with structures </li></ul><ul><li>ENORMOUS CHALLENGE </li></ul>
  26. 27. What is the Structure of Vitamin K?
  27. 28. MeSH <ul><li>A lipid cofactor that is required for normal blood clotting. Several forms of vitamin K have been identified: VITAMIN K 1 (phytomenadione) derived from plants , VITAMIN K 2 (menaquinone) from bacteria, and synthetic naphthoquinone provitamins, VITAMIN K 3 (menadione). Vitamin K 3 provitamins, after being alkylated in vivo, exhibit the antifibrinolytic activity of vitamin K. Green leafy vegetables, liver, cheese, butter, and egg yolk are good sources of vitamin K </li></ul>
  28. 29. What is the Structure of Vitamin K1?
  29. 30. What is the Structure of Vitamin K1?
  30. 31. Vitamin K1
  31. 33. <ul><li>“ 2-methyl-3-(3,7,11,15-tetramethyl hexadec-2-enyl)naphthalene-1,4-dione” </li></ul><ul><li>Variants of systematic names on PubChem </li></ul><ul><li>2-methyl-3-[(E,7R,11R)-3,7,11,15-tetramethyl </li></ul><ul><li>2-methyl-3-[(E,7S,11R)-3,7,11,15-tetramethyl </li></ul><ul><li>2-methyl-3-[(E,7R,11S)-3,7,11,15-tetramethyl </li></ul><ul><li>2-methyl-3-[(E,7S,11S)-3,7,11,15-tetramethyl </li></ul><ul><li>2-methyl-3-[(E,11S)-3,7,11,15-tetramethyl </li></ul><ul><li>2-methyl-3-[(E)-3,7,11,15-tetramethyl </li></ul><ul><li>2-methyl-3-(3,7,11,15-tetramethyl </li></ul><ul><li>2-methyl-3-[(E)-3,7,11,15-tetramethyl </li></ul>
  32. 34. Question Everything online:
  33. 35. It’s all on Wikipedia…
  34. 36. Chemistry on The Internet Is Messy
  35. 37. It’s Methane…
  36. 38. What’s Methane?
  37. 39. What’s Methane?
  38. 40. What ELSE is Methane???
  39. 43. EPA’s DailyMed
  40. 44. EPA’s DailyMed
  41. 45. EPA’s DailyMed
  42. 46. Public Domain Chemistry Databases <ul><li>Our databases are a mess… </li></ul><ul><li>Non-curated databases are proliferating errors </li></ul><ul><li>We source and deposit data between databases </li></ul><ul><li>Original sources of errors hard to determine </li></ul><ul><li>Curation is time-consuming, challenging and exacting </li></ul><ul><li>An examination of quality in databases – inter/intra lab comparison of processes for 150 drugs </li></ul>
  43. 47. Drug Name Generic Name ChEBI ChemSpider CAS Com. Chem ChemIDPlus DailyMed DrugBank PubChem Wikipedia Spiriva Tiotropium Bromide No Hits  No Hits    4/0  Depakote Valproate semisodium        No Structure Basen Voglibose   No Hits  No Hits  2/1  Symbicort 1) Budesonide       8/1  Symbicort 2) Formoterol WRONG  No Hits    6/1  Vytorin 1) Ezetimibe   No Hits      Vytorin 2) Simvastatin       2/1  Taxol Paclitaxel       44/1  Thalidomid Thalidomide No Hits        Zocor Simvastatin       2/1  Crestor Rosuvastatin   No Hits    2/1 
  44. 48. Symbicort: Budesonide + Formoterol
  45. 49. Symbicort: Budesonide + Formoterol ChemIDPlus Wikipedia
  46. 50. DrugBank: Search Symbicort…
  47. 51. Symbicort: Budesonide + Formoterol <ul><li>PubChem </li></ul><ul><ul><li>8 structures called Budesonide. 1 “correct” </li></ul></ul><ul><ul><li>6 structures called Formoterol. 1 “correct” </li></ul></ul><ul><ul><li>Search on “Symbicort” gives 1 structure. </li></ul></ul>
  48. 52. Taxol: Paclitaxel 44 structures
  49. 53. Taxol: Paclitaxel Bioassay Data
  50. 54. Taxol: Paclitaxel Bioassay Data <ul><li>Most Bioassay data associated with structure with one ambiguous stereocenter </li></ul>
  51. 55. <ul><li>Consider searching each of these chemical databases by chemical name (systematic name, trade name or synonym). Please mark each online resource according to how much you generally trust the results. </li></ul>
  52. 60. The Final Search Strategy
  53. 61. All Those Names, One Structure
  54. 62. Searching Chemistry on the Internet <ul><li>How complete a result set will we get if we search for “chemicals” by name? </li></ul><ul><li>Is there a better way to link chemistry databases? Linking by “names” is dangerous </li></ul><ul><li>Chemists want structure and SUBstructure searching </li></ul>
  55. 63. The InChI Identifier
  56. 64. Multiple Layers
  57. 65. InChIStrings Hash to InChIKeys
  58. 66. Oleoylethanolamine
  59. 67. InChIs have traction…
  60. 68. Vancomycin
  61. 69. Vancomycin
  62. 70. Vancomycin Search Molecular SKELETON Search Full Molecule
  63. 71. Full Molecule Search: 4 Hits
  64. 72. Full Skeleton Search: 104 Hits
  65. 73. Vancomycin <ul><li>Who will curate? </li></ul><ul><li>How would you clean such a large dataset? </li></ul>
  66. 74. Vancomycin on ChemSpider
  67. 81. Name Searching is “Easier”
  68. 82. Name Searching is “Easier”
  69. 85. Content is King and Quality Costs <ul><li>Curated Chemistry “content” is expensive to create </li></ul><ul><ul><li>Patent searching </li></ul></ul><ul><ul><li>Structures and properties </li></ul></ul><ul><ul><li>Drug databases </li></ul></ul><ul><ul><li>Literature databases </li></ul></ul><ul><li>Chemical Abstracts Service (CAS), the “Gold Standard” in Chemistry related information </li></ul><ul><ul><li>104 years of content </li></ul></ul><ul><ul><li>>50 million substances </li></ul></ul><ul><ul><li>Proprietary platform </li></ul></ul>
  70. 86. The EXPERTS must get it right?!
  71. 87. Wikipedia, C&E News, PubChem <ul><li>C&E News (from ACS) </li></ul>
  72. 88. Feedback from Steve Ritter <ul><li>“ Although CAS and C&EN are both part of the ACS Publications Division, we at C&EN still have to pay for our SciFinder access, strangely enough.” </li></ul><ul><li>“ It would be nice to have an authoritative web-based source of standard, well-drawn structures for chemists to go to so they can freely cut and paste structures into their papers, PowerPoint presentations, and anything else they might need. Maybe Wikipedia will be that source one day .” </li></ul>
  73. 90. Search OEA
  74. 91. Search OEA
  75. 92. Search OEA
  76. 93. Semantic Mark-up for Chemistry <ul><li>Semantic mark-up for chemistry is here </li></ul><ul><ul><li>RSC project prospect (structure linking, IUPAC Gold Book ontology and other ontologies </li></ul></ul><ul><ul><li>Nature publishing group compound linking </li></ul></ul>
  77. 94. Nature Chemistry Compound Pages
  78. 95. Project Prospect
  79. 96. Entity-Extraction, Mark-up, Annotate
  80. 97. Entity-Extraction, Mark-up, Annotate
  81. 98. And linked to STITCH…
  82. 99. Success Depends on Dictionaries
  83. 100. Online Curation <ul><li>Online databases generally do NOT allow curation or annotation </li></ul><ul><li>If you find errors they stay there! </li></ul><ul><li>ChemSpider allows immediate curation </li></ul>
  84. 101. Search “Vitamin H”
  85. 102. “ Curate” Identifiers
  86. 103. “ Curate” Identifiers
  87. 104. “ Curate” Identifiers
  88. 105. Crowd-sourcing Chemistry Curation
  89. 106. Crowdsourcing Works <ul><li>>130 people have deposited data and participated in data curation </li></ul><ul><li>Different level curators check each other </li></ul><ul><li>Wikipedia is the modern primary example </li></ul>
  90. 107. ChemSpider and Publishing <ul><li>The curation efforts on ChemSpider led to a set of validated dictionaries </li></ul><ul><li>Integrate best-in-class entity extraction with validated name dictionaries </li></ul><ul><li>Already text-mined the RSC archive and presently linking! </li></ul>
  91. 108. Crowdsourcing Synthesis ChemSpider SyntheticPages
  92. 109. Crowdsourcing Synthesis ChemSpider SyntheticPages
  93. 110. ChemSpider Everywhere: What do computers want? <ul><li>Web services </li></ul>
  94. 111. Web Services
  95. 112. ChemSpider Everywhere <ul><li>Linked from Wikipedia and many Public Databases </li></ul><ul><li>Linked from Open Notebook Science sites </li></ul><ul><li>Linked from Blogs using Structure/Spectra EMBED </li></ul><ul><li>Integrated into structure drawing packages </li></ul><ul><li>Integrated to software offerings from Thermo, Waters, Agilent, Bruker </li></ul>
  96. 113. ChemSpider Everywhere : Embed
  97. 114. ChemSpider Everywhere: Spectral Game
  98. 115. ChemSpider Everywhere Crowdsourced Curation of Spectra
  99. 116. ChemSpider Everywhere : ChemMobi
  100. 117. Structure Database Lookup
  101. 118. Structure Database Lookup
  102. 119. Reaction Database Look-up
  103. 120. Reaction Database Look-up
  104. 121. There will always be gaps... <ul><li>What ChemSpider does not deal with, yet... </li></ul><ul><ul><li>Materials </li></ul></ul><ul><ul><li>Minerals </li></ul></ul><ul><ul><li>Polymers </li></ul></ul><ul><ul><li>Biological macromolecules </li></ul></ul>
  105. 122. Collaborative Data Curation <ul><li>How can we COLLECTIVELY clean online data? </li></ul><ul><li>Developing ways to share curation actions back to original data sources </li></ul><ul><li>A mindset of bigger is better is problematic. How many “real chemicals” are in the public databases? </li></ul>
  106. 123. Future Work <ul><li>Continue curation work </li></ul><ul><li>Extend search capabilities </li></ul><ul><li>Expand existing databases </li></ul><ul><li>Text-mine RSC archive and link chemistry </li></ul><ul><li>Project: pre-competitive data sharing and linking for Life Sciences </li></ul><ul><li>Integrate to metabolic pathways tools </li></ul>
  107. 124. The Future of Chemistry on the Web? <ul><li>Public compound databases federate & build a linked environment of validated data! </li></ul><ul><li>Data validation needs are not ignored </li></ul><ul><li>Publishers layer on information to make publications discoverable </li></ul><ul><li>Public-Private databases can be linked </li></ul><ul><li>Open Data proliferate </li></ul><ul><li>The “ Semantic Web ” in action </li></ul>
  108. 125. It’s a long road ahead…
  109. 126. Thank you Email: Twitter: ChemConnector Personal Blog: SLIDES: