Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Improving online chemistry one structure at a time


Published on

RSC|ChemSpider is one of the world’s largest online resources for chemistry related data and services. Developed with the intention of delivering access to structure-based chemistry data via the internet the ChemSpider platform hosts over 26 million unique chemical compounds aggregated from over 400 data sources and provides an environment for the community to both annotate and curate these existing data as well as deposit new data to the system. The search system delivers flexible querying capabilities together with links to external sites for publication and patent data. ChemSpider has spawned a number of projects include ChemSpider SyntheticPages for hosting openly peer-reviewed chemical synthesis articles. This presentation will review the present capabilities of the ChemSpider system providing direct examples of how to use the system to source high quality data of value to pharmaceutical companies. We will discuss some of the challenges associated with validating data quality, examine how ChemSpider is a part of the semantic web for chemistry and investigate approaches to using ChemSpider integrated to analytical instrumentation.

Published in: Technology, Education

Improving online chemistry one structure at a time

  1. 1. Improving Online Chemistry – One Structure at a Time Antony Williams AstraZeneca, February 10 th 2012
  2. 2. We Have …Too Much Data!!!
  3. 3. It is so difficult to navigate… What’s the structure? Are they in our file? What’s similar? What’s the target? Pharmacology data? Known Pathways? Working On Now? Connections to disease? Expressed in right cell type? Competitors? IP?
  4. 4. Pharma Information..and the web… Literature Patents News Pipeline SAR CSRs Safety In vivo Etc
  5. 5. The World of Online Chemistry <ul><li>Property databases </li></ul><ul><li>Compound aggregators </li></ul><ul><li>Screening assay results </li></ul><ul><li>Scientific publications </li></ul><ul><li>Encyclopedic articles (Wikipedia) </li></ul><ul><li>Metabolic pathway databases </li></ul><ul><li>ADME/Tox data – eTOX for example </li></ul><ul><li>Blogs/Wikis and Open Notebook Science </li></ul><ul><li>Contributing Open Source code to projects </li></ul>
  6. 6. PubChem
  7. 7. ChEMBL
  8. 8. Collaborative Knowledge Management
  9. 9. e-Science and Primary Data <ul><li>How much data generated in a lab, that COULD go public, is lost forever? </li></ul>
  10. 10. e-Science and Primary Data <ul><li>How much data generated in a lab, that COULD go public, is lost forever? </li></ul><ul><li>Public Domain reference databases of value? </li></ul><ul><ul><li>Syntheses </li></ul></ul><ul><ul><li>Properties </li></ul></ul><ul><ul><li>Spectra </li></ul></ul><ul><ul><li>CIFs </li></ul></ul><ul><ul><li>Images </li></ul></ul>
  11. 11. e-Science and Primary Data <ul><li>How much data generated in a lab, that COULD go public, is lost forever? </li></ul><ul><li>Public Domain reference databases of value? </li></ul><ul><ul><li>Syntheses </li></ul></ul><ul><ul><li>Properties </li></ul></ul><ul><ul><li>Spectra </li></ul></ul><ul><ul><li>CIFs </li></ul></ul><ul><ul><li>Images </li></ul></ul><ul><li>Much of chemistry is chemical structure-based – where and how could we host these data? </li></ul>
  12. 12. RSC’s ChemSpider
  13. 13. We Want to Answer Questions <ul><li>Questions a chemist might ask… </li></ul><ul><ul><li>What is the melting point of n-heptanol? </li></ul></ul><ul><ul><li>What is the chemical structure of Xanax? </li></ul></ul><ul><ul><li>Chemically, what is phenolphthalein? </li></ul></ul><ul><ul><li>What are the stereocenters of cholesterol? </li></ul></ul><ul><ul><li>Where can I find publications about xylene? </li></ul></ul><ul><ul><li>What are the different trade names for Ketoconazole? </li></ul></ul><ul><ul><li>What is the NMR spectrum of Aspirin? </li></ul></ul><ul><ul><li>What are the safety handling issues for Thymol Blue? </li></ul></ul>
  14. 14. Available Information… <ul><li>Linked to vendors, safety data, toxicity, metabolism </li></ul>
  15. 15. Available Information….
  16. 16. Crowdsourced “Annotations” <ul><li>Users can add </li></ul><ul><ul><li>Descriptions/Syntheses/Commentaries </li></ul></ul><ul><ul><li>Links to PubMed articles </li></ul></ul><ul><ul><li>Links to articles via DOIs </li></ul></ul><ul><ul><li>Add spectral data </li></ul></ul><ul><ul><li>Add Crystallographic Information Files </li></ul></ul><ul><ul><li>Add photos </li></ul></ul><ul><ul><li>Add MP3 files </li></ul></ul><ul><ul><li>Add Videos </li></ul></ul>
  17. 18. Spectra
  18. 19. Data on the Web
  19. 21. Chemistry Data online is messy <ul><li>We have inherited errors </li></ul><ul><li>All public compound databases, including ours, have errors </li></ul><ul><li>“ Incorrect” structures – assertions, timelines etc </li></ul><ul><li>“ Incorrect” names associated with structures </li></ul><ul><li>Properties </li></ul><ul><li>Links </li></ul><ul><li>Publications </li></ul><ul><li>ENORMOUS CHALLENGE </li></ul>
  20. 22. What could create change? <ul><li>Harvard Business Review (2010) </li></ul><ul><li>“ One change would make a substantial difference [ to drug R&D ] : the creation of agreed-upon standards for digitally representing drug assets. ” </li></ul><ul><li>Consider drug structures ONLY… </li></ul>
  21. 23. The Structure of Vitamin K?
  22. 24. MeSH <ul><li>A lipid cofactor that is required for normal blood clotting. Several forms of vitamin K have been identified: VITAMIN K 1 (phytomenadione) derived from plants , VITAMIN K 2 (menaquinone) from bacteria, and synthetic naphthoquinone provitamins, VITAMIN K 3 (menadione). Vitamin K 3 provitamins, after being alkylated in vivo, exhibit the antifibrinolytic activity of vitamin K. Green leafy vegetables, liver, cheese, butter, and egg yolk are good sources of vitamin K </li></ul>
  23. 25. The Structure of Vitamin K1?
  24. 26. What is the Structure of Vitamin K1?
  25. 27. CAS’s Common Chemistry
  26. 28. Wikipedia
  27. 31. ChEBI – Manual Curation
  28. 35. <ul><li>“ 2-methyl-3-(3,7,11,15-tetramethyl hexadec-2-enyl)naphthalene-1,4-dione” </li></ul><ul><li>Variants of systematic names on PubChem </li></ul><ul><li>2-methyl-3-[(E,7R,11R)-3,7,11,15-tetramethyl </li></ul><ul><li>2-methyl-3-[(E,7S,11R)-3,7,11,15-tetramethyl </li></ul><ul><li>2-methyl-3-[(E,7R,11S)-3,7,11,15-tetramethyl </li></ul><ul><li>2-methyl-3-[(E,7S,11S)-3,7,11,15-tetramethyl </li></ul><ul><li>2-methyl-3-[(E,11S)-3,7,11,15-tetramethyl </li></ul><ul><li>2-methyl-3-[(E)-3,7,11,15-tetramethyl </li></ul><ul><li>2-methyl-3-(3,7,11,15-tetramethyl </li></ul><ul><li>2-methyl-3-[(E)-3,7,11,15-tetramethyl </li></ul>
  29. 36. What’s Methane?
  30. 37. What’s Methane?
  31. 38. What ELSE is Methane???
  32. 40. EPA’s DailyMed
  33. 41. EPA’s DailyMed
  34. 42. EPA’s DailyMed
  35. 43. With Great Fanfare…
  36. 44. NPC Browser
  37. 45. NPC Browser
  38. 47. Openness and Quality Issues Williams and Ekins, DDT, 16: 747-750 (2011) Science Translational Medicine 2011
  39. 48. Content is King and Quality Costs <ul><li>Curated Chemistry “content” is expensive to create </li></ul><ul><ul><li>Patent searching </li></ul></ul><ul><ul><li>Structures and properties </li></ul></ul><ul><ul><li>Drug databases </li></ul></ul><ul><ul><li>Literature databases </li></ul></ul><ul><li>Chemical Abstracts Service (CAS), the “Gold Standard” in Chemistry related information </li></ul><ul><ul><li>104 years of content </li></ul></ul><ul><ul><li>>50 million substances </li></ul></ul><ul><ul><li>Proprietary platform </li></ul></ul>
  40. 49. The EXPERTS must get it right?!
  41. 50. Wikipedia, C&E News, PubChem <ul><li>C&E News (from ACS) </li></ul>
  42. 51. Feedback from Steve Ritter <ul><li>“ Although CAS and C&EN are both part of the ACS Publications Division, we at C&EN still have to pay for our SciFinder access, strangely enough.” </li></ul><ul><li>“ It would be nice to have an authoritative web-based source of standard, well-drawn structures for chemists to go to so they can freely cut and paste structures into their papers, PowerPoint presentations, and anything else they might need. Maybe Wikipedia will be that source one day .” </li></ul>
  43. 52. Public Domain Databases <ul><li>Our databases are a mess… </li></ul><ul><li>Non-curated databases are proliferating errors </li></ul><ul><li>We source and deposit data between databases </li></ul><ul><li>Original sources of errors hard to determine </li></ul><ul><li>Curation is time-consuming and challenging </li></ul>
  44. 53. Stop Whining – Fix it
  45. 54. Crowdsourced Curation <ul><li>Crowd-sourced curation: identify/tag errors, edit names, synonyms, identify records to deprecate </li></ul>
  46. 55. Search “Vitamin H”
  47. 56. “ Curate” Identifiers
  48. 57. “ Curate” Identifiers
  49. 58. “ Curate” Identifiers
  50. 59. Standards : Structure Standardization
  51. 60. Standards : Structure Standardization
  52. 61. Standards : Structure Standardization
  53. 62. What needs to happen? <ul><li>Standards </li></ul><ul><ul><li>Standardization of structures </li></ul></ul><ul><ul><ul><li>ChEBI/PubChem sharing </li></ul></ul></ul><ul><ul><ul><li>InChI adoption </li></ul></ul></ul>
  54. 63. The InChI Identifier
  55. 64. Multiple Layers
  56. 65. InChIStrings Hash to InChIKeys
  57. 66. Vancomycin – Search the Internet
  58. 67. Vancomycin Search Molecular SKELETON Search Full Molecule
  59. 68. Full Skeleton Search: 104 Hits
  60. 69. Full Molecule Search: 4 Hits
  61. 70. Crowdsourcing Works <ul><li>>130 people have deposited data and participated in data curation </li></ul><ul><li>Different level curators check each other </li></ul><ul><li>More curators and depositors are encouraged! </li></ul>
  62. 71. What needs to happen? <ul><li>Standards </li></ul><ul><ul><li>Standardization of structures </li></ul></ul><ul><ul><ul><li>ChEBI/PubChem sharing </li></ul></ul></ul><ul><ul><ul><li>InChI adoption </li></ul></ul></ul><ul><li>Collaboration </li></ul><ul><ul><li>Stop reinventing the wheel </li></ul></ul><ul><ul><li>Share data, share efforts and speed the process </li></ul></ul>
  63. 72. Antony Williams vs Identifiers Passport ID Dad, Tony, others SSN Green Card License 5 email addresses ChemSpiderman (blog, Twitter account, Facebook, Friendfeed) OpenID … .
  64. 73. Aspirin names and synonyms <ul><li>Text searches depend on correct association </li></ul><ul><li>335 suggested identifiers for Aspirin just on PubChem! </li></ul><ul><li>Disambiguation dictionaries are necessary, not just for authors! </li></ul>
  65. 75. The Final Search Strategy
  66. 76. All Those Names, One Structure
  67. 77. Curated Dictionaries Matter
  68. 78. Success Depends on Dictionaries
  69. 79. Validated Name-Structure Dictionaries <ul><li>Chemical name dictionaries are used for: </li></ul><ul><ul><ul><li>Text-mining (publications, patents) </li></ul></ul></ul><ul><ul><ul><ul><li>Used to index PubMed and link to Google Patents </li></ul></ul></ul></ul><ul><ul><ul><li>Linking to other databases – think Biology! </li></ul></ul></ul><ul><ul><ul><ul><li>When structures are not available drug names link </li></ul></ul></ul></ul><ul><ul><ul><li>Searching the web </li></ul></ul></ul><ul><ul><ul><ul><li>Names link to structures link to InChIs </li></ul></ul></ul></ul>
  70. 80. I want to know about “Vincristine” If all algorithms work then everything on the page is correct by default except the name-structure relationship!
  71. 81. Vincristine: Identifiers and Properties
  72. 82. Vincristine: Patents Linked by Name
  73. 83. Vincristine: Articles Linked by Name
  74. 84. ChemSpider Everywhere: What do computers want? <ul><li>Web services </li></ul>
  75. 85. Mass Spec Analysis
  76. 86. ChemSpider Interface
  77. 88. Tinuvin 328
  78. 89. Position sorted by references
  79. 90. Position 1 only
  80. 91. Web Services
  81. 92. Web Services Open Up Collaboration <ul><li>Agilent, Bruker, Waters and Thermo all use our web-based services for compound lookup </li></ul><ul><li>Many academic sites integrating directly – metabonomics, name lookup, semantic markup </li></ul><ul><li>Mobile app integration </li></ul><ul><li>Commercial structure drawing packages </li></ul>
  82. 93. ChemSpider Everywhere : Embed
  83. 94. ChemSpider Everywhere: Spectral Game
  84. 95. ChemSpider Everywhere Crowdsourced Curation of Spectra
  85. 96. ChemSpider Everywhere : ChemMobi
  86. 97. Structure Database Lookup
  87. 98. ChemSpider Resources for Chemistry
  88. 99. It is so difficult to navigate… What’s the structure? Are they in our file? What’s similar? What’s the target? Pharmacology data? Known Pathways? Working On Now? Connections to disease? Expressed in right cell type? Competitors? IP?
  89. 100. <ul><li>Open PHACTS Project </li></ul><ul><li>Develop a set of robust standards… </li></ul><ul><li>Implement the standards in a semantic integration hub </li></ul><ul><li>Deliver services to support drug discovery programs in pharma and public domain </li></ul><ul><li>22 partners, 8 pharmaceutical companies, 3 biotechs </li></ul><ul><li>36 months project </li></ul>Guiding principle is open access, open usage, open source - Key to standards adoption -
  90. 102. The Future Commercial Software Pre-competitive Data Open Science Open Data Publishers Educators Open Databases Chemical Vendors Small organic molecules Undefined materials Organometallics Nanomaterials Polymers Minerals Particle bound Links to Biologicals Internet Data
  91. 103. The Future of Chemistry on the Web? <ul><li>Public compound databases federate & build a linked environment of validated data! </li></ul><ul><li>Data validation needs are not ignored </li></ul><ul><li>Publishers layer on information to make publications discoverable </li></ul><ul><li>Public-Private databases can be linked </li></ul><ul><li>Open Data proliferate </li></ul><ul><li>The “ Semantic Web ” in action </li></ul>
  92. 104. Acknowledgments <ul><li>The ChemSpider team </li></ul><ul><li>Our data providers, depositors, collaborators and curators </li></ul><ul><li>Software providers – OpenEye, ChemDoodle, ACD/Labs, GGA Software, Open Source (Jmol, JSpecView, OpenBabel) </li></ul>
  93. 105. Thank you Email: Twitter: ChemConnector Blog: Personal Blog: SLIDES: