Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Chemistry Online and The vision and challenges associated with building the chem spider resource for chemists


Published on

Today ChemSpider ( is one of the community’s primary online resources for chemists. Now hosting over 28 million unique chemical compounds linked to over 400 data sources, ChemSpider offers its users a structure centric platform facilitating access to publications and patents, experimental and predicted property data, spectral data and many other forms of data and information that can benefit a chemist. ChemSpider is a crowdsourcing platform allowing the community to contribute data directly to the database by allowing the deposition and sharing of structure data, properties, spectra and reaction syntheses. The crowdsourcing also allows for the annotation and curation of existing data thereby allowing the community to assist in the much-needed curation and validation of chemistry data on the internet. This work is imperative in order to provide the chemistry underpinnings to semantic web projects such as Open PHACTS ( of which Merck is sure to benefit when it is released to the community. This presentation will provide an overview of the ChemSpider platform and will also examine the challenges of dealing with heterogeneous data quality when attempting to provide a rich resource of data for the community. If you use the internet to research chemistry based data this presentation will be an essential guide to how to source high quality data.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Chemistry Online and The vision and challenges associated with building the chem spider resource for chemists

  1. 1. Chemistry Online – The Vision and Challenges Associated With Buildingthe ChemSpider Resource for Chemists Antony Williams Merck, October 2012
  2. 2. We Have …Too Much Data!!!
  3. 3. It is so difficult to navigate… IP? What’s the structure? Are they in our file? What’s similar? What’s the Pharmacology target? data? Known Pathways? Competitors? Working On Connections Now? to disease? Expressed in right cell type?
  4. 4. The World of Online Chemistry Property databases Compound aggregators Screening assay results Scientific publications Encyclopedic articles (Wikipedia) Metabolic pathway databases ADME/Tox data – eTOX for example Blogs/Wikis and Open Notebook Science Contributing Open Source code to projects
  5. 5. PubChem
  6. 6. ChEMBL
  7. 7. Collaborative Knowledge Management
  8. 8. Data on the Web
  9. 9. RSC’s ChemSpider
  10. 10. We Want to Answer Questions Questions a chemist might ask…  What is the melting point of n-heptanol?  What is the chemical structure of Xanax?  Chemically, what is phenolphthalein?  What are the stereocenters of cholesterol?  Where can I find publications about xylene?  What are the different trade names for Ketoconazole?  What is the NMR spectrum of Aspirin?  What are the safety handling issues for Thymol Blue?
  11. 11. Available Information… Linked to vendors, safety data, toxicity, metabolism
  12. 12. Available Information….
  13. 13. Crowdsourced “Annotations” Users can add  Descriptions/Syntheses/Commentaries  Links to PubMed articles  Links to articles via DOIs  Add spectral data  Add Crystallographic Information Files  Add photos  Add MP3 files  Add Videos
  14. 14. ChemSpider : Spectra Linked
  15. 15. Spectra Linked
  16. 16. Spectra Linked
  17. 17. Chemistry Data online is messy We have inherited errors All public compound databases, including ours, have errors “Incorrect” structures – assertions, timelines etc “Incorrect” names associated with structures Properties Links Publications ENORMOUS CHALLENGE
  18. 18. What could create change? Harvard Business Review (2010)“One change would make a substantialdifference [to drug R&D]: the creation of agreed-upon standards for digitally representing drug assets.”Consider drug structures ONLY…
  19. 19. The Structure of Vitamin K?
  20. 20. MeSH A lipid cofactor that is required for normal blood clotting. Several forms of vitamin K have been identified: VITAMIN K 1 (phytomenadione) derived from plants, VITAMIN K 2 (menaquinone) from bacteria, and synthetic naphthoquinone provitamins, VITAMIN K 3 (menadione). Vitamin K 3 provitamins, after being alkylated in vivo, exhibit the antifibrinolytic activity of vitamin K. Green leafy vegetables, liver, cheese, butter, and egg yolk are good sources of vitamin K
  21. 21. The Structure of Vitamin K1?
  22. 22. What is the Structure of Vitamin K1?
  23. 23. CAS’s Common Chemistry
  24. 24. Wikipedia
  25. 25. ChEBI – Manual Curation
  26. 26. “2-methyl-3-(3,7,11,15-tetramethylhexadec-2- enyl)naphthalene-1,4-dione” Variants of systematic names on PubChem 2-methyl-3-[(E,7R,11R)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7S,11R)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7R,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7S,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E)-3,7,11,15-tetramethyl 2-methyl-3-(3,7,11,15-tetramethyl 2-methyl-3-[(E)-3,7,11,15-tetramethyl
  27. 27. Chemistry on The Internet Is Messy
  28. 28. It’s Methane…
  29. 29. What’s Methane?
  30. 30. What’s Methane?
  31. 31. What ELSE is Methane???
  32. 32. EPA’s DailyMed
  33. 33. EPA’s DailyMed
  34. 34. EPA’s DailyMed
  35. 35. With Great Fanfare…
  36. 36. NPC Browser
  37. 37. NPC Browser
  38. 38. The EXPERTS must get it right?!
  39. 39. Wikipedia, C&E News, PubChem C&E News (from ACS)
  40. 40. People Use Trusted Resources…
  41. 41. Earlier this month…
  42. 42. Stop Whining – Fix it
  43. 43. Crowdsourced Curation  Crowd-sourced curation: identify/tag errors, edit names, synonyms, identify records to deprecate
  44. 44. Search “Vitamin H”
  45. 45. “Curate” Identifiers
  46. 46. “Curate” Identifiers
  47. 47. “Curate” Identifiers
  48. 48. What is the outcome of this??? IF we can get the community to help clean up the internet of chemistry then we have:  High quality online reference resources  Freely available reference data  Ongoing iterative curation – how many chemical structures are “reworked”  And what is the value of “curated chemical dictionaries???”
  49. 49. Successful Semantic MarkupDepends on Dictionaries
  50. 50. Dictionaries Enhance Publications
  51. 51. I want to know about “Vincristine”
  52. 52. Vincristine Identifiers
  53. 53. Vincristine: PatentsLinked by Name
  54. 54. Vincristine: ArticlesLinked by Name
  55. 55. What are the names for thiscompound just in patents????
  56. 56. A disambiguation NIGHTMARE!
  57. 57. Ambiguity in Identifiers
  58. 58. Crowdsourcing Works >130 people have deposited data and participated in data curation Different level curators check each other More curators and depositors encouraged! 28 million chemicals is a long list…
  59. 59. ChemSpider for Analytical Sciences ChemSpider is being developed with the intention of  Being the world’s richest resource of freely accessible curated analytical data  As a platform for structure verification and dereplication  To provide access to supporting prediction algorithms
  60. 60. Spectral Uploading Locate the structure of interest and deposit spectrum Supported formats: JCAMP, PDF
  61. 61. Spectral Uploading Various types of NMR spectra supported
  62. 62. Regular Updates
  63. 63. Multiple Spectra for One Structure
  64. 64. ChemSpider ID 24528095 H1 NMR
  65. 65. ChemSpider ID 24528095 C13 NMR
  66. 66. ChemSpider ID 24528095 HHCOSY
  67. 67. ChemSpider ID 24528095 HSQC
  68. 68. ChemSpider ID 24528095 HMBC
  69. 69. Full C13 assignment uploaded
  70. 70. Available Spectra
  71. 71. How do these data get curated? Every spectrum can be commented on Incorrect spectra have been annotated and curated by users… But curation through gaming is also possible…
  72. 72. Web Services
  73. 73. www.SpectralGame.com
  74. 74. Spectral Game
  75. 75. Increasing Complexity
  76. 76. Spectral Game
  77. 77. Reversed Spectrum
  78. 78. True Curation of Data
  79. 79. SpectralGame in the hand
  80. 80. In progress… Storage and display of ASSIGNED spectra
  81. 81. Mass Spec Analysis
  82. 82. ChemSpider Interface
  83. 83. Tinuvin 328
  84. 84. Position sorted by references
  85. 85. Position 1 only
  86. 86. Web Services
  87. 87. Web Services Open Up Collaboration Agilent, Bruker, Waters and Thermo all use our web-based services for compound lookup Many academic sites integrating directly – metabonomics, name lookup, semantic markup
  88. 88. Where do data come from? ChemSpider users deposit data Some contributions from NIST Chemical vendors are starting to provide data. Synthonix are one of our major contributors (
  89. 89. Commercial Database Access Recently deposited to ChemSpider  EPA/NIST IR Database >5000 spectra Presently under development  NIST MS database >200,000 MS spectra
  90. 90. Where next with Analytical Support? PharmaSea project for the identification of natural products – dereplication approaches  Use mass spectrometry searches of natural product slices to identify  Pre-fragment compounds and develop searches  Dereplication using NMR data  NMR features  Predicted spectra and “Verification approaches”
  91. 91. NMRShiftDB:
  92. 92. NMR Prediction
  93. 93. NMRShiftDB Data Review • High quality NMR shift set of ca. 100,000 shifts • Derived prediction algorithms give very similar performance statistics to commercial algorithms
  94. 94. Crowdsourcing Chemical Synthesis How much data generated in a lab, that COULD go public, is lost forever?
  95. 95. Crowdsourcing Chemical Synthesis How much data generated in a lab, that COULD go public, is lost forever? Public Domain reference databases of value?  Properties  Spectra  CIFs  Images  Syntheses
  96. 96. An Adventure into the World of Smallbut significant contribution..
  97. 97. ChemSpider SyntheticPages
  98. 98. Micropublishing with Peer Review(a chemical synthesis blog?)
  99. 99. Multi-Step Synthesis
  100. 100. Interactive Data
  101. 101. MOBILE Structure Database Lookup
  102. 102. It is so difficult to navigate… IP? What’s the structure? Are they in our file? What’s similar? What’s the Pharmacology target? data? Known Pathways? Competitors? Working On Connections Now? to disease? Expressed in right cell type?
  103. 103. Open PHACTS Project Develop a set of robust standards… Implement the standards in a semantic integration hub Deliver services to support drug discovery programs in pharma and public domain 22 partners, 8 pharmaceutical companies, 3 biotechs 36 months project – goes live next month Guiding principle is open access, open usage, open source - Key to standards adoption -
  104. 104. The Future Internet Data Small organic molecules Commercial Software Undefined materials Pre-competitive Data Organometallics Open Science Nanomaterials Open Data Polymers Publishers Minerals Educators Particle bound Open Databases Links to Biologicals Chemical Vendors
  105. 105. The Future of Chemistry on the Web? Public compound databases federate & build a linked environment of validated data! Data validation needs are not ignored Publishers layer on information to make publications discoverable Public-Private databases can be linked Open Data proliferate The “Semantic Web” in action
  106. 106. Can Merck Contribute to this Project? Do you have any data that you can release into the public domain?  Measured property data  How many “common” spectra are thrown away?  How many syntheses are published and locked behind paywalls? ( Can your scientists contribute annotations and curations if they use ChemSpider? Is the challenge of Legal Clearance too big?
  107. 107. Thank youEmail: williamsa@rsc.orgTwitter: ChemConnectorBlog: Blog: www.chemconnector.comSLIDES: