Chemistry Online and The vision and challenges associated with building the chem spider resource for chemists

Uploaded on

Today ChemSpider ( is one of the community’s primary online resources for chemists. Now hosting over 28 million unique chemical compounds linked to over 400 data sources, ChemSpider …

Today ChemSpider ( is one of the community’s primary online resources for chemists. Now hosting over 28 million unique chemical compounds linked to over 400 data sources, ChemSpider offers its users a structure centric platform facilitating access to publications and patents, experimental and predicted property data, spectral data and many other forms of data and information that can benefit a chemist. ChemSpider is a crowdsourcing platform allowing the community to contribute data directly to the database by allowing the deposition and sharing of structure data, properties, spectra and reaction syntheses. The crowdsourcing also allows for the annotation and curation of existing data thereby allowing the community to assist in the much-needed curation and validation of chemistry data on the internet. This work is imperative in order to provide the chemistry underpinnings to semantic web projects such as Open PHACTS ( of which Merck is sure to benefit when it is released to the community. This presentation will provide an overview of the ChemSpider platform and will also examine the challenges of dealing with heterogeneous data quality when attempting to provide a rich resource of data for the community. If you use the internet to research chemistry based data this presentation will be an essential guide to how to source high quality data.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads


Total Views
On Slideshare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. Chemistry Online – The Vision and Challenges Associated With Buildingthe ChemSpider Resource for Chemists Antony Williams Merck, October 2012
  • 2. We Have …Too Much Data!!!
  • 3. It is so difficult to navigate… IP? What’s the structure? Are they in our file? What’s similar? What’s the Pharmacology target? data? Known Pathways? Competitors? Working On Connections Now? to disease? Expressed in right cell type?
  • 4. The World of Online Chemistry Property databases Compound aggregators Screening assay results Scientific publications Encyclopedic articles (Wikipedia) Metabolic pathway databases ADME/Tox data – eTOX for example Blogs/Wikis and Open Notebook Science Contributing Open Source code to projects
  • 5. PubChem
  • 6. ChEMBL
  • 7. Collaborative Knowledge Management
  • 8. Data on the Web
  • 9. RSC’s ChemSpider
  • 10. We Want to Answer Questions Questions a chemist might ask…  What is the melting point of n-heptanol?  What is the chemical structure of Xanax?  Chemically, what is phenolphthalein?  What are the stereocenters of cholesterol?  Where can I find publications about xylene?  What are the different trade names for Ketoconazole?  What is the NMR spectrum of Aspirin?  What are the safety handling issues for Thymol Blue?
  • 11. Available Information… Linked to vendors, safety data, toxicity, metabolism
  • 12. Available Information….
  • 13. Crowdsourced “Annotations” Users can add  Descriptions/Syntheses/Commentaries  Links to PubMed articles  Links to articles via DOIs  Add spectral data  Add Crystallographic Information Files  Add photos  Add MP3 files  Add Videos
  • 14. ChemSpider : Spectra Linked
  • 15. Spectra Linked
  • 16. Spectra Linked
  • 17. Chemistry Data online is messy We have inherited errors All public compound databases, including ours, have errors “Incorrect” structures – assertions, timelines etc “Incorrect” names associated with structures Properties Links Publications ENORMOUS CHALLENGE
  • 18. What could create change? Harvard Business Review (2010)“One change would make a substantialdifference [to drug R&D]: the creation of agreed-upon standards for digitally representing drug assets.”Consider drug structures ONLY…
  • 19. The Structure of Vitamin K?
  • 20. MeSH A lipid cofactor that is required for normal blood clotting. Several forms of vitamin K have been identified: VITAMIN K 1 (phytomenadione) derived from plants, VITAMIN K 2 (menaquinone) from bacteria, and synthetic naphthoquinone provitamins, VITAMIN K 3 (menadione). Vitamin K 3 provitamins, after being alkylated in vivo, exhibit the antifibrinolytic activity of vitamin K. Green leafy vegetables, liver, cheese, butter, and egg yolk are good sources of vitamin K
  • 21. The Structure of Vitamin K1?
  • 22. What is the Structure of Vitamin K1?
  • 23. CAS’s Common Chemistry
  • 24. Wikipedia
  • 25. ChEBI – Manual Curation
  • 26. “2-methyl-3-(3,7,11,15-tetramethylhexadec-2- enyl)naphthalene-1,4-dione” Variants of systematic names on PubChem 2-methyl-3-[(E,7R,11R)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7S,11R)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7R,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7S,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E)-3,7,11,15-tetramethyl 2-methyl-3-(3,7,11,15-tetramethyl 2-methyl-3-[(E)-3,7,11,15-tetramethyl
  • 27. Chemistry on The Internet Is Messy
  • 28. It’s Methane…
  • 29. What’s Methane?
  • 30. What’s Methane?
  • 31. What ELSE is Methane???
  • 32. EPA’s DailyMed
  • 33. EPA’s DailyMed
  • 34. EPA’s DailyMed
  • 35. With Great Fanfare…
  • 36. NPC Browser
  • 37. NPC Browser
  • 38. The EXPERTS must get it right?!
  • 39. Wikipedia, C&E News, PubChem C&E News (from ACS)
  • 40. People Use Trusted Resources…
  • 41. Earlier this month…
  • 42. Stop Whining – Fix it
  • 43. Crowdsourced Curation  Crowd-sourced curation: identify/tag errors, edit names, synonyms, identify records to deprecate
  • 44. Search “Vitamin H”
  • 45. “Curate” Identifiers
  • 46. “Curate” Identifiers
  • 47. “Curate” Identifiers
  • 48. What is the outcome of this??? IF we can get the community to help clean up the internet of chemistry then we have:  High quality online reference resources  Freely available reference data  Ongoing iterative curation – how many chemical structures are “reworked”  And what is the value of “curated chemical dictionaries???”
  • 49. Successful Semantic MarkupDepends on Dictionaries
  • 50. Dictionaries Enhance Publications
  • 51. I want to know about “Vincristine”
  • 52. Vincristine Identifiers
  • 53. Vincristine: PatentsLinked by Name
  • 54. Vincristine: ArticlesLinked by Name
  • 55. What are the names for thiscompound just in patents????
  • 56. A disambiguation NIGHTMARE!
  • 57. Ambiguity in Identifiers
  • 58. Crowdsourcing Works >130 people have deposited data and participated in data curation Different level curators check each other More curators and depositors encouraged! 28 million chemicals is a long list…
  • 59. ChemSpider for Analytical Sciences ChemSpider is being developed with the intention of  Being the world’s richest resource of freely accessible curated analytical data  As a platform for structure verification and dereplication  To provide access to supporting prediction algorithms
  • 60. Spectral Uploading Locate the structure of interest and deposit spectrum Supported formats: JCAMP, PDF
  • 61. Spectral Uploading Various types of NMR spectra supported
  • 62. Regular Updates
  • 63. Multiple Spectra for One Structure
  • 64. ChemSpider ID 24528095 H1 NMR
  • 65. ChemSpider ID 24528095 C13 NMR
  • 66. ChemSpider ID 24528095 HHCOSY
  • 67. ChemSpider ID 24528095 HSQC
  • 68. ChemSpider ID 24528095 HMBC
  • 69. Full C13 assignment uploaded
  • 70. Available Spectra
  • 71. How do these data get curated? Every spectrum can be commented on Incorrect spectra have been annotated and curated by users… But curation through gaming is also possible…
  • 72. Web Services
  • 73. www.SpectralGame.com
  • 74. Spectral Game
  • 75. Increasing Complexity
  • 76. Spectral Game
  • 77. Reversed Spectrum
  • 78. True Curation of Data
  • 79. SpectralGame in the hand
  • 80. In progress… Storage and display of ASSIGNED spectra
  • 81. Mass Spec Analysis
  • 82. ChemSpider Interface
  • 83. Tinuvin 328
  • 84. Position sorted by references
  • 85. Position 1 only
  • 86. Web Services
  • 87. Web Services Open Up Collaboration Agilent, Bruker, Waters and Thermo all use our web-based services for compound lookup Many academic sites integrating directly – metabonomics, name lookup, semantic markup
  • 88. Where do data come from? ChemSpider users deposit data Some contributions from NIST Chemical vendors are starting to provide data. Synthonix are one of our major contributors (
  • 89. Commercial Database Access Recently deposited to ChemSpider  EPA/NIST IR Database >5000 spectra Presently under development  NIST MS database >200,000 MS spectra
  • 90. Where next with Analytical Support? PharmaSea project for the identification of natural products – dereplication approaches  Use mass spectrometry searches of natural product slices to identify  Pre-fragment compounds and develop searches  Dereplication using NMR data  NMR features  Predicted spectra and “Verification approaches”
  • 91. NMRShiftDB:
  • 92. NMR Prediction
  • 93. NMRShiftDB Data Review • High quality NMR shift set of ca. 100,000 shifts • Derived prediction algorithms give very similar performance statistics to commercial algorithms
  • 94. Crowdsourcing Chemical Synthesis How much data generated in a lab, that COULD go public, is lost forever?
  • 95. Crowdsourcing Chemical Synthesis How much data generated in a lab, that COULD go public, is lost forever? Public Domain reference databases of value?  Properties  Spectra  CIFs  Images  Syntheses
  • 96. An Adventure into the World of Smallbut significant contribution..
  • 97. ChemSpider SyntheticPages
  • 98. Micropublishing with Peer Review(a chemical synthesis blog?)
  • 99. Multi-Step Synthesis
  • 100. Interactive Data
  • 101. MOBILE Structure Database Lookup
  • 102. It is so difficult to navigate… IP? What’s the structure? Are they in our file? What’s similar? What’s the Pharmacology target? data? Known Pathways? Competitors? Working On Connections Now? to disease? Expressed in right cell type?
  • 103. Open PHACTS Project Develop a set of robust standards… Implement the standards in a semantic integration hub Deliver services to support drug discovery programs in pharma and public domain 22 partners, 8 pharmaceutical companies, 3 biotechs 36 months project – goes live next month Guiding principle is open access, open usage, open source - Key to standards adoption -
  • 104. The Future Internet Data Small organic molecules Commercial Software Undefined materials Pre-competitive Data Organometallics Open Science Nanomaterials Open Data Polymers Publishers Minerals Educators Particle bound Open Databases Links to Biologicals Chemical Vendors
  • 105. The Future of Chemistry on the Web? Public compound databases federate & build a linked environment of validated data! Data validation needs are not ignored Publishers layer on information to make publications discoverable Public-Private databases can be linked Open Data proliferate The “Semantic Web” in action
  • 106. Can Merck Contribute to this Project? Do you have any data that you can release into the public domain?  Measured property data  How many “common” spectra are thrown away?  How many syntheses are published and locked behind paywalls? ( Can your scientists contribute annotations and curations if they use ChemSpider? Is the challenge of Legal Clearance too big?
  • 107. Thank youEmail: williamsa@rsc.orgTwitter: ChemConnectorBlog: Blog: www.chemconnector.comSLIDES: