Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry


Published on

The original abstract for the talk is below BUT the talk changed based on a big interest in InChI and the possibilities to use in a Semantic Web for Chemistry

The increasing availability of free and open access resources for scientists on the internet presents us with a revolution in data availability. However, freedom costs and in many cases the cost is quality. ChemSpider is a free access website for chemists built with the intention of providing a structure centric community for chemists. As an aggregator of chemistry related information from many sources, at present over 21.5 million unique chemical entities from over 150 separate data sources, ChemSpider has taken on the task of both robotically and manually curating publicly available data sources. This presentation will provide an overview of how a curated platform can become the centralized hub for resourcing information about chemical entities. We will also present ChemMantis, an entity extraction platform for extracting chemical names and scientific terms in documents and providing a platform for structure-based searching of Open Access chemistry literature.

Published in: Technology
1 Comment
  • Fioricet is often prescribed for tension headaches caused by contractions of the muscles in the neck and shoulder area. Buy now from and make a deal for you.
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

  1. 1. Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry Antony Williams
  2. 2. The Language of Chemistry <ul><li>My language…. </li></ul>
  3. 3. And its dialects….
  4. 4. From Yesterday <ul><li>Approaches to linking data </li></ul><ul><li>RDF’ing, OWL’ing, SPARQL’ing </li></ul><ul><li>Triples and stores </li></ul><ul><li>All are appropriate technologies…. </li></ul><ul><li>Online data linked to by the pharma industry </li></ul><ul><ul><li>Drugbank, PubChem, Daily Med, KEGG, ChEBI </li></ul></ul><ul><li>But what of the Quality of data? </li></ul>
  5. 5. Question Everything
  6. 6. PubChem
  7. 7. Quality is a Major Issue- Search Butanol
  8. 8. Caution! Question Everything!
  9. 9. The FDA’s DailyMed
  10. 10. Quality of Structures!!!
  11. 11. Quality of Structures <ul><li>If the “Authority” isn’t doing the work to curate then who will? </li></ul>
  12. 12. Collaborative Knowledge Management for Chemists
  13. 13. Drugbank
  14. 14. Taxol on PubChem
  15. 15. Daily Med
  16. 16. The InChI Identifier
  17. 17. Multiple Layers <ul><li>Source: Unofficial InChI FAQ page </li></ul>
  18. 18. InChIStrings Hash to InChIKeys
  19. 19. InChIs for Taxol
  20. 20. Back to Taxol <ul><li>DrugBank: RCINICONZNJXQF-CLDWUXIMDD </li></ul><ul><li>ChEBI: RCINICONZNJXQF-GXKQXQCDDN </li></ul><ul><li>Wikipedia: RCINICONZNJXQF-MZXODVADBJ </li></ul><ul><li>Which one is correct??? </li></ul>
  21. 21. InChIKeys for Taxol <ul><li>DrugBank: RCINICONZNJXQF-CLDWUXIMDD </li></ul><ul><li>ChEBI: RCINICONZNJXQF-GXKQXQCDDN </li></ul><ul><li>Wikipedia: RCINICONZNJXQF-MZXODVADBJ </li></ul><ul><li>ChEBI and Wikipedia are the SAME structure </li></ul><ul><li>Drugbank is a DIFFERENT structure – ONE stereocenter </li></ul>
  22. 22. Does one stereocenter matter?
  23. 23. Does one stereocenter matter? <ul><li>Distaval, Talimol, Nibrol, Sedimide, Quietoplex, Contergan, Neurosedyn, and Softenon </li></ul>
  24. 24. Does one stereocenter matter? <ul><li>Distaval, Talimol, Nibrol, Sedimide, Quietoplex, Contergan, Neurosedyn, and Softenon </li></ul>
  25. 25. Too Much Variability in InChIs <ul><li>Source: Unofficial InChI FAQ page </li></ul>
  26. 26. NEW: Resolve Variability with StdInChI StdInChI
  27. 27. Assertion and Chemical Entities <ul><li>Who says what Taxol is? </li></ul><ul><li>What is the “timeline” for a molecule? </li></ul><ul><li>How do we clean up the Public data? </li></ul><ul><li>The Quality source is Chemical Abstracts Service… </li></ul>
  28. 28. Wikipedia Chemistry Curation project <ul><li>> 6000 organic structures </li></ul><ul><li>Over 1 year of work for a team of 6 </li></ul><ul><li>Many errors removed in the process </li></ul><ul><li>Slow and torturous process </li></ul><ul><li>CAS now collaborating in the process </li></ul><ul><li>InChIs and InChIKeys will be added </li></ul>
  29. 30. Stereoisomers
  30. 31. Content is King and Quality Costs <ul><li>Chemistry “content” is big money – Chemistry publishing and content is worth $100s of millions/year </li></ul><ul><ul><li>Patent searching </li></ul></ul><ul><ul><li>Structures and properties </li></ul></ul><ul><ul><li>Drug databases </li></ul></ul><ul><ul><li>Literature databases </li></ul></ul><ul><li>Chemical Abstracts Service (CAS), a division of the ACS is “Gold Standard” in Chemistry related information </li></ul><ul><ul><li>101 years of content, $260 million revenue (2006), >40 million substances and 60 million sequences </li></ul></ul>
  31. 32. <ul><li>Free access website for chemists to research structure based information </li></ul><ul><ul><li>Structure/substructure searches </li></ul></ul><ul><ul><li>Text-based searches </li></ul></ul><ul><ul><li>Prediction of properties </li></ul></ul><ul><ul><li>Web service-based integration </li></ul></ul><ul><li>Platform for deposition, curation, integration of data </li></ul><ul><ul><li>Structures, analytical data, annotations, links to resources </li></ul></ul><ul><ul><li>Annotation and curation of data in real-time </li></ul></ul><ul><li>A platform to assist discovery? </li></ul>
  32. 33. ChemSpider Data <ul><li>The database contains > 21.5 million compounds obtained from >150 data sources and growing weekly.– 0.5 million compounds awaiting deposition </li></ul><ul><ul><li>Chemical vendors </li></ul></ul><ul><ul><li>Publishers </li></ul></ul><ul><ul><li>Commercial Database Vendors </li></ul></ul><ul><ul><li>US and international patents </li></ul></ul><ul><ul><li>Structure aggregators </li></ul></ul><ul><ul><li>Scraped from websites </li></ul></ul><ul><ul><li>Deposited by users </li></ul></ul>
  33. 34. Example Search 1 <ul><li>Is there any information about “Quesnoin”? </li></ul><ul><li>OR… </li></ul><ul><li>Type in the name (and there may be many) or other identifier </li></ul><ul><li>Paste the InChI String, InChIKey or SMILES </li></ul><ul><li>Draw the structure </li></ul>
  34. 35. Example Search 1
  35. 36. Example Search 1
  36. 37. Complex Search
  37. 38. Wikipedia via ChemSpider …
  38. 39. Searching and Reading Articles… <ul><li>Searching articles based on chemical structure and substructure is very expensive.. but is changing </li></ul><ul><li>The web IS “tool-ready” so when will publishers deliver? </li></ul><ul><ul><li>Structures can be shown </li></ul></ul><ul><ul><li>Spectra can be interactive </li></ul></ul><ul><ul><li>Graphics don’t need to be static </li></ul></ul><ul><ul><li>Publishers can enhance their articles (Project Prospect from the RSC is an example) </li></ul></ul>
  39. 40. Publishers should adopt/add InChIs RSC and Nature Publishing Group have!
  40. 42. Document Mark-up and Linking
  41. 43. Structure Searching
  42. 44. Species..
  43. 45. <ul><li>Entity Extraction built around modified algorithms from SureChem </li></ul><ul><li>Optimized for “publications” </li></ul><ul><li>Dictionaries for chemical entities, groups, reactions, elements, families, species… </li></ul><ul><li>Dictionaries can be expanded – presently adding PDB </li></ul>
  44. 46. The InChI Resolver
  45. 47. The InChI “Resolver”
  46. 48. The InChI “Resolver”
  47. 49. Google Searches on InChI – String limit
  48. 50. InChIKey Searches Work
  49. 51. InChIs are incomplete <ul><li>What is NOT supported, yet: </li></ul><ul><ul><li>polymers </li></ul></ul><ul><ul><li>organometallics </li></ul></ul><ul><ul><li>Markush structures </li></ul></ul><ul><ul><li>3-D structures </li></ul></ul><ul><ul><li>excited states </li></ul></ul><ul><ul><li>interlocking structures (e.g. rotaxanes) </li></ul></ul><ul><ul><li>host-guest complexes </li></ul></ul>
  50. 52. Crowdsourcing for Curation <ul><li>Chemistry databases enhanced by crowdsourcing </li></ul><ul><li>Chemistry databases can be connected to articles, vendors, properties, spectra, etc. </li></ul><ul><li>A platform for deposition, curation and distribution ? </li></ul><ul><li>This is the future… existing business models are at risk </li></ul>
  51. 53. Post Comments <ul><li>Anyone can “Post Comments” associated with a structure. To curate data we require login to track </li></ul>
  52. 54. Conclusions <ul><li>The internet enables chemistry – and at a reduced cost </li></ul><ul><li>Web 2.0 is here and improving quality – to benefit 3.0 </li></ul><ul><li>Question Quality! </li></ul><ul><li>Crowdsourcing for expansion, curation and integration </li></ul><ul><li>Classical models may die quite quickly – business models must change soon or fail </li></ul><ul><li>Publishers – heed the profileration of InChIs for Chemistry </li></ul>
  53. 55. Blogs and Contacts <ul><li>The InChI resolver </li></ul><ul><ul><li> (goes live at ACS Spring) </li></ul></ul><ul><li>The ChemSpider blog </li></ul><ul><ul><li> </li></ul></ul><ul><li>Contact </li></ul><ul><ul><li>[email_address] </li></ul></ul>