Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Uploaded on

There is an increasing availability of free and open access resources for scientists to use on the internet. Coupled with the increasing availability of Open Source software tools we are in the middle …

There is an increasing availability of free and open access resources for scientists to use on the internet. Coupled with the increasing availability of Open Source software tools we are in the middle of a revolution in data availability and tools to manipulate these data. However, freedom costs and in many cases the cost is quality. ChemSpider is a free access website for chemists built with the intention of providing a structure centric community for chemists. As an aggregator of chemistry related information from many sources, at present over 21.5 million unique chemical entities from over 150 separate data sources, ChemSpider has taken on the task of both robotically and manually curating publicly available data sources. This presentation will provide an overview of the issue of quality in many chemistry-related databases, approaches to cleaning up the data and how a curated platform can become the centralized hub for resourcing information about chemical entities. This includes experimental and predicted properties, analytical data, publications, suppliers and integrated databases. I will detail three efforts :1) the curation of chemistry on Wikipedia 2) an examination of structure integrity on the FDA Daily Med website, a web site of medication content and labeling as found in medication package inserts 3) recognizing chemical names in documents and providing a platform for structure-based searching of Open Access chemistry literature.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads


Total Views
On Slideshare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry Antony Williams Bio-IT World 2009
  • 2. Linked Data Cloud
  • 3. Chemistry on the Internet
    • Much of the information online is User Beware!
    • The Quality of information is “diverse”
    • Technologies can “link and connect” information but validation and curation is key to providing quality
    • The LinkedData web is of less value when the data linked are “wrong”
  • 4. Quality Costs
    • Chemical Abstracts Service (CAS), a division of the ACS is “Gold Standard” in Chemistry related information
      • 101 years of content, $260 million revenue (2006), >40 million substances and 60 million sequences
      • But online…
  • 5. What is “wrong”?
  • 6. Languages and Links in Chemistry
  • 7.
    • A platform for:
      • Data deposition, curation and annotation
      • Supporting Open Notebook Science efforts
      • Chemistry document mark-up with ChemMantis
      • The Open Access ChemSpider Journal of Chemistry
  • 8. Search Cholesterol
  • 9. Search Cholesterol
  • 10. Search Cholesterol
  • 11. Search Cholesterol
  • 12. Search Cholesterol
  • 13. Search Cholesterol
  • 14. Complex Data and Information
  • 15. Online Data
    • Many websites host structure-based information
    • Question quality!!!
  • 16.  
  • 17. Wikipedia, C&E News, PubChem
    • C&E News (from ACS)
  • 18. Does one stereocenter matter?
  • 19. Vancomycin
    • Who will curate?
    • PubChem is not resourced to clean these errors 
    • How would you clean such a large dataset?
  • 20. Vancomycin ChemSpider: 1 compound – 3 days
  • 21. Question Everything
  • 22. DailyMed
      • “ DailyMed provides high quality information about marketed drugs.
      • This information includes FDA approved labels (package inserts).”
  • 23. The FDA’s DailyMed
  • 24. Structures on DailyMed Poor Representations
  • 25. Structures on DailyMed Lack of Stereochemisty
  • 26. Incorrect Structures Scanning (?) Issues
  • 27. Incorrect Structures
  • 28. Does it Matter?
    • Does it matter to the consumer that the structures are wrong? No…what matters is what is in the bottle is the right medication!
    • To make DailyMed structure searchable it DOES matter
    • To data mine DailyMed it matters
    • To mark up DailyMed it matters
  • 29. Collaborative Knowledge Management for Chemists
  • 30. Wikipedia Links to Drugbank
  • 31. Taxol on PubChem
  • 32. Taxol on Daily Med
  • 33. The InChI Identifier
  • 34. Multiple Layers
    • Source: Unofficial InChI FAQ page
  • 35. InChIStrings Hash to InChIKeys
  • 36. InChIs for Taxol
  • 37. Back to Taxol
    • Which one is correct???
  • 38. InChIKeys for Taxol
    • ChEBI and Wikipedia are the SAME structure
    • Drugbank is a DIFFERENT structure – ONE stereocenter
  • 39. The InChI Resolver
  • 40.  
  • 41. Coming Soon…Linked Articles
  • 42. How bad can it get??? And who is right????
  • 43. ChemMantis
    • Chem ical M arkup A nd N omenclature T ransformation I ntegrated S ystem – ChemMantis
    • A platform for entity extraction for chemistry documents, markup and integration to online information sources – Wikipedia, ChemSpider, Entrez…
    • Web-based submission, markup and publishing platform now hosting the ChemSpider Journal of Chemistry
  • 44. ChemMantis Markup
  • 45. Enable Electronic Articles…
    • Structures are the language of chemistry
    • Show structures to chemists and search/link from there…
  • 46. Species Markup
  • 47. Dictionaries are Easily Enhanced
    • Copy-Paste into appropriate Entity Dictionary
    • Impacts all future markups
    • Expanding knowledgebases of information
    • Linked out to rich sources of information
  • 48. Build Dictionaries Ontologies Next
  • 49. Outlinks…
  • 50. Publishers and Document Mark-Up
  • 51. ChemSpider Everywhere
    • Linked from Wikipedia
    • Linked from Open Notebook Science sites using EMBED
    • Linked from Blogs using Structure/Spectra EMBED
    • Integrated into structure drawing packages such as ACD/ChemSketch, Symyx Draw, Open Source applets
    • Integrated to software offerings from Thermo, Waters, Agilent, Bruker
  • 52. ChemSpider Everywhere Embed Functionality (like YouTube)
  • 53. ChemSpider Everywhere
  • 54. ChemSpider Everywhere Crowdsourced Curation of Spectra
  • 55. ChemSpider Everywhere RSC Compounds
  • 56. ChemSpider Everywhere Nature Chemistry
    • Nature Chemistry articles are annotated to identify all of the chemical compounds mentioned throughout the text.
    • Those compounds are linked out to other information resources including PubChem and ChemSpider .
  • 57. ChemSpider Everywhere ChemMobi
  • 58. Structure RSS Feeds with InChIs
  • 59.  
  • 60. Acknowledgments
    • Richard Kidd, Royal Society of Chemistry
    • Jason Wilde, Nature Publishing Group
    • Martin Walker and the Wikipedia Chemistry team
    • Microsoft – Rudy Potenzone
    • Symyx – Keith Taylor and James Jack
    • SureChem – Nicko Goncharoff
    • Spectral game - Andrew Lang and Jean-Claude Bradley
    • “ The InChI team and Advisory Group”
  • 61. Conclusions
    • InChIs and Internet Chemistry