Your SlideShare is downloading. ×
ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemistry
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemistry


Published on

This was a presentation I gave to an audience at Nature Publishing Group in New York on May 7th 2009. It's a long presentation and over an hour in length. Not much new here relative to other …

This was a presentation I gave to an audience at Nature Publishing Group in New York on May 7th 2009. It's a long presentation and over an hour in length. Not much new here relative to other presentations...just a knitting together of many of the others on here.

There is an increasing availability of free and open access resources for scientists to use on the internet. Coupled with an increasing number of Open Source software programs we are in the middle of a revolution in data availability and tools to manipulate these data. ChemSpider is a free access website built with the intention of providing a structure centric community for chemists. As an aggregator of chemistry related information from many sources, at present over 21.5 million unique chemical entities from over 190 separate data sources, ChemSpider has taken on the task of both robotically and manually integrating and curating publicly available data sources. ChemSpider has also provided an environment for users to deposit, curate and annotate chemistry-related information. This has allowed the community to enhance ChemSpider by adding analytical data, associating synthetic pathways and publications and connecting to social networking resources. I will discuss how ChemSpider is fast becoming the premier curated platform and centralized hub for resourcing information about chemical entities and how the platform provides the foundation data for services allowing the analysis of analytical data and collaborative science.

Published in: Technology, Education

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry Antony Williams Nature Publishing Group Presentation, New York, May 2009
  • 2. The Changing Scope of My Roles and Information Access
    • Postdoctoral position, NRC, Canada (1988-1990)
    • NMR Facility Director, University of Ottawa (1990-1991)
    • NMR Technology Leader, Eastman-Kodak (1991-1997).
    • VP and Chief Science officer for ACD/Labs, a commercial scientific software company (1997-2007).
    • Consultant to cheminformatics/scientific software companies, publishers, academia
    • “ ChemSpiderman” – host of a rich resource of Free Access web-based information for chemists.
  • 3. Access to Information
    • For me…
      • PhD : Libraries primary source of information
      • PostDoc/Academia: Libraries and librarians
      • Eastman Kodak: Software tools and databases
      • Kodak and ACD/Labs: Replaced by the internet
      • Today: The Internet enhanced by a network of collaborators…
  • 4. Content is King
    • Chemistry “content” is big money – Chemistry publishing and content is worth $100s of millions/year
      • Patent searching
      • Structures and properties
      • Drug databases
      • Literature databases
    • Chemical Abstracts Service (CAS), a division of the ACS is “Gold Standard” in Chemistry related information
      • 101 years of content, $260 million revenue (2006), >40 million substances and 60 million sequences
  • 5. The Language of Chemistry
    • My language….
  • 6. And its dialects….
  • 7. As a chemist…
    • I look for information about chemicals/chemistry
      • What is a particular structure ?
      • What alternative names/identifiers?
      • Reaction synthesis?
      • Physical properties?
      • Analytical data?
      • Purchase?
      • Tell me more?
      • Similar stuff – what other compounds are “like” mine?
  • 8. Imagine a time when ….
    • The internet is searchable by chemical structure and substructure (e.g.Wikipedia, Google Scholar)
    • Chemistry articles are indexed and searchable by a free online service
    • The web is linked together through the “language of chemistry”
    • Publicly funded research data can be shared and discussed in the Open, maybe as ONS?
    • Cheminformatics has as much of a public face as bioinformatics
  • 9. Linked Data Cloud
  • 10. Chemistry on the Internet
    • Much of the information online is User Beware!
    • The Quality of information is “diverse”
    • Technologies can “link and connect” information but validation and curation is key to providing quality
    • The LinkedData web is of less value when the data linked are “wrong”
  • 11. “ Good Stuff”
  • 12. PubChem
  • 13. What is “wrong”?
  • 14.
    • Questions a chemist might ask…
      • What is the melting point of n-butanol?
      • What is the chemical structure of Xanax?
      • Chemically, what is phenolphthalein?
      • What are the stereocenters of cholesterol?
      • Where can I find publications about xylene?
      • What are the different trade names for Ketoconazole?
      • What is the NMR spectrum of Aspirin?
      • What are the safety handling issues for Thymol Blue?
      • ChemSpider can answer all of these questions
  • 15. Search Cholesterol
  • 16. Search Cholesterol
  • 17. Search Cholesterol
  • 18. Search Cholesterol
  • 19. Search Cholesterol
  • 20. Search Cholesterol
  • 21. Link outs
  • 22. Links out to KEGG Kyoto Encyclopedia of Genes and Genomes
  • 23. Complex Data and Information
  • 24. Other Searches
    • What compounds have a mass of 300+/-0.001?
    • or search a combination of intrinsic/predicted properties
  • 25. Other Searches
  • 26. Complex Search
  • 27.
    • The Quality of
    • Online Chemistry…
  • 28. Quality is a Major Issue- Search Butanol OLD fixed
  • 29.  
  • 30. Wikipedia, C&E News, PubChem
    • C&E News (from ACS)
  • 31. Does one stereocenter matter? Thalidomide
  • 32. Vancomycin
    • Who will curate?
    • PubChem is not resourced to clean these errors 
    • How would you clean such a large dataset?
  • 33. Vancomycin ChemSpider: 1 compound – 3 days
  • 34. Question Everything
  • 35. DailyMed
      • “ DailyMed provides high quality information about marketed drugs.
      • This information includes FDA approved labels (package inserts).”
  • 36. The FDA’s DailyMed
  • 37. Structures on DailyMed Poor Representations
  • 38. Structures on DailyMed Lack of Stereochemisty
  • 39. Incorrect Structures Scanning (?) Issues
  • 40. Incorrect Structures
  • 41. Does it Matter?
    • Does it matter to the consumer that the structures are wrong? No…what matters is what is in the bottle is the right medication!
    • To make DailyMed structure searchable it DOES matter
    • To data mine DailyMed it matters
    • To mark up DailyMed it matters
  • 42. Wikis for Science
    • Who in the room hasn’t used Wikipedia?
    • Is it trustworthy?
    • What are the advantages and disadvantages of the Wiki environment?
    • How suitable is it for Chemistry?
  • 43. Collaborative Knowledge Management for Chemists
  • 44. Wikipedia Chemistry Curation project
    • Only ca. 6000 organic structures, 8000 total structures
    • Over 18 months of work for a team of 6 people
    • Many errors removed in the process. Curation process is a daily event for users/depositors
    • Slow and torturous process
  • 45. Wikipedia Curation
    • Looking for self-consistency across a Wikipedia Page
    • Primary key is the article TITLE
    • The chemical shown needs to match the title
    • Cyclic self-consistency – and decisions must get made
  • 46. Wikipedia Links to Drugbank
  • 47. Taxol on PubChem
  • 48. Taxol on Daily Med
  • 49. Differences between ChemSpider/Wikipedia No, but links. Analytical Data Active editors > 50 (?) Active depositors/curators – 30 No Prediction of properties ???? 6000 people/day; 3800 registered Detailed compound monographs Compound monographs linked Text Complex queries – Properties, Text, structure/substructure, OA publishers, Data Sources, … ~6000 organics, 2000 others >21 million unique structures Wikipedia ChemSpider
  • 50. Differences between Wikipedia/ChemSpider Growing reputation as focused on quality Worldwide reputation as quality source – good and bad Chemistry is the focus of ‘Spider Chemistry is a subset of the ‘Pedia Mixed “licensing” GFL licensing for everything Growing team of advocates, curators and users Strong team of WP:Chem advocates, curators and admins “ Out of a basement” on three servers and 5 volunteers Established infrastructure and Wikipedia Foundation Team Primarily Microsoft .NET technologies with OS components Supported by tried and tested Media-Wiki platform. ChemSpider Wikipedia
  • 51. The InChI Identifier
  • 52. Multiple Layers
    • Source: Unofficial InChI FAQ page
  • 53. InChIs Structure but NOT substructure
  • 54. InChIStrings Hash to InChIKeys
  • 55. InChIs for Taxol
  • 56. Back to Taxol
    • Which one is correct???
  • 57. InChIKeys for Taxol
    • ChEBI and Wikipedia are the SAME structure
    • Drugbank is a DIFFERENT structure – ONE stereocenter
  • 58. The InChI Resolver
  • 59.  
  • 60. Coming Soon…Linked Articles
  • 61. How bad can it get??? And who is right????
  • 62. Curating ChemSpider
    • Anyone can “Post Comments” associated with a structure. To curate data we require login to track
  • 63. Multi-level Curation and Approval
  • 64. Crowd-sourcing Chemistry
    • Crowd-sourced curation: identify and tag errors, edit names, synonyms, identify records for deprecation
    • ALSO
    • Crowd-sourced deposition: anyone can deposit data (structures, text, images, analytical data)
  • 65. Structure-Centric
    • We want to search “information” by structure, substructure, similarity of structure
    • Specific focus on Open Chemistry at present
    • Standard approaches would be:
      • Identify chemical names “entity extraction”
      • Convert chemical names to structures and index
    • ChemSpider has a validated dictionary of structure-name pairs
    • Use name extraction, name-conversion and dictionary look-up. THEN curate.
  • 66. “Entity Extraction”
    • Rule-based recognition of systematic names:
      • Use a lexeme of name fragments
      • Rules for identifying bounds of a name
    • Look-up dictionary:
      • Drug Names
      • Trivial Names
      • Numbers : Registry IDs, EINECS/ELINCS
      • Massive look-up dictionary of validated identifiers on ChemSpider
  • 67.  
  • 68. Name Recognition
    • Azo aldehyde 2   was  synthesized according to a reported  method [17]. To  a stirred  solution  of azo aldehyde 2   (1.08 g, 3.76 mmol )  in  dry CH2Cl2  (30.00 mL) at  0 oC  were  successively  added (3,4-diaminophenyl)phenyl methanone 1 (0.40 g, 1.88 mmol) and a excces of anhydrous MgSO4 (2.00 g,16.67 mmol) .
    • The resulting  mixture  was  stirred  for  6 hours  at room temperature [18]. The mixture was  filtered and washed with dichloromethane . Then the solvent was  evaporated under reduced pressure to  give azo Schiff base 3   as a red solid which was recrystalized from ethanol 95%    (1.28 g, 91 %)
  • 69. How Many Chemical Names?
    • “ She had the drive to derive success in any venture and was well versed in Karate . When the man in the tartan shirt approached her with a dagger in his hand she spat in his face, took the stance of a commando and took advantage of his shock to release the dagger from his grip and causing him to recoil . He went home and took an aspirin after the beating.”
  • 70. ChemMantis
    • Chem ical M arkup A nd N omenclature T ransformation I ntegrated S ystem
  • 71. Making Open Access Articles Searchable Proof of Concept
    • Can we HOST Chemistry Open Access articles on ChemSpider and add-value
    • Can we identify chemical names in Open Access articles in a user-friendly manner
    • Can we convert names to structures in Open-Access articles and expand ChemSpider and provide structure searching of Open Access chemistry articles?
    • Can we provide an environment for chemists to mark-up their own articles and crowd-source markup of an archive?
  • 72. Document markup
  • 73. Markup – 3 seconds!
  • 74. On the fly conversion
  • 75. Shorthand Formulae Supported
  • 76. One Click to more Info…
  • 77. A Platform for Markup
    • Can we provide a platform for document markup for chemists?
    • Workflow:
      • Upload word docs, RTF files or point to HTML and load
      • Apply entity extraction, convert names to structures, mark-up automatically and ask for user participation
      • Deposit all structures on ChemSpider under embargo and wait for article DOI to release
  • 78. Challenges
    • Computer software can generate chemical names better than the majority of chemists
    • The majority of chemical names are generated by humans, and Incorrect – convert to the wrong structure or are ambiguous
    • One name, Multiple Structures
  • 79. Names and Structures
    • Dichloroacetone
    • Trichloromethylsilane
  • 80. Ambiguity
  • 81. Ambiguity in Abbreviations - DPA
  • 82. Ambiguity in Abbreviations - THF
  • 83. Import is Easy
    • Make articles Public/Private (embargo date soon)
    • Auto-markup and check by user
  • 84. IUPAC PAC Articles
  • 85. Supports Word .DOC, HTML, RTF
  • 86. Nature Publications
  • 87. Entity Balloons
    • Structures are the language of chemistry
    • Show structures to chemists and search/link from there
  • 88. Integrations Out to Other Sources
  • 89. Integrations Out to Other Sources
  • 90. Reactions
  • 91. Manual Curation is Always Necessary
  • 92. Text- Indexing and ChemSpider?
    • ChemSpider text-indexes almost 500,000 Open Access and Free Access articles
    • Collection is growing and more publishers have already agreed. Including theses in the future.
  • 93. Open Access Literature Search
  • 94. Dictionaries are Easily Enhanced
    • Copy-Paste into appropriate Entity Dictionary
    • Impacts all future markups
    • Expanding knowledgebases of information
    • Linked out to rich sources of information
  • 95. Build Dictionaries Ontologies Next
  • 96. Other Dictionaries - Species
    • We are considering
      • Bacteria
      • Fungi
      • Enzymes
      • Viruses
      • PDB codes….
  • 97. ChemSpider Everywhere
    • Linked from Wikipedia
    • Linked from Open Notebook Science sites using EMBED
    • Linked from Blogs using Structure/Spectra EMBED
    • Integrated into structure drawing packages such as ACD/ChemSketch, Symyx Draw, Open Source applets
    • Integrated to software offerings from Thermo, Waters, Agilent, Bruker
  • 98. ChemSpider Everywhere Embed Functionality (like YouTube)
  • 99. ChemSpider Everywhere
  • 100. ChemSpider Everywhere Crowdsourced Curation of Spectra
  • 101. ChemSpider Everywhere RSC Compounds
  • 102. ChemSpider Everywhere Nature Chemistry
    • Nature Chemistry articles are annotated to identify all of the chemical compounds mentioned throughout the text.
    • Those compounds are linked out to other information resources including PubChem and ChemSpider .
  • 103. ChemSpider Everywhere ChemMobi
  • 104. Structure RSS Feeds with InChIs
  • 105.  
  • 106. Conclusions
    • InChIs and Internet Chemistry