ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemistry


Published on

This was a presentation I gave to an audience at Nature Publishing Group in New York on May 7th 2009. It's a long presentation and over an hour in length. Not much new here relative to other presentations...just a knitting together of many of the others on here.

There is an increasing availability of free and open access resources for scientists to use on the internet. Coupled with an increasing number of Open Source software programs we are in the middle of a revolution in data availability and tools to manipulate these data. ChemSpider is a free access website built with the intention of providing a structure centric community for chemists. As an aggregator of chemistry related information from many sources, at present over 21.5 million unique chemical entities from over 190 separate data sources, ChemSpider has taken on the task of both robotically and manually integrating and curating publicly available data sources. ChemSpider has also provided an environment for users to deposit, curate and annotate chemistry-related information. This has allowed the community to enhance ChemSpider by adding analytical data, associating synthetic pathways and publications and connecting to social networking resources. I will discuss how ChemSpider is fast becoming the premier curated platform and centralized hub for resourcing information about chemical entities and how the platform provides the foundation data for services allowing the analysis of analytical data and collaborative science.

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemistry

  1. 1. Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry Antony Williams Nature Publishing Group Presentation, New York, May 2009
  2. 2. The Changing Scope of My Roles and Information Access <ul><li>Postdoctoral position, NRC, Canada (1988-1990) </li></ul><ul><li>NMR Facility Director, University of Ottawa (1990-1991) </li></ul><ul><li>NMR Technology Leader, Eastman-Kodak (1991-1997). </li></ul><ul><li>VP and Chief Science officer for ACD/Labs, a commercial scientific software company (1997-2007). </li></ul><ul><li>Consultant to cheminformatics/scientific software companies, publishers, academia </li></ul><ul><li>“ ChemSpiderman” – host of a rich resource of Free Access web-based information for chemists. </li></ul>
  3. 3. Access to Information <ul><li>For me… </li></ul><ul><ul><li>PhD : Libraries primary source of information </li></ul></ul><ul><ul><li>PostDoc/Academia: Libraries and librarians </li></ul></ul><ul><ul><li>Eastman Kodak: Software tools and databases </li></ul></ul><ul><ul><li>Kodak and ACD/Labs: Replaced by the internet </li></ul></ul><ul><ul><li>Today: The Internet enhanced by a network of collaborators… </li></ul></ul>
  4. 4. Content is King <ul><li>Chemistry “content” is big money – Chemistry publishing and content is worth $100s of millions/year </li></ul><ul><ul><li>Patent searching </li></ul></ul><ul><ul><li>Structures and properties </li></ul></ul><ul><ul><li>Drug databases </li></ul></ul><ul><ul><li>Literature databases </li></ul></ul><ul><li>Chemical Abstracts Service (CAS), a division of the ACS is “Gold Standard” in Chemistry related information </li></ul><ul><ul><li>101 years of content, $260 million revenue (2006), >40 million substances and 60 million sequences </li></ul></ul>
  5. 5. The Language of Chemistry <ul><li>My language…. </li></ul>
  6. 6. And its dialects….
  7. 7. As a chemist… <ul><li>I look for information about chemicals/chemistry </li></ul><ul><ul><li>What is a particular structure ? </li></ul></ul><ul><ul><li>What alternative names/identifiers? </li></ul></ul><ul><ul><li>Reaction synthesis? </li></ul></ul><ul><ul><li>Physical properties? </li></ul></ul><ul><ul><li>Analytical data? </li></ul></ul><ul><ul><li>Purchase? </li></ul></ul><ul><ul><li>Tell me more? </li></ul></ul><ul><ul><li>Similar stuff – what other compounds are “like” mine? </li></ul></ul>
  8. 8. Imagine a time when …. <ul><li>The internet is searchable by chemical structure and substructure (e.g.Wikipedia, Google Scholar) </li></ul><ul><li>Chemistry articles are indexed and searchable by a free online service </li></ul><ul><li>The web is linked together through the “language of chemistry” </li></ul><ul><li>Publicly funded research data can be shared and discussed in the Open, maybe as ONS? </li></ul><ul><li>Cheminformatics has as much of a public face as bioinformatics </li></ul>
  9. 9. Linked Data Cloud
  10. 10. Chemistry on the Internet <ul><li>Much of the information online is User Beware! </li></ul><ul><li>The Quality of information is “diverse” </li></ul><ul><li>Technologies can “link and connect” information but validation and curation is key to providing quality </li></ul><ul><li>The LinkedData web is of less value when the data linked are “wrong” </li></ul>
  11. 11. “ Good Stuff”
  12. 12. PubChem
  13. 13. What is “wrong”?
  14. 14. <ul><li>Questions a chemist might ask… </li></ul><ul><ul><li>What is the melting point of n-butanol? </li></ul></ul><ul><ul><li>What is the chemical structure of Xanax? </li></ul></ul><ul><ul><li>Chemically, what is phenolphthalein? </li></ul></ul><ul><ul><li>What are the stereocenters of cholesterol? </li></ul></ul><ul><ul><li>Where can I find publications about xylene? </li></ul></ul><ul><ul><li>What are the different trade names for Ketoconazole? </li></ul></ul><ul><ul><li>What is the NMR spectrum of Aspirin? </li></ul></ul><ul><ul><li>What are the safety handling issues for Thymol Blue? </li></ul></ul><ul><ul><li>ChemSpider can answer all of these questions </li></ul></ul>
  15. 15. Search Cholesterol
  16. 16. Search Cholesterol
  17. 17. Search Cholesterol
  18. 18. Search Cholesterol
  19. 19. Search Cholesterol
  20. 20. Search Cholesterol
  21. 21. Link outs
  22. 22. Links out to KEGG Kyoto Encyclopedia of Genes and Genomes
  23. 23. Complex Data and Information
  24. 24. Other Searches <ul><li>What compounds have a mass of 300+/-0.001? </li></ul><ul><li>or search a combination of intrinsic/predicted properties </li></ul>
  25. 25. Other Searches
  26. 26. Complex Search
  27. 27. <ul><li>The Quality of </li></ul><ul><li>Online Chemistry… </li></ul>
  28. 28. Quality is a Major Issue- Search Butanol OLD fixed
  29. 30. Wikipedia, C&E News, PubChem <ul><li>C&E News (from ACS) </li></ul>
  30. 31. Does one stereocenter matter? Thalidomide
  31. 32. Vancomycin <ul><li>Who will curate? </li></ul><ul><li>PubChem is not resourced to clean these errors  </li></ul><ul><li>How would you clean such a large dataset? </li></ul>
  32. 33. Vancomycin ChemSpider: 1 compound – 3 days
  33. 34. Question Everything
  34. 35. DailyMed <ul><ul><li>“ DailyMed provides high quality information about marketed drugs. </li></ul></ul><ul><ul><li>This information includes FDA approved labels (package inserts).” </li></ul></ul>
  35. 36. The FDA’s DailyMed
  36. 37. Structures on DailyMed Poor Representations
  37. 38. Structures on DailyMed Lack of Stereochemisty
  38. 39. Incorrect Structures Scanning (?) Issues
  39. 40. Incorrect Structures
  40. 41. Does it Matter? <ul><li>Does it matter to the consumer that the structures are wrong? No…what matters is what is in the bottle is the right medication! </li></ul><ul><li>To make DailyMed structure searchable it DOES matter </li></ul><ul><li>To data mine DailyMed it matters </li></ul><ul><li>To mark up DailyMed it matters </li></ul>
  41. 42. Wikis for Science <ul><li>Who in the room hasn’t used Wikipedia? </li></ul><ul><li>Is it trustworthy? </li></ul><ul><li>What are the advantages and disadvantages of the Wiki environment? </li></ul><ul><li>How suitable is it for Chemistry? </li></ul>
  42. 43. Collaborative Knowledge Management for Chemists
  43. 44. Wikipedia Chemistry Curation project <ul><li>Only ca. 6000 organic structures, 8000 total structures </li></ul><ul><li>Over 18 months of work for a team of 6 people </li></ul><ul><li>Many errors removed in the process. Curation process is a daily event for users/depositors </li></ul><ul><li>Slow and torturous process </li></ul><ul><li> </li></ul>
  44. 45. Wikipedia Curation <ul><li>Looking for self-consistency across a Wikipedia Page </li></ul><ul><li>Primary key is the article TITLE </li></ul><ul><li>The chemical shown needs to match the title </li></ul><ul><li>Cyclic self-consistency – and decisions must get made </li></ul>
  45. 46. Wikipedia Links to Drugbank
  46. 47. Taxol on PubChem
  47. 48. Taxol on Daily Med
  48. 49. Differences between ChemSpider/Wikipedia No, but links. Analytical Data Active editors > 50 (?) Active depositors/curators – 30 No Prediction of properties ???? 6000 people/day; 3800 registered Detailed compound monographs Compound monographs linked Text Complex queries – Properties, Text, structure/substructure, OA publishers, Data Sources, … ~6000 organics, 2000 others >21 million unique structures Wikipedia ChemSpider
  49. 50. Differences between Wikipedia/ChemSpider Growing reputation as focused on quality Worldwide reputation as quality source – good and bad Chemistry is the focus of ‘Spider Chemistry is a subset of the ‘Pedia Mixed “licensing” GFL licensing for everything Growing team of advocates, curators and users Strong team of WP:Chem advocates, curators and admins “ Out of a basement” on three servers and 5 volunteers Established infrastructure and Wikipedia Foundation Team Primarily Microsoft .NET technologies with OS components Supported by tried and tested Media-Wiki platform. ChemSpider Wikipedia
  50. 51. The InChI Identifier
  51. 52. Multiple Layers <ul><li>Source: Unofficial InChI FAQ page </li></ul>
  52. 53. InChIs Structure but NOT substructure
  53. 54. InChIStrings Hash to InChIKeys
  54. 55. InChIs for Taxol
  55. 56. Back to Taxol <ul><li>DrugBank: RCINICONZNJXQF-CLDWUXIMDD </li></ul><ul><li>ChEBI: RCINICONZNJXQF-GXKQXQCDDN </li></ul><ul><li>Wikipedia: RCINICONZNJXQF-MZXODVADBJ </li></ul><ul><li>Which one is correct??? </li></ul>
  56. 57. InChIKeys for Taxol <ul><li>DrugBank: RCINICONZNJXQF-CLDWUXIMDD </li></ul><ul><li>ChEBI: RCINICONZNJXQF-GXKQXQCDDN </li></ul><ul><li>Wikipedia: RCINICONZNJXQF-MZXODVADBJ </li></ul><ul><li>ChEBI and Wikipedia are the SAME structure </li></ul><ul><li>Drugbank is a DIFFERENT structure – ONE stereocenter </li></ul>
  57. 58. The InChI Resolver
  58. 60. Coming Soon…Linked Articles
  59. 61. How bad can it get??? And who is right????
  60. 62. Curating ChemSpider <ul><li>Anyone can “Post Comments” associated with a structure. To curate data we require login to track </li></ul>
  61. 63. Multi-level Curation and Approval
  62. 64. Crowd-sourcing Chemistry <ul><li>Crowd-sourced curation: identify and tag errors, edit names, synonyms, identify records for deprecation </li></ul><ul><li>ALSO </li></ul><ul><li>Crowd-sourced deposition: anyone can deposit data (structures, text, images, analytical data) </li></ul>
  63. 65. Structure-Centric <ul><li>We want to search “information” by structure, substructure, similarity of structure </li></ul><ul><li>Specific focus on Open Chemistry at present </li></ul><ul><li>Standard approaches would be: </li></ul><ul><ul><li>Identify chemical names “entity extraction” </li></ul></ul><ul><ul><li>Convert chemical names to structures and index </li></ul></ul><ul><li>ChemSpider has a validated dictionary of structure-name pairs </li></ul><ul><li>Use name extraction, name-conversion and dictionary look-up. THEN curate. </li></ul>
  64. 66. “Entity Extraction” <ul><li>Rule-based recognition of systematic names: </li></ul><ul><ul><li>Use a lexeme of name fragments </li></ul></ul><ul><ul><li>Rules for identifying bounds of a name </li></ul></ul><ul><li>Look-up dictionary: </li></ul><ul><ul><li>Drug Names </li></ul></ul><ul><ul><li>Trivial Names </li></ul></ul><ul><ul><li>Numbers : Registry IDs, EINECS/ELINCS </li></ul></ul><ul><ul><li>Massive look-up dictionary of validated identifiers on ChemSpider </li></ul></ul>
  65. 68. Name Recognition <ul><li>Azo aldehyde 2   was  synthesized according to a reported  method [17]. To  a stirred  solution  of azo aldehyde 2   (1.08 g, 3.76 mmol )  in  dry CH2Cl2  (30.00 mL) at  0 oC  were  successively  added (3,4-diaminophenyl)phenyl methanone 1 (0.40 g, 1.88 mmol) and a excces of anhydrous MgSO4 (2.00 g,16.67 mmol) . </li></ul><ul><li>The resulting  mixture  was  stirred  for  6 hours  at room temperature [18]. The mixture was  filtered and washed with dichloromethane . Then the solvent was  evaporated under reduced pressure to  give azo Schiff base 3   as a red solid which was recrystalized from ethanol 95%    (1.28 g, 91 %) </li></ul>
  66. 69. How Many Chemical Names? <ul><li>“ She had the drive to derive success in any venture and was well versed in Karate . When the man in the tartan shirt approached her with a dagger in his hand she spat in his face, took the stance of a commando and took advantage of his shock to release the dagger from his grip and causing him to recoil . He went home and took an aspirin after the beating.” </li></ul>
  67. 70. ChemMantis <ul><li>Chem ical M arkup A nd N omenclature T ransformation I ntegrated S ystem </li></ul>
  68. 71. Making Open Access Articles Searchable Proof of Concept <ul><li>Can we HOST Chemistry Open Access articles on ChemSpider and add-value </li></ul><ul><li>Can we identify chemical names in Open Access articles in a user-friendly manner </li></ul><ul><li>Can we convert names to structures in Open-Access articles and expand ChemSpider and provide structure searching of Open Access chemistry articles? </li></ul><ul><li>Can we provide an environment for chemists to mark-up their own articles and crowd-source markup of an archive? </li></ul>
  69. 72. Document markup
  70. 73. Markup – 3 seconds!
  71. 74. On the fly conversion
  72. 75. Shorthand Formulae Supported
  73. 76. One Click to more Info…
  74. 77. A Platform for Markup <ul><li>Can we provide a platform for document markup for chemists? </li></ul><ul><li>Workflow: </li></ul><ul><ul><li>Upload word docs, RTF files or point to HTML and load </li></ul></ul><ul><ul><li>Apply entity extraction, convert names to structures, mark-up automatically and ask for user participation </li></ul></ul><ul><ul><li>Deposit all structures on ChemSpider under embargo and wait for article DOI to release </li></ul></ul>
  75. 78. Challenges <ul><li>Computer software can generate chemical names better than the majority of chemists </li></ul><ul><li>The majority of chemical names are generated by humans, and Incorrect – convert to the wrong structure or are ambiguous </li></ul><ul><li>One name, Multiple Structures </li></ul>
  76. 79. Names and Structures <ul><li>Dichloroacetone </li></ul><ul><li>Trichloromethylsilane </li></ul>
  77. 80. Ambiguity
  78. 81. Ambiguity in Abbreviations - DPA
  79. 82. Ambiguity in Abbreviations - THF
  80. 83. Import is Easy <ul><li>Make articles Public/Private (embargo date soon) </li></ul><ul><li>Auto-markup and check by user </li></ul>
  81. 84. IUPAC PAC Articles
  82. 85. Supports Word .DOC, HTML, RTF
  83. 86. Nature Publications
  84. 87. Entity Balloons <ul><li>Structures are the language of chemistry </li></ul><ul><li>Show structures to chemists and search/link from there </li></ul>
  85. 88. Integrations Out to Other Sources
  86. 89. Integrations Out to Other Sources
  87. 90. Reactions
  88. 91. Manual Curation is Always Necessary
  89. 92. Text- Indexing and ChemSpider? <ul><li>ChemSpider text-indexes almost 500,000 Open Access and Free Access articles </li></ul><ul><li>Collection is growing and more publishers have already agreed. Including theses in the future. </li></ul>
  90. 93. Open Access Literature Search
  91. 94. Dictionaries are Easily Enhanced <ul><li>Copy-Paste into appropriate Entity Dictionary </li></ul><ul><li>Impacts all future markups </li></ul><ul><li>Expanding knowledgebases of information </li></ul><ul><li>Linked out to rich sources of information </li></ul>
  92. 95. Build Dictionaries Ontologies Next
  93. 96. Other Dictionaries - Species <ul><li>We are considering </li></ul><ul><ul><li>Bacteria </li></ul></ul><ul><ul><li>Fungi </li></ul></ul><ul><ul><li>Enzymes </li></ul></ul><ul><ul><li>Viruses </li></ul></ul><ul><ul><li>PDB codes…. </li></ul></ul>
  94. 97. ChemSpider Everywhere <ul><li>Linked from Wikipedia </li></ul><ul><li>Linked from Open Notebook Science sites using EMBED </li></ul><ul><li>Linked from Blogs using Structure/Spectra EMBED </li></ul><ul><li>Integrated into structure drawing packages such as ACD/ChemSketch, Symyx Draw, Open Source applets </li></ul><ul><li>Integrated to software offerings from Thermo, Waters, Agilent, Bruker </li></ul>
  95. 98. ChemSpider Everywhere Embed Functionality (like YouTube)
  96. 99. ChemSpider Everywhere
  97. 100. ChemSpider Everywhere Crowdsourced Curation of Spectra
  98. 101. ChemSpider Everywhere RSC Compounds
  99. 102. ChemSpider Everywhere Nature Chemistry <ul><li>Nature Chemistry articles are annotated to identify all of the chemical compounds mentioned throughout the text. </li></ul><ul><li>Those compounds are linked out to other information resources including PubChem and ChemSpider . </li></ul>
  100. 103. ChemSpider Everywhere ChemMobi
  101. 104. Structure RSS Feeds with InChIs
  102. 106. Conclusions <ul><li> </li></ul><ul><li> </li></ul><ul><li>InChIs and Internet Chemistry </li></ul><ul><li> </li></ul>