The Possibilities and Pitfalls of Internet-Based Chemical Data


Published on

In less than a decade the internet has provided us access to enormous quantities of chemistry data. Chemists have embraced the web as a rich source of data and knowledge. However, all that glisters is not gold and while online searches can now provide us access to information associated with many tens of millions of chemicals, can allow us to traverse patents, publications and public domain databases the promise of high quality data on the web needs to be tempered with caution.
In recent years the crowdsourcing approach to developing curated content has been growing. Can such approaches allow us to bring to bear the collective wisdom of the crowd to validate and enhance the availability of trusted chemistry data online or are algorithms likely to be more powerful in terms of validating data? While it is now possible to search the web using a query language form natural to chemists – that of “structure searching the web” - increasingly scientists are likely going to have to accept joint responsibility for the quality of data online for the foreseeable future. Their participation is likely to come through engaging in open science, the provision of data under open licenses and by offering their skills to the community.
This presentation will provide an overview of the present state of chemistry data online, the challenges and risks of managing and accessing data in the wild and how an internet for chemistry continues to expand in scope and possibilities.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

The Possibilities and Pitfalls of Internet-Based Chemical Data

  1. 1. The Possibilities and Pitfalls of Internet-Based Chemical Data Antony Williams Royal Society of Chemistry
  2. 2. About Me…as a Chemist∗ I’ve performed a few dozen chemical syntheses∗ I’ve run thousands of analytical spectra∗ I’ve generated thousands of NMR assignments∗ I’ve probably published <5% of all work∗ But things can be different today….
  3. 3. My Early Scientific Computing
  4. 4. If it was not just about me…
  5. 5. If it was not just about me…∗ Together we might: ∗ build an encyclopedia ∗ …and rate restaurants ∗ …provide book reviews to each other ∗ …or movie reviews ∗ …or reviews of service providers ∗ …organize sit-ins and social action ∗ …and more data might just be Open
  6. 6. If it was not just about me…∗ Together we might: ∗ build an encyclopedia ∗ …and rate restaurants ∗ …provide book reviews to each other ∗ …or movie reviews ∗ …or reviews of service providers ∗ …organize sit-ins and social action ∗ …and more data might just be Open ∗ …more Chemists might share rather than just take!
  7. 7. A story of a hobby gone wild… Years 1 and 2∗ A hobby-project to connect chemistry data on the web∗ Three servers – one purchased, two hand-built∗ Software begged and borrowed – and thanks to Microsoft!∗ Some late nights – 10pm to 2am for over a year∗ Some survival of the naysayers in the community∗ …and taking advantage of a changing world of data availability and the crowdsourcing of willing participants∗ NO formal funding. Simply passion and abilities lining up.
  8. 8. ChemSpider (Year 2-present)∗ Building a Free Chemical Database∗ A central hub for chemists to source information ∗ >28 million unique chemical records ∗ Aggregated from >400 data sources ∗ Chemicals, analytical data, movies, images, podcasts, links to patents, publications, predictions ∗ Web services for integration ∗ Daily updates of data
  9. 9. Answer Questions for Chemists∗ Questions a chemist might ask… ∗ What is the melting point of n-heptanol? ∗ What is the chemical structure of Xanax? ∗ Chemically, what is phenolphthalein? ∗ What are the stereocenters of cholesterol? ∗ Where can I find publications about xylene? ∗ What are the different trade names for Ketoconazole? ∗ What is the NMR spectrum of Aspirin? ∗ What are the safety handling issues for Thymol Blue?
  10. 10. A LITTLE Chemistry First
  11. 11. Structural Diagrams
  12. 12. Structural Diagrams
  13. 13. Analytical Data
  14. 14. Does Stereochemistry Matter?
  15. 15. Does one stereocenter matter?  Distaval, Talimol, Nibrol, Sedimide, Quietoplex, Contergan, Neurosedyn, Softenon, Thalidomide
  16. 16. Structural Representations
  17. 17. The InChI Standard
  18. 18. InChIKeysSearch the Web by Structure
  19. 19. I want to know about “Vincristine”
  20. 20. Vincristine: Identifiers and Properties
  21. 21. Vincristine: Vendors and Sources
  22. 22. Vincristine: Patents
  23. 23. Chemical Names and Synonyms VALIDATION OF NAMES
  24. 24. Validated Names for Searching…
  25. 25. Information System Architecture Input Filtering Curation Archival Presentation Storage Indexing API Processing Search Browse
  26. 26. The Quality of Chemical Data Online What is the Structure of Vitamin K?A lipid cofactor that is required for normal bloodclotting. Several forms of vitamin K have beenidentified: VITAMIN K1 (phytomenadione) derivedfrom plants, VITAMIN K2 (menaquinone) frombacteria & synthetic naphthoquinone provitamins,VITAMIN K3 (menadione).
  27. 27. What is the Structure of Vitamin K1?
  28. 28. What is the Structure of Vitamin K1?
  29. 29. CAS’s Common Chemistry
  30. 30. Wikipedia
  31. 31. Wolfram Alpha
  32. 32. DailyMed
  33. 33. People Use Trusted Resources…
  34. 34. Just Yesterday…
  35. 35. How will it improve? Participation and contribution
  36. 36. ALL Different, ALL “Domoic Acids”
  37. 37. ALL Different, ALL “Domoic Acids”
  38. 38. The EXPERTS must get it right?!
  39. 39. Question Everything Online:
  40. 40. Deposition, Annotation and Validation∗ ANYBODY can annotate a record on ChemSpider∗ Registered users can deposit new data∗ Registered users can validate existing data
  41. 41. CURATION Search “Vitamin H”
  42. 42. “Curate” Identifiers
  43. 43. “Curate” Identifiers
  44. 44. ChemSpider Web Services
  45. 45. Open APIs for Science∗ ChemSpider via web service access ∗ For structure identification for mass spectrometry ∗ For name and structure resolution ∗ For structure and substructure searching ∗ For an “innovative medicines initiative” semantic web project…
  46. 46. Open PHACTS Project∗ Develop a set of robust standards∗ Integrate Chemistry and Biology data by implementing the standards in a semantic integration hub∗ Deliver services to support drug discovery programs in pharma and public domain∗ INITIALLY 22 partners, 8 pharmaceutical companies, 3 biotechs∗ 36 months project – first public release version is imminent Guiding principle is open access, open usage, open source - Key to standards adoption -
  47. 47. RDF and the semantic web∗ Using RDF permalinks∗∗ Using a Search Term∗∗
  48. 48. RDF and the semantic web
  49. 49. www.SpectralGame.com
  50. 50. The World of Contribution∗ Times have changed ∗ Immediacy of social networks ∗ Commenting on articles/data is here ∗ The “participating scientist” has high profile ∗ And who can be a scientist now???
  51. 51. A Ten Year Old Scientist
  52. 52. Challenging a Publication
  53. 53. Oops…
  54. 54. >2 Years to Resolution
  55. 55. What of Hexacyclinol?
  56. 56. The Blogosphere “Discusses”…
  57. 57. Oxidation by Sodium Hydride?
  58. 58. The Blogosphere Analyzes…
  59. 59. The Blogosphere Analyzes…
  60. 60. How much is in the archives?
  61. 61. Open Notebook Science Analysis
  62. 62. MotivationFaster Science, Better Science
  63. 63. Openness – Still Carries Licensing ∗ Openness may be hard.. ∗ Open Access flavors ∗ Open Source licenses ∗ Open Data licenses ∗ Open Notebook Science
  64. 64. We Suggest Rules for Licensing Data∗ License data based on GOALS: scientific, commercial, or mixed∗ Explore the benefits of open licensing and drawbacks of enclosure∗ Provide simple explanations terms of use∗ If you cant make the data public domain, make the metadata public domain.
  65. 65. We SuggestRules for Licensing Data
  66. 66. Challenged in the Twittersphere
  67. 67. Annotating Articles Today…
  68. 68. Attribution to me…
  69. 69. Other Publications to Annotate…
  70. 70. Other Publications to Annotate…
  71. 71. Publications to Annotate…“We then established a collaboration withprofessor Sum Ting Wong, a fugitive fromthe North Korean University Hu Yu Hai Ding”“..identified as the new protein Wai So Dim”
  72. 72. A New World for Publishing?
  73. 73. An Adventure into the World of Small but significant contribution..
  74. 74. ChemSpider SyntheticPages
  75. 75. Micropublishing with Peer Review (a chemical synthesis blog?)
  76. 76. Multi-Step Synthesis
  77. 77. Interactive Data
  78. 78. A New Route for Scientific Recognition?
  79. 79. The Measure of a Scientist?∗ How do “we” measure a scientist?∗ The funding bodies, department heads etc. use ∗ Publication profile ∗ Impact factors ∗ An index – h, m, g, i10, c, s … ∗ Grants brought in∗ Scientists are notable in different ways – technology can help measure different types of “impact”
  80. 80. What makes a Scientist Notable?
  81. 81. Public Profiles of Scientists∗ Online tools track activities of scientists∗ Some are totally opt-in, an increasing number are about you and need checking!∗ Take responsibility for your profile online∗ Actively BUILD your online profile
  82. 82. Microsoft Academic Search
  83. 83. My Academic Search Profile
  84. 84. My Co-author Graph
  85. 85. Q: How Often Do You Contribute? Annotation and Validation∗ How many times do you see errors where: ∗ 1) You have not been able to annotate or curate ∗ 2) You have chosen not to annotate or curate
  86. 86. My Co-author Graph
  87. 87. Contribute when you can!
  88. 88. Contribute when you can!
  89. 89. Scientists and Orcids?
  90. 90. ∗ A unique identifier for a scientist – a Scientists InChI !∗ Will enable aggregation of a scientists activities∗ ORCIDs associated with publications, data, blog comments, other contributions (Wikipedia, reviews etc.) will be a way to measure their impact
  91. 91. The Alt-Metrics Manifesto∗
  92. 92. ImpactStory
  93. 93. ImpactStory
  94. 94. SlideShare
  95. 95. SlideShare via ImpactStory
  96. 96. ImpactStory
  97. 97. Where do I contribute?How might I be measured?
  98. 98. Article Level Metrics
  99. 99. Article Level Metrics
  100. 100. New Measures of Impact∗ Impact will be an aggregate measure of ∗ Publications – classic measures and article level metrics ∗ Data, algorithms and code – and its distribution and reuse ∗ Contributions as comments, annotation and curation activities ∗ New “impact factors” will develop with time
  101. 101. The Challenges∗ Some challenges are technology based ∗ The growth in data – storage and compute speed ∗ Ontologies, dictionaries and trusted sources∗ Many challenges are “about us” ∗ Licenses and rights ∗ Rewards and recognition ∗ Participation, contribution and collaboration
  102. 102. Tear Down Walls between Government Labs∗ There are many government institutions building public compound databases that should collaborate more: ∗ National Cancer Institute (NCI) ∗ National Institutes of Health (NIH) ∗ Environmental Protection Agency (EPA) ∗ Food and Drug Administration (FDA) ∗ National Library of Medicine (NLM)
  103. 103. Release STRUCTURES Please!
  104. 104. What Does the Future Hold?
  105. 105. The Linked Network Will Grow
  106. 106. The Data Deluge Will Not Go Away
  107. 107. RSC Activities in Development∗ Deliver a Global Chemistry Hub∗ “Data enable” the RSC archive back to 1841: ∗ Extract chemistry – chemicals, reactions, experimental data points, complex data ∗ Enrich the articles for interactive viewing and crowdsourced annotation and curation ∗ Enhance queries possible across the archive
  108. 108. Federated Data Segregation
  109. 109. Future System ArchitectureInput Filtering Filtering Curation Archival Input Filtering Curation Archival Input Smarter Curation Archival algorithms Presentation No more complex Storage Storage StorageIndexing Indexing Elastic, distributed API Indexing New Complexity is hidden algorithms Processing Search Bro Browse Search Search Processing Processing Over federated Over federated Distributed systems systems
  110. 110. Data Validation is Exacting Work
  111. 111. “Challenge” the Community
  112. 112. Chemistry Data at RSC∗ Chemistry is NOT just small molecules!∗ Data in RSC publications will be “enabled”∗ Data available for validation and curation∗ The delivery of the “Datument”∗ Data will be fed to models for validation, to retrain the models, full provenance retained∗ Algorithms will be provided to the community
  113. 113. Enhanced Mark-Up?
  114. 114. An Error in my Abstract?
  115. 115. An Error in my Abstract?Chemists have embraced theweb as a rich source of dataand knowledge. However, allthat glisters is not gold
  116. 116. Thanks Shakespeare
  117. 117. Acknowledgments∗ RSC and RSC|Cheminformatics team∗ All data source providers, curators and annotators∗ All software providers: commercial and open source∗ Contributors, curators, collaborators∗ Trusted Advisors: Jean-Claude Bradley, Sean Ekins, Lee Harland, Gary Martin, Martin Walker and…
  118. 118. Meet Valery…We’d love to chat…
  119. 119. Thank youEmail: williamsa@rsc.orgTwitter: ChemConnectorPersonal Blog: www.chemconnector.comSLIDES: