The Possibilities and Pitfalls of Internet-Based Chemical Data

  • 2,608 views
Uploaded on

In less than a decade the internet has provided us access to enormous quantities of chemistry data. Chemists have embraced the web as a rich source of data and knowledge. However, all that glisters …

In less than a decade the internet has provided us access to enormous quantities of chemistry data. Chemists have embraced the web as a rich source of data and knowledge. However, all that glisters is not gold and while online searches can now provide us access to information associated with many tens of millions of chemicals, can allow us to traverse patents, publications and public domain databases the promise of high quality data on the web needs to be tempered with caution.
In recent years the crowdsourcing approach to developing curated content has been growing. Can such approaches allow us to bring to bear the collective wisdom of the crowd to validate and enhance the availability of trusted chemistry data online or are algorithms likely to be more powerful in terms of validating data? While it is now possible to search the web using a query language form natural to chemists – that of “structure searching the web” - increasingly scientists are likely going to have to accept joint responsibility for the quality of data online for the foreseeable future. Their participation is likely to come through engaging in open science, the provision of data under open licenses and by offering their skills to the community.
This presentation will provide an overview of the present state of chemistry data online, the challenges and risks of managing and accessing data in the wild and how an internet for chemistry continues to expand in scope and possibilities.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,608
On Slideshare
0
From Embeds
0
Number of Embeds
4

Actions

Shares
Downloads
18
Comments
0
Likes
5

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. The Possibilities and Pitfalls of Internet-Based Chemical Data Antony Williams Royal Society of Chemistry
  • 2. About Me…as a Chemist∗ I’ve performed a few dozen chemical syntheses∗ I’ve run thousands of analytical spectra∗ I’ve generated thousands of NMR assignments∗ I’ve probably published <5% of all work∗ But things can be different today….
  • 3. My Early Scientific Computing
  • 4. If it was not just about me…
  • 5. If it was not just about me…∗ Together we might: ∗ build an encyclopedia ∗ …and rate restaurants ∗ …provide book reviews to each other ∗ …or movie reviews ∗ …or reviews of service providers ∗ …organize sit-ins and social action ∗ …and more data might just be Open
  • 6. If it was not just about me…∗ Together we might: ∗ build an encyclopedia ∗ …and rate restaurants ∗ …provide book reviews to each other ∗ …or movie reviews ∗ …or reviews of service providers ∗ …organize sit-ins and social action ∗ …and more data might just be Open ∗ …more Chemists might share rather than just take!
  • 7. A story of a hobby gone wild… Years 1 and 2∗ A hobby-project to connect chemistry data on the web∗ Three servers – one purchased, two hand-built∗ Software begged and borrowed – and thanks to Microsoft!∗ Some late nights – 10pm to 2am for over a year∗ Some survival of the naysayers in the community∗ …and taking advantage of a changing world of data availability and the crowdsourcing of willing participants∗ NO formal funding. Simply passion and abilities lining up.
  • 8. ChemSpider (Year 2-present)∗ Building a Free Chemical Database∗ A central hub for chemists to source information ∗ >28 million unique chemical records ∗ Aggregated from >400 data sources ∗ Chemicals, analytical data, movies, images, podcasts, links to patents, publications, predictions ∗ Web services for integration ∗ Daily updates of data
  • 9. Answer Questions for Chemists∗ Questions a chemist might ask… ∗ What is the melting point of n-heptanol? ∗ What is the chemical structure of Xanax? ∗ Chemically, what is phenolphthalein? ∗ What are the stereocenters of cholesterol? ∗ Where can I find publications about xylene? ∗ What are the different trade names for Ketoconazole? ∗ What is the NMR spectrum of Aspirin? ∗ What are the safety handling issues for Thymol Blue?
  • 10. A LITTLE Chemistry First
  • 11. Structural Diagrams
  • 12. Structural Diagrams
  • 13. Analytical Data
  • 14. Does Stereochemistry Matter?
  • 15. Does one stereocenter matter?  Distaval, Talimol, Nibrol, Sedimide, Quietoplex, Contergan, Neurosedyn, Softenon, Thalidomide
  • 16. Structural Representations
  • 17. The InChI Standard
  • 18. InChIKeysSearch the Web by Structure
  • 19. I want to know about “Vincristine”
  • 20. Vincristine: Identifiers and Properties
  • 21. Vincristine: Vendors and Sources
  • 22. Vincristine: Patents
  • 23. Chemical Names and Synonyms VALIDATION OF NAMES
  • 24. Validated Names for Searching…
  • 25. Information System Architecture Input Filtering Curation Archival Presentation Storage Indexing API Processing Search Browse
  • 26. The Quality of Chemical Data Online What is the Structure of Vitamin K?A lipid cofactor that is required for normal bloodclotting. Several forms of vitamin K have beenidentified: VITAMIN K1 (phytomenadione) derivedfrom plants, VITAMIN K2 (menaquinone) frombacteria & synthetic naphthoquinone provitamins,VITAMIN K3 (menadione).
  • 27. What is the Structure of Vitamin K1?
  • 28. What is the Structure of Vitamin K1?
  • 29. CAS’s Common Chemistry
  • 30. Wikipedia
  • 31. Wolfram Alpha
  • 32. DailyMed
  • 33. People Use Trusted Resources…
  • 34. Just Yesterday…
  • 35. How will it improve? Participation and contribution
  • 36. ALL Different, ALL “Domoic Acids”
  • 37. ALL Different, ALL “Domoic Acids”
  • 38. The EXPERTS must get it right?!
  • 39. Question Everything Online: www.dhmo.org
  • 40. Deposition, Annotation and Validation∗ ANYBODY can annotate a record on ChemSpider∗ Registered users can deposit new data∗ Registered users can validate existing data
  • 41. CURATION Search “Vitamin H”
  • 42. “Curate” Identifiers
  • 43. “Curate” Identifiers
  • 44. ChemSpider Web Services
  • 45. Open APIs for Science∗ ChemSpider via web service access ∗ For structure identification for mass spectrometry ∗ For name and structure resolution ∗ For structure and substructure searching ∗ For an “innovative medicines initiative” semantic web project…
  • 46. Open PHACTS Project∗ Develop a set of robust standards∗ Integrate Chemistry and Biology data by implementing the standards in a semantic integration hub∗ Deliver services to support drug discovery programs in pharma and public domain∗ INITIALLY 22 partners, 8 pharmaceutical companies, 3 biotechs∗ 36 months project – first public release version is imminent Guiding principle is open access, open usage, open source - Key to standards adoption -
  • 47. RDF and the semantic web∗ Using RDF permalinks∗ http://www.chemspider.com/Chemical-Structure.7787.rdf∗ Using a Search Term∗ http://www.chemspider.com/rdf.ashx?q=cyclohexane∗ http://rdf.chemspider.com/cyclohexane
  • 48. RDF and the semantic web
  • 49. www.SpectralGame.comhttp://www.jcheminf.com/content/1/1/9
  • 50. The World of Contribution∗ Times have changed ∗ Immediacy of social networks ∗ Commenting on articles/data is here ∗ The “participating scientist” has high profile ∗ And who can be a scientist now???
  • 51. A Ten Year Old Scientist
  • 52. Challenging a Publication
  • 53. Oops…
  • 54. >2 Years to Resolution
  • 55. What of Hexacyclinol?
  • 56. The Blogosphere “Discusses”…
  • 57. Oxidation by Sodium Hydride?
  • 58. The Blogosphere Analyzes…
  • 59. The Blogosphere Analyzes…
  • 60. How much is in the archives?
  • 61. Open Notebook Science Analysis
  • 62. MotivationFaster Science, Better Science
  • 63. Openness – Still Carries Licensing ∗ Openness may be hard.. ∗ Open Access flavors ∗ Open Source licenses ∗ Open Data licenses ∗ Open Notebook Science
  • 64. We Suggest Rules for Licensing Data∗ License data based on GOALS: scientific, commercial, or mixed∗ Explore the benefits of open licensing and drawbacks of enclosure∗ Provide simple explanations terms of use∗ If you cant make the data public domain, make the metadata public domain.
  • 65. We SuggestRules for Licensing Data
  • 66. Challenged in the Twittersphere
  • 67. Annotating Articles Today…
  • 68. Attribution to me…
  • 69. Other Publications to Annotate…
  • 70. Other Publications to Annotate…
  • 71. Publications to Annotate…“We then established a collaboration withprofessor Sum Ting Wong, a fugitive fromthe North Korean University Hu Yu Hai Ding”“..identified as the new protein Wai So Dim”
  • 72. A New World for Publishing?
  • 73. An Adventure into the World of Small but significant contribution..
  • 74. ChemSpider SyntheticPages
  • 75. Micropublishing with Peer Review (a chemical synthesis blog?)
  • 76. Multi-Step Synthesis
  • 77. Interactive Data
  • 78. A New Route for Scientific Recognition?
  • 79. The Measure of a Scientist?∗ How do “we” measure a scientist?∗ The funding bodies, department heads etc. use ∗ Publication profile ∗ Impact factors ∗ An index – h, m, g, i10, c, s … ∗ Grants brought in∗ Scientists are notable in different ways – technology can help measure different types of “impact”
  • 80. What makes a Scientist Notable?
  • 81. Public Profiles of Scientists∗ Online tools track activities of scientists∗ Some are totally opt-in, an increasing number are about you and need checking!∗ Take responsibility for your profile online∗ Actively BUILD your online profile
  • 82. Microsoft Academic Search
  • 83. My Academic Search Profile
  • 84. My Co-author Graph
  • 85. Q: How Often Do You Contribute? Annotation and Validation∗ How many times do you see errors where: ∗ 1) You have not been able to annotate or curate ∗ 2) You have chosen not to annotate or curate
  • 86. My Co-author Graph
  • 87. Contribute when you can!
  • 88. Contribute when you can!
  • 89. Scientists and Orcids?
  • 90. ∗ A unique identifier for a scientist – a Scientists InChI !∗ Will enable aggregation of a scientists activities∗ ORCIDs associated with publications, data, blog comments, other contributions (Wikipedia, reviews etc.) will be a way to measure their impact
  • 91. The Alt-Metrics Manifesto∗ http://altmetrics.org/manifesto/
  • 92. ImpactStory
  • 93. ImpactStory
  • 94. SlideShare
  • 95. SlideShare via ImpactStory
  • 96. ImpactStory
  • 97. Where do I contribute?How might I be measured?
  • 98. Article Level Metrics
  • 99. Article Level Metrics
  • 100. New Measures of Impact∗ Impact will be an aggregate measure of ∗ Publications – classic measures and article level metrics ∗ Data, algorithms and code – and its distribution and reuse ∗ Contributions as comments, annotation and curation activities ∗ New “impact factors” will develop with time
  • 101. The Challenges∗ Some challenges are technology based ∗ The growth in data – storage and compute speed ∗ Ontologies, dictionaries and trusted sources∗ Many challenges are “about us” ∗ Licenses and rights ∗ Rewards and recognition ∗ Participation, contribution and collaboration
  • 102. Tear Down Walls between Government Labs∗ There are many government institutions building public compound databases that should collaborate more: ∗ National Cancer Institute (NCI) ∗ National Institutes of Health (NIH) ∗ Environmental Protection Agency (EPA) ∗ Food and Drug Administration (FDA) ∗ National Library of Medicine (NLM)
  • 103. Release STRUCTURES Please!
  • 104. What Does the Future Hold?
  • 105. The Linked Network Will Grow
  • 106. The Data Deluge Will Not Go Away
  • 107. RSC Activities in Development∗ Deliver a Global Chemistry Hub∗ “Data enable” the RSC archive back to 1841: ∗ Extract chemistry – chemicals, reactions, experimental data points, complex data ∗ Enrich the articles for interactive viewing and crowdsourced annotation and curation ∗ Enhance queries possible across the archive
  • 108. Federated Data Segregation
  • 109. Future System ArchitectureInput Filtering Filtering Curation Archival Input Filtering Curation Archival Input Smarter Curation Archival algorithms Presentation No more complex Storage Storage StorageIndexing Indexing Elastic, distributed API Indexing New Complexity is hidden algorithms Processing Search Bro Browse Search Search Processing Processing Over federated Over federated Distributed systems systems
  • 110. Data Validation is Exacting Work
  • 111. “Challenge” the Community
  • 112. Chemistry Data at RSC∗ Chemistry is NOT just small molecules!∗ Data in RSC publications will be “enabled”∗ Data available for validation and curation∗ The delivery of the “Datument”∗ Data will be fed to models for validation, to retrain the models, full provenance retained∗ Algorithms will be provided to the community
  • 113. Enhanced Mark-Up?
  • 114. An Error in my Abstract?
  • 115. An Error in my Abstract?Chemists have embraced theweb as a rich source of dataand knowledge. However, allthat glisters is not gold
  • 116. Thanks Shakespeare
  • 117. Acknowledgments∗ RSC and RSC|Cheminformatics team∗ All data source providers, curators and annotators∗ All software providers: commercial and open source∗ Contributors, curators, collaborators∗ Trusted Advisors: Jean-Claude Bradley, Sean Ekins, Lee Harland, Gary Martin, Martin Walker and…
  • 118. Meet Valery…We’d love to chat…
  • 119. Thank youEmail: williamsa@rsc.orgTwitter: ChemConnectorPersonal Blog: www.chemconnector.comSLIDES: www.slideshare.net/AntonyWilliams