Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

How the web has weaved a web of interlinked chemistry data final


Published on

The internet has provided access to unprecedented quantities of data. In the domain of chemistry specifically over the past decade the web has become populated with tens of millions of chemical structures and related properties of assays together with tens of thousands of spectra and syntheses. The data have, to a large extent, remained disparate and disconnected. In recent years with the wave of Web 2.0 participation any chemist can contribute to both the sharing and validation of chemistry-related data whether it be via Wikipedia, the online encyclopedia, or one of the multiple public compound databases. The presentation will offer a perspective of what is available today, our experiences of building a public compound database to link together the internet and a suggested path forward for enabling even greater integration and connectivity for chemistry data for the masses to both use and participate in developing.

Published in: Technology, Education
  • Be the first to comment

How the web has weaved a web of interlinked chemistry data final

  1. 1. How the Web Has Weaved a Web of Interlinked Chemistry Data Antony Williams ACS Anaheim March 29 th 2011
  2. 2. Data on the Web
  3. 3. Where is Chemistry Online? <ul><li>Property databases </li></ul><ul><li>Compound aggregators </li></ul><ul><li>Screening assay results </li></ul><ul><li>Scientific publications </li></ul><ul><li>Encyclopedic articles (Wikipedia) </li></ul><ul><li>Metabolic pathway databases </li></ul><ul><li>ADME/Tox data </li></ul><ul><li>Blogs/Wikis and Open Notebook Science </li></ul><ul><li>Contributing Open Source code to projects </li></ul>
  4. 4. How to Connect Chemicals…
  5. 5. Chemistry on the Internet <ul><li>100s of websites serving up chemistry data, SDF files of structures and data </li></ul><ul><li>RSC’s ChemSpider “links” chemistry on the internet </li></ul><ul><ul><li>Over 25 million compounds, over 400 data sources </li></ul></ul><ul><ul><li>Allows community deposition, curation, annotation </li></ul></ul><ul><ul><li>Integrating properties, publications, patents, media </li></ul></ul><ul><ul><li>Text, structure, substructure, similarity searching </li></ul></ul>
  6. 6.
  7. 7. Search for a Chemical
  8. 8. We Have Delivered the Vision <ul><ul><li>“ Build a Structure Centric Community to </li></ul></ul><ul><ul><li>Serve Chemists” </li></ul></ul><ul><ul><li>Integrate chemical structure data on the web </li></ul></ul>
  9. 9. How Did We Build It? <ul><li>We deal in Molfiles or SDF files </li></ul><ul><li>We do rudimentary filtering prior to deposition – valence checking, charge imbalance etc. </li></ul><ul><li>We have our own “business logic” to standardize </li></ul><ul><li>We use InChI to “aggregate tautomers” </li></ul><ul><li>Link out to external sites where possible using IDs </li></ul>
  10. 10. Inherited Errors <ul><li>We have inherited errors </li></ul><ul><li>All public compound databases, including ours, have errors </li></ul><ul><li>“ Incorrect” structures – assertions, timelines etc </li></ul><ul><li>“ Incorrect” names associated with structures </li></ul><ul><li>Properties </li></ul><ul><li>Links </li></ul><ul><li>Publications </li></ul><ul><li>ENORMOUS CHALLENGE </li></ul>
  11. 11. The Structure of Vitamin K?
  12. 12. MeSH <ul><li>A lipid cofactor that is required for normal blood clotting. Several forms of vitamin K have been identified: VITAMIN K 1 (phytomenadione) derived from plants , VITAMIN K 2 (menaquinone) from bacteria, and synthetic naphthoquinone provitamins, VITAMIN K 3 (menadione). Vitamin K 3 provitamins, after being alkylated in vivo, exhibit the antifibrinolytic activity of vitamin K. Green leafy vegetables, liver, cheese, butter, and egg yolk are good sources of vitamin K </li></ul>
  13. 13. The Structure of Vitamin K1?
  14. 14. What is the Structure of Vitamin K1?
  15. 15. CAS’s Common Chemistry
  16. 16. Wikipedia
  17. 19. ChEBI – Manual Curation
  18. 22. PubChem
  19. 24. <ul><li>“ 2-methyl-3-(3,7,11,15-tetramethyl hexadec-2-enyl)naphthalene-1,4-dione” </li></ul><ul><li>Variants of systematic names on PubChem </li></ul><ul><li>2-methyl-3-[(E,7R,11R)-3,7,11,15-tetramethyl </li></ul><ul><li>2-methyl-3-[(E,7S,11R)-3,7,11,15-tetramethyl </li></ul><ul><li>2-methyl-3-[(E,7R,11S)-3,7,11,15-tetramethyl </li></ul><ul><li>2-methyl-3-[(E,7S,11S)-3,7,11,15-tetramethyl </li></ul><ul><li>2-methyl-3-[(E,11S)-3,7,11,15-tetramethyl </li></ul><ul><li>2-methyl-3-[(E)-3,7,11,15-tetramethyl </li></ul><ul><li>2-methyl-3-(3,7,11,15-tetramethyl </li></ul><ul><li>2-methyl-3-[(E)-3,7,11,15-tetramethyl </li></ul>
  20. 25. Public Domain Databases <ul><li>Our databases are a mess… </li></ul><ul><li>Non-curated databases are proliferating errors </li></ul><ul><li>We source and deposit data between databases </li></ul><ul><li>Original sources of errors hard to determine </li></ul><ul><li>Curation is time-consuming and challenging </li></ul>
  21. 27. <ul><li>Consider searching each of these chemical databases by chemical name (systematic name, trade name or synonym). Please mark each online resource according to how much you generally trust the results. </li></ul>
  22. 29. To report at Denver ACS… <ul><li>An examination of quality in databases – inter/intra lab comparison of processes for 150 drugs </li></ul><ul><li>Five separate organizations, 8 individuals </li></ul><ul><li>The Wikipedia List of the “200 Top Selling Drugs” </li></ul>
  23. 31. Vytorin: Ezetimibe/Simvastatin
  24. 32. Vytorin: Ezetimibe/Simvastatin
  25. 33. Vytorin: Ezetimibe/Simvastatin
  26. 34. Vytorin: Ezetimibe/Simvastatin
  27. 35. Vytorin: Ezetimibe/Simvastatin
  28. 36. Taxol: Paclitaxel 44 structures
  29. 37. Taxol: Paclitaxel Bioassay Data
  30. 38. Taxol: Paclitaxel Bioassay Data <ul><li>Most Bioassay data associated with structure with one ambiguous stereocenter </li></ul>
  31. 39. Drug Name Generic Name ChEBI ChemSpider CAS Com. Chem ChemIDPlus DailyMed DrugBank PubChem Wikipedia Spiriva Tiotropium Bromide No Hits  No Hits    4/0  Depakote Valproate semisodium        No Structure Basen Voglibose   No Hits  No Hits  2/1  Symbicort 1) Budesonide       8/1  Symbicort 2) Formoterol WRONG  No Hits    6/1  Vytorin 1) Ezetimibe   No Hits      Vytorin 2) Simvastatin       2/1  Taxol Paclitaxel       44/1  Thalidomid Thalidomide No Hits        Zocor Simvastatin       2/1  Crestor Rosuvastatin   No Hits    2/1 
  32. 40. Entity-Extraction and Mark-up
  33. 41. Entity-Extraction and Mark-up
  34. 42. Success Depends on Dictionaries
  35. 43. Nature Chemistry
  36. 44. RSC Prospect
  37. 45. Validated “Dictionaries” <ul><li>The following resources do NOT have structures to link to ChemSpider…but are linked: </li></ul><ul><ul><li>Google Scholar </li></ul></ul><ul><ul><li>PubMed </li></ul></ul><ul><ul><li>DailyMed </li></ul></ul><ul><ul><li>RSC Databases and Backfile </li></ul></ul><ul><li>How did we link these resources to ChemSpider? Validated Name Look-up! </li></ul>
  38. 46. Extend the Vision <ul><ul><li>“ Build a Structure Centric Community to </li></ul></ul><ul><ul><li>Serve Chemists” </li></ul></ul><ul><ul><li>Integrate chemical structure data on the web </li></ul></ul><ul><ul><li>Create a “structure-based hub” to information, data and algorithmic predictions </li></ul></ul>
  39. 47. Integrate other services.. <ul><li>We will integrate to systems of values to the community </li></ul><ul><li>Many interfaces now available for integration </li></ul><ul><ul><li>NMRShiftDB </li></ul></ul><ul><ul><li>ACD/Labs Name Generation </li></ul></ul><ul><ul><li>ChemAxon Chemicalize </li></ul></ul><ul><ul><li>What others??? </li></ul></ul>
  40. 49. Extend the Vision <ul><ul><li>“ Build a Structure Centric Community to </li></ul></ul><ul><ul><li>Serve Chemists” </li></ul></ul><ul><ul><li>Integrate chemical structure data on the web </li></ul></ul><ul><ul><li>Create a “structure-based hub” to information, data and algorithmic predictions </li></ul></ul><ul><ul><li>Let chemists contribute their own data </li></ul></ul><ul><ul><li>Allow the community to curate/correct data </li></ul></ul>
  41. 50. How Did We Build It (cont.) <ul><li>Ask users to add… </li></ul><ul><ul><li>Descriptions/Syntheses/Commentaries </li></ul></ul><ul><ul><li>Links to PubMed articles </li></ul></ul><ul><ul><li>Links to articles via DOIs </li></ul></ul><ul><ul><li>Add spectral data </li></ul></ul><ul><ul><li>Add Crystallographic Information Files </li></ul></ul><ul><ul><li>Add photos </li></ul></ul><ul><ul><li>Add MP3 files </li></ul></ul><ul><ul><li>Add Videos </li></ul></ul>
  42. 51. Complex Data and Information
  43. 53. Kind Contributions!
  44. 54. Crowdsourcing “Vitamin H”
  45. 55. “ Curate” Identifiers
  46. 56. “ Curate” Identifiers
  47. 57. Crowdsourcing Works <ul><li>>130 people have deposited data and participated in data curation </li></ul><ul><li>Different level curators check each other </li></ul><ul><li>More curators and depositors are encouraged! </li></ul>
  48. 58. Accessibility and Reuse <ul><li>It’s a shame to go it alone!!! </li></ul><ul><li>Can we “collectively” improve the quality of chemistry on the Internet? </li></ul>
  49. 59. All DBs should take comments!
  50. 60. Proof-of-concept curation sharing <ul><li>Presently collaborating with DrugBank to enable “curation sharing” </li></ul><ul><li>Setting up services for monitoring curations and edits – starting with “identifiers” </li></ul>
  51. 61. The Social Network <ul><li>Career-wise NOT having a personal presence online will be a detriment </li></ul><ul><ul><li>Self-marketing </li></ul></ul><ul><ul><li>Establishing a profile </li></ul></ul><ul><ul><li>Getting on the record </li></ul></ul><ul><ul><li>Collaborative Science </li></ul></ul><ul><ul><li>Demonstrating a skill set </li></ul></ul><ul><ul><li>Measured using alternative metrics </li></ul></ul><ul><ul><li>Contributing to the public peer review process </li></ul></ul>
  52. 62. Social Networking Tools <ul><li>A growing number of social networking tools: </li></ul><ul><ul><li>Facebook </li></ul></ul><ul><ul><li>Twitter </li></ul></ul><ul><ul><li>Linked-In </li></ul></ul><ul><ul><li>Flickr </li></ul></ul><ul><ul><li>YouTube </li></ul></ul><ul><ul><li>Blogs </li></ul></ul><ul><ul><li>Communities </li></ul></ul><ul><ul><li>Collaborative environments </li></ul></ul>
  53. 63. Chemistry Social Networking <ul><li>Methods of sharing MY chemistry online include: </li></ul><ul><ul><li>Wikis or blogs </li></ul></ul><ul><ul><li>Slideshare for presentations </li></ul></ul><ul><ul><li>YouTube for videos </li></ul></ul><ul><ul><li>Flickr, Wikimedia etc. for images </li></ul></ul><ul><ul><li>PubChem for assay data </li></ul></ul><ul><ul><li>NMRShiftDB for NMR assignments </li></ul></ul><ul><ul><li>GoogleDocs for data </li></ul></ul>
  54. 64. Drivers in the Social Network <ul><li>Anonymity is a choice in the social networks </li></ul><ul><li>Anonymity in peer-review will likely become less important and may be generational </li></ul><ul><li>I may want acknowledgment if… </li></ul><ul><ul><li>I share my data </li></ul></ul><ul><ul><li>I review a paper </li></ul></ul><ul><ul><li>I share my expertise </li></ul></ul>
  55. 65. The Alt-Metrics Manifesto <ul><li> </li></ul>
  56. 66. Enabled by ORCID…
  57. 67. What will enhance OUR network? <ul><li>The “semantic web” </li></ul><ul><li>Mobile technologies </li></ul><ul><li>More participation </li></ul><ul><li>Use of standards: JCAMP, InChI </li></ul>
  58. 68. RDF and the semantic web <ul><li>Using RDF permalinks </li></ul><ul><li> </li></ul><ul><li>Using a Search Term </li></ul><ul><li> </li></ul><ul><li> </li></ul>
  59. 70. Enabled through InChIs
  60. 71. Mobile Support
  61. 72. Licensing “My Work” Online <ul><li>The complex nature of licensing “my” chemistry </li></ul><ul><ul><li>Blogs - copyrighted and creative commons </li></ul></ul><ul><ul><li>Wikis - mixed licensing, depends on the host(s) </li></ul></ul><ul><ul><li>Data – much value in sharing data as “Open Data” </li></ul></ul><ul><li>Often, people can make money from your work! </li></ul><ul><li>Police your own “licensing” – how many people have read the Facebook and Twitter agreements?! </li></ul>
  62. 73. Who declares data as Open? <ul><li>Data licensing is very interesting and can spark “interesting” conversations. Opinions differ: </li></ul><ul><ul><li>Are images data? Are assertions data? </li></ul></ul><ul><ul><li>What on a ChemSpider record is data? </li></ul></ul><ul><li>We allow people to declare their data as Open and add an Open Data button at upload </li></ul>
  63. 74. Acknowledgments <ul><li>RSC|ChemSpider team </li></ul><ul><li>All data source providers </li></ul><ul><li>>100 curators and annotators, and growing… </li></ul><ul><li>Service providers: </li></ul><ul><ul><li>ACD/Labs </li></ul></ul><ul><ul><li>ChemAxon </li></ul></ul><ul><ul><li>GGA Software Services </li></ul></ul><ul><ul><li>Google </li></ul></ul><ul><ul><li>PubMed </li></ul></ul><ul><ul><li>… . </li></ul></ul>
  64. 75. ChemSpider Training Session <ul><li>ChemSpider: </li></ul><ul><li>A Community Resource for Chemical Data </li></ul><ul><li>Wednesday, March 30th </li></ul><ul><li>8:30-11:00 AM </li></ul><ul><li>Anaheim Convention Center, Room 211 A </li></ul>