Enhancing discoverability across Royal Society of Chemistry content by integrating to ChemSpider, an online database of ch...
A Pragmatic Vision <ul><ul><li>“ Build a Structure Centric Community to </li></ul></ul><ul><ul><li>Serve Chemists” </li></...
ChemSpider Today <ul><li>Over 25 million unique compounds </li></ul><ul><li>Sourced from over 300 data sources </li></ul><...
Keep Your Plants Healthy-Looking
Which is better for Plants? Vodka, Sprite or Viagra?
It Works – Viagra Wins the Day
Now Which is Better? <ul><li>Viagra or Cialis? </li></ul><ul><li>Images sourced from Wikipedia </li></ul>
Cialis <ul><li>I want… </li></ul><ul><ul><ul><li>The structure </li></ul></ul></ul><ul><ul><ul><li>Any patent information ...
Cialis on Google?
What is Cialis?
What is Cialis? Can we trust Wikipedia?
What is Cialis? 6 hits on PubChem
What is Cialis?
Search by Trade Name
Search by CAS Number  (from Wikipedia)
Are there other names???
Are there other names??? <ul><li>PubMed hits: </li></ul><ul><ul><li>736 Tadalafil </li></ul></ul><ul><ul><li>744 Cialis </...
Are there other names???
Are there other names?
Are There Other Names?
IC351 on PubChem? 5 HITS for IC351 ZERO HITS for IC 351
Text Searching the Web <ul><li>Text searching the web for chemical compounds is an enormous challenge </li></ul><ul><li>RS...
The RSC Publishing Platform (Beta)
2+2 = 4 Articles?
CAS Number Search
Text Searching the Web <ul><li>Text searching RSC Publishing for chemical compounds to retrieve ALL hits is a challenge </...
Search ChemSpider for Cialis
Cialis on ChemSpider : 1 hit <ul><li>Chemicals are curated/validated on ChemSpider by ourselves and the community </li></u...
Cialis – Searching the Web by InChI Search Molecular SKELETON Search Full Molecule
InChI Search the Web  by Skeleton 78 Hits by Skeleton
InChI Search the Web  Exact Match 32 Hits by InChIKey
InChI Search the Web  Exact Match 6 Hits by Standard InChIKey
InChifying the Web <ul><li>Different versions of InChI lead to complex search results </li></ul><ul><li>There are more 2X ...
Vancomycin –  Search the Internet
Full  Molecule  Search: 4 Hits
Full  Skeleton  Search: 104 Hits
ChemSpider – Patents Linked SURECHEM PATENTS  GOOGLE
Google Patents
Google Books
Microsoft Academic Search
Google Scholar – Found By CAS #
Identifiers for Tadalafil
Validated Registry Number Same Result as Searching PubMed
How Many Articles in RSC Journals ? <ul><li>Based on  171596-29​-5  there are 13 articles in RSC journals </li></ul><ul><l...
How Many Articles in RSC Journals ?
How Many Articles in RSC Journals ?
RSC Journals
RSC Journals  REMEMBER 2+2 = 4
RSC Books
PubMed
Google Books – Expanded Hit Set
Google Scholar – Expanded Hit Set
Microsoft Academic Search
Microsoft Academic Search <ul><li>More mussels than drugs… </li></ul>
RSC Databases
media.obsessable.com <ul><li>As few interfaces as possible </li></ul>Did we solve this problem now?
What Do We Know? <ul><li>Validated Name-Structure Dictionaries enable “structure-searching” the web.  </li></ul><ul><li>Se...
Semantic Markup: Project Prospect
Pospected Compound Deposition
Success Depends on Dictionaries Link to a Structure or the Right Structure?
Name-Structure Pairs
Semantic Linking of Structures <ul><li>What would you want to link off a structure? </li></ul><ul><ul><li>Chemical supplie...
ChemSpider SyntheticPages
Other RSC Resources… <ul><li>Once we have validated name-structure dictionaries we can tap other RSC resources </li></ul><...
Roses’ Crystal Image Collection
MP3s and Videos : Titanium
Beautiful Elements
Periodic Table Images
Other system enhancements? <ul><li>What ChemSpider doesn’t deal with yet... </li></ul><ul><ul><li>Markush structures and o...
Leaving Markush to Patent Indexers
What’s Next? <ul><li>Continue the curation effort and keep cleaning </li></ul><ul><li>Enhanced integration with RSC publis...
What’s Next? <ul><li>Use entity extraction approaches and ChemSpider dictionaries to analyze the entire RSC archive </li><...
The InChI “Resolver”
InChI Resolver to DOIs Structure Search the Web
Most Chemistry is NOT Published <ul><li>Only a fraction of chemistry is published </li></ul><ul><li>Only a tiny fraction o...
Chemistry on the Internet FUTURE <ul><li>The semantic web for chemistry is in place </li></ul><ul><li>Crowdsourced contrib...
Anyone from Penn State here? <ul><li>Please see me afterwards… </li></ul>
Thank you [email_address] Twitter: ChemSpiderman www.chemspider.com/blog SLIDES: www.slideshare.net/AntonyWilliams
Upcoming SlideShare
Loading in...5
×

Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

2,876

Published on

The ability to query across a chemistry publishers content using chemical structure searching can dramatically enhance discoverability. RSC has been applying a number of procedures to integrate RSC’s ChemSpider community resource with our published content and databases. These include: 1) entity extraction procedures 2) chemical name conversion procedures using software algorithms and curated dictionaries 3) semantic markup and 4) a crowdsourced curation processes. This presentation will provide an overview of the processes we have utilized in order to provide structure-based integration to RSC content. We will discuss our ongoing efforts to extend the approaches to the mining of data from the rich supplementary information sections of many RSC publications. Our intention is to provide access to synthesis procedures and analytical data and further enrich the ChemSpider database for the benefit of the chemistry community.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
2,876
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
13
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

  1. 1. Enhancing discoverability across Royal Society of Chemistry content by integrating to ChemSpider, an online database of chemical structures
  2. 2. A Pragmatic Vision <ul><ul><li>“ Build a Structure Centric Community to </li></ul></ul><ul><ul><li>Serve Chemists” </li></ul></ul><ul><ul><li>Integrate chemical structure data on the web </li></ul></ul><ul><ul><li>Create a “structure-based hub” to information, data and algorithmic predictions </li></ul></ul><ul><ul><li>Let chemists contribute their own data </li></ul></ul><ul><ul><li>Allow the community to curate/correct data </li></ul></ul>
  3. 3. ChemSpider Today <ul><li>Over 25 million unique compounds </li></ul><ul><li>Sourced from over 300 data sources </li></ul><ul><li>Growing daily – new compounds, annotations, data </li></ul><ul><ul><li>Structures, text, spectra, images, movies, syntheses </li></ul></ul><ul><li>Text searching the web is far from optimal </li></ul><ul><li>Structure searching the web is not a dream </li></ul><ul><li>The quality of data on the web is a problem </li></ul><ul><li>An example… </li></ul>
  4. 4. Keep Your Plants Healthy-Looking
  5. 5. Which is better for Plants? Vodka, Sprite or Viagra?
  6. 6. It Works – Viagra Wins the Day
  7. 7. Now Which is Better? <ul><li>Viagra or Cialis? </li></ul><ul><li>Images sourced from Wikipedia </li></ul>
  8. 8. Cialis <ul><li>I want… </li></ul><ul><ul><ul><li>The structure </li></ul></ul></ul><ul><ul><ul><li>Any patent information </li></ul></ul></ul><ul><ul><ul><li>Related publications </li></ul></ul></ul><ul><ul><ul><li>Where can I buy it? </li></ul></ul></ul><ul><ul><ul><li>Metabolic pathway info </li></ul></ul></ul><ul><ul><ul><li>What else is easy to find… </li></ul></ul></ul>
  9. 9. Cialis on Google?
  10. 10. What is Cialis?
  11. 11. What is Cialis? Can we trust Wikipedia?
  12. 12. What is Cialis? 6 hits on PubChem
  13. 13. What is Cialis?
  14. 14. Search by Trade Name
  15. 15. Search by CAS Number (from Wikipedia)
  16. 16. Are there other names???
  17. 17. Are there other names??? <ul><li>PubMed hits: </li></ul><ul><ul><li>736 Tadalafil </li></ul></ul><ul><ul><li>744 Cialis </li></ul></ul>
  18. 18. Are there other names???
  19. 19. Are there other names?
  20. 20. Are There Other Names?
  21. 21. IC351 on PubChem? 5 HITS for IC351 ZERO HITS for IC 351
  22. 22. Text Searching the Web <ul><li>Text searching the web for chemical compounds is an enormous challenge </li></ul><ul><li>RSC has multiple databases, >500,000 articles and a lot of other resources. How do we do? </li></ul>
  23. 23. The RSC Publishing Platform (Beta)
  24. 24. 2+2 = 4 Articles?
  25. 25. CAS Number Search
  26. 26. Text Searching the Web <ul><li>Text searching RSC Publishing for chemical compounds to retrieve ALL hits is a challenge </li></ul><ul><li>Dictionaries of name-structure relationships could be very enabling. Creating validated dictionaries is, also, an enormous challenge </li></ul>
  27. 27. Search ChemSpider for Cialis
  28. 28. Cialis on ChemSpider : 1 hit <ul><li>Chemicals are curated/validated on ChemSpider by ourselves and the community </li></ul><ul><li>Based on assertions from various sources. Iterative, time-consuming and exacting! </li></ul><ul><li>We believe we know the structure now </li></ul>
  29. 29. Cialis – Searching the Web by InChI Search Molecular SKELETON Search Full Molecule
  30. 30. InChI Search the Web by Skeleton 78 Hits by Skeleton
  31. 31. InChI Search the Web Exact Match 32 Hits by InChIKey
  32. 32. InChI Search the Web Exact Match 6 Hits by Standard InChIKey
  33. 33. InChifying the Web <ul><li>Different versions of InChI lead to complex search results </li></ul><ul><li>There are more 2X “skeletons” for Cialis than exact matches – different stereo? Mistakes? </li></ul><ul><li>Our judgment…based on the following experience. MISTAKES </li></ul>
  34. 34. Vancomycin – Search the Internet
  35. 35. Full Molecule Search: 4 Hits
  36. 36. Full Skeleton Search: 104 Hits
  37. 37. ChemSpider – Patents Linked SURECHEM PATENTS GOOGLE
  38. 38. Google Patents
  39. 39. Google Books
  40. 40. Microsoft Academic Search
  41. 41. Google Scholar – Found By CAS #
  42. 42. Identifiers for Tadalafil
  43. 43. Validated Registry Number Same Result as Searching PubMed
  44. 44. How Many Articles in RSC Journals ? <ul><li>Based on 171596-29​-5 there are 13 articles in RSC journals </li></ul><ul><li>What about if we VALIDATE identifiers? </li></ul>
  45. 45. How Many Articles in RSC Journals ?
  46. 46. How Many Articles in RSC Journals ?
  47. 47. RSC Journals
  48. 48. RSC Journals REMEMBER 2+2 = 4
  49. 49. RSC Books
  50. 50. PubMed
  51. 51. Google Books – Expanded Hit Set
  52. 52. Google Scholar – Expanded Hit Set
  53. 53. Microsoft Academic Search
  54. 54. Microsoft Academic Search <ul><li>More mussels than drugs… </li></ul>
  55. 55. RSC Databases
  56. 56. media.obsessable.com <ul><li>As few interfaces as possible </li></ul>Did we solve this problem now?
  57. 57. What Do We Know? <ul><li>Validated Name-Structure Dictionaries enable “structure-searching” the web. </li></ul><ul><li>Search the structure on ChemSpider and we have integrated many services online </li></ul><ul><ul><li>NCBI Entrez </li></ul></ul><ul><ul><li>PubMed </li></ul></ul><ul><ul><li>Google Scholar, Books, Patents </li></ul></ul><ul><ul><li>Microsoft Academic Search </li></ul></ul><ul><ul><li>SureChem Patents </li></ul></ul><ul><ul><li>… .. </li></ul></ul>
  58. 58. Semantic Markup: Project Prospect
  59. 59. Pospected Compound Deposition
  60. 60. Success Depends on Dictionaries Link to a Structure or the Right Structure?
  61. 61. Name-Structure Pairs
  62. 62. Semantic Linking of Structures <ul><li>What would you want to link off a structure? </li></ul><ul><ul><li>Chemical suppliers </li></ul></ul><ul><ul><li>Other publications </li></ul></ul><ul><ul><li>Analytical Data </li></ul></ul><ul><ul><li>Related Reactions </li></ul></ul><ul><ul><li>Wikipedia </li></ul></ul><ul><ul><li>Patents </li></ul></ul><ul><ul><li>“ Everything” </li></ul></ul>
  63. 63. ChemSpider SyntheticPages
  64. 64. Other RSC Resources… <ul><li>Once we have validated name-structure dictionaries we can tap other RSC resources </li></ul><ul><li>There is ALWAYS a validation stage </li></ul><ul><li>Ultimately crowdsourced curation is necessary </li></ul>
  65. 65. Roses’ Crystal Image Collection
  66. 66. MP3s and Videos : Titanium
  67. 67. Beautiful Elements
  68. 68. Periodic Table Images
  69. 69. Other system enhancements? <ul><li>What ChemSpider doesn’t deal with yet... </li></ul><ul><ul><li>Markush structures and other “non-defineds” </li></ul></ul><ul><ul><li>Materials </li></ul></ul><ul><ul><li>Minerals </li></ul></ul><ul><ul><li>Polymers </li></ul></ul><ul><ul><li>Biological macromolecules </li></ul></ul>
  70. 70. Leaving Markush to Patent Indexers
  71. 71. What’s Next? <ul><li>Continue the curation effort and keep cleaning </li></ul><ul><li>Enhanced integration with RSC publishing workflows and databases </li></ul><ul><li>Tighter integration to RSC databases </li></ul><ul><ul><li>Natural Product Updates </li></ul></ul><ul><ul><li>Methods of Organic Synthesis </li></ul></ul><ul><li>Use ChemSpider dictionaries to enhance markup precision and recall </li></ul>
  72. 72. What’s Next? <ul><li>Use entity extraction approaches and ChemSpider dictionaries to analyze the entire RSC archive </li></ul><ul><li>Deposit structures into ChemSpider from the backfile </li></ul><ul><li>Use crowdsourced curation approaches to optimize the results </li></ul>
  73. 73. The InChI “Resolver”
  74. 74. InChI Resolver to DOIs Structure Search the Web
  75. 75. Most Chemistry is NOT Published <ul><li>Only a fraction of chemistry is published </li></ul><ul><li>Only a tiny fraction of chemistry is patented </li></ul><ul><li>What of the “Lost Chemistry”- never published and cannot be abstracted </li></ul><ul><ul><li>Reactions performed </li></ul></ul><ul><ul><li>Structures made and studied </li></ul></ul><ul><ul><li>Spectra acquired and then disposed of </li></ul></ul><ul><li>ChemSpider can give it all a home… </li></ul>
  76. 76. Chemistry on the Internet FUTURE <ul><li>The semantic web for chemistry is in place </li></ul><ul><li>Crowdsourced contributions are commonplace </li></ul><ul><li>Chemists will search by structure/substructure </li></ul><ul><li>Chemistry articles indexed and searchable </li></ul><ul><li>Reduced number of searches to find data </li></ul><ul><li>Data are integrated – compounds, vendors, syntheses, data, publications and patents </li></ul><ul><li>A world of Open Access and Open Data </li></ul><ul><li>Classical business models will have to morph </li></ul>
  77. 77. Anyone from Penn State here? <ul><li>Please see me afterwards… </li></ul>
  78. 78. Thank you [email_address] Twitter: ChemSpiderman www.chemspider.com/blog SLIDES: www.slideshare.net/AntonyWilliams
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×