Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures
Upcoming SlideShare
Loading in...5
×
 

Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

on

  • 3,306 views

The ability to query across a chemistry publishers content using chemical structure searching can dramatically enhance discoverability. RSC has been applying a number of procedures to integrate ...

The ability to query across a chemistry publishers content using chemical structure searching can dramatically enhance discoverability. RSC has been applying a number of procedures to integrate RSC’s ChemSpider community resource with our published content and databases. These include: 1) entity extraction procedures 2) chemical name conversion procedures using software algorithms and curated dictionaries 3) semantic markup and 4) a crowdsourced curation processes. This presentation will provide an overview of the processes we have utilized in order to provide structure-based integration to RSC content. We will discuss our ongoing efforts to extend the approaches to the mining of data from the rich supplementary information sections of many RSC publications. Our intention is to provide access to synthesis procedures and analytical data and further enrich the ChemSpider database for the benefit of the chemistry community.

Statistics

Views

Total Views
3,306
Views on SlideShare
2,368
Embed Views
938

Actions

Likes
0
Downloads
12
Comments
0

5 Embeds 938

http://www.chemspider.com 928
http://translate.googleusercontent.com 5
http://www.slideshare.net 2
http://webcache.googleusercontent.com 2
http://www.google.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures Presentation Transcript

  • Enhancing discoverability across Royal Society of Chemistry content by integrating to ChemSpider, an online database of chemical structures
  • A Pragmatic Vision
      • “ Build a Structure Centric Community to
      • Serve Chemists”
      • Integrate chemical structure data on the web
      • Create a “structure-based hub” to information, data and algorithmic predictions
      • Let chemists contribute their own data
      • Allow the community to curate/correct data
  • ChemSpider Today
    • Over 25 million unique compounds
    • Sourced from over 300 data sources
    • Growing daily – new compounds, annotations, data
      • Structures, text, spectra, images, movies, syntheses
    • Text searching the web is far from optimal
    • Structure searching the web is not a dream
    • The quality of data on the web is a problem
    • An example…
  • Keep Your Plants Healthy-Looking
  • Which is better for Plants? Vodka, Sprite or Viagra?
  • It Works – Viagra Wins the Day
  • Now Which is Better?
    • Viagra or Cialis?
    • Images sourced from Wikipedia
  • Cialis
    • I want…
        • The structure
        • Any patent information
        • Related publications
        • Where can I buy it?
        • Metabolic pathway info
        • What else is easy to find…
  • Cialis on Google?
  • What is Cialis?
  • What is Cialis? Can we trust Wikipedia?
  • What is Cialis? 6 hits on PubChem
  • What is Cialis?
  • Search by Trade Name
  • Search by CAS Number (from Wikipedia)
  • Are there other names???
  • Are there other names???
    • PubMed hits:
      • 736 Tadalafil
      • 744 Cialis
  • Are there other names???
  • Are there other names?
  • Are There Other Names?
  • IC351 on PubChem? 5 HITS for IC351 ZERO HITS for IC 351
  • Text Searching the Web
    • Text searching the web for chemical compounds is an enormous challenge
    • RSC has multiple databases, >500,000 articles and a lot of other resources. How do we do?
  • The RSC Publishing Platform (Beta)
  • 2+2 = 4 Articles?
  • CAS Number Search
  • Text Searching the Web
    • Text searching RSC Publishing for chemical compounds to retrieve ALL hits is a challenge
    • Dictionaries of name-structure relationships could be very enabling. Creating validated dictionaries is, also, an enormous challenge
  • Search ChemSpider for Cialis
  • Cialis on ChemSpider : 1 hit
    • Chemicals are curated/validated on ChemSpider by ourselves and the community
    • Based on assertions from various sources. Iterative, time-consuming and exacting!
    • We believe we know the structure now
  • Cialis – Searching the Web by InChI Search Molecular SKELETON Search Full Molecule
  • InChI Search the Web by Skeleton 78 Hits by Skeleton
  • InChI Search the Web Exact Match 32 Hits by InChIKey
  • InChI Search the Web Exact Match 6 Hits by Standard InChIKey
  • InChifying the Web
    • Different versions of InChI lead to complex search results
    • There are more 2X “skeletons” for Cialis than exact matches – different stereo? Mistakes?
    • Our judgment…based on the following experience. MISTAKES
  • Vancomycin – Search the Internet
  • Full Molecule Search: 4 Hits
  • Full Skeleton Search: 104 Hits
  • ChemSpider – Patents Linked SURECHEM PATENTS GOOGLE
  • Google Patents
  • Google Books
  • Microsoft Academic Search
  • Google Scholar – Found By CAS #
  • Identifiers for Tadalafil
  • Validated Registry Number Same Result as Searching PubMed
  • How Many Articles in RSC Journals ?
    • Based on 171596-29​-5 there are 13 articles in RSC journals
    • What about if we VALIDATE identifiers?
  • How Many Articles in RSC Journals ?
  • How Many Articles in RSC Journals ?
  • RSC Journals
  • RSC Journals REMEMBER 2+2 = 4
  • RSC Books
  • PubMed
  • Google Books – Expanded Hit Set
  • Google Scholar – Expanded Hit Set
  • Microsoft Academic Search
  • Microsoft Academic Search
    • More mussels than drugs…
  • RSC Databases
  • media.obsessable.com
    • As few interfaces as possible
    Did we solve this problem now?
  • What Do We Know?
    • Validated Name-Structure Dictionaries enable “structure-searching” the web.
    • Search the structure on ChemSpider and we have integrated many services online
      • NCBI Entrez
      • PubMed
      • Google Scholar, Books, Patents
      • Microsoft Academic Search
      • SureChem Patents
      • … ..
  • Semantic Markup: Project Prospect
  • Pospected Compound Deposition
  • Success Depends on Dictionaries Link to a Structure or the Right Structure?
  • Name-Structure Pairs
  • Semantic Linking of Structures
    • What would you want to link off a structure?
      • Chemical suppliers
      • Other publications
      • Analytical Data
      • Related Reactions
      • Wikipedia
      • Patents
      • “ Everything”
  • ChemSpider SyntheticPages
  • Other RSC Resources…
    • Once we have validated name-structure dictionaries we can tap other RSC resources
    • There is ALWAYS a validation stage
    • Ultimately crowdsourced curation is necessary
  • Roses’ Crystal Image Collection
  • MP3s and Videos : Titanium
  • Beautiful Elements
  • Periodic Table Images
  • Other system enhancements?
    • What ChemSpider doesn’t deal with yet...
      • Markush structures and other “non-defineds”
      • Materials
      • Minerals
      • Polymers
      • Biological macromolecules
  • Leaving Markush to Patent Indexers
  • What’s Next?
    • Continue the curation effort and keep cleaning
    • Enhanced integration with RSC publishing workflows and databases
    • Tighter integration to RSC databases
      • Natural Product Updates
      • Methods of Organic Synthesis
    • Use ChemSpider dictionaries to enhance markup precision and recall
  • What’s Next?
    • Use entity extraction approaches and ChemSpider dictionaries to analyze the entire RSC archive
    • Deposit structures into ChemSpider from the backfile
    • Use crowdsourced curation approaches to optimize the results
  • The InChI “Resolver”
  • InChI Resolver to DOIs Structure Search the Web
  • Most Chemistry is NOT Published
    • Only a fraction of chemistry is published
    • Only a tiny fraction of chemistry is patented
    • What of the “Lost Chemistry”- never published and cannot be abstracted
      • Reactions performed
      • Structures made and studied
      • Spectra acquired and then disposed of
    • ChemSpider can give it all a home…
  • Chemistry on the Internet FUTURE
    • The semantic web for chemistry is in place
    • Crowdsourced contributions are commonplace
    • Chemists will search by structure/substructure
    • Chemistry articles indexed and searchable
    • Reduced number of searches to find data
    • Data are integrated – compounds, vendors, syntheses, data, publications and patents
    • A world of Open Access and Open Data
    • Classical business models will have to morph
  • Anyone from Penn State here?
    • Please see me afterwards…
  • Thank you [email_address] Twitter: ChemSpiderman www.chemspider.com/blog SLIDES: www.slideshare.net/AntonyWilliams