Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry
Upcoming SlideShare
Loading in...5
×
 

Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

on

  • 4,486 views

The original abstract for the talk is below BUT the talk changed based on a big interest in InChI and the possibilities to use in a Semantic Web for Chemistry ...

The original abstract for the talk is below BUT the talk changed based on a big interest in InChI and the possibilities to use in a Semantic Web for Chemistry

The increasing availability of free and open access resources for scientists on the internet presents us with a revolution in data availability. However, freedom costs and in many cases the cost is quality. ChemSpider is a free access website for chemists built with the intention of providing a structure centric community for chemists. As an aggregator of chemistry related information from many sources, at present over 21.5 million unique chemical entities from over 150 separate data sources, ChemSpider has taken on the task of both robotically and manually curating publicly available data sources. This presentation will provide an overview of how a curated platform can become the centralized hub for resourcing information about chemical entities. We will also present ChemMantis, an entity extraction platform for extracting chemical names and scientific terms in documents and providing a platform for structure-based searching of Open Access chemistry literature.

Statistics

Views

Total Views
4,486
Slideshare-icon Views on SlideShare
3,529
Embed Views
957

Actions

Likes
2
Downloads
39
Comments
1

6 Embeds 957

http://www.chemspider.com 947
http://staging.plu.mx 4
http://www.google.com 2
http://kris81 2
http://www.slideshare.net 1
http://translate.googleusercontent.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • Fioricet is often prescribed for tension headaches caused by contractions of the muscles in the neck and shoulder area. Buy now from http://www.fioricetsupply.com and make a deal for you.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry Presentation Transcript

    • Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry Antony Williams
    • The Language of Chemistry
      • My language….
    • And its dialects….
    • From Yesterday
      • Approaches to linking data
      • RDF’ing, OWL’ing, SPARQL’ing
      • Triples and stores
      • All are appropriate technologies….
      • Online data linked to by the pharma industry
        • Drugbank, PubChem, Daily Med, KEGG, ChEBI
      • But what of the Quality of data?
    • Question Everything www.dhmo.org
    • PubChem
    • Quality is a Major Issue- Search Butanol
    • Caution! Question Everything!
    • The FDA’s DailyMed
    • Quality of Structures!!!
    • Quality of Structures
      • If the “Authority” isn’t doing the work to curate then who will?
    • Collaborative Knowledge Management for Chemists
    • Drugbank
    • Taxol on PubChem
    • Daily Med
    • The InChI Identifier
    • Multiple Layers
      • Source: Unofficial InChI FAQ page
    • InChIStrings Hash to InChIKeys
    • InChIs for Taxol
    • Back to Taxol
      • DrugBank: RCINICONZNJXQF-CLDWUXIMDD
      • ChEBI: RCINICONZNJXQF-GXKQXQCDDN
      • Wikipedia: RCINICONZNJXQF-MZXODVADBJ
      • Which one is correct???
    • InChIKeys for Taxol
      • DrugBank: RCINICONZNJXQF-CLDWUXIMDD
      • ChEBI: RCINICONZNJXQF-GXKQXQCDDN
      • Wikipedia: RCINICONZNJXQF-MZXODVADBJ
      • ChEBI and Wikipedia are the SAME structure
      • Drugbank is a DIFFERENT structure – ONE stereocenter
    • Does one stereocenter matter?
    • Does one stereocenter matter?
      • Distaval, Talimol, Nibrol, Sedimide, Quietoplex, Contergan, Neurosedyn, and Softenon
    • Does one stereocenter matter?
      • Distaval, Talimol, Nibrol, Sedimide, Quietoplex, Contergan, Neurosedyn, and Softenon
    • Too Much Variability in InChIs
      • Source: Unofficial InChI FAQ page
    • NEW: Resolve Variability with StdInChI StdInChI
    • Assertion and Chemical Entities
      • Who says what Taxol is?
      • What is the “timeline” for a molecule?
      • How do we clean up the Public data?
      • The Quality source is Chemical Abstracts Service…
    • Wikipedia Chemistry Curation project
      • > 6000 organic structures
      • Over 1 year of work for a team of 6
      • Many errors removed in the process
      • Slow and torturous process
      • CAS now collaborating in the process
      • InChIs and InChIKeys will be added
    •  
    • Stereoisomers
    • Content is King and Quality Costs
      • Chemistry “content” is big money – Chemistry publishing and content is worth $100s of millions/year
        • Patent searching
        • Structures and properties
        • Drug databases
        • Literature databases
      • Chemical Abstracts Service (CAS), a division of the ACS is “Gold Standard” in Chemistry related information
        • 101 years of content, $260 million revenue (2006), >40 million substances and 60 million sequences
    • www.chemspider.com
      • Free access website for chemists to research structure based information
        • Structure/substructure searches
        • Text-based searches
        • Prediction of properties
        • Web service-based integration
      • Platform for deposition, curation, integration of data
        • Structures, analytical data, annotations, links to resources
        • Annotation and curation of data in real-time
      • A platform to assist discovery?
    • ChemSpider Data
      • The database contains > 21.5 million compounds obtained from >150 data sources and growing weekly.– 0.5 million compounds awaiting deposition
        • Chemical vendors
        • Publishers
        • Commercial Database Vendors
        • US and international patents
        • Structure aggregators
        • Scraped from websites
        • Deposited by users
    • Example Search 1
      • Is there any information about “Quesnoin”?
      • OR…
      • Type in the name (and there may be many) or other identifier
      • Paste the InChI String, InChIKey or SMILES
      • Draw the structure
    • Example Search 1
    • Example Search 1
    • Complex Search
    • Wikipedia via ChemSpider …
    • Searching and Reading Articles…
      • Searching articles based on chemical structure and substructure is very expensive.. but is changing
      • The web IS “tool-ready” so when will publishers deliver?
        • Structures can be shown
        • Spectra can be interactive
        • Graphics don’t need to be static
        • Publishers can enhance their articles (Project Prospect from the RSC is an example)
    • Publishers should adopt/add InChIs RSC and Nature Publishing Group have!
    •  
    • Document Mark-up and Linking
    • Structure Searching
    • Species..
      • Entity Extraction built around modified algorithms from SureChem
      • Optimized for “publications”
      • Dictionaries for chemical entities, groups, reactions, elements, families, species…
      • Dictionaries can be expanded – presently adding PDB
    • The InChI Resolver
    • The InChI “Resolver”
    • The InChI “Resolver”
    • Google Searches on InChI – String limit
    • InChIKey Searches Work
    • InChIs are incomplete
      • What is NOT supported, yet:
        • polymers
        • organometallics
        • Markush structures
        • 3-D structures
        • excited states
        • interlocking structures (e.g. rotaxanes)
        • host-guest complexes
    • Crowdsourcing for Curation
      • Chemistry databases enhanced by crowdsourcing
      • Chemistry databases can be connected to articles, vendors, properties, spectra, etc.
      • A platform for deposition, curation and distribution ?
      • This is the future… existing business models are at risk
    • Post Comments
      • Anyone can “Post Comments” associated with a structure. To curate data we require login to track
    • Conclusions
      • The internet enables chemistry – and at a reduced cost
      • Web 2.0 is here and improving quality – to benefit 3.0
      • Question Quality!
      • Crowdsourcing for expansion, curation and integration
      • Classical models may die quite quickly – business models must change soon or fail
      • Publishers – heed the profileration of InChIs for Chemistry
    • Blogs and Contacts
      • The InChI resolver
        • http://inchis.chemspider.com (goes live at ACS Spring)
      • The ChemSpider blog
        • www.chemspider.com/blog
      • Contact
        • [email_address]