Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    1 Favorite

    Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry - Presentation Transcript

    1. Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry Antony Williams
    2. The Language of Chemistry
      • My language….
    3. And its dialects….
    4. From Yesterday
      • Approaches to linking data
      • RDF’ing, OWL’ing, SPARQL’ing
      • Triples and stores
      • All are appropriate technologies….
      • Online data linked to by the pharma industry
        • Drugbank, PubChem, Daily Med, KEGG, ChEBI
      • But what of the Quality of data?
    5. Question Everything www.dhmo.org
    6. PubChem
    7. Quality is a Major Issue- Search Butanol
    8. Caution! Question Everything!
    9. The FDA’s DailyMed
    10. Quality of Structures!!!
    11. Quality of Structures
      • If the “Authority” isn’t doing the work to curate then who will?
    12. Collaborative Knowledge Management for Chemists
    13. Drugbank
    14. Taxol on PubChem
    15. Daily Med
    16. The InChI Identifier
    17. Multiple Layers
      • Source: Unofficial InChI FAQ page
    18. InChIStrings Hash to InChIKeys
    19. InChIs for Taxol
    20. Back to Taxol
      • DrugBank: RCINICONZNJXQF-CLDWUXIMDD
      • ChEBI: RCINICONZNJXQF-GXKQXQCDDN
      • Wikipedia: RCINICONZNJXQF-MZXODVADBJ
      • Which one is correct???
    21. InChIKeys for Taxol
      • DrugBank: RCINICONZNJXQF-CLDWUXIMDD
      • ChEBI: RCINICONZNJXQF-GXKQXQCDDN
      • Wikipedia: RCINICONZNJXQF-MZXODVADBJ
      • ChEBI and Wikipedia are the SAME structure
      • Drugbank is a DIFFERENT structure – ONE stereocenter
    22. Does one stereocenter matter?
    23. Does one stereocenter matter?
      • Distaval, Talimol, Nibrol, Sedimide, Quietoplex, Contergan, Neurosedyn, and Softenon
    24. Does one stereocenter matter?
      • Distaval, Talimol, Nibrol, Sedimide, Quietoplex, Contergan, Neurosedyn, and Softenon
    25. Too Much Variability in InChIs
      • Source: Unofficial InChI FAQ page
    26. NEW: Resolve Variability with StdInChI StdInChI
    27. Assertion and Chemical Entities
      • Who says what Taxol is?
      • What is the “timeline” for a molecule?
      • How do we clean up the Public data?
      • The Quality source is Chemical Abstracts Service…
    28. Wikipedia Chemistry Curation project
      • > 6000 organic structures
      • Over 1 year of work for a team of 6
      • Many errors removed in the process
      • Slow and torturous process
      • CAS now collaborating in the process
      • InChIs and InChIKeys will be added
    29.  
    30. Stereoisomers
    31. Content is King and Quality Costs
      • Chemistry “content” is big money – Chemistry publishing and content is worth $100s of millions/year
        • Patent searching
        • Structures and properties
        • Drug databases
        • Literature databases
      • Chemical Abstracts Service (CAS), a division of the ACS is “Gold Standard” in Chemistry related information
        • 101 years of content, $260 million revenue (2006), >40 million substances and 60 million sequences
    32. www.chemspider.com
      • Free access website for chemists to research structure based information
        • Structure/substructure searches
        • Text-based searches
        • Prediction of properties
        • Web service-based integration
      • Platform for deposition, curation, integration of data
        • Structures, analytical data, annotations, links to resources
        • Annotation and curation of data in real-time
      • A platform to assist discovery?
    33. ChemSpider Data
      • The database contains > 21.5 million compounds obtained from >150 data sources and growing weekly.– 0.5 million compounds awaiting deposition
        • Chemical vendors
        • Publishers
        • Commercial Database Vendors
        • US and international patents
        • Structure aggregators
        • Scraped from websites
        • Deposited by users
    34. Example Search 1
      • Is there any information about “Quesnoin”?
      • OR…
      • Type in the name (and there may be many) or other identifier
      • Paste the InChI String, InChIKey or SMILES
      • Draw the structure
    35. Example Search 1
    36. Example Search 1
    37. Complex Search
    38. Wikipedia via ChemSpider …
    39. Searching and Reading Articles…
      • Searching articles based on chemical structure and substructure is very expensive.. but is changing
      • The web IS “tool-ready” so when will publishers deliver?
        • Structures can be shown
        • Spectra can be interactive
        • Graphics don’t need to be static
        • Publishers can enhance their articles (Project Prospect from the RSC is an example)
    40. Publishers should adopt/add InChIs RSC and Nature Publishing Group have!
    41.  
    42. Document Mark-up and Linking
    43. Structure Searching
    44. Species..
      • Entity Extraction built around modified algorithms from SureChem
      • Optimized for “publications”
      • Dictionaries for chemical entities, groups, reactions, elements, families, species…
      • Dictionaries can be expanded – presently adding PDB
    45. The InChI Resolver
    46. The InChI “Resolver”
    47. The InChI “Resolver”
    48. Google Searches on InChI – String limit
    49. InChIKey Searches Work
    50. InChIs are incomplete
      • What is NOT supported, yet:
        • polymers
        • organometallics
        • Markush structures
        • 3-D structures
        • excited states
        • interlocking structures (e.g. rotaxanes)
        • host-guest complexes
    51. Crowdsourcing for Curation
      • Chemistry databases enhanced by crowdsourcing
      • Chemistry databases can be connected to articles, vendors, properties, spectra, etc.
      • A platform for deposition, curation and distribution ?
      • This is the future… existing business models are at risk
    52. Post Comments
      • Anyone can “Post Comments” associated with a structure. To curate data we require login to track
    53. Conclusions
      • The internet enables chemistry – and at a reduced cost
      • Web 2.0 is here and improving quality – to benefit 3.0
      • Question Quality!
      • Crowdsourcing for expansion, curation and integration
      • Classical models may die quite quickly – business models must change soon or fail
      • Publishers – heed the profileration of InChIs for Chemistry
    54. Blogs and Contacts
      • The InChI resolver
        • http://inchis.chemspider.com (goes live at ACS Spring)
      • The ChemSpider blog
        • www.chemspider.com/blog
      • Contact
        • [email_address]

    + Antony Williams, ChemSpidermanAntony Williams, ChemSpiderman, 9 months ago

    custom

    1033 views, 1 favs, 1 embeds more stats

    The original abstract for the talk is below BUT the more

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 1033
      • 989 on SlideShare
      • 44 from embeds
    • Comments 0
    • Favorites 1
    • Downloads 12
    Most viewed embeds
    • 44 views on http://www.chemspider.com

    more

    All embeds
    • 44 views on http://www.chemspider.com

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories