ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemistry

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    Favorites, Groups & Events

    ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemistry - Presentation Transcript

    1. Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry Antony Williams Nature Publishing Group Presentation, New York, May 2009
    2. The Changing Scope of My Roles and Information Access
      • Postdoctoral position, NRC, Canada (1988-1990)
      • NMR Facility Director, University of Ottawa (1990-1991)
      • NMR Technology Leader, Eastman-Kodak (1991-1997).
      • VP and Chief Science officer for ACD/Labs, a commercial scientific software company (1997-2007).
      • Consultant to cheminformatics/scientific software companies, publishers, academia
      • “ ChemSpiderman” – host of a rich resource of Free Access web-based information for chemists.
    3. Access to Information
      • For me…
        • PhD : Libraries primary source of information
        • PostDoc/Academia: Libraries and librarians
        • Eastman Kodak: Software tools and databases
        • Kodak and ACD/Labs: Replaced by the internet
        • Today: The Internet enhanced by a network of collaborators…
    4. Content is King
      • Chemistry “content” is big money – Chemistry publishing and content is worth $100s of millions/year
        • Patent searching
        • Structures and properties
        • Drug databases
        • Literature databases
      • Chemical Abstracts Service (CAS), a division of the ACS is “Gold Standard” in Chemistry related information
        • 101 years of content, $260 million revenue (2006), >40 million substances and 60 million sequences
    5. The Language of Chemistry
      • My language….
    6. And its dialects….
    7. As a chemist…
      • I look for information about chemicals/chemistry
        • What is a particular structure ?
        • What alternative names/identifiers?
        • Reaction synthesis?
        • Physical properties?
        • Analytical data?
        • Purchase?
        • Tell me more?
        • Similar stuff – what other compounds are “like” mine?
    8. Imagine a time when ….
      • The internet is searchable by chemical structure and substructure (e.g.Wikipedia, Google Scholar)
      • Chemistry articles are indexed and searchable by a free online service
      • The web is linked together through the “language of chemistry”
      • Publicly funded research data can be shared and discussed in the Open, maybe as ONS?
      • Cheminformatics has as much of a public face as bioinformatics
    9. Linked Data Cloud
    10. Chemistry on the Internet
      • Much of the information online is User Beware!
      • The Quality of information is “diverse”
      • Technologies can “link and connect” information but validation and curation is key to providing quality
      • The LinkedData web is of less value when the data linked are “wrong”
    11. “ Good Stuff” TotallySynthetic.com
    12. PubChem
    13. What is “wrong”?
      • Questions a chemist might ask…
        • What is the melting point of n-butanol?
        • What is the chemical structure of Xanax?
        • Chemically, what is phenolphthalein?
        • What are the stereocenters of cholesterol?
        • Where can I find publications about xylene?
        • What are the different trade names for Ketoconazole?
        • What is the NMR spectrum of Aspirin?
        • What are the safety handling issues for Thymol Blue?
        • ChemSpider can answer all of these questions
    14. Search Cholesterol
    15. Search Cholesterol
    16. Search Cholesterol
    17. Search Cholesterol
    18. Search Cholesterol
    19. Search Cholesterol
    20. Link outs
    21. Links out to KEGG Kyoto Encyclopedia of Genes and Genomes
    22. Complex Data and Information
    23. Other Searches
      • What compounds have a mass of 300+/-0.001?
      • or search a combination of intrinsic/predicted properties
    24. Other Searches
    25. Complex Search
      • The Quality of
      • Online Chemistry…
    26. Quality is a Major Issue- Search Butanol OLD EXAMPLE..now fixed
    27.  
    28. Wikipedia, C&E News, PubChem
      • C&E News (from ACS)
    29. Does one stereocenter matter? Thalidomide
    30. Vancomycin
      • Who will curate?
      • PubChem is not resourced to clean these errors 
      • How would you clean such a large dataset?
    31. Vancomycin ChemSpider: 1 compound – 3 days
    32. Question Everything www.dhmo.org
    33. DailyMed
        • “ DailyMed provides high quality information about marketed drugs.
        • This information includes FDA approved labels (package inserts).”
    34. The FDA’s DailyMed
    35. Structures on DailyMed Poor Representations
    36. Structures on DailyMed Lack of Stereochemisty
    37. Incorrect Structures Scanning (?) Issues
    38. Incorrect Structures
    39. Does it Matter?
      • Does it matter to the consumer that the structures are wrong? No…what matters is what is in the bottle is the right medication!
      • To make DailyMed structure searchable it DOES matter
      • To data mine DailyMed it matters
      • To mark up DailyMed it matters
    40. Wikis for Science
      • Who in the room hasn’t used Wikipedia?
      • Is it trustworthy?
      • What are the advantages and disadvantages of the Wiki environment?
      • How suitable is it for Chemistry?
    41. Collaborative Knowledge Management for Chemists
    42. Wikipedia Chemistry Curation project
      • Only ca. 6000 organic structures, 8000 total structures
      • Over 18 months of work for a team of 6 people
      • Many errors removed in the process. Curation process is a daily event for users/depositors
      • Slow and torturous process
      • http://en.wikipedia.org/wiki/Talk:Tacrolimus#IUPAC_Name_and_structure
    43. Wikipedia Curation
      • Looking for self-consistency across a Wikipedia Page
      • Primary key is the article TITLE
      • The chemical shown needs to match the title
      • Cyclic self-consistency – and decisions must get made
    44. Wikipedia Links to Drugbank
    45. Taxol on PubChem
    46. Taxol on Daily Med
    47. Differences between ChemSpider/Wikipedia No, but links. Analytical Data Active editors > 50 (?) Active depositors/curators – 30 No Prediction of properties ???? 6000 people/day; 3800 registered Detailed compound monographs Compound monographs linked Text Complex queries – Properties, Text, structure/substructure, OA publishers, Data Sources, … ~6000 organics, 2000 others >21 million unique structures Wikipedia ChemSpider
    48. Differences between Wikipedia/ChemSpider Growing reputation as focused on quality Worldwide reputation as quality source – good and bad Chemistry is the focus of ‘Spider Chemistry is a subset of the ‘Pedia Mixed “licensing” GFL licensing for everything Growing team of advocates, curators and users Strong team of WP:Chem advocates, curators and admins “ Out of a basement” on three servers and 5 volunteers Established infrastructure and Wikipedia Foundation Team Primarily Microsoft .NET technologies with OS components Supported by tried and tested Media-Wiki platform. ChemSpider Wikipedia
    49. The InChI Identifier
    50. Multiple Layers
      • Source: Unofficial InChI FAQ page
    51. InChIs Structure but NOT substructure
    52. InChIStrings Hash to InChIKeys
    53. InChIs for Taxol
    54. Back to Taxol
      • DrugBank: RCINICONZNJXQF-CLDWUXIMDD
      • ChEBI: RCINICONZNJXQF-GXKQXQCDDN
      • Wikipedia: RCINICONZNJXQF-MZXODVADBJ
      • Which one is correct???
    55. InChIKeys for Taxol
      • DrugBank: RCINICONZNJXQF-CLDWUXIMDD
      • ChEBI: RCINICONZNJXQF-GXKQXQCDDN
      • Wikipedia: RCINICONZNJXQF-MZXODVADBJ
      • ChEBI and Wikipedia are the SAME structure
      • Drugbank is a DIFFERENT structure – ONE stereocenter
    56. The InChI Resolver
    57.  
    58. Coming Soon…Linked Articles
    59. How bad can it get??? And who is right????
    60. Curating ChemSpider
      • Anyone can “Post Comments” associated with a structure. To curate data we require login to track
    61. Multi-level Curation and Approval
    62. Crowd-sourcing Chemistry
      • Crowd-sourced curation: identify and tag errors, edit names, synonyms, identify records for deprecation
      • ALSO
      • Crowd-sourced deposition: anyone can deposit data (structures, text, images, analytical data)
    63. Structure-Centric
      • We want to search “information” by structure, substructure, similarity of structure
      • Specific focus on Open Chemistry at present
      • Standard approaches would be:
        • Identify chemical names “entity extraction”
        • Convert chemical names to structures and index
      • ChemSpider has a validated dictionary of structure-name pairs
      • Use name extraction, name-conversion and dictionary look-up. THEN curate.
    64. “Entity Extraction”
      • Rule-based recognition of systematic names:
        • Use a lexeme of name fragments
        • Rules for identifying bounds of a name
      • Look-up dictionary:
        • Drug Names
        • Trivial Names
        • Numbers : Registry IDs, EINECS/ELINCS
        • Massive look-up dictionary of validated identifiers on ChemSpider
    65.  
    66. Name Recognition
      • Azo aldehyde 2   was  synthesized according to a reported  method [17]. To  a stirred  solution  of azo aldehyde 2   (1.08 g, 3.76 mmol )  in  dry CH2Cl2  (30.00 mL) at  0 oC  were  successively  added (3,4-diaminophenyl)phenyl methanone 1 (0.40 g, 1.88 mmol) and a excces of anhydrous MgSO4 (2.00 g,16.67 mmol) .
      • The resulting  mixture  was  stirred  for  6 hours  at room temperature [18]. The mixture was  filtered and washed with dichloromethane . Then the solvent was  evaporated under reduced pressure to  give azo Schiff base 3   as a red solid which was recrystalized from ethanol 95%    (1.28 g, 91 %)
    67. How Many Chemical Names?
      • “ She had the drive to derive success in any venture and was well versed in Karate . When the man in the tartan shirt approached her with a dagger in his hand she spat in his face, took the stance of a commando and took advantage of his shock to release the dagger from his grip and causing him to recoil . He went home and took an aspirin after the beating.”
    68. ChemMantis
      • Chem ical M arkup A nd N omenclature T ransformation I ntegrated S ystem
    69. Making Open Access Articles Searchable Proof of Concept
      • Can we HOST Chemistry Open Access articles on ChemSpider and add-value
      • Can we identify chemical names in Open Access articles in a user-friendly manner
      • Can we convert names to structures in Open-Access articles and expand ChemSpider and provide structure searching of Open Access chemistry articles?
      • Can we provide an environment for chemists to mark-up their own articles and crowd-source markup of an archive?
    70. Document markup
    71. Markup – 3 seconds!
    72. On the fly conversion
    73. Shorthand Formulae Supported
    74. One Click to more Info…
    75. A Platform for Markup
      • Can we provide a platform for document markup for chemists?
      • Workflow:
        • Upload word docs, RTF files or point to HTML and load
        • Apply entity extraction, convert names to structures, mark-up automatically and ask for user participation
        • Deposit all structures on ChemSpider under embargo and wait for article DOI to release
    76. Challenges
      • Computer software can generate chemical names better than the majority of chemists
      • The majority of chemical names are generated by humans, and Incorrect – convert to the wrong structure or are ambiguous
      • One name, Multiple Structures
    77. Names and Structures
      • Dichloroacetone
      • Trichloromethylsilane
    78. Ambiguity
    79. Ambiguity in Abbreviations - DPA
    80. Ambiguity in Abbreviations - THF
    81. Import is Easy
      • Make articles Public/Private (embargo date soon)
      • Auto-markup and check by user
    82. IUPAC PAC Articles
    83. Supports Word .DOC, HTML, RTF
    84. Nature Publications
    85. Entity Balloons
      • Structures are the language of chemistry
      • Show structures to chemists and search/link from there
    86. Integrations Out to Other Sources
    87. Integrations Out to Other Sources
    88. Reactions
    89. Manual Curation is Always Necessary
    90. Text- Indexing and ChemSpider?
      • ChemSpider text-indexes almost 500,000 Open Access and Free Access articles
      • Collection is growing and more publishers have already agreed. Including theses in the future.
    91. Open Access Literature Search
    92. Dictionaries are Easily Enhanced
      • Copy-Paste into appropriate Entity Dictionary
      • Impacts all future markups
      • Expanding knowledgebases of information
      • Linked out to rich sources of information
    93. Build Dictionaries Ontologies Next
    94. Other Dictionaries - Species
      • We are considering
        • Bacteria
        • Fungi
        • Enzymes
        • Viruses
        • PDB codes….
    95. ChemSpider Everywhere
      • Linked from Wikipedia
      • Linked from Open Notebook Science sites using EMBED
      • Linked from Blogs using Structure/Spectra EMBED
      • Integrated into structure drawing packages such as ACD/ChemSketch, Symyx Draw, Open Source applets
      • Integrated to software offerings from Thermo, Waters, Agilent, Bruker
    96. ChemSpider Everywhere Embed Functionality (like YouTube)
    97. ChemSpider Everywhere www.spectralgame.com
    98. ChemSpider Everywhere Crowdsourced Curation of Spectra
    99. ChemSpider Everywhere RSC Compounds
    100. ChemSpider Everywhere Nature Chemistry
      • Nature Chemistry articles are annotated to identify all of the chemical compounds mentioned throughout the text.
      • Those compounds are linked out to other information resources including PubChem and ChemSpider .
    101. ChemSpider Everywhere ChemMobi
    102. Structure RSS Feeds with InChIs
    103.  
    104. Conclusions
      • www.chemspider.com
      • www.chemspider.com/journal
      • InChIs and Internet Chemistry
      • http://inchis.chemspider.com

    + Antony Williams, ChemSpidermanAntony Williams, ChemSpiderman, 6 months ago

    custom

    414 views, 0 favs, 2 embeds more stats

    This was a presentation I gave to an audience at Na more

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 414
      • 410 on SlideShare
      • 4 from embeds
    • Comments 0
    • Favorites 0
    • Downloads 8
    Most viewed embeds
    • 3 views on http://democonnect.com.php5-4.dfw1-1.websitetestlink.com
    • 1 views on http://www.chemspider.com

    more

    All embeds
    • 3 views on http://democonnect.com.php5-4.dfw1-1.websitetestlink.com
    • 1 views on http://www.chemspider.com

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories