• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
A Presentation at Nature Publishing Group Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry
 

A Presentation at Nature Publishing Group Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

on

  • 2,799 views

This is a presentation I gave at the Nature Publishing group offices in London UK. It covers general information about ChemSPider and our efforts with ChemMantis.

This is a presentation I gave at the Nature Publishing group offices in London UK. It covers general information about ChemSPider and our efforts with ChemMantis.

Statistics

Views

Total Views
2,799
Views on SlideShare
2,793
Embed Views
6

Actions

Likes
0
Downloads
11
Comments
0

2 Embeds 6

http://www.slideshare.net 4
http://staging.plu.mx 2

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    A Presentation at Nature Publishing Group Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry A Presentation at Nature Publishing Group Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry Presentation Transcript

    • Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry Antony Williams
    • Imagine a time when …. The internet is searchable by chemical structure and substructure (e.g.Wikipedia, Google Scholar) Chemistry articles are indexed and searchable by a free online service The web is linked together through the “language of chemistry” Publicly funded research data can be shared and discussed in the Open, maybe as ONS? Cheminformatics has as much of a public face as bioinformatics Building a Structure Centric Community for Chemists
    • ChemSpider - A Search Engine for Chemists Questions a chemist might ask… What is the melting point of n-butanol? What is the chemical structure of Xanax? Chemically, what is phenolphthalein? What are the stereocenters of cholesterol? Where can I find publications about xylene? What are the different trade names for Ketoconazole? What is the NMR spectrum of Aspirin? What are the safety handling issues for Thymol Blue? ChemSpider can answer all of these questions Building a Structure Centric Community for Chemists
    • What is a Structure? Ask a computer…ask a chemist Building a Structure Centric Community for Chemists
    • Tell Me About Glutathione Building a Structure Centric Community for Chemists
    • Tell Me About Glutathione Building a Structure Centric Community for Chemists
    • Tell Me About Glutathione Building a Structure Centric Community for Chemists
    • Tell Me About Glutathione Building a Structure Centric Community for Chemists
    • Tell Me About Glutathione Building a Structure Centric Community for Chemists
    • Tell Me About Glutathione Building a Structure Centric Community for Chemists
    • Link outs Building a Structure Centric Community for Chemists
    • Links out to KEGG Kyoto Encyclopedia of Genes and Genomes Building a Structure Centric Community for Chemists
    • How many names does a compound have? Building a Structure Centric Community for Chemists
    • ChemSpider Data Content Over 21.5 million unique chemical structures from ca. 150 data sources Online Databases –PubChem, Drugbank, KEGG, Wikipedia Literature – PubMed, J Het Chem, Nature, RSC, Open Access Chemical Vendors – over 40 different vendors and growing Personal Depositions – individual contributions Content database vendors Analytical data collections Patents Web scraping Content is linked back to the original data sources Building a Structure Centric Community for Chemists
    • Other Searches What compounds have a mass of 300+/-0.001? or search a combination of intrinsic/predicted properties Building a Structure Centric Community for Chemists
    • Other Searches Building a Structure Centric Community for Chemists
    • Complex Search Building a Structure Centric Community for Chemists
    • The Quality of Data Online… Aggregating data opens up quality issues Structure-identifier associations are “dirty” Structures are COMMONLY incorrect Manual curation of small databases is enough work – what about millions of structures? Structures are far from perfect. What is a “correct structure”? Full stereochemistry? Historical timeline of structure? Who is the authority? Building a Structure Centric Community for Chemists
    • Who holds THE Quality Authority? Chemical Abstracts Service is the structural authority today. 1400 employees, world standard in chemistry information 101 years of knowledge, process and expertise. How can an online, free access system peacefully co- exist with the authority? Building a Structure Centric Community for Chemists
    • Quality is a Major Issue- Search Butanol OLD EXAMPLE..now fixed Building a Structure Centric Community for Chemists
    • Wikipedia Chemistry Curation project Only ca. 5000 organic structures, 7000 total structures Almost a year of work so far for a team of 6 people Many errors removed in the process. Curation process is a daily event for users/depositors Slow and torturous process http://en.wikipedia.org/wiki/Talk:Tacrolimus# IUPAC_Name_and_structure Building a Structure Centric Community for Chemists
    • Wikipedia Curation Looking for self-consistency across a Wikipedia Page Primary key is the article TITLE The chemical shown needs to match the title Cyclic self-consistency – and decisions must get made Building a Structure Centric Community for Chemists
    • Viagra or Sildenafil Building a Structure Centric Community for Chemists
    • Other issues… Building a Structure Centric Community for Chemists
    • Charges Building a Structure Centric Community for Chemists
    • Sugars – Machine Readable vs Aesthetics Haworth Stereo Fischer Building a Structure Centric Community for Chemists
    • Wikipedia – Crowdsourcing Chemistry Building a Structure Centric Community for Chemists
    • Thymol Blue on ChemSpider Data online includes: UV-vis spectrum Measured experimental properties Link to Wikipedia article Links to chromatography details Multiple identifiers/trade names etc. Links to vendors/suppliers/other databases Safety information http://www.chemspider.com/q/thymol%20blue Building a Structure Centric Community for Chemists
    • Differences between ChemSpider/Wikipedia ChemSpider Wikipedia >21 million unique structures ~5000 organics, 2000 others Complex queries – Properties, Text Text, structure/substructure, OA publishers, Data Sources, … Prediction of properties No Analytical Data No, but links. Active depositors/curators – 30 Active editors > 50 (?) 6000 people/day; 1900 registered ???? Compound monographs linked Detailed compound monographs Building a Structure Centric Community for Chemists
    • Differences between Wikipedia/ChemSpider Wikipedia ChemSpider Supported by tried and tested Primarily Microsoft .NET Media-Wiki platform. technologies with OS components Established infrastructure and “Out of a basement” on three Wikipedia Foundation Team servers and 5 volunteers Chemistry is a subset of the ‘Pedia Chemistry is the focus of ‘Spider GFL licensing for everything Mixed “licensing” Strong team of WP:Chem Growing team of advocates, advocates, curators and admins curators and users Worldwide reputation as quality Growing reputation as focused on source – good and bad quality Building a Structure Centric Community for Chemists
    • Crowd-sourcing Curation How to curate data for millions of structures? Robot processes can clean up depositions Search for Chloride and check molecular formula for Cl Check for stereochemistry and remove names with stereo Provide a simple-to-use platform to curate, annotate and tag data Provide curator administration to prevent vandalism (Veropedia) Building a Structure Centric Community for Chemists
    • Post Comments Anyone can “Post Comments” associated with a structure. To curate data we require login to track Building a Structure Centric Community for Chemists
    • Multi-level Curation and Approval Building a Structure Centric Community for Chemists
    • Crowd-sourcing Chemistry Crowd-sourced curation: identify and tag errors, edit names, synonyms, identify records for deprecation ALSO Crowd-sourced deposition: anyone can deposit data (structures, text, images, analytical data) Building a Structure Centric Community for Chemists
    • DailyMed Building a Structure Centric Community for Chemists
    • Quality of Structures Building a Structure Centric Community for Chemists
    • Quality of Structures!!! Building a Structure Centric Community for Chemists
    • Structure-Centric We want to search “information” by structure, substructure, similarity of structure Specific focus on Open Chemistry at present Standard approaches would be: Identify chemical names “entity extraction” Convert chemical names to structures and index ChemSpider has a validated dictionary of structure-name pairs Use name extraction, name-conversion and dictionary look- up. THEN curate. Building a Structure Centric Community for Chemists
    • “Entity Extraction” Rule-based recognition of systematic names: Use a lexeme of name fragments Rules for identifying bounds of a name Look-up dictionary: Drug Names Trivial Names Numbers : Registry IDs, EINECS/ELINCS Massive look-up dictionary of validated identifiers on ChemSpider Building a Structure Centric Community for Chemists
    • Building a Structure Centric Community for Chemists
    • Name Recognition Azo aldehyde 2 was synthesized according to a reported method [17]. To a stirred solution of azo aldehyde 2 (1.08 g, 3.76 mmol ) in dry CH2Cl2 (30.00 mL) at 0 oC were successively added (3,4-diaminophenyl)phenyl methanone 1(0.40 g, 1.88 mmol) and a excces of anhydrous MgSO4 (2.00 g,16.67 mmol) . The resulting mixture was stirred for 6 hours at room temperature [18]. The mixture was filtered and washed with dichloromethane . Then the solvent was evaporated under reduced pressure to give azo Schiff base 3 as a red solid which was recrystalized from ethanol 95% (1.28 g, 91 %) Building a Structure Centric Community for Chemists
    • Name Recognition Azo aldehyde 2 was synthesized according to a reported method [17]. To a stirred solution of azo aldehyde 2 (1.08 g, 3.76 mmol ) in dry CH2Cl2 (30.00 mL) at 0 oC were successively added (3,4-diaminophenyl)phenyl methanone 1(0.40 g, 1.88 mmol) and a excess of anhydrous MgSO4 (2.00 g,16.67 mmol) . The resulting mixture was stirred for 6 hours at room temperature [18]. The mixture was filtered and washed with dichloromethane . Then the solvent was evaporated under reduced pressure to give azo Schiff base 3 as a red solid which was recrystalized from ethanol 95% (1.28 g, 91 %) Building a Structure Centric Community for Chemists
    • How Many Chemical Names? “She had the drive to derive success in any venture and was well versed in Karate. When the man in the tartan shirt approached her with a dagger in his hand she spat in his face, took the stance of a commando and took advantage of his shock to release the dagger from his grip and causing him to recoil. He went home and took an aspirin after the beating.” Building a Structure Centric Community for Chemists
    • How Many Chemical Names? “She had the drive to derive success in any venture and was well versed in Karate. When the man in the tartan shirt approached her with a dagger in his hand she spat in his face, took the stance of a commando and took advantage of his shock to release the dagger from his grip and causing him to recoil. He went home and took an aspirin after the beating.” Building a Structure Centric Community for Chemists
    • ChemMantis Chemical Markup And Nomenclature Transformation Integrated System Building a Structure Centric Community for Chemists
    • Making Open Access Articles Searchable Proof of Concept Can we HOST Chemistry Open Access articles on ChemSpider and add-value Can we identify chemical names in Open Access articles in a user-friendly manner Can we convert names to structures in Open-Access articles and expand ChemSpider and provide structure searching of Open Access chemistry articles? Can we provide an environment for chemists to mark-up their own articles and crowd-source markup of an archive? Building a Structure Centric Community for Chemists
    • Document markup ChemSpider now hosting Open Access articles from MDPI, Molecular Diversity Preservation International Hosting the Molbank collection at present Building a Structure Centric Community for Chemists
    • A Standard for Document Markup? NLM-DTD: National Library of Medicine; Document Type Definition Approved markup definitions to apply to journal articles – extended as necessary for our purposes Building a Structure Centric Community for Chemists
    • NLM/DTD markup Building a Structure Centric Community for Chemists
    • Chemistry and Biology Menus can be extended as necessary Building a Structure Centric Community for Chemists
    • Document markup Building a Structure Centric Community for Chemists
    • Markup – 3 seconds! Building a Structure Centric Community for Chemists
    • On the fly conversion Building a Structure Centric Community for Chemists
    • Shorthand Formulae Supported Building a Structure Centric Community for Chemists
    • One Click to more Info… Building a Structure Centric Community for Chemists
    • Structure Image Conversion Building a Structure Centric Community for Chemists
    • Two Seconds Later Building a Structure Centric Community for Chemists
    • Not Always Perfect…. Building a Structure Centric Community for Chemists
    • A Platform for Markup Can we provide a platform for document markup for chemists? Workflow: Upload word docs, RTF files or point to HTML and load Apply entity extraction, convert names to structures, mark-up automatically and ask for user participation Publish final version with NLM-DTD markup Deposit all structures on ChemSpider under embargo and wait for article DOI to release Building a Structure Centric Community for Chemists
    • Challenges Computer software can generate chemical names better than the majority of chemists The majority of chemical names are generated by humans, and Incorrect – convert to the wrong structure or are ambiguous One name, Multiple Structures Building a Structure Centric Community for Chemists
    • Names and Structures Dichloroacetone Trichloromethylsilane Building a Structure Centric Community for Chemists
    • Ambiguity Building a Structure Centric Community for Chemists
    • Ambiguity in Abbreviations - DPA Building a Structure Centric Community for Chemists
    • Ambiguity in Abbreviations - THF Building a Structure Centric Community for Chemists
    • Import is Easy Make articles Public/Private (embargo date soon) Auto-markup and check by user Building a Structure Centric Community for Chemists
    • IUPAC PAC Articles Building a Structure Centric Community for Chemists
    • Supports Word .DOC, HTML, RTF Building a Structure Centric Community for Chemists
    • Drexel University Documents Building a Structure Centric Community for Chemists
    • Drexel University Documents Building a Structure Centric Community for Chemists
    • Drexel University Documents Building a Structure Centric Community for Chemists
    • Patents Building a Structure Centric Community for Chemists
    • Single Configuration File defines entities for markup Algorithms can be built for certain entities but the majority are dictionaries – vendors, Phys Properties, Analytical We can extend our system to support your needs based on dictionaries – what does NPG need/not need? Building a Structure Centric Community for Chemists
    • Nature Publications Building a Structure Centric Community for Chemists
    • Entity Balloons Structures are the language of chemistry Show structures to chemists and search/link from there Building a Structure Centric Community for Chemists
    • Other Dictionaries - Species We are considering Bacteria Fungi Enzymes Viruses PDB codes…. Building a Structure Centric Community for Chemists
    • Integrations Out to Other Sources Building a Structure Centric Community for Chemists
    • Integrations Out to Other Sources Building a Structure Centric Community for Chemists
    • Reactions Building a Structure Centric Community for Chemists
    • Manual Curation is Always Necessary Building a Structure Centric Community for Chemists
    • Text-Indexing and ChemSpider? ChemSpider text-indexes almost 500,000 Open Access and Free Access articles Collection is growing and more publishers have already agreed. Including theses in the future. Building a Structure Centric Community for Chemists
    • Open Access Literature Search Building a Structure Centric Community for Chemists
    • Conclusions The quality of structure-based data online should always be questioned – that includes ChemSpider Data on ChemSpider are being added and curated on a daily basis but we need more eyeballs helping always ChemSpider has a large validated structure-name dictionary Chemical name extraction and document markup is very enabling Building a Structure Centric Community for Chemists
    • Oops… Building a Structure Centric Community for Chemists