Web Crawling Chemistry

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    Favorites, Groups & Events

    Web Crawling Chemistry - Presentation Transcript

    1. Weaving a New Web for Chemistry Antony Williams
    2. Structure-Enabled Articles
    3. Searching from the Structure Balloon
    4. Imagine a time when ….
      • The internet is searchable by chemical structure and substructure
      • Chemistry articles are indexed and searchable by a free online service
      • Publicly funded research data can be shared and discussed in the Open
      • Cheminformatics has as much of a public face as bioinformatics
    5. ChemSpider - A Search Engine for Chemists
      • Questions a chemist might ask…
        • What is the melting point of n-butanol?
        • What is the chemical structure of Xanax?
        • Chemically, what is phenolphthalein?
        • What are the stereocenters of cholesterol?
        • Where can I find publications about xylene?
        • What are the different trade names for Ketoconazole?
        • What is the NMR spectrum of Aspirin?
        • What are the safety handling issues for Thymol Blue?
        • ChemSpider can answer all of these questions
    6. ChemSpider Data Content
      • Over 21.5 million unique chemical structures from ca. 150 data sources
        • Online Databases –PubChem, Drugbank, HMDB, Wikipedia
        • Chemical Vendors – over 40 different vendors and growing
        • Personal Depositions – individual contributions
        • Journal Publishers
        • Content database vendors
        • Analytical data collections
        • Patents (9 MILLION Structures being deposited now )
        • Web scraping
        • Content is generally linked back to the original data sources
    7. Tell me about Aspirin
    8. Tell me about Aspirin
    9. Link outs
    10. Links out to KEGG Kyoto Encyclopedia of Genes and Genomes
    11. Tell me about Aspirin
    12. Tell me About Aspirin
    13. Tell me about Aspirin
    14. Tell me about Aspirin
    15. Tell me about Aspirin
    16. Text- Indexing and ChemSpider?
      • ChemSpider text-indexes almost 500,000 Open Access and Free Access articles
      • Collection is growing weekly and more publishers have already agreed
    17. Open Access Literature Search
    18. Search PubMed – ChemSpider
    19. Other Searches
      • What compounds have a mass of 300+/-0.001?
      • or search a combination of intrinsic/predicted properties
    20. Other Searches
    21. Complex Search
    22. The Quality of Data Online…
      • Aggregating data opens up quality issues
      • Structure-identifier associations are “dirty”
      • Structures are COMMONLY incorrect – stereochem issues
      • Manual curation of small databases is enough work – what about millions of structures?
      • Structures are far from perfect. What is a “correct structure”?
        • Full stereochemistry?
        • Historical timeline of structure?
        • Who is the authority?
    23. Who holds THE Quality Authority?
      • Chemical Abstracts Service is the structural authority today. 1400 (?) employees, world standard in chemistry information
      • 101 years of knowledge, process and expertise. MANUAL curation is key. Robotic curation is enabling
      • How can an online, free access system peacefully co-exist with the authority?
    24. Quality is a Major Issue- Search Butanol
    25. Wikipedia – Crowdsourcing Chemistry
    26. Wikipedia Chemistry Curation project
      • Only ca. 5000 organic structures, 7000 total structures
      • MONTHS of work so far for a team of 6 people
      • Many errors removed in the process. Curation process is a daily event for users/depositors
      • Slow and torturous process for stereo molecules.
    27. Thymol Blue on ChemSpider
      • Data online includes:
        • UV-vis spectrum
        • Measured experimental properties
        • Link to Wikipedia article
        • Links to chromatography details
        • Multiple identifiers/trade names etc.
        • Links to vendors/suppliers/other databases
        • Safety information
    28. Differences between ChemSpider/Wikipedia No Analytical Data Active editors – about 50 (?) Active depositors/curators – 30 No Prediction of properties ???? 5000 people/day; 1100 registered Detailed compound monographs Compound monographs linked Text Complex queries – Properties, Text, structure/substructure, OA publishers, Data Sources, … ~5000 organics, 2000 others >21 million unique structures Wikipedia ChemSpider
    29. Differences between Wikipedia/ChemSpider Growing reputation as focused on quality Worldwide reputation as quality source Chemistry is the focus of ‘Spider Chemistry is a subset of the ‘Pedia Mixed “licensing” GFL licensing for everything Growing team of WP:Chem advocates, curators and admins Strong team of WP:Chem advocates, curators and admins “ Out of a basement” on three servers and 5 volunteers Established infrastructure and Wikipedia Foundation Team Primarily Microsoft .NET technologies with OS components Supported by tried and tested Media-Wiki platform. ChemSpider Wikipedia
    30. Crowd-sourcing Curation
      • How to curate data for millions of structures?
      • Robot processes can clean up depositions
        • Search for Chloride and check molecular formula for Cl
        • Check for stereochemistry and remove names with stereo
      • Provide a simple-to-use platform to curate, annotate and tag data
      • Provide curator administration to prevent vandalism (Veropedia)
    31. Multi-level Curation and Approval
    32. Post Comments
      • Anyone can “Post Comments” associated with a structure. To curate data we require login to track
    33. Crowd-sourcing Chemistry
      • Crowd-sourced curation: identify and tag errors, edit names, synonyms, identify records for deprecation
      • ALSO
      • Crowd-sourced deposition: anyone can deposit data (structures, text, images, analytical data)
    34. But, when registered and logged in…
      • Ability to curate and add to the database
        • Add structures
        • “ Clean” structures
        • Add data (spectra, CIFs, images)
        • Add links to other pages (URLs)
        • Add publication details
    35. Adding to the Database - Structure
    36. Adding New Text Data Add Publication Add Identifier Add URL
    37. Adding Supplementary Info to a Structure
    38. ChemSpider TouchGraph
    39. Structure-Centric
      • We want to search Open-Access articles by structure, substructure, similarity of structure
      • Standard approaches would be:
        • Identify chemical names “entity extraction”
        • Convert chemical names to structures and index
      • ChemSpider has a validated dictionary of structure-name pairs
      • Use name extraction, name-conversion and dictionary look-up. THEN curate.
    40. “Entity Extraction”
      • Rule-based recognition of systematic names:
        • Use a lexeme of name fragments
        • Rules for identifying bounds of a name
      • Look-up dictionary:
        • Drug Names
        • Trivial Names
        • Numbers : Registry IDs, EINECS/ELINCS/Beilstein IDs
    41. Name Recognition
      • Azo aldehyde 2   was  synthesized according to a reported  method [17]. To  a stirred  solution  of azo aldehyde 2   (1.08 g, 3.76 mmol )  in  dry CH2Cl2  (30.00 mL) at  0 oC  were  successively  added (3,4-diaminophenyl)phenyl methanone 1 (0.40 g, 1.88 mmol) and a excces of anhydrous MgSO4 (2.00 g,16.67 mmol) . The resulting  mixture  was  stirred  for  6 hours  at room temperature [18]. The mixture was  filtered and washed with dichloromethane . Then the solvent was  evaporated under reduced pressure to  give azo Schiff base 3   as a red solid which was recrystalized from ethanol 95%    (1.28 g, 91 %)
    42. Name Recognition
      • Azo aldehyde 2   was  synthesized according to a reported  method [17]. To  a stirred  solution  of azo aldehyde 2   (1.08 g, 3.76 mmol )  in  dry CH2Cl2   (30.00 mL) at  0 oC  were  successively  added  (3,4-diaminophenyl)phenyl methanone 1 (0.40 g, 1.88 mmol) and a excess of anhydrous MgSO 4 (2.00 g,16.67 mmol) .
      • The resulting  mixture  was  stirred  for  6 hours  at room temperature [18]. The mixture was  filtered and washed with dichloromethane . Then the solvent was  evaporated under reduced pressure to  give azo Schiff base 3   as a red solid which was recrystalized from ethanol 95%    (1.28 g, 91 %)
    43. How Many Chemical Names?
      • “ She had the drive to derive success in any venture and was well versed in Karate. When the man in the tartan shirt approached her with a dagger in his hand she spat in his face, took the stance of a commando and took advantage of his shock to release the dagger from his grip and causing him to recoil. He went home and took an aspirin after the beating.”
    44. How Many Chemical Names?
      • “ She had the drive to derive success in any venture and was well versed in Karate . When the man in the tartan shirt approached her with a dagger in his hand she spat in his face, took the stance of a commando and took advantage of his shock to release the dagger from his grip and causing him to recoil . He went home and took an aspirin after the beating.”
    45. Making Open Access Articles Searchable Proof of Concept
      • Can we HOST Chemistry Open Access articles on ChemSpider and add-value
      • Can we identify chemical names in Open Access articles in a user-friendly manner
      • Can we convert names to structures in Open-Access articles and expand ChemSpider and provide structure searching of Open Access chemistry articles?
      • Can we provide an environment for chemists to mark-up their own articles and crowd-source markup of an archive?
    46. Document markup
      • ChemSpider now hosting Open Access articles from MDPI, Molecular Diversity Preservation International
      • Hosting the Molbank collection at present
    47. A Standard for Document Markup?
      • NLM-DTD: National Library of Medicine; Document Type Definition
      • Approved markup definitions to apply to journal articles – extended as necessary for our purposes
    48. NLM/DTD markup
    49. Chemistry and Biology
    50. Chemistry and Biology
      • Menus can be extended as necessary
    51. Document markup
    52. Searching from the Structure Balloon
    53. A Platform for Markup
      • Can we provide a platform for document markup for chemists?
      • Workflow:
        • Upload word docs, RTF files or point to HTML and load
        • Apply entity extraction, convert names to structures, mark-up automatically and ask for user participation
        • Publish final version with NLM-DTD markup
        • Deposit all structures on ChemSpider under embargo and wait for article DOI to release
    54. Online Markup
    55. Automated markup
    56. Name to Structure Conversion
    57. Conversion of Structure Images
      • Not all compounds have a “name”
      • Structure images can be converted to connection tables
    58. Cryptomisrine
    59. Structure Conversion from Images-CLiDE
      • Conversion dependent on zoom-factor can give perfect conversion!
    60. Supports Word .DOC, HTML, RTF
    61. Extensible Markup Process
      • Markup process is easily extendable
      • Configurable from one XML file
      • NLM/DTD is incorporated but is easy to extend
    62. Tipping Point
      • Tipping point - the point at which a slow gradual change becomes irreversible and then proceeds with gathering pace
    63. Our Challenges
      • There are “no employees”
      • ChemSpider is non-funded
      • System is hyper-dependent on ISP, power and limited compute power
      • We are upsetting some people – specifically “closed” data content providers
    64. What’s Coming?
      • Agreement with Royal Society of Chemistry that we can add their structure-based RSS feeds to ChemSpider
      • Agreement with Nature Publishing Group to add their Nature Chemical Biology structure collections to ChemSpider as they issue
      • Presently indexing Acta Chemica Scandanavica, 1947-1999 PDF backfile – our first foray into OCR
      • Presently indexing PLoS journals directly
      • More publishers have agreed…
    65. Conclusions
      • The quality of structure-based data online should always be questioned – that includes ChemSpider
      • Robots and software algorithms can help but eyeballs are necessary
      • Data on ChemSpider are being added and curated on a daily basis but we need more eyeballs helping always
      • ChemSpider now has a large validated structure-name dictionary
    66. Further reading
      • www.chemspider.com/blog
      • Internet-based tools for communication and collaboration in chemistry, Drug Discovery Today, Volume 13, Numbers 11/12, June 2008 502-506, doi:10.1016/j.drudis.2008.03.015
      • A perspective of publicly accessible/open-access chemistry databases, Drug Discovery Today, Volume 13, Numbers 11/12, June 2008, 495-501, doi:10.1016/j.drudis.2008.03.017
    67. ChemSpider Forums/Blogs
      • Forum.chemspider.com
      • www.chemspider.com/blog
    68. Acknowledgments
      • The ChemSpider team of volunteer developers
      • ChemSpider Advisory Group
      • Our curators, depositors and users
      • Suppliers of commercial software – Microsoft, ACD/Labs, OpenEye, ChemAxon, SimBioSys
      • SureChem – Structure Based Online Patent Searching

    + Antony Williams, ChemSpidermanAntony Williams, ChemSpiderman, 2 years ago

    custom

    773 views, 0 favs, 1 embeds more stats

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 773
      • 772 on SlideShare
      • 1 from embeds
    • Comments 0
    • Favorites 0
    • Downloads 14
    Most viewed embeds
    • 1 views on http://www.chemspider.com

    more

    All embeds
    • 1 views on http://www.chemspider.com

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories