Web Crawling Chemistry
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Web Crawling Chemistry

on

  • 1,818 views

 

Statistics

Views

Total Views
1,818
Views on SlideShare
1,815
Embed Views
3

Actions

Likes
0
Downloads
29
Comments
0

2 Embeds 3

http://www.slideshare.net 2
http://www.chemspider.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Web Crawling Chemistry Presentation Transcript

  • 1. Weaving a New Web for Chemistry Antony Williams
  • 2. Structure-Enabled Articles
  • 3. Searching from the Structure Balloon
  • 4. Imagine a time when ….
    • The internet is searchable by chemical structure and substructure
    • Chemistry articles are indexed and searchable by a free online service
    • Publicly funded research data can be shared and discussed in the Open
    • Cheminformatics has as much of a public face as bioinformatics
  • 5. ChemSpider - A Search Engine for Chemists
    • Questions a chemist might ask…
      • What is the melting point of n-butanol?
      • What is the chemical structure of Xanax?
      • Chemically, what is phenolphthalein?
      • What are the stereocenters of cholesterol?
      • Where can I find publications about xylene?
      • What are the different trade names for Ketoconazole?
      • What is the NMR spectrum of Aspirin?
      • What are the safety handling issues for Thymol Blue?
      • ChemSpider can answer all of these questions
  • 6. ChemSpider Data Content
    • Over 21.5 million unique chemical structures from ca. 150 data sources
      • Online Databases –PubChem, Drugbank, HMDB, Wikipedia
      • Chemical Vendors – over 40 different vendors and growing
      • Personal Depositions – individual contributions
      • Journal Publishers
      • Content database vendors
      • Analytical data collections
      • Patents (9 MILLION Structures being deposited now )
      • Web scraping
      • Content is generally linked back to the original data sources
  • 7. Tell me about Aspirin
  • 8. Tell me about Aspirin
  • 9. Link outs
  • 10. Links out to KEGG Kyoto Encyclopedia of Genes and Genomes
  • 11. Tell me about Aspirin
  • 12. Tell me About Aspirin
  • 13. Tell me about Aspirin
  • 14. Tell me about Aspirin
  • 15. Tell me about Aspirin
  • 16. Text- Indexing and ChemSpider?
    • ChemSpider text-indexes almost 500,000 Open Access and Free Access articles
    • Collection is growing weekly and more publishers have already agreed
  • 17. Open Access Literature Search
  • 18. Search PubMed – ChemSpider
  • 19. Other Searches
    • What compounds have a mass of 300+/-0.001?
    • or search a combination of intrinsic/predicted properties
  • 20. Other Searches
  • 21. Complex Search
  • 22. The Quality of Data Online…
    • Aggregating data opens up quality issues
    • Structure-identifier associations are “dirty”
    • Structures are COMMONLY incorrect – stereochem issues
    • Manual curation of small databases is enough work – what about millions of structures?
    • Structures are far from perfect. What is a “correct structure”?
      • Full stereochemistry?
      • Historical timeline of structure?
      • Who is the authority?
  • 23. Who holds THE Quality Authority?
    • Chemical Abstracts Service is the structural authority today. 1400 (?) employees, world standard in chemistry information
    • 101 years of knowledge, process and expertise. MANUAL curation is key. Robotic curation is enabling
    • How can an online, free access system peacefully co-exist with the authority?
  • 24. Quality is a Major Issue- Search Butanol
  • 25. Wikipedia – Crowdsourcing Chemistry
  • 26. Wikipedia Chemistry Curation project
    • Only ca. 5000 organic structures, 7000 total structures
    • MONTHS of work so far for a team of 6 people
    • Many errors removed in the process. Curation process is a daily event for users/depositors
    • Slow and torturous process for stereo molecules.
  • 27. Thymol Blue on ChemSpider
    • Data online includes:
      • UV-vis spectrum
      • Measured experimental properties
      • Link to Wikipedia article
      • Links to chromatography details
      • Multiple identifiers/trade names etc.
      • Links to vendors/suppliers/other databases
      • Safety information
  • 28. Differences between ChemSpider/Wikipedia No Analytical Data Active editors – about 50 (?) Active depositors/curators – 30 No Prediction of properties ???? 5000 people/day; 1100 registered Detailed compound monographs Compound monographs linked Text Complex queries – Properties, Text, structure/substructure, OA publishers, Data Sources, … ~5000 organics, 2000 others >21 million unique structures Wikipedia ChemSpider
  • 29. Differences between Wikipedia/ChemSpider Growing reputation as focused on quality Worldwide reputation as quality source Chemistry is the focus of ‘Spider Chemistry is a subset of the ‘Pedia Mixed “licensing” GFL licensing for everything Growing team of WP:Chem advocates, curators and admins Strong team of WP:Chem advocates, curators and admins “ Out of a basement” on three servers and 5 volunteers Established infrastructure and Wikipedia Foundation Team Primarily Microsoft .NET technologies with OS components Supported by tried and tested Media-Wiki platform. ChemSpider Wikipedia
  • 30. Crowd-sourcing Curation
    • How to curate data for millions of structures?
    • Robot processes can clean up depositions
      • Search for Chloride and check molecular formula for Cl
      • Check for stereochemistry and remove names with stereo
    • Provide a simple-to-use platform to curate, annotate and tag data
    • Provide curator administration to prevent vandalism (Veropedia)
  • 31. Multi-level Curation and Approval
  • 32. Post Comments
    • Anyone can “Post Comments” associated with a structure. To curate data we require login to track
  • 33. Crowd-sourcing Chemistry
    • Crowd-sourced curation: identify and tag errors, edit names, synonyms, identify records for deprecation
    • ALSO
    • Crowd-sourced deposition: anyone can deposit data (structures, text, images, analytical data)
  • 34. But, when registered and logged in…
    • Ability to curate and add to the database
      • Add structures
      • “ Clean” structures
      • Add data (spectra, CIFs, images)
      • Add links to other pages (URLs)
      • Add publication details
  • 35. Adding to the Database - Structure
  • 36. Adding New Text Data Add Publication Add Identifier Add URL
  • 37. Adding Supplementary Info to a Structure
  • 38. ChemSpider TouchGraph
  • 39. Structure-Centric
    • We want to search Open-Access articles by structure, substructure, similarity of structure
    • Standard approaches would be:
      • Identify chemical names “entity extraction”
      • Convert chemical names to structures and index
    • ChemSpider has a validated dictionary of structure-name pairs
    • Use name extraction, name-conversion and dictionary look-up. THEN curate.
  • 40. “Entity Extraction”
    • Rule-based recognition of systematic names:
      • Use a lexeme of name fragments
      • Rules for identifying bounds of a name
    • Look-up dictionary:
      • Drug Names
      • Trivial Names
      • Numbers : Registry IDs, EINECS/ELINCS/Beilstein IDs
  • 41. Name Recognition
    • Azo aldehyde 2   was  synthesized according to a reported  method [17]. To  a stirred  solution  of azo aldehyde 2   (1.08 g, 3.76 mmol )  in  dry CH2Cl2  (30.00 mL) at  0 oC  were  successively  added (3,4-diaminophenyl)phenyl methanone 1 (0.40 g, 1.88 mmol) and a excces of anhydrous MgSO4 (2.00 g,16.67 mmol) . The resulting  mixture  was  stirred  for  6 hours  at room temperature [18]. The mixture was  filtered and washed with dichloromethane . Then the solvent was  evaporated under reduced pressure to  give azo Schiff base 3   as a red solid which was recrystalized from ethanol 95%    (1.28 g, 91 %)
  • 42. Name Recognition
    • Azo aldehyde 2   was  synthesized according to a reported  method [17]. To  a stirred  solution  of azo aldehyde 2   (1.08 g, 3.76 mmol )  in  dry CH2Cl2   (30.00 mL) at  0 oC  were  successively  added  (3,4-diaminophenyl)phenyl methanone 1 (0.40 g, 1.88 mmol) and a excess of anhydrous MgSO 4 (2.00 g,16.67 mmol) .
    • The resulting  mixture  was  stirred  for  6 hours  at room temperature [18]. The mixture was  filtered and washed with dichloromethane . Then the solvent was  evaporated under reduced pressure to  give azo Schiff base 3   as a red solid which was recrystalized from ethanol 95%    (1.28 g, 91 %)
  • 43. How Many Chemical Names?
    • “ She had the drive to derive success in any venture and was well versed in Karate. When the man in the tartan shirt approached her with a dagger in his hand she spat in his face, took the stance of a commando and took advantage of his shock to release the dagger from his grip and causing him to recoil. He went home and took an aspirin after the beating.”
  • 44. How Many Chemical Names?
    • “ She had the drive to derive success in any venture and was well versed in Karate . When the man in the tartan shirt approached her with a dagger in his hand she spat in his face, took the stance of a commando and took advantage of his shock to release the dagger from his grip and causing him to recoil . He went home and took an aspirin after the beating.”
  • 45. Making Open Access Articles Searchable Proof of Concept
    • Can we HOST Chemistry Open Access articles on ChemSpider and add-value
    • Can we identify chemical names in Open Access articles in a user-friendly manner
    • Can we convert names to structures in Open-Access articles and expand ChemSpider and provide structure searching of Open Access chemistry articles?
    • Can we provide an environment for chemists to mark-up their own articles and crowd-source markup of an archive?
  • 46. Document markup
    • ChemSpider now hosting Open Access articles from MDPI, Molecular Diversity Preservation International
    • Hosting the Molbank collection at present
  • 47. A Standard for Document Markup?
    • NLM-DTD: National Library of Medicine; Document Type Definition
    • Approved markup definitions to apply to journal articles – extended as necessary for our purposes
  • 48. NLM/DTD markup
  • 49. Chemistry and Biology
  • 50. Chemistry and Biology
    • Menus can be extended as necessary
  • 51. Document markup
  • 52. Searching from the Structure Balloon
  • 53. A Platform for Markup
    • Can we provide a platform for document markup for chemists?
    • Workflow:
      • Upload word docs, RTF files or point to HTML and load
      • Apply entity extraction, convert names to structures, mark-up automatically and ask for user participation
      • Publish final version with NLM-DTD markup
      • Deposit all structures on ChemSpider under embargo and wait for article DOI to release
  • 54. Online Markup
  • 55. Automated markup
  • 56. Name to Structure Conversion
  • 57. Conversion of Structure Images
    • Not all compounds have a “name”
    • Structure images can be converted to connection tables
  • 58. Cryptomisrine
  • 59. Structure Conversion from Images-CLiDE
    • Conversion dependent on zoom-factor can give perfect conversion!
  • 60. Supports Word .DOC, HTML, RTF
  • 61. Extensible Markup Process
    • Markup process is easily extendable
    • Configurable from one XML file
    • NLM/DTD is incorporated but is easy to extend
  • 62. Tipping Point
    • Tipping point - the point at which a slow gradual change becomes irreversible and then proceeds with gathering pace
  • 63. Our Challenges
    • There are “no employees”
    • ChemSpider is non-funded
    • System is hyper-dependent on ISP, power and limited compute power
    • We are upsetting some people – specifically “closed” data content providers
  • 64. What’s Coming?
    • Agreement with Royal Society of Chemistry that we can add their structure-based RSS feeds to ChemSpider
    • Agreement with Nature Publishing Group to add their Nature Chemical Biology structure collections to ChemSpider as they issue
    • Presently indexing Acta Chemica Scandanavica, 1947-1999 PDF backfile – our first foray into OCR
    • Presently indexing PLoS journals directly
    • More publishers have agreed…
  • 65. Conclusions
    • The quality of structure-based data online should always be questioned – that includes ChemSpider
    • Robots and software algorithms can help but eyeballs are necessary
    • Data on ChemSpider are being added and curated on a daily basis but we need more eyeballs helping always
    • ChemSpider now has a large validated structure-name dictionary
  • 66. Further reading
    • www.chemspider.com/blog
    • Internet-based tools for communication and collaboration in chemistry, Drug Discovery Today, Volume 13, Numbers 11/12, June 2008 502-506, doi:10.1016/j.drudis.2008.03.015
    • A perspective of publicly accessible/open-access chemistry databases, Drug Discovery Today, Volume 13, Numbers 11/12, June 2008, 495-501, doi:10.1016/j.drudis.2008.03.017
  • 67. ChemSpider Forums/Blogs
    • Forum.chemspider.com
    • www.chemspider.com/blog
  • 68. Acknowledgments
    • The ChemSpider team of volunteer developers
    • ChemSpider Advisory Group
    • Our curators, depositors and users
    • Suppliers of commercial software – Microsoft, ACD/Labs, OpenEye, ChemAxon, SimBioSys
    • SureChem – Structure Based Online Patent Searching