Text Mining for Chemistry and Building a Public Platform for Document Markup


Published on

Text Mining for Chemistry and Building a Public Platform for Document Markup

The identification of chemical names in documents has provided platforms to enable structure-based searching of patents and mark-up chemistry publications. A natural extension is the ability to make chemistry articles, blog pages, wiki pages and other documents searchable by the extracted chemical structures. The ChemSpider database is built on a database of over 21 million unique chemical entities from close to 200 data sources and provides a rich resource of information for chemists. We will report on our efforts to integrate chemical name extraction with the ChemSpider platform to enable structure searching of Open Access chemistry articles, and online chemistry materials. We will unveil our online document markup platform for chemists to make both their open- and closed-access publications searchable by the language of chemistry – the structure.

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Text Mining for Chemistry and Building a Public Platform for Document Markup

  1. 1. Text mining for chemistry and building a public platform for document markup Antony Williams
  2. 2. Searching and Reading Articles… <ul><li>Online search tools for chemistry articles are generally text-based </li></ul><ul><li>Searching articles based on chemical structure and substructure is very expensive.. but is changing </li></ul><ul><li>Text-mining is a “hot area” of research ….but what is public? What depends on public curation? </li></ul>
  3. 3. Text-Based Search Tools <ul><li>Google </li></ul><ul><li>Pubmed </li></ul><ul><li>Google Scholar </li></ul><ul><li>Publishers websites </li></ul><ul><li>And 10s of other resources…. </li></ul>
  4. 4. Vancomycin Through PubChem
  5. 5. Vancomycin Text Searches <ul><li>Pubmed </li></ul><ul><li>Google Scholar </li></ul>
  6. 6. Online Structure Searching of Articles <ul><li>Some capabilities from publishers starting to show up </li></ul>
  7. 7. Publishers should adopt/add InChIs RSC and Nature Publishing Group have!
  8. 9. ChemMantis - Single Click Mark-up
  9. 10. Name-Structure Pairs
  10. 11. Converting Detected Names… <ul><li>Names are searched against a validated dictionary (this expands as ChemSpider is curated </li></ul><ul><li>If not found then they are passed through a Name to Structure algorithm </li></ul><ul><li>If they cannot convert then ChemSpider is searched for non-validated names </li></ul>
  11. 12. RED Underline Non-validated, Cannot Convert through NTS <ul><li>“Names” can be added to Suppress List </li></ul>
  12. 13. BLUE Underline Name to Structure Converted
  13. 14. Deposit Structures
  14. 15. <ul><li>Entity Extraction built around modified algorithms from SureChem </li></ul><ul><li>Optimized for “publications” </li></ul><ul><li>Dictionaries for chemical entities, groups, reactions, elements, families, species… </li></ul><ul><li>Dictionaries can be expanded – presently adding PDB </li></ul>
  15. 16. Species..
  16. 17. What do you do with a markup system? <ul><li>Test it, Show it off and make it available… </li></ul><ul><li>Tested on chemistry articles so why not HOST articles? </li></ul><ul><li>…and create an online journal… </li></ul>
  17. 18. The ChemSpider Journal
  18. 19. Open Access Community Journal
  19. 20. Deposit Article <ul><li>Import URL or Document </li></ul><ul><li>Copy-Paste </li></ul><ul><li>Markup </li></ul>
  20. 21. Copy-Paste Version Martin Walker Monthly Article
  21. 22. Chemical names
  22. 23. Names, Elements, Groups, Families
  23. 24. Outlinks
  24. 25. Mark Up Open Access Article
  25. 26. Online Journals and Live Data
  26. 27. A Community Resource of Spectra <ul><li>Spectra deposited on ChemSpider as “Open Data” are available to anybody to “Embed” in their articles, blogs, wikis etc </li></ul>
  27. 28. Present Dictionaries <ul><li>Chemical names - ChemSpider Validated Names </li></ul><ul><li>Reactions - Wikipedia Named Reactions and RSC Reaction Ontology reactions </li></ul><ul><li>Species – Wikipedia “species” </li></ul><ul><li>To add – New Dictionaries </li></ul><ul><ul><li>PDB codes </li></ul></ul><ul><ul><li>IUPAC Gold Book </li></ul></ul>
  28. 29. Conclusions <ul><li>The internet enables chemistry – and at a reduced cost </li></ul><ul><li>Web 2.0 is here and improving quality – to benefit 3.0 </li></ul><ul><li>Question Quality! </li></ul><ul><li>Crowdsourcing for expansion, curation and integration </li></ul><ul><li>Classical models may die quite quickly – business models must change soon or fail </li></ul><ul><li>Publishers – heed the profileration of InChIs for Chemistry </li></ul>