Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The application of text and data mining to enhance the RSC publication archive

5,210 views

Published on

The Royal Society of Chemistry (RSC) is one of the world’s most prominent scientific societies and STM publishers. Our contributions to the scientific community include the delivery of a myriad of resources to support the chemistry community to access chemistry-related data, information and knowledge. This includes ChemSpider, a compound centric platform linking together over 30 million chemical compounds with internet-based resources. Using this compound database and its associated chemical identifiers as a basis the RSC is utilizing text and data mining approaches to data enable our published archive of scientific publications. This presentation will provide an overview of our technical approaches to text and data enable our archive of scientific articles, how we are developing an integrated database of chemical compounds, reactions, physical and analytical data and how it will be used to facilitate scientific discovery.

Published in: Science, Technology, Education
  • Be the first to comment

The application of text and data mining to enhance the RSC publication archive

  1. 1. The Application of Text and Data Mining to Enhance the Royal Society of Chemistry Publication Archive Antony Williams Emerging Trends in Scholarly Publishing™ Seminar, Washington, April 24th 2014
  2. 2. So, I’m writing an article…
  3. 3. With lots of these….
  4. 4. And these…I will lose data 
  5. 5. Data in Publications • This is not new, you know the story… • So much data of value is contained within a publication and delivered in a PDF form • PDF files, and unclear licensing/copyright, limit access to data so I can rework, reuse, repurpose, text mine etc. • “I specialize in XXXX. I want a database of YYYY extracted from publications and made available, for free, with the capabilities I need, and the publishers should just do it”
  6. 6. And over the years, progress… • There is much progress with open access, data access, licensing, enhanced articles, open data, free online tools, open source codes, publishers waking up, scientists contributing • We should be excited at what is available now, what the future holds, what opportunities exist in front of us
  7. 7. It is so difficult to navigate… What’s the structure? What’s the structure? Are they in our file? Are they in our file? What’s similar? What’s similar? What’s the target? What’s the target?Pharmacology data? Pharmacology data? Known Pathways? Known Pathways? Working On Now? Working On Now?Connections to disease? Connections to disease? Expressed in right cell type? Expressed in right cell type? Competitors?Competitors? IP?IP?
  8. 8. “Data enable” publications? • We would LOVE to bring data out of our archive • What could we do? • Find chemical names and generate structures • Find chemical images and generate structures • Find reactions – and make a database! • Find data (MP, BP, LogP) and host. Build models! • Find figures and database them • Find spectra (and link to structures) • Validate the data algorithmically
  9. 9. RSC Archive – since 1841
  10. 10. Text Mining The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4- thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser . The reaction mixture was heated at reflux with stirring , for a period of about one-half hour . After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N- methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
  11. 11. Text Mining The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4- thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser . The reaction mixture was heated at reflux with stirring , for a period of about one-half hour . After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N- methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
  12. 12. But names = structures • Systematic names can be generated FROM chemical structures algorithmically
  13. 13. But names = structures • …and structures from systematic names
  14. 14. But what of trivial names? • What about trivial names, trade names, CAS numbers, multilingual names etc.?
  15. 15. Searching that lipid in patents
  16. 16. • ~30 million chemicals and growing • Data sourced from >500 different sources • Crowd sourced curation and annotation • Ongoing deposition of data from our journals and our collaborators • Structure centric hub for web-searching • …and a really big dictionary!!!
  17. 17. ChemSpider
  18. 18. ChemSpider
  19. 19. Experimental/Predicted Properties
  20. 20. Literature references
  21. 21. Patents references
  22. 22. Books
  23. 23. Chemical vendors and data sources
  24. 24. Aspirin on ChemSpider
  25. 25. Data Enabling the RSC Archive
  26. 26. How is DERA going? • We have text-mined all 21st century articles… >100k articles from 2000-2013 • Marked up with XML and published onto the HTML forms of the articles • Required multiple iterations based on dictionaries, markup, text mining iterations • New visualization tools in development – not just chemical names. Add chemical and biomedical terms markup also!
  27. 27. Work in Progress
  28. 28. Work in Progress
  29. 29. Work in Progress
  30. 30. Work in Progress
  31. 31. But Context Gives Reactions The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4- thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser . The reaction mixture was heated at reflux with stirring , for a period of about one-half hour . After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N- methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
  32. 32. ChemSpider Reactions
  33. 33. Is It Easy?
  34. 34. Dictionary (ontologies)RSC ontologies (methods, reactions) Dictionary (chemistry) Text-mining Curated dictionaries for known names ACD N2S OPSIN Unknown names: automated name to structure conversion XML ready for publication Marked-up XML Production processes CDX integration (coming soon) Chemical structures SD file Is It Easy?
  35. 35. So..compounds and reactions • ChemSpider is a compounds repository • We are building a Reactions Repository • “Reaction Validation” procedures to check data • Ontological approaches to classify the reactions • But why stop at chemicals and reactions?
  36. 36. Compounds Database
  37. 37. Reactions Database
  38. 38. Analytical Data Database
  39. 39. But publication data is FIGURES
  40. 40. So Turn “Figures” Into Data EXTRACTED DATA FIGURE
  41. 41. Early Test Experiments  74 supplementary data documents/ 3444 pages  Extracted content in 1069 page instances to produce 1151 spectra, > 80% of peaks extracted to within 1-2 decimal places  Working on batch extraction and production of spectral data
  42. 42. Validating Spectra • How will we check data consistency? • How do we know the structure and the spectra match? • Predict spectra and use algorithmic checking. • Flag “suspect data” and crowd source data checking
  43. 43. ESI – Text Spectra
  44. 44. Lots of “Textual Spectra”
  45. 45. 1H NMR (CDCl3, 400 MHz): δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz, C(11b)H), 4.35 (t, 1H, Jb = 10.8 Hz, C(6)H), 4.47 (m, 2H, C(5)H), 4.57 (dd, 1H, J = 2.8 Hz, C(6)H), 6.95 (d, 1H, J = 8.4 Hz, ArH), 7.18–7.94 (m, 11H, ArH)
  46. 46. Visualization of Spectral Data • For spectra associated with compounds we will be viewing “interactive spectra”
  47. 47. What are we extracting? • Compounds from compound names • Reactions from the text • Spectral extraction – from figures and text • Extraction of data from “tables” – not only CSV files but tables in the publication
  48. 48. BUT I hate text mining data • DERA: using pipelining tools for text-mining so we will be able to process documents for mark-up • Compound extraction/markup • Reaction extraction/conversion • Extract data from tables • Convert “text spectra” to generate spectral libraries • REALLY???? AGGHHHHH!
  49. 49. DERA is FINE for an archive The WRONG WAY otherwise! • We should NOT be mining data out of future publications • Structures should be submitted “correctly” • Spectra should be digital spectral formats, not images • ESI should be RICH and interactive • Data should be open, available, with meta data and provenance
  50. 50. Advanced ESI
  51. 51. We can solve for Authors here Will it be used though???
  52. 52. ChemSpider as a Foundation • >30 million chemicals (and growing) with associated experimental and predicted property data, analytical data, links out to hundreds of data sources, patents, journal articles, books etc…is a lot of data! • ChemSpider is free to access for everyone – and the API means people program against it • What projects can we benefit?
  53. 53. Support grant-based services • Multiple European consortium-based grants • PharmaSea (FP7 funded) • Open PHACTS (IMI funded) • UK National Chemical Database Service ( http://cds.rsc.org) – developing data repository for lab data, integrate Electronic Lab Notebooks • Open Drug Discovery projects
  54. 54. • 3-year Innovative Medicines Initiative project • Integrating chemistry and biology data using semantic web technologies • Open code, open data, open standards • Academics, Pharmas, Publishers… • To put medicines in the pipeline…
  55. 55. The Open PHACTS community ecosystem
  56. 56. Open Source Drug Discovery India
  57. 57. Conclusions • Great progress in mining the archive for compounds • Reaction extraction and spectral data are underway • All of the resulting data will be available to the chemistry community
  58. 58. And that article I’m writing
  59. 59. The Figures will be data too
  60. 60. Every compound will live
  61. 61. And linking will InChI forward
  62. 62. Structure Searching the Web
  63. 63. Thank you Email: williamsa@rsc.org ORCID: 0000-0002-2668-4821 Twitter: @ChemConnector Personal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams

×