Your SlideShare is downloading. ×
0
The Application of Text and Data
Mining to Enhance the Royal Society
of Chemistry Publication Archive
Antony Williams
Emer...
So, I’m writing an article…
With lots of these….
And these…I will lose data 
Data in Publications
• This is not new, you know the story…
• So much data of value is contained within a
publication and ...
And over the years, progress…
• There is much progress with open access, data
access, licensing, enhanced articles, open
d...
It is so difficult to navigate…
What’s the
structure?
What’s the
structure?
Are they in
our file?
Are they in
our file?
Wh...
“Data enable” publications?
• We would LOVE to bring data out of our archive
• What could we do?
• Find chemical names and...
RSC Archive – since 1841
Text Mining
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-
thiadiazol-5-yl)urea prepared in Example 6 , thio...
Text Mining
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-
thiadiazol-5-yl)urea prepared in Example 6 , thio...
But names = structures
• Systematic names can be generated FROM
chemical structures algorithmically
But names = structures
• …and structures from systematic names
But what of trivial names?
• What about trivial names, trade names, CAS
numbers, multilingual names etc.?
Searching that lipid in patents
• ~30 million chemicals and growing
• Data sourced from >500 different sources
• Crowd sourced curation and annotation
• O...
ChemSpider
ChemSpider
Experimental/Predicted Properties
Literature references
Patents references
Books
Chemical vendors and data sources
Aspirin on ChemSpider
Data Enabling the RSC Archive
How is DERA going?
• We have text-mined all 21st
century articles…
>100k articles from 2000-2013
• Marked up with XML and ...
Work in Progress
Work in Progress
Work in Progress
Work in Progress
But Context Gives Reactions
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-
thiadiazol-5-yl)urea prepared in ...
ChemSpider Reactions
Is It Easy?
Dictionary
(ontologies)RSC ontologies
(methods,
reactions)
Dictionary
(chemistry)
Text-mining
Curated dictionaries for kno...
So..compounds and reactions
• ChemSpider is a compounds repository
• We are building a Reactions Repository
• “Reaction Va...
Compounds Database
Reactions Database
Analytical Data Database
But publication data is FIGURES
So Turn “Figures” Into Data
EXTRACTED
DATA
FIGURE
Early Test Experiments

74 supplementary data documents/ 3444 pages

Extracted content in 1069 page instances to
produce...
Validating Spectra
• How will we check data consistency?
• How do we know the structure and the spectra
match?
• Predict s...
ESI – Text Spectra
Lots of “Textual Spectra”
1H NMR (CDCl3, 400 MHz):
δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz, C(11b)H), 4.35
(t, 1H, Jb = 10.8 Hz, C(6)H...
Visualization of Spectral Data
• For spectra associated with compounds we
will be viewing “interactive spectra”
What are we extracting?
• Compounds from compound names
• Reactions from the text
• Spectral extraction – from figures and...
BUT I hate text mining data
• DERA: using pipelining tools for text-mining
so we will be able to process documents
for mar...
DERA is FINE for an archive
The WRONG WAY otherwise!
• We should NOT be mining data out of future
publications
• Structure...
Advanced ESI
We can solve for Authors here
Will it be used though???
ChemSpider as a Foundation
• >30 million chemicals (and growing) with
associated experimental and predicted
property data,...
Support grant-based services
• Multiple European consortium-based grants
• PharmaSea (FP7 funded)
• Open PHACTS (IMI funde...
• 3-year Innovative Medicines Initiative project
• Integrating chemistry and biology data using
semantic web technologies
...
The Open PHACTS community ecosystem
Open Source Drug Discovery
India
Conclusions
• Great progress in mining the archive for
compounds
• Reaction extraction and spectral data are
underway
• Al...
And that article I’m writing
The Figures will be data too
Every compound will live
And linking will InChI forward
Structure Searching the Web
Thank you
Email: williamsa@rsc.org
ORCID: 0000-0002-2668-4821
Twitter: @ChemConnector
Personal Blog: www.chemconnector.com...
The application of text and data mining to enhance the RSC publication archive
The application of text and data mining to enhance the RSC publication archive
Upcoming SlideShare
Loading in...5
×

The application of text and data mining to enhance the RSC publication archive

4,516

Published on

The Royal Society of Chemistry (RSC) is one of the world’s most prominent scientific societies and STM publishers. Our contributions to the scientific community include the delivery of a myriad of resources to support the chemistry community to access chemistry-related data, information and knowledge. This includes ChemSpider, a compound centric platform linking together over 30 million chemical compounds with internet-based resources. Using this compound database and its associated chemical identifiers as a basis the RSC is utilizing text and data mining approaches to data enable our published archive of scientific publications. This presentation will provide an overview of our technical approaches to text and data enable our archive of scientific articles, how we are developing an integrated database of chemical compounds, reactions, physical and analytical data and how it will be used to facilitate scientific discovery.

Published in: Science, Technology, Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,516
On Slideshare
0
From Embeds
0
Number of Embeds
16
Actions
Shares
0
Downloads
4
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "The application of text and data mining to enhance the RSC publication archive"

  1. 1. The Application of Text and Data Mining to Enhance the Royal Society of Chemistry Publication Archive Antony Williams Emerging Trends in Scholarly Publishing™ Seminar, Washington, April 24th 2014
  2. 2. So, I’m writing an article…
  3. 3. With lots of these….
  4. 4. And these…I will lose data 
  5. 5. Data in Publications • This is not new, you know the story… • So much data of value is contained within a publication and delivered in a PDF form • PDF files, and unclear licensing/copyright, limit access to data so I can rework, reuse, repurpose, text mine etc. • “I specialize in XXXX. I want a database of YYYY extracted from publications and made available, for free, with the capabilities I need, and the publishers should just do it”
  6. 6. And over the years, progress… • There is much progress with open access, data access, licensing, enhanced articles, open data, free online tools, open source codes, publishers waking up, scientists contributing • We should be excited at what is available now, what the future holds, what opportunities exist in front of us
  7. 7. It is so difficult to navigate… What’s the structure? What’s the structure? Are they in our file? Are they in our file? What’s similar? What’s similar? What’s the target? What’s the target?Pharmacology data? Pharmacology data? Known Pathways? Known Pathways? Working On Now? Working On Now?Connections to disease? Connections to disease? Expressed in right cell type? Expressed in right cell type? Competitors?Competitors? IP?IP?
  8. 8. “Data enable” publications? • We would LOVE to bring data out of our archive • What could we do? • Find chemical names and generate structures • Find chemical images and generate structures • Find reactions – and make a database! • Find data (MP, BP, LogP) and host. Build models! • Find figures and database them • Find spectra (and link to structures) • Validate the data algorithmically
  9. 9. RSC Archive – since 1841
  10. 10. Text Mining The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4- thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser . The reaction mixture was heated at reflux with stirring , for a period of about one-half hour . After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N- methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
  11. 11. Text Mining The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4- thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser . The reaction mixture was heated at reflux with stirring , for a period of about one-half hour . After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N- methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
  12. 12. But names = structures • Systematic names can be generated FROM chemical structures algorithmically
  13. 13. But names = structures • …and structures from systematic names
  14. 14. But what of trivial names? • What about trivial names, trade names, CAS numbers, multilingual names etc.?
  15. 15. Searching that lipid in patents
  16. 16. • ~30 million chemicals and growing • Data sourced from >500 different sources • Crowd sourced curation and annotation • Ongoing deposition of data from our journals and our collaborators • Structure centric hub for web-searching • …and a really big dictionary!!!
  17. 17. ChemSpider
  18. 18. ChemSpider
  19. 19. Experimental/Predicted Properties
  20. 20. Literature references
  21. 21. Patents references
  22. 22. Books
  23. 23. Chemical vendors and data sources
  24. 24. Aspirin on ChemSpider
  25. 25. Data Enabling the RSC Archive
  26. 26. How is DERA going? • We have text-mined all 21st century articles… >100k articles from 2000-2013 • Marked up with XML and published onto the HTML forms of the articles • Required multiple iterations based on dictionaries, markup, text mining iterations • New visualization tools in development – not just chemical names. Add chemical and biomedical terms markup also!
  27. 27. Work in Progress
  28. 28. Work in Progress
  29. 29. Work in Progress
  30. 30. Work in Progress
  31. 31. But Context Gives Reactions The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4- thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser . The reaction mixture was heated at reflux with stirring , for a period of about one-half hour . After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N- methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
  32. 32. ChemSpider Reactions
  33. 33. Is It Easy?
  34. 34. Dictionary (ontologies)RSC ontologies (methods, reactions) Dictionary (chemistry) Text-mining Curated dictionaries for known names ACD N2S OPSIN Unknown names: automated name to structure conversion XML ready for publication Marked-up XML Production processes CDX integration (coming soon) Chemical structures SD file Is It Easy?
  35. 35. So..compounds and reactions • ChemSpider is a compounds repository • We are building a Reactions Repository • “Reaction Validation” procedures to check data • Ontological approaches to classify the reactions • But why stop at chemicals and reactions?
  36. 36. Compounds Database
  37. 37. Reactions Database
  38. 38. Analytical Data Database
  39. 39. But publication data is FIGURES
  40. 40. So Turn “Figures” Into Data EXTRACTED DATA FIGURE
  41. 41. Early Test Experiments  74 supplementary data documents/ 3444 pages  Extracted content in 1069 page instances to produce 1151 spectra, > 80% of peaks extracted to within 1-2 decimal places  Working on batch extraction and production of spectral data
  42. 42. Validating Spectra • How will we check data consistency? • How do we know the structure and the spectra match? • Predict spectra and use algorithmic checking. • Flag “suspect data” and crowd source data checking
  43. 43. ESI – Text Spectra
  44. 44. Lots of “Textual Spectra”
  45. 45. 1H NMR (CDCl3, 400 MHz): δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz, C(11b)H), 4.35 (t, 1H, Jb = 10.8 Hz, C(6)H), 4.47 (m, 2H, C(5)H), 4.57 (dd, 1H, J = 2.8 Hz, C(6)H), 6.95 (d, 1H, J = 8.4 Hz, ArH), 7.18–7.94 (m, 11H, ArH)
  46. 46. Visualization of Spectral Data • For spectra associated with compounds we will be viewing “interactive spectra”
  47. 47. What are we extracting? • Compounds from compound names • Reactions from the text • Spectral extraction – from figures and text • Extraction of data from “tables” – not only CSV files but tables in the publication
  48. 48. BUT I hate text mining data • DERA: using pipelining tools for text-mining so we will be able to process documents for mark-up • Compound extraction/markup • Reaction extraction/conversion • Extract data from tables • Convert “text spectra” to generate spectral libraries • REALLY???? AGGHHHHH!
  49. 49. DERA is FINE for an archive The WRONG WAY otherwise! • We should NOT be mining data out of future publications • Structures should be submitted “correctly” • Spectra should be digital spectral formats, not images • ESI should be RICH and interactive • Data should be open, available, with meta data and provenance
  50. 50. Advanced ESI
  51. 51. We can solve for Authors here Will it be used though???
  52. 52. ChemSpider as a Foundation • >30 million chemicals (and growing) with associated experimental and predicted property data, analytical data, links out to hundreds of data sources, patents, journal articles, books etc…is a lot of data! • ChemSpider is free to access for everyone – and the API means people program against it • What projects can we benefit?
  53. 53. Support grant-based services • Multiple European consortium-based grants • PharmaSea (FP7 funded) • Open PHACTS (IMI funded) • UK National Chemical Database Service ( http://cds.rsc.org) – developing data repository for lab data, integrate Electronic Lab Notebooks • Open Drug Discovery projects
  54. 54. • 3-year Innovative Medicines Initiative project • Integrating chemistry and biology data using semantic web technologies • Open code, open data, open standards • Academics, Pharmas, Publishers… • To put medicines in the pipeline…
  55. 55. The Open PHACTS community ecosystem
  56. 56. Open Source Drug Discovery India
  57. 57. Conclusions • Great progress in mining the archive for compounds • Reaction extraction and spectral data are underway • All of the resulting data will be available to the chemistry community
  58. 58. And that article I’m writing
  59. 59. The Figures will be data too
  60. 60. Every compound will live
  61. 61. And linking will InChI forward
  62. 62. Structure Searching the Web
  63. 63. Thank you Email: williamsa@rsc.org ORCID: 0000-0002-2668-4821 Twitter: @ChemConnector Personal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×