The application of text and data mining to enhance the RSC publication archive
Upcoming SlideShare
Loading in...5
×
 

The application of text and data mining to enhance the RSC publication archive

on

  • 3,133 views

The Royal Society of Chemistry (RSC) is one of the world’s most prominent scientific societies and STM publishers. Our contributions to the scientific community include the delivery of a myriad of ...

The Royal Society of Chemistry (RSC) is one of the world’s most prominent scientific societies and STM publishers. Our contributions to the scientific community include the delivery of a myriad of resources to support the chemistry community to access chemistry-related data, information and knowledge. This includes ChemSpider, a compound centric platform linking together over 30 million chemical compounds with internet-based resources. Using this compound database and its associated chemical identifiers as a basis the RSC is utilizing text and data mining approaches to data enable our published archive of scientific publications. This presentation will provide an overview of our technical approaches to text and data enable our archive of scientific articles, how we are developing an integrated database of chemical compounds, reactions, physical and analytical data and how it will be used to facilitate scientific discovery.

Statistics

Views

Total Views
3,133
Views on SlideShare
383
Embed Views
2,750

Actions

Likes
2
Downloads
3
Comments
0

10 Embeds 2,750

http://www.chemconnector.com 2721
http://feedly.com 10
http://www.newsblur.com 5
http://www.slideee.com 4
https://twitter.com 3
http://www.feedspot.com 2
http://www.inoreader.com 2
http://127.0.0.1 1
http://prlog.ru 1
https://www.linkedin.com 1
More...

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution-NonCommercial LicenseCC Attribution-NonCommercial License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    The application of text and data mining to enhance the RSC publication archive The application of text and data mining to enhance the RSC publication archive Presentation Transcript

    • The Application of Text and Data Mining to Enhance the Royal Society of Chemistry Publication Archive Antony Williams Emerging Trends in Scholarly Publishing™ Seminar, Washington, April 24th 2014
    • So, I’m writing an article…
    • With lots of these….
    • And these…I will lose data 
    • Data in Publications • This is not new, you know the story… • So much data of value is contained within a publication and delivered in a PDF form • PDF files, and unclear licensing/copyright, limit access to data so I can rework, reuse, repurpose, text mine etc. • “I specialize in XXXX. I want a database of YYYY extracted from publications and made available, for free, with the capabilities I need, and the publishers should just do it”
    • And over the years, progress… • There is much progress with open access, data access, licensing, enhanced articles, open data, free online tools, open source codes, publishers waking up, scientists contributing • We should be excited at what is available now, what the future holds, what opportunities exist in front of us
    • It is so difficult to navigate… What’s the structure? What’s the structure? Are they in our file? Are they in our file? What’s similar? What’s similar? What’s the target? What’s the target?Pharmacology data? Pharmacology data? Known Pathways? Known Pathways? Working On Now? Working On Now?Connections to disease? Connections to disease? Expressed in right cell type? Expressed in right cell type? Competitors?Competitors? IP?IP?
    • “Data enable” publications? • We would LOVE to bring data out of our archive • What could we do? • Find chemical names and generate structures • Find chemical images and generate structures • Find reactions – and make a database! • Find data (MP, BP, LogP) and host. Build models! • Find figures and database them • Find spectra (and link to structures) • Validate the data algorithmically
    • RSC Archive – since 1841
    • Text Mining The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4- thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser . The reaction mixture was heated at reflux with stirring , for a period of about one-half hour . After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N- methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
    • Text Mining The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4- thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser . The reaction mixture was heated at reflux with stirring , for a period of about one-half hour . After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N- methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
    • But names = structures • Systematic names can be generated FROM chemical structures algorithmically
    • But names = structures • …and structures from systematic names
    • But what of trivial names? • What about trivial names, trade names, CAS numbers, multilingual names etc.?
    • Searching that lipid in patents
    • • ~30 million chemicals and growing • Data sourced from >500 different sources • Crowd sourced curation and annotation • Ongoing deposition of data from our journals and our collaborators • Structure centric hub for web-searching • …and a really big dictionary!!!
    • ChemSpider
    • ChemSpider
    • Experimental/Predicted Properties
    • Literature references
    • Patents references
    • Books
    • Chemical vendors and data sources
    • Aspirin on ChemSpider
    • Data Enabling the RSC Archive
    • How is DERA going? • We have text-mined all 21st century articles… >100k articles from 2000-2013 • Marked up with XML and published onto the HTML forms of the articles • Required multiple iterations based on dictionaries, markup, text mining iterations • New visualization tools in development – not just chemical names. Add chemical and biomedical terms markup also!
    • Work in Progress
    • Work in Progress
    • Work in Progress
    • Work in Progress
    • But Context Gives Reactions The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4- thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser . The reaction mixture was heated at reflux with stirring , for a period of about one-half hour . After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N- methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
    • ChemSpider Reactions
    • Is It Easy?
    • Dictionary (ontologies)RSC ontologies (methods, reactions) Dictionary (chemistry) Text-mining Curated dictionaries for known names ACD N2S OPSIN Unknown names: automated name to structure conversion XML ready for publication Marked-up XML Production processes CDX integration (coming soon) Chemical structures SD file Is It Easy?
    • So..compounds and reactions • ChemSpider is a compounds repository • We are building a Reactions Repository • “Reaction Validation” procedures to check data • Ontological approaches to classify the reactions • But why stop at chemicals and reactions?
    • Compounds Database
    • Reactions Database
    • Analytical Data Database
    • But publication data is FIGURES
    • So Turn “Figures” Into Data EXTRACTED DATA FIGURE
    • Early Test Experiments  74 supplementary data documents/ 3444 pages  Extracted content in 1069 page instances to produce 1151 spectra, > 80% of peaks extracted to within 1-2 decimal places  Working on batch extraction and production of spectral data
    • Validating Spectra • How will we check data consistency? • How do we know the structure and the spectra match? • Predict spectra and use algorithmic checking. • Flag “suspect data” and crowd source data checking
    • ESI – Text Spectra
    • Lots of “Textual Spectra”
    • 1H NMR (CDCl3, 400 MHz): δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz, C(11b)H), 4.35 (t, 1H, Jb = 10.8 Hz, C(6)H), 4.47 (m, 2H, C(5)H), 4.57 (dd, 1H, J = 2.8 Hz, C(6)H), 6.95 (d, 1H, J = 8.4 Hz, ArH), 7.18–7.94 (m, 11H, ArH)
    • Visualization of Spectral Data • For spectra associated with compounds we will be viewing “interactive spectra”
    • What are we extracting? • Compounds from compound names • Reactions from the text • Spectral extraction – from figures and text • Extraction of data from “tables” – not only CSV files but tables in the publication
    • BUT I hate text mining data • DERA: using pipelining tools for text-mining so we will be able to process documents for mark-up • Compound extraction/markup • Reaction extraction/conversion • Extract data from tables • Convert “text spectra” to generate spectral libraries • REALLY???? AGGHHHHH!
    • DERA is FINE for an archive The WRONG WAY otherwise! • We should NOT be mining data out of future publications • Structures should be submitted “correctly” • Spectra should be digital spectral formats, not images • ESI should be RICH and interactive • Data should be open, available, with meta data and provenance
    • Advanced ESI
    • We can solve for Authors here Will it be used though???
    • ChemSpider as a Foundation • >30 million chemicals (and growing) with associated experimental and predicted property data, analytical data, links out to hundreds of data sources, patents, journal articles, books etc…is a lot of data! • ChemSpider is free to access for everyone – and the API means people program against it • What projects can we benefit?
    • Support grant-based services • Multiple European consortium-based grants • PharmaSea (FP7 funded) • Open PHACTS (IMI funded) • UK National Chemical Database Service ( http://cds.rsc.org) – developing data repository for lab data, integrate Electronic Lab Notebooks • Open Drug Discovery projects
    • • 3-year Innovative Medicines Initiative project • Integrating chemistry and biology data using semantic web technologies • Open code, open data, open standards • Academics, Pharmas, Publishers… • To put medicines in the pipeline…
    • The Open PHACTS community ecosystem
    • Open Source Drug Discovery India
    • Conclusions • Great progress in mining the archive for compounds • Reaction extraction and spectral data are underway • All of the resulting data will be available to the chemistry community
    • And that article I’m writing
    • The Figures will be data too
    • Every compound will live
    • And linking will InChI forward
    • Structure Searching the Web
    • Thank you Email: williamsa@rsc.org ORCID: 0000-0002-2668-4821 Twitter: @ChemConnector Personal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams