Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

CrossRef Text & Data Mining - UKSG 2015


Published on

CrossRef Text & Data Mining presentation given at UKSG 2015

Published in: Technology
  • Be the first to comment

CrossRef Text & Data Mining - UKSG 2015

  1. 1. Rachael Lammey Product Manager, CrossRef UKSG 2015 CrossRef Text and Data Mining Services: one year in
  2. 2. Not-for-profit association of scholarly publishers All subjects, all business models 5,000+ organizations from all over the world 83 non-publisher affiliates, 2000 library affiliates 72 million + DOIs assigned to content items
  3. 3. 10.1098/ rstl. 1665.0001
  4. 4. User clicks on CrossRef DOI reference link in Journal A Tani, N., N. Tomaru, M. Araki, AND K. Ohba. 1996. Genetic diversity and differentiation in populations of Japanese stone pine (Pinus pumila) in Japan. Canadian Journal of Forest Research 26: 1454–1462.[CrossRef] DOI directory returns URL User accesses cited article in Journal B
  5. 5. 100,000,000
  6. 6. A Text and Data Mining Hub for Researchers
  7. 7. What is Text and Data Mining (TDM)? Text Mining is an interdisciplinary field combining techniques from linguistics, computer science and statistics to build tools that can efficiently retrieve and extract information from digital text. It uses powerful computers to find links between drugs and side effects, or genes and diseases, that are hidden within the vast scientific literature. These are discoveries that a person scouring through papers one by one may never notice.
  8. 8. Why?• Researchers find it impractical to negotiate multiple bilateral agreements with hundreds of subscription-based publishers in order to authorise TDM of subscribed content. • Subscription-based publishers find it impractical to negotiate multiple bilateral agreements with thousands of researchers and institutions in order to authorise TDM of subscribed content. • All parties would benefit from support of standard APIs and data representations in order to enable TDM across both open access and subscription-based publishers.
  9. 9. Build Cross-Publisher API for TDM
  10. 10. Access To Full Text Problem: Researchers want to get full text content from publishers’ sites for OA or subscribed content. Solution: Solution: Common API (protocol) for requesting machine readable full text from many different publishers
  11. 11. Negotiating Permissions Problem: Researchers want to know whether text and data mining is allowed, and if not, get permission. Solution: Licensing information embedded in article metadata and a registry for supplemental text and data mining terms and conditions (licenses).
  12. 12. Text and Data Mining Steps • Define problem • Identify potential corpus to mine • Discovery (full text links) • Identification of subset which can be accessed (license information) • Download identified corpus • Text and data mine corpus
  13. 13. The Basic Workflow
  14. 14. Publisher Participation To enable their content for use by the service, publishers have to provide CrossRef with two additional pieces of metadata: • Full text URIs (to show where the full-text is located) • License URIs (to show the Terms & Conditions under which they can use it) • Can implement rate limiting CrossRef doesn’t charge publishers for participating in this service.
  15. 15. Researcher Use • The CrossRef REST API is the main aspect of this service • It is designed to allow researchers to easily harvest full text documents from all participating publishers regardless of their business model (e.g. open access, subscription). • It makes use of CrossRef DOI content negotiation to provide researchers with links to the full text of content located on the publisher’s site. • The publisher remains responsible for actually delivering the full text of the content requested • CrossRef does not charge researchers for using the service
  16. 16. Publisher Metadata for CrossRef TDM: Hindawi
  17. 17. Publisher Metadata for CrossRef TDM: Elsevier
  18. 18. CrossRef TDM Demo
  19. 19. Click-Through Service
  20. 20. Extended Workflow
  21. 21. Researcher View
  22. 22. Publisher View
  23. 23. Researcher queries DOI using CN + API token Publisher verifies API token If token verified AND access control allows, publisher returns full text (frequency at publisher discretion)
  24. 24. Benefits • Streamlines researcher access to distributed full text for TDM • Enables machine-to-machine, automated access for recognized TDM (i.e. researchers won’t be locked out of publisher sites) • Enables article-level licensing info and easy mechanism for supplemental T&Cs for text and data mining (publishers discussing model license via STM)
  25. 25. Publishers Over 14 million articles with full-text links and license information deposited
  26. 26. Usable as is:
  27. 27.
  28. 28.
  29. 29. How can researchers use the service? • Modify TDM tools to make use of the API token • Modify TDM tools to look for <lic_ref> elements • Register with the click-through service and accept/decline licenses (if applicable) • Details at:
  30. 30. Using the DOI as the basis for a common text and data mining API provides several benefits. For example, the DOI provides: •An easy way to de-duplicate documents that may be found on several sites. •Persistent provenance information. •An easy way to document, share and compare coropra without having to exchange the actual documents •A mechanism to ensure the reproducibility of TDM results using the source documents. •A mechanism to track the impact of updates, corrections retractions and withdrawals on corpora. Why use the DOI?