Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Analysing the performance of open access papers discovery tools

1,096 views

Published on

Open Access discovery tools aim to locate freely available copies of research papers which might be behind the paywall on a publisher’s website. Our study provides a large scale quantitative performance comparison of several OA discovery tools on a randomly selected sample of 100k DOIs from CrossRef. We use the acquired knowledge from this analysis to build a new discovery tool - CORE Discovery.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Analysing the performance of open access papers discovery tools

  1. 1. Analysing the performance of open access papers discovery tools Petr Knoth Matteo Cancellieri June 13, 2019 – OR 2019, Hamburg, Germany CORE Big Scientific Data and Text Analytics Group Knowledge Media Institute, The Open University
  2. 2. Why Open Access (OA) Discovery? • Automating the process of finding a full text of a research paper • Identifying free copies of paywalled papers • Reducing the access process to just one-click • Analysis and monitoring of OA, subscriptions negotiation • Discovery tools: Browser extensions, system integrations, public APIs
  3. 3. Search vs Discovery • Search: Given a query, find relevant papers • Discovery: Given a document identifier(s), give me the full text
  4. 4. Aims of this work In scope: • Quantitatively compare and evaluate OA discovery tools using widely established information retrieval metrics • Identify gaps for improvement of OA discovery tools • Design a tool that maximises performance (CORE Discovery) Not in scope: • Discovery beyond freely available content • Illegal tools
  5. 5. What are OA discovery systems? Task definition: Given a document identifier (DOI), give me the URL of a freely accessible version of the document.
  6. 6. Most successful OA discovery methods 1. Using Crossref as a primary data source and systematically crawling full text based on Crossref links or other information. 2. Calling a wide range of external APIs in real-time. We implemented method 1 as a baseline (+ call to CORE in advance), to understand to what extent are the available methods better.
  7. 7. How OA Discovery systems work? Unpaywall OA Button K[.*]io Baseline Aims to find freely available copies of articles Help subscribed users access non-OA content Enriches data to obtain more OA links than already provided by underlying infrastructures Builds a database of DOI -> URL mappings Calls external infrastructure services while serving users’ request Disclaimer: It is not possible for us to specify the name of the tool label as K[.*]io. Any potential similarity with an existing tool on the market is purely coincidental. We make no claims and take no responsibility for any interpretations that might arise.
  8. 8. Reliance on other infrastructures
  9. 9. Evaluation methodology • Test all tools on the same data sample (DOIs) and capture the result • Query all tools as if they were executed by the user • Baseline method: • Collecting links from Crossref and crawling them to find full texts. • Calling CORE data via CORE API (as a batch prior to execution) • Evaluation metrics: • Hit rate - proportion of DOIs for which a URL is returned • Precision - proportion of true positives, i.e. correctly identified freely available article copy URLs, in the set of all returned URLs. • Analysis of the returned results
  10. 10. Data sample • 100k sample of DOIs randomly sampled from Crossref • 99% confidence level a confidence interval of 0.41%, i.e. below 1%.
  11. 11. Hit rate
  12. 12. Hit rate with respect to paper publication year
  13. 13. Precision • Responding with a URL to a given DOI does not guarantee that the provided URL leads to a freely available version of the correct paper. • We crawl all URLs returned by each tool and test: • contain the string of the article’s title as recorded in Crossref, • the text of the resource is the full version of the content (difficult to automate).Limitation: overestimates precision (manual check needed) No major differences on the automated check
  14. 14. Are some tools better for some disciplines? No significant differences across disciplines
  15. 15. Pairwise overlap of the returned URLs Overlap lower than expected
  16. 16. What hit rate can be achieved if tools are combined? We can improve hit rate by combining the outputs from multiple discovery tools.
  17. 17. Introducing CORE Discovery • High coverage of freely available content • Free service for researchers by researchers. No company controlling the pipes. • Best grip on open repository content. • Repository integration • Discovering documents without a DOI. https://core.ac.uk/services/discovery/
  18. 18. How CORE Discovery works • Run a process on a big data cluster merging data from MAG, Crossref, Unpaywall (2018 dump) and merging with CORE to find free links in advance. • Crawling provided links to find full texts. • If not found, calling EPMC. • Originally started with: • OA Button: increased hit-rate but significantly decreased precision. ~32.59% of links discovered by OA Button, which are not discovered by CORE Discovery and Unpaywall were wrong, based on a manual check. • K[.*]io removed the possibility to call API early in 2019. Also not used in CORE Discovery because of doubts regarding the delivery of many Researchgate URL links.
  19. 19. CORE Discovery demonstration
  20. 20. Hit rate from Performance of CORE Discovery • 10k random sample from Crossref. CORE Discovery Unpaywall Not found 7374 7474 Discovered 2626 2526 Hit Rate 26.26% 25.26%
  21. 21. Performance of CORE Discovery • Manually checked 200 responses where CORE Discovery and Unpaywall both returned a URL. • Precision: • CORE Discovery: 95.94% • Unpaywall: 93.4% CORE Unpaywall Display page with PDF link 9.64% 5.08% HTML 7.61% 7.61% HTML + PDF 3.55% 1.52% PDF 70.56% 78.17% PDF in another language 1.02% 1.02% TOC link 3.55% 0.00% Dead link 0.51% 0.51% HTML (abstract only) 0.51% 0.51% DOI not detected 1.02% 3.55% Wrong 1.02% 1.02% Wrong PDF 1.02% 1.02% Correct 95.94% 93.40%
  22. 22. CORE Discovery Repository integration • Majority of articles in repositories metadata only. • CORE Discovery repository plugin: • turns dead ends of user journeys into journeys fulfilling users’ information needs • makes repository content more discoverable.
  23. 23. Conclusions • First study to quantitatively analyse the performance of OA discovery systems • We identified: • Significant differences in the way OA discovery systems operate. • Strategies that are successful • Potential for further improvement • We developed CORE Discovery which offers one-click access to free copies of research papers whenever you hit the paywall. • Install CORE Discovery browser extension and/or our repository plugin.
  24. 24. Acknowledgements Feedback: CORE Ambassadors, KMI staff, UK Repository Managers Lucas Anastasiou Viktor Yakubiv Harriett Cornish Sergei Misak Nancy Pontika Svetlana Rumyanceva Samuel Pearce Balviar Notay Chris Biggs Alan Stiles
  25. 25. Thank you! https://core.ac.uk/services/discovery

×