Open Access discovery tools aim to locate freely available copies of research papers which might be behind the paywall on a publisher’s website. Our study provides a large scale quantitative performance comparison of several OA discovery tools on a randomly selected sample of 100k DOIs from CrossRef. We use the acquired knowledge from this analysis to build a new discovery tool - CORE Discovery.
Highest coverage of freely available content. Our tests have shown CORE Discovery finding more free content than any other discovery system. Free service for researchers by researchers. CORE Discovery is the only free content discovery extension developed by researchers for researchers. There is no major publisher or enterprise controlling and profiting from your usage data. Best grip on open repository content. Due to CORE being a leader in harvesting open access literature, CORE Discovery has the best grip on open content from open repositories as opposed to other services that disproportionately focus only on content indexed in major commercial databases. Repository integration and discovering documents without a DOI. The only service offering seamless and free integration into repositories. CORE Discovery is also the only discovery system that can locate scientific content even for items with an unknown DOI or which do not have a DOI.
Open access discovery tools locate freely available copies of research papers which might be behind the paywall
Analysing the performance of open access papers discovery tools
Analysing the performance of open
access papers discovery tools
June 13, 2019 – OR 2019, Hamburg, Germany
Big Scientific Data and Text Analytics Group
Knowledge Media Institute, The Open University
Why Open Access (OA) Discovery?
• Automating the process of finding a full text of a research paper
• Identifying free copies of paywalled papers
• Reducing the access process to just one-click
• Analysis and monitoring of OA, subscriptions negotiation
• Discovery tools: Browser extensions, system integrations, public
Search vs Discovery
• Search: Given a query, find relevant papers
• Discovery: Given a document identifier(s), give me the full text
Aims of this work
• Quantitatively compare and evaluate OA discovery tools using
widely established information retrieval metrics
• Identify gaps for improvement of OA discovery tools
• Design a tool that maximises performance (CORE Discovery)
Not in scope:
• Discovery beyond freely available content
• Illegal tools
What are OA discovery systems?
Task definition: Given a document identifier
(DOI), give me the URL of a freely accessible
version of the document.
Most successful OA discovery methods
1. Using Crossref as a primary data source and systematically
crawling full text based on Crossref links or other information.
2. Calling a wide range of external APIs in real-time.
We implemented method 1 as a baseline (+ call to CORE in
advance), to understand to what extent are the available
How OA Discovery systems work?
Unpaywall OA Button K[.*]io Baseline
Aims to find freely available copies of articles
Help subscribed users access non-OA content
Enriches data to obtain more OA links than already
provided by underlying infrastructures
Builds a database of DOI -> URL mappings
Calls external infrastructure services while serving
Disclaimer: It is not possible for us to specify the name of the tool label as K[.*]io. Any potential similarity with an
existing tool on the market is purely coincidental. We make no claims and take no responsibility for any interpretations
that might arise.
• Test all tools on the same data sample (DOIs) and capture the
• Query all tools as if they were executed by the user
• Baseline method:
• Collecting links from Crossref and crawling them to find full texts.
• Calling CORE data via CORE API (as a batch prior to execution)
• Evaluation metrics:
• Hit rate - proportion of DOIs for which a URL is returned
• Precision - proportion of true positives, i.e. correctly identified freely
available article copy URLs, in the set of all returned URLs.
• Analysis of the returned results
• 100k sample of
• 99% confidence
0.41%, i.e. below
Hit rate with respect to paper publication
• Responding with a URL to a given DOI does not guarantee that
the provided URL leads to a freely available version of the
• We crawl all URLs returned by each tool and test:
• contain the string of the article’s title as recorded in Crossref,
• the text of the resource is the full version of the content
(difficult to automate).Limitation:
No major differences on
the automated check
Are some tools better for some
Pairwise overlap of the returned URLs
What hit rate can be achieved if tools are
Introducing CORE Discovery
• High coverage of freely
• Free service for
company controlling the
• Best grip on open
• Repository integration
• Discovering documents
without a DOI.
How CORE Discovery works
• Run a process on a big data cluster merging data from MAG,
Crossref, Unpaywall (2018 dump) and merging with CORE to
find free links in advance.
• Crawling provided links to find full texts.
• If not found, calling EPMC.
• Originally started with:
• OA Button: increased hit-rate but significantly decreased precision.
~32.59% of links discovered by OA Button, which are not discovered by
CORE Discovery and Unpaywall were wrong, based on a manual
• K[.*]io removed the possibility to call API early in 2019. Also not used in
CORE Discovery because of doubts regarding the delivery of many
Researchgate URL links.
Hit rate from Performance of CORE
• 10k random sample from Crossref.
CORE Discovery Unpaywall
Not found 7374 7474
Discovered 2626 2526
Hit Rate 26.26% 25.26%
Performance of CORE Discovery
• Manually checked 200
responses where CORE
Discovery and Unpaywall both
returned a URL.
• CORE Discovery: 95.94%
• Unpaywall: 93.4%
Display page with PDF link 9.64% 5.08%
HTML 7.61% 7.61%
HTML + PDF 3.55% 1.52%
PDF 70.56% 78.17%
PDF in another language 1.02% 1.02%
TOC link 3.55% 0.00%
Dead link 0.51% 0.51%
HTML (abstract only) 0.51% 0.51%
DOI not detected 1.02% 3.55%
Wrong 1.02% 1.02%
Wrong PDF 1.02% 1.02%
Correct 95.94% 93.40%
CORE Discovery Repository integration
• Majority of articles in
repositories metadata only.
• CORE Discovery
• turns dead ends of user
journeys into journeys
fulfilling users’ information
• makes repository content
• First study to quantitatively analyse the performance of OA
• We identified:
• Significant differences in the way OA discovery systems operate.
• Strategies that are successful
• Potential for further improvement
• We developed CORE Discovery which offers one-click access
to free copies of research papers whenever you hit the paywall.
• Install CORE Discovery browser extension and/or our repository
Feedback: CORE Ambassadors, KMI staff, UK Repository
Viktor Yakubiv Harriett Cornish Sergei Misak Nancy Pontika Svetlana Rumyanceva
Samuel Pearce Balviar Notay Chris Biggs Alan Stiles