Seamless access to the world’s open access research papers via ResourceSync

Seamless access to the world’s open
access research papers via
ResourceSync
Petr Knoth

Use Case 1: ResourceSync as a seamless layer over
heterogenous APIs

Use Case 1: What is CORE?
OA Repositories OA Journals
Mostly OAI-PMH
CORE aggregates and
provides free access to
millions of research
articles aggregated
from thousands of OA
repositories and
journals.

Mostly OAI-PMH
CORE aggregates and
articles aggregated
repositories and
journals.
»Enrichment and
harmonisation of
aggregated data
»Products/services:
›Portal
›API
›Data dumps
›Recommendation
system for libraries
›Repository dashboard
›B2B and analytical
services

Mostly OAI-PMH
CORE aggregates and
articles aggregated
repositories and
journals.
»70 million+
metadata records
»Over 6 million full
texts hosted on
CORE
»~1.5 million
monthly active
users
»Aggregating from
2,500 repositories
and 10k OA
journals

Use Case 1: Key issue
Key players do not provide interoperability for machine
access to metadata and content of research papers.
35%
23%
18%
12%
12%
Accessing full-text by
harvesting
the website
Major search
engines
Recongnised
services upon
approval
75%
12%
13%
Restricting access to
full-text
Don't restrict
access in any way
Specify a crawl
delay
Allow access to
specific robots
39%
11%
39%
11%
Reference of an article’s
full-text on metadata
Direct link to full-
text
Interface
supporting full-text
transfer
50%
42%
8%
Accessing content
standards
OAI
Own API
Z39.50
36%
24%
4%
32%
4%
Files format
PDF
HTML
Plain text
HTML
JSON
54%31%
15%
Automated downloads
of OA full-text
Website
API
FTP

Use Case 1: Approach
Key publishers
(OA + hybrid OA)
Publisher connector
Mostly OAI-PMH
A range of bespoke APIs
+ many others
Provide seamless access over non-standardised APIs.
What protocol?

Use Case 1: Approach
Key publishers
(OA + hybrid OA)
Publisher connector
Mostly OAI-PMH
+ many others
Provide seamless access over non-standardised APIs.
What protocol? »Why not OAI-PMH?
›slow and very inefficient
for big repositories.
›Standardised for
metadata transfer but
not for content transfer.
› Very difficult to
represent the richness of
metadata from a broad
range of data providers.

Use Case 1: ResourceSync as a seamless access layer
»Very scalable
implementation on
both the server and
client side
»Interpretation of
metadata happens
using existing pipeline
at the aggregator.
»1.5 million OA
publications from
Elsevier, Springer and
others already
exposed.
»Available at: https://publisher-connector.core.ac.uk/resourcesync
Key publishers
(OA + hybrid OA)
Publisher connector
Mostly OAI-PMH
+ many others
ResourceSync

Use Case 2: Exposing enriched data for Text and Data
Mining (TDM) via ResourceSync

Use Case 2: Subscribing to ResourceSync
Key publishers
(OA + hybrid OA)
Publisher connector
Mostly OAI-PMH
ResourceSync
+ many others
»Other aggregators can
subscribe to the Publisher
connector to make use of their
ingestion pipelines and
enrichment technologies

Use Case 2: Content ingestion in OpenMinTeD
Key publishers
(OA + hybrid OA)
Publisher connector
ResourceSync
Mostly OAI-PMH
OMTD-SHARE
(over REST)
+ many others
»CORE and OpenAIRE are content sources in the OpenMinTeD
TDM platform (EU infrastructure project) being developed to
enable the mining of scholarly literature.

Use Case 2: Exposing enriched data for TDM
Key publishers
(OA + hybrid OA)
Publisher connector
ResourceSync
Mostly OAI-PMH
+ many others
ResourceSync
»But others want similar solutions … typically, they want to be
able to sync and host the data.

Use Case 3: Make repositories and journals adopt
ResourceSync

Use Case 3: Replace OAI-PMH with ResourceSync
Key publishers
(OA + hybrid OA)
Publisher connector
ResourceSync
Mostly OAI-PMH
OMTD-SHARE
(over REST)
+ many others
ResourceSync
ResourceSync
»Will be a game changer …
»Advocated by COAR Next
Generation Repositories WG

Key contributions and considerations

What’s new about our implementation of ResourceSync?
»Scales to many millions of resources as required by
aggregators (as opposed to existing implementations for
repositories that are scalable for tens of thousands of
resources)
»Real-time updating of ResourceLists and ChangeLists
(avoiding unnecessary batch processes).
»Combination of real-time updates and scalability

Architectural choices
»Based on the principle of changes being communicated
to a controller as they happen (rather than having to be
detected prior to ResourceList/ChangeList updates)
»Uses Elasticsearch as a database
»Hashing mechanism to distribute size of each
ResourceList link and a clever mechanism for iterative
updating of ResourceLists

Conclusions
»ResourceSync:
›broad range of uses in scholarly communication.
›solves problems with aggregating content over OAI-PMH, faster &
more efficient aggregation => fresher data in aggregators compared
to OAI-PMH
»We used ResourceSync to ”liberate” over 1.5 million OA papers (and
growing) from key publishers
»CORE soon to provide access to over 8 million OA full texts via
ResourceSync.
»CORE actively contributes to the adoption of ResourceSync in the
repositories community (as part of OpenMinTeD and COAR NGR)

Seamless access to the world’s open access research papers via ResourceSync

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Seamless access to the world’s open access research papers via ResourceSync

Similar to Seamless access to the world’s open access research papers via ResourceSync (20)

More from petrknoth

More from petrknoth (20)

Recently uploaded

Recently uploaded (20)

Seamless access to the world’s open access research papers via ResourceSync