From Open Access Metadata to Open Access Content: Two Principles for Increased Visibility of Open Access Content

1/31
From Open Access Metadata to
Open Access Content
Two Principles for Increased Visibility of Open
Access Content
Petr Knoth
CORE (Connecting REpositories) project
Knowledge Media institute
The Open University
@petrknoth, #diggicore

2/31
Why from OA metadata to OA content?
• Free online availability of research outputs (not metadata of
research outputs) is the main goal of open access (OA)
• Repositories are one of the recommended ways to achieve this
(BOAI,2002).
• Despite large amount of OA content already available online
(Laakso & Bjork, 2012), OA content is not necessarily easily
discoverable (Morrisson, 2012; Konkiel, 2012).
• Often available, but difficult to find …
• Inhibiting the OA impact – accessibility, discoverability, reuse …
• Discoverability of OA content on the Web can be dramatically
increased by adopting two simple principles!

3/31
Outline
1. Goals of repositories
2. The bleak truth about availability of OA metadata vs content
3. Content referencing practises in repositories
4. Two principles to increase visibility of OA content

4/31
Outline
1. Goals of repositories (repositories as large metadata silos)

5/31
The primary purpose of repositories
• Institutional repositories (IRs) serve a number of purposes; such
collecting and curating digital outputs, providing statistics,
research excellence, etc.
• The primary goal of repositories is to open and disseminate
research outputs to a worldwide audience (Crow, 2002) –
SPARC’s position paper on the case for institutional repositories.

6/31
SPARC’s position paper on IRs
“For the repository to provide access to the broader research
community, users outside the university must be able to find and
retrieve information from the repository. Therefore, institutional
repository systems must be able to support interoperability in order
to provide access via multiple search engines and other discovery
tools. An institution does not necessarily need to implement
searching and indexing functionality to satisfy this demand: it could
simply maintain and expose metadata, allowing other services to
harvest and search the content. This simplicity lowers the barrier to
repository operation for many institutions, as it only requires a file
system to hold the content and the ability to create and share
metadata with external systems.”

7/31
COAR: About harvesting and aggregations …
“Each individual repository is of limited value for research: the real
power of Open Access lies in the possibility of connecting and tying
together repositories, which is why we need interoperability. In
order to create a seamless layer of content through connected
repositories from around the world, Open Access relies on
interoperability, the ability for systems to communicate with each
other and pass information back and forth in a usable format.
Interoperability allows us to exploit today's computational power so
that we can aggregate, data mine, create new tools and services,
and generate new knowledge from repository content.’’
[COAR manifesto]

8/31
We need OA to content (not just metadata)
• Repositories (even the most prominent) often seen by
aggregation systems as large metadata.
• OA to metadata is not disruptive. Little difference to the
traditional publishing model.

9/31
Outline

10/31
Study
• 83 repositories (mainly EPrints with pdf research outputs)
• 1,461,016 metadata records
• Ratio of metadata to content
• Data acquired from CORE (Knoth & Zdrahal, 2012)

12/31
“[The institutional repository] is like a roach motel. Data goes in, but
it doesn’t come out.” (Salo, 2008)

13/31
Why is this a problem?
• Lower accessibility of papers (we have them, but cannot find
them)
• Text-mining
• Cannot monitor growth
• Loosing a strong argument for the adoption of OA!

14/31
Outline

15/31
OAI-PMH and content referencing
• OAI-PMH supports representing metadata in multiple formats,
but at a minimum repositories must be able to return records
with metadata expressed in the Dublin Core format (OAI-PMH
v2.0, 2008)
• If repositories want to satisfy the SPARC guidelines (Crow, 2002),
they must provide a link to the content as part of the exposed
metadata.

16/31
The Open Research Online repository (Eprints) links directly to the
resource from metadata.
Cranfield repository (DSpace) identifies the resource by providing a
link to a page from which the resource (if available) can be accessed.

17/31
The OAI-PMH specification states on this topic that:
“The nature of a resource identifier is outside the scope of the OAI-
PMH. To facilitate access to the resource associated with harvested
metadata, repositories should use an element in metadata records
to establish a linkage between the record (and the identifier of its
item) and the identifier (URL, URN, DOI, etc.) of the associated
resource. The mandatory Dublin Core format provides the identifier
element that should be used for this purpose.”

18/31
• What is an identifier of the associated resource?
Is a splash page an identifier? According to OAI-PMH examples it is:
<dc:identifier>http://arXiv.org/abs/cs/0112017</dc:identifier>
• The standard is pretty weak on this aspect …

19/31
Outline

20/31
The principles of the principles 
• Pragmatic rather than exciting.
• Generating maximum benefit for a minimum investment.
• Deliberately use current standards to minimise adoption time.
• Respecting differences across systems and backwards
compatibility.
• Emphasizes the need for easy to use compliance mechanisms to
assist repository managers in ensuring systems interoperability.

21/31
Principle 1 – Content referencing
Open repositories should always establish a link from the metadata
record to the item the metadata record describes using a
dereferencable identifier pointing to the version held in the
repository. The dereferencable identifier should be provided in the
appropriate metadata element in the used metadata format (i.e.
dc:identifier in the case of Dublin Core).

22/31
Implications: Principle 1 – Content referencing
• Repositories can use different standards to deliver metadata over
OAI-PMH (DC, METS, MPEG-21 DIDL)
• Identifier must resolve (be actionable) to the object it identifies
• In the case of DC, if more identifiers are present, use the first
identifier as the identifier of the object
• Should resolve to the version of the object in the local repository
• Similarity with RIOXX identifier field
• The principle is easily applicable in the OA domain: each item can
be freely resolved

23/31
Open access statistics and principle 1
• Only dereferencable
items are OA
• Increases stats acuracy
• Avoids anecdotal
situations (e.g. 23,380
Dark Items)

24/31
Principle 2 – Content accessibility to machines
Open repositories must provide universal access to machines with
the same level of access as humans have. It is the role of open
repositories to allow machines harvest the entire content of the
repository in a reasonable time to enable harvesting systems to
acquire and maintain up-to-date information about the repository
content.

25/31
Example from Arxiv.org
• Googlebot: unrestricted
• Yahoo/MSN: can
reharvest in 6 months
• Researchers: access
denied

26/31
Implications: Principle 2 – Content accessibility to machines
• Accessibility of repository content by machines
• Enabling reuse through new services, such as those relying on
text-mining
• Open Repositories should not discriminate, except for abusive
behavior
• Presumption of innocence

27/31
Validation tools
• Key to adoption – Repository managers should not be left alone
• Repository Analytics

28/31
Conclusions
• Proportion of OA content that can be harvested is fairly low in
comparison to metadata
• Inhibiting the benefits of OA
• Two principles:
1) Dereferencable identifiers - Open Repositories provide open
access to content and not just meatadata
2) Machine access – Open Repositories should provide free access
to content (for anybody, but mainly researchers)
• Compliance validation tools are needed

29/31
Thank you!
Open access needs open repositories

30/31
References 1/2
[BOAI, 2002] Budapest Open Access Initiative. (2002)
http://www.opensocietyfoundations.org/openaccess/boai-10-recommendations
[Crow, 2002] Crow, R. (2002). The case for institutional repositories: a SPARC position
paper. ARL Bimonthly Report 223.
[Knoth & Zdrahal, 2012] Knoth, P. and Zdrahal, Z. (2012) CORE: Three Access Levels to
Underpin Open Access, D-Lib Magazine, 18, 11/12, Corporation for National Research
Initiatives, http://dx.doi.org/10.1045/november2012-knoth
[Konkiel, 2012] Konkiel, S. (2012) Are Institutional Repositories Doing Their Job?
https://blogs.libraries.iub.edu/scholcomm/2012/09/11/are-institutional-repositories-
doing-their-job/
[Laakso & Bjork, 2012] Laakso, M., & Björk, B. C. (2012). Anatomy of open access
publishing: a study of longitudinal development and internal structure. BMC Medicine,
10(1), 124.

31/31
References 2/2
[Morrison, 2012] Morrison, Louise (2012) 5 reasons why I can’t find Open Access
publications. http://mmitscotland.wordpress.com/2012/08/06/5-reasons-why-i-cant-find-
open-access-publications-2/
[OAI-PMH v2.0, 2008] The Open Archives Initiative Protocol for Metadata Harvesting
Version 2.0 (OAI-PMH), Impementation Guidelines (2008).
http://www.openarchives.org/OAI/openarchivesprotocol.html
[ResourceSync draft, 2013] ResourceSync protocol draft. 2013
http://www.niso.org/workrooms/resourcesync/
[Salo, 2008] Salo, D. (2008). Innkeeper at the roach motel. Library Trends, 57(2), 98-123.
[Van de Sompel et al, 2004] Van de Sompel, H., Nelson, M. L., Lagoze, C., & Warner, S.
(2004). Resource harvesting within the OAI-PMH framework. D-lib magazine, 10(12), 1082-
9873.

From Open Access Metadata to Open Access Content: Two Principles for Increased Visibility of Open Access Content

More Related Content

What's hot

Viewers also liked

Similar to From Open Access Metadata to Open Access Content: Two Principles for Increased Visibility of Open Access Content

More from petrknoth

Recently uploaded

From Open Access Metadata to Open Access Content: Two Principles for Increased Visibility of Open Access Content

Editor's Notes