How can repositories support the text-mining of their content and why?

How can repositories support the
text-mining of their content and
why?
@openminted_eu
Dr. Petr Knoth and Dr. Nancy Pontika
Knowledge Media institute, The Open University
United Kingdom
Twitter: @oacore

Why should repositories
support TDM?
@openminted_eu

Repositories and TDM
@openminted_eu
Institutional
Repositories
Subject
Repositories
Publishers/
OA journals
Other sources:
Research
Networking
Services
Primary Research
Data...
Text Mining Services

TDM & Repositories
Managers
@openminted_eu
• Established and maintain a close collaboration with
researchers
• Extensive experience in advocacy, i.e. open access
• Knowledgeable about the repository’s collection
• Participate in the Academic Institution’s Research
Committees
• Knowledgeable of your repository’s collection
• Familiarity with Copyright issues and Creative Commons
Licenses

How can repositories support
TDM?
TDM is all about processing text and data at
scale. The role of repositories is to facilitate the
aggregation of research papers at a full-text level
(and beyond) effectively enabling TDM services
to operate seamlessly on all available research
content.
7

What is the problem?
@openminted_eu
• A small study (Knoth, 2013)
• 83 repositories - mainly Eprints with PDF research
outputs
• 1,461,016 metadata records
metadata linked
to content
content
downloadable
content
machine
readable
Mean 54.1% 34.4% 27.6%
Median 39.5% 16.7% 13.0%
Standard
deviation
39.2% 34.2% 31.0%

How is content aggregated
today?
@openminted_eu
• DC over OAI-PMH: vast majority of repositories, never
intended to support content harvesting. The main problem:
linking metadata with content.
“The nature of a resource identifier is outside the scope of the OAI-
PMH. To facilitate access to the resource associated with harvested
metadata, repositories should use an element in metadata records
to establish a linkage between the record (and the identifier of its
item) and the identifier (URL, URN, DOI, etc.) of the associated
resource. The mandatory Dublin Core format provides the identifier
element that should be used for this purpose.”

How is content aggregated
today?
@openminted_eu
• RIOXX: Just one identifier, recommends the identifier
points to the actual resource being described.
• OpenAIRE Guidelines: identifier links to either the
resource or a jump-off page. Does allow multiple
identifiers.
• ResourceSync
• CrossRef: comercial publishers/journals

The content referencing
problem
@openminted_eu

Principle 1: content
referencing
Repositories should always establish a link from
the metadata record to the item the metadata
record describes using a dereferencable identifier
pointing to the version held locally in the
repository. The dereferencable identifier should
be provided in the appropriate metadata element
in the used metadata format (i.e. dc:identifier in
the case of Dublin Core). If multiple identifiers are
used, it is recommended listing the local
dereferencable identifier first.
1

The accessibility of
repositories to harvesting
systems
@openminted_eu

Principle 2: Content
accessibility to machines
Repositories must provide universal access to
machines with the same level of access as
humans have. It is the role of repositories to
allow aggregators to harvest the entire content of
the repository in a reasonable time to enable
acquiring and maintain up-to-date information
about the repository content.
1

What can repositories do?
@openminted_eu
• Ensure correct referencing of content from metadata:
• Dereferencable link which resolves to content
• Locally held (content under its control)
• Using a standard repository platform can help
• Check robots.txt
• Register your repository
• Advocate for good pdf (media) quality of deposited content
• Use monitoring tools
• CORE Repository Dashboard
• OpenAIRE Repository Manager Dashboard
• Machine readable licensing

beyond Open Access
MAKING SENSE OF
LARGE VOLUMES OF
SCIENTIFIC CONTENT
1

Interested in how to TDM
research papers?
@openminted_eu
We have 3 more
talks tomorrow!
Developer track 1, 11:00
Mining Open Access
publications with CORE

research papers?
@openminted_eu
We have 3 more
talks tomorrow!
Developer track 1, 11:20
Oxford vs Cambridge
Contest: Collecting Open
Research Evaluation
Metrics for University
Ranking

research papers?
@openminted_eu
We have 3 more
talks tomorrow!
Papers 4, 4:00
Exploring
Semantometrics:
full text-based
research
evaluation for
open repositories

Thank you
Dr. Pert Knoth,, Research Fellow
petr.knoth@open.ac.uk
Dr. Nancy Pontika, Open Access Aggregation
Officer
nancy.pontika@open.ac.uk
.
2

How can repositories support the text-mining of their content and why?

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (16)

Similar to How can repositories support the text-mining of their content and why?

Similar to How can repositories support the text-mining of their content and why? (20)

More from Nancy Pontika

More from Nancy Pontika (19)

Recently uploaded

Recently uploaded (20)

How can repositories support the text-mining of their content and why?

Editor's Notes