Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
How can repositories support the
text-mining of their content and
why?
@openminted_eu
Dr. Petr Knoth and Dr. Nancy Pontika...
Why should repositories
support TDM?
@openminted_eu
@openminted_eu
In the UK
Repositories and TDM
@openminted_eu
Institutional
Repositories
Institutional
Repositories
Subject
Repositories
Subject
Rep...
TDM & Repositories
Managers
@openminted_eu
• Established and maintain a close collaboration with
researchers
• Extensive e...
How can repositories
support TDM?
TDM is all about processing text and data at scale.
The role of repositories is to facil...
What is the problem?
@openminted_eu
• A small study (Knoth, 2013)
• 83 repositories - mainly Eprints with PDF research
out...
How is content aggregated
today?
@openminted_eu
• DC over OAI-PMH: vast majority of repositories, never
intended to suppor...
How is content aggregated
today?
@openminted_eu
• RIOXX: Just one identifier, recommends the identifier
points to the actu...
The content referencing
problem
@openminted_eu
Principle 1: content
referencing
Repositories should always establish a link from the
metadata record to the item the meta...
The accessibility of
repositories to harvesting
systems
@openminted_eu
Principle 2: Content
accessibility to machines
Repositories must provide universal access to
machines with the same level ...
What can repositories
do?
@openminted_eu
• Ensure correct referencing of content from
metadata:
• Dereferencable link whic...
beyond Open
Access
MAKING SENSE OF
LARGE VOLUMES
OF
SCIENTIFIC
CONTENT
16
Interested in how to
TDM research papers?
@openminted_eu
We have 3 more
talks tomorrow!
Developer track 1, 11:00
Mining Op...
Interested in how to
TDM research papers?
@openminted_eu
We have 3 more
talks tomorrow!
Developer track 1, 11:20
Oxford vs...
Interested in how to
TDM research papers?
@openminted_eu
We have 3 more
talks tomorrow!
Papers 4, 4:00
Exploring
Semantome...
Thank you
Dr. Pert Knoth,, Research Fellow
petr.knoth@open.ac.uk
Dr. Nancy Pontika, Open Access
Aggregation Officer
nancy....
How can repositories support the text mining of their content and why?
Upcoming SlideShare
Loading in …5
×

How can repositories support the text mining of their content and why?

783 views

Published on

Petr Knoth, Nancy Pontika

Published in: Data & Analytics
  • Be the first to comment

How can repositories support the text mining of their content and why?

  1. 1. How can repositories support the text-mining of their content and why? @openminted_eu Dr. Petr Knoth and Dr. Nancy Pontika Knowledge Media institute, The Open University United Kingdom Twitter: @oacore
  2. 2. Why should repositories support TDM? @openminted_eu
  3. 3. @openminted_eu In the UK
  4. 4. Repositories and TDM @openminted_eu Institutional Repositories Institutional Repositories Subject Repositories Subject Repositories Publishers/ OA journals Publishers/ OA journals Other sources: Research Networking Services Primary Research Data... Other sources: Research Networking Services Primary Research Data... Text Mining Services
  5. 5. TDM & Repositories Managers @openminted_eu • Established and maintain a close collaboration with researchers • Extensive experience in advocacy, i.e. open access • Knowledgeable about the repository’s collection • Participate in the Academic Institution’s Research Committees • Knowledgeable of your repository’s collection • Familiarity with Copyright issues and Creative Commons Licenses
  6. 6. How can repositories support TDM? TDM is all about processing text and data at scale. The role of repositories is to facilitate the aggregation of research papers at a full-text level (and beyond) effectively enabling TDM services to operate seamlessly on all available research content. 7
  7. 7. What is the problem? @openminted_eu • A small study (Knoth, 2013) • 83 repositories - mainly Eprints with PDF research outputs • 1,461,016 metadata records   metadata linked to content content downloadable content machine readable Mean 54.1% 34.4% 27.6% Median 39.5% 16.7% 13.0% Standard deviation 39.2% 34.2% 31.0%
  8. 8. How is content aggregated today? @openminted_eu • DC over OAI-PMH: vast majority of repositories, never intended to support content harvesting. The main problem: linking metadata with content. “The nature of a resource identifier is outside the scope of the OAI- PMH. To facilitate access to the resource associated with harvested metadata, repositories should use an element in metadata records to establish a linkage between the record (and the identifier of its item) and the identifier (URL, URN, DOI, etc.) of the associated resource. The mandatory Dublin Core format provides the identifier element that should be used for this purpose.”
  9. 9. How is content aggregated today? @openminted_eu • RIOXX: Just one identifier, recommends the identifier points to the actual resource being described. • OpenAIRE Guidelines: identifier links to either the resource or a jump-off page. Does allow multiple identifiers. • ResourceSync • CrossRef: comercial publishers/journals
  10. 10. The content referencing problem @openminted_eu
  11. 11. Principle 1: content referencing Repositories should always establish a link from the metadata record to the item the metadata record describes using a dereferencable identifier pointing to the version held locally in the repository. The dereferencable identifier should be provided in the appropriate metadata element in the used metadata format (i.e. dc:identifier in the case of Dublin Core). If multiple identifiers are used, it is recommended listing the local dereferencable identifier first. 12
  12. 12. The accessibility of repositories to harvesting systems @openminted_eu
  13. 13. Principle 2: Content accessibility to machines Repositories must provide universal access to machines with the same level of access as humans have. It is the role of repositories to allow aggregators to harvest the entire content of the repository in a reasonable time to enable acquiring and maintain up- to-date information about the repository content. 14
  14. 14. What can repositories do? @openminted_eu • Ensure correct referencing of content from metadata: • Dereferencable link which resolves to content • Locally held (content under its control) • Using a standard repository platform can help • Check robots.txt • Register your repository • Advocate for good pdf (media) quality of deposited content • Use monitoring tools • CORE Repository Dashboard • OpenAIRE Repository Manager Dashboard • Machine readable licensing
  15. 15. beyond Open Access MAKING SENSE OF LARGE VOLUMES OF SCIENTIFIC CONTENT 16
  16. 16. Interested in how to TDM research papers? @openminted_eu We have 3 more talks tomorrow! Developer track 1, 11:00 Mining Open Access publications with CORE
  17. 17. Interested in how to TDM research papers? @openminted_eu We have 3 more talks tomorrow! Developer track 1, 11:20 Oxford vs Cambridge Contest: Collecting Open Research Evaluation Metrics for University Ranking
  18. 18. Interested in how to TDM research papers? @openminted_eu We have 3 more talks tomorrow! Papers 4, 4:00 Exploring Semantometrics : full text-based research evaluation for open repositories
  19. 19. Thank you Dr. Pert Knoth,, Research Fellow petr.knoth@open.ac.uk Dr. Nancy Pontika, Open Access Aggregation Officer nancy.pontika@open.ac.uk . 20

×