SlideShare a Scribd company logo
1 of 20
How can repositories support the
text-mining of their content and
why?
@openminted_eu
Dr. Petr Knoth and Dr. Nancy Pontika
Knowledge Media institute, The Open University
United Kingdom
Twitter: @oacore
Why should repositories
support TDM?
@openminted_eu
@openminted_eu
In the UK
Repositories and TDM
@openminted_eu
Institutional
Repositories
Institutional
Repositories
Subject
Repositories
Subject
Repositories
Publishers/
OA journals
Publishers/
OA journals
Other sources:
Research
Networking
Services
Primary Research
Data...
Other sources:
Research
Networking
Services
Primary Research
Data...
Text Mining Services
TDM & Repositories
Managers
@openminted_eu
• Established and maintain a close collaboration with
researchers
• Extensive experience in advocacy, i.e. open access
• Knowledgeable about the repository’s collection
• Participate in the Academic Institution’s Research
Committees
• Knowledgeable of your repository’s collection
• Familiarity with Copyright issues and Creative Commons
Licenses
How can repositories
support TDM?
TDM is all about processing text and data at scale.
The role of repositories is to facilitate the aggregation
of research papers at a full-text level (and beyond)
effectively enabling TDM services to operate
seamlessly on all available research content.
7
What is the problem?
@openminted_eu
• A small study (Knoth, 2013)
• 83 repositories - mainly Eprints with PDF research
outputs
• 1,461,016 metadata records
  metadata
linked to
content
content
downloadable
content
machine
readable
Mean 54.1% 34.4% 27.6%
Median 39.5% 16.7% 13.0%
Standard
deviation
39.2% 34.2% 31.0%
How is content aggregated
today?
@openminted_eu
• DC over OAI-PMH: vast majority of repositories, never
intended to support content harvesting. The main
problem: linking metadata with content.
“The nature of a resource identifier is outside the scope of the OAI-
PMH. To facilitate access to the resource associated with harvested
metadata, repositories should use an element in metadata records
to establish a linkage between the record (and the identifier of its
item) and the identifier (URL, URN, DOI, etc.) of the associated
resource. The mandatory Dublin Core format provides the identifier
element that should be used for this purpose.”
How is content aggregated
today?
@openminted_eu
• RIOXX: Just one identifier, recommends the identifier
points to the actual resource being described.
• OpenAIRE Guidelines: identifier links to either the
resource or a jump-off page. Does allow multiple
identifiers.
• ResourceSync
• CrossRef: comercial publishers/journals
The content referencing
problem
@openminted_eu
Principle 1: content
referencing
Repositories should always establish a link from the
metadata record to the item the metadata record
describes using a dereferencable identifier pointing to
the version held locally in the repository. The
dereferencable identifier should be provided in the
appropriate metadata element in the used metadata
format (i.e. dc:identifier in the case of Dublin Core).
If multiple identifiers are used, it is recommended
listing the local dereferencable identifier first.
12
The accessibility of
repositories to harvesting
systems
@openminted_eu
Principle 2: Content
accessibility to machines
Repositories must provide universal access to
machines with the same level of access as humans
have. It is the role of repositories to allow aggregators
to harvest the entire content of the repository in a
reasonable time to enable acquiring and maintain up-
to-date information about the repository content.
14
What can repositories
do?
@openminted_eu
• Ensure correct referencing of content from
metadata:
• Dereferencable link which resolves to content
• Locally held (content under its control)
• Using a standard repository platform can help
• Check robots.txt
• Register your repository
• Advocate for good pdf (media) quality of deposited content
• Use monitoring tools
• CORE Repository Dashboard
• OpenAIRE Repository Manager Dashboard
• Machine readable licensing
beyond Open
Access
MAKING SENSE OF
LARGE VOLUMES
OF
SCIENTIFIC
CONTENT
16
Interested in how to
TDM research papers?
@openminted_eu
We have 3 more
talks tomorrow!
Developer track 1, 11:00
Mining Open Access
publications with CORE
Interested in how to
TDM research papers?
@openminted_eu
We have 3 more
talks tomorrow!
Developer track 1, 11:20
Oxford vs Cambridge
Contest: Collecting
Open Research
Evaluation Metrics for
University Ranking
Interested in how to
TDM research papers?
@openminted_eu
We have 3 more
talks tomorrow!
Papers 4, 4:00
Exploring
Semantometrics
: full text-based
research
evaluation for
open
repositories
Thank you
Dr. Pert Knoth,, Research Fellow
petr.knoth@open.ac.uk
Dr. Nancy Pontika, Open Access
Aggregation Officer
nancy.pontika@open.ac.uk
.
20

More Related Content

What's hot

LIBER on the path towards Open Science: Libraries as enablers
LIBER on the path towards Open Science:  Libraries as enablers LIBER on the path towards Open Science:  Libraries as enablers
LIBER on the path towards Open Science: Libraries as enablers LIBER Europe
 
Zenodo - The catch-all repository
Zenodo - The catch-all repository Zenodo - The catch-all repository
Zenodo - The catch-all repository OpenAccessBelgium
 
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...BigData_Europe
 
re3data.org – a Registry of Research Data Repositories
re3data.org – a Registry of Research Data Repositoriesre3data.org – a Registry of Research Data Repositories
re3data.org – a Registry of Research Data RepositoriesHeinz Pampel
 
Big Data Europe SC6 WS 3: Where we are and are going for Big Data in OpenScie...
Big Data Europe SC6 WS 3: Where we are and are going for Big Data in OpenScie...Big Data Europe SC6 WS 3: Where we are and are going for Big Data in OpenScie...
Big Data Europe SC6 WS 3: Where we are and are going for Big Data in OpenScie...BigData_Europe
 
re3data.org – Registry of Research Data Repositories
re3data.org – Registry of Research Data Repositoriesre3data.org – Registry of Research Data Repositories
re3data.org – Registry of Research Data RepositoriesHeinz Pampel
 
Open content opens up new avenues of research
Open content opens up new avenues of researchOpen content opens up new avenues of research
Open content opens up new avenues of researchFelix Lohmeier
 
Making Research Data Repositories Visible – The re3data.org Registry
Making Research Data Repositories Visible – The re3data.org RegistryMaking Research Data Repositories Visible – The re3data.org Registry
Making Research Data Repositories Visible – The re3data.org RegistryHeinz Pampel
 
(Big) bibliographic data @ ScaDS project meeting - 2015-06-12
(Big) bibliographic data @ ScaDS project meeting - 2015-06-12(Big) bibliographic data @ ScaDS project meeting - 2015-06-12
(Big) bibliographic data @ ScaDS project meeting - 2015-06-12Felix Lohmeier
 
Rebecca Grant - DRI Training Series: 1. Organising Your Collection
Rebecca Grant - DRI Training Series: 1. Organising Your Collection Rebecca Grant - DRI Training Series: 1. Organising Your Collection
Rebecca Grant - DRI Training Series: 1. Organising Your Collection dri_ireland
 
FREYA - Connected Open Identifiers for Discovery, Access and Use of Research ...
FREYA - Connected Open Identifiers for Discovery, Access and Use of Research ...FREYA - Connected Open Identifiers for Discovery, Access and Use of Research ...
FREYA - Connected Open Identifiers for Discovery, Access and Use of Research ...EUDAT
 
Libraries in the Big Data Era: Strategies and Challenges in Archiving and Sha...
Libraries in the Big Data Era: Strategies and Challenges in Archiving and Sha...Libraries in the Big Data Era: Strategies and Challenges in Archiving and Sha...
Libraries in the Big Data Era: Strategies and Challenges in Archiving and Sha...Peter Löwe
 
Open Science in HORIZON Grant Agreement
Open Science in HORIZON Grant AgreementOpen Science in HORIZON Grant Agreement
Open Science in HORIZON Grant AgreementMilan Zdravković
 

What's hot (20)

Scholze liber 2015-06-25_final
Scholze liber 2015-06-25_finalScholze liber 2015-06-25_final
Scholze liber 2015-06-25_final
 
LIBER on the path towards Open Science: Libraries as enablers
LIBER on the path towards Open Science:  Libraries as enablers LIBER on the path towards Open Science:  Libraries as enablers
LIBER on the path towards Open Science: Libraries as enablers
 
Zenodo - The catch-all repository
Zenodo - The catch-all repository Zenodo - The catch-all repository
Zenodo - The catch-all repository
 
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...
 
re3data.org – a Registry of Research Data Repositories
re3data.org – a Registry of Research Data Repositoriesre3data.org – a Registry of Research Data Repositories
re3data.org – a Registry of Research Data Repositories
 
Scholze goportis 4-11-14
Scholze goportis 4-11-14Scholze goportis 4-11-14
Scholze goportis 4-11-14
 
Imac 090924
Imac 090924Imac 090924
Imac 090924
 
Elab 16 5-13-re3data-scholze-final
Elab 16 5-13-re3data-scholze-finalElab 16 5-13-re3data-scholze-final
Elab 16 5-13-re3data-scholze-final
 
Big Data Europe SC6 WS 3: Where we are and are going for Big Data in OpenScie...
Big Data Europe SC6 WS 3: Where we are and are going for Big Data in OpenScie...Big Data Europe SC6 WS 3: Where we are and are going for Big Data in OpenScie...
Big Data Europe SC6 WS 3: Where we are and are going for Big Data in OpenScie...
 
re3data.org – Registry of Research Data Repositories
re3data.org – Registry of Research Data Repositoriesre3data.org – Registry of Research Data Repositories
re3data.org – Registry of Research Data Repositories
 
Open content opens up new avenues of research
Open content opens up new avenues of researchOpen content opens up new avenues of research
Open content opens up new avenues of research
 
Making Research Data Repositories Visible – The re3data.org Registry
Making Research Data Repositories Visible – The re3data.org RegistryMaking Research Data Repositories Visible – The re3data.org Registry
Making Research Data Repositories Visible – The re3data.org Registry
 
(Big) bibliographic data @ ScaDS project meeting - 2015-06-12
(Big) bibliographic data @ ScaDS project meeting - 2015-06-12(Big) bibliographic data @ ScaDS project meeting - 2015-06-12
(Big) bibliographic data @ ScaDS project meeting - 2015-06-12
 
Connecting Museums with Linked Data
Connecting Museums with Linked DataConnecting Museums with Linked Data
Connecting Museums with Linked Data
 
Scholze imcw 2014-11-25
Scholze imcw 2014-11-25Scholze imcw 2014-11-25
Scholze imcw 2014-11-25
 
Rebecca Grant - DRI Training Series: 1. Organising Your Collection
Rebecca Grant - DRI Training Series: 1. Organising Your Collection Rebecca Grant - DRI Training Series: 1. Organising Your Collection
Rebecca Grant - DRI Training Series: 1. Organising Your Collection
 
FREYA - Connected Open Identifiers for Discovery, Access and Use of Research ...
FREYA - Connected Open Identifiers for Discovery, Access and Use of Research ...FREYA - Connected Open Identifiers for Discovery, Access and Use of Research ...
FREYA - Connected Open Identifiers for Discovery, Access and Use of Research ...
 
Libraries in the Big Data Era: Strategies and Challenges in Archiving and Sha...
Libraries in the Big Data Era: Strategies and Challenges in Archiving and Sha...Libraries in the Big Data Era: Strategies and Challenges in Archiving and Sha...
Libraries in the Big Data Era: Strategies and Challenges in Archiving and Sha...
 
Research data management: DMP & repository
Research data management: DMP & repositoryResearch data management: DMP & repository
Research data management: DMP & repository
 
Open Science in HORIZON Grant Agreement
Open Science in HORIZON Grant AgreementOpen Science in HORIZON Grant Agreement
Open Science in HORIZON Grant Agreement
 

Similar to How can repositories support the text mining of their content and why?

How can repositories support the text-mining of their content and why?
How can repositories support the text-mining of their content and why? How can repositories support the text-mining of their content and why?
How can repositories support the text-mining of their content and why? Nancy Pontika
 
A Pragmatic Approach to Facilitating Text and Data Mining
A Pragmatic Approach to Facilitating Text and Data Mining A Pragmatic Approach to Facilitating Text and Data Mining
A Pragmatic Approach to Facilitating Text and Data Mining Chris Shillum
 
Towards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific PublicationsTowards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific Publicationspetrknoth
 
Presentation from ALA Midwinter 2014 on Elsevier's new Text and Data Mining P...
Presentation from ALA Midwinter 2014 on Elsevier's new Text and Data Mining P...Presentation from ALA Midwinter 2014 on Elsevier's new Text and Data Mining P...
Presentation from ALA Midwinter 2014 on Elsevier's new Text and Data Mining P...Chris Shillum
 
OpenAIRE infrastructure presentation at the Semantic Services in EOSC worksho...
OpenAIRE infrastructure presentation at the Semantic Services in EOSC worksho...OpenAIRE infrastructure presentation at the Semantic Services in EOSC worksho...
OpenAIRE infrastructure presentation at the Semantic Services in EOSC worksho...Pedro Príncipe
 
OpenAIRE infrastructure presentation at the Semantic Services in EOSC worksho...
OpenAIRE infrastructure presentation at the Semantic Services in EOSC worksho...OpenAIRE infrastructure presentation at the Semantic Services in EOSC worksho...
OpenAIRE infrastructure presentation at the Semantic Services in EOSC worksho...OpenAIRE
 
OSFair2017 training | Machine accessibility of Open Access scientific publica...
OSFair2017 training | Machine accessibility of Open Access scientific publica...OSFair2017 training | Machine accessibility of Open Access scientific publica...
OSFair2017 training | Machine accessibility of Open Access scientific publica...Open Science Fair
 
Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...Lucy McKenna
 
OpenAIRE guidelines and broker service for repository managers - OpenAIRE #OA...
OpenAIRE guidelines and broker service for repository managers - OpenAIRE #OA...OpenAIRE guidelines and broker service for repository managers - OpenAIRE #OA...
OpenAIRE guidelines and broker service for repository managers - OpenAIRE #OA...OpenAIRE
 
Next generation repositories
Next generation repositoriesNext generation repositories
Next generation repositoriesPaul Walk
 
OSFair2017 Workshop | Building a global knowledge commons - ramping up reposi...
OSFair2017 Workshop | Building a global knowledge commons - ramping up reposi...OSFair2017 Workshop | Building a global knowledge commons - ramping up reposi...
OSFair2017 Workshop | Building a global knowledge commons - ramping up reposi...Open Science Fair
 
(Open) Research Data Management in H2020 (ISERD – Tel Aviv, Oct 31, 2016)
(Open) Research Data Management in H2020 (ISERD – Tel Aviv, Oct 31, 2016)(Open) Research Data Management in H2020 (ISERD – Tel Aviv, Oct 31, 2016)
(Open) Research Data Management in H2020 (ISERD – Tel Aviv, Oct 31, 2016)OpenAIRE
 
OpenAIRE compatibility for repositories - Webinar on the OpenAIRE Guidelines
OpenAIRE compatibility for repositories - Webinar on the OpenAIRE GuidelinesOpenAIRE compatibility for repositories - Webinar on the OpenAIRE Guidelines
OpenAIRE compatibility for repositories - Webinar on the OpenAIRE GuidelinesPedro Príncipe
 
IDCC workshop: OpenAIRE services and tools for Open Research Data in H2020
IDCC workshop: OpenAIRE services and tools for Open Research Data in H2020IDCC workshop: OpenAIRE services and tools for Open Research Data in H2020
IDCC workshop: OpenAIRE services and tools for Open Research Data in H2020OpenAIRE
 
Open Science, Open Data: towards a new transparent and reproducible ecosystem
Open Science, Open Data:   towards a new transparent and reproducible ecosystemOpen Science, Open Data:   towards a new transparent and reproducible ecosystem
Open Science, Open Data: towards a new transparent and reproducible ecosystemLIBER Europe
 
OA Repositories for DE in Myanmar presentation
OA Repositories for DE in Myanmar presentationOA Repositories for DE in Myanmar presentation
OA Repositories for DE in Myanmar presentationaduchesne1
 
Webinar on OpenAIRE compatibility for repositories: DSpace repository platform
Webinar on OpenAIRE compatibility for repositories: DSpace repository platformWebinar on OpenAIRE compatibility for repositories: DSpace repository platform
Webinar on OpenAIRE compatibility for repositories: DSpace repository platformOpenAIRE
 
The once and future library - reimagining the national library as infrastruct...
The once and future library - reimagining the national library as infrastruct...The once and future library - reimagining the national library as infrastruct...
The once and future library - reimagining the national library as infrastruct...ukcorr
 

Similar to How can repositories support the text mining of their content and why? (20)

How can repositories support the text-mining of their content and why?
How can repositories support the text-mining of their content and why? How can repositories support the text-mining of their content and why?
How can repositories support the text-mining of their content and why?
 
A Pragmatic Approach to Facilitating Text and Data Mining
A Pragmatic Approach to Facilitating Text and Data Mining A Pragmatic Approach to Facilitating Text and Data Mining
A Pragmatic Approach to Facilitating Text and Data Mining
 
Towards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific PublicationsTowards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific Publications
 
Presentation from ALA Midwinter 2014 on Elsevier's new Text and Data Mining P...
Presentation from ALA Midwinter 2014 on Elsevier's new Text and Data Mining P...Presentation from ALA Midwinter 2014 on Elsevier's new Text and Data Mining P...
Presentation from ALA Midwinter 2014 on Elsevier's new Text and Data Mining P...
 
OpenAIRE infrastructure presentation at the Semantic Services in EOSC worksho...
OpenAIRE infrastructure presentation at the Semantic Services in EOSC worksho...OpenAIRE infrastructure presentation at the Semantic Services in EOSC worksho...
OpenAIRE infrastructure presentation at the Semantic Services in EOSC worksho...
 
OpenAIRE infrastructure presentation at the Semantic Services in EOSC worksho...
OpenAIRE infrastructure presentation at the Semantic Services in EOSC worksho...OpenAIRE infrastructure presentation at the Semantic Services in EOSC worksho...
OpenAIRE infrastructure presentation at the Semantic Services in EOSC worksho...
 
OSFair2017 training | Machine accessibility of Open Access scientific publica...
OSFair2017 training | Machine accessibility of Open Access scientific publica...OSFair2017 training | Machine accessibility of Open Access scientific publica...
OSFair2017 training | Machine accessibility of Open Access scientific publica...
 
Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...
 
OpenAIRE guidelines and broker service for repository managers - OpenAIRE #OA...
OpenAIRE guidelines and broker service for repository managers - OpenAIRE #OA...OpenAIRE guidelines and broker service for repository managers - OpenAIRE #OA...
OpenAIRE guidelines and broker service for repository managers - OpenAIRE #OA...
 
Next generation repositories
Next generation repositoriesNext generation repositories
Next generation repositories
 
Repositories Update (UK)
Repositories Update (UK) Repositories Update (UK)
Repositories Update (UK)
 
OSFair2017 Workshop | Building a global knowledge commons - ramping up reposi...
OSFair2017 Workshop | Building a global knowledge commons - ramping up reposi...OSFair2017 Workshop | Building a global knowledge commons - ramping up reposi...
OSFair2017 Workshop | Building a global knowledge commons - ramping up reposi...
 
(Open) Research Data Management in H2020 (ISERD – Tel Aviv, Oct 31, 2016)
(Open) Research Data Management in H2020 (ISERD – Tel Aviv, Oct 31, 2016)(Open) Research Data Management in H2020 (ISERD – Tel Aviv, Oct 31, 2016)
(Open) Research Data Management in H2020 (ISERD – Tel Aviv, Oct 31, 2016)
 
OpenAIRE compatibility for repositories - Webinar on the OpenAIRE Guidelines
OpenAIRE compatibility for repositories - Webinar on the OpenAIRE GuidelinesOpenAIRE compatibility for repositories - Webinar on the OpenAIRE Guidelines
OpenAIRE compatibility for repositories - Webinar on the OpenAIRE Guidelines
 
IDCC workshop: OpenAIRE services and tools for Open Research Data in H2020
IDCC workshop: OpenAIRE services and tools for Open Research Data in H2020IDCC workshop: OpenAIRE services and tools for Open Research Data in H2020
IDCC workshop: OpenAIRE services and tools for Open Research Data in H2020
 
Patham "NISO-ODI (Open Discovery Initiative) Standards Update"
Patham "NISO-ODI (Open Discovery Initiative) Standards Update"Patham "NISO-ODI (Open Discovery Initiative) Standards Update"
Patham "NISO-ODI (Open Discovery Initiative) Standards Update"
 
Open Science, Open Data: towards a new transparent and reproducible ecosystem
Open Science, Open Data:   towards a new transparent and reproducible ecosystemOpen Science, Open Data:   towards a new transparent and reproducible ecosystem
Open Science, Open Data: towards a new transparent and reproducible ecosystem
 
OA Repositories for DE in Myanmar presentation
OA Repositories for DE in Myanmar presentationOA Repositories for DE in Myanmar presentation
OA Repositories for DE in Myanmar presentation
 
Webinar on OpenAIRE compatibility for repositories: DSpace repository platform
Webinar on OpenAIRE compatibility for repositories: DSpace repository platformWebinar on OpenAIRE compatibility for repositories: DSpace repository platform
Webinar on OpenAIRE compatibility for repositories: DSpace repository platform
 
The once and future library - reimagining the national library as infrastruct...
The once and future library - reimagining the national library as infrastruct...The once and future library - reimagining the national library as infrastruct...
The once and future library - reimagining the national library as infrastruct...
 

More from openminted_eu

Supporting the uptake of TDM
Supporting the uptake of TDMSupporting the uptake of TDM
Supporting the uptake of TDMopenminted_eu
 
OpenMinTeD, LIBER conference 2017
OpenMinTeD, LIBER conference 2017OpenMinTeD, LIBER conference 2017
OpenMinTeD, LIBER conference 2017openminted_eu
 
Resource sync overview and real-world use cases for discovery, harvesting, an...
Resource sync overview and real-world use cases for discovery, harvesting, an...Resource sync overview and real-world use cases for discovery, harvesting, an...
Resource sync overview and real-world use cases for discovery, harvesting, an...openminted_eu
 
Seamless access to the world's open access research papers via resources sync
Seamless access to the world's open access research papers via resources syncSeamless access to the world's open access research papers via resources sync
Seamless access to the world's open access research papers via resources syncopenminted_eu
 
Webinar slides: Interoperability between resources involved in TDM at the lev...
Webinar slides: Interoperability between resources involved in TDM at the lev...Webinar slides: Interoperability between resources involved in TDM at the lev...
Webinar slides: Interoperability between resources involved in TDM at the lev...openminted_eu
 
Legal issues Text and Data Mining
Legal issues Text and Data MiningLegal issues Text and Data Mining
Legal issues Text and Data Miningopenminted_eu
 
Tentative steps in mining UK theses
Tentative steps in mining UK thesesTentative steps in mining UK theses
Tentative steps in mining UK thesesopenminted_eu
 
OpenMinTeD - Une infrastructure text-mining au service des scientifiques
OpenMinTeD - Une infrastructure text-mining au service des scientifiquesOpenMinTeD - Une infrastructure text-mining au service des scientifiques
OpenMinTeD - Une infrastructure text-mining au service des scientifiquesopenminted_eu
 
Infrastructure crossroads... and the way we walked them in DKPro
Infrastructure crossroads... and the way we walked them in DKProInfrastructure crossroads... and the way we walked them in DKPro
Infrastructure crossroads... and the way we walked them in DKProopenminted_eu
 
Experiences of Text Mining; the National Library of Austria perspective
Experiences of Text Mining; the National Library of Austria perspectiveExperiences of Text Mining; the National Library of Austria perspective
Experiences of Text Mining; the National Library of Austria perspectiveopenminted_eu
 
Text and Data Mining at the Royal Library in the Netherlands
Text and Data Mining at the Royal Library in the NetherlandsText and Data Mining at the Royal Library in the Netherlands
Text and Data Mining at the Royal Library in the Netherlandsopenminted_eu
 

More from openminted_eu (11)

Supporting the uptake of TDM
Supporting the uptake of TDMSupporting the uptake of TDM
Supporting the uptake of TDM
 
OpenMinTeD, LIBER conference 2017
OpenMinTeD, LIBER conference 2017OpenMinTeD, LIBER conference 2017
OpenMinTeD, LIBER conference 2017
 
Resource sync overview and real-world use cases for discovery, harvesting, an...
Resource sync overview and real-world use cases for discovery, harvesting, an...Resource sync overview and real-world use cases for discovery, harvesting, an...
Resource sync overview and real-world use cases for discovery, harvesting, an...
 
Seamless access to the world's open access research papers via resources sync
Seamless access to the world's open access research papers via resources syncSeamless access to the world's open access research papers via resources sync
Seamless access to the world's open access research papers via resources sync
 
Webinar slides: Interoperability between resources involved in TDM at the lev...
Webinar slides: Interoperability between resources involved in TDM at the lev...Webinar slides: Interoperability between resources involved in TDM at the lev...
Webinar slides: Interoperability between resources involved in TDM at the lev...
 
Legal issues Text and Data Mining
Legal issues Text and Data MiningLegal issues Text and Data Mining
Legal issues Text and Data Mining
 
Tentative steps in mining UK theses
Tentative steps in mining UK thesesTentative steps in mining UK theses
Tentative steps in mining UK theses
 
OpenMinTeD - Une infrastructure text-mining au service des scientifiques
OpenMinTeD - Une infrastructure text-mining au service des scientifiquesOpenMinTeD - Une infrastructure text-mining au service des scientifiques
OpenMinTeD - Une infrastructure text-mining au service des scientifiques
 
Infrastructure crossroads... and the way we walked them in DKPro
Infrastructure crossroads... and the way we walked them in DKProInfrastructure crossroads... and the way we walked them in DKPro
Infrastructure crossroads... and the way we walked them in DKPro
 
Experiences of Text Mining; the National Library of Austria perspective
Experiences of Text Mining; the National Library of Austria perspectiveExperiences of Text Mining; the National Library of Austria perspective
Experiences of Text Mining; the National Library of Austria perspective
 
Text and Data Mining at the Royal Library in the Netherlands
Text and Data Mining at the Royal Library in the NetherlandsText and Data Mining at the Royal Library in the Netherlands
Text and Data Mining at the Royal Library in the Netherlands
 

Recently uploaded

Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 

Recently uploaded (20)

Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 

How can repositories support the text mining of their content and why?

  • 1. How can repositories support the text-mining of their content and why? @openminted_eu Dr. Petr Knoth and Dr. Nancy Pontika Knowledge Media institute, The Open University United Kingdom Twitter: @oacore
  • 2. Why should repositories support TDM? @openminted_eu
  • 4. Repositories and TDM @openminted_eu Institutional Repositories Institutional Repositories Subject Repositories Subject Repositories Publishers/ OA journals Publishers/ OA journals Other sources: Research Networking Services Primary Research Data... Other sources: Research Networking Services Primary Research Data... Text Mining Services
  • 5.
  • 6. TDM & Repositories Managers @openminted_eu • Established and maintain a close collaboration with researchers • Extensive experience in advocacy, i.e. open access • Knowledgeable about the repository’s collection • Participate in the Academic Institution’s Research Committees • Knowledgeable of your repository’s collection • Familiarity with Copyright issues and Creative Commons Licenses
  • 7. How can repositories support TDM? TDM is all about processing text and data at scale. The role of repositories is to facilitate the aggregation of research papers at a full-text level (and beyond) effectively enabling TDM services to operate seamlessly on all available research content. 7
  • 8. What is the problem? @openminted_eu • A small study (Knoth, 2013) • 83 repositories - mainly Eprints with PDF research outputs • 1,461,016 metadata records   metadata linked to content content downloadable content machine readable Mean 54.1% 34.4% 27.6% Median 39.5% 16.7% 13.0% Standard deviation 39.2% 34.2% 31.0%
  • 9. How is content aggregated today? @openminted_eu • DC over OAI-PMH: vast majority of repositories, never intended to support content harvesting. The main problem: linking metadata with content. “The nature of a resource identifier is outside the scope of the OAI- PMH. To facilitate access to the resource associated with harvested metadata, repositories should use an element in metadata records to establish a linkage between the record (and the identifier of its item) and the identifier (URL, URN, DOI, etc.) of the associated resource. The mandatory Dublin Core format provides the identifier element that should be used for this purpose.”
  • 10. How is content aggregated today? @openminted_eu • RIOXX: Just one identifier, recommends the identifier points to the actual resource being described. • OpenAIRE Guidelines: identifier links to either the resource or a jump-off page. Does allow multiple identifiers. • ResourceSync • CrossRef: comercial publishers/journals
  • 12. Principle 1: content referencing Repositories should always establish a link from the metadata record to the item the metadata record describes using a dereferencable identifier pointing to the version held locally in the repository. The dereferencable identifier should be provided in the appropriate metadata element in the used metadata format (i.e. dc:identifier in the case of Dublin Core). If multiple identifiers are used, it is recommended listing the local dereferencable identifier first. 12
  • 13. The accessibility of repositories to harvesting systems @openminted_eu
  • 14. Principle 2: Content accessibility to machines Repositories must provide universal access to machines with the same level of access as humans have. It is the role of repositories to allow aggregators to harvest the entire content of the repository in a reasonable time to enable acquiring and maintain up- to-date information about the repository content. 14
  • 15. What can repositories do? @openminted_eu • Ensure correct referencing of content from metadata: • Dereferencable link which resolves to content • Locally held (content under its control) • Using a standard repository platform can help • Check robots.txt • Register your repository • Advocate for good pdf (media) quality of deposited content • Use monitoring tools • CORE Repository Dashboard • OpenAIRE Repository Manager Dashboard • Machine readable licensing
  • 16. beyond Open Access MAKING SENSE OF LARGE VOLUMES OF SCIENTIFIC CONTENT 16
  • 17. Interested in how to TDM research papers? @openminted_eu We have 3 more talks tomorrow! Developer track 1, 11:00 Mining Open Access publications with CORE
  • 18. Interested in how to TDM research papers? @openminted_eu We have 3 more talks tomorrow! Developer track 1, 11:20 Oxford vs Cambridge Contest: Collecting Open Research Evaluation Metrics for University Ranking
  • 19. Interested in how to TDM research papers? @openminted_eu We have 3 more talks tomorrow! Papers 4, 4:00 Exploring Semantometrics : full text-based research evaluation for open repositories
  • 20. Thank you Dr. Pert Knoth,, Research Fellow petr.knoth@open.ac.uk Dr. Nancy Pontika, Open Access Aggregation Officer nancy.pontika@open.ac.uk . 20

Editor's Notes

  1. Mining individual repositories is not intersteing. TDM is about processing at scale. The role of repositories is: …
  2. So why am I talking about what the role of the repositories is? Well I think we have a slight problem here … We have done a study to …
  3. The main problem: linking metadata with content.
  4. OpenAIRE guidelines: https://guidelines.openaire.eu/en/latest/literature/field_resourceidentifier.html The ideal use of this element is to use a direct link or a link to a jump-off page (persistent URL) fromdc:identifier in the metadata record to the digital resource or a jump-off page.
  5. <dc:identifier> field: The aim of the Dublin Core Metadata tags is to ensure online interoperability of metadata standards. The importance of the <dc:identifier> tag is that it describes the resource of the harvested output. CORE expects in this field to find the direct URL of the PDF. When the information in this field is not presented properly, the CORE crawler needs to crawl for the PDF and the success of finding it cannot be guaranteed. This also causes additional server processing time and bandwidth both for the harvester and the hosting institution.There are also three additional points that need to be considered with regards to the <dc:identifier>; a) this field should describe an absolute path to the file, b) it should contain an appropriate file name extension, for example “.pdf” and c) the full-text items should be stored under the same repository domain.
  6. The problem is not multiple metadata formats, but the fact that none of them is good enough! Thinking that by supporting the guidelines you allow content aggregation is an issue. Locally means within the repositories control. <dc:identifier> field: The aim of the Dublin Core Metadata tags is to ensure online interoperability of metadata standards. The importance of the <dc:identifier> tag is that it describes the resource of the harvested output. CORE expects in this field to find the direct URL of the PDF. When the information in this field is not presented properly, the CORE crawler needs to crawl for the PDF and the success of finding it cannot be guaranteed. This also causes additional server processing time and bandwidth both for the harvester and the hosting institution.There are also three additional points that need to be considered with regards to the <dc:identifier>; a) this field should describe an absolute path to the file, b) it should contain an appropriate file name extension, for example “.pdf” and c) the full-text items should be stored under the same repository domain.
  7. Arxiv has now a slightly nicer robots.txt where anoyone is allowed access with a 15s delay. Still not doable …
  8. Platform: For those who haven’t deployed a repository yet, it is highly advised that the repository platform is not built in house, but one of the industry standard platforms is chosen. The benefits of choosing one of the existing platforms is that they provide frequent content updates, constant support and extend repository functionality through plug-ins.
  9. Our ultimate goal is to put in place infrastructure that will enable anyone to make sense of large volumes of scientific data. The infrastructure is open and transparent.
  10. If you are interested in how we makes sense of the large volumes of scientific content.
  11. If you are interested in how we makes sense of the large volumes of scientific content.
  12. If you are interested in how we makes sense of the large volumes of scientific content.