SlideShare a Scribd company logo
1 of 38
1/38
From Open Access Metadata to
Open Access Content:
Towards an Infrastructure for
Mining Scientific Publications
Petr Knoth
CORE (Connecting REpositories) project
Knowledge Media institute
The Open University
@petrknoth, #diggicore
2/38
What is Open Access exactly?
By “open access” to *peer-reviewed research literature], we mean
its free availability on the public internet, permitting any users to
read, download, copy, distribute, print, search, or link to the full
texts of these articles, crawl them for indexing, pass them as data
to software, or use them for any other lawful purpose, without
financial, legal, or technical barriers other than those inseparable
from gaining access to the internet itself.
[BOAI, 2002]
3/38
How to achieve OA?
Two routes:
• Self-archiving: Institional/Open Repositories
• Open Access Journals
4/38
Why from OA metadata to OA content?
• Despite large amount of OA content already available online
(Laakso & Bjork, 2012), OA content is not necessarily easily
discoverable (Morrisson, 2012; Konkiel, 2012).
• Often available, but difficult to find …
• Inhibiting the OA impact – accessibility, discoverability, reuse …
• Discoverability of OA content on the Web can be dramatically
increased by adopting two simple principles!
5/38
Outline
1. Goals of repositories
2. The bleak truth about availability of OA metadata vs content
3. Content referencing practises in repositories
4. Two principles to increase visibility of OA content
5. How to data mine OA aggregated data and why
6/38
Outline
1. Goals of repositories (repositories as large metadata silos)
2. The bleak truth about availability of OA metadata vs content
3. Content referencing practises in repositories
4. Two principles to increase visibility of OA content
5. How to data mine OA aggregated data and why
7/38
The primary purpose of repositories
• Institutional repositories (IRs) serve a number of purposes; such
collecting and curating digital outputs, providing statistics,
research excellence, etc.
• The primary goal of repositories is to open and disseminate
research outputs to a worldwide audience (Crow, 2002) –
SPARC’s position paper on the case for institutional repositories.
8/38
SPARC’s position paper on IRs
“For the repository to provide access to the broader research
community, users outside the university must be able to find and
retrieve information from the repository. Therefore, institutional
repository systems must be able to support interoperability in order
to provide access via multiple search engines and other discovery
tools. An institution does not necessarily need to implement
searching and indexing functionality to satisfy this demand: it could
simply maintain and expose metadata, allowing other services to
harvest and search the content. This simplicity lowers the barrier to
repository operation for many institutions, as it only requires a file
system to hold the content and the ability to create and share
metadata with external systems.”
9/38
COAR: About harvesting and aggregations …
“Each individual repository is of limited value for research: the real
power of Open Access lies in the possibility of connecting and tying
together repositories, which is why we need interoperability. In
order to create a seamless layer of content through connected
repositories from around the world, Open Access relies on
interoperability, the ability for systems to communicate with each
other and pass information back and forth in a usable format.
Interoperability allows us to exploit today's computational power so
that we can aggregate, data mine, create new tools and services,
and generate new knowledge from repository content.’’
[COAR manifesto]
10/38
We need OA to content (not just metadata)
• Repositories (even the most prominent) often seen by
aggregation systems as large metadata.
• OA to metadata is not disruptive. Little difference to the
traditional publishing model.
11/38
Outline
1. Goals of repositories (repositories as large metadata silos)
2. The bleak truth about availability of OA metadata vs content
3. Content referencing practises in repositories
4. Two principles to increase visibility of OA content
5. How to data mine OA aggregated data and why
12/38
Study
• 83 repositories (mainly EPrints with pdf research outputs)
• 1,461,016 metadata records
• Ratio of metadata to content
• Data acquired from CORE (Knoth & Zdrahal, 2012)
13/38
“*The institutional repository+ is like a roach motel. Data goes in, but
it doesn’t come out.” (Salo, 2008)
14/38
Why is this a problem?
• Lower accessibility of papers (we have them, but cannot find
them)
• Text-mining
• Cannot monitor growth
• Loosing a strong argument for the adoption of OA!
15/38
Outline
1. Goals of repositories (repositories as large metadata silos)
2. The bleak truth about availability of OA metadata vs content
3. Content referencing practises in repositories
4. Two principles to increase visibility of OA content
5. How to data mine OA aggregated data and why
16/38
OAI-PMH and content referencing
• OAI-PMH supports representing metadata in multiple formats,
but at a minimum repositories must be able to return records
with metadata expressed in the Dublin Core format (OAI-PMH
v2.0, 2008)
• If repositories want to satisfy the SPARC guidelines (Crow, 2002),
they must provide a link to the content as part of the exposed
metadata.
17/38
OAI-PMH and content referencing
The Open Research Online repository (Eprints) links directly to the
resource from metadata.
Cranfield repository (DSpace) identifies the resource by providing a
link to a page from which the resource (if available) can be accessed.
18/38
OAI-PMH and content referencing
The OAI-PMH specification states on this topic that:
“The nature of a resource identifier is outside the scope of the OAI-
PMH. To facilitate access to the resource associated with harvested
metadata, repositories should use an element in metadata records
to establish a linkage between the record (and the identifier of its
item) and the identifier (URL, URN, DOI, etc.) of the associated
resource. The mandatory Dublin Core format provides the identifier
element that should be used for this purpose.”
19/38
OAI-PMH and content referencing
• What is an identifier of the associated resource?
Is a splash page an identifier? According to OAI-PMH examples it is:
<dc:identifier>http://arXiv.org/abs/cs/0112017</dc:identifier>
• The standard is pretty weak on this aspect …
20/38
Outline
1. Goals of repositories (repositories as large metadata silos)
2. The bleak truth about availability of OA metadata vs content
3. Content referencing practises in repositories
4. Two principles to increase visibility of OA content
5. How to data mine OA aggregated data and why
21/38
The principles of the principles 
• Pragmatic rather than exciting.
• Generating maximum benefit for a minimum investment.
• Deliberately use current standards to minimise adoption time.
• Respecting differences across systems and backwards
compatibility.
• Emphasizes the need for easy to use compliance mechanisms to
assist repository managers in ensuring systems interoperability.
22/38
Principle 1 – Content referencing
Open repositories should always establish a link from the metadata
record to the item the metadata record describes using a
dereferencable identifier pointing to the version held in the
repository. The dereferencable identifier should be provided in the
appropriate metadata element in the used metadata format (i.e.
dc:identifier in the case of Dublin Core).
23/38
Implications: Principle 1 – Content referencing
• Repositories can use different standards to deliver metadata over
OAI-PMH (DC, METS, MPEG-21 DIDL)
• Identifier must resolve (be actionable) to the object it identifies
• In the case of DC, if more identifiers are present, use the first
identifier as the identifier of the object
• Should resolve to the version of the object in the local repository
• Similarity with RIOXX identifier field
• The principle is easily applicable in the OA domain: each item can
be freely resolved
24/38
Open access statistics and principle 1
• Only dereferencable
items are OA
• Increases stats acuracy
• Avoids anecdotal
situations (e.g. 23,380
Dark Items)
25/38
Principle 2 – Content accessibility to machines
Open repositories must provide universal access to machines with
the same level of access as humans have. It is the role of open
repositories to allow machines harvest the entire content of the
repository in a reasonable time to enable harvesting systems to
acquire and maintain up-to-date information about the repository
content.
26/38
Example from Arxiv.org
• Googlebot: unrestricted
• Yahoo/MSN: can
reharvest in 6 months
• Researchers: access
denied
27/38
Implications: Principle 2 – Content accessibility to machines
• Accessibility of repository content by machines
• Enabling reuse through new services, such as those relying on
text-mining
• Open Repositories should not discriminate, except for abusive
behavior
• Presumption of innocence
28/38
Validation tools
• Key to adoption – Repository managers should not be left alone
• Repository Analytics
29/38
Outline
1. Goals of repositories (repositories as large metadata silos)
2. The bleak truth about availability of OA metadata vs content
3. Content referencing practises in repositories
4. Two principles to increase visibility of OA content
5. How to data mine OA aggregated data and why
30/38
CORE API
Enables external systems to interact with OA data (JSON or XML)
• Search, download metadata and cotent
• Content recommendation
• Citation references
• Statistics
• …
31/38
Data dumps
• About 11.5 million records
• Over 1 million full-texts
• Cleaned and enriched with additional information
• Distributed as two large zip files: metadata + full-texts
32/38
Examples of usage
• Author disambiguation
• Mining URLs from papers to detect trends
• Tagging of chemical compounds for image retrieval
• Citation analysis
• Content recommendation
• Detecting collaboration patterns of scientific communities
• Monitoring of OA growth
• Any form of text or data mining …
• API useful for services and data dumps for offline experiments
33/38
Why to use it?
• It is only OA, thus you can legally mine it …
• You can redistribute it: essential for reproducible research
• Very large and growing
• Kept up-to-date
• Ability to rerun experiments with new data
• All research content will soon be OA (UK HEFCE policy)
• Status of a UK national aggregator
• 0.5 million monthly visits, but only 150k six months ago
34/38
Why to use it?
• Open infrastructure for open science
• Not owned or managed by a for profit company => Ability to run
your own services = new opportunities and no give away of your
research to commercial companies
35/38
Conclusions
• Visibility of OA content can be significantly improved by adoption
two principles:
1) Dereferencable identifiers - Open Repositories provide open
access to content and not just metadata
2) Machine access – Open Repositories should provide free access
to content (for anybody and mainly researchers)
• Compliance validation tools are needed to support repositories
• Researchers who want to mine content or build services that can
rely on aggregators to acquire datasets
• Researchers can deploy their solutions, not just rely on
commercial providers.
36/38
Thank you!
Open access needs open repositories and open science
37/38
References 1/2
[BOAI, 2002] Budapest Open Access Initiative. (2002)
http://www.opensocietyfoundations.org/openaccess/boai-10-recommendations
[Crow, 2002] Crow, R. (2002). The case for institutional repositories: a SPARC position
paper. ARL Bimonthly Report 223.
[Knoth & Zdrahal, 2012] Knoth, P. and Zdrahal, Z. (2012) CORE: Three Access Levels to
Underpin Open Access, D-Lib Magazine, 18, 11/12, Corporation for National Research
Initiatives, http://dx.doi.org/10.1045/november2012-knoth
[Konkiel, 2012] Konkiel, S. (2012) Are Institutional Repositories Doing Their Job?
https://blogs.libraries.iub.edu/scholcomm/2012/09/11/are-institutional-repositories-
doing-their-job/
[Laakso & Bjork, 2012] Laakso, M., & Björk, B. C. (2012). Anatomy of open access
publishing: a study of longitudinal development and internal structure. BMC Medicine,
10(1), 124.
38/38
References 2/2
*Morrison, 2012+ Morrison, Louise (2012) 5 reasons why I can’t find Open Access
publications. http://mmitscotland.wordpress.com/2012/08/06/5-reasons-why-i-cant-find-
open-access-publications-2/
[OAI-PMH v2.0, 2008] The Open Archives Initiative Protocol for Metadata Harvesting
Version 2.0 (OAI-PMH), Impementation Guidelines (2008).
http://www.openarchives.org/OAI/openarchivesprotocol.html
[ResourceSync draft, 2013] ResourceSync protocol draft. 2013
http://www.niso.org/workrooms/resourcesync/
[Salo, 2008] Salo, D. (2008). Innkeeper at the roach motel. Library Trends, 57(2), 98-123.
[Van de Sompel et al, 2004] Van de Sompel, H., Nelson, M. L., Lagoze, C., & Warner, S.
(2004). Resource harvesting within the OAI-PMH framework. D-lib magazine, 10(12), 1082-
9873.

More Related Content

What's hot

Text Mining - Techniques & Limitations (A Pharmaceutical Industry Viewpoint)
Text Mining - Techniques & Limitations (A Pharmaceutical Industry Viewpoint)Text Mining - Techniques & Limitations (A Pharmaceutical Industry Viewpoint)
Text Mining - Techniques & Limitations (A Pharmaceutical Industry Viewpoint)Frank Oellien
 
OAI and Publishers’ metadata: Using the static repositories approach to discl...
OAI and Publishers’ metadata: Using the static repositories approach to discl...OAI and Publishers’ metadata: Using the static repositories approach to discl...
OAI and Publishers’ metadata: Using the static repositories approach to discl...R. John Robertson
 
Text Analytics in the EU Fusepool project (at II-SDV 2013 conference)
Text Analytics in the EU Fusepool project (at II-SDV 2013 conference)Text Analytics in the EU Fusepool project (at II-SDV 2013 conference)
Text Analytics in the EU Fusepool project (at II-SDV 2013 conference)Treparel
 
Open for Business - Open Archives, OpenURL, RSS and the Dublin Core
Open for Business - Open Archives, OpenURL, RSS and the Dublin CoreOpen for Business - Open Archives, OpenURL, RSS and the Dublin Core
Open for Business - Open Archives, OpenURL, RSS and the Dublin CoreAndy Powell
 
Hughes RDAP11 Data Publication Repositories
Hughes RDAP11 Data Publication RepositoriesHughes RDAP11 Data Publication Repositories
Hughes RDAP11 Data Publication RepositoriesASIS&T
 
CNI 2018: A Research Object Authoring Tool for the Data Commons
CNI 2018: A Research Object Authoring Tool for the Data CommonsCNI 2018: A Research Object Authoring Tool for the Data Commons
CNI 2018: A Research Object Authoring Tool for the Data CommonsAnita de Waard
 
EUDAT & OpenAIRE Webinar: How to write a Data Management Plan - July 7, 2016|...
EUDAT & OpenAIRE Webinar: How to write a Data Management Plan - July 7, 2016|...EUDAT & OpenAIRE Webinar: How to write a Data Management Plan - July 7, 2016|...
EUDAT & OpenAIRE Webinar: How to write a Data Management Plan - July 7, 2016|...EUDAT
 
Northumbria University Geospatial Metadata Workshop 20110505
Northumbria University Geospatial Metadata Workshop 20110505Northumbria University Geospatial Metadata Workshop 20110505
Northumbria University Geospatial Metadata Workshop 20110505EDINA, University of Edinburgh
 
Horizon 2020 open access and open data mandates
Horizon 2020 open access and open data mandatesHorizon 2020 open access and open data mandates
Horizon 2020 open access and open data mandatesMartin Donnelly
 
H2020 Open Research Data pilot
H2020 Open Research Data pilotH2020 Open Research Data pilot
H2020 Open Research Data pilotSarah Jones
 
Grid Computing July 2009
Grid Computing July 2009Grid Computing July 2009
Grid Computing July 2009Ian Foster
 
EPSRC Policy Compliance: What researchers need to know
EPSRC Policy Compliance: What researchers need to knowEPSRC Policy Compliance: What researchers need to know
EPSRC Policy Compliance: What researchers need to knowHistoric Environment Scotland
 
Open Data: Sharing the Main Actor of a Scientific Story - Paola Masuzzo
Open Data: Sharing the Main Actor of a Scientific Story - Paola MasuzzoOpen Data: Sharing the Main Actor of a Scientific Story - Paola Masuzzo
Open Data: Sharing the Main Actor of a Scientific Story - Paola MasuzzoOpenAIRE
 
Research data policy
Research data policyResearch data policy
Research data policySarah Jones
 
PIDs and DOI registration with DataCite - IATUL Workshop 2013
PIDs and DOI registration with DataCite - IATUL Workshop 2013PIDs and DOI registration with DataCite - IATUL Workshop 2013
PIDs and DOI registration with DataCite - IATUL Workshop 2013Frauke Ziedorn
 
OpenAIRE and Eudat services and tools to support FAIR DMP implementation
OpenAIRE and Eudat services and tools to support FAIR DMP implementation OpenAIRE and Eudat services and tools to support FAIR DMP implementation
OpenAIRE and Eudat services and tools to support FAIR DMP implementation Research Data Alliance
 
H2020 Open Data Pilot
H2020 Open Data PilotH2020 Open Data Pilot
H2020 Open Data PilotSarah Jones
 

What's hot (20)

Text Mining - Techniques & Limitations (A Pharmaceutical Industry Viewpoint)
Text Mining - Techniques & Limitations (A Pharmaceutical Industry Viewpoint)Text Mining - Techniques & Limitations (A Pharmaceutical Industry Viewpoint)
Text Mining - Techniques & Limitations (A Pharmaceutical Industry Viewpoint)
 
OAI and Publishers’ metadata: Using the static repositories approach to discl...
OAI and Publishers’ metadata: Using the static repositories approach to discl...OAI and Publishers’ metadata: Using the static repositories approach to discl...
OAI and Publishers’ metadata: Using the static repositories approach to discl...
 
Text Analytics in the EU Fusepool project (at II-SDV 2013 conference)
Text Analytics in the EU Fusepool project (at II-SDV 2013 conference)Text Analytics in the EU Fusepool project (at II-SDV 2013 conference)
Text Analytics in the EU Fusepool project (at II-SDV 2013 conference)
 
Open for Business - Open Archives, OpenURL, RSS and the Dublin Core
Open for Business - Open Archives, OpenURL, RSS and the Dublin CoreOpen for Business - Open Archives, OpenURL, RSS and the Dublin Core
Open for Business - Open Archives, OpenURL, RSS and the Dublin Core
 
Hughes RDAP11 Data Publication Repositories
Hughes RDAP11 Data Publication RepositoriesHughes RDAP11 Data Publication Repositories
Hughes RDAP11 Data Publication Repositories
 
CNI 2018: A Research Object Authoring Tool for the Data Commons
CNI 2018: A Research Object Authoring Tool for the Data CommonsCNI 2018: A Research Object Authoring Tool for the Data Commons
CNI 2018: A Research Object Authoring Tool for the Data Commons
 
EUDAT & OpenAIRE Webinar: How to write a Data Management Plan - July 7, 2016|...
EUDAT & OpenAIRE Webinar: How to write a Data Management Plan - July 7, 2016|...EUDAT & OpenAIRE Webinar: How to write a Data Management Plan - July 7, 2016|...
EUDAT & OpenAIRE Webinar: How to write a Data Management Plan - July 7, 2016|...
 
Northumbria University Geospatial Metadata Workshop 20110505
Northumbria University Geospatial Metadata Workshop 20110505Northumbria University Geospatial Metadata Workshop 20110505
Northumbria University Geospatial Metadata Workshop 20110505
 
SomeSlides
SomeSlidesSomeSlides
SomeSlides
 
Horizon 2020 open access and open data mandates
Horizon 2020 open access and open data mandatesHorizon 2020 open access and open data mandates
Horizon 2020 open access and open data mandates
 
H2020 Open Research Data pilot
H2020 Open Research Data pilotH2020 Open Research Data pilot
H2020 Open Research Data pilot
 
Grid Computing July 2009
Grid Computing July 2009Grid Computing July 2009
Grid Computing July 2009
 
Open Access Repository Junction
Open Access Repository JunctionOpen Access Repository Junction
Open Access Repository Junction
 
EPSRC Policy Compliance: What researchers need to know
EPSRC Policy Compliance: What researchers need to knowEPSRC Policy Compliance: What researchers need to know
EPSRC Policy Compliance: What researchers need to know
 
Open Data: Sharing the Main Actor of a Scientific Story - Paola Masuzzo
Open Data: Sharing the Main Actor of a Scientific Story - Paola MasuzzoOpen Data: Sharing the Main Actor of a Scientific Story - Paola Masuzzo
Open Data: Sharing the Main Actor of a Scientific Story - Paola Masuzzo
 
Authentication Methods: Shibboleth
Authentication Methods: ShibbolethAuthentication Methods: Shibboleth
Authentication Methods: Shibboleth
 
Research data policy
Research data policyResearch data policy
Research data policy
 
PIDs and DOI registration with DataCite - IATUL Workshop 2013
PIDs and DOI registration with DataCite - IATUL Workshop 2013PIDs and DOI registration with DataCite - IATUL Workshop 2013
PIDs and DOI registration with DataCite - IATUL Workshop 2013
 
OpenAIRE and Eudat services and tools to support FAIR DMP implementation
OpenAIRE and Eudat services and tools to support FAIR DMP implementation OpenAIRE and Eudat services and tools to support FAIR DMP implementation
OpenAIRE and Eudat services and tools to support FAIR DMP implementation
 
H2020 Open Data Pilot
H2020 Open Data PilotH2020 Open Data Pilot
H2020 Open Data Pilot
 

Viewers also liked

Core presentation
Core presentationCore presentation
Core presentationpetrknoth
 
Amicable resources corporate presentation- Human resource company
Amicable resources corporate presentation- Human resource companyAmicable resources corporate presentation- Human resource company
Amicable resources corporate presentation- Human resource companyrachna1122
 
Snail 12345
Snail 12345Snail 12345
Snail 12345reblyn1
 
DiggiCORE: Digging into Connected Repositories
DiggiCORE: Digging into Connected RepositoriesDiggiCORE: Digging into Connected Repositories
DiggiCORE: Digging into Connected Repositoriespetrknoth
 
CORE projects family
CORE projects familyCORE projects family
CORE projects familypetrknoth
 
Text mining in CORE (OR2012)
Text mining in CORE (OR2012)Text mining in CORE (OR2012)
Text mining in CORE (OR2012)petrknoth
 
The murder of a student.
The murder of a student.The murder of a student.
The murder of a student.selimkaradag
 
DEVCSI Core Mobile
DEVCSI Core MobileDEVCSI Core Mobile
DEVCSI Core Mobilepetrknoth
 
Ali’S Careers Power Point
Ali’S Careers Power PointAli’S Careers Power Point
Ali’S Careers Power Pointguestb4db5a8
 
FOSTER - Content Delivery (WP3)
FOSTER - Content Delivery (WP3)FOSTER - Content Delivery (WP3)
FOSTER - Content Delivery (WP3)petrknoth
 
My repository is being aggregated: a blessing or a curse?
My repository is being aggregated: a blessing or a curse?My repository is being aggregated: a blessing or a curse?
My repository is being aggregated: a blessing or a curse?petrknoth
 
Aggregating Research papers from Publishers' Systems to Support Text and Data...
Aggregating Research papers from Publishers' Systems to Support Text and Data...Aggregating Research papers from Publishers' Systems to Support Text and Data...
Aggregating Research papers from Publishers' Systems to Support Text and Data...petrknoth
 
CORE: Aggregating and Enriching Content to Support Open Access
CORE: Aggregating and Enriching Content to Support Open AccessCORE: Aggregating and Enriching Content to Support Open Access
CORE: Aggregating and Enriching Content to Support Open Accesspetrknoth
 
Semantometrics: Towards Fulltext-based Research Evaluation
Semantometrics: Towards Fulltext-based Research EvaluationSemantometrics: Towards Fulltext-based Research Evaluation
Semantometrics: Towards Fulltext-based Research Evaluationpetrknoth
 
93136540 spider-cloud-small-cell-cluster-case-study-091911-final
93136540 spider-cloud-small-cell-cluster-case-study-091911-final93136540 spider-cloud-small-cell-cluster-case-study-091911-final
93136540 spider-cloud-small-cell-cluster-case-study-091911-finalZarobiza
 

Viewers also liked (18)

Core presentation
Core presentationCore presentation
Core presentation
 
Amicable resources corporate presentation- Human resource company
Amicable resources corporate presentation- Human resource companyAmicable resources corporate presentation- Human resource company
Amicable resources corporate presentation- Human resource company
 
Snail 12345
Snail 12345Snail 12345
Snail 12345
 
DiggiCORE: Digging into Connected Repositories
DiggiCORE: Digging into Connected RepositoriesDiggiCORE: Digging into Connected Repositories
DiggiCORE: Digging into Connected Repositories
 
All Joke Photos
All Joke PhotosAll Joke Photos
All Joke Photos
 
CORE projects family
CORE projects familyCORE projects family
CORE projects family
 
Text mining in CORE (OR2012)
Text mining in CORE (OR2012)Text mining in CORE (OR2012)
Text mining in CORE (OR2012)
 
The murder of a student.
The murder of a student.The murder of a student.
The murder of a student.
 
DEVCSI Core Mobile
DEVCSI Core MobileDEVCSI Core Mobile
DEVCSI Core Mobile
 
Ali’S Careers Power Point
Ali’S Careers Power PointAli’S Careers Power Point
Ali’S Careers Power Point
 
FOSTER - Content Delivery (WP3)
FOSTER - Content Delivery (WP3)FOSTER - Content Delivery (WP3)
FOSTER - Content Delivery (WP3)
 
My repository is being aggregated: a blessing or a curse?
My repository is being aggregated: a blessing or a curse?My repository is being aggregated: a blessing or a curse?
My repository is being aggregated: a blessing or a curse?
 
Aggregating Research papers from Publishers' Systems to Support Text and Data...
Aggregating Research papers from Publishers' Systems to Support Text and Data...Aggregating Research papers from Publishers' Systems to Support Text and Data...
Aggregating Research papers from Publishers' Systems to Support Text and Data...
 
CORE: Aggregating and Enriching Content to Support Open Access
CORE: Aggregating and Enriching Content to Support Open AccessCORE: Aggregating and Enriching Content to Support Open Access
CORE: Aggregating and Enriching Content to Support Open Access
 
Suman Pandit
Suman PanditSuman Pandit
Suman Pandit
 
The Clown Doctor
The Clown DoctorThe Clown Doctor
The Clown Doctor
 
Semantometrics: Towards Fulltext-based Research Evaluation
Semantometrics: Towards Fulltext-based Research EvaluationSemantometrics: Towards Fulltext-based Research Evaluation
Semantometrics: Towards Fulltext-based Research Evaluation
 
93136540 spider-cloud-small-cell-cluster-case-study-091911-final
93136540 spider-cloud-small-cell-cluster-case-study-091911-final93136540 spider-cloud-small-cell-cluster-case-study-091911-final
93136540 spider-cloud-small-cell-cluster-case-study-091911-final
 

Similar to Towards an Infrastructure for Mining Scientific Publications

Open Archives Initiatives For Metadata Harvesting
Open Archives Initiatives For Metadata   HarvestingOpen Archives Initiatives For Metadata   Harvesting
Open Archives Initiatives For Metadata HarvestingNikesh Narayanan
 
Open archives initiatives(final)
 Open archives initiatives(final) Open archives initiatives(final)
Open archives initiatives(final)floyd taag
 
Open archives initiatives(final)
 Open archives initiatives(final) Open archives initiatives(final)
Open archives initiatives(final)floyd taag
 
Open archives initiatives(final)
 Open archives initiatives(final) Open archives initiatives(final)
Open archives initiatives(final)floyd taag
 
-Open Archives Initiatives(final)
-Open Archives Initiatives(final)-Open Archives Initiatives(final)
-Open Archives Initiatives(final)floyd taag
 
Open archives initiatives(final)
 Open archives initiatives(final) Open archives initiatives(final)
Open archives initiatives(final)marevil awas
 
OSFair2017 training | Machine accessibility of Open Access scientific publica...
OSFair2017 training | Machine accessibility of Open Access scientific publica...OSFair2017 training | Machine accessibility of Open Access scientific publica...
OSFair2017 training | Machine accessibility of Open Access scientific publica...Open Science Fair
 
Open Archives Initiative Object Reuse and Exchange
Open Archives Initiative Object Reuse and ExchangeOpen Archives Initiative Object Reuse and Exchange
Open Archives Initiative Object Reuse and Exchangelagoze
 
How can repositories support the text mining of their content and why?
How can repositories support the text mining of their content and why?How can repositories support the text mining of their content and why?
How can repositories support the text mining of their content and why?openminted_eu
 
CORE - Petr Knoth, Research Associate
CORE - Petr Knoth, Research AssociateCORE - Petr Knoth, Research Associate
CORE - Petr Knoth, Research AssociateThe European Library
 
How can repositories support the text-mining of their content and why?
How can repositories support the text-mining of their content and why? How can repositories support the text-mining of their content and why?
How can repositories support the text-mining of their content and why? Nancy Pontika
 
Presentation from ALA Midwinter 2014 on Elsevier's new Text and Data Mining P...
Presentation from ALA Midwinter 2014 on Elsevier's new Text and Data Mining P...Presentation from ALA Midwinter 2014 on Elsevier's new Text and Data Mining P...
Presentation from ALA Midwinter 2014 on Elsevier's new Text and Data Mining P...Chris Shillum
 
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...Research Object Composer: A Tool for Publishing Complex Data Objects in the C...
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...Anita de Waard
 
Sharing with the Open Archives Initiative
Sharing with the Open Archives InitiativeSharing with the Open Archives Initiative
Sharing with the Open Archives InitiativeJenn Riley
 
RDA-WDS Publishing Data Interest Group
RDA-WDS Publishing Data Interest GroupRDA-WDS Publishing Data Interest Group
RDA-WDS Publishing Data Interest GroupAnita de Waard
 

Similar to Towards an Infrastructure for Mining Scientific Publications (20)

Open Archives Initiatives For Metadata Harvesting
Open Archives Initiatives For Metadata   HarvestingOpen Archives Initiatives For Metadata   Harvesting
Open Archives Initiatives For Metadata Harvesting
 
OAI-PMH
OAI-PMHOAI-PMH
OAI-PMH
 
Open archives initiatives(final)
 Open archives initiatives(final) Open archives initiatives(final)
Open archives initiatives(final)
 
Open archives initiatives(final)
 Open archives initiatives(final) Open archives initiatives(final)
Open archives initiatives(final)
 
Open archives initiatives(final)
 Open archives initiatives(final) Open archives initiatives(final)
Open archives initiatives(final)
 
-Open Archives Initiatives(final)
-Open Archives Initiatives(final)-Open Archives Initiatives(final)
-Open Archives Initiatives(final)
 
Open archives initiatives(final)
 Open archives initiatives(final) Open archives initiatives(final)
Open archives initiatives(final)
 
OSFair2017 training | Machine accessibility of Open Access scientific publica...
OSFair2017 training | Machine accessibility of Open Access scientific publica...OSFair2017 training | Machine accessibility of Open Access scientific publica...
OSFair2017 training | Machine accessibility of Open Access scientific publica...
 
Open Archives Initiative Object Reuse and Exchange
Open Archives Initiative Object Reuse and ExchangeOpen Archives Initiative Object Reuse and Exchange
Open Archives Initiative Object Reuse and Exchange
 
How can repositories support the text mining of their content and why?
How can repositories support the text mining of their content and why?How can repositories support the text mining of their content and why?
How can repositories support the text mining of their content and why?
 
CORE - Petr Knoth, Research Associate
CORE - Petr Knoth, Research AssociateCORE - Petr Knoth, Research Associate
CORE - Petr Knoth, Research Associate
 
Metadata april 8 2013
Metadata april 8 2013Metadata april 8 2013
Metadata april 8 2013
 
How can repositories support the text-mining of their content and why?
How can repositories support the text-mining of their content and why? How can repositories support the text-mining of their content and why?
How can repositories support the text-mining of their content and why?
 
Presentation from ALA Midwinter 2014 on Elsevier's new Text and Data Mining P...
Presentation from ALA Midwinter 2014 on Elsevier's new Text and Data Mining P...Presentation from ALA Midwinter 2014 on Elsevier's new Text and Data Mining P...
Presentation from ALA Midwinter 2014 on Elsevier's new Text and Data Mining P...
 
OAI and OAI-PMH
OAI and OAI-PMHOAI and OAI-PMH
OAI and OAI-PMH
 
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...Research Object Composer: A Tool for Publishing Complex Data Objects in the C...
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...
 
Sharing with the Open Archives Initiative
Sharing with the Open Archives InitiativeSharing with the Open Archives Initiative
Sharing with the Open Archives Initiative
 
RDA-WDS Publishing Data Interest Group
RDA-WDS Publishing Data Interest GroupRDA-WDS Publishing Data Interest Group
RDA-WDS Publishing Data Interest Group
 
Digitisation and institutional repositories 2
Digitisation and institutional repositories 2Digitisation and institutional repositories 2
Digitisation and institutional repositories 2
 
Presentation 16 may keynote karin bredenberg
Presentation 16 may keynote karin bredenbergPresentation 16 may keynote karin bredenberg
Presentation 16 may keynote karin bredenberg
 

More from petrknoth

Qui Bono? Cumulative advantage in open access publishing
Qui Bono? Cumulative advantage in open access publishingQui Bono? Cumulative advantage in open access publishing
Qui Bono? Cumulative advantage in open access publishingpetrknoth
 
OAI Identifiers: Decentralised PIDs for Research Outputs in Repositories
OAI Identifiers: Decentralised PIDs for Research Outputs in RepositoriesOAI Identifiers: Decentralised PIDs for Research Outputs in Repositories
OAI Identifiers: Decentralised PIDs for Research Outputs in Repositoriespetrknoth
 
UKRI OA policy requirements for repositories and how to meet them
UKRI OA policy requirements for repositories and how to meet themUKRI OA policy requirements for repositories and how to meet them
UKRI OA policy requirements for repositories and how to meet thempetrknoth
 
Enabling Educators to Locate High-Quality Teaching Resources
Enabling Educators to LocateHigh-Quality Teaching ResourcesEnabling Educators to LocateHigh-Quality Teaching Resources
Enabling Educators to Locate High-Quality Teaching Resourcespetrknoth
 
Tracking compliance of the REF2021 policy with the CORE Repository Dashboard
Tracking compliance of the REF2021 policy with the CORE Repository DashboardTracking compliance of the REF2021 policy with the CORE Repository Dashboard
Tracking compliance of the REF2021 policy with the CORE Repository Dashboardpetrknoth
 
Better together: building services for public good on top of content from the...
Better together: building services for public good on top of content from the...Better together: building services for public good on top of content from the...
Better together: building services for public good on top of content from the...petrknoth
 
CORE Analytics Dashboard
CORE Analytics DashboardCORE Analytics Dashboard
CORE Analytics Dashboardpetrknoth
 
Better together: building services for public good on top of content from the...
Better together: building services for public good on top of content from the...Better together: building services for public good on top of content from the...
Better together: building services for public good on top of content from the...petrknoth
 
Analysing the performance of open access papers discovery tools
Analysing the performance of open access papers discovery toolsAnalysing the performance of open access papers discovery tools
Analysing the performance of open access papers discovery toolspetrknoth
 
Assessing Compliance with the UK REF 2021 Open Access Policy
Assessing Compliance with the UK REF 2021 Open Access PolicyAssessing Compliance with the UK REF 2021 Open Access Policy
Assessing Compliance with the UK REF 2021 Open Access Policypetrknoth
 
Data interoperability toolkit (OpenMinTeD)
Data interoperability toolkit (OpenMinTeD)Data interoperability toolkit (OpenMinTeD)
Data interoperability toolkit (OpenMinTeD)petrknoth
 
Integrating research indicators for use in the repositories infrastructure
Integrating research indicators for use in the repositories infrastructure Integrating research indicators for use in the repositories infrastructure
Integrating research indicators for use in the repositories infrastructure petrknoth
 
Towards effective research recommender systems for repositories
Towards effective research recommender systems for repositoriesTowards effective research recommender systems for repositories
Towards effective research recommender systems for repositoriespetrknoth
 
COAR Next Generation Repositories WG - Text mining and Recommender system sto...
COAR Next Generation Repositories WG - Text mining and Recommender system sto...COAR Next Generation Repositories WG - Text mining and Recommender system sto...
COAR Next Generation Repositories WG - Text mining and Recommender system sto...petrknoth
 
Seamless access to the world’s open access research papers via ResourceSync
Seamless access to the world’s open access research papers via ResourceSyncSeamless access to the world’s open access research papers via ResourceSync
Seamless access to the world’s open access research papers via ResourceSyncpetrknoth
 

More from petrknoth (16)

Qui Bono? Cumulative advantage in open access publishing
Qui Bono? Cumulative advantage in open access publishingQui Bono? Cumulative advantage in open access publishing
Qui Bono? Cumulative advantage in open access publishing
 
CORE APIv3
CORE APIv3CORE APIv3
CORE APIv3
 
OAI Identifiers: Decentralised PIDs for Research Outputs in Repositories
OAI Identifiers: Decentralised PIDs for Research Outputs in RepositoriesOAI Identifiers: Decentralised PIDs for Research Outputs in Repositories
OAI Identifiers: Decentralised PIDs for Research Outputs in Repositories
 
UKRI OA policy requirements for repositories and how to meet them
UKRI OA policy requirements for repositories and how to meet themUKRI OA policy requirements for repositories and how to meet them
UKRI OA policy requirements for repositories and how to meet them
 
Enabling Educators to Locate High-Quality Teaching Resources
Enabling Educators to LocateHigh-Quality Teaching ResourcesEnabling Educators to LocateHigh-Quality Teaching Resources
Enabling Educators to Locate High-Quality Teaching Resources
 
Tracking compliance of the REF2021 policy with the CORE Repository Dashboard
Tracking compliance of the REF2021 policy with the CORE Repository DashboardTracking compliance of the REF2021 policy with the CORE Repository Dashboard
Tracking compliance of the REF2021 policy with the CORE Repository Dashboard
 
Better together: building services for public good on top of content from the...
Better together: building services for public good on top of content from the...Better together: building services for public good on top of content from the...
Better together: building services for public good on top of content from the...
 
CORE Analytics Dashboard
CORE Analytics DashboardCORE Analytics Dashboard
CORE Analytics Dashboard
 
Better together: building services for public good on top of content from the...
Better together: building services for public good on top of content from the...Better together: building services for public good on top of content from the...
Better together: building services for public good on top of content from the...
 
Analysing the performance of open access papers discovery tools
Analysing the performance of open access papers discovery toolsAnalysing the performance of open access papers discovery tools
Analysing the performance of open access papers discovery tools
 
Assessing Compliance with the UK REF 2021 Open Access Policy
Assessing Compliance with the UK REF 2021 Open Access PolicyAssessing Compliance with the UK REF 2021 Open Access Policy
Assessing Compliance with the UK REF 2021 Open Access Policy
 
Data interoperability toolkit (OpenMinTeD)
Data interoperability toolkit (OpenMinTeD)Data interoperability toolkit (OpenMinTeD)
Data interoperability toolkit (OpenMinTeD)
 
Integrating research indicators for use in the repositories infrastructure
Integrating research indicators for use in the repositories infrastructure Integrating research indicators for use in the repositories infrastructure
Integrating research indicators for use in the repositories infrastructure
 
Towards effective research recommender systems for repositories
Towards effective research recommender systems for repositoriesTowards effective research recommender systems for repositories
Towards effective research recommender systems for repositories
 
COAR Next Generation Repositories WG - Text mining and Recommender system sto...
COAR Next Generation Repositories WG - Text mining and Recommender system sto...COAR Next Generation Repositories WG - Text mining and Recommender system sto...
COAR Next Generation Repositories WG - Text mining and Recommender system sto...
 
Seamless access to the world’s open access research papers via ResourceSync
Seamless access to the world’s open access research papers via ResourceSyncSeamless access to the world’s open access research papers via ResourceSync
Seamless access to the world’s open access research papers via ResourceSync
 

Recently uploaded

Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 

Recently uploaded (20)

Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 

Towards an Infrastructure for Mining Scientific Publications

  • 1. 1/38 From Open Access Metadata to Open Access Content: Towards an Infrastructure for Mining Scientific Publications Petr Knoth CORE (Connecting REpositories) project Knowledge Media institute The Open University @petrknoth, #diggicore
  • 2. 2/38 What is Open Access exactly? By “open access” to *peer-reviewed research literature], we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. [BOAI, 2002]
  • 3. 3/38 How to achieve OA? Two routes: • Self-archiving: Institional/Open Repositories • Open Access Journals
  • 4. 4/38 Why from OA metadata to OA content? • Despite large amount of OA content already available online (Laakso & Bjork, 2012), OA content is not necessarily easily discoverable (Morrisson, 2012; Konkiel, 2012). • Often available, but difficult to find … • Inhibiting the OA impact – accessibility, discoverability, reuse … • Discoverability of OA content on the Web can be dramatically increased by adopting two simple principles!
  • 5. 5/38 Outline 1. Goals of repositories 2. The bleak truth about availability of OA metadata vs content 3. Content referencing practises in repositories 4. Two principles to increase visibility of OA content 5. How to data mine OA aggregated data and why
  • 6. 6/38 Outline 1. Goals of repositories (repositories as large metadata silos) 2. The bleak truth about availability of OA metadata vs content 3. Content referencing practises in repositories 4. Two principles to increase visibility of OA content 5. How to data mine OA aggregated data and why
  • 7. 7/38 The primary purpose of repositories • Institutional repositories (IRs) serve a number of purposes; such collecting and curating digital outputs, providing statistics, research excellence, etc. • The primary goal of repositories is to open and disseminate research outputs to a worldwide audience (Crow, 2002) – SPARC’s position paper on the case for institutional repositories.
  • 8. 8/38 SPARC’s position paper on IRs “For the repository to provide access to the broader research community, users outside the university must be able to find and retrieve information from the repository. Therefore, institutional repository systems must be able to support interoperability in order to provide access via multiple search engines and other discovery tools. An institution does not necessarily need to implement searching and indexing functionality to satisfy this demand: it could simply maintain and expose metadata, allowing other services to harvest and search the content. This simplicity lowers the barrier to repository operation for many institutions, as it only requires a file system to hold the content and the ability to create and share metadata with external systems.”
  • 9. 9/38 COAR: About harvesting and aggregations … “Each individual repository is of limited value for research: the real power of Open Access lies in the possibility of connecting and tying together repositories, which is why we need interoperability. In order to create a seamless layer of content through connected repositories from around the world, Open Access relies on interoperability, the ability for systems to communicate with each other and pass information back and forth in a usable format. Interoperability allows us to exploit today's computational power so that we can aggregate, data mine, create new tools and services, and generate new knowledge from repository content.’’ [COAR manifesto]
  • 10. 10/38 We need OA to content (not just metadata) • Repositories (even the most prominent) often seen by aggregation systems as large metadata. • OA to metadata is not disruptive. Little difference to the traditional publishing model.
  • 11. 11/38 Outline 1. Goals of repositories (repositories as large metadata silos) 2. The bleak truth about availability of OA metadata vs content 3. Content referencing practises in repositories 4. Two principles to increase visibility of OA content 5. How to data mine OA aggregated data and why
  • 12. 12/38 Study • 83 repositories (mainly EPrints with pdf research outputs) • 1,461,016 metadata records • Ratio of metadata to content • Data acquired from CORE (Knoth & Zdrahal, 2012)
  • 13. 13/38 “*The institutional repository+ is like a roach motel. Data goes in, but it doesn’t come out.” (Salo, 2008)
  • 14. 14/38 Why is this a problem? • Lower accessibility of papers (we have them, but cannot find them) • Text-mining • Cannot monitor growth • Loosing a strong argument for the adoption of OA!
  • 15. 15/38 Outline 1. Goals of repositories (repositories as large metadata silos) 2. The bleak truth about availability of OA metadata vs content 3. Content referencing practises in repositories 4. Two principles to increase visibility of OA content 5. How to data mine OA aggregated data and why
  • 16. 16/38 OAI-PMH and content referencing • OAI-PMH supports representing metadata in multiple formats, but at a minimum repositories must be able to return records with metadata expressed in the Dublin Core format (OAI-PMH v2.0, 2008) • If repositories want to satisfy the SPARC guidelines (Crow, 2002), they must provide a link to the content as part of the exposed metadata.
  • 17. 17/38 OAI-PMH and content referencing The Open Research Online repository (Eprints) links directly to the resource from metadata. Cranfield repository (DSpace) identifies the resource by providing a link to a page from which the resource (if available) can be accessed.
  • 18. 18/38 OAI-PMH and content referencing The OAI-PMH specification states on this topic that: “The nature of a resource identifier is outside the scope of the OAI- PMH. To facilitate access to the resource associated with harvested metadata, repositories should use an element in metadata records to establish a linkage between the record (and the identifier of its item) and the identifier (URL, URN, DOI, etc.) of the associated resource. The mandatory Dublin Core format provides the identifier element that should be used for this purpose.”
  • 19. 19/38 OAI-PMH and content referencing • What is an identifier of the associated resource? Is a splash page an identifier? According to OAI-PMH examples it is: <dc:identifier>http://arXiv.org/abs/cs/0112017</dc:identifier> • The standard is pretty weak on this aspect …
  • 20. 20/38 Outline 1. Goals of repositories (repositories as large metadata silos) 2. The bleak truth about availability of OA metadata vs content 3. Content referencing practises in repositories 4. Two principles to increase visibility of OA content 5. How to data mine OA aggregated data and why
  • 21. 21/38 The principles of the principles  • Pragmatic rather than exciting. • Generating maximum benefit for a minimum investment. • Deliberately use current standards to minimise adoption time. • Respecting differences across systems and backwards compatibility. • Emphasizes the need for easy to use compliance mechanisms to assist repository managers in ensuring systems interoperability.
  • 22. 22/38 Principle 1 – Content referencing Open repositories should always establish a link from the metadata record to the item the metadata record describes using a dereferencable identifier pointing to the version held in the repository. The dereferencable identifier should be provided in the appropriate metadata element in the used metadata format (i.e. dc:identifier in the case of Dublin Core).
  • 23. 23/38 Implications: Principle 1 – Content referencing • Repositories can use different standards to deliver metadata over OAI-PMH (DC, METS, MPEG-21 DIDL) • Identifier must resolve (be actionable) to the object it identifies • In the case of DC, if more identifiers are present, use the first identifier as the identifier of the object • Should resolve to the version of the object in the local repository • Similarity with RIOXX identifier field • The principle is easily applicable in the OA domain: each item can be freely resolved
  • 24. 24/38 Open access statistics and principle 1 • Only dereferencable items are OA • Increases stats acuracy • Avoids anecdotal situations (e.g. 23,380 Dark Items)
  • 25. 25/38 Principle 2 – Content accessibility to machines Open repositories must provide universal access to machines with the same level of access as humans have. It is the role of open repositories to allow machines harvest the entire content of the repository in a reasonable time to enable harvesting systems to acquire and maintain up-to-date information about the repository content.
  • 26. 26/38 Example from Arxiv.org • Googlebot: unrestricted • Yahoo/MSN: can reharvest in 6 months • Researchers: access denied
  • 27. 27/38 Implications: Principle 2 – Content accessibility to machines • Accessibility of repository content by machines • Enabling reuse through new services, such as those relying on text-mining • Open Repositories should not discriminate, except for abusive behavior • Presumption of innocence
  • 28. 28/38 Validation tools • Key to adoption – Repository managers should not be left alone • Repository Analytics
  • 29. 29/38 Outline 1. Goals of repositories (repositories as large metadata silos) 2. The bleak truth about availability of OA metadata vs content 3. Content referencing practises in repositories 4. Two principles to increase visibility of OA content 5. How to data mine OA aggregated data and why
  • 30. 30/38 CORE API Enables external systems to interact with OA data (JSON or XML) • Search, download metadata and cotent • Content recommendation • Citation references • Statistics • …
  • 31. 31/38 Data dumps • About 11.5 million records • Over 1 million full-texts • Cleaned and enriched with additional information • Distributed as two large zip files: metadata + full-texts
  • 32. 32/38 Examples of usage • Author disambiguation • Mining URLs from papers to detect trends • Tagging of chemical compounds for image retrieval • Citation analysis • Content recommendation • Detecting collaboration patterns of scientific communities • Monitoring of OA growth • Any form of text or data mining … • API useful for services and data dumps for offline experiments
  • 33. 33/38 Why to use it? • It is only OA, thus you can legally mine it … • You can redistribute it: essential for reproducible research • Very large and growing • Kept up-to-date • Ability to rerun experiments with new data • All research content will soon be OA (UK HEFCE policy) • Status of a UK national aggregator • 0.5 million monthly visits, but only 150k six months ago
  • 34. 34/38 Why to use it? • Open infrastructure for open science • Not owned or managed by a for profit company => Ability to run your own services = new opportunities and no give away of your research to commercial companies
  • 35. 35/38 Conclusions • Visibility of OA content can be significantly improved by adoption two principles: 1) Dereferencable identifiers - Open Repositories provide open access to content and not just metadata 2) Machine access – Open Repositories should provide free access to content (for anybody and mainly researchers) • Compliance validation tools are needed to support repositories • Researchers who want to mine content or build services that can rely on aggregators to acquire datasets • Researchers can deploy their solutions, not just rely on commercial providers.
  • 36. 36/38 Thank you! Open access needs open repositories and open science
  • 37. 37/38 References 1/2 [BOAI, 2002] Budapest Open Access Initiative. (2002) http://www.opensocietyfoundations.org/openaccess/boai-10-recommendations [Crow, 2002] Crow, R. (2002). The case for institutional repositories: a SPARC position paper. ARL Bimonthly Report 223. [Knoth & Zdrahal, 2012] Knoth, P. and Zdrahal, Z. (2012) CORE: Three Access Levels to Underpin Open Access, D-Lib Magazine, 18, 11/12, Corporation for National Research Initiatives, http://dx.doi.org/10.1045/november2012-knoth [Konkiel, 2012] Konkiel, S. (2012) Are Institutional Repositories Doing Their Job? https://blogs.libraries.iub.edu/scholcomm/2012/09/11/are-institutional-repositories- doing-their-job/ [Laakso & Bjork, 2012] Laakso, M., & Björk, B. C. (2012). Anatomy of open access publishing: a study of longitudinal development and internal structure. BMC Medicine, 10(1), 124.
  • 38. 38/38 References 2/2 *Morrison, 2012+ Morrison, Louise (2012) 5 reasons why I can’t find Open Access publications. http://mmitscotland.wordpress.com/2012/08/06/5-reasons-why-i-cant-find- open-access-publications-2/ [OAI-PMH v2.0, 2008] The Open Archives Initiative Protocol for Metadata Harvesting Version 2.0 (OAI-PMH), Impementation Guidelines (2008). http://www.openarchives.org/OAI/openarchivesprotocol.html [ResourceSync draft, 2013] ResourceSync protocol draft. 2013 http://www.niso.org/workrooms/resourcesync/ [Salo, 2008] Salo, D. (2008). Innkeeper at the roach motel. Library Trends, 57(2), 98-123. [Van de Sompel et al, 2004] Van de Sompel, H., Nelson, M. L., Lagoze, C., & Warner, S. (2004). Resource harvesting within the OAI-PMH framework. D-lib magazine, 10(12), 1082- 9873.

Editor's Notes

  1. We live in the era of Open Access metadata. Europeana liberated the market by CC0 to metadata, but what about the content?
  2. In this paper, we use the term institutional repositories (which are the main interest of the Open Repositories conference), to refer also to subject-based repositories or archives or systems for depositing research outputs used by open access publishers. As a result, the conclusions of this paper and recommendations are equally valid for both the green (self-archiving) and gold (OA publishing) routes to OA.
  3. In this paper, we use the term institutional repositories (which are the main interest of the Open Repositories conference), to refer also to subject-based repositories or archives or systems for depositing research outputs used by open access publishers. As a result, the conclusions of this paper and recommendations are equally valid for both the green (self-archiving) and gold (OA publishing) routes to OA.
  4. Lets have a look on what some key players in the field think about the purpose of repositories
  5. In this paper, we use the term institutional repositories (which are the main interest of the Open Repositories conference), to refer also to subject-based repositories or archives or systems for depositing research outputs used by open access publishers. As a result, the conclusions of this paper and recommendations are equally valid for both the green (self-archiving) and gold (OA publishing) routes to OA.
  6. It will not be possible transfer to an OA culture unless we change the environment so that there will be clear benefits for researchers to participate in OA. These benefits should be technical rather than political.
  7. In this paper, we use the term institutional repositories (which are the main interest of the Open Repositories conference), to refer also to subject-based repositories or archives or systems for depositing research outputs used by open access publishers. As a result, the conclusions of this paper and recommendations are equally valid for both the green (self-archiving) and gold (OA publishing) routes to OA.
  8. In this paper, we use the term institutional repositories (which are the main interest of the Open Repositories conference), to refer also to subject-based repositories or archives or systems for depositing research outputs used by open access publishers. As a result, the conclusions of this paper and recommendations are equally valid for both the green (self-archiving) and gold (OA publishing) routes to OA.
  9. I can specify the differences between this and RIOXX
  10. I can specify the differences between this and RIOXX
  11. I can specify the differences between this and RIOXX
  12. I can specify the differences between this and RIOXX
  13. I can specify the differences between this and RIOXX
  14. I can specify the differences between this and RIOXX
  15. In this paper, we use the term institutional repositories (which are the main interest of the Open Repositories conference), to refer also to subject-based repositories or archives or systems for depositing research outputs used by open access publishers. As a result, the conclusions of this paper and recommendations are equally valid for both the green (self-archiving) and gold (OA publishing) routes to OA.
  16. In this paper, we use the term institutional repositories (which are the main interest of the Open Repositories conference), to refer also to subject-based repositories or archives or systems for depositing research outputs used by open access publishers. As a result, the conclusions of this paper and recommendations are equally valid for both the green (self-archiving) and gold (OA publishing) routes to OA.
  17. In this paper, we use the term institutional repositories (which are the main interest of the Open Repositories conference), to refer also to subject-based repositories or archives or systems for depositing research outputs used by open access publishers. As a result, the conclusions of this paper and recommendations are equally valid for both the green (self-archiving) and gold (OA publishing) routes to OA.
  18. In this paper, we use the term institutional repositories (which are the main interest of the Open Repositories conference), to refer also to subject-based repositories or archives or systems for depositing research outputs used by open access publishers. As a result, the conclusions of this paper and recommendations are equally valid for both the green (self-archiving) and gold (OA publishing) routes to OA.
  19. In this paper, we use the term institutional repositories (which are the main interest of the Open Repositories conference), to refer also to subject-based repositories or archives or systems for depositing research outputs used by open access publishers. As a result, the conclusions of this paper and recommendations are equally valid for both the green (self-archiving) and gold (OA publishing) routes to OA.
  20. In this paper, we use the term institutional repositories (which are the main interest of the Open Repositories conference), to refer also to subject-based repositories or archives or systems for depositing research outputs used by open access publishers. As a result, the conclusions of this paper and recommendations are equally valid for both the green (self-archiving) and gold (OA publishing) routes to OA.
  21. In this paper, we use the term institutional repositories (which are the main interest of the Open Repositories conference), to refer also to subject-based repositories or archives or systems for depositing research outputs used by open access publishers. As a result, the conclusions of this paper and recommendations are equally valid for both the green (self-archiving) and gold (OA publishing) routes to OA.
  22. In this paper, we use the term institutional repositories (which are the main interest of the Open Repositories conference), to refer also to subject-based repositories or archives or systems for depositing research outputs used by open access publishers. As a result, the conclusions of this paper and recommendations are equally valid for both the green (self-archiving) and gold (OA publishing) routes to OA.
  23. I can specify the differences between this and RIOXX
  24. I can specify the differences between this and RIOXX