Your SlideShare is downloading. ×
0
Towards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific Publications
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Towards an Infrastructure for Mining Scientific Publications

327

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
327
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • We live in the era of Open Access metadata. Europeana liberated the market by CC0 to metadata, but what about the content?
  • In this paper, we use the term institutional repositories (which are the main interest of the Open Repositories conference), to refer also to subject-based repositories or archives or systems for depositing research outputs used by open access publishers. As a result, the conclusions of this paper and recommendations are equally valid for both the green (self-archiving) and gold (OA publishing) routes to OA.
  • In this paper, we use the term institutional repositories (which are the main interest of the Open Repositories conference), to refer also to subject-based repositories or archives or systems for depositing research outputs used by open access publishers. As a result, the conclusions of this paper and recommendations are equally valid for both the green (self-archiving) and gold (OA publishing) routes to OA.
  • Lets have a look on what some key players in the field think about the purpose of repositories
  • In this paper, we use the term institutional repositories (which are the main interest of the Open Repositories conference), to refer also to subject-based repositories or archives or systems for depositing research outputs used by open access publishers. As a result, the conclusions of this paper and recommendations are equally valid for both the green (self-archiving) and gold (OA publishing) routes to OA.
  • It will not be possible transfer to an OA culture unless we change the environment so that there will be clear benefits for researchers to participate in OA. These benefits should be technical rather than political.
  • In this paper, we use the term institutional repositories (which are the main interest of the Open Repositories conference), to refer also to subject-based repositories or archives or systems for depositing research outputs used by open access publishers. As a result, the conclusions of this paper and recommendations are equally valid for both the green (self-archiving) and gold (OA publishing) routes to OA.
  • In this paper, we use the term institutional repositories (which are the main interest of the Open Repositories conference), to refer also to subject-based repositories or archives or systems for depositing research outputs used by open access publishers. As a result, the conclusions of this paper and recommendations are equally valid for both the green (self-archiving) and gold (OA publishing) routes to OA.
  • I can specify the differences between this and RIOXX
  • I can specify the differences between this and RIOXX
  • I can specify the differences between this and RIOXX
  • I can specify the differences between this and RIOXX
  • I can specify the differences between this and RIOXX
  • I can specify the differences between this and RIOXX
  • In this paper, we use the term institutional repositories (which are the main interest of the Open Repositories conference), to refer also to subject-based repositories or archives or systems for depositing research outputs used by open access publishers. As a result, the conclusions of this paper and recommendations are equally valid for both the green (self-archiving) and gold (OA publishing) routes to OA.
  • In this paper, we use the term institutional repositories (which are the main interest of the Open Repositories conference), to refer also to subject-based repositories or archives or systems for depositing research outputs used by open access publishers. As a result, the conclusions of this paper and recommendations are equally valid for both the green (self-archiving) and gold (OA publishing) routes to OA.
  • In this paper, we use the term institutional repositories (which are the main interest of the Open Repositories conference), to refer also to subject-based repositories or archives or systems for depositing research outputs used by open access publishers. As a result, the conclusions of this paper and recommendations are equally valid for both the green (self-archiving) and gold (OA publishing) routes to OA.
  • In this paper, we use the term institutional repositories (which are the main interest of the Open Repositories conference), to refer also to subject-based repositories or archives or systems for depositing research outputs used by open access publishers. As a result, the conclusions of this paper and recommendations are equally valid for both the green (self-archiving) and gold (OA publishing) routes to OA.
  • In this paper, we use the term institutional repositories (which are the main interest of the Open Repositories conference), to refer also to subject-based repositories or archives or systems for depositing research outputs used by open access publishers. As a result, the conclusions of this paper and recommendations are equally valid for both the green (self-archiving) and gold (OA publishing) routes to OA.
  • In this paper, we use the term institutional repositories (which are the main interest of the Open Repositories conference), to refer also to subject-based repositories or archives or systems for depositing research outputs used by open access publishers. As a result, the conclusions of this paper and recommendations are equally valid for both the green (self-archiving) and gold (OA publishing) routes to OA.
  • In this paper, we use the term institutional repositories (which are the main interest of the Open Repositories conference), to refer also to subject-based repositories or archives or systems for depositing research outputs used by open access publishers. As a result, the conclusions of this paper and recommendations are equally valid for both the green (self-archiving) and gold (OA publishing) routes to OA.
  • In this paper, we use the term institutional repositories (which are the main interest of the Open Repositories conference), to refer also to subject-based repositories or archives or systems for depositing research outputs used by open access publishers. As a result, the conclusions of this paper and recommendations are equally valid for both the green (self-archiving) and gold (OA publishing) routes to OA.
  • I can specify the differences between this and RIOXX
  • I can specify the differences between this and RIOXX
  • Transcript

    • 1. 1/38 From Open Access Metadata to Open Access Content: Towards an Infrastructure for Mining Scientific Publications Petr Knoth CORE (Connecting REpositories) project Knowledge Media institute The Open University @petrknoth, #diggicore
    • 2. 2/38 What is Open Access exactly? By “open access” to *peer-reviewed research literature], we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. [BOAI, 2002]
    • 3. 3/38 How to achieve OA? Two routes: • Self-archiving: Institional/Open Repositories • Open Access Journals
    • 4. 4/38 Why from OA metadata to OA content? • Despite large amount of OA content already available online (Laakso & Bjork, 2012), OA content is not necessarily easily discoverable (Morrisson, 2012; Konkiel, 2012). • Often available, but difficult to find … • Inhibiting the OA impact – accessibility, discoverability, reuse … • Discoverability of OA content on the Web can be dramatically increased by adopting two simple principles!
    • 5. 5/38 Outline 1. Goals of repositories 2. The bleak truth about availability of OA metadata vs content 3. Content referencing practises in repositories 4. Two principles to increase visibility of OA content 5. How to data mine OA aggregated data and why
    • 6. 6/38 Outline 1. Goals of repositories (repositories as large metadata silos) 2. The bleak truth about availability of OA metadata vs content 3. Content referencing practises in repositories 4. Two principles to increase visibility of OA content 5. How to data mine OA aggregated data and why
    • 7. 7/38 The primary purpose of repositories • Institutional repositories (IRs) serve a number of purposes; such collecting and curating digital outputs, providing statistics, research excellence, etc. • The primary goal of repositories is to open and disseminate research outputs to a worldwide audience (Crow, 2002) – SPARC’s position paper on the case for institutional repositories.
    • 8. 8/38 SPARC’s position paper on IRs “For the repository to provide access to the broader research community, users outside the university must be able to find and retrieve information from the repository. Therefore, institutional repository systems must be able to support interoperability in order to provide access via multiple search engines and other discovery tools. An institution does not necessarily need to implement searching and indexing functionality to satisfy this demand: it could simply maintain and expose metadata, allowing other services to harvest and search the content. This simplicity lowers the barrier to repository operation for many institutions, as it only requires a file system to hold the content and the ability to create and share metadata with external systems.”
    • 9. 9/38 COAR: About harvesting and aggregations … “Each individual repository is of limited value for research: the real power of Open Access lies in the possibility of connecting and tying together repositories, which is why we need interoperability. In order to create a seamless layer of content through connected repositories from around the world, Open Access relies on interoperability, the ability for systems to communicate with each other and pass information back and forth in a usable format. Interoperability allows us to exploit today's computational power so that we can aggregate, data mine, create new tools and services, and generate new knowledge from repository content.’’ [COAR manifesto]
    • 10. 10/38 We need OA to content (not just metadata) • Repositories (even the most prominent) often seen by aggregation systems as large metadata. • OA to metadata is not disruptive. Little difference to the traditional publishing model.
    • 11. 11/38 Outline 1. Goals of repositories (repositories as large metadata silos) 2. The bleak truth about availability of OA metadata vs content 3. Content referencing practises in repositories 4. Two principles to increase visibility of OA content 5. How to data mine OA aggregated data and why
    • 12. 12/38 Study • 83 repositories (mainly EPrints with pdf research outputs) • 1,461,016 metadata records • Ratio of metadata to content • Data acquired from CORE (Knoth & Zdrahal, 2012)
    • 13. 13/38 “*The institutional repository+ is like a roach motel. Data goes in, but it doesn’t come out.” (Salo, 2008)
    • 14. 14/38 Why is this a problem? • Lower accessibility of papers (we have them, but cannot find them) • Text-mining • Cannot monitor growth • Loosing a strong argument for the adoption of OA!
    • 15. 15/38 Outline 1. Goals of repositories (repositories as large metadata silos) 2. The bleak truth about availability of OA metadata vs content 3. Content referencing practises in repositories 4. Two principles to increase visibility of OA content 5. How to data mine OA aggregated data and why
    • 16. 16/38 OAI-PMH and content referencing • OAI-PMH supports representing metadata in multiple formats, but at a minimum repositories must be able to return records with metadata expressed in the Dublin Core format (OAI-PMH v2.0, 2008) • If repositories want to satisfy the SPARC guidelines (Crow, 2002), they must provide a link to the content as part of the exposed metadata.
    • 17. 17/38 OAI-PMH and content referencing The Open Research Online repository (Eprints) links directly to the resource from metadata. Cranfield repository (DSpace) identifies the resource by providing a link to a page from which the resource (if available) can be accessed.
    • 18. 18/38 OAI-PMH and content referencing The OAI-PMH specification states on this topic that: “The nature of a resource identifier is outside the scope of the OAI- PMH. To facilitate access to the resource associated with harvested metadata, repositories should use an element in metadata records to establish a linkage between the record (and the identifier of its item) and the identifier (URL, URN, DOI, etc.) of the associated resource. The mandatory Dublin Core format provides the identifier element that should be used for this purpose.”
    • 19. 19/38 OAI-PMH and content referencing • What is an identifier of the associated resource? Is a splash page an identifier? According to OAI-PMH examples it is: <dc:identifier>http://arXiv.org/abs/cs/0112017</dc:identifier> • The standard is pretty weak on this aspect …
    • 20. 20/38 Outline 1. Goals of repositories (repositories as large metadata silos) 2. The bleak truth about availability of OA metadata vs content 3. Content referencing practises in repositories 4. Two principles to increase visibility of OA content 5. How to data mine OA aggregated data and why
    • 21. 21/38 The principles of the principles  • Pragmatic rather than exciting. • Generating maximum benefit for a minimum investment. • Deliberately use current standards to minimise adoption time. • Respecting differences across systems and backwards compatibility. • Emphasizes the need for easy to use compliance mechanisms to assist repository managers in ensuring systems interoperability.
    • 22. 22/38 Principle 1 – Content referencing Open repositories should always establish a link from the metadata record to the item the metadata record describes using a dereferencable identifier pointing to the version held in the repository. The dereferencable identifier should be provided in the appropriate metadata element in the used metadata format (i.e. dc:identifier in the case of Dublin Core).
    • 23. 23/38 Implications: Principle 1 – Content referencing • Repositories can use different standards to deliver metadata over OAI-PMH (DC, METS, MPEG-21 DIDL) • Identifier must resolve (be actionable) to the object it identifies • In the case of DC, if more identifiers are present, use the first identifier as the identifier of the object • Should resolve to the version of the object in the local repository • Similarity with RIOXX identifier field • The principle is easily applicable in the OA domain: each item can be freely resolved
    • 24. 24/38 Open access statistics and principle 1 • Only dereferencable items are OA • Increases stats acuracy • Avoids anecdotal situations (e.g. 23,380 Dark Items)
    • 25. 25/38 Principle 2 – Content accessibility to machines Open repositories must provide universal access to machines with the same level of access as humans have. It is the role of open repositories to allow machines harvest the entire content of the repository in a reasonable time to enable harvesting systems to acquire and maintain up-to-date information about the repository content.
    • 26. 26/38 Example from Arxiv.org • Googlebot: unrestricted • Yahoo/MSN: can reharvest in 6 months • Researchers: access denied
    • 27. 27/38 Implications: Principle 2 – Content accessibility to machines • Accessibility of repository content by machines • Enabling reuse through new services, such as those relying on text-mining • Open Repositories should not discriminate, except for abusive behavior • Presumption of innocence
    • 28. 28/38 Validation tools • Key to adoption – Repository managers should not be left alone • Repository Analytics
    • 29. 29/38 Outline 1. Goals of repositories (repositories as large metadata silos) 2. The bleak truth about availability of OA metadata vs content 3. Content referencing practises in repositories 4. Two principles to increase visibility of OA content 5. How to data mine OA aggregated data and why
    • 30. 30/38 CORE API Enables external systems to interact with OA data (JSON or XML) • Search, download metadata and cotent • Content recommendation • Citation references • Statistics • …
    • 31. 31/38 Data dumps • About 11.5 million records • Over 1 million full-texts • Cleaned and enriched with additional information • Distributed as two large zip files: metadata + full-texts
    • 32. 32/38 Examples of usage • Author disambiguation • Mining URLs from papers to detect trends • Tagging of chemical compounds for image retrieval • Citation analysis • Content recommendation • Detecting collaboration patterns of scientific communities • Monitoring of OA growth • Any form of text or data mining … • API useful for services and data dumps for offline experiments
    • 33. 33/38 Why to use it? • It is only OA, thus you can legally mine it … • You can redistribute it: essential for reproducible research • Very large and growing • Kept up-to-date • Ability to rerun experiments with new data • All research content will soon be OA (UK HEFCE policy) • Status of a UK national aggregator • 0.5 million monthly visits, but only 150k six months ago
    • 34. 34/38 Why to use it? • Open infrastructure for open science • Not owned or managed by a for profit company => Ability to run your own services = new opportunities and no give away of your research to commercial companies
    • 35. 35/38 Conclusions • Visibility of OA content can be significantly improved by adoption two principles: 1) Dereferencable identifiers - Open Repositories provide open access to content and not just metadata 2) Machine access – Open Repositories should provide free access to content (for anybody and mainly researchers) • Compliance validation tools are needed to support repositories • Researchers who want to mine content or build services that can rely on aggregators to acquire datasets • Researchers can deploy their solutions, not just rely on commercial providers.
    • 36. 36/38 Thank you! Open access needs open repositories and open science
    • 37. 37/38 References 1/2 [BOAI, 2002] Budapest Open Access Initiative. (2002) http://www.opensocietyfoundations.org/openaccess/boai-10-recommendations [Crow, 2002] Crow, R. (2002). The case for institutional repositories: a SPARC position paper. ARL Bimonthly Report 223. [Knoth & Zdrahal, 2012] Knoth, P. and Zdrahal, Z. (2012) CORE: Three Access Levels to Underpin Open Access, D-Lib Magazine, 18, 11/12, Corporation for National Research Initiatives, http://dx.doi.org/10.1045/november2012-knoth [Konkiel, 2012] Konkiel, S. (2012) Are Institutional Repositories Doing Their Job? https://blogs.libraries.iub.edu/scholcomm/2012/09/11/are-institutional-repositories- doing-their-job/ [Laakso & Bjork, 2012] Laakso, M., & Björk, B. C. (2012). Anatomy of open access publishing: a study of longitudinal development and internal structure. BMC Medicine, 10(1), 124.
    • 38. 38/38 References 2/2 *Morrison, 2012+ Morrison, Louise (2012) 5 reasons why I can’t find Open Access publications. http://mmitscotland.wordpress.com/2012/08/06/5-reasons-why-i-cant-find- open-access-publications-2/ [OAI-PMH v2.0, 2008] The Open Archives Initiative Protocol for Metadata Harvesting Version 2.0 (OAI-PMH), Impementation Guidelines (2008). http://www.openarchives.org/OAI/openarchivesprotocol.html [ResourceSync draft, 2013] ResourceSync protocol draft. 2013 http://www.niso.org/workrooms/resourcesync/ [Salo, 2008] Salo, D. (2008). Innkeeper at the roach motel. Library Trends, 57(2), 98-123. [Van de Sompel et al, 2004] Van de Sompel, H., Nelson, M. L., Lagoze, C., & Warner, S. (2004). Resource harvesting within the OAI-PMH framework. D-lib magazine, 10(12), 1082- 9873.

    ×