Open Archives Initiative Object Reuse and Exchange     Infrastructure to Support  New Models of Scholarly Publication Carl Lagoze Information Science Cornell University [email_address]
This is a talk about plumbing….
…motivated by exciting final results
Infrastructure to Support  New Models  of Scholarly Publication Exploit open access to scholarly results from multiple disciplines (we have lots more stuff availble, let’s do something interesting with it) Take advantage of new forms of scholarly results combining text, data, simulation, etc. (and the stuff is getting more interesting) Expose process of scholarship Enable new workflows that allow new services to add value to scholarly content  Facilitate reuse of scholarly information for education, industry, etc.
Scholarly Communication and Digital Libraries Group Cornell Information Science arXiv Fedora  Legal Information Institute Open Archives Initiative National Science Digital Library
The cast of collaborators and supporters: Pathways project (NSF IIS-0430906) Cornell University (Carl Lagoze, Sandy Payette, Simeon Warner)  Los Alamos National Laboratory (Herbert Van de Sompel). Fedora Open Source Repository Project (Mellon) Cornell University (Sandy Payette) University of Virginia (Thorton Staples) OAI Object Reuse and Exchange OAI-ORE (Mellon) Cornell University (Carl Lagoze) Los Alamos National Laboratory (Herbert Van de Sompel)
 
Meeting in NYC, April 20-21 2006 Supported by Microsoft, Mellon Foundation, Coalition for Networked Information, Digital Library Federation, JISC Representatives from institutional  Repository  projects, scholarly content  Repositories , Registry projects, various projects that touch on interoperability See  http://msc.mellon.org/Meetings/Interop/  for Agenda, Participants, Topics & Goals, Terminology, Presentations, Prototype demonstration. Report available since beginning of August 2006
Motivation: Scholarship is Changing Influenced by: High performance computing and connectivity Peta-scale data storage Advanced data mining and data storage Web services, Web 2.0 Open Access movement Evolution towards: Highly collaborative Network-based Data-driven Visible in Science & Engineering but also in humanities And, there are increasing links between these formerly separated fields (benefits of having “everything” online)
Motivation: “Everything Online” E-print repositories – arXiv, Cogprints Institutional repositories – DSpace, FEDORA, ePrints.org Publication repositories – PubMed Central Data Repositories – NVO, NCBI Interoperability architecture – DCMI, OAI-PMH But many of these are changes in  form  rather than in  nature Or, at best, not solutions that generalize across disciplines
Setting More Ambitious Goals In many cases we’ve only created an electronic equivalent of the paper-based system. While ‘open access’ is important, it should not be our only focus.  The networked environment provides opportunities for more radical changes.  Expose component products and process Allow components to move across multiple workflows Promote recombination, refactoring, and transformation of information Transform repositories/databases from passive storage to active building blocks for higher level services
Providing the fabric for an Information Layer  Over Core Resources
Motivation 1 : Richer cross- Repository  services Distributed  Repositories  provide source materials for cross- Repository  overlay services such as discovery services Manner in which those materials are exposed must allow a variety of discovery services that fit individual disciplinary and cross-disciplinary needs (i.e., not just text scraping)
Richer cross- Repository  services : Scenario Scenario 1: Chemical search engine A search engine monitors scholarly repositories but is only interested in making machine-readable chemical structures contained in  Digital Objects  available from those repositories searchable. This constitutes re-use of the (part of) the  Digital Objects  by a service overlaid upon the monitored repositories. And, of course, a chemical compound discovered via the search engine can be cited in some new paper, i.e. the value chain does not stop here
Motivation 2 : Scholarly communication workflow Distributed  Repositories   of a digital scholarly communication system Scholarly communication as a global workflow (value chain) across those  Repositories Digital Objects  from  Repositories  are the subject of the workflow; they are used and re-used in many contexts .
Scholarly communication workflow : Scenarios Scenario 2: Citation An author writes a paper (to be  Put  into her institutional repository) and cites 10 papers available from other repositories.  This citation should be more than crude text-pasting, or URL linking, but should be characterized as form of  re-use  of cited object in a new context. And, of course, the new paper can be cited (reused) too, i.e. the value chain does not stop here.
Scholarly communication workflow : Scenarios Scenario 3: Overlay journal The editor of an overlay journal selects papers from 3 different repositories for inclusion in the next issue of the overlay journal.  Each of those articles is being re-used in a new context, with  value being added.  And, the overlay journal can be mirrored for preservation purposes, i.e. the value chain does not stop here.
Scholarly communication workflow : Scenarios Scenario 4: eScience A researcher uses datasets from 2 different dataset repositories, performs operations on those, and creates a publication that contains a resulting new dataset and an accompanying paper, and deposits this publication in her institutional repository. This constitutes re-use of the origin datasets, and value added through the creation of the new publication. And, of course, the new dataset can be re-used too, i.e. the value chain does not stop here.
Augmenting interoperability across  Repositories DSpace Fedora NIH NVO arXiv Nature Individual Data Models and Services Shared Data Model and Services
Some Meta-Observations on Interoperability Scholarly communication is a long-term endeavor: Dependent on stability and integrity of participants  Need  abstract definitions  of  models and interfaces  that can be instantiated on the basis of various technologies as time goes by Identification   is particularly important: Scalable Agnostic about existing identification schemes Granular Object decomposition Repository origination Workflows do not require transfer of all digital object content The content that needs to be transferred depends on the nature of the workflow In many cases full transfer is not permitted, impractical or superfluous
The Challenge:  Keeping it Simple and Sufficiently Functional Web Crawling search engines Metadata Dublin Core Metadata Harvesting OAI-PMH Grid Protocols and APIs Overhead Functionality
Building Block I - Repositories Networked system that provides  services  pertaining to a  managed collection of digital objects. Institutional repositories, online journals, dataset stores, learning objects, etc.
Building Block II:  Digital Objects Abstract units of scholarly communication  Compound aggregations  consisting of: Multiple media types Linkage to services Have a  persistent identifier Can be  recursive : digital objects within digital objects Instantiated in various implementations Digital Objects id id
Complex/Compound Digital Objects Text Data Simulations Images Video Computations Automated Analyses Virtual aggregations of distributed components Data
Augmenting interoperability across  Repositories DSpace Fedora aDORe ePrints arXiv Nature Individual Data Models and Services m Obtain Harvest Put
Building Block III:  Common Data Model Provides a common abstraction for describing digital objects despite their (repository, service)-specific implementation. A common denominator: Does not completely cover implementation-specific features Features conform to requirements of interoperability fabric (e.g., identity, workflow support, etc.) m
Model Core Requirement Recursion  for n-levels of information containment Identity  independent of specific schemes Lineage  relationships among objects evidence of workflow for evidential citation Semantics  associated with entities facilitate service mapping Link to  concrete representation Assertion of  persistence  levels m
Basis for a Network of Linked Digital Objects
Building Block IV: Serialization Surrogates Represent Digital Objects according to the data model for transmission over the net. Default to shallow (by-reference) rather than deep (by-value) copy Full asset transfer is only required for certain applications Avoid IP issues at the level of the interoperability framework Allow Surrogates to flow freely independent of business models of the underlying content
Obtain interface :  a  Repository  interface that supports the request of services pertaining to individual  Digital Objects  (including their component  Datastreams ). The core service is the request of a  Surrogate  for a  Digital Object . Building Block V: Common Services and APIs Harvest interface :  a  Repository  interface that exposes  Surrogates  for incremental collecting/harvesting. Put interface :  a  Repository  interface that supports submission of one or more  Surrogates  into the  Repository , thereby facilitating the addition of  Digital Objects  to the collection of the  Repository . Obtain Harvest Put
Surrogate is at the core of the value chain Lineage Lineage id id id Obtain Obtain Put Obtain recombine & add value providerInfo providerInfo
Repo1 Put 1 Harvest 1 Obtain 1 Obtain Harvest Put Repo2 Obtain Harvest Put 2 Harvest 2 Obtain 2 Put service
References Van de Sompel, Payette, Erickson, Lagoze, Warner, “Rethinking Scholarly Communication”,  D-Lib  September 2004 Warner, Bekaert, Lagoze, Liu, Payette, Van de Sompel, “Pathways: Augmenting Interoperability for Scholarly Repositories”,  International Journal of Digital Libraries Van de Sompel, Lagoze, Bekaert, Liu, Payette, Warner, “Interoperability for Distributed Scholarly Workflows”,  D-Lib  October 2006
Questions/Comments?

Open Archives Initiative Object Reuse and Exchange

  • 1.
    Open Archives InitiativeObject Reuse and Exchange Infrastructure to Support New Models of Scholarly Publication Carl Lagoze Information Science Cornell University [email_address]
  • 2.
    This is atalk about plumbing….
  • 3.
  • 4.
    Infrastructure to Support New Models of Scholarly Publication Exploit open access to scholarly results from multiple disciplines (we have lots more stuff availble, let’s do something interesting with it) Take advantage of new forms of scholarly results combining text, data, simulation, etc. (and the stuff is getting more interesting) Expose process of scholarship Enable new workflows that allow new services to add value to scholarly content Facilitate reuse of scholarly information for education, industry, etc.
  • 5.
    Scholarly Communication andDigital Libraries Group Cornell Information Science arXiv Fedora Legal Information Institute Open Archives Initiative National Science Digital Library
  • 6.
    The cast ofcollaborators and supporters: Pathways project (NSF IIS-0430906) Cornell University (Carl Lagoze, Sandy Payette, Simeon Warner) Los Alamos National Laboratory (Herbert Van de Sompel). Fedora Open Source Repository Project (Mellon) Cornell University (Sandy Payette) University of Virginia (Thorton Staples) OAI Object Reuse and Exchange OAI-ORE (Mellon) Cornell University (Carl Lagoze) Los Alamos National Laboratory (Herbert Van de Sompel)
  • 7.
  • 8.
    Meeting in NYC,April 20-21 2006 Supported by Microsoft, Mellon Foundation, Coalition for Networked Information, Digital Library Federation, JISC Representatives from institutional Repository projects, scholarly content Repositories , Registry projects, various projects that touch on interoperability See http://msc.mellon.org/Meetings/Interop/ for Agenda, Participants, Topics & Goals, Terminology, Presentations, Prototype demonstration. Report available since beginning of August 2006
  • 9.
    Motivation: Scholarship isChanging Influenced by: High performance computing and connectivity Peta-scale data storage Advanced data mining and data storage Web services, Web 2.0 Open Access movement Evolution towards: Highly collaborative Network-based Data-driven Visible in Science & Engineering but also in humanities And, there are increasing links between these formerly separated fields (benefits of having “everything” online)
  • 10.
    Motivation: “Everything Online”E-print repositories – arXiv, Cogprints Institutional repositories – DSpace, FEDORA, ePrints.org Publication repositories – PubMed Central Data Repositories – NVO, NCBI Interoperability architecture – DCMI, OAI-PMH But many of these are changes in form rather than in nature Or, at best, not solutions that generalize across disciplines
  • 11.
    Setting More AmbitiousGoals In many cases we’ve only created an electronic equivalent of the paper-based system. While ‘open access’ is important, it should not be our only focus. The networked environment provides opportunities for more radical changes. Expose component products and process Allow components to move across multiple workflows Promote recombination, refactoring, and transformation of information Transform repositories/databases from passive storage to active building blocks for higher level services
  • 12.
    Providing the fabricfor an Information Layer Over Core Resources
  • 13.
    Motivation 1 :Richer cross- Repository services Distributed Repositories provide source materials for cross- Repository overlay services such as discovery services Manner in which those materials are exposed must allow a variety of discovery services that fit individual disciplinary and cross-disciplinary needs (i.e., not just text scraping)
  • 14.
    Richer cross- Repository services : Scenario Scenario 1: Chemical search engine A search engine monitors scholarly repositories but is only interested in making machine-readable chemical structures contained in Digital Objects available from those repositories searchable. This constitutes re-use of the (part of) the Digital Objects by a service overlaid upon the monitored repositories. And, of course, a chemical compound discovered via the search engine can be cited in some new paper, i.e. the value chain does not stop here
  • 15.
    Motivation 2 :Scholarly communication workflow Distributed Repositories of a digital scholarly communication system Scholarly communication as a global workflow (value chain) across those Repositories Digital Objects from Repositories are the subject of the workflow; they are used and re-used in many contexts .
  • 16.
    Scholarly communication workflow: Scenarios Scenario 2: Citation An author writes a paper (to be Put into her institutional repository) and cites 10 papers available from other repositories. This citation should be more than crude text-pasting, or URL linking, but should be characterized as form of re-use of cited object in a new context. And, of course, the new paper can be cited (reused) too, i.e. the value chain does not stop here.
  • 17.
    Scholarly communication workflow: Scenarios Scenario 3: Overlay journal The editor of an overlay journal selects papers from 3 different repositories for inclusion in the next issue of the overlay journal. Each of those articles is being re-used in a new context, with value being added. And, the overlay journal can be mirrored for preservation purposes, i.e. the value chain does not stop here.
  • 18.
    Scholarly communication workflow: Scenarios Scenario 4: eScience A researcher uses datasets from 2 different dataset repositories, performs operations on those, and creates a publication that contains a resulting new dataset and an accompanying paper, and deposits this publication in her institutional repository. This constitutes re-use of the origin datasets, and value added through the creation of the new publication. And, of course, the new dataset can be re-used too, i.e. the value chain does not stop here.
  • 19.
    Augmenting interoperability across Repositories DSpace Fedora NIH NVO arXiv Nature Individual Data Models and Services Shared Data Model and Services
  • 20.
    Some Meta-Observations onInteroperability Scholarly communication is a long-term endeavor: Dependent on stability and integrity of participants Need abstract definitions of models and interfaces that can be instantiated on the basis of various technologies as time goes by Identification is particularly important: Scalable Agnostic about existing identification schemes Granular Object decomposition Repository origination Workflows do not require transfer of all digital object content The content that needs to be transferred depends on the nature of the workflow In many cases full transfer is not permitted, impractical or superfluous
  • 21.
    The Challenge: Keeping it Simple and Sufficiently Functional Web Crawling search engines Metadata Dublin Core Metadata Harvesting OAI-PMH Grid Protocols and APIs Overhead Functionality
  • 22.
    Building Block I- Repositories Networked system that provides services pertaining to a managed collection of digital objects. Institutional repositories, online journals, dataset stores, learning objects, etc.
  • 23.
    Building Block II: Digital Objects Abstract units of scholarly communication Compound aggregations consisting of: Multiple media types Linkage to services Have a persistent identifier Can be recursive : digital objects within digital objects Instantiated in various implementations Digital Objects id id
  • 24.
    Complex/Compound Digital ObjectsText Data Simulations Images Video Computations Automated Analyses Virtual aggregations of distributed components Data
  • 25.
    Augmenting interoperability across Repositories DSpace Fedora aDORe ePrints arXiv Nature Individual Data Models and Services m Obtain Harvest Put
  • 26.
    Building Block III: Common Data Model Provides a common abstraction for describing digital objects despite their (repository, service)-specific implementation. A common denominator: Does not completely cover implementation-specific features Features conform to requirements of interoperability fabric (e.g., identity, workflow support, etc.) m
  • 27.
    Model Core RequirementRecursion for n-levels of information containment Identity independent of specific schemes Lineage relationships among objects evidence of workflow for evidential citation Semantics associated with entities facilitate service mapping Link to concrete representation Assertion of persistence levels m
  • 28.
    Basis for aNetwork of Linked Digital Objects
  • 29.
    Building Block IV:Serialization Surrogates Represent Digital Objects according to the data model for transmission over the net. Default to shallow (by-reference) rather than deep (by-value) copy Full asset transfer is only required for certain applications Avoid IP issues at the level of the interoperability framework Allow Surrogates to flow freely independent of business models of the underlying content
  • 30.
    Obtain interface : a Repository interface that supports the request of services pertaining to individual Digital Objects (including their component Datastreams ). The core service is the request of a Surrogate for a Digital Object . Building Block V: Common Services and APIs Harvest interface : a Repository interface that exposes Surrogates for incremental collecting/harvesting. Put interface : a Repository interface that supports submission of one or more Surrogates into the Repository , thereby facilitating the addition of Digital Objects to the collection of the Repository . Obtain Harvest Put
  • 31.
    Surrogate is atthe core of the value chain Lineage Lineage id id id Obtain Obtain Put Obtain recombine & add value providerInfo providerInfo
  • 32.
    Repo1 Put 1Harvest 1 Obtain 1 Obtain Harvest Put Repo2 Obtain Harvest Put 2 Harvest 2 Obtain 2 Put service
  • 33.
    References Van deSompel, Payette, Erickson, Lagoze, Warner, “Rethinking Scholarly Communication”, D-Lib September 2004 Warner, Bekaert, Lagoze, Liu, Payette, Van de Sompel, “Pathways: Augmenting Interoperability for Scholarly Repositories”, International Journal of Digital Libraries Van de Sompel, Lagoze, Bekaert, Liu, Payette, Warner, “Interoperability for Distributed Scholarly Workflows”, D-Lib October 2006
  • 34.