Open Archives Initiative Object Reuse and Exchange

1,145 views

Published on

Presentation at Microsoft eScience Workshop at Johns Hopkins in October 2006

Published in: Technology, Education
  • Be the first to comment

Open Archives Initiative Object Reuse and Exchange

  1. 1. Open Archives Initiative Object Reuse and Exchange Infrastructure to Support New Models of Scholarly Publication Carl Lagoze Information Science Cornell University [email_address]
  2. 2. This is a talk about plumbing….
  3. 3. …motivated by exciting final results
  4. 4. Infrastructure to Support New Models of Scholarly Publication <ul><li>Exploit open access to scholarly results from multiple disciplines (we have lots more stuff availble, let’s do something interesting with it) </li></ul><ul><li>Take advantage of new forms of scholarly results combining text, data, simulation, etc. (and the stuff is getting more interesting) </li></ul><ul><li>Expose process of scholarship </li></ul><ul><li>Enable new workflows that allow new services to add value to scholarly content </li></ul><ul><li>Facilitate reuse of scholarly information for education, industry, etc. </li></ul>
  5. 5. Scholarly Communication and Digital Libraries Group Cornell Information Science <ul><li>arXiv </li></ul><ul><li>Fedora </li></ul><ul><li>Legal Information Institute </li></ul><ul><li>Open Archives Initiative </li></ul><ul><li>National Science Digital Library </li></ul>
  6. 6. The cast of collaborators and supporters: <ul><li>Pathways project (NSF IIS-0430906) </li></ul><ul><ul><li>Cornell University (Carl Lagoze, Sandy Payette, Simeon Warner) </li></ul></ul><ul><ul><li>Los Alamos National Laboratory (Herbert Van de Sompel). </li></ul></ul><ul><li>Fedora Open Source Repository Project (Mellon) </li></ul><ul><ul><li>Cornell University (Sandy Payette) </li></ul></ul><ul><ul><li>University of Virginia (Thorton Staples) </li></ul></ul><ul><li>OAI Object Reuse and Exchange OAI-ORE (Mellon) </li></ul><ul><ul><li>Cornell University (Carl Lagoze) </li></ul></ul><ul><ul><li>Los Alamos National Laboratory (Herbert Van de Sompel) </li></ul></ul>
  7. 8. Meeting in NYC, April 20-21 2006 <ul><li>Supported by Microsoft, Mellon Foundation, Coalition for Networked Information, Digital Library Federation, JISC </li></ul><ul><li>Representatives from institutional Repository projects, scholarly content Repositories , Registry projects, various projects that touch on interoperability </li></ul><ul><li>See http://msc.mellon.org/Meetings/Interop/ for Agenda, Participants, Topics & Goals, Terminology, Presentations, Prototype demonstration. </li></ul><ul><li>Report available since beginning of August 2006 </li></ul>
  8. 9. Motivation: Scholarship is Changing <ul><li>Influenced by: </li></ul><ul><ul><li>High performance computing and connectivity </li></ul></ul><ul><ul><li>Peta-scale data storage </li></ul></ul><ul><ul><li>Advanced data mining and data storage </li></ul></ul><ul><ul><li>Web services, Web 2.0 </li></ul></ul><ul><ul><li>Open Access movement </li></ul></ul><ul><li>Evolution towards: </li></ul><ul><ul><li>Highly collaborative </li></ul></ul><ul><ul><li>Network-based </li></ul></ul><ul><ul><li>Data-driven </li></ul></ul><ul><li>Visible in Science & Engineering but also in humanities </li></ul><ul><li>And, there are increasing links between these formerly separated fields (benefits of having “everything” online) </li></ul>
  9. 10. Motivation: “Everything Online” <ul><li>E-print repositories – arXiv, Cogprints </li></ul><ul><li>Institutional repositories – DSpace, FEDORA, ePrints.org </li></ul><ul><li>Publication repositories – PubMed Central </li></ul><ul><li>Data Repositories – NVO, NCBI </li></ul><ul><li>Interoperability architecture – DCMI, OAI-PMH </li></ul><ul><li>But many of these are changes in form rather than in nature </li></ul><ul><li>Or, at best, not solutions that generalize across disciplines </li></ul>
  10. 11. Setting More Ambitious Goals <ul><li>In many cases we’ve only created an electronic equivalent of the paper-based system. </li></ul><ul><li>While ‘open access’ is important, it should not be our only focus. </li></ul><ul><li>The networked environment provides opportunities for more radical changes. </li></ul><ul><ul><li>Expose component products and process </li></ul></ul><ul><ul><li>Allow components to move across multiple workflows </li></ul></ul><ul><ul><li>Promote recombination, refactoring, and transformation of information </li></ul></ul><ul><ul><li>Transform repositories/databases from passive storage to active building blocks for higher level services </li></ul></ul>
  11. 12. Providing the fabric for an Information Layer Over Core Resources
  12. 13. Motivation 1 : Richer cross- Repository services <ul><li>Distributed Repositories provide source materials for cross- Repository overlay services such as discovery services </li></ul><ul><li>Manner in which those materials are exposed must allow a variety of discovery services that fit individual disciplinary and cross-disciplinary needs (i.e., not just text scraping) </li></ul>
  13. 14. Richer cross- Repository services : Scenario <ul><li>Scenario 1: Chemical search engine </li></ul><ul><li>A search engine monitors scholarly repositories but is only interested in making machine-readable chemical structures contained in Digital Objects available from those repositories searchable. </li></ul><ul><li>This constitutes re-use of the (part of) the Digital Objects by a service overlaid upon the monitored repositories. </li></ul><ul><li>And, of course, a chemical compound discovered via the search engine can be cited in some new paper, i.e. the value chain does not stop here </li></ul>
  14. 15. Motivation 2 : Scholarly communication workflow <ul><li>Distributed Repositories of a digital scholarly communication system </li></ul><ul><li>Scholarly communication as a global workflow (value chain) across those Repositories </li></ul><ul><li>Digital Objects from Repositories are the subject of the workflow; they are used and re-used in many contexts . </li></ul>
  15. 16. Scholarly communication workflow : Scenarios <ul><li>Scenario 2: Citation </li></ul><ul><li>An author writes a paper (to be Put into her institutional repository) and cites 10 papers available from other repositories. </li></ul><ul><li>This citation should be more than crude text-pasting, or URL linking, but should be characterized as form of re-use of cited object in a new context. </li></ul><ul><li>And, of course, the new paper can be cited (reused) too, i.e. the value chain does not stop here. </li></ul>
  16. 17. Scholarly communication workflow : Scenarios <ul><li>Scenario 3: Overlay journal </li></ul><ul><li>The editor of an overlay journal selects papers from 3 different repositories for inclusion in the next issue of the overlay journal. </li></ul><ul><li>Each of those articles is being re-used in a new context, with value being added. </li></ul><ul><li>And, the overlay journal can be mirrored for preservation purposes, i.e. the value chain does not stop here. </li></ul>
  17. 18. Scholarly communication workflow : Scenarios <ul><li>Scenario 4: eScience </li></ul><ul><li>A researcher uses datasets from 2 different dataset repositories, performs operations on those, and creates a publication that contains a resulting new dataset and an accompanying paper, and deposits this publication in her institutional repository. </li></ul><ul><li>This constitutes re-use of the origin datasets, and value added through the creation of the new publication. </li></ul><ul><li>And, of course, the new dataset can be re-used too, i.e. the value chain does not stop here. </li></ul>
  18. 19. Augmenting interoperability across Repositories DSpace Fedora NIH NVO arXiv Nature Individual Data Models and Services Shared Data Model and Services
  19. 20. Some Meta-Observations on Interoperability <ul><li>Scholarly communication is a long-term endeavor: </li></ul><ul><ul><li>Dependent on stability and integrity of participants </li></ul></ul><ul><ul><li>Need abstract definitions of models and interfaces that can be instantiated on the basis of various technologies as time goes by </li></ul></ul><ul><li>Identification is particularly important: </li></ul><ul><ul><li>Scalable </li></ul></ul><ul><ul><li>Agnostic about existing identification schemes </li></ul></ul><ul><ul><li>Granular </li></ul></ul><ul><ul><ul><li>Object decomposition </li></ul></ul></ul><ul><ul><ul><li>Repository origination </li></ul></ul></ul><ul><li>Workflows do not require transfer of all digital object content </li></ul><ul><ul><li>The content that needs to be transferred depends on the nature of the workflow </li></ul></ul><ul><ul><li>In many cases full transfer is not permitted, impractical or superfluous </li></ul></ul>
  20. 21. The Challenge: Keeping it Simple and Sufficiently Functional Web Crawling search engines Metadata Dublin Core Metadata Harvesting OAI-PMH Grid Protocols and APIs Overhead Functionality
  21. 22. Building Block I - Repositories Networked system that provides services pertaining to a managed collection of digital objects. Institutional repositories, online journals, dataset stores, learning objects, etc.
  22. 23. Building Block II: Digital Objects <ul><li>Abstract units of scholarly communication </li></ul><ul><li>Compound aggregations consisting of: </li></ul><ul><li>Multiple media types </li></ul><ul><li>Linkage to services </li></ul><ul><li>Have a persistent identifier </li></ul><ul><li>Can be recursive : digital objects within digital objects </li></ul><ul><li>Instantiated in various implementations </li></ul>Digital Objects id id
  23. 24. Complex/Compound Digital Objects <ul><li>Text </li></ul><ul><li>Data </li></ul><ul><li>Simulations </li></ul><ul><li>Images </li></ul><ul><li>Video </li></ul><ul><li>Computations </li></ul><ul><li>Automated Analyses </li></ul><ul><li>Virtual aggregations of distributed components </li></ul>Data
  24. 25. Augmenting interoperability across Repositories DSpace Fedora aDORe ePrints arXiv Nature Individual Data Models and Services m Obtain Harvest Put
  25. 26. Building Block III: Common Data Model <ul><li>Provides a common abstraction for describing digital objects despite their (repository, service)-specific implementation. </li></ul><ul><li>A common denominator: </li></ul><ul><ul><li>Does not completely cover implementation-specific features </li></ul></ul><ul><ul><li>Features conform to requirements of interoperability fabric (e.g., identity, workflow support, etc.) </li></ul></ul>m
  26. 27. Model Core Requirement <ul><li>Recursion for n-levels of information containment </li></ul><ul><li>Identity independent of specific schemes </li></ul><ul><li>Lineage relationships among objects </li></ul><ul><ul><li>evidence of workflow for evidential citation </li></ul></ul><ul><li>Semantics associated with entities </li></ul><ul><ul><li>facilitate service mapping </li></ul></ul><ul><li>Link to concrete representation </li></ul><ul><li>Assertion of persistence levels </li></ul>m
  27. 28. Basis for a Network of Linked Digital Objects
  28. 29. Building Block IV: Serialization Surrogates <ul><li>Represent Digital Objects according to the data model for transmission over the net. </li></ul><ul><li>Default to shallow (by-reference) rather than deep (by-value) copy </li></ul><ul><ul><li>Full asset transfer is only required for certain applications </li></ul></ul><ul><ul><li>Avoid IP issues at the level of the interoperability framework </li></ul></ul><ul><ul><li>Allow Surrogates to flow freely independent of business models of the underlying content </li></ul></ul>
  29. 30. Obtain interface : a Repository interface that supports the request of services pertaining to individual Digital Objects (including their component Datastreams ). The core service is the request of a Surrogate for a Digital Object . Building Block V: Common Services and APIs Harvest interface : a Repository interface that exposes Surrogates for incremental collecting/harvesting. Put interface : a Repository interface that supports submission of one or more Surrogates into the Repository , thereby facilitating the addition of Digital Objects to the collection of the Repository . Obtain Harvest Put
  30. 31. Surrogate is at the core of the value chain Lineage Lineage id id id Obtain Obtain Put Obtain recombine & add value providerInfo providerInfo
  31. 32. Repo1 Put 1 Harvest 1 Obtain 1 Obtain Harvest Put Repo2 Obtain Harvest Put 2 Harvest 2 Obtain 2 Put service
  32. 33. References <ul><li>Van de Sompel, Payette, Erickson, Lagoze, Warner, “Rethinking Scholarly Communication”, D-Lib September 2004 </li></ul><ul><li>Warner, Bekaert, Lagoze, Liu, Payette, Van de Sompel, “Pathways: Augmenting Interoperability for Scholarly Repositories”, International Journal of Digital Libraries </li></ul><ul><li>Van de Sompel, Lagoze, Bekaert, Liu, Payette, Warner, “Interoperability for Distributed Scholarly Workflows”, D-Lib October 2006 </li></ul>
  33. 34. Questions/Comments?

×