The aDORe Federation Architecture

3,394 views

Published on

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
3,394
On SlideShare
0
From Embeds
0
Number of Embeds
1,366
Actions
Shares
0
Downloads
11
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

The aDORe Federation Architecture

  1. 1. The aDORe Federation Architecture Herbert Van de Sompel (1), Ryan Chute (2), Luydimilla Balakireva (3) Digital Library Research & Prototyping Team Research Library Los Alamos National Laboratory (1)herbertv@lanl.gov http://public.lanl.gov/herbertv/ (2) rchute@lanl.gov (3) ludab@lanl.gov Acknowledgments: Jeroen Bekaert, Patrick Hochstenbach, Henry Jerez, Xiaoming Liu The aDORe Federation Architecture Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva Open Repositories 2008, University of Southampton, UK, April 1-4 2008
  2. 2. The Fedora Adoration Architecture? (the mandatory joke at the start of a presentation) The aDORe Federation Architecture Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva Open Repositories 2008, University of Southampton, UK, April 1-4 2008
  3. 3. Presentation Based on the Paper Herbert Van de Sompel, Ryan Chute, Patrick Hochstenbach. The aDORe Federation: Digital Repositories at Scale. 40 pages. International Journal on Digital Libraries. Special Issue on Very Large Digital Libraries. In Publication, 2008. Preprint available at http://arxiv.org/abs/0803.4511 The aDORe Federation Architecture Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva Open Repositories 2008, University of Southampton, UK, April 1-4 2008
  4. 4. The aDORe Project: Background • Initial motivation: o Severe deficiencies in the new information discovery environment developed for the LANL Research Library: - Metadata-centric: descriptive metadata records first class citizens; actual digital assets auxiliary data. - Tens of millions of digital assets stored as files in file system. - Tight integration between content collection and discovery application, preventing other applications from leveraging the rich content base. o Obvious solution: - Replace metadata-centric approach by compound object approach. - Bundle digital assets into storage containers that dramatically reduce the amount of files in a file system. - Cleanly separate storage repository from applications that leverage the stored assets by providing necessary machine interfaces. • Implementation of the obvious solution led to the aDORe R&D project 2003-2007 The aDORe Federation Architecture Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva Open Repositories 2008, University of Southampton, UK, April 1-4 2008
  5. 5. The aDORe Project: Major Drivers • Concrete need to design and implement a solution to ingest, store, access the vast and growing collection of the LANL Research Library. o Scale, scale, scale! o Existing open source solutions (at that time) did not meet our scale requirements - e.g. static binding of disseminators to objects in Fedora. • Interest in repository interoperability, cf. involvement in OAI- PMH, NISO OpenURL, OAI-ORE • Interest in digital preservation, cf. NDIIP funding The aDORe Federation Architecture Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva Open Repositories 2008, University of Southampton, UK, April 1-4 2008
  6. 6. The aDORe Project: Major Design Constraints • Leverage existing standards and technologies to make development and migration more straightforward. o Read: Laziness as a strategy • Use a distributed, component based approach to meet challenges of scale. The aDORe Federation Architecture Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva Open Repositories 2008, University of Southampton, UK, April 1-4 2008
  7. 7. The aDORe Project: Some of the Results • Conceptual: o aDORe Federation architecture: A high-level, 3-Tier architecture for the federation of distributed repositories. • Concrete: o The aDORe Archive storage solution (XMLtapes/ARCfiles) - Tier-1 of the aDORe Federation architecture. o The aDORe Federation software - an implementation of the 3- Tier architecture, with the aDORe Archive in Tier-1. • And a lot more, but that’s not for today. The aDORe Federation Architecture Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva Open Repositories 2008, University of Southampton, UK, April 1-4 2008
  8. 8. The aDORe Federation software • Available today at: http://african.lanl.gov/aDORe/aDOReFederation • Thanks to Ryan Chute & Luydimilla Balakireva • This is a major update to the aDORe Archive: o Updates the Tier-1 aDORe Archive o Implements the 3 Tiers of the architecture instead of only Tier-1 • In production at LANL Research Library for over 1 year • Attractive for large collections of relatively stable objects • Could be used as a plug-in storage component for IR solutions The aDORe Federation Architecture Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva Open Repositories 2008, University of Southampton, UK, April 1-4 2008
  9. 9. The aDORe Archive @ LANL, March 31 2008 • 87,000,000 Compound Digital Objects • 208,000,000 Stored bitstreams • ~ 9,200 autonomous repositories: o ~ 4,000 XMLtapes: XML-based serializations of Digital Objects o ~ 5,200 ARCfiles: bitstreams • > 500,000,000 identifiers • More about the aDORe Federation software later The aDORe Federation Architecture Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva Open Repositories 2008, University of Southampton, UK, April 1-4 2008
  10. 10. The aDORe Federation Architecture: Goal • Facilitate a uniform manner for client applications to discover and access content objects available in a group of distributed repositories. • Single repository behavior for a group of distributed repositories. • Note that these distributed repositories can very well be “hidden” and that only the federated result is made “public”. o Cf. Peter Murray-Rust departmental repositories • Not about uniform approaches to add, update, delete objects in repositories. o Considered the responsibility of individual repositories. o However, changes are made apparent to the federation. The aDORe Federation Architecture Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva Open Repositories 2008, University of Southampton, UK, April 1-4 2008
  11. 11. aDORe Federation Architecture The aDORe Federation Architecture Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva Open Repositories 2008, University of Southampton, UK, April 1-4 2008
  12. 12. Basic Design Choices • All entities in the architecture are identified by means of URIs: content objects, repositories, machine interfaces. o Turns entities into uniquely identified Web resources o Avoids unwanted identifier collisions, for example, for different content objects from various repositories. • Depending on use case, certain Content Objects are identified by either: o Protocol-based URIs o Non-protocol-based URIs • All machine-interfaces are (HTTP) protocol-based The aDORe Federation Architecture Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva Open Repositories 2008, University of Southampton, UK, April 1-4 2008
  13. 13. Content Objects • 3 types of Content Objects: o Digital Objects, o Datastreams, o Surrogates • These types do not need to be natively embraced by repositories; they are supported at federation-facing machine interfaces. They are abstractions. • Other properties of Content Objects may be expressed, but the architecture only serves to convey them. • Core enabling properties of the architecture: o identification, o location, o time-stamp of these Content Objects The aDORe Federation Architecture Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva Open Repositories 2008, University of Southampton, UK, April 1-4 2008
  14. 14. Content Objects: Digital Object • Cf. Kahn-Wilensky, cf. most repository systems • A Digital Object is an identified aggregation of one or more Datastreams and properties pertaining to the Datastreams and to the aggregation. • A Digital Object is the perspective of a repository’s native compound object that is shared with the federation. The aDORe Federation Architecture Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva Open Repositories 2008, University of Southampton, UK, April 1-4 2008
  15. 15. Content Objects: Digital Object • Identification: DO-URI o Inherited from other environment and/or minted by repository. - Cf Internet Archive HTTP URIs of stored objects; identifiers assigned to scanned images o One or more per Digital Object o Digital Objects with same DO-URI may exist in multiple repositories - Cf paper with same DOI in multiple IRs; HTTP URI in Internet Archive o Protocol-based, non-protocol-based - http://some.repo.org/do/1234 - info:some-repo/do/1234 o Always treated as non-protocol-based, i.e. never resolved (in the federation) using its native resolution protocol, but rather conveyed as parameter in request against the federation’s machine interfaces. - Cf Internet Archive. • Time-stamping: o Digital Object change over time o Changes communicated to federation via Surrogates The aDORe Federation Architecture Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva Open Repositories 2008, University of Southampton, UK, April 1-4 2008
  16. 16. Content Objects: Surrogates • A Surrogate is the serialization of a Digital Object into a machine- readable representation that is made accessible by a repository. e.g. DIDL, METS, ORE Atom or RDF/XML. • Surrogates are the vehicles repositories use to keep the federation informed about the availability of their Digital Objects and about changes those Digital Objects undergo. • One or more Surrogates can correspond with a Digital Object in a federation: o Digital Object with same URI may exist in mutliple repositories o Single repository may have multiple Surrogates for a Digital Object • Minimally expresses: o DO-URI of the DO it serializes, o Datastream-URI/URLs of constituent Datastreams, o Identifier of the Surrogate itself. The aDORe Federation Architecture Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva Open Repositories 2008, University of Southampton, UK, April 1-4 2008
  17. 17. Content Objects: Surrogates • Identification: Identifier minted by the repository that makes the Surrogate available. o Protocol-based: Surrogate-URL (native resolution) o Non-protocol-based: Surrogate-URI (resolution via interfaces) • Time-stamping: Surrogate-datetime changes when a change to a Digital Object needs to be communicated to the federation. Minimally when constituency changes. • Update Policies: o New Surrogate Policy: - Change to Digital Object => New Surrogate, new Surrogate- URI/URL, new Surrogate-datetime - Previous Surrogate remains available o Update Surrogate Policy - Change to Digital Object => Update Surrogate, same Surrogate- URI/URL, new Surrogate-datetime - Previous Surrogate no longer available The aDORe Federation Architecture Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva Open Repositories 2008, University of Southampton, UK, April 1-4 2008
  18. 18. Content Objects: Datastreams • A Datastream is a retrievable bitstream of whichever media type made available by a repository to the federation. • It is a perspective of a repository’s native bitstream that is shared with an aDORe federation. • Can be, e.g.: o Dissemination of locally stored bitstream, o Dissemination of externally stored bitstream, o Result of applying a service to a (local or external) bitstream. • A Datastream can be part of multiple Digital Objects, but there is one repository in the federation that owns/serves it. The aDORe Federation Architecture Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva Open Repositories 2008, University of Southampton, UK, April 1-4 2008
  19. 19. Content Objects: Datastreams • Identification: Identifier minted by the repository that makes the Datastream available. o Protocol-based: Datastream-URL (native resolution) o Non-protocol-based: Datastream-URI (resolution via interfaces) • Time-stamping: Datastream-datetime changes when a change to a Datastream needs to be communicated to the federation. • Update Policies: o New Datastream Policy: - Update of retrievable bitstream => new Datastream, new Datastream-URI/URL, new Datastream-datetime - Cf. digital preservation: migration of a JPEG image (URI-1) leads to JPEG-2000 (URI-2), original JPEG maintained. o Update Datastream Policy - Update of retrievable bitstream => same Datastream, same Datastream-URI/URL, new Datastream-datetime - Original bitstream no longer available The aDORe Federation Architecture Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva Open Repositories 2008, University of Southampton, UK, April 1-4 2008
  20. 20. aDORe Federation Architecture: Tier-1 The aDORe Federation Architecture Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva Open Repositories 2008, University of Southampton, UK, April 1-4 2008
  21. 21. Tier-1: Surrogate and (sometimes) Datastream Repositories • Surrogate Repositories, Datastream Repositories as well as their Interfaces identified by URI • Interfaces leverage identification, time-stamping of Content Objects • Datastream Repository only when using (non-protocol-based) Datastream-URIs The aDORe Federation Architecture Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva Open Repositories 2008, University of Southampton, UK, April 1-4 2008
  22. 22. aDORe Federation Architecture: Tier-1 The aDORe Federation Architecture Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva Open Repositories 2008, University of Southampton, UK, April 1-4 2008
  23. 23. aDORe Federation Architecture: Tier-1 The aDORe Federation Architecture Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva Open Repositories 2008, University of Southampton, UK, April 1-4 2008
  24. 24. Tier-1: Locate Surrogates • Use: Repositories that have multiple Surrogates for a given Digital Object, or that have Digital Object that share Datastreams. • http://some.repo.org/openurl?url_ver=z39.88-2004&rft_id=http:// some.repo.org/ds/5678&svc_id=info:ourfederation/svc/LocateSurrogates. The aDORe Federation Architecture Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva Open Repositories 2008, University of Southampton, UK, April 1-4 2008
  25. 25. Tier-1: Obtain Datastream • Use: For repositories that use Datastream-URIs, not Datastream-URLs • http://some.repo.org/openurl?url_ver=z39.88-2004&rft_id=info:some- repo/ds/5678&svc_id=info:ourfederation/svc/ObtainDatastream. The aDORe Federation Architecture Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva Open Repositories 2008, University of Southampton, UK, April 1-4 2008
  26. 26. aDORe Federation Architecture: Tier-2 The aDORe Federation Architecture Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva Open Repositories 2008, University of Southampton, UK, April 1-4 2008
  27. 27. Tier-2: Service Registry • Keeps track of all components in the federation. In essence 2 look-up tables. • Look-up table 1: o URI of component (e.g. Repository-URI) o Matching Interface-URIs (and Interface type) • Table 2 o Interface-URI o Interface-URL • Exposes minimally 1 interface, Obtain Registry Record • Cf Information Environment Service Registry The aDORe Federation Architecture Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva Open Repositories 2008, University of Southampton, UK, April 1-4 2008
  28. 28. Tier-2: Identifier Locator • Look-up table: o Identifiers of Content Objects in the federation o Identifiers of Datastream or Surrogate Repositories that make these Content Objects accessible o Necessarily will store this information for non-protocol-based identifiers - Minimally DO-URI (remember: treated as non-protocol-based) • Populated by recurrently interacting with Harvest Surrogates and Harvest Datastream Identifier interfaces of all Tier-1 repositories. • Identifier Locator knows about these interfaces via the Service Registry. • Exposes minimally 1 interface, Locate Repositories: o In: DO-URI, Surrogate-URI, Datastream-URI o Out: List of Repository-URIs The aDORe Federation Architecture Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva Open Repositories 2008, University of Southampton, UK, April 1-4 2008
  29. 29. aDORe Federation Architecture: Tier-3 The aDORe Federation Architecture Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva Open Repositories 2008, University of Southampton, UK, April 1-4 2008
  30. 30. aDORe Federation software The aDORe Federation Architecture Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva Open Repositories 2008, University of Southampton, UK, April 1-4 2008
  31. 31. The aDORe Archive, new release • An updated aDORe Archive Installer containing: o XMLtape Toolkit (updated) o ARCfile Toolkit (updated) o XMLtape Registry (updated) o ARCfile Registry (updated) o XMLtape OpenURL Resolver (new) - Provides Core Surrogate Services (e.g. Obtain, Locate, Harvest Identifiers) through a simple and efficient OpenURL Service Interface. o XMLtape OpenURL XQuery Resolver (new) - Provides a configuration-based solution for complex ad-hoc queries. Built upon Nux <http://dsd.lbl.gov/nux/>, an open-source Java toolkit that provides a scalable solution for non-indexed based search of large XML repositories The aDORe Federation Architecture Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva Open Repositories 2008, University of Southampton, UK, April 1-4 2008
  32. 32. The aDORe Archive, new release • A new aDORe Federation Installer providing: o An aDORe Archive installation o IESR-based Service Registry (new) - Uses the Ockham Service Registry IESR-based database schema and provides OAI-PMH and OpenURL Services. o Identifier Locator (new) - A fast, in-memory MySQL-based solution used for efficient resolution of Digital Object, Datastream, and Surrogate URIs to Repository URIs. o OAI-PMH Federator (new) - Provides access to multiple aDORe Archive installations through a common OAI-PMH interface. o OpenURL Disseminator (new) - OpenURL Service interface providing federated access to all repository content, as well as performs dynamic dissemination services using a rule-engine based plug-in framework. The aDORe Federation Architecture Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva Open Repositories 2008, University of Southampton, UK, April 1-4 2008
  33. 33. The aDORe Archive, new release • The documentation will provide: o Detailed presentations of available service interfaces and available pluggable interfaces. o A sample ingestion of DIDL content to illustrate the various key service interfaces available in the aDORe Federation o A tutorial showing how to create processing implementations and the necessary configurations o A public demo version of the aDORe Federation using public domain content (coming soon) The aDORe Federation Architecture Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva Open Repositories 2008, University of Southampton, UK, April 1-4 2008

×