The document describes the aDORe Federation Architecture, which was developed to address challenges of scale in digital repositories. The key aspects are:
1) It is a 3-tier architecture that federates distributed digital object repositories to provide unified access to content.
2) The first tier consists of surrogate and sometimes datastream repositories that store metadata about digital objects and bitstreams.
3) The architecture leverages URIs to identify digital objects, surrogates, repositories and interfaces to allow federated access across repositories.
The impact of social media on mental health and well-being has been a topic o...
The aDORe Federation Architecture
1. The aDORe Federation Architecture
Herbert Van de Sompel (1), Ryan Chute (2), Luydimilla Balakireva (3)
Digital Library Research & Prototyping Team
Research Library
Los Alamos National Laboratory
(1)herbertv@lanl.gov
http://public.lanl.gov/herbertv/
(2) rchute@lanl.gov
(3) ludab@lanl.gov
Acknowledgments:
Jeroen Bekaert, Patrick Hochstenbach, Henry Jerez, Xiaoming Liu
The aDORe Federation Architecture
Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva
Open Repositories 2008, University of Southampton, UK, April 1-4 2008
2. The Fedora Adoration Architecture?
(the mandatory joke at the start of a presentation)
The aDORe Federation Architecture
Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva
Open Repositories 2008, University of Southampton, UK, April 1-4 2008
3. Presentation Based on the Paper
Herbert Van de Sompel, Ryan Chute, Patrick Hochstenbach.
The aDORe Federation: Digital Repositories at Scale. 40
pages. International Journal on Digital Libraries.
Special Issue on Very Large Digital Libraries. In
Publication, 2008.
Preprint available at http://arxiv.org/abs/0803.4511
The aDORe Federation Architecture
Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva
Open Repositories 2008, University of Southampton, UK, April 1-4 2008
4. The aDORe Project: Background
• Initial motivation:
o Severe deficiencies in the new information discovery
environment developed for the LANL Research Library:
- Metadata-centric: descriptive metadata records first class citizens;
actual digital assets auxiliary data.
- Tens of millions of digital assets stored as files in file system.
- Tight integration between content collection and discovery application,
preventing other applications from leveraging the rich content base.
o Obvious solution:
- Replace metadata-centric approach by compound object approach.
- Bundle digital assets into storage containers that dramatically reduce
the amount of files in a file system.
- Cleanly separate storage repository from applications that leverage the
stored assets by providing necessary machine interfaces.
• Implementation of the obvious solution led to the aDORe R&D
project 2003-2007
The aDORe Federation Architecture
Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva
Open Repositories 2008, University of Southampton, UK, April 1-4 2008
5. The aDORe Project: Major Drivers
• Concrete need to design and implement a solution to ingest,
store, access the vast and growing collection of the LANL
Research Library.
o Scale, scale, scale!
o Existing open source solutions (at that time) did not meet our
scale requirements
- e.g. static binding of disseminators to objects in Fedora.
• Interest in repository interoperability, cf. involvement in OAI-
PMH, NISO OpenURL, OAI-ORE
• Interest in digital preservation, cf. NDIIP funding
The aDORe Federation Architecture
Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva
Open Repositories 2008, University of Southampton, UK, April 1-4 2008
6. The aDORe Project: Major Design Constraints
• Leverage existing standards and technologies to make
development and migration more straightforward.
o Read: Laziness as a strategy
• Use a distributed, component based approach to meet
challenges of scale.
The aDORe Federation Architecture
Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva
Open Repositories 2008, University of Southampton, UK, April 1-4 2008
7. The aDORe Project: Some of the Results
• Conceptual:
o aDORe Federation architecture: A high-level, 3-Tier architecture
for the federation of distributed repositories.
• Concrete:
o The aDORe Archive storage solution (XMLtapes/ARCfiles) -
Tier-1 of the aDORe Federation architecture.
o The aDORe Federation software - an implementation of the 3-
Tier architecture, with the aDORe Archive in Tier-1.
• And a lot more, but that’s not for today.
The aDORe Federation Architecture
Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva
Open Repositories 2008, University of Southampton, UK, April 1-4 2008
8. The aDORe Federation software
• Available today at:
http://african.lanl.gov/aDORe/aDOReFederation
• Thanks to Ryan Chute & Luydimilla Balakireva
• This is a major update to the aDORe Archive:
o Updates the Tier-1 aDORe Archive
o Implements the 3 Tiers of the architecture instead of only Tier-1
• In production at LANL Research Library for over 1 year
• Attractive for large collections of relatively stable objects
• Could be used as a plug-in storage component for IR solutions
The aDORe Federation Architecture
Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva
Open Repositories 2008, University of Southampton, UK, April 1-4 2008
9. The aDORe Archive @ LANL, March 31 2008
• 87,000,000 Compound Digital Objects
• 208,000,000 Stored bitstreams
• ~ 9,200 autonomous repositories:
o ~ 4,000 XMLtapes: XML-based serializations of Digital
Objects
o ~ 5,200 ARCfiles: bitstreams
• > 500,000,000 identifiers
• More about the aDORe Federation software later
The aDORe Federation Architecture
Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva
Open Repositories 2008, University of Southampton, UK, April 1-4 2008
10. The aDORe Federation Architecture: Goal
• Facilitate a uniform manner for client applications to discover and
access content objects available in a group of distributed
repositories.
• Single repository behavior for a group of distributed repositories.
• Note that these distributed repositories can very well be “hidden”
and that only the federated result is made “public”.
o Cf. Peter Murray-Rust departmental repositories
• Not about uniform approaches to add, update, delete objects in
repositories.
o Considered the responsibility of individual repositories.
o However, changes are made apparent to the federation.
The aDORe Federation Architecture
Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva
Open Repositories 2008, University of Southampton, UK, April 1-4 2008
11. aDORe Federation Architecture
The aDORe Federation Architecture
Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva
Open Repositories 2008, University of Southampton, UK, April 1-4 2008
12. Basic Design Choices
• All entities in the architecture are identified by means of URIs:
content objects, repositories, machine interfaces.
o Turns entities into uniquely identified Web resources
o Avoids unwanted identifier collisions, for example, for different
content objects from various repositories.
• Depending on use case, certain Content Objects are identified
by either:
o Protocol-based URIs
o Non-protocol-based URIs
• All machine-interfaces are (HTTP) protocol-based
The aDORe Federation Architecture
Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva
Open Repositories 2008, University of Southampton, UK, April 1-4 2008
13. Content Objects
• 3 types of Content Objects:
o Digital Objects,
o Datastreams,
o Surrogates
• These types do not need to be natively embraced by
repositories; they are supported at federation-facing machine
interfaces. They are abstractions.
• Other properties of Content Objects may be expressed, but the
architecture only serves to convey them.
• Core enabling properties of the architecture:
o identification,
o location,
o time-stamp
of these Content Objects
The aDORe Federation Architecture
Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva
Open Repositories 2008, University of Southampton, UK, April 1-4 2008
14. Content Objects: Digital Object
• Cf. Kahn-Wilensky, cf. most repository systems
• A Digital Object is an identified aggregation of one or more
Datastreams and properties pertaining to the Datastreams and to
the aggregation.
• A Digital Object is the perspective of a repository’s native
compound object that is shared with the federation.
The aDORe Federation Architecture
Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva
Open Repositories 2008, University of Southampton, UK, April 1-4 2008
15. Content Objects: Digital Object
• Identification: DO-URI
o Inherited from other environment and/or minted by repository.
- Cf Internet Archive HTTP URIs of stored objects; identifiers assigned to
scanned images
o One or more per Digital Object
o Digital Objects with same DO-URI may exist in multiple repositories
- Cf paper with same DOI in multiple IRs; HTTP URI in Internet Archive
o Protocol-based, non-protocol-based
- http://some.repo.org/do/1234
- info:some-repo/do/1234
o Always treated as non-protocol-based, i.e. never resolved (in the
federation) using its native resolution protocol, but rather conveyed as
parameter in request against the federation’s machine interfaces.
- Cf Internet Archive.
• Time-stamping:
o Digital Object change over time
o Changes communicated to federation via Surrogates
The aDORe Federation Architecture
Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva
Open Repositories 2008, University of Southampton, UK, April 1-4 2008
16. Content Objects: Surrogates
• A Surrogate is the serialization of a Digital Object into a machine-
readable representation that is made accessible by a repository.
e.g. DIDL, METS, ORE Atom or RDF/XML.
• Surrogates are the vehicles repositories use to keep the federation
informed about the availability of their Digital Objects and about
changes those Digital Objects undergo.
• One or more Surrogates can correspond with a Digital Object in a
federation:
o Digital Object with same URI may exist in mutliple repositories
o Single repository may have multiple Surrogates for a Digital Object
• Minimally expresses:
o DO-URI of the DO it serializes,
o Datastream-URI/URLs of constituent Datastreams,
o Identifier of the Surrogate itself.
The aDORe Federation Architecture
Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva
Open Repositories 2008, University of Southampton, UK, April 1-4 2008
17. Content Objects: Surrogates
• Identification: Identifier minted by the repository that makes the
Surrogate available.
o Protocol-based: Surrogate-URL (native resolution)
o Non-protocol-based: Surrogate-URI (resolution via interfaces)
• Time-stamping: Surrogate-datetime changes when a change to a
Digital Object needs to be communicated to the federation.
Minimally when constituency changes.
• Update Policies:
o New Surrogate Policy:
- Change to Digital Object => New Surrogate, new Surrogate-
URI/URL, new Surrogate-datetime
- Previous Surrogate remains available
o Update Surrogate Policy
- Change to Digital Object => Update Surrogate, same Surrogate-
URI/URL, new Surrogate-datetime
- Previous Surrogate no longer available
The aDORe Federation Architecture
Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva
Open Repositories 2008, University of Southampton, UK, April 1-4 2008
18. Content Objects: Datastreams
• A Datastream is a retrievable bitstream of whichever media type
made available by a repository to the federation.
• It is a perspective of a repository’s native bitstream that is shared
with an aDORe federation.
• Can be, e.g.:
o Dissemination of locally stored bitstream,
o Dissemination of externally stored bitstream,
o Result of applying a service to a (local or external) bitstream.
• A Datastream can be part of multiple Digital Objects, but there is
one repository in the federation that owns/serves it.
The aDORe Federation Architecture
Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva
Open Repositories 2008, University of Southampton, UK, April 1-4 2008
19. Content Objects: Datastreams
• Identification: Identifier minted by the repository that makes the
Datastream available.
o Protocol-based: Datastream-URL (native resolution)
o Non-protocol-based: Datastream-URI (resolution via interfaces)
• Time-stamping: Datastream-datetime changes when a change to
a Datastream needs to be communicated to the federation.
• Update Policies:
o New Datastream Policy:
- Update of retrievable bitstream => new Datastream, new
Datastream-URI/URL, new Datastream-datetime
- Cf. digital preservation: migration of a JPEG image (URI-1)
leads to JPEG-2000 (URI-2), original JPEG maintained.
o Update Datastream Policy
- Update of retrievable bitstream => same Datastream, same
Datastream-URI/URL, new Datastream-datetime
- Original bitstream no longer available
The aDORe Federation Architecture
Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva
Open Repositories 2008, University of Southampton, UK, April 1-4 2008
20. aDORe Federation Architecture: Tier-1
The aDORe Federation Architecture
Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva
Open Repositories 2008, University of Southampton, UK, April 1-4 2008
21. Tier-1: Surrogate and (sometimes) Datastream Repositories
• Surrogate Repositories, Datastream Repositories as well as their
Interfaces identified by URI
• Interfaces leverage identification, time-stamping of Content Objects
• Datastream Repository only when using (non-protocol-based)
Datastream-URIs
The aDORe Federation Architecture
Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva
Open Repositories 2008, University of Southampton, UK, April 1-4 2008
22. aDORe Federation Architecture: Tier-1
The aDORe Federation Architecture
Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva
Open Repositories 2008, University of Southampton, UK, April 1-4 2008
23. aDORe Federation Architecture: Tier-1
The aDORe Federation Architecture
Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva
Open Repositories 2008, University of Southampton, UK, April 1-4 2008
24. Tier-1: Locate Surrogates
• Use: Repositories that have multiple Surrogates for a given
Digital Object, or that have Digital Object that share
Datastreams.
• http://some.repo.org/openurl?url_ver=z39.88-2004&rft_id=http://
some.repo.org/ds/5678&svc_id=info:ourfederation/svc/LocateSurrogates.
The aDORe Federation Architecture
Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva
Open Repositories 2008, University of Southampton, UK, April 1-4 2008
25. Tier-1: Obtain Datastream
• Use: For repositories that use Datastream-URIs, not
Datastream-URLs
• http://some.repo.org/openurl?url_ver=z39.88-2004&rft_id=info:some-
repo/ds/5678&svc_id=info:ourfederation/svc/ObtainDatastream.
The aDORe Federation Architecture
Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva
Open Repositories 2008, University of Southampton, UK, April 1-4 2008
26. aDORe Federation Architecture: Tier-2
The aDORe Federation Architecture
Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva
Open Repositories 2008, University of Southampton, UK, April 1-4 2008
27. Tier-2: Service Registry
• Keeps track of all components in the federation. In essence 2
look-up tables.
• Look-up table 1:
o URI of component (e.g. Repository-URI)
o Matching Interface-URIs (and Interface type)
• Table 2
o Interface-URI
o Interface-URL
• Exposes minimally 1 interface, Obtain Registry Record
• Cf Information Environment Service Registry
The aDORe Federation Architecture
Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva
Open Repositories 2008, University of Southampton, UK, April 1-4 2008
28. Tier-2: Identifier Locator
• Look-up table:
o Identifiers of Content Objects in the federation
o Identifiers of Datastream or Surrogate Repositories that make
these Content Objects accessible
o Necessarily will store this information for non-protocol-based
identifiers
- Minimally DO-URI (remember: treated as non-protocol-based)
• Populated by recurrently interacting with Harvest Surrogates and
Harvest Datastream Identifier interfaces of all Tier-1 repositories.
• Identifier Locator knows about these interfaces via the Service
Registry.
• Exposes minimally 1 interface, Locate Repositories:
o In: DO-URI, Surrogate-URI, Datastream-URI
o Out: List of Repository-URIs
The aDORe Federation Architecture
Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva
Open Repositories 2008, University of Southampton, UK, April 1-4 2008
29. aDORe Federation Architecture: Tier-3
The aDORe Federation Architecture
Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva
Open Repositories 2008, University of Southampton, UK, April 1-4 2008
30. aDORe Federation software
The aDORe Federation Architecture
Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva
Open Repositories 2008, University of Southampton, UK, April 1-4 2008
31. The aDORe Archive, new release
• An updated aDORe Archive Installer containing:
o XMLtape Toolkit (updated)
o ARCfile Toolkit (updated)
o XMLtape Registry (updated)
o ARCfile Registry (updated)
o XMLtape OpenURL Resolver (new)
- Provides Core Surrogate Services (e.g. Obtain, Locate, Harvest
Identifiers) through a simple and efficient OpenURL Service Interface.
o XMLtape OpenURL XQuery Resolver (new)
- Provides a configuration-based solution for complex ad-hoc queries.
Built upon Nux <http://dsd.lbl.gov/nux/>, an open-source Java toolkit
that provides a scalable solution for non-indexed based search of large
XML repositories
The aDORe Federation Architecture
Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva
Open Repositories 2008, University of Southampton, UK, April 1-4 2008
32. The aDORe Archive, new release
• A new aDORe Federation Installer providing:
o An aDORe Archive installation
o IESR-based Service Registry (new)
- Uses the Ockham Service Registry IESR-based database schema and
provides OAI-PMH and OpenURL Services.
o Identifier Locator (new)
- A fast, in-memory MySQL-based solution used for efficient resolution of
Digital Object, Datastream, and Surrogate URIs to Repository URIs.
o OAI-PMH Federator (new)
- Provides access to multiple aDORe Archive installations through a
common OAI-PMH interface.
o OpenURL Disseminator (new)
- OpenURL Service interface providing federated access to all repository
content, as well as performs dynamic dissemination services using a
rule-engine based plug-in framework.
The aDORe Federation Architecture
Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva
Open Repositories 2008, University of Southampton, UK, April 1-4 2008
33. The aDORe Archive, new release
• The documentation will provide:
o Detailed presentations of available service interfaces and
available pluggable interfaces.
o A sample ingestion of DIDL content to illustrate the various key
service interfaces available in the aDORe Federation
o A tutorial showing how to create processing implementations
and the necessary configurations
o A public demo version of the aDORe Federation using public
domain content (coming soon)
The aDORe Federation Architecture
Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva
Open Repositories 2008, University of Southampton, UK, April 1-4 2008