Digital repositories allow for the storage and management of digital publications and related content beyond simple PDF files. They support complex, heterogeneous publications that may include various media types and relationships between components. Repository systems like Fedora, EPrints and DSpace provide services for ingesting, preserving, discovering and accessing publications and their related content and metadata over time while maintaining identifiers and workflows. Repositories aim to enable reuse of content and establish policies around ownership, access, and long-term preservation of information within a networked scholarly communications environment.
1. How do Digital Repositories Work?
Thornton Staples
Fedora Commons, Inc.
2. What do we mean, “Institutional
Repository”
• Is it a place to cherish some PDF files?
• Is it about use and re-use of the content or just
preservation?
• Is a repository part of a network of inter-related
repositories?
• Is a repository used to construct and manage the
publication from the beginning?
3. What is the nature of the
“publication”?
• A PDF file?
• An illustrated narrative?
• A science article that includes datasets?
• A virtual exhibition?
• An electronic critical edition?
• Does it support annotation and other social
participation?
4. Beyond a PDF....
• These entities are usually heterogeneous combinations
of more than one type of content
• The can include complex relationships among the
content components
• They will increasingly include components that are not
under the control of the author or sponsoring institution
• Increasingly, they will include as components, or have
important relationships to, born-digital content, not just
surrogates for physical objects
5. Repository Systems
• EPrints is a vertically integrated application that is
specifically oriented around documents and articles
• DSpace is also a vertically integrated application that has
a more general conception of content
• If the content that you are managing fits their models both
provide a reasonably complete system for managing it.
• Fedora is a foundation of services upon which many
information management applications can be built
• Applications built upon Fedora for domain-specific
information management are beginning to appear
6. Persistent Identifiers (PIDs)
• Names for resources that uniquely identify them
without respect to their location
• One of the main jobs of a repository is to manage
the content on the back-end, while publicly
maintaining the PIDs
• A repository can maintain a unique PID for an
object that is then exposed using one or more
schemes
7. Workflow
• Provide submission processes that enforce content
and metadata standards at ingestion
• Relationships among the content components must
be maintained throughout the process
• Audit trails for actions on the repository must be
maintained
• It is important to maintain versions of content upon
updating
• Should provide for review and approval processes
• Social processes require a complete workflow
8. Discovery and Access
• Expose metadata or full-text to harvesting services,
such as Google or OAI
• Provide specialist access to indexed metadata or full
text
• As publications become more complex, more
metadata standards will have to be supported
• Endlessly federating searches does not seem to be
the answer
9. Use and Re-use of Content
• The point of searching is finding!
• Content should be available in flexible ways to be
used by an array of tools
• Content discovered in one context should be
reusable in another
• The repository must be able to exchange content
with other repositories
10. Sustaining Digital Information
• Preserving the components such that they can be
reloaded in case of loss
• Keeping the data technically viable, migrating the
content to new encodings, as necessary.
• Vouching for the veracity and authenticity of the
information, both content and structure
• Relationships to content that is outside the
repositories control must be maintained or
gracefully degraded
11. Establishing and Enforcing Policies
• Policies must be established for the entire life-cycle
of the information
– Ownership and workflow policies
– Access and use policies
– Policies associated with sustaining (or not!)
• Policy information can be based on assumptions
built into the software, encoded in metadata to be
handed off to other processes, and/or managed and
used as data in the repository
• Polices must be expressed for end users
• Policies must also be expressed for machine access
12. Towards Community Repositories
• The repository becomes the medium for scholarly
communication within a network
• Content creators own their content and control all
policies associated with it
• Authorship becomes a process of adding nodes and
arc to a worldwide network of content
• Transfer ownership to other sophisticated
repositories (digital libraries?) to be sustained in the
long term
• Community repositories could be hosted by
publishers or professional societies