Oxford Common File Layout (OCFL)

Oxford Common File Layout
Rosalyn Metz (Emory),
Simeon Warner (Cornell)
Samvera Connect 2018
http://bit.ly/ocfl-samcon2018

Not just us...
OCFL Editorial Group
● Andrew Hankinson (Oxford)
● Neil Jefferies (Oxford)
● Julian Morley (Stanford)
● Andrew Woods (DuraSpace)
● and us (Rosalyn and Simeon)
Community input from pasig-discuss and
ocfl-community groups, and from others

BagIt
Well established and implemented specification for handling sets of files
● Being formally standardized as RFC:
https://tools.ietf.org/html/draft-kunze-bagit-17
● Used for transfer and (somewhat less) for files at rest
● Good fixity support
● No explicit versioning support
○ Could use local conventions for version inside a bag
○ Could use bag-per-version

Moab: A Brief History
Slides adapted from Julian Morley's in the OR2018 OCFL presentation
● Moab is the closest ancestor of OCFL
● Developed at Stanford Libraries by Richard Anderson
○ Article: http://journal.code4lib.org/articles/8482
● Named after Moab, UT

Moab: A Brief History
● Moab is a versioned, forward delta file
structure that supports fixity and file
de-duplication.
● You can preserve anything with it (even
cat pictures found on the internet)
● The tools to manage and create Moabs
are open source Ruby gem
○ https://github.com/sul-dlss/moab-versioning

Moab is part of the
Stanford Digital Repository
Here be Moabs!

Moab in Practice @ Stanford
We have many Moabs in the SDR
● 1.6 million Moab objects
● 5 million version directories
● 50+ million files
● 500+ TB of data (25TB added last month)
● Spread across 15 NFS volumes on NetApp filers
● Backed up by IBM Spectrum Protect (formerly TSM)
○ 1 tape copy kept in local tape frame;1 sent to Iron Mountain

ab123cd4567
v0001
data
content
title.jpg
intro.jpg
page1.jpg
page2.jpg
page3.jpg
metadata
versionMetadata.xml
descMetadata.xml
identityMetadata.xml
manifests
versionInventory.xml
signatureCatalog.xml
versionAdditions.xml
fileInventoryDifference.xml
manifestInventory.xml
v0002
data
content
page2.jpg
metadata
versionMetadata.xml
technicalMetadata.xml
manifests
versionInventory.xml
signatureCatalog.xml
versionAdditions.xml
fileInventoryDifference.xml
manifestInventory.xml
two version directories; /v0001 & /v0002
A sample Moab object on disk
/data content comes from upstream and could
be anything, but our systems create data in
/content and /metadata directories.
/manifests directories are for Moab metadata.
This is where we store all the checksums and
change information for deduplication and
forward deltas.

CULAR @ 2017
It worked, what
now?
● Fedora 3 no longer being
developed, Fedora 4 not an
appropriate option
● Decision not to buy
"preservation services",
primarily on cost grounds
● Decision that we want one local
copy for legal access reasons
Short term ⇒ use local disk and
AWS S3. Build tools over
filesystem and object stores

Those files sure
are piling up!
Nearly 100TB now, planning
100TB/year digitization
● Plan to purchase a scalable
local (object) storage system
for 1 copy
● Two more copies in cloud
(perhaps tape)
● Content will outlast any
application or software system
● Content will outlast any storage
system
● Expect change and hence
migration ⇒ KISS

OCFL object
OCFL storage root(s)?

Shared Cornell and OCFL Goals
● Provide an application and vendor neutral storage arrangement that can be
used with filesystems and object stores
○ Allow easy replication between multiple storage environments
○ Allow easy migration between storage systems (modulo the inherent burdens)
○ Allow use with multiple and changing applications
● Support package versioning at low cost (complexity and storage use)
● Support internal package validation for completeness and fixity
● Support audit and self-description of entire store
● Have an easy migration path from current archival storage arrangements
● Develop a shared model that is useful at multiple institutions so that all benefit
from community developed tools and expertise.

Lessons from Emory: Deliverables
Actively engaged in a multi-year effort to gather requirements, design, and
develop a digital repository based on the Samvera framework.
Selected deliverables included...
Develop object definitions/types (e.g.
collections, objects, other entities) and their
relationships to one another; determine
preservation objects inside and outside of
Fedora.
Identify needs for AIP structure.
Identify storage requirements (e.g. number of
copies, file access scenarios)

Lessons from Emory: Identified requirements
The means to distribute digital objects to third-party preservation services.
A well understood and well documented model for storing digital objects.
Ability to place multiple copies of digital objects into diverse storage services
(AWS, local storage, etc.).
Easily allow for fixity checking of digital objects.

Digital
Object
Content Files
(Primary or Supplemental)
Content file 1
Content file 2
Content file 3...
… + additional
… + additional
The content itself:
relationships provided in
structural metadata
Metadata (Actionable/Indexed)
Desc. metadata
Technical metadata (File-level)
Preservation Events/Audits
Administrative metadata
Structural metadata (PCDM)
Metadata converted to RDF
for Hyrax/Fedora - editable
and/or searchable
Supplemental Preservation Files
(Metadata/Administrative Files)
Source Metadata (binary file)
Desc. Metadata record (binary file)
METS (binary file)
License/agreement (binary file)
Supplemental PREMIS (binary file)
Variable supplemental info
stored as files (not directly
system-readable):
staff can view or download
file to read it

Collection
Ancient Egyptian
Collection
Administrative
Collection
Carlos Museum
Administrative
Collections reflect the
process the libraries
followed when deciding to
collect materials.
Digital Objects must be a
part of an Administrative
Collection and optionally in
one or more Collections
Digital Objects may
contain one or more files
Digital Objects,
Collections receive
Emory-defined metadata
and relationships
Major Emory
Entities PCDM
Context -
Simple Example
Individual Agreements
contain information about
the Administrative
Collection.
may contain one or more
files
are assigned to objects
through their parent
Collection
Is a member of
Is a member of
Individual Agreement
Carlos Museum
Agreement
Digital Object
Statuette of a Cat.
Collection
Divine Felines Exhibition
Is a member of
Is a member of

OCFL Requirements
1) Completeness, so that a repository can be
rebuilt from the files it stores,
2) Parsability, both by humans and machines,
most importantly in the absence of original
software,
3) Robustness, against errors, corruption, and
migration between storage technologies, and
4) Storage, on a variety of infrastructures
including cloud object stores.
Many existing digital preservation
standards like:
● TDR (ISO 16363)
● OAIS (ISO 14721)
● NDSA Levels of Preservation
● BagIt
discuss the need for these
requirements, but none provided a
standardized way for how to do it.

OCFL the specification
https://ocfl.io/draft/spec/

OCFL Object
A group of one or more content files and
administrative information identified by a
URI.
The object may contain a sequence of versions
of the files organized into version directories.
The base directory of the object may contain a
logs directory.
A NAMASTE file indicating conformance.
An object contains an inventory digest file
which provides a digest for the
inventory.json file.
[object root]
├── 0=ocfl_object_1.0
├── inventory.json
├── inventory.json.sha512
├── v1
│ ├── empty.txt
│ ├── foo
│ │ └── bar.xml
│ ├── image.tiff
│ ├── inventory.json
│ └── inventory.json.sha512
├── v2
│ ├── foo
└── v3
└── inventory.json.sha512

OCFL Object
An object contains an inventory.json file
which inventories the contents of an object.
The manifest block lists all the digests and
existing file paths for all of the object’s content.
The versions block identifies the logical file path
and the digest for each version of the object’s
content.
Separating the logical file path from the
existing file path and using digests to refer to
files allows for deduplication of content.
{
"head": "v3",
"id": "ark:/12345/bcd987",
"manifest": {
"4d27c8...b53": [ "v2/foo/bar.xml" ],
"7dcc35...c31": [ "v1/foo/bar.xml" ],
"cf83e1...a3e": [ "v1/empty.txt" ],
"ffccf6...62e": [ "v1/image.tiff" ]
},
"type": "Object",
"versions": [
{
"created": "2018-01-01T01:01:01Z",
"message": "Initial import",
"state": {
"7dcc35...c31": [ "foo/bar.xml" ],
"cf83e1...a3e": [ "empty.txt" ],
"ffccf6...62e": [ "image.tiff" ]
},
"type": "Version",
"user": {
"address": "alice@example.com",
"name": "Alice"
},
"version": "v1"
},
{
"created": "2018-02-02T02:02:02Z",
"message": "Fix bar.xml, remove image.tiff,

OCFL Storage Root
The base directory of an OCFL storage layout.
Should also contain the OCFL specification in
human-readable plain-text format.
Should contain the conformance declaration
OCFL Objects may conform to the same or
earlier version of the specification.
The storage hierarchy must terminate with an
OCFL Object Root.
[storage root]
├── 0=ocfl_1.0
├── ocfl_1.0.txt (optional)
├── ab12cd34
│ ├── 0=ocfl_object_1.0
│ ├── inventory.json.sha512
│ └── v1
│ ├── file.txt
└── ef56gh78
. ├── 0=ocfl_object_1.0
├── v1
│ ├── foo
└── v2
├── foo
│ └── bar.xml

OCFL Storage Root
Storage hierarchies must not include files
within intermediate directories
Storage hierarchies must be terminated by
OCFL Object Roots
Storage hierarchies within the same OCFL
Storage Root should use just one layout
pattern
Storage hierarchies within the same OCFL
Storage Root should consistently use either a
directory hierarchy of OCFL Objects or
top-level OCFL Objects
[storage root]
├── 0=ocfl_1.0
└── ab
└── 12
└── cd
└── 34
└── ab12cd34
├── 0=ocfl_object_1.0
├── v1
│ ├── foo
└── v2
├── foo
│ └── bar.xml

OCFL implementation patterns
https://ocfl.io/draft/implementation-notes/

Rebuildability
● Key OCFL goal -- be able to rebuild repo
from an OCFL storage root
● Therefore, in OAIS terms: must include
all the descriptive, administrative,
structural, representation, and
preservation metadata relevant to the
object.
● Optionally include copy of spec in top level
of OCFL storage root
● More complete option would be a specific
OCFL object that contains this
documentation and to have a pointer to its
location in the storage root.
e.g. permissions, access, and
creation times
● not portable between filesystems
● not preservable through file
transfer operations
● ill-defined fixity
⇒ out-of-scope
If important, use filesystem image
format or extract as metadata
Filesystem metadata

Empty Directories
● OCFL preserves files and their
content
● Directories serve as an
organizational convention
● Empty directories not directly
supported
⇒ Use zero-length `.keep` file as
necessary (ala. `git`, BagIt)
Only special files are the inventory,
its digest file, and conformance
declaration files
Otherwise OCFL makes no
distinction between different types of
files.
⇒ Use local conventions as
needed
Data and Metadata

Storage
● Filesystem or Object Store -- you choose
● Original filename or Normalized filename -- you choose
● Deduplication & Forward delta differencing (at file level) --
optional but likely desirable/normal
"logical file path" - path of file in content as part of state for a particular version
"existing file path" - path of file in OCFL object
content addressing ties these two together

Storage Root Hierarchy - flat, pairtree, ex-wye-zee
[storage_root]
├── 0=ocfl_1.0
├── d45be626e024
| ├── 0=ocfl_object_1.0
| ├── inventory.json
| ├── inventory.json.sha512
| └── v1...
├── d45be626e036
| └── v1...
├── 3104edf0363a
| └── v1...
[storage_root]
├── 0=ocfl_1.0
├── d4
| └── 5b
| └── e6
| └── 26
| └── e0
| ├── 24
| | └──d45be626e024
| | ├──
0=ocfl_object_1.0
| | └── ...
| └── 36
| └──d45be626e036
| ├──
0=ocfl_object_1.0
| └── ...

File operations
(mungification?)
● Inheritance
● Addition
● Updating
● Renaming
● Deletion
● Reinstatement
● Purging ⇒ choices:
a. rebuild new object
b. break immutability and
rewrite (not recommended)
Yes - OCFL supports that...

Version Immutability
OCFL supports systems where
versions (everything in a given
version directory) is immutable once
written.
● It is recommended to follow this
practice
● BUT you can rewrite objects if
you really want to, but
OCFL supports (in fact, enforces for
internal references) deduplication
through digests
● Only within an object
● File level
● sha512 digest recommended
Deduplication

Forward Delta
Each version need only include new
and changed files
● Files from previous version
included by reference
● Reference by content (digest)
supports renaming without
duplicating
(You can avoid this and include files again if you
really want. But why?)
1. Digests used for reference
already provide basis for strong
fixity checks (pref. sha512)
2. Additional digests may be
include to support legacy fixity
information (e.g. md5)
(Fixity of inventory files themselves handled by
sidecar file, e.g. inventory.json.sha512)
Fixity

Log Information
log directory in OCFL object
available for information not in
objects content and not versioned
● form not specified
● will be ignored in object
validation
Objects with many small file may
cause problems with some storage
infrastructures and may make
validation/fixity time consuming
● package in single file (ZIP
recommend)
(Options for a later version of the OCFL spec
are ZIPped objects and/or ZIP by version)
Small Files

Roadmap
Alpha (yesterday)
● Released(ish) on October 10 community call
(OCFL Editors and PASIG Discuss)
● Feedback for November community call
Beta (date based on feedback)
● Experimental validation tool
● Determine what other groups communities to
seek input from
Release 1.0 (2019)
● One production-ready validator
● Test suite and fixture objects
● Two institutions committed to backing the
initiative (should define that)

41
Thank You
https://ocfl.io
https://github.com/OCFL
ocfl-community@googlegroups.com

Oxford Common File Layout (OCFL)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Oxford Common File Layout (OCFL)

Similar to Oxford Common File Layout (OCFL) (20)

More from Simeon Warner

More from Simeon Warner (20)

Recently uploaded

Recently uploaded (20)

Oxford Common File Layout (OCFL)