Oxford Common File Layout
Rosalyn Metz (Emory),
Simeon Warner (Cornell)
Samvera Connect 2018
http://bit.ly/ocfl-samcon2018
Not just us...
OCFL Editorial Group
● Andrew Hankinson (Oxford)
● Neil Jefferies (Oxford)
● Julian Morley (Stanford)
● Andrew Woods (DuraSpace)
● and us (Rosalyn and Simeon)
Community input from pasig-discuss and
ocfl-community groups, and from others
Closest parents: BagIt & Moab
BagIt
Well established and implemented specification for handling sets of files
● Being formally standardized as RFC:
https://tools.ietf.org/html/draft-kunze-bagit-17
● Used for transfer and (somewhat less) for files at rest
● Good fixity support
● No explicit versioning support
○ Could use local conventions for version inside a bag
○ Could use bag-per-version
Moab: A Brief History
Slides adapted from Julian Morley's in the OR2018 OCFL presentation
● Moab is the closest ancestor of OCFL
● Developed at Stanford Libraries by Richard Anderson
○ Article: http://journal.code4lib.org/articles/8482
● Named after Moab, UT
Moab: A Brief History
● Moab is a versioned, forward delta file
structure that supports fixity and file
de-duplication.
● You can preserve anything with it (even
cat pictures found on the internet)
● The tools to manage and create Moabs
are open source Ruby gem
○ https://github.com/sul-dlss/moab-versioning
Moab is part of the
Stanford Digital Repository
Here be Moabs!
Moab in Practice @ Stanford
We have many Moabs in the SDR
● 1.6 million Moab objects
● 5 million version directories
● 50+ million files
● 500+ TB of data (25TB added last month)
● Spread across 15 NFS volumes on NetApp filers
● Backed up by IBM Spectrum Protect (formerly TSM)
○ 1 tape copy kept in local tape frame;1 sent to Iron Mountain
ab123cd4567
v0001
data
content
title.jpg
intro.jpg
page1.jpg
page2.jpg
page3.jpg
metadata
versionMetadata.xml
descMetadata.xml
identityMetadata.xml
manifests
versionInventory.xml
signatureCatalog.xml
versionAdditions.xml
fileInventoryDifference.xml
manifestInventory.xml
v0002
data
content
page2.jpg
metadata
versionMetadata.xml
technicalMetadata.xml
manifests
versionInventory.xml
signatureCatalog.xml
versionAdditions.xml
fileInventoryDifference.xml
manifestInventory.xml
two version directories; /v0001 & /v0002
A sample Moab object on disk
/data content comes from upstream and could
be anything, but our systems create data in
/content and /metadata directories.
/manifests directories are for Moab metadata.
This is where we store all the checksums and
change information for deduplication and
forward deltas.
Lessons from Cornell
CULAR @ 2017
It worked, what
now?
● Fedora 3 no longer being
developed, Fedora 4 not an
appropriate option
● Decision not to buy
"preservation services",
primarily on cost grounds
● Decision that we want one local
copy for legal access reasons
Short term ⇒ use local disk and
AWS S3. Build tools over
filesystem and object stores
Those files sure
are piling up!
Nearly 100TB now, planning
100TB/year digitization
● Plan to purchase a scalable
local (object) storage system
for 1 copy
● Two more copies in cloud
(perhaps tape)
● Content will outlast any
application or software system
● Content will outlast any storage
system
● Expect change and hence
migration ⇒ KISS
OCFL object
OCFL storage root(s)?
Shared Cornell and OCFL Goals
● Provide an application and vendor neutral storage arrangement that can be
used with filesystems and object stores
○ Allow easy replication between multiple storage environments
○ Allow easy migration between storage systems (modulo the inherent burdens)
○ Allow use with multiple and changing applications
● Support package versioning at low cost (complexity and storage use)
● Support internal package validation for completeness and fixity
● Support audit and self-description of entire store
● Have an easy migration path from current archival storage arrangements
● Develop a shared model that is useful at multiple institutions so that all benefit
from community developed tools and expertise.
Lessons from Emory
Lessons from Emory: Deliverables
Actively engaged in a multi-year effort to gather requirements, design, and
develop a digital repository based on the Samvera framework.
Selected deliverables included...
Develop object definitions/types (e.g.
collections, objects, other entities) and their
relationships to one another; determine
preservation objects inside and outside of
Fedora.
Identify needs for AIP structure.
Identify storage requirements (e.g. number of
copies, file access scenarios)
Lessons from Emory: Identified requirements
The means to distribute digital objects to third-party preservation services.
A well understood and well documented model for storing digital objects.
Ability to place multiple copies of digital objects into diverse storage services
(AWS, local storage, etc.).
Easily allow for fixity checking of digital objects.
Digital
Object
Content Files
(Primary or Supplemental)
Content file 1
Content file 2
Content file 3...
… + additional
… + additional
The content itself:
relationships provided in
structural metadata
Metadata (Actionable/Indexed)
Desc. metadata
Technical metadata (File-level)
Preservation Events/Audits
Administrative metadata
Structural metadata (PCDM)
Metadata converted to RDF
for Hyrax/Fedora - editable
and/or searchable
Supplemental Preservation Files
(Metadata/Administrative Files)
Source Metadata (binary file)
Desc. Metadata record (binary file)
METS (binary file)
License/agreement (binary file)
Supplemental PREMIS (binary file)
Variable supplemental info
stored as files (not directly
system-readable):
staff can view or download
file to read it
Collection
Ancient Egyptian
Collection
Administrative
Collection
Carlos Museum
Administrative
Collections reflect the
process the libraries
followed when deciding to
collect materials.
Digital Objects must be a
part of an Administrative
Collection and optionally in
one or more Collections
Digital Objects may
contain one or more files
Digital Objects,
Collections receive
Emory-defined metadata
and relationships
Major Emory
Entities PCDM
Context -
Simple Example
Individual Agreements
contain information about
the Administrative
Collection.
Individual Agreements
may contain one or more
files
Individual Agreements
are assigned to objects
through their parent
Collection
Is a member of
Is a member of
Individual Agreement
Carlos Museum
Agreement
Digital Object
Statuette of a Cat.
Collection
Divine Felines Exhibition
Is a member of
Is a member of
Goals of OCFL
OCFL Requirements
1) Completeness, so that a repository can be
rebuilt from the files it stores,
2) Parsability, both by humans and machines,
most importantly in the absence of original
software,
3) Robustness, against errors, corruption, and
migration between storage technologies, and
4) Storage, on a variety of infrastructures
including cloud object stores.
Many existing digital preservation
standards like:
● TDR (ISO 16363)
● OAIS (ISO 14721)
● NDSA Levels of Preservation
● BagIt
discuss the need for these
requirements, but none provided a
standardized way for how to do it.
OCFL the specification
https://ocfl.io/draft/spec/
OCFL Object
A group of one or more content files and
administrative information identified by a
URI.
The object may contain a sequence of versions
of the files organized into version directories.
The base directory of the object may contain a
logs directory.
A NAMASTE file indicating conformance.
An object contains an inventory digest file
which provides a digest for the
inventory.json file.
[object root]
├── 0=ocfl_object_1.0
├── inventory.json
├── inventory.json.sha512
├── v1
│ ├── empty.txt
│ ├── foo
│ │ └── bar.xml
│ ├── image.tiff
│ ├── inventory.json
│ └── inventory.json.sha512
├── v2
│ ├── foo
│ │ └── bar.xml
│ ├── inventory.json
│ └── inventory.json.sha512
└── v3
├── inventory.json
└── inventory.json.sha512
OCFL Object
An object contains an inventory.json file
which inventories the contents of an object.
The manifest block lists all the digests and
existing file paths for all of the object’s content.
The versions block identifies the logical file path
and the digest for each version of the object’s
content.
Separating the logical file path from the
existing file path and using digests to refer to
files allows for deduplication of content.
{
"head": "v3",
"id": "ark:/12345/bcd987",
"manifest": {
"4d27c8...b53": [ "v2/foo/bar.xml" ],
"7dcc35...c31": [ "v1/foo/bar.xml" ],
"cf83e1...a3e": [ "v1/empty.txt" ],
"ffccf6...62e": [ "v1/image.tiff" ]
},
"type": "Object",
"versions": [
{
"created": "2018-01-01T01:01:01Z",
"message": "Initial import",
"state": {
"7dcc35...c31": [ "foo/bar.xml" ],
"cf83e1...a3e": [ "empty.txt" ],
"ffccf6...62e": [ "image.tiff" ]
},
"type": "Version",
"user": {
"address": "alice@example.com",
"name": "Alice"
},
"version": "v1"
},
{
"created": "2018-02-02T02:02:02Z",
"message": "Fix bar.xml, remove image.tiff,
OCFL Storage Root
The base directory of an OCFL storage layout.
Should also contain the OCFL specification in
human-readable plain-text format.
Should contain the conformance declaration
OCFL Objects may conform to the same or
earlier version of the specification.
The storage hierarchy must terminate with an
OCFL Object Root.
[storage root]
├── 0=ocfl_1.0
├── ocfl_1.0.txt (optional)
├── ab12cd34
│ ├── 0=ocfl_object_1.0
│ ├── inventory.json
│ ├── inventory.json.sha512
│ └── v1
│ ├── file.txt
│ ├── inventory.json
│ └── inventory.json.sha512
└── ef56gh78
. ├── 0=ocfl_object_1.0
├── inventory.json
├── inventory.json.sha512
├── v1
│ ├── empty.txt
│ ├── foo
│ │ └── bar.xml
│ ├── image.tiff
│ ├── inventory.json
│ └── inventory.json.sha512
└── v2
├── foo
│ └── bar.xml
├── inventory.json
└── inventory.json.sha512
OCFL Storage Root
Storage hierarchies must not include files
within intermediate directories
Storage hierarchies must be terminated by
OCFL Object Roots
Storage hierarchies within the same OCFL
Storage Root should use just one layout
pattern
Storage hierarchies within the same OCFL
Storage Root should consistently use either a
directory hierarchy of OCFL Objects or
top-level OCFL Objects
[storage root]
├── 0=ocfl_1.0
├── ocfl_1.0.txt (optional)
└── ab
└── 12
└── cd
└── 34
└── ab12cd34
├── 0=ocfl_object_1.0
├── inventory.json
├── inventory.json.sha512
├── v1
│ ├── empty.txt
│ ├── foo
│ │ └── bar.xml
│ ├── image.tiff
│ ├── inventory.json
│ └── inventory.json.sha512
└── v2
├── foo
│ └── bar.xml
├── inventory.json
└── inventory.json.sha512
OCFL implementation patterns
https://ocfl.io/draft/implementation-notes/
Rebuildability
● Key OCFL goal -- be able to rebuild repo
from an OCFL storage root
● Therefore, in OAIS terms: must include
all the descriptive, administrative,
structural, representation, and
preservation metadata relevant to the
object.
● Optionally include copy of spec in top level
of OCFL storage root
● More complete option would be a specific
OCFL object that contains this
documentation and to have a pointer to its
location in the storage root.
e.g. permissions, access, and
creation times
● not portable between filesystems
● not preservable through file
transfer operations
● ill-defined fixity
⇒ out-of-scope
If important, use filesystem image
format or extract as metadata
Filesystem metadata
Empty Directories
● OCFL preserves files and their
content
● Directories serve as an
organizational convention
● Empty directories not directly
supported
⇒ Use zero-length `.keep` file as
necessary (ala. `git`, BagIt)
Only special files are the inventory,
its digest file, and conformance
declaration files
Otherwise OCFL makes no
distinction between different types of
files.
⇒ Use local conventions as
needed
Data and Metadata
Storage
● Filesystem or Object Store -- you choose
● Original filename or Normalized filename -- you choose
● Deduplication & Forward delta differencing (at file level) --
optional but likely desirable/normal
"logical file path" - path of file in content as part of state for a particular version
"existing file path" - path of file in OCFL object
content addressing ties these two together
Storage Root Hierarchy - flat, pairtree, ex-wye-zee
[storage_root]
├── 0=ocfl_1.0
├── ocfl_1.0.txt (optional)
├── d45be626e024
| ├── 0=ocfl_object_1.0
| ├── inventory.json
| ├── inventory.json.sha512
| └── v1...
├── d45be626e036
| ├── 0=ocfl_object_1.0
| ├── inventory.json
| ├── inventory.json.sha512
| └── v1...
├── 3104edf0363a
| ├── 0=ocfl_object_1.0
| ├── inventory.json
| ├── inventory.json.sha512
| └── v1...
[storage_root]
├── 0=ocfl_1.0
├── ocfl_1.0.txt (optional)
├── d4
| └── 5b
| └── e6
| └── 26
| └── e0
| ├── 24
| | └──d45be626e024
| | ├──
0=ocfl_object_1.0
| | └── ...
| └── 36
| └──d45be626e036
| ├──
0=ocfl_object_1.0
| └── ...
File operations
(mungification?)
● Inheritance
● Addition
● Updating
● Renaming
● Deletion
● Reinstatement
● Purging ⇒ choices:
a. rebuild new object
b. break immutability and
rewrite (not recommended)
Yes - OCFL supports that...
Version Immutability
OCFL supports systems where
versions (everything in a given
version directory) is immutable once
written.
● It is recommended to follow this
practice
● BUT you can rewrite objects if
you really want to, but
OCFL supports (in fact, enforces for
internal references) deduplication
through digests
● Only within an object
● File level
● sha512 digest recommended
Deduplication
Forward Delta
Each version need only include new
and changed files
● Files from previous version
included by reference
● Reference by content (digest)
supports renaming without
duplicating
(You can avoid this and include files again if you
really want. But why?)
1. Digests used for reference
already provide basis for strong
fixity checks (pref. sha512)
2. Additional digests may be
include to support legacy fixity
information (e.g. md5)
(Fixity of inventory files themselves handled by
sidecar file, e.g. inventory.json.sha512)
Fixity
Log Information
log directory in OCFL object
available for information not in
objects content and not versioned
● form not specified
● will be ignored in object
validation
Objects with many small file may
cause problems with some storage
infrastructures and may make
validation/fixity time consuming
● package in single file (ZIP
recommend)
(Options for a later version of the OCFL spec
are ZIPped objects and/or ZIP by version)
Small Files
Roadmap
Alpha (yesterday)
● Released(ish) on October 10 community call
(OCFL Editors and PASIG Discuss)
● Feedback for November community call
Beta (date based on feedback)
● Experimental validation tool
● Determine what other groups communities to
seek input from
Release 1.0 (2019)
● One production-ready validator
● Test suite and fixture objects
● Two institutions committed to backing the
initiative (should define that)
41
Thank You
https://ocfl.io
https://github.com/OCFL
ocfl-community@googlegroups.com

Oxford Common File Layout (OCFL)

  • 1.
    Oxford Common FileLayout Rosalyn Metz (Emory), Simeon Warner (Cornell) Samvera Connect 2018 http://bit.ly/ocfl-samcon2018
  • 2.
    Not just us... OCFLEditorial Group ● Andrew Hankinson (Oxford) ● Neil Jefferies (Oxford) ● Julian Morley (Stanford) ● Andrew Woods (DuraSpace) ● and us (Rosalyn and Simeon) Community input from pasig-discuss and ocfl-community groups, and from others
  • 3.
  • 4.
    BagIt Well established andimplemented specification for handling sets of files ● Being formally standardized as RFC: https://tools.ietf.org/html/draft-kunze-bagit-17 ● Used for transfer and (somewhat less) for files at rest ● Good fixity support ● No explicit versioning support ○ Could use local conventions for version inside a bag ○ Could use bag-per-version
  • 5.
    Moab: A BriefHistory Slides adapted from Julian Morley's in the OR2018 OCFL presentation ● Moab is the closest ancestor of OCFL ● Developed at Stanford Libraries by Richard Anderson ○ Article: http://journal.code4lib.org/articles/8482 ● Named after Moab, UT
  • 6.
    Moab: A BriefHistory ● Moab is a versioned, forward delta file structure that supports fixity and file de-duplication. ● You can preserve anything with it (even cat pictures found on the internet) ● The tools to manage and create Moabs are open source Ruby gem ○ https://github.com/sul-dlss/moab-versioning
  • 7.
    Moab is partof the Stanford Digital Repository Here be Moabs!
  • 8.
    Moab in Practice@ Stanford We have many Moabs in the SDR ● 1.6 million Moab objects ● 5 million version directories ● 50+ million files ● 500+ TB of data (25TB added last month) ● Spread across 15 NFS volumes on NetApp filers ● Backed up by IBM Spectrum Protect (formerly TSM) ○ 1 tape copy kept in local tape frame;1 sent to Iron Mountain
  • 9.
    ab123cd4567 v0001 data content title.jpg intro.jpg page1.jpg page2.jpg page3.jpg metadata versionMetadata.xml descMetadata.xml identityMetadata.xml manifests versionInventory.xml signatureCatalog.xml versionAdditions.xml fileInventoryDifference.xml manifestInventory.xml v0002 data content page2.jpg metadata versionMetadata.xml technicalMetadata.xml manifests versionInventory.xml signatureCatalog.xml versionAdditions.xml fileInventoryDifference.xml manifestInventory.xml two version directories;/v0001 & /v0002 A sample Moab object on disk /data content comes from upstream and could be anything, but our systems create data in /content and /metadata directories. /manifests directories are for Moab metadata. This is where we store all the checksums and change information for deduplication and forward deltas.
  • 10.
  • 12.
    CULAR @ 2017 Itworked, what now? ● Fedora 3 no longer being developed, Fedora 4 not an appropriate option ● Decision not to buy "preservation services", primarily on cost grounds ● Decision that we want one local copy for legal access reasons Short term ⇒ use local disk and AWS S3. Build tools over filesystem and object stores
  • 14.
    Those files sure arepiling up! Nearly 100TB now, planning 100TB/year digitization ● Plan to purchase a scalable local (object) storage system for 1 copy ● Two more copies in cloud (perhaps tape) ● Content will outlast any application or software system ● Content will outlast any storage system ● Expect change and hence migration ⇒ KISS
  • 17.
  • 18.
    Shared Cornell andOCFL Goals ● Provide an application and vendor neutral storage arrangement that can be used with filesystems and object stores ○ Allow easy replication between multiple storage environments ○ Allow easy migration between storage systems (modulo the inherent burdens) ○ Allow use with multiple and changing applications ● Support package versioning at low cost (complexity and storage use) ● Support internal package validation for completeness and fixity ● Support audit and self-description of entire store ● Have an easy migration path from current archival storage arrangements ● Develop a shared model that is useful at multiple institutions so that all benefit from community developed tools and expertise.
  • 19.
  • 20.
    Lessons from Emory:Deliverables Actively engaged in a multi-year effort to gather requirements, design, and develop a digital repository based on the Samvera framework. Selected deliverables included... Develop object definitions/types (e.g. collections, objects, other entities) and their relationships to one another; determine preservation objects inside and outside of Fedora. Identify needs for AIP structure. Identify storage requirements (e.g. number of copies, file access scenarios)
  • 21.
    Lessons from Emory:Identified requirements The means to distribute digital objects to third-party preservation services. A well understood and well documented model for storing digital objects. Ability to place multiple copies of digital objects into diverse storage services (AWS, local storage, etc.). Easily allow for fixity checking of digital objects.
  • 22.
    Digital Object Content Files (Primary orSupplemental) Content file 1 Content file 2 Content file 3... … + additional … + additional The content itself: relationships provided in structural metadata Metadata (Actionable/Indexed) Desc. metadata Technical metadata (File-level) Preservation Events/Audits Administrative metadata Structural metadata (PCDM) Metadata converted to RDF for Hyrax/Fedora - editable and/or searchable Supplemental Preservation Files (Metadata/Administrative Files) Source Metadata (binary file) Desc. Metadata record (binary file) METS (binary file) License/agreement (binary file) Supplemental PREMIS (binary file) Variable supplemental info stored as files (not directly system-readable): staff can view or download file to read it
  • 23.
    Collection Ancient Egyptian Collection Administrative Collection Carlos Museum Administrative Collectionsreflect the process the libraries followed when deciding to collect materials. Digital Objects must be a part of an Administrative Collection and optionally in one or more Collections Digital Objects may contain one or more files Digital Objects, Collections receive Emory-defined metadata and relationships Major Emory Entities PCDM Context - Simple Example Individual Agreements contain information about the Administrative Collection. Individual Agreements may contain one or more files Individual Agreements are assigned to objects through their parent Collection Is a member of Is a member of Individual Agreement Carlos Museum Agreement Digital Object Statuette of a Cat. Collection Divine Felines Exhibition Is a member of Is a member of
  • 24.
  • 25.
    OCFL Requirements 1) Completeness,so that a repository can be rebuilt from the files it stores, 2) Parsability, both by humans and machines, most importantly in the absence of original software, 3) Robustness, against errors, corruption, and migration between storage technologies, and 4) Storage, on a variety of infrastructures including cloud object stores. Many existing digital preservation standards like: ● TDR (ISO 16363) ● OAIS (ISO 14721) ● NDSA Levels of Preservation ● BagIt discuss the need for these requirements, but none provided a standardized way for how to do it.
  • 26.
  • 27.
    OCFL Object A groupof one or more content files and administrative information identified by a URI. The object may contain a sequence of versions of the files organized into version directories. The base directory of the object may contain a logs directory. A NAMASTE file indicating conformance. An object contains an inventory digest file which provides a digest for the inventory.json file. [object root] ├── 0=ocfl_object_1.0 ├── inventory.json ├── inventory.json.sha512 ├── v1 │ ├── empty.txt │ ├── foo │ │ └── bar.xml │ ├── image.tiff │ ├── inventory.json │ └── inventory.json.sha512 ├── v2 │ ├── foo │ │ └── bar.xml │ ├── inventory.json │ └── inventory.json.sha512 └── v3 ├── inventory.json └── inventory.json.sha512
  • 28.
    OCFL Object An objectcontains an inventory.json file which inventories the contents of an object. The manifest block lists all the digests and existing file paths for all of the object’s content. The versions block identifies the logical file path and the digest for each version of the object’s content. Separating the logical file path from the existing file path and using digests to refer to files allows for deduplication of content. { "head": "v3", "id": "ark:/12345/bcd987", "manifest": { "4d27c8...b53": [ "v2/foo/bar.xml" ], "7dcc35...c31": [ "v1/foo/bar.xml" ], "cf83e1...a3e": [ "v1/empty.txt" ], "ffccf6...62e": [ "v1/image.tiff" ] }, "type": "Object", "versions": [ { "created": "2018-01-01T01:01:01Z", "message": "Initial import", "state": { "7dcc35...c31": [ "foo/bar.xml" ], "cf83e1...a3e": [ "empty.txt" ], "ffccf6...62e": [ "image.tiff" ] }, "type": "Version", "user": { "address": "alice@example.com", "name": "Alice" }, "version": "v1" }, { "created": "2018-02-02T02:02:02Z", "message": "Fix bar.xml, remove image.tiff,
  • 29.
    OCFL Storage Root Thebase directory of an OCFL storage layout. Should also contain the OCFL specification in human-readable plain-text format. Should contain the conformance declaration OCFL Objects may conform to the same or earlier version of the specification. The storage hierarchy must terminate with an OCFL Object Root. [storage root] ├── 0=ocfl_1.0 ├── ocfl_1.0.txt (optional) ├── ab12cd34 │ ├── 0=ocfl_object_1.0 │ ├── inventory.json │ ├── inventory.json.sha512 │ └── v1 │ ├── file.txt │ ├── inventory.json │ └── inventory.json.sha512 └── ef56gh78 . ├── 0=ocfl_object_1.0 ├── inventory.json ├── inventory.json.sha512 ├── v1 │ ├── empty.txt │ ├── foo │ │ └── bar.xml │ ├── image.tiff │ ├── inventory.json │ └── inventory.json.sha512 └── v2 ├── foo │ └── bar.xml ├── inventory.json └── inventory.json.sha512
  • 30.
    OCFL Storage Root Storagehierarchies must not include files within intermediate directories Storage hierarchies must be terminated by OCFL Object Roots Storage hierarchies within the same OCFL Storage Root should use just one layout pattern Storage hierarchies within the same OCFL Storage Root should consistently use either a directory hierarchy of OCFL Objects or top-level OCFL Objects [storage root] ├── 0=ocfl_1.0 ├── ocfl_1.0.txt (optional) └── ab └── 12 └── cd └── 34 └── ab12cd34 ├── 0=ocfl_object_1.0 ├── inventory.json ├── inventory.json.sha512 ├── v1 │ ├── empty.txt │ ├── foo │ │ └── bar.xml │ ├── image.tiff │ ├── inventory.json │ └── inventory.json.sha512 └── v2 ├── foo │ └── bar.xml ├── inventory.json └── inventory.json.sha512
  • 31.
  • 32.
    Rebuildability ● Key OCFLgoal -- be able to rebuild repo from an OCFL storage root ● Therefore, in OAIS terms: must include all the descriptive, administrative, structural, representation, and preservation metadata relevant to the object. ● Optionally include copy of spec in top level of OCFL storage root ● More complete option would be a specific OCFL object that contains this documentation and to have a pointer to its location in the storage root. e.g. permissions, access, and creation times ● not portable between filesystems ● not preservable through file transfer operations ● ill-defined fixity ⇒ out-of-scope If important, use filesystem image format or extract as metadata Filesystem metadata
  • 33.
    Empty Directories ● OCFLpreserves files and their content ● Directories serve as an organizational convention ● Empty directories not directly supported ⇒ Use zero-length `.keep` file as necessary (ala. `git`, BagIt) Only special files are the inventory, its digest file, and conformance declaration files Otherwise OCFL makes no distinction between different types of files. ⇒ Use local conventions as needed Data and Metadata
  • 34.
    Storage ● Filesystem orObject Store -- you choose ● Original filename or Normalized filename -- you choose ● Deduplication & Forward delta differencing (at file level) -- optional but likely desirable/normal "logical file path" - path of file in content as part of state for a particular version "existing file path" - path of file in OCFL object content addressing ties these two together
  • 35.
    Storage Root Hierarchy- flat, pairtree, ex-wye-zee [storage_root] ├── 0=ocfl_1.0 ├── ocfl_1.0.txt (optional) ├── d45be626e024 | ├── 0=ocfl_object_1.0 | ├── inventory.json | ├── inventory.json.sha512 | └── v1... ├── d45be626e036 | ├── 0=ocfl_object_1.0 | ├── inventory.json | ├── inventory.json.sha512 | └── v1... ├── 3104edf0363a | ├── 0=ocfl_object_1.0 | ├── inventory.json | ├── inventory.json.sha512 | └── v1... [storage_root] ├── 0=ocfl_1.0 ├── ocfl_1.0.txt (optional) ├── d4 | └── 5b | └── e6 | └── 26 | └── e0 | ├── 24 | | └──d45be626e024 | | ├── 0=ocfl_object_1.0 | | └── ... | └── 36 | └──d45be626e036 | ├── 0=ocfl_object_1.0 | └── ...
  • 36.
    File operations (mungification?) ● Inheritance ●Addition ● Updating ● Renaming ● Deletion ● Reinstatement ● Purging ⇒ choices: a. rebuild new object b. break immutability and rewrite (not recommended) Yes - OCFL supports that...
  • 37.
    Version Immutability OCFL supportssystems where versions (everything in a given version directory) is immutable once written. ● It is recommended to follow this practice ● BUT you can rewrite objects if you really want to, but OCFL supports (in fact, enforces for internal references) deduplication through digests ● Only within an object ● File level ● sha512 digest recommended Deduplication
  • 38.
    Forward Delta Each versionneed only include new and changed files ● Files from previous version included by reference ● Reference by content (digest) supports renaming without duplicating (You can avoid this and include files again if you really want. But why?) 1. Digests used for reference already provide basis for strong fixity checks (pref. sha512) 2. Additional digests may be include to support legacy fixity information (e.g. md5) (Fixity of inventory files themselves handled by sidecar file, e.g. inventory.json.sha512) Fixity
  • 39.
    Log Information log directoryin OCFL object available for information not in objects content and not versioned ● form not specified ● will be ignored in object validation Objects with many small file may cause problems with some storage infrastructures and may make validation/fixity time consuming ● package in single file (ZIP recommend) (Options for a later version of the OCFL spec are ZIPped objects and/or ZIP by version) Small Files
  • 40.
    Roadmap Alpha (yesterday) ● Released(ish)on October 10 community call (OCFL Editors and PASIG Discuss) ● Feedback for November community call Beta (date based on feedback) ● Experimental validation tool ● Determine what other groups communities to seek input from Release 1.0 (2019) ● One production-ready validator ● Test suite and fixture objects ● Two institutions committed to backing the initiative (should define that)
  • 41.