HydraDAM2: Repository Challenges and Solutions for Large Media Files

HydraDAM2:
Repository Challenges and
Solutions for Large Media Files
Karen Cariani
Senior Director
WGBH Media Library and
Archives
Jon Dunn
Assistant Dean for Library
Technologies
Indiana University

Who we are:
WGBH Media Library and Archives

Challenges of Audio and Video
 Descriptive metadata
 Technical metadata
 Large preservation files
 Multiple files with similar metadata
 Storage dependent on frequency of access
—Bandwidth capability

Preservation Needs
 Multiple Copies
 Save original files
 Validity – check sum
 Regular storage migration
 Persistence
 File format issues
— Migration ease
— Future playback
 Fixity check big files
 Big files
— Speed of access of preservation
files for reuse
— Processing speed

Some History: HydraDAM1
 Began with HydraDAM 1 that was based on Sufia and
Fedora 4
—Self-deposit institutional repository application
 Adapted to add bulk ingest, bulk edit, characterization
of files, transcoding of proxies
 Limitations:
—Assumed full workflow pipeline for ingestion of A/V
materials
—Processing performance problems

Indiana University Context
• Over 3 million special collections
items at IU Bloomington
• Within and outside the
Libraries
• Many sources of A/V
• Music and other performing
arts
• Ethnomusicology,
anthropology
• Public broadcasting stations
• Film collections
• Athletics

MDPI:
IU Media Digitization and Preservation Initiative
 Goal: “To digitize, preserve and make universally available
by IU’s Bicentennial—subject to copyright or other legal
restrictions—all of the time-based media objects on all
campuses of IU judged important by experts.“
 280,000+ items
 ~7PB over 4 years
 9TB per day peak
 http://mdpi.iu.edu/

IU MDPI Repository Needs
Media Files
and Metadata
Digital
Preservation
Repository
(HydraDAM2)
Access
Repository
Masters,
Mezzanines
Transcodes
Out-of-Region
Storage

HydraDAM2 Project Objectives
 To extend the HydraDAM digital asset management system to operate in
conjunction with Fedora 4.
— Hydra “Head” for digital audio/video preservation
 Develop Fedora 4 content models for digital asset preservation objects, including
descriptive, structural, and digital provenance metadata, based on current
standards and practices and utilizing new features in Fedora 4 for storage and
indexing of RDF.
 Implement support in HydraDAM for different storage models, appropriate to
different types of institutions.
 Integrate HydraDAM into preservation workflows that feed access systems at IU
(Avalon) and WGBH (Open Vault) and conduct testing of large files and high-
throughput workflows.
 Document and disseminate information about our implementation and experience
to the library, archive, digital repository, audiovisual preservation, and Hydra
communities.

NEH Desired Outcomes
 How hard is it to do?
 Is it implementable elsewhere?
 Is it feasible for broad use?
 NEH Preservation and Access R&D Grant:
January 2015 – January 2017

Project progress
 Slow start getting developers in place
 Coordinating work across organizations
 Developing data models in common and different
 Determining where code splits for different storage
needs
 Workable agile development schedule split across
geographically different organizations

Storage use cases
 WGBH storing files offline on LTO tape directly from
local workstation
—Bandwidth issues to move large preservation files
across the network
—Easier for us to hand deliver
 Indiana University utilizing a central HSM system for
nearline storage
—Auto delivery of large files through network

Storage use cases
 Not storing media preservation files in Fedora or in
filesystem managed by Fedora
— WGBH: Just the location of the files on LTO tape
— IU: URL in Fedora that redirects to download of
content from HSM
 How do we accommodate both needs with common
code?
— Where does the code split off?

Not in Fedora because
 Files are big
 Costly in terms of performance to push in and out of
Fedora
 Federation or projection in Fedora would allow Fedora
to register content in and out but limitations
—Now deprecated in Fedora
 Volumes (petabytes) of data too large to put on
spinning disk because too costly
 So storing on tape

HydraDAM2 Architecture Components
 Fedora 4
 Curation Concerns 0.14
 Hydra::Works
 PCDM (Portland Common Data Model)

PCDM:Collection
Hydra:
GenericWork
extends
PCDM:Object
Hydra: FileSet
extends
PCDM:Object
PCDM:File PCDM:File PCDM:File
master file A
binary
mezzanine file A
binary
access file A
binary
hasMember
hasMember
hasFile
hasFile
hasFile
HydraDAM2
PCDM
IU case
PCDM:File PCDM:File PCDM:File
master file B
binary
mezzanine file B
binary
access file B
binary
hasFile
hasFile
hasFile
PCDM:File
POD XML
binary
PCDM:File
Memnon/IU XML
binary
PCDM:File
MODS XML
binary

Apache Camel Routes
Asynchronous Storage Proxy
Rails application with AS UI gem
Local Tape
Storage
Services Large files
on Disk
Notify
Cloud
Storage
Services
Service
translation
blueprint
Service
translation
blueprint
Service
translation
blueprint
Asynchronous aware
user interface provides
interactions
Proxy provides API
with common
endpoints and
responses
Translations map
from common
API to specific
storage APIs
Should be able to
be an API-X
sharable service
Fedora 4 Asynchronous Storage: Proof of Concept

Fedora 4
RDF resource container
node
Non-RDF resource node
URL redirect
Asynchronous Interactions UI
Apache Camel Routes
Asynchronous Storage Proxy
Slow storage
service
Invoking from asynchronous interactions from Fedora 4 API
Redirecting node via
external-body MIME type;
can be set using Fedora 4
API and also via Hydra
Works file behaviors
The URL to redirect to would be
wherever the Asynchronous
Interactions UI is deployed,
immediately invoking interactions for a
unique identifier (preferably using
persistent URLs)
Access to redirecting nodes
via Fedora 4 API invokes
immediate redirect to stored
URL

Where We’re Going
 Ensure content models are on the right track
 Continue development
— Build out storage proxy interaction with IU mass storage
— Built out WGBH storage implementation
— Additional user functionality
— Build out descriptive metadata / PBcore support
 Batch ingest
 Feed to/from Avalon Media System
 Pilot implementation
 Production implementation

Questions?
 https://github.com/WGBH/hydradam2
 karen_cariani@wgbh.org
 jwd@iu.edu

HydraDAM2: Repository Challenges and Solutions for Large Media Files

Recommended

Recommended

More Related Content

What's hot

What's hot (11)

Similar to HydraDAM2: Repository Challenges and Solutions for Large Media Files

Similar to HydraDAM2: Repository Challenges and Solutions for Large Media Files (20)

More from Jon W. Dunn

More from Jon W. Dunn (6)

Recently uploaded

Recently uploaded (20)

HydraDAM2: Repository Challenges and Solutions for Large Media Files

Editor's Notes