HydraDAM2:
Repository Challenges and
Solutions for Large Media Files
Karen Cariani
Senior Director
WGBH Media Library and
Archives
Jon Dunn
Assistant Dean for Library
Technologies
Indiana University
Who we are:
WGBH Media Library and Archives
Challenges of Audio and Video
 Descriptive metadata
 Technical metadata
 Large preservation files
 Multiple files with similar metadata
 Storage dependent on frequency of access
—Bandwidth capability
Preservation Needs
 Multiple Copies
 Save original files
 Validity – check sum
 Regular storage migration
 Persistence
 File format issues
— Migration ease
— Future playback
 Fixity check big files
 Big files
— Speed of access of preservation
files for reuse
— Processing speed
Some History: HydraDAM1
 Began with HydraDAM 1 that was based on Sufia and
Fedora 4
—Self-deposit institutional repository application
 Adapted to add bulk ingest, bulk edit, characterization
of files, transcoding of proxies
 Limitations:
—Assumed full workflow pipeline for ingestion of A/V
materials
—Processing performance problems
Indiana University Context
• Over 3 million special collections
items at IU Bloomington
• Within and outside the
Libraries
• Many sources of A/V
• Music and other performing
arts
• Ethnomusicology,
anthropology
• Public broadcasting stations
• Film collections
• Athletics
MDPI:
IU Media Digitization and Preservation Initiative
 Goal: “To digitize, preserve and make universally available
by IU’s Bicentennial—subject to copyright or other legal
restrictions—all of the time-based media objects on all
campuses of IU judged important by experts.“
 280,000+ items
 ~7PB over 4 years
 9TB per day peak
 http://mdpi.iu.edu/
IU MDPI Repository Needs
Media Files
and Metadata
Digital
Preservation
Repository
(HydraDAM2)
Access
Repository
Masters,
Mezzanines
Transcodes
Out-of-Region
Storage
HydraDAM2 Project Objectives
 To extend the HydraDAM digital asset management system to operate in
conjunction with Fedora 4.
— Hydra “Head” for digital audio/video preservation
 Develop Fedora 4 content models for digital asset preservation objects, including
descriptive, structural, and digital provenance metadata, based on current
standards and practices and utilizing new features in Fedora 4 for storage and
indexing of RDF.
 Implement support in HydraDAM for different storage models, appropriate to
different types of institutions.
 Integrate HydraDAM into preservation workflows that feed access systems at IU
(Avalon) and WGBH (Open Vault) and conduct testing of large files and high-
throughput workflows.
 Document and disseminate information about our implementation and experience
to the library, archive, digital repository, audiovisual preservation, and Hydra
communities.
NEH Desired Outcomes
 How hard is it to do?
 Is it implementable elsewhere?
 Is it feasible for broad use?
 NEH Preservation and Access R&D Grant:
January 2015 – January 2017
Project progress
 Slow start getting developers in place
 Coordinating work across organizations
 Developing data models in common and different
 Determining where code splits for different storage
needs
 Workable agile development schedule split across
geographically different organizations
Storage use cases
 WGBH storing files offline on LTO tape directly from
local workstation
—Bandwidth issues to move large preservation files
across the network
—Easier for us to hand deliver
 Indiana University utilizing a central HSM system for
nearline storage
—Auto delivery of large files through network
Storage use cases
 Not storing media preservation files in Fedora or in
filesystem managed by Fedora
— WGBH: Just the location of the files on LTO tape
— IU: URL in Fedora that redirects to download of
content from HSM
 How do we accommodate both needs with common
code?
— Where does the code split off?
Not in Fedora because
 Files are big
 Costly in terms of performance to push in and out of
Fedora
 Federation or projection in Fedora would allow Fedora
to register content in and out but limitations
—Now deprecated in Fedora
 Volumes (petabytes) of data too large to put on
spinning disk because too costly
 So storing on tape
HydraDAM2 Architecture Components
 Fedora 4
 Curation Concerns 0.14
 Hydra::Works
 PCDM (Portland Common Data Model)
PCDM:Collection
Hydra:
GenericWork
extends
PCDM:Object
Hydra: FileSet
extends
PCDM:Object
PCDM:File PCDM:File PCDM:File
master file A
binary
mezzanine file A
binary
access file A
binary
hasMember
hasMember
hasFile
hasFile
hasFile
HydraDAM2
PCDM
IU case
PCDM:File PCDM:File PCDM:File
master file B
binary
mezzanine file B
binary
access file B
binary
hasFile
hasFile
hasFile
PCDM:File
POD XML
binary
PCDM:File
Memnon/IU XML
binary
PCDM:File
MODS XML
binary
Apache Camel Routes
Asynchronous Storage Proxy
Rails application with AS UI gem
Local Tape
Storage
Services Large files
on Disk
Notify
Cloud
Storage
Services
Service
translation
blueprint
Service
translation
blueprint
Service
translation
blueprint
Asynchronous aware
user interface provides
interactions
Proxy provides API
with common
endpoints and
responses
Translations map
from common
API to specific
storage APIs
Should be able to
be an API-X
sharable service
Fedora 4 Asynchronous Storage: Proof of Concept
Fedora 4
RDF resource container
node
Non-RDF resource node
URL redirect
Asynchronous Interactions UI
Apache Camel Routes
Asynchronous Storage Proxy
Slow storage
service
Invoking from asynchronous interactions from Fedora 4 API
Redirecting node via
external-body MIME type;
can be set using Fedora 4
API and also via Hydra
Works file behaviors
The URL to redirect to would be
wherever the Asynchronous
Interactions UI is deployed,
immediately invoking interactions for a
unique identifier (preferably using
persistent URLs)
Access to redirecting nodes
via Fedora 4 API invokes
immediate redirect to stored
URL
Demo
Where We’re Going
 Ensure content models are on the right track
 Continue development
— Build out storage proxy interaction with IU mass storage
— Built out WGBH storage implementation
— Additional user functionality
— Build out descriptive metadata / PBcore support
 Batch ingest
 Feed to/from Avalon Media System
 Pilot implementation
 Production implementation
Questions?
 https://github.com/WGBH/hydradam2
 karen_cariani@wgbh.org
 jwd@iu.edu

HydraDAM2: Repository Challenges and Solutions for Large Media Files

  • 1.
    HydraDAM2: Repository Challenges and Solutionsfor Large Media Files Karen Cariani Senior Director WGBH Media Library and Archives Jon Dunn Assistant Dean for Library Technologies Indiana University
  • 2.
    Who we are: WGBHMedia Library and Archives
  • 3.
    Challenges of Audioand Video  Descriptive metadata  Technical metadata  Large preservation files  Multiple files with similar metadata  Storage dependent on frequency of access —Bandwidth capability
  • 4.
    Preservation Needs  MultipleCopies  Save original files  Validity – check sum  Regular storage migration  Persistence  File format issues — Migration ease — Future playback  Fixity check big files  Big files — Speed of access of preservation files for reuse — Processing speed
  • 5.
    Some History: HydraDAM1 Began with HydraDAM 1 that was based on Sufia and Fedora 4 —Self-deposit institutional repository application  Adapted to add bulk ingest, bulk edit, characterization of files, transcoding of proxies  Limitations: —Assumed full workflow pipeline for ingestion of A/V materials —Processing performance problems
  • 6.
    Indiana University Context •Over 3 million special collections items at IU Bloomington • Within and outside the Libraries • Many sources of A/V • Music and other performing arts • Ethnomusicology, anthropology • Public broadcasting stations • Film collections • Athletics
  • 7.
    MDPI: IU Media Digitizationand Preservation Initiative  Goal: “To digitize, preserve and make universally available by IU’s Bicentennial—subject to copyright or other legal restrictions—all of the time-based media objects on all campuses of IU judged important by experts.“  280,000+ items  ~7PB over 4 years  9TB per day peak  http://mdpi.iu.edu/
  • 8.
    IU MDPI RepositoryNeeds Media Files and Metadata Digital Preservation Repository (HydraDAM2) Access Repository Masters, Mezzanines Transcodes Out-of-Region Storage
  • 9.
    HydraDAM2 Project Objectives To extend the HydraDAM digital asset management system to operate in conjunction with Fedora 4. — Hydra “Head” for digital audio/video preservation  Develop Fedora 4 content models for digital asset preservation objects, including descriptive, structural, and digital provenance metadata, based on current standards and practices and utilizing new features in Fedora 4 for storage and indexing of RDF.  Implement support in HydraDAM for different storage models, appropriate to different types of institutions.  Integrate HydraDAM into preservation workflows that feed access systems at IU (Avalon) and WGBH (Open Vault) and conduct testing of large files and high- throughput workflows.  Document and disseminate information about our implementation and experience to the library, archive, digital repository, audiovisual preservation, and Hydra communities.
  • 10.
    NEH Desired Outcomes How hard is it to do?  Is it implementable elsewhere?  Is it feasible for broad use?  NEH Preservation and Access R&D Grant: January 2015 – January 2017
  • 11.
    Project progress  Slowstart getting developers in place  Coordinating work across organizations  Developing data models in common and different  Determining where code splits for different storage needs  Workable agile development schedule split across geographically different organizations
  • 12.
    Storage use cases WGBH storing files offline on LTO tape directly from local workstation —Bandwidth issues to move large preservation files across the network —Easier for us to hand deliver  Indiana University utilizing a central HSM system for nearline storage —Auto delivery of large files through network
  • 13.
    Storage use cases Not storing media preservation files in Fedora or in filesystem managed by Fedora — WGBH: Just the location of the files on LTO tape — IU: URL in Fedora that redirects to download of content from HSM  How do we accommodate both needs with common code? — Where does the code split off?
  • 14.
    Not in Fedorabecause  Files are big  Costly in terms of performance to push in and out of Fedora  Federation or projection in Fedora would allow Fedora to register content in and out but limitations —Now deprecated in Fedora  Volumes (petabytes) of data too large to put on spinning disk because too costly  So storing on tape
  • 15.
    HydraDAM2 Architecture Components Fedora 4  Curation Concerns 0.14  Hydra::Works  PCDM (Portland Common Data Model)
  • 16.
    PCDM:Collection Hydra: GenericWork extends PCDM:Object Hydra: FileSet extends PCDM:Object PCDM:File PCDM:FilePCDM:File master file A binary mezzanine file A binary access file A binary hasMember hasMember hasFile hasFile hasFile HydraDAM2 PCDM IU case PCDM:File PCDM:File PCDM:File master file B binary mezzanine file B binary access file B binary hasFile hasFile hasFile PCDM:File POD XML binary PCDM:File Memnon/IU XML binary PCDM:File MODS XML binary
  • 17.
    Apache Camel Routes AsynchronousStorage Proxy Rails application with AS UI gem Local Tape Storage Services Large files on Disk Notify Cloud Storage Services Service translation blueprint Service translation blueprint Service translation blueprint Asynchronous aware user interface provides interactions Proxy provides API with common endpoints and responses Translations map from common API to specific storage APIs Should be able to be an API-X sharable service Fedora 4 Asynchronous Storage: Proof of Concept
  • 18.
    Fedora 4 RDF resourcecontainer node Non-RDF resource node URL redirect Asynchronous Interactions UI Apache Camel Routes Asynchronous Storage Proxy Slow storage service Invoking from asynchronous interactions from Fedora 4 API Redirecting node via external-body MIME type; can be set using Fedora 4 API and also via Hydra Works file behaviors The URL to redirect to would be wherever the Asynchronous Interactions UI is deployed, immediately invoking interactions for a unique identifier (preferably using persistent URLs) Access to redirecting nodes via Fedora 4 API invokes immediate redirect to stored URL
  • 19.
  • 33.
    Where We’re Going Ensure content models are on the right track  Continue development — Build out storage proxy interaction with IU mass storage — Built out WGBH storage implementation — Additional user functionality — Build out descriptive metadata / PBcore support  Batch ingest  Feed to/from Avalon Media System  Pilot implementation  Production implementation
  • 34.

Editor's Notes

  • #3 Who are we? WGBH is Boston’s Public television station. We produce fully one third of the content broadcast on PBS, including the series you see here, as well as Downton Abbey and Sherlock. In addition to television, we have 2 radio stations and a large, award winning Interactive department that is the number one producer for the sites you’ll find on PBS.org. As you can see, we produce a wide variety of programming from public affairs, to history and science, to children’s program, arts, culture, drama and how to’s. We have been on the air since 1951 with radio and 1955 with television. At heart and through our mission we are an educational and cultural institution. We originated out of a consortium of academic universities in the Boston area. Because we have produced so much we have a large archive of educational programming that is of interest to scholars and researchers, in addition to the public.
  • #5 A quick check on preservation needs - So this digital stuff really sucks. Film or stone are a much longer lasting medium. But digital gives us much better and broader access. So how do we preserve this fragile stuff that needs migration every 3-5 years. Well you need multiple copies, and save the originals because they should be whole. Check sums – validity checks on files to make sure you have all the bits. Migration not only of the content – the files, but also all the technology and systems you use and storage. And doing this with big media files is hard, time consuming, and subject to damage and errors.
  • #11 We were generously awarded a grant to see if we could build a media preservation DAM system using open source software. In particular we wanted to test the Hydra tech stack, see what it would take to build, what it would take for others to install (better documentation) and really see how to integrate with an open source community.