SlideShare a Scribd company logo
HydraDAM2:
Repository Challenges and
Solutions for Large Media Files
Karen Cariani
Senior Director
WGBH Media Library and
Archives
Jon Dunn
Assistant Dean for Library
Technologies
Indiana University
Who we are:
WGBH Media Library and Archives
Challenges of Audio and Video
 Descriptive metadata
 Technical metadata
 Large preservation files
 Multiple files with similar metadata
 Storage dependent on frequency of access
—Bandwidth capability
Preservation Needs
 Multiple Copies
 Save original files
 Validity – check sum
 Regular storage migration
 Persistence
 File format issues
— Migration ease
— Future playback
 Fixity check big files
 Big files
— Speed of access of preservation
files for reuse
— Processing speed
Some History: HydraDAM1
 Began with HydraDAM 1 that was based on Sufia and
Fedora 4
—Self-deposit institutional repository application
 Adapted to add bulk ingest, bulk edit, characterization
of files, transcoding of proxies
 Limitations:
—Assumed full workflow pipeline for ingestion of A/V
materials
—Processing performance problems
Indiana University Context
• Over 3 million special collections
items at IU Bloomington
• Within and outside the
Libraries
• Many sources of A/V
• Music and other performing
arts
• Ethnomusicology,
anthropology
• Public broadcasting stations
• Film collections
• Athletics
MDPI:
IU Media Digitization and Preservation Initiative
 Goal: “To digitize, preserve and make universally available
by IU’s Bicentennial—subject to copyright or other legal
restrictions—all of the time-based media objects on all
campuses of IU judged important by experts.“
 280,000+ items
 ~7PB over 4 years
 9TB per day peak
 http://mdpi.iu.edu/
IU MDPI Repository Needs
Media Files
and Metadata
Digital
Preservation
Repository
(HydraDAM2)
Access
Repository
Masters,
Mezzanines
Transcodes
Out-of-Region
Storage
HydraDAM2 Project Objectives
 To extend the HydraDAM digital asset management system to operate in
conjunction with Fedora 4.
— Hydra “Head” for digital audio/video preservation
 Develop Fedora 4 content models for digital asset preservation objects, including
descriptive, structural, and digital provenance metadata, based on current
standards and practices and utilizing new features in Fedora 4 for storage and
indexing of RDF.
 Implement support in HydraDAM for different storage models, appropriate to
different types of institutions.
 Integrate HydraDAM into preservation workflows that feed access systems at IU
(Avalon) and WGBH (Open Vault) and conduct testing of large files and high-
throughput workflows.
 Document and disseminate information about our implementation and experience
to the library, archive, digital repository, audiovisual preservation, and Hydra
communities.
NEH Desired Outcomes
 How hard is it to do?
 Is it implementable elsewhere?
 Is it feasible for broad use?
 NEH Preservation and Access R&D Grant:
January 2015 – January 2017
Project progress
 Slow start getting developers in place
 Coordinating work across organizations
 Developing data models in common and different
 Determining where code splits for different storage
needs
 Workable agile development schedule split across
geographically different organizations
Storage use cases
 WGBH storing files offline on LTO tape directly from
local workstation
—Bandwidth issues to move large preservation files
across the network
—Easier for us to hand deliver
 Indiana University utilizing a central HSM system for
nearline storage
—Auto delivery of large files through network
Storage use cases
 Not storing media preservation files in Fedora or in
filesystem managed by Fedora
— WGBH: Just the location of the files on LTO tape
— IU: URL in Fedora that redirects to download of
content from HSM
 How do we accommodate both needs with common
code?
— Where does the code split off?
Not in Fedora because
 Files are big
 Costly in terms of performance to push in and out of
Fedora
 Federation or projection in Fedora would allow Fedora
to register content in and out but limitations
—Now deprecated in Fedora
 Volumes (petabytes) of data too large to put on
spinning disk because too costly
 So storing on tape
HydraDAM2 Architecture Components
 Fedora 4
 Curation Concerns 0.14
 Hydra::Works
 PCDM (Portland Common Data Model)
PCDM:Collection
Hydra:
GenericWork
extends
PCDM:Object
Hydra: FileSet
extends
PCDM:Object
PCDM:File PCDM:File PCDM:File
master file A
binary
mezzanine file A
binary
access file A
binary
hasMember
hasMember
hasFile
hasFile
hasFile
HydraDAM2
PCDM
IU case
PCDM:File PCDM:File PCDM:File
master file B
binary
mezzanine file B
binary
access file B
binary
hasFile
hasFile
hasFile
PCDM:File
POD XML
binary
PCDM:File
Memnon/IU XML
binary
PCDM:File
MODS XML
binary
Apache Camel Routes
Asynchronous Storage Proxy
Rails application with AS UI gem
Local Tape
Storage
Services Large files
on Disk
Notify
Cloud
Storage
Services
Service
translation
blueprint
Service
translation
blueprint
Service
translation
blueprint
Asynchronous aware
user interface provides
interactions
Proxy provides API
with common
endpoints and
responses
Translations map
from common
API to specific
storage APIs
Should be able to
be an API-X
sharable service
Fedora 4 Asynchronous Storage: Proof of Concept
Fedora 4
RDF resource container
node
Non-RDF resource node
URL redirect
Asynchronous Interactions UI
Apache Camel Routes
Asynchronous Storage Proxy
Slow storage
service
Invoking from asynchronous interactions from Fedora 4 API
Redirecting node via
external-body MIME type;
can be set using Fedora 4
API and also via Hydra
Works file behaviors
The URL to redirect to would be
wherever the Asynchronous
Interactions UI is deployed,
immediately invoking interactions for a
unique identifier (preferably using
persistent URLs)
Access to redirecting nodes
via Fedora 4 API invokes
immediate redirect to stored
URL
Demo
Where We’re Going
 Ensure content models are on the right track
 Continue development
— Build out storage proxy interaction with IU mass storage
— Built out WGBH storage implementation
— Additional user functionality
— Build out descriptive metadata / PBcore support
 Batch ingest
 Feed to/from Avalon Media System
 Pilot implementation
 Production implementation
Questions?
 https://github.com/WGBH/hydradam2
 karen_cariani@wgbh.org
 jwd@iu.edu

More Related Content

What's hot

HDF
HDFHDF
e-Services to Keep Your Digital Files Current
e-Services to Keep Your Digital Files Currente-Services to Keep Your Digital Files Current
e-Services to Keep Your Digital Files Current
pbajcsy
 
Trinity College Dublin Tara DSpace Repository upgrade project presentation - ...
Trinity College Dublin Tara DSpace Repository upgrade project presentation - ...Trinity College Dublin Tara DSpace Repository upgrade project presentation - ...
Trinity College Dublin Tara DSpace Repository upgrade project presentation - ...
Gavin Henrick
 
Introduction to HDF5 Data Model, Programming Model and Library APIs
Introduction to HDF5 Data Model, Programming Model and Library APIsIntroduction to HDF5 Data Model, Programming Model and Library APIs
Introduction to HDF5 Data Model, Programming Model and Library APIs
The HDF-EOS Tools and Information Center
 
635 642
635 642635 642
Hadoop_Fundamentals_HDFS_map_reduce
Hadoop_Fundamentals_HDFS_map_reduceHadoop_Fundamentals_HDFS_map_reduce
Hadoop_Fundamentals_HDFS_map_reduce
Debatri Mitra
 
Using Dublin Core for DISCOVER: a New Zealand visual art and music resource f...
Using Dublin Core for DISCOVER: a New Zealand visual art and music resource f...Using Dublin Core for DISCOVER: a New Zealand visual art and music resource f...
Using Dublin Core for DISCOVER: a New Zealand visual art and music resource f...
Karen R
 
Harmonization of vocabularies for water data
Harmonization of vocabularies for water dataHarmonization of vocabularies for water data
Harmonization of vocabularies for water data
Simon Cox
 
Giraffa - November 2014
Giraffa - November 2014Giraffa - November 2014
Giraffa - November 2014
Plamen Jeliazkov
 
Data accessibilityandchallenges
Data accessibilityandchallengesData accessibilityandchallenges
Data accessibilityandchallenges
jyotikhadake
 
Aura HDF-EOS File Format Guidelines: Overview and Status
Aura HDF-EOS File Format Guidelines: Overview and StatusAura HDF-EOS File Format Guidelines: Overview and Status
Aura HDF-EOS File Format Guidelines: Overview and Status
The HDF-EOS Tools and Information Center
 

What's hot (11)

HDF
HDFHDF
HDF
 
e-Services to Keep Your Digital Files Current
e-Services to Keep Your Digital Files Currente-Services to Keep Your Digital Files Current
e-Services to Keep Your Digital Files Current
 
Trinity College Dublin Tara DSpace Repository upgrade project presentation - ...
Trinity College Dublin Tara DSpace Repository upgrade project presentation - ...Trinity College Dublin Tara DSpace Repository upgrade project presentation - ...
Trinity College Dublin Tara DSpace Repository upgrade project presentation - ...
 
Introduction to HDF5 Data Model, Programming Model and Library APIs
Introduction to HDF5 Data Model, Programming Model and Library APIsIntroduction to HDF5 Data Model, Programming Model and Library APIs
Introduction to HDF5 Data Model, Programming Model and Library APIs
 
635 642
635 642635 642
635 642
 
Hadoop_Fundamentals_HDFS_map_reduce
Hadoop_Fundamentals_HDFS_map_reduceHadoop_Fundamentals_HDFS_map_reduce
Hadoop_Fundamentals_HDFS_map_reduce
 
Using Dublin Core for DISCOVER: a New Zealand visual art and music resource f...
Using Dublin Core for DISCOVER: a New Zealand visual art and music resource f...Using Dublin Core for DISCOVER: a New Zealand visual art and music resource f...
Using Dublin Core for DISCOVER: a New Zealand visual art and music resource f...
 
Harmonization of vocabularies for water data
Harmonization of vocabularies for water dataHarmonization of vocabularies for water data
Harmonization of vocabularies for water data
 
Giraffa - November 2014
Giraffa - November 2014Giraffa - November 2014
Giraffa - November 2014
 
Data accessibilityandchallenges
Data accessibilityandchallengesData accessibilityandchallenges
Data accessibilityandchallenges
 
Aura HDF-EOS File Format Guidelines: Overview and Status
Aura HDF-EOS File Format Guidelines: Overview and StatusAura HDF-EOS File Format Guidelines: Overview and Status
Aura HDF-EOS File Format Guidelines: Overview and Status
 

Similar to HydraDAM2: Repository Challenges and Solutions for Large Media Files

3.7.17 DSpace for Data: issues, solutions and challenges Webinar Slides
3.7.17 DSpace for Data: issues, solutions and challenges Webinar Slides3.7.17 DSpace for Data: issues, solutions and challenges Webinar Slides
3.7.17 DSpace for Data: issues, solutions and challenges Webinar Slides
DuraSpace
 
Webinar: The Four Requirements of a Cloud-Era File System
Webinar: The Four Requirements of a Cloud-Era File SystemWebinar: The Four Requirements of a Cloud-Era File System
Webinar: The Four Requirements of a Cloud-Era File System
Storage Switzerland
 
Analytics with unified file and object
Analytics with unified file and object Analytics with unified file and object
Analytics with unified file and object
Sandeep Patil
 
9.-dados e processamento distribuido-hadoop.pdf
9.-dados e processamento distribuido-hadoop.pdf9.-dados e processamento distribuido-hadoop.pdf
9.-dados e processamento distribuido-hadoop.pdf
Manoel Ribeiro
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
sravya raju
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
sravya raju
 
Hypatia for dlf 2011
Hypatia for dlf 2011Hypatia for dlf 2011
Hypatia for dlf 2011
DLFCLIR
 
Wilcox - Open Source Repositories and the Future of Fedora
Wilcox - Open Source Repositories and the Future of FedoraWilcox - Open Source Repositories and the Future of Fedora
Wilcox - Open Source Repositories and the Future of Fedora
National Information Standards Organization (NISO)
 
Sector Vs Hadoop
Sector Vs HadoopSector Vs Hadoop
Sector Vs Hadoop
lilyco
 
Impact of Covid-19 on Learning and Education
Impact of Covid-19 on Learning and EducationImpact of Covid-19 on Learning and Education
Impact of Covid-19 on Learning and Education
MANENDRASINGH30
 
Hadoop hdfs
Hadoop hdfsHadoop hdfs
Hadoop hdfs
Sudipta Ghosh
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
shrey mehrotra
 
Hota hadoop
Hota hadoopHota hadoop
Hota hadoop
Chittaranjan Hota
 
Hadoop-2022.pptx
Hadoop-2022.pptxHadoop-2022.pptx
Hadoop-2022.pptx
MurindanyiSudi1
 
Hadoop security
Hadoop securityHadoop security
Hadoop security
Biju Nair
 
IRJET- A Novel Approach to Process Small HDFS Files with Apache Spark
IRJET- A Novel Approach to Process Small HDFS Files with Apache SparkIRJET- A Novel Approach to Process Small HDFS Files with Apache Spark
IRJET- A Novel Approach to Process Small HDFS Files with Apache Spark
IRJET Journal
 
Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)
Takrim Ul Islam Laskar
 
WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014
WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014 WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014
WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014
Chris Almond
 
Apache hadoop and hive
Apache hadoop and hiveApache hadoop and hive
Apache hadoop and hive
srikanthhadoop
 
Desktop as a Service supporting Environmental ‘omics
Desktop as a Service supporting Environmental ‘omicsDesktop as a Service supporting Environmental ‘omics
Desktop as a Service supporting Environmental ‘omics
David Wallom
 

Similar to HydraDAM2: Repository Challenges and Solutions for Large Media Files (20)

3.7.17 DSpace for Data: issues, solutions and challenges Webinar Slides
3.7.17 DSpace for Data: issues, solutions and challenges Webinar Slides3.7.17 DSpace for Data: issues, solutions and challenges Webinar Slides
3.7.17 DSpace for Data: issues, solutions and challenges Webinar Slides
 
Webinar: The Four Requirements of a Cloud-Era File System
Webinar: The Four Requirements of a Cloud-Era File SystemWebinar: The Four Requirements of a Cloud-Era File System
Webinar: The Four Requirements of a Cloud-Era File System
 
Analytics with unified file and object
Analytics with unified file and object Analytics with unified file and object
Analytics with unified file and object
 
9.-dados e processamento distribuido-hadoop.pdf
9.-dados e processamento distribuido-hadoop.pdf9.-dados e processamento distribuido-hadoop.pdf
9.-dados e processamento distribuido-hadoop.pdf
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Hypatia for dlf 2011
Hypatia for dlf 2011Hypatia for dlf 2011
Hypatia for dlf 2011
 
Wilcox - Open Source Repositories and the Future of Fedora
Wilcox - Open Source Repositories and the Future of FedoraWilcox - Open Source Repositories and the Future of Fedora
Wilcox - Open Source Repositories and the Future of Fedora
 
Sector Vs Hadoop
Sector Vs HadoopSector Vs Hadoop
Sector Vs Hadoop
 
Impact of Covid-19 on Learning and Education
Impact of Covid-19 on Learning and EducationImpact of Covid-19 on Learning and Education
Impact of Covid-19 on Learning and Education
 
Hadoop hdfs
Hadoop hdfsHadoop hdfs
Hadoop hdfs
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 
Hota hadoop
Hota hadoopHota hadoop
Hota hadoop
 
Hadoop-2022.pptx
Hadoop-2022.pptxHadoop-2022.pptx
Hadoop-2022.pptx
 
Hadoop security
Hadoop securityHadoop security
Hadoop security
 
IRJET- A Novel Approach to Process Small HDFS Files with Apache Spark
IRJET- A Novel Approach to Process Small HDFS Files with Apache SparkIRJET- A Novel Approach to Process Small HDFS Files with Apache Spark
IRJET- A Novel Approach to Process Small HDFS Files with Apache Spark
 
Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)
 
WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014
WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014 WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014
WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014
 
Apache hadoop and hive
Apache hadoop and hiveApache hadoop and hive
Apache hadoop and hive
 
Desktop as a Service supporting Environmental ‘omics
Desktop as a Service supporting Environmental ‘omicsDesktop as a Service supporting Environmental ‘omics
Desktop as a Service supporting Environmental ‘omics
 

More from Jon W. Dunn

AMP: An Audiovisual Metadata Platform to Support Mass Description
AMP: An Audiovisual Metadata Platform to Support Mass DescriptionAMP: An Audiovisual Metadata Platform to Support Mass Description
AMP: An Audiovisual Metadata Platform to Support Mass Description
Jon W. Dunn
 
An Audiovisual Metadata Platform to Support Mass Description
An Audiovisual Metadata Platform to Support Mass DescriptionAn Audiovisual Metadata Platform to Support Mass Description
An Audiovisual Metadata Platform to Support Mass Description
Jon W. Dunn
 
Variations on Video: The Avalon Media System
Variations on Video: The Avalon Media SystemVariations on Video: The Avalon Media System
Variations on Video: The Avalon Media System
Jon W. Dunn
 
Sakai11 Citations BOF Introductory Slides
Sakai11 Citations BOF Introductory SlidesSakai11 Citations BOF Introductory Slides
Sakai11 Citations BOF Introductory Slides
Jon W. Dunn
 
User Needs and Project Plans for Library-Managed Media Assets
User Needs and Project Plans for Library-Managed Media AssetsUser Needs and Project Plans for Library-Managed Media Assets
User Needs and Project Plans for Library-Managed Media Assets
Jon W. Dunn
 
Integration of Library Resources and Services in Sakai 3
Integration of Library Resources and Services in Sakai 3Integration of Library Resources and Services in Sakai 3
Integration of Library Resources and Services in Sakai 3
Jon W. Dunn
 

More from Jon W. Dunn (6)

AMP: An Audiovisual Metadata Platform to Support Mass Description
AMP: An Audiovisual Metadata Platform to Support Mass DescriptionAMP: An Audiovisual Metadata Platform to Support Mass Description
AMP: An Audiovisual Metadata Platform to Support Mass Description
 
An Audiovisual Metadata Platform to Support Mass Description
An Audiovisual Metadata Platform to Support Mass DescriptionAn Audiovisual Metadata Platform to Support Mass Description
An Audiovisual Metadata Platform to Support Mass Description
 
Variations on Video: The Avalon Media System
Variations on Video: The Avalon Media SystemVariations on Video: The Avalon Media System
Variations on Video: The Avalon Media System
 
Sakai11 Citations BOF Introductory Slides
Sakai11 Citations BOF Introductory SlidesSakai11 Citations BOF Introductory Slides
Sakai11 Citations BOF Introductory Slides
 
User Needs and Project Plans for Library-Managed Media Assets
User Needs and Project Plans for Library-Managed Media AssetsUser Needs and Project Plans for Library-Managed Media Assets
User Needs and Project Plans for Library-Managed Media Assets
 
Integration of Library Resources and Services in Sakai 3
Integration of Library Resources and Services in Sakai 3Integration of Library Resources and Services in Sakai 3
Integration of Library Resources and Services in Sakai 3
 

Recently uploaded

Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 

Recently uploaded (20)

Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 

HydraDAM2: Repository Challenges and Solutions for Large Media Files

  • 1. HydraDAM2: Repository Challenges and Solutions for Large Media Files Karen Cariani Senior Director WGBH Media Library and Archives Jon Dunn Assistant Dean for Library Technologies Indiana University
  • 2. Who we are: WGBH Media Library and Archives
  • 3. Challenges of Audio and Video  Descriptive metadata  Technical metadata  Large preservation files  Multiple files with similar metadata  Storage dependent on frequency of access —Bandwidth capability
  • 4. Preservation Needs  Multiple Copies  Save original files  Validity – check sum  Regular storage migration  Persistence  File format issues — Migration ease — Future playback  Fixity check big files  Big files — Speed of access of preservation files for reuse — Processing speed
  • 5. Some History: HydraDAM1  Began with HydraDAM 1 that was based on Sufia and Fedora 4 —Self-deposit institutional repository application  Adapted to add bulk ingest, bulk edit, characterization of files, transcoding of proxies  Limitations: —Assumed full workflow pipeline for ingestion of A/V materials —Processing performance problems
  • 6. Indiana University Context • Over 3 million special collections items at IU Bloomington • Within and outside the Libraries • Many sources of A/V • Music and other performing arts • Ethnomusicology, anthropology • Public broadcasting stations • Film collections • Athletics
  • 7. MDPI: IU Media Digitization and Preservation Initiative  Goal: “To digitize, preserve and make universally available by IU’s Bicentennial—subject to copyright or other legal restrictions—all of the time-based media objects on all campuses of IU judged important by experts.“  280,000+ items  ~7PB over 4 years  9TB per day peak  http://mdpi.iu.edu/
  • 8. IU MDPI Repository Needs Media Files and Metadata Digital Preservation Repository (HydraDAM2) Access Repository Masters, Mezzanines Transcodes Out-of-Region Storage
  • 9. HydraDAM2 Project Objectives  To extend the HydraDAM digital asset management system to operate in conjunction with Fedora 4. — Hydra “Head” for digital audio/video preservation  Develop Fedora 4 content models for digital asset preservation objects, including descriptive, structural, and digital provenance metadata, based on current standards and practices and utilizing new features in Fedora 4 for storage and indexing of RDF.  Implement support in HydraDAM for different storage models, appropriate to different types of institutions.  Integrate HydraDAM into preservation workflows that feed access systems at IU (Avalon) and WGBH (Open Vault) and conduct testing of large files and high- throughput workflows.  Document and disseminate information about our implementation and experience to the library, archive, digital repository, audiovisual preservation, and Hydra communities.
  • 10. NEH Desired Outcomes  How hard is it to do?  Is it implementable elsewhere?  Is it feasible for broad use?  NEH Preservation and Access R&D Grant: January 2015 – January 2017
  • 11. Project progress  Slow start getting developers in place  Coordinating work across organizations  Developing data models in common and different  Determining where code splits for different storage needs  Workable agile development schedule split across geographically different organizations
  • 12. Storage use cases  WGBH storing files offline on LTO tape directly from local workstation —Bandwidth issues to move large preservation files across the network —Easier for us to hand deliver  Indiana University utilizing a central HSM system for nearline storage —Auto delivery of large files through network
  • 13. Storage use cases  Not storing media preservation files in Fedora or in filesystem managed by Fedora — WGBH: Just the location of the files on LTO tape — IU: URL in Fedora that redirects to download of content from HSM  How do we accommodate both needs with common code? — Where does the code split off?
  • 14. Not in Fedora because  Files are big  Costly in terms of performance to push in and out of Fedora  Federation or projection in Fedora would allow Fedora to register content in and out but limitations —Now deprecated in Fedora  Volumes (petabytes) of data too large to put on spinning disk because too costly  So storing on tape
  • 15. HydraDAM2 Architecture Components  Fedora 4  Curation Concerns 0.14  Hydra::Works  PCDM (Portland Common Data Model)
  • 16. PCDM:Collection Hydra: GenericWork extends PCDM:Object Hydra: FileSet extends PCDM:Object PCDM:File PCDM:File PCDM:File master file A binary mezzanine file A binary access file A binary hasMember hasMember hasFile hasFile hasFile HydraDAM2 PCDM IU case PCDM:File PCDM:File PCDM:File master file B binary mezzanine file B binary access file B binary hasFile hasFile hasFile PCDM:File POD XML binary PCDM:File Memnon/IU XML binary PCDM:File MODS XML binary
  • 17. Apache Camel Routes Asynchronous Storage Proxy Rails application with AS UI gem Local Tape Storage Services Large files on Disk Notify Cloud Storage Services Service translation blueprint Service translation blueprint Service translation blueprint Asynchronous aware user interface provides interactions Proxy provides API with common endpoints and responses Translations map from common API to specific storage APIs Should be able to be an API-X sharable service Fedora 4 Asynchronous Storage: Proof of Concept
  • 18. Fedora 4 RDF resource container node Non-RDF resource node URL redirect Asynchronous Interactions UI Apache Camel Routes Asynchronous Storage Proxy Slow storage service Invoking from asynchronous interactions from Fedora 4 API Redirecting node via external-body MIME type; can be set using Fedora 4 API and also via Hydra Works file behaviors The URL to redirect to would be wherever the Asynchronous Interactions UI is deployed, immediately invoking interactions for a unique identifier (preferably using persistent URLs) Access to redirecting nodes via Fedora 4 API invokes immediate redirect to stored URL
  • 19. Demo
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33. Where We’re Going  Ensure content models are on the right track  Continue development — Build out storage proxy interaction with IU mass storage — Built out WGBH storage implementation — Additional user functionality — Build out descriptive metadata / PBcore support  Batch ingest  Feed to/from Avalon Media System  Pilot implementation  Production implementation

Editor's Notes

  1. Who are we? WGBH is Boston’s Public television station. We produce fully one third of the content broadcast on PBS, including the series you see here, as well as Downton Abbey and Sherlock. In addition to television, we have 2 radio stations and a large, award winning Interactive department that is the number one producer for the sites you’ll find on PBS.org. As you can see, we produce a wide variety of programming from public affairs, to history and science, to children’s program, arts, culture, drama and how to’s. We have been on the air since 1951 with radio and 1955 with television. At heart and through our mission we are an educational and cultural institution. We originated out of a consortium of academic universities in the Boston area. Because we have produced so much we have a large archive of educational programming that is of interest to scholars and researchers, in addition to the public.
  2. A quick check on preservation needs - So this digital stuff really sucks. Film or stone are a much longer lasting medium. But digital gives us much better and broader access. So how do we preserve this fragile stuff that needs migration every 3-5 years. Well you need multiple copies, and save the originals because they should be whole. Check sums – validity checks on files to make sure you have all the bits. Migration not only of the content – the files, but also all the technology and systems you use and storage. And doing this with big media files is hard, time consuming, and subject to damage and errors.
  3. We were generously awarded a grant to see if we could build a media preservation DAM system using open source software. In particular we wanted to test the Hydra tech stack, see what it would take to build, what it would take for others to install (better documentation) and really see how to integrate with an open source community.