Your SlideShare is downloading. ×
Slide 1 - Tetherless World Constellation - Tetherless World Wiki
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Slide 1 - Tetherless World Constellation - Tetherless World Wiki


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Persistent Archives: Long-term sustainability of data based on  policy and data virtualization Arcot (Raja) Rajasekar University of North Carolina at Chapel Hill [email_address] NSF OCI-0848296 “NARA Transcontinental Persistent Archives Prototype” (2008-2012) NSF SDCI 0721400 “Data Grids for Community Driven Applications” (2007-2010)
  • 2. Topics
    • Data Grids for Preservation & Sharing
      • Brief Intro
      • Why are they suitable for deploying scalable persistent archives?
      • iRODS as an exemplar Data Grid
    • Two Examples:
      • DIGARCH: Preservation of Multi-media Collection
      • TPAP: NARA Testbed of Persistent Archives
  • 3. Data Preservation Challenges
    • Data driven research generates massive data collections
      • Data sources are remote and distributed
      • Collaborators are remote
      • Wide variety of data types: observational data, experimental data, simulation data, real-time data, office products, web pages, multi-media
    • Collections contain millions of files
      • Logical arrangement is needed for distributed data
      • Discovery requires the addition of descriptive metadata
    • Long-term retention requires migration of output into a reference collection
      • Automation of administrative functions is essential to minimize long-term labor support costs
      • Creation of representation information for describing file context
      • Validation of assessment criteria (authenticity, integrity)
  • 4. What is a Data Grid?
    • Geographically distributed heterogeneous resources that are managed autonomously
    • Active with data resources being added and removed
    • Users like to share/discover data using contextual information
  • 5. What is a Data Grid?
    • Data Grid – a network of data resources that is presented as a single, accessible collection of data.
    • Data Grid – provisions for associating metadata & annotations
    • Data Grid – enables discovery, access & server-side processing
    • Metadata-based data virtualization
    • Policy Virtualization
    Meta data
  • 6. Why Data Grids?
    • Data Virtualization: Shared Collections Concept
      • Common Abstract Name Spaces: physical-independence
        • Data objects and collections : logical names
        • Users/collaborators : global user name space
        • Shared resources & uniform access : location & protocol transparency
        • Common typing conventions for objects & actions
      • Provide technology independence
        • Platform & Vendor-ndependence
        • High scalability
      • Need discovery metadata
        • Descriptive attributes for each name space
        • System & Domain-specific information
  • 7. Why Data Grids?
    • Policy- Virtualization: Automate Operations
      • System-centric Policies & Obligations:
        • Manage retention, disposition, distribution, replication, integrity, authenticity, chain of custody, access controls, representation information, descriptive information requirement, logical arrangement, audit trails, authorization, authentication
      • Domain-specific Policies:
        • Identification & Extraction of Metadata
        • Ingestion Control for Provenance Attribution
        • Processing of Data on Ingestion
          • Creation of multi-resolution images, type-identification, anonymization,…
        • Processing of Data on Access
          • IRB Approval for data access, Data sub-setting, Merging of multiple images, conversion, redaction, …
  • 8. Preservation is an Integral Part of the Data Life Cycle
    • Organize project data into a shared collection
    • Publish data in a digital library for use by other researchers
    • Enable data-discovery & data-driven analyses
    • Preserve reference collection for use by future research initiatives
    • Associate new collection against prior state-of-the-art data
    • Define & Enforce Policies for long-term management and curation
  • 9. Exemplar Data Grid: iRODS
    • Integrated Rule- Oriented Data System
    • It is a data grid system – data virtualization
      • A distributed file system, based on a client-server architecture.
      • Allows users to access files seamlessly across a distributed environment, based upon their attributes & GUID rather than locations
      • It replicates, syncs and archives data, connecting heterogeneous resources in a logical and abstracted manner.
    • It is a server-side workflow system – policy virtualization
      • Actions are coded as functions/scripts (micro-services)
      • Micro-services can be chained into Policies (rules)
      • Rules are interpreted by a distributed rule engine
      • The chains can be triggered on an event and condition (rules)
      • Micro-services communicate through parameters, shared contexts, and out-of-band message queues.
      • Open Policy and Uniform Access
  • 10. Policy/Rule Examples
    • Automatically extract metadata for a file with certain types and store in domain-centric metadata catalog
    • Notify owner if a file metadata is missing N days after ingestion
    • Automatically “audit” derived datasets – provenance gathering
    • Periodically check for integrity of files in a collection and repair them if needed/possible
    • Allow users only using certificate-based log in to access files from a collection – multi-lock control
    • Automatically migrate a file to “slow” storage location after N days of non-use – storage management
    • Automatically replicate a file that falls into a collection into 3 geo-distributed sites – replication strategies
    • When too many users from site A are using a file from site B, keep a copy in site A – data placement strategies
    • Send a notification when file with certain type of data is ingested.
  • 11. Overview of iRODS Architecture *Access data with Web-based Browser or iRODS GUI or Command Line clients. Overview of iRODS Data System iRODS Data Server Disk, Tape, etc. iRODS Data System User Can Search, Access, Add and Manage Data & Metadata iRODS Metadata Catalog Track data iRODS Rule Engine Track policies
  • 12. integrated Rule-Oriented Data System Client Interface Admin Interface Current State Rule Invoker Micro Service Modules Metadata-based Services Resources Micro Service Modules Resource-based Services Rule Modifier Module Consistency Check Module Confs Config Modifier Module Metadata Modifier Module Metadata Persistent Repository Consistency Check Module Rule Base Service Manager Consistency Check Module Engine Rule
  • 13. iRODS Components Rule Engine Execution Control Messaging System Execution Engine Virtualization Server Side Workflow Persistent State information Scheduling Policy Management Data Transport Metadata Catalog
  • 14. iRODS Applications
    • Institutional repositories
      • Carolina Digital Repository at University of North Carolina
      • Duke Medical Archive
    • Regional data grids
      • RENCI data grid linking 7 engagement centers in North Carolina
      • HASTAC data grid linking humanities collections across 9 UC campuses
    • National data grids
      • NARA Transcontinental Persistent Archive Prototype
      • NSF Temporal Dynamics of Learning Center data grid
      • NSF Ocean Observatories Initiative data grid
      • NASA Center for Computational Sciences archive
      • JPL Planetary Data System data grid
    • International data grids
      • Australian Research Collaboration Service - ARCS
      • French National Library
  • 15. User Interfaces
    • C library calls - Application level
    • Unix shell commands - Scripting languages
    • Java I/O class library (JARGON) - Web services
    • SAGA - Grid API
    • Web browser (Java-python) - Web interface
    • Windows browser - Windows interface
    • WebDAV - iPhone interface
    • Fedora digital library middleware - Digital library middleware
    • Dspace digital library - Digital library services
    • Parrot - Unification interface
    • Kepler workflow - Scientific workflow
    • Fuse user-level file system - Unix file system
  • 16. Case 1: NARA TPAP
    • National Archives Electronic Records Administration Research Program (funded thru NSF)
    • Transcontinental Persistent Archive Prototype
      • Use federation of data grid technology to build a preservation environment
      • Conduct research on preservation concepts
        • Infrastructure independence
        • Enforcement of preservation properties
        • Validation of assessment criteria
        • Automation of administrative processes
        • Show technology migration
      • Demonstrate preservation on selected NARA digital holdings
  • 17. National Archives and Records Administration Transcontinental Persistent Archive Prototype U Md UCSD MCAT MCAT Federation of Seven Independent Data Grids NARA I MCAT Extensible Environment, can federate with additional research and education sites. Each data grid uses different vendor products. U NC MCAT Georgia Tech MCAT NARA II MCAT Rocket Center MCAT
  • 18. ISO MOIMS-repository assessment criteria
    • We are developing 150 rules that implement the assessment criteria
    • Examples:
    90 Verify descriptive metadata and source against SIP template and set SIP compliance flag 91 Verify descriptive metadata against semantic term list 92 Verify status of metadata catalog backup (create a snapshot of metadata catalog) 93 Verify consistency of preservation metadata after hardware change or error
  • 19.
    • Case Study 2: DIGARCH
    • Preservation of Video Files
      • By Integrating a Video Production Pipeline
      • With a Preservation Workflow
  • 20. Digital Preservation Lifecycle Management Building a demonstration prototype for the preservation of large-scale multi-media collections San Diego Supercomputer Center, Univ. of California, San Diego Arcot Rajasekar (PI) Richard Marciano Reagan Moore Chien-Yi Hou Francine Berman (co-PI) UCSD-TV, Univ. of California, San Diego Lynn Burstan (co-PI) Steve Anderson Mellisa McEwen Bee Bornheimer UCTV-Berkeley Harry Kreisler UCSD Libraries, Univ. of California, San Diego Brian Schottlaender (co-PI) Luc DeClerck Brad Westbrook Arwen Hutt Ardys Kozbial Chris Frymann Vivian Chu
  • 21. Our Proposal
    • Design and Development of a Prototype for
    • Preserving Digital Video Collections
      • Management of Authenticity , Integrity ,
      • and Infrastructure Independence
      • Preservation Life-cycle meshing
      • seamlessly with the content production
        • Minimal impact to production life-cycle
      • Workflow system that automates accession, description, organization and preservation of video and associated contents
        • Metadata definition, extraction and ingestion
        • Long-term retention and technology migration
      • At risk Collection: ‘Conversation with History’ video collection
        • Video, audio, text transcripts, web-based material
        • Databases of administrative and descriptive metadata
        • Derived products
  • 22. Exemplar Collection
    • Conversation with History - UCTV - from 1982
      • Hour-long interviews with internationally prominent individuals
      • Institute of International Affairs, UC Berkeley
      • Available in 15 million homes nationwide via UCTV
      • 40 program segments annually
      • Web-site for downloading older segments
      • Among UCTVs most accessed on-line programs
      • Programs used in educational material
  • 23. Pre-Interview Interview Transcription Post-Interview Metadata Analysis Schema Generation SIP/AIP Definitions Capture Scripts Metadata DB Capture Interview Metadata Capture Make SIPS Aggregate AIP/Verify Store/Replicate/ Preserve TV production Lifecycle Metadata Definition & Capture Workflow Persistent Archival Workflow Broadcast/ Transfer Metadata Validation
  • 24. Preservation Processes
    • Generation of a Globally Unique Identifier (GUID) for each interview session
    • Retrieval of the original video session
    • Retrieval of each of the segments associated with a video session
    • Retrieval of the transmission scripts for each video segment
    • Retrieval of the material published on the Web page for each segment
    • Processing of each Web page to redirect internal URLs into handles within the preservation logical name space for digital entities
    • Retrieval of the rights statement for each session
    • Retrieval of the header associated with each video segment
    • Retrieval of the trailer associated with each video segment
    • Retrieval of the administrative, structural, and descriptive metadata stored in the Filemaker Pro database
    • Retrieval of the annotations stored with the Web pages
    • Specification of Preservation Metadata for AIP
    • Creation of AIPs for the above material
    • Creation of containers for physically aggregating material for storage in the preservation environment
    • Storage of containers within the preservation environment
    • Specification of preservation management metadata such as access controls, storage location, and replication
  • 25.  
  • 26. Utility of Data Grids
    • Logical Universal Identifier
    • Uniform Access to Distributed Data
    • Data Discovery through Metadata
    • Open Policy for Provenance & Data Management
    • Rule-based workflow for Metadata Extraction & Analysis
    • Audit trails to capture provenance
    • Provides a Platform for Data Publication
    • Provides a means to Uniquely identify datasets
    • Can be used to enforce metadata requirement - policy
    • Cross-referencing provenance through workflow-derived data
    • Provides a means to perform data attribution