Policy-based Data Management

  • 313 views
Uploaded on

Data grids are an emerging technology that enables the formation of sharable collections from data distributed across multiple storage resources. The integrated Rule Oriented Data System (iRODS) is a …

Data grids are an emerging technology that enables the formation of sharable collections from data distributed across multiple storage resources. The integrated Rule Oriented Data System (iRODS) is a data grid developed by the DICE Center at UNC-CH. The iRODS data grid enforces management policies that control properties of the collection. Examples of policies include retention, disposition, distribution, replication, metadata extraction, time-dependent access controls, data processing, data redaction, and integrity checking. Policies can be defined that automate administrative functions (file migration and replication) and that validate assessment criteria (authenticity, integrity, chain of custody). iRODS is used to build data sharing environments, digital libraries, and preservation environments. The iRODS data grid is used at UNC-CH to support the Carolina Digital Repository, the LifeTime Library for the School of Information and Library Science, data grids for the Renaissance Computing Institute (RENCI), collaborations within North Carolina, and both national and international data sharing. At RENCI, the TUCASI data grid supports shared collections between UNC-CH, Duke, and NCSU. The RENCI data grid is federated with ten other data grids including the National Climatic Data Center, the Texas Advanced Computing Center data grid, and the Ocean Observatories Initiative data grid. International applications include the CyberSKA Square Kilometer Array for radio astronomy and the French National Institute for Nuclear Physics and Particle Physics. The collections that are assembled may contain hundreds of millions of files, and petabytes of data. A specific goal is the integration of institutional repositories with the national data infrastructure that is being assembled under the NSF DataNet program. The software is available as an open source distribution from http://irods.diceresearch.org.

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
313
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
3
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Policy-Based Data Management Arcot Rajasekar Mike Conway Reagan W. Moore University of North Carolina at Chapel Hill [email_address]
  • 2. Topics
    • Policy-based data management
    • Integrated Rule-Oriented Data System
      • DataNet Federation Consortium
      • SILS’s Lifetime Library
    • Simple Demonstration
  • 3. Expectations
    • Data collection sizes will increase
      • Now petabytes, soon exabytes
        • 1 PB/year = 32 MB/sec
        • 1 PB/day = 11.6 GB/sec
    • Moving Data is becoming a problem
        • Need to keep data in a centralized location
          • Store once
          • Use from anywhere
        • Need to do data analyses at the storage system
        • Digital Legacy is becoming a frightful nightmare
          • How to control our digital lifecycle
  • 4. Applications
    • Digital libraries
      • Personal Libraries
      • email, photos, music, video, ebooks, documents, maps
      • social networking content
      • Need for continuous indexing of contents
    • Scientific/Office data collections
      • Extraction of features from data sets
      • Discovery and Access
      • Creation of derived data products
      • Sharing with collaborators and other users
      • Keeping it for reference & repurpose
    • …… .
  • 5. Policy-based Data Sharing Client Provider iRODS controlled workflows Provider iRODS controlled workflows Storage Storage Shared Collection Consensus on Policies and Procedures controls the shared data
  • 6. Access distributed data with Web-based Browser or iRODS GUI or Command Line clients. iRODS Data Server Disk, Tape, etc. iRODS Middleware Overview of iRODS Architecture User w/ Client Can Search, Access, Add and Manage Data & Metadata iRODS Metadata Catalog Track information iRODS Rule Engine Track Policies
  • 7. Data Virtualization Storage System Storage Protocol Access Interface Policy Enforcement Points Standard Micro-services Map from the actions requested by the client to multiple policy enforcement points. Map from policy to standard micro-services. Map from micro-services to standard Posix I/O operations. Map standard I/O operations to the protocol supported by the storage system Standard I/O Operations Data Grid
  • 8. iRODS Distributed Data Management
  • 9. iRODS Extensible Infrastructure
    • Clients – specific to discipline and life cycle state
    • Policies – specific to discipline
    • Procedures – specific to discipline
    • Remaining infrastructure is generic
      • Network transport Parallel I/O
      • Authentication / Authorization Single Sign-on
      • Distributed storage access Protocol mediation
      • Remote execution Deferred/periodic
      • Metadata management Catalog
      • Message passing Debugging/progress
      • Rule engine Policy control
  • 10. Generic Capabilities
    • Replication
    • Registration of files into the data grid
    • Synchronization of remote directory
    • Managed file transport (iDrop)
    • Automated metadata extraction
    • Queries on metadata, tags
    • Server-side workflows (loop over result sets)
    • Parallel I/O streams & RBUDP transport
  • 11. Policies
    • Retention, disposition, distribution, arrangement
    • Authenticity, provenance, description
    • Integrity, replication, synchronization
    • Deletion, trash cans, versioning
    • Archiving, staging, caching
    • Authentication, authorization, redaction
    • Access, approval, IRB, audit trails, report generation
    • Assessment criteria, validation
    • Derived data product generation, format parsing
    • Federation of independent data grids
  • 12. Highly Controlled Environment
    • All accesses are authenticated
      • GSI / Kerberos / Challenge-response / Shibboleth
    • All operations are authorized
      • ACLs on files, storage
      • User groups, storage groups
    • All policies evaluate a constraint
      • Constraints based on persistent state information and session information
  • 13. Applications
    • Data grids – PB-size distributed collections
      • Astronomy – NOAO, CyberSKA, LSST
      • High Energy Physics – BaBar, KEK
      • Earth Systems – NASA (MODIS data set)
      • Australian Research Collaboration Service
      • Plant biology – iPlant Collaborative
    • Institutional repositories
      • Carolina Digital Repository
    • Libraries
      • Texas Digital Libraries
      • Seismology - Southern California Earthquake Center
    • Archives
      • Ocean Observatories Initiative
  • 14. LIfeTime Library
    • Student personal digital library
      • Manage course material
      • Photograph collections
      • Video collections
      • Reference collection (soft links to information)
    • Choose favorite access mechanism
      • iDrop (synchronize local directory, share data)
      • iDrop cloud browser (add tags, metadata)
      • Unix tools (execute personal rules)
  • 15. LifeTime Library Storage System Storage Protocol Access Interface Policy Enforcement Points Standard Micro-services Multiple clients Policies to automate replication turn on versioning set audit trails enforce strict ACLs replicate metadata Micro-services to create thumbnails extract metadata assign organization Standard I/O Operations Data Grid
  • 16. DataNet Federation Consortium
    • Implement national data grid
      • Federate existing discipline-specific data management systems to enable national research collaborations
    • Enable collaborative research on shared data collections
      • Manage collection life cycle as the user community broadens
    • Integrate “live” research data into education initiatives
      • Enable student research participation through control policies
    Cyber-infrastructure Partners: Univ. of North Carolina, Chapel Hill Univ. of California, San Diego Arizona State University Drexel University Duke University University of Arizona University of South Carolina Science and Engineering Initiatives: Ocean Observatories Initiative the iPlant Collaborative CUAHSI CIBER-U Odum Social Science Institute Temporal Dynamics of Learning Center National Science Foundation Cooperative Agreement: OCI-0940841 Policy-based data management Project Shared Collection Processing Pipeline Digital Library Reference Collection Federation Collection Life Cycle
  • 17. iRODS - Open Source Software
    • Community driven software development
      • Focus on features required by user communities
      • Focus on bug-free software
      • Focus on highly reliable software
      • Focus on highly extensible software
      • Approximately 3-4 software releases per year
    • Distributed under a BSD license
      • International collaborations on software development
      • IN2P3 (France), SHAMAN (UK), ARCS (Australia), Academia Sinica (Taiwan)
    • Highly Successful
  • 18.
    • iRODS - Open Source Software
    • Reagan W. Moore
    • [email_address]
    • http://irods.diceresearch.org
    NSF OCI-0940841 “ DataNet Federation Consortium ” NSF OCI-1032732 "SDCI Data Improvement: Improvement and Sustainability of iRODS Data Grid Software for Multi-Disciplinary Community Driven Application" NSF OCI-0848296 “NARA Transcontinental Persistent Archives Prototype” NSF SDCI-0721400 “Data Grids for Community Driven Applications”