Data Management Systems
Upcoming SlideShare
Loading in...5

Data Management Systems






Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Data Management Systems Data Management Systems Presentation Transcript

  • Data Management Systems Richard Marciano Reagan W. Moore Wayne Schroeder Arcot Rajasekar Mike Wan San Diego Supercomputer Center [email_address] http://www. sdsc . edu / srb /
  • Key Data Management Standards
    • Data sharing - data grids
      • Storage Resource Broker - SRB
        • Enables each community to use their preferred API
        • Supports federation between data grids
      • Integrated Rule Oriented Data System - iRODS
        • Automates execution of management policies
    • Data publication - digital libraries
      • DSpace and Fedora support metadata standards
        • OAI-PMH
        • METS
    • Data preservation - persistent archives
      • Integrated Rule Oriented Data System - iRODS
        • RLG/NARA Trusted Repository Assessment criteria
        • OAIS
        • ISADG metadata
        • DFDL data format characterizatioin
  • Data Standards
    • What are your plans of supporting these?
      • International collaboration on the development of iRODS.
      • Integration with other data management solutions including LOCKSS, IBP/Lstore, DataVerse
    • Which is in your opinion the area which lacks standardization most (either because of the absence of standardization or because of insufficient standards)?
      • Need progress in representation information
      • Need DFDL standard
  • Data Management Standards
    • Which (future) standards are seen as important?
      • Virtual Machine technology
      • Workflow virtualization
    • How close is your implementation to the (published) standard(s)?
      • Most standards provide access methods, but not data management
      • Port access standards as they become heavily used
    • Which are the open source extensions to employed standards you had to make and why?
      • Generic interface for manipulating structured information
        • Need the ability to query a structured information resource to read and write internal information
      • Generic interface for characterizing data management policies
        • Need the ability to virtualize management policies across administrative domains
  • Grid Infrastructure
    • Interoperation Challenges
      • Virtualization of trust
        • Support for multiple uthentication and authorization mechanisms
      • Data management virtualization
        • Characterization of management policies as server-side workflows
      • Virtualization of workflows
        • Ability to migrate workflows between client-side and server-side
    • What do you suggest are the best ways to tackle these problem areas?
      • Bottom-up interoperability development by the principal software system developers
    • Roadmap document
      • Wiki -
    • What is your funding status in a mid- and long-term perspective?
      • Sustained funding for the next three years (NSF, NARA)
  • William Charles Wentworth (1790-1872)
    • Noted Australian explorer and statesman
    • Ancestry
      • 8,979 ancestors
      • 84,628 descents from Charlemagne
    • Cousins
      • Queen Elizabeth 10th cousin, 3 removes
      • George Washington 11th cousin, 3 removes
      • Reagan Wentworth Moore 10th cousin, 4 removes
  • Data Grid Evolution
    • Data grids
      • Infrastructure independence
      • Data sharing through data and trust virtualization
        • SRB - Storage Resource Broker
    • Rule-based data grids
      • Automation of management policies Management virtualization
      • Open source software
        • iRODS - integrated Rule-Oriented Data System
  • Data Management Applications
    • Data grids
      • Share data - organize distributed data as a collection
    • Digital libraries
      • Publish data - support browsing and discovery
    • Persistent archives
      • Preserve data - manage technology evolution
    • Real-time sensor systems
      • Federate sensor data - integrate across sensor streams
    • Workflow systems
      • Analyze data - integrate client- & server-side workflows
  • Generic Infrastructure
    • Data grids organize distributed data into shared collections
      • Persistent name spaces for files, users, storage
      • Collection attributes
        • Provenance, descriptive, system metadata
    • Data grids manage heterogeneous storage systems
      • Standard operations across file systems, tape archives, object ring buffers
      • Enable technology evolution
        • At the point in time when new technology is available, both the old and new systems can be integrated
  • Using a Data Grid – in Abstract Data Grid
    • User asks for data from the data grid
    Ask for data Data delivered
    • The data is found and returned
      • Where & how details are hidden
  • Using a Data Grid - Details
    • Data request goes to iRODS Server
    • Server looks up information in catalog
    • Catalog tells which iRODS server has data
    • 1 st server asks 2 nd for data
    • The 2nd iRODS server applies rules
    • User asks for data
    iRODS Server iRODS Server Metadata Catalog DB
  • Extremely Successful
    • Storage Resource Broker (SRB) manages 2 PBs of data in internationally shared collections
    • Data collections for NSF, NARA, NASA, DOE, DOD, NIH, LC, NHPRC, IMLS; APAC, UK e-Science, IN2P3, KEK, …
      • Astronomy Data grid
      • Bio-informatics Digital library
      • Earth Sciences Data grid
      • Ecology Collection
      • Education Persistent archive
      • Engineering Digital library
      • Environmental science Data grid
      • High energy physics Data grid
      • Humanities Data Grid
      • Medical community Digital library
      • Oceanography Real time sensor data, persistent archive
      • Seismology Digital library, real-time sensor data
    • Goal has been generic infrastructure for distributed data
  • BaBar High-Energy Physics
    • Stanford Linear Accelerator
    • IN2P3
    • Lyon, France
    • Rome, Italy
    • San Diego
    • RAL, UK
    • A functioning international Data Grid for high-energy physics
    Manchester-SDSC mirror Moved over 300 TBs of data Increasing to 5 TBs per day
  • Requirements Driving Evolution
    • Observe that as the size of the shared collections grow, the administrative tasks can become onerous.
      • Data grids provide mechanisms to manage recovery from all errors that occur in the distributed environment
    • Need to minimize labor support through automation of administrative functions
      • File ingestion tasks
      • Verification of desired collection properties
      • Integrity checks and replica management
  • Requirements Driving Evolution
    • Observe that each community has unique management policies
      • User administration
      • File retention & deletion
      • Time-dependent access controls
      • Data distribution and replication
      • File update (versions, backups)
      • Descriptive metadata
  • Requirements Driving Evolution
    • Socialization of collections
      • The creators of the collection have specific properties that they assert the collection will possess
        • Completeness
        • Authoritative sources
        • Authenticity
      • The users of the collection have their own criteria for the properties they expect
    • Socialization is the mapping from creator assertions to user expectations
  • Data Grid Mechanisms
    • Essential components needed for synergism implemented in SRB
      • Infrastructure independence
      • Data and trust virtualization
    • Components needed for specific management policies and processes implemented in iRODS
      • Map policies to rules that control all processes
      • Map processes to standard micro-services
  • Data Management iRODS - integrated Rule-Oriented Data System
  • Rules
    • Rule classes
      • System enforced rules
      • Administrator controlled rules
      • User defined rules
    • Rule execution
      • Atomic rules - executed on each operation invoked by a client
      • Deferred rules - executed at a future time
      • Periodic rules - executed to validate assessment criteria and enforce desired properties (integrity)
  • iRODS Rule Syntax
    • Event | Condition | Action-set | Recovery-set
      • Event - triggered by operation or queued rule
      • Condition - composed of tests on any attributes in
      • the persistent state information
      • Action-set - composed from both micro-services
      • and rules
      • Recovery-set - used to ensure transaction semantics
      • and consistent state information
    • Executed by a rule engine installed at each storage location - server side workflows
  • Micro-Services
    • Challenge is that storage systems do not provide desired processes
      • Have “minimal” set of standard operations that are performed at the storage system
      • Have actions required by clients such as replication, metadata extraction
      • Create standard micro-services that aggregate storage operations into modules that can be used to implement desired processes.
  • Data Virtualization Storage System Storage Protocol Access Interface Standard Micro-services Data Grid Map from the actions requested by the access method to a standard set of micro-services. The standard micro-services are mapped to the operations supported by the storage system Standard Operations
  • integrated Rule-Oriented Data System Client Interface Admin Interface Current State Rule Invoker Micro Service Modules Metadata-based Services Resources Micro Service Modules Resource-based Services Rule Modifier Module Consistency Check Module Confs Config Modifier Module Metadata Modifier Module Metadata Persistent Repository Consistency Check Module Rule Base Service Manager Consistency Check Module Engine Rule
  • Distributed Management System Rule Engine Data Transport Metadata Catalog Execution Control Messaging System Execution Engine Virtualization Server Side Workflow Persistent State information Scheduling Policy Management
  • Micro-service Classes
    • Test
    • System
    • Workflow control
    • Client
    • iCAT catalog
    • User level invoked by “irule”
    • Image manipulation
  • Digital Preservation
    • Preservation community is defining the rules need to assert trustworthiness of a digital repository
      • RLG/ NARA - Trustworthy Repositories Audit & Certification: Criteria and Checklist.
    • Defined 105 rules that are being implemented in iRODS
  • RLG/NARA Assessment
    • Example TRAC assessment criteria
    Verify consistency of preservation metadata after hardware change or error 93 Verify status of metadata catalog backup (create a snapshot of metadata catalog) 92 Verify descriptive metadata against semantic term list 91 Verify descriptive metadata and source against SIP template and set SIP compliance flag 90
  • Classes of Assessment Criteria
    • Collection properties
      • List properties of associated name spaces
      • Verify properties
      • Compare properties with assertions
    • Collection operations
      • Transform file formats
      • Migrate data
      • Generate audit trails
    • Structured information
      • Parse audit trails to generate compliance reports
      • Apply templates to extract information
      • Apply templates to format state information
  • iRODS Development
    • NSF - SDCI grant “Adaptive Middleware for Community Shared Collections”
      • iRODS development, SRB maintenance
    • NARA - Transcontinental Persistent Archive Prototype
      • Trusted repository assessment criteria
    • NSF - Ocean Research Interactive Observatory Network (ORION)
      • Real-time sensor data stream management
    • NSF - Temporal Dynamics of Learning Center data grid
      • Management of Institution Research Board approval
  • iRODS Development Status
    • Current release is version 0.9.2
      • June 2007
    • Production release will be version 1.0
      • Fall quarter 2007
    • International collaborations
      • SHAMAN - University of Liverpool
        • Sustaining Heritage Access through Multivalent ArchiviNg
      • UK e-Science data grid
      • IN2P3 in Lyon, France
      • DSpace policy management
  • Planned Development
    • GSI support
    • Time-limited sessions via a one-way hash authentication
    • Python Client library
    • GUI Browser (AJAX in development)
    • Driver for HPSS (in development)
    • Driver for SAM-QFS
    • Porting to additional versions of Unix/Linux
    • Porting to Windows
    • Support for MySQL as the metadata catalog
    • API support packages based on existing mounted collection driver
    • MCAT to ICAT migration tools
    • Extensible Metadata including Databases Access Interface
    • Zones/Federation
    • Auditing - mechanisms to record and track iRODS persistent state changes
  • For More Information (iRODS Tutorial on Thursday)
    • Reagan W. Moore
    • San Diego Supercomputer Center
    • [email_address] edu