Data Management Systems
Upcoming SlideShare
Loading in...5
×
 

Data Management Systems

on

  • 561 views

 

Statistics

Views

Total Views
561
Views on SlideShare
561
Embed Views
0

Actions

Likes
0
Downloads
5
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Data Management Systems Data Management Systems Presentation Transcript

  • Data Management Systems Richard Marciano Reagan W. Moore Wayne Schroeder Arcot Rajasekar Mike Wan San Diego Supercomputer Center [email_address] http://irods.sdsc.edu http://www. sdsc . edu / srb /
  • Key Data Management Standards
    • Data sharing - data grids
      • Storage Resource Broker - SRB
        • Enables each community to use their preferred API
        • Supports federation between data grids
      • Integrated Rule Oriented Data System - iRODS
        • Automates execution of management policies
    • Data publication - digital libraries
      • DSpace and Fedora support metadata standards
        • OAI-PMH
        • METS
    • Data preservation - persistent archives
      • Integrated Rule Oriented Data System - iRODS
        • RLG/NARA Trusted Repository Assessment criteria
        • OAIS
        • ISADG metadata
        • DFDL data format characterizatioin
  • Data Standards
    • What are your plans of supporting these?
      • International collaboration on the development of iRODS.
      • Integration with other data management solutions including LOCKSS, IBP/Lstore, DataVerse
    • Which is in your opinion the area which lacks standardization most (either because of the absence of standardization or because of insufficient standards)?
      • Need progress in representation information
      • Need DFDL standard
  • Data Management Standards
    • Which (future) standards are seen as important?
      • Virtual Machine technology
      • Workflow virtualization
    • How close is your implementation to the (published) standard(s)?
      • Most standards provide access methods, but not data management
      • Port access standards as they become heavily used
    • Which are the open source extensions to employed standards you had to make and why?
      • Generic interface for manipulating structured information
        • Need the ability to query a structured information resource to read and write internal information
      • Generic interface for characterizing data management policies
        • Need the ability to virtualize management policies across administrative domains
  • Grid Infrastructure
    • Interoperation Challenges
      • Virtualization of trust
        • Support for multiple uthentication and authorization mechanisms
      • Data management virtualization
        • Characterization of management policies as server-side workflows
      • Virtualization of workflows
        • Ability to migrate workflows between client-side and server-side
    • What do you suggest are the best ways to tackle these problem areas?
      • Bottom-up interoperability development by the principal software system developers
    • Roadmap document
      • Wiki - http://irods.sdsc.edu
    • What is your funding status in a mid- and long-term perspective?
      • Sustained funding for the next three years (NSF, NARA)
  • William Charles Wentworth (1790-1872)
    • Noted Australian explorer and statesman
    • Ancestry
      • 8,979 ancestors
      • 84,628 descents from Charlemagne
    • Cousins
      • Queen Elizabeth 10th cousin, 3 removes
      • George Washington 11th cousin, 3 removes
      • Reagan Wentworth Moore 10th cousin, 4 removes
  • Data Grid Evolution
    • Data grids
      • Infrastructure independence
      • Data sharing through data and trust virtualization
        • SRB - Storage Resource Broker
    • Rule-based data grids
      • Automation of management policies Management virtualization
      • Open source software
        • iRODS - integrated Rule-Oriented Data System
  • Data Management Applications
    • Data grids
      • Share data - organize distributed data as a collection
    • Digital libraries
      • Publish data - support browsing and discovery
    • Persistent archives
      • Preserve data - manage technology evolution
    • Real-time sensor systems
      • Federate sensor data - integrate across sensor streams
    • Workflow systems
      • Analyze data - integrate client- & server-side workflows
  • Generic Infrastructure
    • Data grids organize distributed data into shared collections
      • Persistent name spaces for files, users, storage
      • Collection attributes
        • Provenance, descriptive, system metadata
    • Data grids manage heterogeneous storage systems
      • Standard operations across file systems, tape archives, object ring buffers
      • Enable technology evolution
        • At the point in time when new technology is available, both the old and new systems can be integrated
  • Using a Data Grid – in Abstract Data Grid
    • User asks for data from the data grid
    Ask for data Data delivered
    • The data is found and returned
      • Where & how details are hidden
  • Using a Data Grid - Details
    • Data request goes to iRODS Server
    • Server looks up information in catalog
    • Catalog tells which iRODS server has data
    • 1 st server asks 2 nd for data
    • The 2nd iRODS server applies rules
    • User asks for data
    iRODS Server iRODS Server Metadata Catalog DB
  • Extremely Successful
    • Storage Resource Broker (SRB) manages 2 PBs of data in internationally shared collections
    • Data collections for NSF, NARA, NASA, DOE, DOD, NIH, LC, NHPRC, IMLS; APAC, UK e-Science, IN2P3, KEK, …
      • Astronomy Data grid
      • Bio-informatics Digital library
      • Earth Sciences Data grid
      • Ecology Collection
      • Education Persistent archive
      • Engineering Digital library
      • Environmental science Data grid
      • High energy physics Data grid
      • Humanities Data Grid
      • Medical community Digital library
      • Oceanography Real time sensor data, persistent archive
      • Seismology Digital library, real-time sensor data
    • Goal has been generic infrastructure for distributed data
  •  
  • BaBar High-Energy Physics
    • Stanford Linear Accelerator
    • IN2P3
    • Lyon, France
    • Rome, Italy
    • San Diego
    • RAL, UK
    • A functioning international Data Grid for high-energy physics
    Manchester-SDSC mirror Moved over 300 TBs of data Increasing to 5 TBs per day
  • Requirements Driving Evolution
    • Observe that as the size of the shared collections grow, the administrative tasks can become onerous.
      • Data grids provide mechanisms to manage recovery from all errors that occur in the distributed environment
    • Need to minimize labor support through automation of administrative functions
      • File ingestion tasks
      • Verification of desired collection properties
      • Integrity checks and replica management
  • Requirements Driving Evolution
    • Observe that each community has unique management policies
      • User administration
      • File retention & deletion
      • Time-dependent access controls
      • Data distribution and replication
      • File update (versions, backups)
      • Descriptive metadata
  • Requirements Driving Evolution
    • Socialization of collections
      • The creators of the collection have specific properties that they assert the collection will possess
        • Completeness
        • Authoritative sources
        • Authenticity
      • The users of the collection have their own criteria for the properties they expect
    • Socialization is the mapping from creator assertions to user expectations
  • Data Grid Mechanisms
    • Essential components needed for synergism implemented in SRB
      • Infrastructure independence
      • Data and trust virtualization
    • Components needed for specific management policies and processes implemented in iRODS
      • Map policies to rules that control all processes
      • Map processes to standard micro-services
  • Data Management iRODS - integrated Rule-Oriented Data System
  • Rules
    • Rule classes
      • System enforced rules
      • Administrator controlled rules
      • User defined rules
    • Rule execution
      • Atomic rules - executed on each operation invoked by a client
      • Deferred rules - executed at a future time
      • Periodic rules - executed to validate assessment criteria and enforce desired properties (integrity)
  • iRODS Rule Syntax
    • Event | Condition | Action-set | Recovery-set
      • Event - triggered by operation or queued rule
      • Condition - composed of tests on any attributes in
      • the persistent state information
      • Action-set - composed from both micro-services
      • and rules
      • Recovery-set - used to ensure transaction semantics
      • and consistent state information
    • Executed by a rule engine installed at each storage location - server side workflows
  • Micro-Services
    • Challenge is that storage systems do not provide desired processes
      • Have “minimal” set of standard operations that are performed at the storage system
      • Have actions required by clients such as replication, metadata extraction
      • Create standard micro-services that aggregate storage operations into modules that can be used to implement desired processes.
  • Data Virtualization Storage System Storage Protocol Access Interface Standard Micro-services Data Grid Map from the actions requested by the access method to a standard set of micro-services. The standard micro-services are mapped to the operations supported by the storage system Standard Operations
  • integrated Rule-Oriented Data System Client Interface Admin Interface Current State Rule Invoker Micro Service Modules Metadata-based Services Resources Micro Service Modules Resource-based Services Rule Modifier Module Consistency Check Module Confs Config Modifier Module Metadata Modifier Module Metadata Persistent Repository Consistency Check Module Rule Base Service Manager Consistency Check Module Engine Rule
  • Distributed Management System Rule Engine Data Transport Metadata Catalog Execution Control Messaging System Execution Engine Virtualization Server Side Workflow Persistent State information Scheduling Policy Management
  • Micro-service Classes
    • Test
    • System
    • Workflow control
    • Client
    • iCAT catalog
    • User level invoked by “irule”
    • Image manipulation
  • Digital Preservation
    • Preservation community is defining the rules need to assert trustworthiness of a digital repository
      • RLG/ NARA - Trustworthy Repositories Audit & Certification: Criteria and Checklist.
      • http://wiki.digitalrepositoryauditandcertification.org/pub/Main/ReferenceInputDocuments/trac.pdf
    • Defined 105 rules that are being implemented in iRODS
  • RLG/NARA Assessment
    • Example TRAC assessment criteria
    Verify consistency of preservation metadata after hardware change or error 93 Verify status of metadata catalog backup (create a snapshot of metadata catalog) 92 Verify descriptive metadata against semantic term list 91 Verify descriptive metadata and source against SIP template and set SIP compliance flag 90
  • Classes of Assessment Criteria
    • Collection properties
      • List properties of associated name spaces
      • Verify properties
      • Compare properties with assertions
    • Collection operations
      • Transform file formats
      • Migrate data
      • Generate audit trails
    • Structured information
      • Parse audit trails to generate compliance reports
      • Apply templates to extract information
      • Apply templates to format state information
  • iRODS Development
    • NSF - SDCI grant “Adaptive Middleware for Community Shared Collections”
      • iRODS development, SRB maintenance
    • NARA - Transcontinental Persistent Archive Prototype
      • Trusted repository assessment criteria
    • NSF - Ocean Research Interactive Observatory Network (ORION)
      • Real-time sensor data stream management
    • NSF - Temporal Dynamics of Learning Center data grid
      • Management of Institution Research Board approval
  • iRODS Development Status
    • Current release is version 0.9.2
      • June 2007
    • Production release will be version 1.0
      • Fall quarter 2007
    • International collaborations
      • SHAMAN - University of Liverpool
        • Sustaining Heritage Access through Multivalent ArchiviNg
      • UK e-Science data grid
      • IN2P3 in Lyon, France
      • DSpace policy management
  • Planned Development
    • GSI support
    • Time-limited sessions via a one-way hash authentication
    • Python Client library
    • GUI Browser (AJAX in development)
    • Driver for HPSS (in development)
    • Driver for SAM-QFS
    • Porting to additional versions of Unix/Linux
    • Porting to Windows
    • Support for MySQL as the metadata catalog
    • API support packages based on existing mounted collection driver
    • MCAT to ICAT migration tools
    • Extensible Metadata including Databases Access Interface
    • Zones/Federation
    • Auditing - mechanisms to record and track iRODS persistent state changes
  • For More Information (iRODS Tutorial on Thursday)
    • Reagan W. Moore
    • San Diego Supercomputer Center
    • [email_address] edu
    • http://www.sdsc.edu/srb/
    • http://irods.sdsc.edu/