• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Integration of Data Grids, Digital Libraries, and Persistent ...
 

Integration of Data Grids, Digital Libraries, and Persistent ...

on

  • 548 views

 

Statistics

Views

Total Views
548
Views on SlideShare
548
Embed Views
0

Actions

Likes
0
Downloads
5
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • 18
  • 18
  • 19
  • 20
  • 24

Integration of Data Grids, Digital Libraries, and Persistent ... Integration of Data Grids, Digital Libraries, and Persistent ... Presentation Transcript

  • Integration of Data Grids, Digital Libraries, and Persistent Archives (Storage Resource Broker - SRB) Arcot Rajasekar Michael Wan Reagan W. Moore (sekar, mwan, moore)@sdsc.edu
  • SDSC SRB Team
    • Reagan Moore
    • Michael Wan
    • Arcot Rajasekar
    • Wayne Schroeder
    • Arun Jagatheesan
    • Charlie Cowart
    • Lucas Gilbert
    • George Kremenek
    • Sheau-Yen Chen
    • Bing Zhu
    • Roman Olschanowsky (BIRN)
    • Vicky Rowley (BIRN)
    • Marcio Faerman (SCEC)
    • Antoine De Torcy (IN2P3)
    • Students & emeritus
      • Erik Vandekieft
      • Reena Mathew
      • Xi (Cynthia) Sheng
      • Allen Ding
      • Grace Lin
      • Qiao Xin
      • Daniel Moore
      • Ethan Chen
      • Jon Weinburg
  • Topics
    • Concepts behind data management
    • Production data grid examples
    • Integration of data grids with digital libraries and persistent archives
  • Data Grid
    • Support data sharing between institutions
      • Discover relevant data without knowing the file name
      • Access data without knowing the storage location or storage access protocol
      • Retrieve data using your preferred API
    • Organize distributed data in a collection hierarchy
    • Manage latency in wide-area-networks
    • Manage PetaBytes of data and hundreds of millions of files
  • Digital Library
    • Provide curation services
      • Organization, description, and management of data
      • Support schema extension
    • Provide access services
      • Discovery, browsing, presentation, and manipulation of data
    • Federate semantics across collections
      • Digital library crosswalks
  • Persistent Archive
    • Support archival processes
      • Appraisal, accession, arrangement, description, preservation, and access
    • Manage technology evolution while preserving integrity and authenticity of data
    • Minimize risk of data loss
      • Preserve collections for hundreds of years
      • Data replication
  • Challenges
    • Each community assigns different meanings to terms used to describe their requirements
    • Data grid community
      • Persistent Archive is the infrastructure that manages storage technology evolution while preserving a collection
    • Archivist community
      • Persistent Archive is the collection that is being preserved in some choice of infrastructure
    • Together they define a preservation environment
  • Challenges
    • Preservation community traditionally views technology evolution as the problem rather than the solution
      • Preservation requires the ability to manipulate old formats
    • Digital library community attempts to assert exact meaning for semantics.
      • Metadata Encoding and Transmission Standard is one approach towards the creation of a metadata framework with the ability to support extension schema
    • Data grid community has not chosen standards for distributed data management
      • Computer science is just starting to understand how to characterize and manage data, information, and knowledge
  • To Make Progress
    • Develop simplest possible description for describing data, information, and knowledge management
    • Identify common infrastructure components
    • Apply in production settings
      • Iterate, based on new expectations for data management
  • Common Requirements for Data Management
    • Distributed data sources
      • Management across administrative domains
    • Heterogeneity
      • Multiple types of storage repositories
    • Scalability
      • Support for billions of digital entities, PetaBytes of data
    • Preservation
      • Management of technology evolution
  • SRB Collections at SDSC
  • Data Management Concepts (Elements)
    • Collection
      • The organization of digital entities to simplify management and access.
    • Context
      • The information that describes the digital entities in a collection.
    • Content
      • The digital entities in a collection
  • Types of Context Metadata
    • Descriptive
      • Provenance information, discovery attributes
    • Administrative
      • Location, ownership, size, time stamps
    • Structural
      • Data model, internal components
    • Behavioral
      • Display and manipulation operations
    • Authenticity
      • Audit trails, checksums, access controls
  • Metadata Standards
    • METS - Metadata Encoding Transmission Standard
      • Defines standard structure and schema extension
    • OAIS - Open Archival Information System
      • Preservation packages for submission, archiving, distribution
    • OAI - Open Archives Initiative
      • Metadata retrieval based on Dublin Core provenance attributes
  • Data Management Concepts (Mechanisms)
    • Curation
      • The process of creating the context
    • Closure
      • Assertion that the collection has global properties, including completeness and homogeneity under specified operations
    • Consistency
      • Assertion that the context represents the content
  • Information Technologies
    • Data collecting
      • Sensor systems , object ring buffers and portals
    • Data organization
      • Collections , manage data context
    • Data sharing
      • Data grids , manage heterogeneity
    • Data publication
      • Digital libraries , support discovery
    • Data preservation
      • Persistent archives , manage technology evolution
    • Data analysis
      • Processing pipelines , manage knowledge extraction
  • Assertion
    • Data Grids provide the underlying abstractions required to support
      • Digital libraries
        • Curation processes
        • Distributed collections
        • Discovery and presentation services
      • Persistent archives
        • Management of technology evolution
        • Preservation of authenticity
    • The management of data requires the use of information (semantic labels).
    • The management of information requires the use of knowledge (relationships).
  • Data Grid Terms
    • Data
      • Bits - zeros and ones
    • Digital Entity
      • The bits that form an image of reality (file, object, image, data, metadata, string of bits, structured sets of string of bits)
    • Information
      • Semantic labels applied to data
    • Metadata
      • Semantic label and the associated data (attribute name and attribute value)
    • Knowledge
      • Relationships between semantic labels applied to data
      • Relationships used to assert the application of a semantic label
  • Data Grid Components
    • Federated client-server architecture
      • Servers can talk to each other independently of the client
    • Infrastructure independent naming
      • Logical names for users, resources, files, applications
    • Collective ownership of data
      • Collection-owned data, with infrastructure independent access control lists
    • Context management
      • Record state information in a metadata catalog from data grid services such as replication
    • Abstractions for dealing with heterogeneity
  • Data Grid Abstractions
    • Logical name space for files
      • Global persistent identifier
    • Storage repository virtualization
      • Standard operations supported on storage systems
    • Information repository virtualization
      • Standard operations to manage collections in databases
    • Access virtualization
      • Standard interface to support alternate APIs
    • Latency management mechanisms
      • Aggregation, parallel I/O, replication, caching
    • Security interoperability
      • GSSAPI, inter-realm authentication, collection-based authorization
  • Storage Repository Virtualization Archive Database File System User Application
  • Storage Repository Virtualization Archive Database File System Common set of operations for interacting with every type of storage repository User Application Remote operations Unix file system Latency management Procedures Transformations Third party transfer Filtering Queries
  • Mappings on Resource Name Space
    • Define logical resource name
      • List of physical resources
    • Replication
      • Write to logical resource completes when all physical resources have a copy
    • Load balancing
      • Write to a logical resource completes when copy exist on next physical resource in the list
    • Fault tolerance
      • Write to a logical resource completes when copies exist on “k” of “n” physical resources
  • Containers
    • Archivists store hardcopy in “cardboard boxes”
    • A container is the digital equivalent, the aggregation of digital files into a single file, with an associated “packing list”
    • Containers are used to minimize access latency, keep similar digital entities together
  • Data Stored at SDSC
    • HPSS archive
      • Stores 1 Petabyte of data
      • Stores 17 million files
    • Storage Resource Broker data grid
      • Stores 114 Terabytes of data
      • Stores 31 million files
      • Containers are used to aggregate files before loading into HPSS
  • Java, NT Browsers GridFTP OAI WSDL SDSC Storage Resource Broker & Meta-data Catalog HRM Application Access APIs Drivers Storage Abstraction Catalog Abstraction Databases DB2, Oracle, Sybase, SQLServer Consistency Management / Authorization-Authentication Logical Name Space Latency Management Data Transport Metadata Transport SRB Server Linux I/O DLL / Python Unix Shell Archives HPSS, ADSM, UniTree, DMF Databases DB2, Oracle, Postgres File Systems Unix, NT, Mac OSX C, C++, Libraries
  • Production Data Grid
    • SDSC Storage Resource Broker
      • Federated client-server system, managing
        • Over 100 TBs of data at SDSC
        • Over 25 million files
      • Manages data collections stored in
        • Archives (HPSS, UniTree, ADSM, DMF)
        • Hierarchical Resource Managers
        • Tapes, tape robots
        • File systems (Unix, Linux, Mac OS X, Windows)
        • FTP sites
        • Databases (Oracle, DB2, Postgres, SQLserver, Sybase, Informix)
        • Virtual Object Ring Buffers
  • Data Virtualization Archive at SDSC Database At U Md File System at U Texas User Application
  • Data Virtualization Archive at SDSC Database At U Md File System at U Texas Common naming convention and set of attributes for describing digital entities User Application Logical name space Location independent identifier Persistent identifier Collection owned data Access controls Audit trails Checksums Descriptive metadata Inter-realm authentication Single sign-on system
  • Logical Name Space
    • Persistent, location-independent identifiers for digital entities
      • Organized as collection hierarchy
      • Attributes mapped to logical name space
        • Attributed managed in a database
    • Types of administrative metadata
      • Physical location of file
      • Owner, size, creation time, update time
      • Access controls
  • File Identifiers
    • Logical file name
      • Infrastructure independent
      • Used to organize files into a collection hierarchy
    • Globally unique identifier
      • GUID for asserting equivalence across collections
    • Descriptive metadata
      • Support discovery
    • Physical file name
      • Location of file
  • Information Repository Virtualization Choice of database for Metadata Catalog User Application
    • Operations used to manage
    • administrative, descriptive,
    • user-defined metadata
      • Import from XML file
      • Export to XML file
      • Bulk load
      • Bulk unload
      • Schema extension
      • Access controls
      • Dynamic SQL generation
    Common operations for managing a catalog in a database
  • Java, NT Browsers GridFTP OAI WSDL Access Virtualization Application Linux I/O DLL / Python Common operations performed on all storage repositories Map from API to remote operations Unix file system Latency management Procedures Transformations Third party transfer Filtering Queries Unix Shell C, C++, Libraries
  • Technology Evolution
    • All components of the “Persistent Archive” will evolve
      • Hardware systems
      • Software systems
      • Protocols
      • Access methods
      • Encoding syntax for digital entities
    • Create drivers for each new storage repository protocol
      • Migrate data to each new storage system
    • Manage evolution of the encoding syntax through either transformative migration or emulation
  • Are Repeated Media Migrations Feasible?
    • At SDSC, cartridge capacity has increased from 200 Mbytes to 200 Gbytes for same cartridge cost
    • Only migrate to new technology when the cost per Gigabyte is a factor of two lower
    • Then the media cost is fixed when sum over all migrations
      • (1 + 1/2 + 1/4 + 1/8 + 1/16 + 1/32 + …) = 2
    • SDSC migrates to new media to reduce cost
      • All tape are stored in robots to minimize labor costs
  • Transformative Migration versus Emulation versus Digital Ontology
    • Transformative Migration
      • Transform the encoding format to a new standard
      • Can combine encoding format transformation with media migration
    • Emulation
      • Create a transportable parser for the original encoding format
      • Migrate emulator forward in time
      • Example - Multivalent Browser (written in Java) for parsing pdf, laTex, …
    • Digital ontology
      • Characterize the structures and relationships present within the digital entity
      • Migrate the characterization forward in time
  • Persistent Archives
    • When migrate from an old technology to a new technology, both versions are available.
    • Virtualization mechanisms used for federation across space can be used to manage migration over time
    • Persistent archives can be built on data grid infrastructure
  • Automation of Archival Processes Discovery and retrieval Access Long-term storage Preservation Logical organization of digital entities Arrangement Assignment of preservation metadata Description Import of digital entities Accession Assessment of digital entities Appraisal Functionality Archival Process
  • Data Grid Core Capabilities Persistent handle Logical name space structural independence from physical file Retrieval by logical name Registration of files in logical name space Logical name space Containers for data Standard data movement protocol support Standard data access mechanism Storage interface to at least one repository Storage repository abstraction
  • Information Repository Abstraction Containers for metadata Data referenced by catalog query Encoding format specification attributes Attributes for mapping from logical file name to physical file Access control lists for logical name space Scalable metadata insertion Attribute creation and deletion Standard metadata attributes (controlled vocabulary) Collection hierarchy for organizing logical name space Collection owned data
  • Distributed Resilient Architecture Specification of mechanism to assure integrity of data Specification of mechanism to validate integrity of data Specification of reliability against permanent data loss Authentication mechanism Status checking Standard error messages Specification of system availability
  • Virtual Data Grid Characterization of the application of archival processes Characterization of the application of transformative migrations on encoding format Knowledge repositories for managing collection properties
  • Federated SRB server model SRB server SRB agent SRB server MCAT Read Application SRB agent 1 2 3 4 6 5 Logical Name Or Attribute Condition 1.Logical-to-Physical mapping 2.Identification of Replicas 3.Access & Audit Control Peer-to-peer Brokering Server(s) Spawning Data Access Parallel Data Access R1 R2 5/6
  • Latency Management -Bulk Operations
    • Bulk register
      • Create a logical name for a file
      • Load context (metadata)
    • Bulk load
      • Create a copy of the file on a data grid storage repository
    • Bulk unload
      • Provide containers to hold small files and pointers to each file location
    • Bulk delete
    • Requests for bulk operations for access control, …
  • SRB Latency Management Replication Server-initiated I/O Streaming Parallel I/O Caching Client-initiated I/O Remote Proxies, Staging Data Aggregation Containers Source Destination Prefetch Network Destination Network
  • Southern California Earthquake Center
    • Build community digital library
    • Manage simulation and observational data
      • Anelastic wave propagation output
      • 10 TBs, 1.5 million files
    • Provide web-based interface
      • Support standard services on digital library
    • Manage data distributed across multiple sites
      • USC, SDSC, UCSB, SDSU, SIO
    • Provide standard metadata
      • Community based descriptive metadata
      • Administrative metadata
      • Application specific metadata
  • SCEC Digital Library Technologies
    • Portals
      • Knowledge interface to the library, presenting a coherent view of the services
    • Knowledge Management Systems
      • Organize relationships between SCEC concepts and semantic labels
    • Process management systems
      • Data processing pipelines to create derived data products
    • Web services
      • Uniform capabilities provided across SCEC collections
    • Data grid
      • Management of collections of distributed data
    • Computational grid
      • Access to distributed compute resources
    • Persistent archive
      • Management of technology evolution
  • Metadata Organization (Domain View versus Run View) Domain List Formatting Output Run Provenance Velocity Model Fault Model Physical Numerical Spatial Temporal Domain ... Simulation Model Program Computer System
  •  
  •  
  • Zone SRB Federation
    • Mechanisms to impose consistency and access constraints when sharing:
      • Resources
        • Controls on which zones may use a resource
      • User names (user-name / domain / SRB-zone)
        • Users may be registered into another domain, but retain their home zone, similar to Shibboleth
      • Data files
        • Controls on who specifies replication of data
      • Context metadata
        • Controls on who manages updates to metadata
  • Java, NT Browsers OAI, WSDL, OGSA HTTP Application ORB Storage Repository Virtualization Catalog Abstraction Databases DB2, Oracle, Sybase, Postgres, mySQL, Informix Logical Name Space Latency Management Data Transport Metadata Transport Consistency & Metadata Management / Authorization-Authentication Audit Linux I/O DLL / Python, Perl Federation Management Data Grid Federation - zoneSRB Unix Shell Archives - Tape, HPSS, ADSM, UniTree, DMF, CASTOR,ADS Databases DB2, Oracle, Sybase, SQLserver,Postgres, mySQL, Informix File Systems Unix, NT, Mac OSX C, C++, Java Libraries
  • Peer-to-Peer Federation
    • Occasional Interchange - for specified users
    • Replicated Catalogs - entire state information replication
    • Resource Interaction - data replication
    • Replicated Data Zones - no user interactions between zones
    • Master-Slave Zones - slaves replicate data from master zone
    • Snow-Flake Zones - hierarchy of data replication zones
    • User / Data Replica Zones - user access from remote to home zone
    • Nomadic Zones “SRB in a Box” - synchronize local zone to parent zone
    • Free-floating “myZone” - synchronize without a parent zone
    • Archival “BackUp Zone” - synchronize to an archive
    • SRB Version 3.0.1 released December 19, 2003
  • Principle peer-to-peer federation approaches (1536 possible combinations)
  • Replicated Catalog Archival Partial User-ID Sharing Partial Resource Sharing No Metadata Synch Hierarchical Zone Organization One Shared User-ID System Managed Replication Connection From Any Zone Complete Resource Sharing System Set Access Controls System Controlled Complete Synch Complete User-ID Sharing System Managed Replication System Set Access Controls System Controlled Partial Synch No Resource Sharing Super Administrator Zone Control System Controlled Complete Synch Complete User-ID Sharing Peer-to-Peer Zones Replication Zones Hierarchical Zones Occasional Interchange Free Floating Resource Interaction User and Data Replica Nomadic Snow Flake Master Slave Replicated Data
  • Deep Archive
    • Impose sharing constraints:
      • Only system administrator access
      • Selected replication of files
      • Write once, with versions created on changes to data
    • Impose consistency constraints
      • Coordinate update of preservation metadata with file replication
    • Manage replication of both data and metadata
    • Use federation to guarantee preservation against
      • Local hardware and software failures
      • Local operation errors
      • Local disasters
  • Research
    • Information (semantic label) is an assertion that some criteria were met for the application of the label
      • Need to describe and manage the assertions (rules and relationships) used to apply semantic labels
    • Information (semantic label) expresses a context-related meaning that should be associated with a digital entity
      • Meaning is determined by the context
    • Characterization of information requires the ability to describe
      • The context that defines the assertions for assigning the label
      • The context that explains the meaning of the label
    • Organization of information requires the use of relationships (knowledge)
  • Knowledge Based Data Grid Roadmap Attributes Semantics Knowledge Information Data Ingest Services Management Access Services (Model-based Access) (Data Handling System) MCAT/HDF Grids XML DTD SDLIP XTM DTD
      • Rules - KQL
    Information Repository Attribute- based Query Feature-based Query Knowledge or Topic-Based Query / Browse Knowledge Repository for Rules Relationships Between Concepts Fields Containers Folders Storage (Replicas, Persistent IDs)
  • For More Information Reagan W. Moore San Diego Supercomputer Center [email_address] http://www.npaci.edu/DICE http://www.npaci.edu/DICE/SRB http://www.npaci.edu/dice/srb/mySRB/mySRB.html