A Taxonomy for Distributed Data Sharing, Management and Processing Chris Sosa VCGR 2007
Overview <ul><li>Discussion of Data Grids </li></ul><ul><ul><li>What they are </li></ul></ul><ul><ul><li>Why they are usef...
What are Data Grids? <ul><li>Aggregation of  geographically-distributed ,  heterogeneous  computing, storage, and network ...
Why are Data Grids Important? <ul><li>Proliferation of Data:  Seeing GB -> PB </li></ul><ul><li>Geographical Distribution:...
Issues Related to Data Grids <ul><li>Site Autonomy </li></ul><ul><li>Heterogeneity </li></ul><ul><li>Limited Resources </l...
Related Technologies <ul><li>Content Delivery Network </li></ul><ul><ul><li>Collection of non-source servers that offload ...
 
A Taxonomy for Data Grids <ul><li>What is a Taxonomy? </li></ul><ul><ul><li>Technique for classifying  something  into gro...
Organization Sub-Taxonomy
Data Transport Sub-Taxonomy
Data Replication and Storage Sub-Taxonomy
Replication Architecture Sub-sub-Taxonomy
Replication Strategy Sub-sub-Taxonomy (cnt’d)
Resource Allocation and Scheduling Sub-Taxonomy
  Classification Time   <ul><li>For complete classification see section 5 in the paper. </li></ul><ul><li>Next few slide...
<ul><li>Classification Time     Organization </li></ul><ul><li>HEP – hierarchical and shared facilities for computing and...
   Classification Time     Data Transport <ul><li>GASS (Globus Toolkit) </li></ul><ul><ul><li>Data access mechanism.  </...
   Classification Time     Data Transport (cnt’d) <ul><li>Legion I/O </li></ul><ul><ul><li>Object-oriented middleware </...
   Classification Time     Data Replication and Storage <ul><li>GFarm (Grid DataFarm) – for data-intensive programs  </l...
   Classification Time     Allocation and Scheduling
Genesis II Classification <ul><li>Organization  - Federated, Interdomain, Collaborative, Stable, Managed </li></ul><ul><li...
Questions?
Upcoming SlideShare
Loading in...5
×

Data Grid Taxonomies

791

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
791
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
16
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Hierarchical – single source of data being pushed to distribution worldwide Federation – Each site manages own data – databases already pre-existing Sensors (bottom-up) – Sensors push to a central DB. Flow of data is from the bottom -&gt; up. Hybrid – Combos! Scope – difference between generic and adaption to a particular domain VOs Collaborative – If it is created by entities that share common goal (single). Regulated – Controlled by single organization. Economy-Based – enter into collaborations with consumers due to profit motive (SLA’s etc) Reputation-based – Inviting entities to join based on the level of services that they are known to provide Data sources – no real notion of transient data yet Management – self-explanatory
  • Function – 3 tier stack (everything above has everything below implicitly). File I/O = remote appears as if local. Overlay network manages routing Security – Also mutually exclusive. Can have multiple for Auth. Fine-grained, more flexible ownership of data – certs with tickets etc. Fault Tolerance – Cache transfer = store-forward Transfer Mode – Latency Management
  • Centralized (master / copy) or Decentralized – many copies with no master Storage Integration – Control over FS (kernel level) or using File system (high-level) Transfer Protocols – Open = data available outside of rep method Metadata – Two types of attributes (user-defined = vo’s etc). Update Type – how it is updated Replica Update Propagation – Epidemic vs on-demand Catalog – Replica Catalog – Tree, hash, DB
  • When and where to create a replica of the data. Method – Whether to adapt to changes in demand, bandwidth, or storage availability (more overhead) Granularity – how big Objective Function – Why –
  • Application model they are targeted towards Scope – Community based- QoS, SLA’s etc vs individual uses Data Replication – Attach to replication Utility – makespan – time it takes for all jobs to go in a se Locality – Spatial – locating a job in such a way that all the data for the job is available on data hosts that are located close to the point of computation (moving jobs to the data) Temporal – fact that if the data is close to the compute node, subsequent jobs which require the same data are schedule to the same node. (moving data to jobs)
  • Data Grid Taxonomies

    1. 1. A Taxonomy for Distributed Data Sharing, Management and Processing Chris Sosa VCGR 2007
    2. 2. Overview <ul><li>Discussion of Data Grids </li></ul><ul><ul><li>What they are </li></ul></ul><ul><ul><li>Why they are useful </li></ul></ul><ul><ul><li>Why they are difficult </li></ul></ul><ul><li>Taxonomies </li></ul><ul><li>Classification of Data Grids using Taxonomies </li></ul><ul><li>Attempted Classification of Genesis II </li></ul>
    3. 3. What are Data Grids? <ul><li>Aggregation of geographically-distributed , heterogeneous computing, storage, and network resources to form unified , secure and pervasive access . </li></ul><ul><li>Large data sets that can be shared worldwide </li></ul><ul><li>Data is a 1 st -class resource </li></ul>
    4. 4. Why are Data Grids Important? <ul><li>Proliferation of Data: Seeing GB -> PB </li></ul><ul><li>Geographical Distribution: They’re everywhere !!! </li></ul><ul><li>Sharing with Site Autonomity </li></ul><ul><li>A Single Source for a variety of data </li></ul><ul><li>Want to be able to search and discover suitable resources </li></ul>
    5. 5. Issues Related to Data Grids <ul><li>Site Autonomy </li></ul><ul><li>Heterogeneity </li></ul><ul><li>Limited Resources </li></ul><ul><li>Single Source </li></ul><ul><li>Access Restrictions </li></ul><ul><li>Unified Namespace </li></ul>
    6. 6. Related Technologies <ul><li>Content Delivery Network </li></ul><ul><ul><li>Collection of non-source servers that offload work by delivering content on their behalf </li></ul></ul><ul><ul><li>For load-balancing mostly </li></ul></ul><ul><li>Peer-to-Peer Network </li></ul><ul><ul><li>Protection against volatility with scalability and reliability </li></ul></ul><ul><ul><li>Ad hoc aggregation of resources to form a decentralized system </li></ul></ul><ul><li>Distributed Databases </li></ul><ul><ul><li>ACID requirements </li></ul></ul><ul><ul><li>Logically organized collection of data stored at different sites w/ each site having some autonomy. </li></ul></ul>
    7. 8. A Taxonomy for Data Grids <ul><li>What is a Taxonomy? </li></ul><ul><ul><li>Technique for classifying something into groups. </li></ul></ul><ul><ul><li>Technique used here is making a Graph (looks like a Tree but things can be multi-classified into different leaves). </li></ul></ul><ul><li>Taxonomy broken into four sub-taxonomies </li></ul><ul><ul><li>Organization </li></ul></ul><ul><ul><li>Data Transport </li></ul></ul><ul><ul><li>Data Replication </li></ul></ul><ul><ul><li>Scheduling </li></ul></ul>
    8. 9. Organization Sub-Taxonomy
    9. 10. Data Transport Sub-Taxonomy
    10. 11. Data Replication and Storage Sub-Taxonomy
    11. 12. Replication Architecture Sub-sub-Taxonomy
    12. 13. Replication Strategy Sub-sub-Taxonomy (cnt’d)
    13. 14. Resource Allocation and Scheduling Sub-Taxonomy
    14. 15.  Classification Time  <ul><li>For complete classification see section 5 in the paper. </li></ul><ul><li>Next few slides highlight interesting aspects of classifying technologies </li></ul><ul><li>End with discussion of Genesis II classification </li></ul>
    15. 16. <ul><li>Classification Time  Organization </li></ul><ul><li>HEP – hierarchical and shared facilities for computing and storage (collaborative) </li></ul><ul><li>Astronomy – organizing VO’s to find single source data. Federated model. </li></ul><ul><li>Bio-Informatics – Federated model (over DB’s) and providing common data formats. </li></ul><ul><li>Earth Sciences (NEESgrid) – bottom-up model. </li></ul>
    16. 17.  Classification Time  Data Transport <ul><li>GASS (Globus Toolkit) </li></ul><ul><ul><li>Data access mechanism. </li></ul></ul><ul><ul><li>Goal to provide uniform access. </li></ul></ul><ul><ul><li>Remote I/O mechanism for Grid apps. </li></ul></ul><ul><ul><li>Fetches entire file onto “cache”. Can use prestaging etc. </li></ul></ul><ul><li>IBP (Internet Backplane) </li></ul><ul><ul><li>Optimize data transfer with “store-and-forward” protocol. </li></ul></ul><ul><ul><li>Fixed size byte arrays in global addressing space </li></ul></ul><ul><ul><li>Security is capabilities-based. </li></ul></ul><ul><li>GridFTP (misconception: doesn’t require Globus Toolkit) </li></ul><ul><ul><li>Extension of default FTP protocol to provide addt’l Grid func. </li></ul></ul><ul><ul><li>Allows GSI and Kerberos based authentication. </li></ul></ul><ul><ul><li>Multiple TCP streams over the same channel and allows and handles data striping. </li></ul></ul><ul><ul><li>Restart capability </li></ul></ul><ul><li>Kangaroo </li></ul><ul><ul><li>IBP but hidden from user (cannot be explicitly told how to route) </li></ul></ul><ul><ul><li>R/W’s are in the background (non-blocking) unless told otherwise </li></ul></ul><ul><ul><li>Uses hops </li></ul></ul>
    17. 18.  Classification Time  Data Transport (cnt’d) <ul><li>Legion I/O </li></ul><ul><ul><li>Object-oriented middleware </li></ul></ul><ul><ul><li>Single system mage (distributed file system) </li></ul></ul><ul><ul><li>Transparent access by native and legacy apps </li></ul></ul><ul><ul><li>Uses X.509 Proxies to handle security for file transfers (data not encrypted while in transit) </li></ul></ul><ul><li>SRB I/O (Storage Resource Broker) </li></ul><ul><ul><li>Uniform and transparent interface to hetero storage systems </li></ul></ul><ul><ul><li>Parallel-I/O and 3 rd party transfers </li></ul></ul><ul><ul><li>Fine-grained security. </li></ul></ul><ul><ul><li>Remote procedures </li></ul></ul><ul><li>Stork </li></ul><ul><ul><li>Schedule for data placement jobs </li></ul></ul><ul><ul><li>Can translate between mutually incompat. Transfer protocols </li></ul></ul><ul><ul><li>Can create DAG’s (directed acyclic graphs) to plan higher level transfers (data pipelines) </li></ul></ul>
    18. 19.  Classification Time  Data Replication and Storage <ul><li>GFarm (Grid DataFarm) – for data-intensive programs </li></ul><ul><ul><li>GFarm’s (parallel) file system unifies the file addressing space over all nodes </li></ul></ul><ul><ul><li>Replica management is dynamic and coupled with scheduling </li></ul></ul><ul><ul><li>Data in a file can be broken into fragments on multiple disks </li></ul></ul><ul><ul><li>Files are write-once </li></ul></ul><ul><li>Giggle  (GIGa-scale Global Location Engine) – architecture framework for a replica location service (RLS). </li></ul><ul><ul><li>Data represented by a logical file name (LFN) </li></ul></ul><ul><ul><li>Physical location identified by a physical file name (PFN) - URL </li></ul></ul><ul><ul><li>Local Replica Catalogs (LRC) get matching between LFN’s and PFN’s </li></ul></ul><ul><ul><li>Replication Location Index (RLI) creates an index of replica catalogs (pointer from LFN’s to LRC’s). Periodically updated via polling. </li></ul></ul><ul><ul><li>Aimed at write-once, read many. Only provides indexing. </li></ul></ul><ul><li>GDMP to provide secure and high-speed file transfer services. </li></ul><ul><ul><li>Based on pub-sub model. </li></ul></ul><ul><ul><li>GSI as security model (auth + authz). </li></ul></ul><ul><ul><li>HEP uses it. </li></ul></ul><ul><ul><li>Client replicates from central storage </li></ul></ul><ul><li>SRB to Enable creation of shared collections </li></ul><ul><ul><li>Unified view of data files which is analogous to the UNIX fs structure. </li></ul></ul><ul><ul><li>Static replication with replication managed a the container / dataset level </li></ul></ul><ul><ul><li>SRB focuses on preservation of the data </li></ul></ul><ul><ul><li>Decentralized model with Hybrid scheme – Tree for naming, ring for replication </li></ul></ul><ul><ul><li>Replication is organized with a DB </li></ul></ul>
    19. 20.  Classification Time  Allocation and Scheduling
    20. 21. Genesis II Classification <ul><li>Organization - Federated, Interdomain, Collaborative, Stable, Managed </li></ul><ul><li>Data Transport - File I/O (RNS), Cryptographic Keys (WS-Security), SSL, Fine-grained (through delegation), Restart, Block + Stream (ByteIO) </li></ul><ul><li>Replica Architecture and Strategy – TBD </li></ul><ul><li>Scheduling – Process-Oriented, Individual, Makespan, Spatial. </li></ul>
    21. 22. Questions?
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×