Robust Grid Computing


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Robust Grid Computing

    1. 1. Robust Grid Computing Using Peer-to-Peer Services Alan Sussman Department of Computer Science
    2. 2. Desktop Grids and P2P Systems Convergence of Grid and P2P Centralized Server-Client Architecture Complex scientific applications From: Decentralized Peer Networks File Sharing From:
    3. 3. P2P Desktop Grid System <ul><li>Executing applications on a (widely) distributed set of resources </li></ul><ul><ul><li>Decentralized , Robust , Highly available and Scalable </li></ul></ul><ul><li>Goal is to enable resource sharing </li></ul><ul><ul><li>create ad-hoc grids within (research) communities </li></ul></ul><ul><li>Applications </li></ul><ul><ul><li>Large computational requirements and relatively low I/O requirements </li></ul></ul><ul><ul><ul><li>Physical simulations and data analysis related to astronomy, physics, other scientific disciplines - parameter sweeps </li></ul></ul></ul><ul><ul><ul><ul><li>binary asteroid formation (co-I Richardson) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>comet composition (Deep Impact, co-I Wellnitz) </li></ul></ul></ul></ul><ul><ul><ul><li>Monte Carlo simulations </li></ul></ul></ul><ul><ul><ul><li>Bioinformatics applications </li></ul></ul></ul>
    4. 4. Hard Problems / Issues <ul><li>Job Submission </li></ul><ul><ul><li>Submit a job into the decentralized P2P system </li></ul></ul><ul><li>Matchmaking </li></ul><ul><ul><li>Find a resource that can meet the minimum resource requirements of a job </li></ul></ul><ul><li>Load balance </li></ul><ul><ul><li>Distribute the load (jobs) across the nodes in the system </li></ul></ul><ul><ul><li>And no node has too much overhead (messages) </li></ul></ul><ul><li>Resilience to failure </li></ul><ul><ul><li>Overall system must not lose jobs from failures, and allow nodes to join and leave (or fail) dynamically </li></ul></ul>
    5. 6. Goals of Matchmaking Algorithm <ul><li>Expressiveness </li></ul><ul><ul><li>Allow users to specify any type of resource requirements </li></ul></ul><ul><li>Load balance </li></ul><ul><ul><li>Distribute load across multiple capable candidates </li></ul></ul><ul><li>Parsimony </li></ul><ul><ul><li>Resources should not be wasted </li></ul></ul><ul><li>Completeness </li></ul><ul><ul><li>A valid assignment of a job to a node must be found if such an assignment exists </li></ul></ul><ul><li>Low overhead </li></ul><ul><ul><li>Routing must not add significant overhead </li></ul></ul>
    6. 7. Basic Assumptions <ul><li>Underlying Distributed Hash Table (DHT) </li></ul><ul><ul><li>Object location and routing in P2P network </li></ul></ul><ul><ul><li>Reformulate the problem of matchmaking to one of routing </li></ul></ul><ul><li>Job in the system </li></ul><ul><ul><li>Data and associated profile (requirements) </li></ul></ul><ul><ul><li>All jobs are independent </li></ul></ul><ul><li>Optimization criterion </li></ul><ul><ul><li>Minimize time to complete all jobs (combination of throughput and response time) </li></ul></ul><ul><ul><li>Queuing time used as proxy for response time </li></ul></ul>
    7. 8. Modified Content-Addressable Network <ul><li>Basic CAN </li></ul><ul><ul><li>Logical d -dimensional space </li></ul></ul><ul><ul><ul><li>zone, neighbors, greedy forwarding </li></ul></ul></ul><ul><li>Formulate the matchmaking problem as a routing problem in CAN space </li></ul><ul><ul><li>Each resource type is a CAN dimension </li></ul></ul><ul><ul><li>Map nodes and jobs into the CAN space </li></ul></ul><ul><ul><ul><li>Resource capabilities and requirements, respectively </li></ul></ul></ul><ul><ul><li>Search for a node whose coordinates in all dimensions meet or exceed the job’s requirements </li></ul></ul>
    8. 9. CAN-based Algorithm <ul><li>Employ dynamic aggregated resource information for load balancing </li></ul>CPU Dimension Memory Dimension Node A Routing <ul><li>Aggregate Resource Information </li></ul>Job <ul><li>Choose the least loaded direction </li></ul><ul><li>Push a job into under-loaded region </li></ul><ul><li>Stop pushing </li></ul><ul><li>Choose the best run node </li></ul>S Node B Job
    9. 10. Different Resource Types <ul><li>Categorical Resources </li></ul><ul><ul><li>Architecture, Operating system type (version), etc. </li></ul></ul><ul><ul><li>Exact match </li></ul></ul><ul><li>Continuous Resources </li></ul><ul><ul><li>CPU speed, Memory amount, Disk space, etc. </li></ul></ul><ul><ul><li>Minimum match </li></ul></ul><ul><li>Until recently, our work considered only continuous resource types </li></ul>
    10. 11. Integrating Categorical Resources <ul><li>How to integrate all types of resources in CAN? </li></ul><ul><ul><li>Could add new CAN dimensions for categorical resource types </li></ul></ul><ul><ul><li>Could create a distinct CAN space for each combination of categorical resource types </li></ul></ul><ul><li>Our Approach </li></ul><ul><ul><li>Virtual Peers + 1-dimensional transformation </li></ul></ul>
    11. 12. Virtual Peer+1-D Transformation <ul><li>1-dimensional Transformation </li></ul><ul><ul><li>Transform all categorical resource dimensions into a single dimension </li></ul></ul><ul><ul><ul><li>To make virtual peer management and failure recovery simple, and re-use our existing algorithms </li></ul></ul></ul><ul><ul><li>Use Hilbert Space-Filling Curve </li></ul></ul><ul><li>Virtual peers cover the unoccupied parts of the CAN space in the categorical dimension </li></ul><ul><ul><li>Effectively divide the CAN space into multiple disjoint sub-CANs and connect them </li></ul></ul>
    12. 13. Memory C- Dim B A C V BE E F V EH V H H D Linux & Intel OSX & PPC Windows & Intel AIX & Power Solaris & Sparc M J I G Job J OS == Solaris && Arch == Sparc && Memory >= M J Job J Owner Run Node Job J
    13. 14. Improving Overall Scalability <ul><li>Partial Updates </li></ul><ul><ul><li>Limit the size of a single update message </li></ul></ul><ul><li>Probabilistic Heartbeat Messaging </li></ul><ul><ul><li>Limit the number of outgoing update messages </li></ul></ul><ul><li>Specialized Routing along T-dimension </li></ul><ul><ul><li>Balance the load of processing routing messages from a sub-CAN to another sub-CAN </li></ul></ul><ul><li>All without affecting correctness of CAN algorithms </li></ul><ul><ul><li>Must quantify performance and reliability effects </li></ul></ul>
    14. 15. Experimental Results
    15. 16. Current Status <ul><li>Resource discovery algorithms thoroughly simulated and verified </li></ul><ul><li>CAN-based implementation complete and being tested </li></ul><ul><ul><li>All CAN services working – node join, node leave/fail, job assignment, load aggregation </li></ul></ul><ul><ul><li>Basic authentication mechanism in place, based on certificates and public-key authentication </li></ul></ul><ul><ul><li>Job management and client interface completed </li></ul></ul><ul><ul><li>Categorical resource types still being implemented </li></ul></ul><ul><li>Large-scale deployment and measurement in process </li></ul><ul><ul><li>binary asteroid formation application (Richardson) </li></ul></ul>
    16. 17. Ongoing Work <ul><li>Deploying the prototype system for real workloads and real machines </li></ul><ul><ul><li>practical issues, such as firewalls, user interface for job submission </li></ul></ul><ul><ul><li>with serious logging for measuring system behavior </li></ul></ul><ul><li>Dealing with multicore and multiprocessor machines </li></ul><ul><li>Better characterization of real workloads </li></ul><ul><ul><li>via consultation with Astronomy collaborators, logging behavior of the peer, and mining of Condor system logs </li></ul></ul><ul><li>High quality peer software, with good user tools </li></ul>
    17. 18. The Project Team <ul><li>Faculty members </li></ul><ul><ul><li>Alan Sussman, Pete Keleher, Bobby Bhattacharjee, Derek Richardson (Astronomy), Dennis Wellnitz (Astronomy) </li></ul></ul><ul><li>Matchmaking algorithms and simulations </li></ul><ul><ul><li>Jik-Soo Kim (RA) </li></ul></ul><ul><ul><li>Jaehwan Lee (RA) </li></ul></ul><ul><li>Peer implementation and other tools </li></ul><ul><ul><li>Michael Marsh (research scientist), Beomseok Nam (postdoc) </li></ul></ul><ul><ul><li>Sukhyun Song, San Ratanasanya (RAs) </li></ul></ul>
    18. 19. End of Talk