Distributed Checkpointing on an Enterprise Desktop Grid

830 views

Published on

Doctoral dissertation defense from Colorado Technical University

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
830
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Distributed Checkpointing on an Enterprise Desktop Grid

  1. 1. Brent Wilson DCS-3 Dissertation Defense
  2. 2.  Brief Introduction of Topic  Introduction of Research Problem/Question  Research Methodology  Experiment Results/Analysis  Future Research – The next step…
  3. 3.  A client-server architecture that allows for volunteer donation of idle CPU cycles from non-dedicated computers  A central scheduler distributes tasks among the worker nodes and awaits their results.  An application only finishes when all tasks have been completed  “volunteer computing”
  4. 4.  Loss of computation ◦ Tasks need rescheduled if uncompleted  Volatility of Resources ◦ Network outages ◦ Machine crashes  Interference Failures ◦ Machine owners retake control of machine.
  5. 5.  A checkpoint is a process state snapshot  Clients save checkpoints periodically during execution of grid tasks  Checkpoints are used to restart client applications from a previously saved state
  6. 6.  Easy Distributed Grid for Advanced Research ◦ Developed at George Fox University, Fall 2008 ◦ Grid middleware/framework:  Developed in Java to allow for client heterogeneity  All grid communications use Java Sockets  Communication classes are extensible to allow for various checkpointing architectures ◦ Grid applications:  Grid applications are developed in Python  Simple syntax allows for Grid access by larger scientific
  7. 7.  The performance of a desktop grid degrades when client nodes fail. The execution time increases due to failed nodes.  To reduce this degradation either: ◦ Client nodes must not fail or ◦ The impact of client failures needs to be minimized on the grid’s performance
  8. 8.  Can the utilization of distributed shared checkpoints within a neighborhood of nodes as compared to centralization of shared checkpoints on a checkpoint server significantly improve the completion turnaround time in enterprise desktop grid applications?
  9. 9.  H0: There is no significant difference in turnaround time of applications run on a desktop grid using distributed shared checkpoints as compared to centralized shared checkpoints.  H1: There is a significant improvement in the completion time of desktop grid applications utilizing distributed shared checkpoints compared to centralized shared checkpoints.
  10. 10. µ1 = Mean execution time of centralized shared checkpoint grid. µ2 = Mean execution time of distributed shared checkpoint grid. H 0 : m £ m2 1 H1 : m > m2 1 Large-sample test of two means with unknown standard deviations z > 1.645 a = 0.05
  11. 11.  Hardware/Network Topology
  12. 12.  Checkpointing Architectures ◦ Centralized Checkpointing – Checkpoint Server  Clients save their checkpoint data both locally and on the checkpoint server  When a client fails the grid server reassigns the task to an available client from the last checkpoint
  13. 13.  Checkpointing Architectures ◦ Distributed Checkpointing – Neighborhood of Nodes  Neighborhoods are created from geographically close nodes  Clients save their checkpoint data locally and with each neighbor  When a client fails its first neighbor to complete their task picks up the failed task from the last checkpoint
  14. 14.  The Grid Application ◦ A parallel implementation to find the approximation of PI using a dart-throwing method. ◦ Darts are “thrown” at a circle inscribed in a square. ◦ PI = 4.0 * darts in circle / total darts. ◦ Grid tasks (darts to be thrown) were randomly selected between 1% and 2% of the total darts. ◦ Total darts per application were also randomly selected between 10 and 12 billion darts.
  15. 15.  Client Node Failures ◦ After each checkpoint, a failure process was run. ◦ Each machine had a 2% randomized chance of failure.  Network Load ◦ Experiments were performed once with no network load other than normal grid communications ◦ Experiments were also performed with heavy network load, 150% of normal.
  16. 16.  Experiment – Run 100 Grid Applications ◦ No network traffic other than grid traffic using a centralized shared checkpoint server architecture. ◦ No network traffic other than grid traffic using a distributed shared checkpoint architecture (Neighborhood of Nodes). ◦ Heavy network traffic using a centralized checkpoint server.   ◦ Heavy network traffic using a distributed shared checkpoint architecture (Neighborhood of Nodes).
  17. 17.  Data Collection ◦ For each of the 100 runs under the four conditions  Total Execution Time  Total Node Failures  Checkpoint Server Failure (if applicable)  Checkpointing Architecture  Network Load
  18. 18. Experiments 1 & 2 Checkpoint Neighborhood of Server Nodes Mean Exec Time 1858.72 sec 1820.07 sec Mean Node Failure 3.87 3.91 Checkpoint Svr Failures 1 N/A Architecture Chkpt Srvr Neighborhood Network Load None None
  19. 19. Experiments 3 & 4 Checkpoint Neighborhood of Server Nodes Mean Exec Time 1898.05 sec 1844.61 sec Mean Node Failure 4.02 4.09 Checkpoint Svr Failures 2 N/A Architecture Chkpt Srvr Neighborhood Network Load Heavy Heavy
  20. 20. Checkpoint Server Neighborhood No Net Load No Net Load Mean Execution Time 1858.72 1820.07 Standard Deviation 164.27 147.07 Sample Size 100 100 z-Score = 1.753
  21. 21. Checkpoint Server Neighborhood Heavy Net Load Heavy Net Load Mean Execution Time 1898.05 1844.61 Standard Deviation 156.53 181.93 Sample Size 100 100 z-Score = 2.227
  22. 22.  Network Isolation ◦ z-Score of 1.753 is greater than 1.645  Loaded Network ◦ z-Score of 2.227 is greater than 1.645  With a 95% confidence interval the null hypothesis is rejected.  Therefore, there is significant difference in a distributed shared checkpointing architecture as compared to a centralized shared checkpointing architecture.
  23. 23.  Neighborhood of Nodes architecture performed better with a heavy network load. At what point in network load will this architecture begin to experience diminishing returns?  Neighborhoods had a maximum of three neighbors. What effect will increasing the neighborhood have on the performance of this architecture?
  24. 24.  Neighborhoods were pre-established through static means. What effect will developing a dynamic protocol for assigning neighborhoods have on overall performance?
  25. 25. Questions? Comments? bwilson@georgefox.edu
  26. 26.  DIF Distributed Infrastructure Network ◦ A "different" perspective on grid computing  SINC Simple Infrastructure for Network Computing ◦ Don't let your CPU cycles go down the drain!  SANCT Simple Architecture for Network Computing Today ◦ SANCT-ify your code!  QUAKER QUick Architecture Kit for Enterprise Research ◦ Replace your manpower with a "friend”  NERD Network Enterprise Research Distribution ◦ Get a Nerd on your side  GEEKI Grid Enhanced Enterprise Kit Infrastructure ◦ Professor Wilson said he wanted a "geeky" name  SINER Simple Infrastructure for Network Enterprise Research ◦ Redeem your CPU!  EDGAR Easy Distributed Grid for Advanced Research  PyGI Python Grid Infrastructure ◦ "This one went to your CPU..."  SIMPLE Simple Integration of Multiple PCs for Large Enterprises
  27. 27.  Anderson, D. (2004). BOINC: A System for Public-Resource Computing amd Storage. Paper presented at the 5th IEEE/ACM International Workshop on Grid Computing, Pittsburgh, PA. USA.  Casanova, H. (2002). Distributed computing research issues in grid computing. SIGACT News, 33(3), 50-70.  Coulson, G., Grace, P., Blair, G., Mathy, L., Duce, D., Cooper, C., et al. (2004). Towards A Component-Based Middleware Framework for Configurable and Reconfigurable Grid Computing. Paper presented at the 13th IEEE International Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE'04).  Darby, P. J., & Tzeng, N. F. (2007). Peer-to-peer checkpointing arrangement for mobile grid computing systems. Paper presented at the Proceedings of the 16th international symposium on High performance distributed computing.  de Camargo, R. Y., Cerqueira, R., & Kon, F. (2005). Strategies for storage of checkpointing data using non-dedicated repositories on Grid systems. Paper presented at the Proceedings of the 3rd international workshop on Middleware for grid computing.  Domingues, P., Andrzejak, A., & Silva, L. M. (2006). Using Checkpointing to Enhance Turnaround Time on Institutional Desktop Grids. Paper presented at the Second IEEE International Conference on e-Science and Grid Computing (e-Science'06), Amsterdam, The Netherlands.  Domingues, P., Araujo, F., & Silva, L. M. (2006). A DHT-Based Infrastructure for Sharing Checkpoints in Desktop Grid Computing. Paper presented at the Second IEEE International Conference on e-Science and Grid Computing (e-Science'06) Amsterdam, The Netherlands.  Domingues, P., Marques, P., & Silva, L. M. (2005). Resource usage of windows computer laboratories. Paper presented at the International Conference on Parallel Processing, Oslo, Norway.  Fedak, G., Germain, C., Neri, V., & Cappello, F. (2001). XtremWeb: A Generic Global Computing System. Paper presented at the 1st International Symposium on Cluster Computing and the Grid, Brisbane, Australia.  Goldchleger, A., Kon, F., Goldmann, A., Finger, M., & Bezerra, G. (2000). InteGrade: object-oriented Grid middleware leveraging idle computing power of desktop machines: University of Sao Paulo, Brazil.  Kurniawan, D., & Abramson, D. (2007). An Integrated Grid Development Environment in Eclipse. Paper presented at the Third IEEE International Conference on e-Science and Grid Computing (e-Science'07), Bangalore, India.  Litzkow, M., Livny, M., & Mutka, M. (1988). Condor - A Hunter of Idle Workstations. Paper presented at the 8th International Conference on Distributed Computing Systems, Washington, DC. USA.  Pheatt, C. (2007). An easy to use distributed computing framework. Paper presented at the Proceedings of the 38th SIGCSE technical symposium on Computer science education.

×