Distributed Checkpointing on an Enterprise Desktop Grid

Brent Wilson
DCS-3
Dissertation Defense

 Brief Introduction of Topic

 Introduction of Research Problem/Question

 Research Methodology

 Experiment Results/Analysis

 Future Research – The next step…

 A client-server architecture that allows for
volunteer donation of idle CPU cycles from
non-dedicated computers

 A central scheduler distributes tasks among
the worker nodes and awaits their results.

 An application only finishes when all tasks
have been completed

 “volunteer computing”

 Loss of computation
◦ Tasks need rescheduled if uncompleted
 Volatility of Resources
◦ Network outages
◦ Machine crashes
 Interference Failures
◦ Machine owners retake control of machine.

 A checkpoint is a process state snapshot

 Clients save checkpoints periodically during
execution of grid tasks

 Checkpoints are used to restart client
applications from a previously saved state

 Easy Distributed Grid for Advanced Research
◦ Developed at George Fox University, Fall 2008
◦ Grid middleware/framework:
 Developed in Java to allow for client heterogeneity
 All grid communications use Java Sockets
 Communication classes are extensible to allow for
various checkpointing architectures

◦ Grid applications:
 Grid applications are developed in Python
 Simple syntax allows for Grid access by larger scientific

 The performance of a desktop grid degrades
when client nodes fail. The execution time
increases due to failed nodes.

 To reduce this degradation either:

◦ Client nodes must not fail or

◦ The impact of client failures needs to be minimized
on the grid’s performance

 Can the utilization of distributed shared
checkpoints within a neighborhood of nodes
as compared to centralization of shared
checkpoints on a checkpoint server
significantly improve the completion
turnaround time in enterprise desktop grid
applications?

 H0: There is no significant difference in
turnaround time of applications run on a
desktop grid using distributed shared
checkpoints as compared to centralized
shared checkpoints.
 H1: There is a significant improvement in the
completion time of desktop grid applications
utilizing distributed shared checkpoints
compared to centralized shared checkpoints.

µ1 = Mean execution time of centralized shared checkpoint grid.
µ2 = Mean execution time of distributed shared checkpoint grid.

H 0 : m £ m2
1

H1 : m > m2
1

Large-sample test of two means with unknown standard deviations

z > 1.645
a = 0.05

 Hardware/Network Topology

 Checkpointing Architectures

◦ Centralized Checkpointing – Checkpoint Server

 Clients save their checkpoint data both locally and on
the checkpoint server

 When a client fails the grid server reassigns the task to
an available client from the last checkpoint

 Checkpointing Architectures
◦ Distributed Checkpointing – Neighborhood of Nodes

 Neighborhoods are created from geographically close
nodes

 Clients save their checkpoint data locally and with each
neighbor

 When a client fails its first neighbor to complete their
task picks up the failed task from the last checkpoint

 The Grid Application
◦ A parallel implementation to find the
approximation of PI using a dart-throwing method.

◦ Darts are “thrown” at a circle inscribed in a square.

◦ PI = 4.0 * darts in circle / total darts.

◦ Grid tasks (darts to be thrown) were randomly
selected between 1% and 2% of the total darts.

◦ Total darts per application were also randomly
selected between 10 and 12 billion darts.

 Client Node Failures
◦ After each checkpoint, a failure process was run.

◦ Each machine had a 2% randomized chance of
failure.

 Network Load

◦ Experiments were performed once with no network
load other than normal grid communications

◦ Experiments were also performed with heavy
network load, 150% of normal.

 Experiment – Run 100 Grid Applications
◦ No network traffic other than grid traffic using a
centralized shared checkpoint server architecture.
◦ No network traffic other than grid traffic using a
distributed shared checkpoint architecture
(Neighborhood of Nodes).
◦ Heavy network traffic using a centralized
checkpoint server.
◦ Heavy network traffic using a distributed shared
checkpoint architecture (Neighborhood of Nodes).

 Data Collection
◦ For each of the 100 runs under the four conditions
 Total Execution Time
 Total Node Failures
 Checkpoint Server Failure (if applicable)
 Checkpointing Architecture
 Network Load

Experiments 1 & 2 Checkpoint Neighborhood of
Server Nodes
Mean Exec Time 1858.72 sec 1820.07 sec

Mean Node Failure 3.87 3.91

Checkpoint Svr Failures 1 N/A

Architecture Chkpt Srvr Neighborhood

Network Load None None

Experiments 3 & 4 Checkpoint Neighborhood of
Server Nodes
Mean Exec Time 1898.05 sec 1844.61 sec

Mean Node Failure 4.02 4.09

Checkpoint Svr Failures 2 N/A

Architecture Chkpt Srvr Neighborhood

Network Load Heavy Heavy

Checkpoint Server Neighborhood
No Net Load No Net Load
Mean Execution Time 1858.72 1820.07
Standard Deviation 164.27 147.07
Sample Size 100 100

z-Score = 1.753

Checkpoint Server Neighborhood
Heavy Net Load Heavy Net Load
Mean Execution Time 1898.05 1844.61
Standard Deviation 156.53 181.93
Sample Size 100 100

z-Score = 2.227

 Network Isolation
◦ z-Score of 1.753 is greater than 1.645
 Loaded Network
◦ z-Score of 2.227 is greater than 1.645
 With a 95% confidence interval the null
hypothesis is rejected.
 Therefore, there is significant difference in a
distributed shared checkpointing architecture
as compared to a centralized shared
checkpointing architecture.

 Neighborhood of Nodes architecture
performed better with a heavy network load.
At what point in network load will this
architecture begin to experience diminishing
returns?

 Neighborhoods had a maximum of three
neighbors. What effect will increasing the
neighborhood have on the performance of
this architecture?

 Neighborhoods were pre-established through
static means. What effect will developing a
dynamic protocol for assigning
neighborhoods have on overall performance?

Questions? Comments?

bwilson@georgefox.edu

 DIF Distributed Infrastructure Network
◦ A "different" perspective on grid computing
 SINC Simple Infrastructure for Network Computing
◦ Don't let your CPU cycles go down the drain!
 SANCT Simple Architecture for Network Computing Today
◦ SANCT-ify your code!
 QUAKER QUick Architecture Kit for Enterprise Research
◦ Replace your manpower with a "friend”
 NERD Network Enterprise Research Distribution
◦ Get a Nerd on your side
 GEEKI Grid Enhanced Enterprise Kit Infrastructure
◦ Professor Wilson said he wanted a "geeky" name
 SINER Simple Infrastructure for Network Enterprise Research
◦ Redeem your CPU!
 EDGAR Easy Distributed Grid for Advanced Research
 PyGI Python Grid Infrastructure
◦ "This one went to your CPU..."
 SIMPLE Simple Integration of Multiple PCs for Large Enterprises

 Anderson, D. (2004). BOINC: A System for Public-Resource Computing amd Storage. Paper presented at the 5th IEEE/ACM International
Workshop on Grid Computing, Pittsburgh, PA. USA.
 Casanova, H. (2002). Distributed computing research issues in grid computing. SIGACT News, 33(3), 50-70.
 Coulson, G., Grace, P., Blair, G., Mathy, L., Duce, D., Cooper, C., et al. (2004). Towards A Component-Based Middleware Framework for
Configurable and Reconfigurable Grid Computing. Paper presented at the 13th IEEE International Workshops on Enabling Technologies:
Infrastructure for Collaborative Enterprises (WETICE'04).
 Darby, P. J., & Tzeng, N. F. (2007). Peer-to-peer checkpointing arrangement for mobile grid computing systems. Paper presented at the
Proceedings of the 16th international symposium on High performance distributed computing.
 de Camargo, R. Y., Cerqueira, R., & Kon, F. (2005). Strategies for storage of checkpointing data using non-dedicated repositories on Grid
systems. Paper presented at the Proceedings of the 3rd international workshop on Middleware for grid computing.
 Domingues, P., Andrzejak, A., & Silva, L. M. (2006). Using Checkpointing to Enhance Turnaround Time on Institutional Desktop Grids.
Paper presented at the Second IEEE International Conference on e-Science and Grid Computing (e-Science'06), Amsterdam, The
Netherlands.
 Domingues, P., Araujo, F., & Silva, L. M. (2006). A DHT-Based Infrastructure for Sharing Checkpoints in Desktop Grid Computing. Paper
presented at the Second IEEE International Conference on e-Science and Grid Computing (e-Science'06) Amsterdam, The Netherlands.
 Domingues, P., Marques, P., & Silva, L. M. (2005). Resource usage of windows computer laboratories. Paper presented at the International
Conference on Parallel Processing, Oslo, Norway.
 Fedak, G., Germain, C., Neri, V., & Cappello, F. (2001). XtremWeb: A Generic Global Computing System. Paper presented at the 1st
International Symposium on Cluster Computing and the Grid, Brisbane, Australia.
 Goldchleger, A., Kon, F., Goldmann, A., Finger, M., & Bezerra, G. (2000). InteGrade: object-oriented Grid middleware leveraging idle
computing power of desktop machines: University of Sao Paulo, Brazil.
 Kurniawan, D., & Abramson, D. (2007). An Integrated Grid Development Environment in Eclipse. Paper presented at the Third IEEE
International Conference on e-Science and Grid Computing (e-Science'07), Bangalore, India.
 Litzkow, M., Livny, M., & Mutka, M. (1988). Condor - A Hunter of Idle Workstations. Paper presented at the 8th International Conference
on Distributed Computing Systems, Washington, DC. USA.
 Pheatt, C. (2007). An easy to use distributed computing framework. Paper presented at the Proceedings of the 38th SIGCSE technical
symposium on Computer science education.

Distributed Checkpointing on an Enterprise Desktop Grid

More Related Content

What's hot

Viewers also liked

Similar to Distributed Checkpointing on an Enterprise Desktop Grid

Recently uploaded

Distributed Checkpointing on an Enterprise Desktop Grid