Brent Wilson
              DCS-3
Dissertation Defense
   Brief Introduction of Topic

   Introduction of Research Problem/Question

   Research Methodology

   Experiment Results/Analysis

   Future Research – The next step…
   A client-server architecture that allows for
    volunteer donation of idle CPU cycles from
    non-dedicated computers

   A central scheduler distributes tasks among
    the worker nodes and awaits their results.

   An application only finishes when all tasks
    have been completed

   “volunteer computing”
   Loss of computation
    ◦ Tasks need rescheduled if uncompleted
   Volatility of Resources
    ◦ Network outages
    ◦ Machine crashes
   Interference Failures
    ◦ Machine owners retake control of machine.
   A checkpoint is a process state snapshot

   Clients save checkpoints periodically during
    execution of grid tasks

   Checkpoints are used to restart client
    applications from a previously saved state
   Easy Distributed Grid for Advanced Research
    ◦ Developed at George Fox University, Fall 2008
    ◦ Grid middleware/framework:
      Developed in Java to allow for client heterogeneity
      All grid communications use Java Sockets
      Communication classes are extensible to allow for
       various checkpointing architectures

    ◦ Grid applications:
      Grid applications are developed in Python
      Simple syntax allows for Grid access by larger scientific
   The performance of a desktop grid degrades
    when client nodes fail. The execution time
    increases due to failed nodes.

   To reduce this degradation either:

    ◦ Client nodes must not fail or

    ◦ The impact of client failures needs to be minimized
      on the grid’s performance
   Can the utilization of distributed shared
    checkpoints within a neighborhood of nodes
    as compared to centralization of shared
    checkpoints on a checkpoint server
    significantly improve the completion
    turnaround time in enterprise desktop grid
    applications?
   H0: There is no significant difference in
    turnaround time of applications run on a
    desktop grid using distributed shared
    checkpoints as compared to centralized
    shared checkpoints.
   H1: There is a significant improvement in the
    completion time of desktop grid applications
    utilizing distributed shared checkpoints
    compared to centralized shared checkpoints.
µ1 = Mean execution time of centralized shared checkpoint grid.
µ2 = Mean execution time of distributed shared checkpoint grid.

H 0 : m £ m2
       1

H1 : m > m2
      1

Large-sample test of two means with unknown standard deviations

                        z > 1.645
                        a = 0.05
   Hardware/Network Topology
   Checkpointing Architectures

    ◦ Centralized Checkpointing – Checkpoint Server

      Clients save their checkpoint data both locally and on
       the checkpoint server

      When a client fails the grid server reassigns the task to
       an available client from the last checkpoint
   Checkpointing Architectures
    ◦ Distributed Checkpointing – Neighborhood of Nodes

      Neighborhoods are created from geographically close
       nodes

      Clients save their checkpoint data locally and with each
       neighbor

      When a client fails its first neighbor to complete their
       task picks up the failed task from the last checkpoint
   The Grid Application
    ◦ A parallel implementation to find the
      approximation of PI using a dart-throwing method.

    ◦ Darts are “thrown” at a circle inscribed in a square.

    ◦ PI = 4.0 * darts in circle / total darts.

    ◦ Grid tasks (darts to be thrown) were randomly
      selected between 1% and 2% of the total darts.

    ◦ Total darts per application were also randomly
      selected between 10 and 12 billion darts.
   Client Node Failures
    ◦ After each checkpoint, a failure process was run.

    ◦ Each machine had a 2% randomized chance of
      failure.

   Network Load

    ◦ Experiments were performed once with no network
      load other than normal grid communications

    ◦ Experiments were also performed with heavy
      network load, 150% of normal.
   Experiment – Run 100 Grid Applications
    ◦ No network traffic other than grid traffic using a
      centralized shared checkpoint server architecture.
    ◦ No network traffic other than grid traffic using a
      distributed shared checkpoint architecture
     (Neighborhood of Nodes).
    ◦ Heavy network traffic using a centralized
      checkpoint server.  
    ◦ Heavy network traffic using a distributed shared
      checkpoint architecture (Neighborhood of Nodes).
   Data Collection
    ◦ For each of the 100 runs under the four conditions
      Total Execution Time
      Total Node Failures
      Checkpoint Server Failure (if applicable)
      Checkpointing Architecture
      Network Load
Experiments 1 & 2         Checkpoint    Neighborhood of
                          Server        Nodes
Mean Exec Time            1858.72 sec   1820.07 sec

Mean Node Failure         3.87          3.91

Checkpoint Svr Failures   1             N/A

Architecture              Chkpt Srvr    Neighborhood

Network Load              None          None
Experiments 3 & 4         Checkpoint    Neighborhood of
                          Server        Nodes
Mean Exec Time            1898.05 sec   1844.61 sec

Mean Node Failure         4.02          4.09

Checkpoint Svr Failures   2             N/A

Architecture              Chkpt Srvr    Neighborhood

Network Load              Heavy         Heavy
Checkpoint Server   Neighborhood
                      No Net Load         No Net Load
Mean Execution Time   1858.72             1820.07
Standard Deviation    164.27              147.07
Sample Size           100                 100


z-Score = 1.753
Checkpoint Server   Neighborhood
                      Heavy Net Load      Heavy Net Load
Mean Execution Time   1898.05             1844.61
Standard Deviation    156.53              181.93
Sample Size           100                 100


z-Score = 2.227
   Network Isolation
    ◦ z-Score of 1.753 is greater than 1.645
   Loaded Network
    ◦ z-Score of 2.227 is greater than 1.645
   With a 95% confidence interval the null
    hypothesis is rejected.
   Therefore, there is significant difference in a
    distributed shared checkpointing architecture
    as compared to a centralized shared
    checkpointing architecture.
   Neighborhood of Nodes architecture
    performed better with a heavy network load.
    At what point in network load will this
    architecture begin to experience diminishing
    returns?

   Neighborhoods had a maximum of three
    neighbors. What effect will increasing the
    neighborhood have on the performance of
    this architecture?
   Neighborhoods were pre-established through
    static means. What effect will developing a
    dynamic protocol for assigning
    neighborhoods have on overall performance?
Questions? Comments?


bwilson@georgefox.edu
   DIF    Distributed Infrastructure Network
    ◦ A "different" perspective on grid computing
   SINC   Simple Infrastructure for Network Computing
    ◦ Don't let your CPU cycles go down the drain!
   SANCT Simple Architecture for Network Computing Today
    ◦ SANCT-ify your code!
   QUAKER QUick Architecture Kit for Enterprise Research
    ◦ Replace your manpower with a "friend”
   NERD Network Enterprise Research Distribution
    ◦ Get a Nerd on your side
   GEEKI Grid Enhanced Enterprise Kit Infrastructure
    ◦ Professor Wilson said he wanted a "geeky" name
   SINER Simple Infrastructure for Network Enterprise Research
    ◦ Redeem your CPU!
   EDGAR Easy Distributed Grid for Advanced Research
   PyGI Python Grid Infrastructure
    ◦ "This one went to your CPU..."
   SIMPLE Simple Integration of Multiple PCs for Large Enterprises
   Anderson, D. (2004). BOINC: A System for Public-Resource Computing amd Storage. Paper presented at the 5th IEEE/ACM International
    Workshop on Grid Computing, Pittsburgh, PA. USA.
   Casanova, H. (2002). Distributed computing research issues in grid computing. SIGACT News, 33(3), 50-70.
   Coulson, G., Grace, P., Blair, G., Mathy, L., Duce, D., Cooper, C., et al. (2004). Towards A Component-Based Middleware Framework for
    Configurable and Reconfigurable Grid Computing. Paper presented at the 13th IEEE International Workshops on Enabling Technologies:
    Infrastructure for Collaborative Enterprises (WETICE'04).
   Darby, P. J., & Tzeng, N. F. (2007). Peer-to-peer checkpointing arrangement for mobile grid computing systems. Paper presented at the
    Proceedings of the 16th international symposium on High performance distributed computing.
   de Camargo, R. Y., Cerqueira, R., & Kon, F. (2005). Strategies for storage of checkpointing data using non-dedicated repositories on Grid
    systems. Paper presented at the Proceedings of the 3rd international workshop on Middleware for grid computing.
   Domingues, P., Andrzejak, A., & Silva, L. M. (2006). Using Checkpointing to Enhance Turnaround Time on Institutional Desktop Grids.
    Paper presented at the Second IEEE International Conference on e-Science and Grid Computing (e-Science'06), Amsterdam, The
    Netherlands.
   Domingues, P., Araujo, F., & Silva, L. M. (2006). A DHT-Based Infrastructure for Sharing Checkpoints in Desktop Grid Computing. Paper
    presented at the Second IEEE International Conference on e-Science and Grid Computing (e-Science'06) Amsterdam, The Netherlands.
   Domingues, P., Marques, P., & Silva, L. M. (2005). Resource usage of windows computer laboratories. Paper presented at the International
    Conference on Parallel Processing, Oslo, Norway.
   Fedak, G., Germain, C., Neri, V., & Cappello, F. (2001). XtremWeb: A Generic Global Computing System. Paper presented at the 1st
    International Symposium on Cluster Computing and the Grid, Brisbane, Australia.
   Goldchleger, A., Kon, F., Goldmann, A., Finger, M., & Bezerra, G. (2000). InteGrade: object-oriented Grid middleware leveraging idle
    computing power of desktop machines: University of Sao Paulo, Brazil.
   Kurniawan, D., & Abramson, D. (2007). An Integrated Grid Development Environment in Eclipse. Paper presented at the Third IEEE
    International Conference on e-Science and Grid Computing (e-Science'07), Bangalore, India.
   Litzkow, M., Livny, M., & Mutka, M. (1988). Condor - A Hunter of Idle Workstations. Paper presented at the 8th International Conference
    on Distributed Computing Systems, Washington, DC. USA.
   Pheatt, C. (2007). An easy to use distributed computing framework. Paper presented at the Proceedings of the 38th SIGCSE technical
    symposium on Computer science education.

Distributed Checkpointing on an Enterprise Desktop Grid

  • 1.
    Brent Wilson DCS-3 Dissertation Defense
  • 2.
    Brief Introduction of Topic  Introduction of Research Problem/Question  Research Methodology  Experiment Results/Analysis  Future Research – The next step…
  • 3.
    A client-server architecture that allows for volunteer donation of idle CPU cycles from non-dedicated computers  A central scheduler distributes tasks among the worker nodes and awaits their results.  An application only finishes when all tasks have been completed  “volunteer computing”
  • 4.
    Loss of computation ◦ Tasks need rescheduled if uncompleted  Volatility of Resources ◦ Network outages ◦ Machine crashes  Interference Failures ◦ Machine owners retake control of machine.
  • 5.
    A checkpoint is a process state snapshot  Clients save checkpoints periodically during execution of grid tasks  Checkpoints are used to restart client applications from a previously saved state
  • 6.
    Easy Distributed Grid for Advanced Research ◦ Developed at George Fox University, Fall 2008 ◦ Grid middleware/framework:  Developed in Java to allow for client heterogeneity  All grid communications use Java Sockets  Communication classes are extensible to allow for various checkpointing architectures ◦ Grid applications:  Grid applications are developed in Python  Simple syntax allows for Grid access by larger scientific
  • 9.
    The performance of a desktop grid degrades when client nodes fail. The execution time increases due to failed nodes.  To reduce this degradation either: ◦ Client nodes must not fail or ◦ The impact of client failures needs to be minimized on the grid’s performance
  • 10.
    Can the utilization of distributed shared checkpoints within a neighborhood of nodes as compared to centralization of shared checkpoints on a checkpoint server significantly improve the completion turnaround time in enterprise desktop grid applications?
  • 12.
    H0: There is no significant difference in turnaround time of applications run on a desktop grid using distributed shared checkpoints as compared to centralized shared checkpoints.  H1: There is a significant improvement in the completion time of desktop grid applications utilizing distributed shared checkpoints compared to centralized shared checkpoints.
  • 13.
    µ1 = Meanexecution time of centralized shared checkpoint grid. µ2 = Mean execution time of distributed shared checkpoint grid. H 0 : m £ m2 1 H1 : m > m2 1 Large-sample test of two means with unknown standard deviations z > 1.645 a = 0.05
  • 14.
    Hardware/Network Topology
  • 15.
    Checkpointing Architectures ◦ Centralized Checkpointing – Checkpoint Server  Clients save their checkpoint data both locally and on the checkpoint server  When a client fails the grid server reassigns the task to an available client from the last checkpoint
  • 16.
    Checkpointing Architectures ◦ Distributed Checkpointing – Neighborhood of Nodes  Neighborhoods are created from geographically close nodes  Clients save their checkpoint data locally and with each neighbor  When a client fails its first neighbor to complete their task picks up the failed task from the last checkpoint
  • 17.
    The Grid Application ◦ A parallel implementation to find the approximation of PI using a dart-throwing method. ◦ Darts are “thrown” at a circle inscribed in a square. ◦ PI = 4.0 * darts in circle / total darts. ◦ Grid tasks (darts to be thrown) were randomly selected between 1% and 2% of the total darts. ◦ Total darts per application were also randomly selected between 10 and 12 billion darts.
  • 18.
    Client Node Failures ◦ After each checkpoint, a failure process was run. ◦ Each machine had a 2% randomized chance of failure.  Network Load ◦ Experiments were performed once with no network load other than normal grid communications ◦ Experiments were also performed with heavy network load, 150% of normal.
  • 19.
    Experiment – Run 100 Grid Applications ◦ No network traffic other than grid traffic using a centralized shared checkpoint server architecture. ◦ No network traffic other than grid traffic using a distributed shared checkpoint architecture (Neighborhood of Nodes). ◦ Heavy network traffic using a centralized checkpoint server.   ◦ Heavy network traffic using a distributed shared checkpoint architecture (Neighborhood of Nodes).
  • 20.
    Data Collection ◦ For each of the 100 runs under the four conditions  Total Execution Time  Total Node Failures  Checkpoint Server Failure (if applicable)  Checkpointing Architecture  Network Load
  • 23.
    Experiments 1 &2 Checkpoint Neighborhood of Server Nodes Mean Exec Time 1858.72 sec 1820.07 sec Mean Node Failure 3.87 3.91 Checkpoint Svr Failures 1 N/A Architecture Chkpt Srvr Neighborhood Network Load None None
  • 26.
    Experiments 3 &4 Checkpoint Neighborhood of Server Nodes Mean Exec Time 1898.05 sec 1844.61 sec Mean Node Failure 4.02 4.09 Checkpoint Svr Failures 2 N/A Architecture Chkpt Srvr Neighborhood Network Load Heavy Heavy
  • 27.
    Checkpoint Server Neighborhood No Net Load No Net Load Mean Execution Time 1858.72 1820.07 Standard Deviation 164.27 147.07 Sample Size 100 100 z-Score = 1.753
  • 28.
    Checkpoint Server Neighborhood Heavy Net Load Heavy Net Load Mean Execution Time 1898.05 1844.61 Standard Deviation 156.53 181.93 Sample Size 100 100 z-Score = 2.227
  • 29.
    Network Isolation ◦ z-Score of 1.753 is greater than 1.645  Loaded Network ◦ z-Score of 2.227 is greater than 1.645  With a 95% confidence interval the null hypothesis is rejected.  Therefore, there is significant difference in a distributed shared checkpointing architecture as compared to a centralized shared checkpointing architecture.
  • 30.
    Neighborhood of Nodes architecture performed better with a heavy network load. At what point in network load will this architecture begin to experience diminishing returns?  Neighborhoods had a maximum of three neighbors. What effect will increasing the neighborhood have on the performance of this architecture?
  • 31.
    Neighborhoods were pre-established through static means. What effect will developing a dynamic protocol for assigning neighborhoods have on overall performance?
  • 32.
  • 33.
    DIF Distributed Infrastructure Network ◦ A "different" perspective on grid computing  SINC Simple Infrastructure for Network Computing ◦ Don't let your CPU cycles go down the drain!  SANCT Simple Architecture for Network Computing Today ◦ SANCT-ify your code!  QUAKER QUick Architecture Kit for Enterprise Research ◦ Replace your manpower with a "friend”  NERD Network Enterprise Research Distribution ◦ Get a Nerd on your side  GEEKI Grid Enhanced Enterprise Kit Infrastructure ◦ Professor Wilson said he wanted a "geeky" name  SINER Simple Infrastructure for Network Enterprise Research ◦ Redeem your CPU!  EDGAR Easy Distributed Grid for Advanced Research  PyGI Python Grid Infrastructure ◦ "This one went to your CPU..."  SIMPLE Simple Integration of Multiple PCs for Large Enterprises
  • 34.
    Anderson, D. (2004). BOINC: A System for Public-Resource Computing amd Storage. Paper presented at the 5th IEEE/ACM International Workshop on Grid Computing, Pittsburgh, PA. USA.  Casanova, H. (2002). Distributed computing research issues in grid computing. SIGACT News, 33(3), 50-70.  Coulson, G., Grace, P., Blair, G., Mathy, L., Duce, D., Cooper, C., et al. (2004). Towards A Component-Based Middleware Framework for Configurable and Reconfigurable Grid Computing. Paper presented at the 13th IEEE International Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE'04).  Darby, P. J., & Tzeng, N. F. (2007). Peer-to-peer checkpointing arrangement for mobile grid computing systems. Paper presented at the Proceedings of the 16th international symposium on High performance distributed computing.  de Camargo, R. Y., Cerqueira, R., & Kon, F. (2005). Strategies for storage of checkpointing data using non-dedicated repositories on Grid systems. Paper presented at the Proceedings of the 3rd international workshop on Middleware for grid computing.  Domingues, P., Andrzejak, A., & Silva, L. M. (2006). Using Checkpointing to Enhance Turnaround Time on Institutional Desktop Grids. Paper presented at the Second IEEE International Conference on e-Science and Grid Computing (e-Science'06), Amsterdam, The Netherlands.  Domingues, P., Araujo, F., & Silva, L. M. (2006). A DHT-Based Infrastructure for Sharing Checkpoints in Desktop Grid Computing. Paper presented at the Second IEEE International Conference on e-Science and Grid Computing (e-Science'06) Amsterdam, The Netherlands.  Domingues, P., Marques, P., & Silva, L. M. (2005). Resource usage of windows computer laboratories. Paper presented at the International Conference on Parallel Processing, Oslo, Norway.  Fedak, G., Germain, C., Neri, V., & Cappello, F. (2001). XtremWeb: A Generic Global Computing System. Paper presented at the 1st International Symposium on Cluster Computing and the Grid, Brisbane, Australia.  Goldchleger, A., Kon, F., Goldmann, A., Finger, M., & Bezerra, G. (2000). InteGrade: object-oriented Grid middleware leveraging idle computing power of desktop machines: University of Sao Paulo, Brazil.  Kurniawan, D., & Abramson, D. (2007). An Integrated Grid Development Environment in Eclipse. Paper presented at the Third IEEE International Conference on e-Science and Grid Computing (e-Science'07), Bangalore, India.  Litzkow, M., Livny, M., & Mutka, M. (1988). Condor - A Hunter of Idle Workstations. Paper presented at the 8th International Conference on Distributed Computing Systems, Washington, DC. USA.  Pheatt, C. (2007). An easy to use distributed computing framework. Paper presented at the Proceedings of the 38th SIGCSE technical symposium on Computer science education.