A comparative analysis of minimum process coordinated checkpointing


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

A comparative analysis of minimum process coordinated checkpointing

  1. 1. International Journal of Computer Engineering (IJCET), ISSN 0976 – 6367(Print), International Journal of Computer Engineering and Technology ISSN 0976 – 6375(Online) Volume 1, Number 1, May - June (2010), © IAEMEand Technology (IJCET), ISSN 0976 – 6367(Print)ISSN 0976 – 6375(Online) Volume 1 IJCETNumber 1, May - June (2010), pp. 46-56 ©IAEME© IAEME, http://www.iaeme.com/ijcet.html A COMPARATIVE ANALYSIS OF MINIMUM-PROCESS COORDINATED CHECKPOINTING ALGORITHMS FOR MOBILE DISTRIBUTED SYSTEMS Preeti gupta Singhania University Pacheri Bari, Rajasthan Parveen kumar Meerut Institute of Engineering & Technology Meerut, Email:pk223475@yahoo.com Anil kumar solanki Meerut Institute of Engineering & Technology Meerut, India ABSTRACT Fault Tolerance Techniques enable systems to perform tasks in the presence of faults. A checkpoint is a local state of a process saved on stable storage. While dealing with Mobile Distributed systems, we come across some issues like: mobility, low bandwidth of wireless channels and lack of stable storage on mobile nodes, disconnections, limited battery power and high failure rate of mobile nodes. These issues make traditional check pointing techniques designed for Distributed systems unsuitable for Mobile environments. To take a checkpoint, an MH has to transfer a large amount of checkpoint data to its local MSS over the wireless network. Since the wireless network has low bandwidth and MHs have low computation power, all-process check pointing will waste the scarce resources of the mobile system on every checkpoint. Minimum-process coordinated check pointing is a preferred approach for mobile distributed systems. In this paper, we discuss various existing minimum-process check pointing protocols for mobile distributed systems. Keywords: Check pointing algorithms; parallel & distributed computing; rollback recovery; fault-tolerant systems 46
  2. 2. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),ISSN 0976 – 6375(Online) Volume 1, Number 1, May - June (2010), © IAEME1. INTRODUCTION Parallel computing with clusters of workstations is being used extensively as theyare cost-effective and scalable, and are able to meet the demands of high performancecomputing. Increase in the number of components in such systems increases the failureprobability. It is, thus, necessary to examine both hardware and software solutions toensure fault tolerance of such parallel computers. To provide fault tolerance, it isessential to understand the nature of the faults that occur in these systems. There aremainly two kinds of faults: permanent and transient. Permanent faults are caused bypermanent damage to one or more components and transient faults are caused by changesin environmental conditions. Permanent faults can be rectified by repair or replacementof components. Transient faults remain for a short duration of time and are difficult todetect and deal with. Hence it is necessary to provide fault tolerance particularly fortransient failures in parallel computers. Fault-tolerant techniques enable a system toperform tasks in the presence of faults. It is easier and more cost effective to providesoftware fault tolerance solutions than hardware solutions to cope with transient failures[1, 2]. Local checkpoint is the saved state of a process at a processor at a given instance.Global checkpoint is a collection of local checkpoints, one from each process. A globalstate is said to be “consistent” if it contains no orphan message; i.e., a message whosereceive event is recorded, but its send event is lost. A transit message is a message whosesend event has been recorded by the sending process but whose receive event has notbeen recorded by the receiving process [1, 7]. The problem of taking a checkpoint in a message passing distributed system isquite complex because any arbitrary set of checkpoints cannot be used for recovery [9].This is due to the fact that the set of checkpoints used for recovery must form a consistentglobal state. Checkpointing is classified into following categories: • Asynchronous/Uncoordinated Checkpointing • Synchronous/Coordinated Checkpointing • Quasi-Synchronous or Communication-induced Checkpointing 47
  3. 3. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),ISSN 0976 – 6375(Online) Volume 1, Number 1, May - June (2010), © IAEME • Message Logging based Checkpointing The problem of taking a checkpoint in a message passing distributed system isquite complex because any arbitrary set of checkpoints cannot be used for recovery.This is due to the fact that the set of checkpoints used for recovery must form a consistentglobal state.1.1 Asynchronous/ Checkpointing Under the asynchronous approach, checkpoints at each process are takenindependently without any synchronization among the processes. Because of absence ofsynchronization, there is no guarantee that a set of local checkpoints taken will be aconsistent set of checkpoints. Thus, a recovery algorithm has to search for the most recentconsistent set of checkpoints before it can initiate recovery [8], [9].1.2 Synchronous / Coordinated Checkpointing In coordinated or synchronous checkpointing, processes coordinate their localcheckpointing actions such that the set of all recent checkpoints in the system isguaranteed to be consistent [add reference list……]. In case of a fault, every processrestarts from its most recent permanent/committed checkpoint. Hence, this approachsimplifies recovery and it does not suffer from domino-effect. Furthermore, coordinatedcheckpointing requires each process to maintain only one permanent checkpoint on stablestorage, reducing storage overhead and eliminating the need for garbage collection. Itsmain disadvantage is the large latency involved in output commits [15]. A straightforward approach to coordinate checkpointing is to blockcommunications while the checkpointing process executes. A coordinator takes acheckpoint and broadcasts a request message to all processes, asking them to take acheckpoint. When a process receives a message, it stops its execution, flushes all thecommunication channels, takes a tentative checkpoint, and sends an acknowledgementmessage back to the coordinator. After the coordinator receives acknowledgement fromall processes, it broadcasts a commit message that completes the two phase checkpointingprotocol. After receiving the commit message, each process receives the old permanentcheckpoint and makes the tentative checkpoint permanent. The process is then free toresume execution and exchange messages with other processes. The coordinated 48
  4. 4. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),ISSN 0976 – 6375(Online) Volume 1, Number 1, May - June (2010), © IAEMEcheckpointing algorithms can also be classified into following two categories: minimum-process and all process algorithms. In all-process coordinated checkpointing algorithms,every process is required to take its checkpoint in an initiation [7], [8]. In minimum-process algorithms, minimum interacting processes are required to take their checkpointsin an initiation. The coordinated checkpointing protocols can be classified into two types:blocking and non-blocking. In blocking algorithms, as mentioned above, some blockingof processes takes place during checkpointing [4]. In non-blocking algorithms, noblocking of processes is required for checkpointing [5]. In a centralized algorithm like Chandy-lamport [7], there is one node whichalways initiates the checkpoints and coordinates the participating nodes. Thedisadvantage of a centralized algorithm is that all nodes have to initiate checkpointswhenever the centralized node decides to checkpoint. Nodes can be given autonomy ininitiating checkpoints by allowing any node in the system to initiate checkpoints.1. 3 Mobile Distributed Computing System A mobile computing system consists of a large number of MHs and relativelyfewer MSSs. The distributed computation we consider consists of n spatially separatedsequential processes denoted by P0, P1, ..., Pn-1, running on fail-stop MHs or on MSSs.Each MH or MSS has one process running on it. The static network provides reliable,sequenced delivery of messages between any two MSSs, with arbitrary message latency.The wireless network within a cell ensures FIFO delivery of messages between an MSSand a local MH, i.e., there exists a FIFO channel from an MH to its local MSS, andanother FIFO channel from the MSS to the MH. If an MH did not leave the cell, thenevery message sent to it from the local MSS would be received in the sequence in whichthey are sent [2]. To send a message from an MH h1 to another MH h2, h1 first sends the message toits local MSS over the wireless network. This MSS then forwards the message to thelocal MSS of h2 which forwards it to h2 over its local wireless network. Since the locationof an MH within the network is neither fixed nor universally known to the network,because, an MH switches cells frequently. The local MSS of h1 needs to first determine 49
  5. 5. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),ISSN 0976 – 6375(Online) Volume 1, Number 1, May - June (2010), © IAEMEthe MSS that currently serves h2. This is essentially the problem that has been tackledthrough a variety of routing protocols at the network layer. Hence, the cost incurred forrouting and delivering a message to an MH, varies with the specific routing protocolbeing used. Here, we have assumed that any message destined for an MH incurs a fixedsearch cost [2].1.4 Checkpointing Issues in Mobile Distributed Computing Systems The existence of mobile nodes in a distributed system introduces new issues thatneed proper handling while designing a checkpointing algorithm for such systems. Theseissues are mobility, disconnections, finite power source, vulnerable to physical damage,lack of stable storage etc. [2]. The location of an MH within the network, as representedby its current local MSS, changes with time. Checkpointing schemes that send controlmessages to MH’s, will need to first locate the MH within the network, and thereby incura search overhead [2]. Due to vulnerability of mobile computers to catastrophic failures,disk storage of an MH is not acceptably stable for storing message logs or localcheckpoints. Checkpointing schemes must therefore, rely on an alternative stablerepository for an MH’s local checkpoint [2]. Disconnections of one or more MHs shouldnot prevent recording the global state of an application executing on MHs. It should benoted that disconnection of an MH is a voluntary operation, and frequent disconnectionsof MHs is an expected feature of the mobile computing environments [2]. The battery atthe MH has limited life. To save energy, the MH can power down individual componentsduring periods of low activity [2]. This strategy is referred to as the doze mode operation.An MH in doze mode is awakened on receiving a message. Therefore, energyconservation and low bandwidth constraints require the checkpointing algorithms tominimize the number of synchronization messages and the number of checkpoints.2 SOME MINIMUM-PROCESS COORDINATED CHECKPOINTINGPROTOCOLS FOR MOBILE DISTRIBUTED SYSTEMS2.1 GUOHONG CAO AND MUKESH SINGHAL ALGORITHM [5] Cao and Singhal achieved non-intrusiveness in the minimum-process algorithmby introducing the concept of mutable checkpoints. In their algorithm, initiator, say Pin,sends the checkpoint request to any process, say Pj, only if Pin receives m from Pj in the 50
  6. 6. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),ISSN 0976 – 6375(Online) Volume 1, Number 1, May - June (2010), © IAEMEcurrent CI. Pj takes its tentative checkpoint if Pj has sent m to Pin in the current CI;otherwise, Pj concludes that the checkpoint request is a useless one. Similarly, when Pjtakes its tentative checkpoint, it propagates the checkpoint request to other processes.This process is continued till the checkpoint request reaches all the processes on whichthe initiator transitively depends and a checkpointing tree is formed. Duringcheckpointing, if Pi receives m from Pj such that Pj has taken some checkpoint in thecurrent initiation before sending m, Pi may be forced to take a checkpoint, called mutablecheckpoint. If Pi is not in the minimum set, its mutable checkpoint is useless and isdiscarded on commit. The huge data structure MR[] is also attached with the checkpointrequests to reduce the number of useless checkpoint requests. The response from eachprocess is sent directly to initiator.2.2 KUMAR AND KUMAR ALGORITHM [14] They proposed an algorithm which is based on keeping track of directdependencies of processes. Initiator MSS collects the direct dependency vectors of allprocesses, computes the tentative minimum set (minimum set or its subset), and sends thecheckpoint request along with the tentative minimum set to all MSSs. This step is takento reduce the time to collect the coordinated checkpoint. It will also reduce the number ofuseless checkpoints and the blocking of the processes. Suppose, during the execution ofthe checkpointing algorithm, Pi takes its checkpoint and sends m to Pj. Pj receives msuch that it has not taken its checkpoint for the current initiation and it does not knowwhether it will get the checkpoint request. If Pj takes its checkpoint after processing m, mwill become orphan. In order to avoid such orphan messages, they propose the followingtechnique. If Pj has sent at least one message to a process, say Pk and Pk is in thetentative minimum set, there is a good probability that Pj will get the checkpoint request.Therefore, Pj takes its induced checkpoint before processing m. An induced checkpoint issimilar to the mutable checkpoint [18]. In this case, most probably, Pj will get thecheckpoint request and its induced checkpoint will be converted into permanent one.There is a less probability that Pj will not get the checkpoint request and its inducedcheckpoint will be discarded. Alternatively, if there is not a good probability that Pj willget the checkpoint request, Pj buffers m till it takes its checkpoint or receives the commitmessage. They have tried to minimise the number of useless checkpoints and blocking of 51
  7. 7. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),ISSN 0976 – 6375(Online) Volume 1, Number 1, May - June (2010), © IAEMEthe process by using the probabilistic approach and buffering selective messages at thereceiver end. Exact dependencies among processes are maintained. It abolishes theuseless checkpoint requests and reduces the number of duplicate checkpoint requests.2.3 SILVA AND SILVA ALGORITHM [15] The authors proposed all process coordinated checkpointing protocol fordistributed systems. The non-intrusiveness during checkpointing is achieved bypiggybacking monotonically increasing checkpoint number along with computationalmessage. When a process receives a computational message with the high checkpointnumber, it takes its checkpoint before processing the message. When it actually gets thecheckpoint request from the initiator, it ignores the same. If each process of thedistributed program is allowed to initiate the checkpoint operation, the network may beflooded with control messages and process might waste their time making unnecessarycheckpoints. In order to avoid this, Silva and Silva give the key to initiate checkpointalgorithm to one process. The checkpoint event is triggered periodically by a local timermechanism. When this timer expires, the initiator process the checkpoint state of processrunning in the machine and force all the others to take checkpoint by sending a broadcastmessage. The interval between adjacent checkpoints is called checkpoint interval.2.4 KOO-TOUEG’S ALGORITHM [10] They proposed a minimum process blocking checkpointing algorithm fordistributed systems. The algorithm consists of two phases. During the first phase, thecheckpoint initiator identifies all processes with which it has communicated since the lastcheckpoint and sends them a request. Upon receiving the request, each process in turnidentifies all process it has communicated with since the last checkpoint and sends them arequest, and so on, until no more processes can be identified. During the second phase, allprocess identified in the first phase take a checkpoint. The result is a consistentcheckpoint that involves only the participating processes. In this protocol, after a processtakes a checkpoint, it cannot send any message until the second phase terminatessuccessfully, although receiving messages after the checkpoint is permissible.2.5 COA AND SINGHAL BLOCKING ALGORITHM [4] They presented a minimum process checkpointing algorithm in which, thedependency information is recorded by a Boolean vector. This algorithm is a two-phase 52
  8. 8. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),ISSN 0976 – 6375(Online) Volume 1, Number 1, May - June (2010), © IAEMEprotocol and saves two kinds of checkpoint on the stable storage. In the first phase, theinitiator sends a request to all processes to send their dependency vector. On receiving therequest, each process sends its dependency vector. Having received all the dependencyvectors, the initiator constructs an N*N dependency matrix with one row per process,represented by the dependency vector of the process, Based on the dependency matrix,the initiator can locally calculate all the process on which the initiator transitively depend.After the initiator finds all the process that need to take their checkpoints, it adds them tothe set Sforced and asks them to take checkpoints. Any process receiving a checkpointrequest takes the checkpoint and sends a reply. The process has to be blocked afterreceiving the dependency vectors request and resumes its computation after receiving acheckpoint request.2.6 PRAKASH-SINGHAL ALGORITHM [16] It forces only a minimum number of processes to take checkpoints and does notblock the underlying computation during checkpointing. Prakash Singhal algorithmforces only part of processes to take checkpoints, the csn of some processes may be out-of-date, and may not be able to avoid inconsistencies [8]. Prakash-Singhal algorithmattempts to solve this problem by having each process maintains an array to save the csn,where csni[i] has been the expected csn of Pi. Note that Pi’s csnj[i] may be different fromPjs csnj [i] if there is no communication between Pi and Pj for several checkpointintervals. By using csn and the initiator identification number, they claim that their non-blocking algorithm can avoid inconsistencies and minimize the number of checkpointsduring checkpointing.2.7. P. KUMAR’S HYBRID ALGORITHM [13] In minimum-process coordinated checkpointing, some processes may notcheckpoint for several checkpoint initiations. In the case of a recovery after a fault, suchprocesses may rollback to far earlier checkpointed state and thus may cause greater lossof computation. In all-process coordinated checkpointing, the recovery line is advancedfor all processes but the checkpointing overhead may be exceedingly high. To optimizeboth matrices, the checkpointing overhead and the loss of computation on recovery,P.Kumar proposed a hybrid checkpointing algorithm, wherein an all-process coordinatedcheckpoint is taken after the execution of minimum-process coordinated checkpointing 53
  9. 9. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),ISSN 0976 – 6375(Online) Volume 1, Number 1, May - June (2010), © IAEMEalgorithm for a fixed number of times. Thus, the Mobile nodes with low activity or indoze mode operation may not be disturbed in the case of minimum-process checkpointingand the recovery line is advanced for each process after an all-process checkpoint.Additionally, he tried to minimize the information piggybacked onto each computationmessage. For minimum-process checkpointing, he designed a blocking algorithm, whereno useless checkpoints are taken and an effort has been made to optimize the blocking ofprocesses. He proposed to delay selective messages at the receiver end. By doing so,processes are allowed to perform their normal computation, send messages and partiallyreceive them during their blocking period. The proposed minimum-process blockingalgorithm forces zero useless checkpoints at the cost of very small blocking.3. CONCLUSION The existence of mobile nodes in a distributed system introduces new issues thatneed proper handling while designing a checkpointing algorithm for such systems. Theseissues are mobility, disconnections, finite power source, vulnerable to physical damage,lack of stable storage, etc. The new issues make traditional checkpointing techniquesunsuitable to checkpoint mobile distributed systems. A good checkpointing protocol formobile distributed systems should have low memory overheads on MHs, low overheadson wireless channels and should avoid awakening of an MH in doze mode operation. Thedisconnection of an MH should not lead to infinite wait state. The algorithm should benon-intrusive and should force minimum number of processes to take their localcheckpoints. Minimum-process coordinated checkpointing is a suitable approach to introducefault tolerance in mobile distributed systems transparently. This approach is domino-free,requires at most two checkpoints of a process on stable storage, and forces only aminimum number of processes to checkpoint. It may require blocking of processes, extrasynchronization messages, piggybacking of some information along with computationmessages, or taking some useless checkpoints. 54
  10. 10. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),ISSN 0976 – 6375(Online) Volume 1, Number 1, May - June (2010), © IAEME In this paper we have given introductory concepts related to checkpointing inmobile distributed systems. We also gave an analysis of various minimum-processcheckpointing schemes especially designed for mobile distributed systems.REFERENCES[1] Singhal, M, N G Shivratri, “Advanced concepts in Operating Systems, Tata Mc Graw Hill, 1994.[2] Acharya A., “Structuring Distributed Algorithms and Services for networks with Mobile Hosts”, Ph.D. Thesis, Rutgers University, 1995.[3] Cao G. and Singhal M., “On coordinated checkpointing in Distributed Systems”, IEEE Transactions on Parallel and Distributed Systems, vol. 9, no.12, pp. 1213-1225, Dec 1998.[4] Cao G. and Singhal M., “On the Impossibility of Min-process Non-blocking Checkpointing and an Efficient Checkpointing Algorithm for Mobile Computing Systems,” Proceedings of International Conference on Parallel Processing, pp. 37- 44, August 1998.[5] Cao G. and Singhal M., “Mutable Checkpoints: A New Checkpointing Approach for Mobile Computing systems,” IEEE Transaction On Parallel and Distributed Systems, vol. 12, no. 2, pp. 157-172, February 2001.[6] Cao G. and Singhal M., “Checkpointing with Mutable Checkpoints”, Theoretical Computer Science, 290(2003), pp. 1127-1148.[7] Chandy K. M. and Lamport L., “Distributed Snapshots: Determining Global State of Distributed Systems,” ACM Transaction on Computing Systems, vol. 3, No. 1, pp. 63- 75, February 1985.[8] Elnozahy E.N., Alvisi L., Wang Y.M. and Johnson D.B., “A Survey of Rollback- Recovery Protocols in Message-Passing Systems,” ACM Computing Surveys, vol. 34, no. 3, pp. 375-408, 2002.[9] Kalaiselvi, S., Rajaraman, V., “A Survey of Checkpointing Algorithms for Parallel and Distributed Systems”, Sadhna, Vol. 25, Part 5, October 2000, pp. 489-510. 55
  11. 11. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),ISSN 0976 – 6375(Online) Volume 1, Number 1, May - June (2010), © IAEME[10] Koo R. and Toueg S., “Checkpointing and Roll-Back Recovery for Distributed Systems,” IEEE Trans. on Software Engineering, vol. 13, no. 1, pp. 23-31, January 1987.[11] Parveen Kumar, Lalit Kumar, R K Chauhan, V K Gupta “A Non-Intrusive Minimum Process Synchronous Checkpointing Protocol for Mobile Distributed Systems” Proceedings of IEEE ICPWC-2005, January 2005.[12] Pushpendra Singh, Gilbert Cabillic, “A Checkpointing Algorithm for Mobile Computing Environment”, LNCS, No. 2775, pp 65-74, 2003.[13] Parveen Kumar, “A Low-Cost Hybrid Coordinated Checkpointing Protocol for Mobile Distributed Systems”, Mobile Information Systems [An International Journal from IOS Press, Netherlands] pp 13-32, Vol. 4, No. 1, 2007.[14] Lalit Kumar, Parveen Kumar “A Synchronous Checkpointing Protocol for Mobile Distributed Systems: A Probabilistic Approach”, International Journal of Information and Computer Security [An International Journal from Inderscience Publishers, USA], pp 298-314, Vol. 3 No. 1, 2007.[15] Silva L, Silva J 1992 Global checkpointing for distributed programs. Proc. IEEE 11th Symp. On Reliable Distributed Syst. pp 155-162.[16] R. Prakash and M. Singhal. “Low-Cost Checkpointing and Failure Recovery in Mobile Computing Systems”. IEEE Trans. on Parallel and Distributed System, pages 1035-1048,Oct. 1996. 56