Paradigms in Fault Tolerant Checkpointing Protocols in Distributed Mobile Systems
Abstract• Distributed mobile systems are ubiquitous now-a days.• Distributed mobile systems are not fault tolerant. They introduce new challenges in the area of fault tolerant computing.• Mobile computing having many issues, such as lower throughput and latency, low bandwidth of wireless channels, lack of stable storage on mobile hosts, connection breakdowns and inadequate battery life.• This paper surveys the algorithms which will restore the system back to a consistent state after a failure.
• Various techniques and algorithms have been devised and developed in this regard. One commonly applied solution to these failures is the use of Checkpoint/Restart scheme.• But the problem with this technique is that it rollbacks all the processors to an earlier stage, even if single processor crashes.• The idea behind most of the fault tolerance protocols is to roll-back only the crashed processor instead of rolling-back all the processors.• In such cases, if some processors are not dependent upon the results of the crashed processors, they can continue to perform their task without further waiting
• “distributed transaction” is a group of several sub-transactions, each running and updating data on different computer systems.• local “transaction manager” whose purpose is to enlist, prepare, commit, and abort the calls made by the distributed transactions.• Before the occurrence of any distributed transaction, each participating transaction manager must agree to commit an action; like, updating.
Failure Models in Mobile Distributed Systems1) Timing faults – occurs when a module does not complete its services in time;2) Omission faults - occurs when a module completely fails to accomplish its services;3) Crash faults - occurs when a module either stops operating completely or never yields to an effective state;4) Byzantine faults - these are the faults that are random in nature.
FAULT TOLERANCE PROTOCOLSThe Two-phase commit (2PC) protocol: The two-phase commit (2PC) protocol is a distributedalgorithm that assures the reliable termination of atransaction in a distributed environment.
Phase-I Protocol for the coordinator:Starti) Send transaction to the participating nodes.ii) Wait for signal (YES/NO) from all participating nodes.StopPhase-I Protocol for the participating nodes:Starti) Receive transaction from the coordinator.ii) Do local processing.iii) Send signal (YES/NO) to the coordinator node.Stop
Decision making phase(YES)Phase-II Agreement Protocol for the coordinator:Starti) Send commit signal to the participating nodes.ii) Receive acknowledgment from all participating nodes.iii) Commit or complete the transaction.StopPhase-II Agreement Protocol for the participating nodes:Starti) Receive commit signal from the coordinator.ii) Commit the transaction.iii) Release the resources.iv) Send acknowledgement to the coordinator node.Stop
In case of (NO)Phase-II Failure Protocol for the coordinator:Starti) Send switchback signal to the participating nodes.ii) Receive acknowledgment from all participating nodes.iii) Undo transaction.StopPhase-II Failure Protocol for the participating nodes:Starti) Receive switchback signal from the coordinator.ii) Undo transaction.iii) Release the resources.iv) Send acknowledgement to thecoordinator node.Stop
conclusion• Reliability can be restored using the above mentioned techniques of mobile distributed systems• Although there will be new challenges and thus making such protocols is still unsuitable.• Further protocols can be developed to add reliability to such systems.• This recent paper provides a further step to restore the system back to a consistent state even during the presence of a failure.