ISSN: 2278 – 1323                                          International Journal of Advanced Research in Computer Engineer...
ISSN: 2278 – 1323                                         International Journal of Advanced Research in Computer Engineeri...
ISSN: 2278 – 1323                                            International Journal of Advanced Research in Computer Engine...
ISSN: 2278 – 1323                                             International Journal of Advanced Research in Computer Engin...
ISSN: 2278 – 1323                                           International Journal of Advanced Research in Computer Enginee...
ISSN: 2278 – 1323                                          International Journal of Advanced Research in Computer Engineer...
Upcoming SlideShare
Loading in …5

347 352


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

347 352

  1. 1. ISSN: 2278 – 1323 International Journal of Advanced Research in Computer Engineering & Technology Volume 1, Issue 4, June 2012 Fault Tolerant Environment in web crawler Using Hardware Failure Detection Anup Garje *, Prof. Bhavesh Patel**, Dr. B. B. Mesharm***ABSTRACT These areas require that their downtime is negligible. The deployment of distributed systems in theseFault Tolerant Environment is a complete programming areas has put extra demand on their reliabilityandenvironment for the reliable execution of distributed availaability. In a distributed system that is running numberapplication programs. Fault Tolerant Distributed applications it is important to provide fault-tolerance, toEnvironment encompasses all aspects of modern fault- avoid the waste of computations accomplished on the wholetolerant distributed computing. The built-in user- distributed system when one of its nodes fails to ensuretransparent error detection mechanism covers processornode crashes and hardware transient failures. The failure transparency. Consistency is also one of the measuremechanism also integrates user-assisted error checks requirements. In on-line and mission-critical systems a faultinto the system failure model. The nucleus non-blocking in the system operation can disrupt control systems, hurtcheckpointing mechanism combined with a novel low sales, or endanger human lives. Distributed environmentsoverhead roll forward recovery scheme delivers an running such applications must be highlyefficient, low-overload backup and recovery mechanism available, i.e. they should continue to provide stable andfor distributed processes. Fault Tolerant Distributed accurate services despite faults in the underlying hardware.Environment also provides a means of remote automaticprocess allocation on distributed system nodes. In case Fault tolerant environment was developed toof recovery is not possible, we can use new harness the computational power of interconnectedmicrorebooting approach to store the system to stablestate. workstations to deliver reliable distributed computing services in the presence of hardware faults affecting individual nodes in a distributed system. To achieve the1. INTRODUCTION stated objective, Fault tolerant environment has to support Though cloud computing is rapidly developing the autonomic distribution or the application processes andfield, it is generally accepted that distributed systems provide means for user-transparent fault-tolerance in arepresent the backbone of today’s computing world. One of multi-node environment[10].their obvious benefits is that distributed systems possess theability to solve complex computational problems requiring Addressing the reliability issues of distributedlarge computational by dividing them into smaller systems involves tackling two problems: error detection andproblems. Distributed systems help to exploit parallelism to process recovery. Error detection is concerned withspeed-up execution of computation-hungry applications permanent and transient computer hardware faults as wellsuch as neural-network training or various system as faults in the application software and the communicationmodeling. Another benefit of distributed systems is that links. The recovery of failed distributed applicationsthey reflect the global business and social environment in requires recovering the local execution state of thewhich we live and work. The implementation of electronic processes, as well as taking into consideration the statecommerce, flight reservation systems, satellite surveillance of the communication channels between them at the time ofsystems or real-time telemetry systems is unthinkable failure[10].without the services or intra and global distributed systems. 2 RELATED WORK Fault-tolerance methods for distributed systems*(Dep artm ent of Com p u ter Technology, Veetm ata Jijabai have developed in two streams: checkpointing/rollbackTechnilogical Institu te,Matu nga, Mu m bai.Ind ia(anu p g.007@gm ) recovery and process-replication mechanisms.**(Dep artm ent of Com p u ter Technology, Veetm ata Jijabai Process replication techniques have been widelyTechnilogical Institu te,Matu nga, Mu m bai. Ind ia(bh_p studied by many researchers. In this technique, required processes are replicated and executed on different***(Head of Dept. of Computer Technology, Veermata Jijabai machines. The assumption is made that all replicas of sameTechnological Institute,Matunga Mumbai. Ind ia( process will not fail at the same point of time and an unfailed replica can be used to recover other replicas. Although these techniques incur a smaller degradation in performance when compared to checkpointing mechanisms, 347 All Rights Reserved © 2012 IJARCET
  2. 2. ISSN: 2278 – 1323 International Journal of Advanced Research in Computer Engineering & Technology Volume 1, Issue 4, June 2012they are not overhead-free. Updating of one replica requires of the domino phenomenon in a distributed computationthat the other replicas must be updated to maintain while at the same time will offer a recovery mechanism thatconsistency. However, the main hindrance to wide adoption is as simple as in the synchronous checkpointing approach.of process-replication methods in various areas is the heavy In order to achieve its objective, processes go on takingcost of the redundant hardware needed for the execution of checkpoints (basic checkpoints) asynchronously wheneverthe replicas. needed whereas the roll-forward checkpointing algorithm runs periodically (say the time period is T) by an initiator In contrast with process replication mechanisms, process to determine the GCCs. During the execution of thethe other technique does not require duplication of algorithm an application process P is forced to take ahardware or replication of processes. Instead, each process checkpoint if it has sent an application message m after itsperiodically records its current state and/or some history of latest basic checkpoint which was taken by processthe system in stable storage, and this action called asynchronously. It means that the message m cannot remaincheckpointing. If a failure occurs, processes return to the an orphan because of the presence of the forced checkpointprevious checkpoint (rollback) and resume their execution because every sent message is recorded. It implies that infrom this checkpoint. A checkpoint is a snapshot of the the event of a failure occurring in the distributed systemlocal state of a process along timeline, saved on local non- before the next periodic execution of the algorithm, processvolatile storage to survive process failures. A global P can restart simply from this forced checkpoint`t after thecheckpoint of an n-process distributed system consists of n system recovers from the failure. However, if process P hascheckpoints (local) such that each of these n checkpoints not sent any message after its latest basic checkpoint, thecorresponds uniquely to one of the n processes. A global algorithm does not force the process to take a checkpoint.checkpoint C is defined as a consistent global checkpoint if In such a situation process P can restart simply from itsno message is sent after a checkpoint of C and received latest basic checkpoint.before another checkpoint of C. The checkpoints belongingto a consistent global checkpoint are called globally 2.2 Difference between Roll forward and Rollbackconsistent checkpoints (GCCs).The overhead that this approachtechnique incurs is greater than that of process replicationmechanisms because checkpoints are taken during failure-  Roll forward stores only latest checkpoint andfree operation of processes, and rollback-recovery requires rollback stores all checkpoints and requirescertain actions to be performed to ensure consistency of truncation.system when processes recover from crash.  Roll forward takes two kinds of checkpoints The concept of roll-forward checkpointing is forced and co-ordinated while roll back takes onlyconsidered to achieve a simple recovery comparable to that coordinatedin the synchronous approach. This concept helps in limitingthe amount of rollback of a process (known as domino  In roll forward, every process takes forcedeffect) in the event of a failure. The roll-forward checkpoint after it sends message, it is notcheckpointing approach has been chosen as the basis of the necessary in rollback approachbecause of its simplicity and some important advantages it  Roll forward guarantees that no orphans existoffers from the viewpoints of both checkpointing and while roll back makes no such promise.recovery. In case the recovery using r0ll-forward approach is 3 PROPOSED SYSTEMnot possible, then we have no choice other than restoringthe system to a best known stable state. To restore the 3.1 Overview of the Fault tolerant environmentsystem to a best known stable state, the best option isrebooting the system. Rebooting involves restarting of all Fault tolerant environment is designed to supportthe components of the system including those once which computation-intensive applications executing on networkedwere working correctly before the system failed. Restarting workstations such as Railway reservation system. While thethe whole system can be time consuming sometime which ‘economy of effort’ stipulates the introduction of measuresincreases the downtime of the system. This is not tolerable that prevent the loss of the long-running computationin the distributed systems that work 24/7. Therefore we can results executing on distributed nodes, frequently it isminimize the rebooting using microreboot approach. In this difficult to justify the use or expansive hardware replicationapproach, only those components are rebooted which failed fault-tolerance techniques. Another relevant area for thein the system. application of Fault tolerant environment is on-line distributed programs such as decision support control2.1 Work done in Roll-forward checkpointing mechanisms and flight reservation systems. Such systems The objective of the algorithm is to design a can tolerate a short delay in services -for fault-management,checkpointing / recovery algorithm that will limit the effect but a complete halt of the system upon the occurrence of 348 All Rights Reserved © 2012 IJARCET
  3. 3. ISSN: 2278 – 1323 International Journal of Advanced Research in Computer Engineering & Technology Volume 1, Issue 4, June 2012faults cannot be accepted i.e. they require immediaterecovery for the failure. The Acknowledgement timeout is calculated as Figure presents an abstract view of Fault tolerantenvironment operation. The error detection and faultrecovery modules run on a central node (server) that isassumed to be fault-tolerant, i.e. the probability of its failureis negligible. The required reliability of the central nodemight be obtained by hardware duplication. It is assumedthat the probability of the hardware failure of central server Whereis negligible. Upon system start-up, Fault tolerant  RTTi is the current estimate of round-trip timeenvironment identities the configuration of the underlyingnetwork and presents it to the user, which selects the  RTTi+1, is the new computed value, andnetwork node(s) on which the application processes shouldbe executed. Fault tolerant environment spawns the  α is a constant between 0 and 1 that controls howapplication processes on the specified nodes and rapidly the estimated rtt adapts to the change inperiodically triggers their checkpointing to save their network loadexecution image together with any inter-process messageswhich are saved into stable storage Nodes participating inthe application execution are continuously monitored, and 3.2 Detecting application-process failuresin the event of no& crash, checkpoints of all the processesrunning on the failed node arc extracted from stable storage When a user-process exits, analyses the processand the processes arc restarted on an operative node. exit status is analyzed by Fault tolerant environment to determine whether it exited normally or prematurely due to a failure, in which case the failed process recovery is3.2 Detection of faults in the hardware environment initiated. With regard to the user-assisted error detection, a special signal handler was dedicated to service the detection The starting point for all fault-tolerant strategies is of such errors. All the programmer has to do is to raise anthe detection of an erroneous state that, in the absence of interrupt with a predefined signal number and the detectionany corrective actions, could lead to a failure of the system. mechanism will handle the error as if it was raised by theFault tolerant environment error detection mechanism kerne1 (OS) detection mechanism (KDM).(EDM) identifies two types of hardware faults: processornode crashes (as caused by power failure) and transient For a centralized detection mechanism - such ashardware failures (temporary memory flips, bus errors, etc.) Fault tolerant environment’s, it is vital to consider thethat cause the failure of a single application process, and latency or detecting errors on the distributed system nodes.also allows the integration of user-programmed(application-specific) error checks.3.3 Detecting node failures Detection of node failures is based on a centralnode monitoring task that periodically sendsacknowledgement requests to all the nodes in the system. 3.3 Recovery of distributed application processesEach node must acknowledge within a predefined time 3.3.1 Creation of checkpointsinterval (acknowledgment timeout), otherwise it will beconsidered as having ‘crashed’. 349 All Rights Reserved © 2012 IJARCET
  4. 4. ISSN: 2278 – 1323 International Journal of Advanced Research in Computer Engineering & Technology Volume 1, Issue 4, June 2012 Assume that the distributed system has n processes only once irrespective of how many messages process P1(P0, P1, . . . , Pi, . . . , Pn-1). Let Cxi (0 ≤ i ≤ n - 1, x > 0) has sent before taking the checkpoint C11. Process P1 hasdenote the x-th checkpoint of process Pi, where i is the not sent any message between checkpoints C11 and C12. So,process identifier, and x is the checkpoint number. Each c1 remains at 0. Also it is clear why c1 still remains at 0 afterprocess Pi maintains a flag ci (Boolean). The flag is initially the checkpoint C12. Process P0 sets its flag c0 to 1 when itset at zero. It is set at 1 only when process Pi sends its first decides to send the message m3 after its latest checkpointapplication message after its latest checkpoint. It is reset to C01 .0 again after process Pi takes a checkpoint. Flag ci is storedin local RAM of the processor running process Pi for itsfaster updating. Note that the flag ci is set to 1 only once 3.3.2 Significance of forced checkpointsindependent of how many messages process Pi sends afterits latest checkpoint. In addition, process Pi maintains an Consider the system of Fig. 1 (ignore theinteger variable Ni which is initially set at 0 and is checkpoint C20 for the time being). Suppose at time T2 aincremented by 1 each time the algorithm is invoked. As in failure ‘f’ occurs. According to the asynchronous approachthe classical synchronous approach, we assume that besides processes P0 and P1 will restart their computation from C01the system of n application processes, there exists an and C11, since these are the latest GCCs.initiator process PI that invokes the execution of thealgorithm to determine the GCCs periodically. However, we Now, consider a different approach. Suppose, athave shown later that the proposed algorithm can easily be time T1, an attempt is made to determine the GCCs usingmodified so that the application processes can assume the the idea of forced checkpoints. We start with the recentrole of the initiator process in turn. We assume that a checkpoints C01 and C12 , and find that the message m3 is ancheckpoint Cxi will be stored in stable storage if it is a GCC; orphan. Observe that the flag co of process P0 is 1, whichotherwise in the disk unit of the processor running the means that process P0 has not yet taken a checkpoint afterprocess Pi replacing its previous checkpoint Cx-1i. We have sending the message m3. However, if at time T1 process P0shown that the proposed algorithm considers only the recent is forced to take the checkpoint C02 (which is not a basiccheckpoints of the processes to determine a consistent checkpoint of P0), this newly created checkpoint C02global checkpoint of the system. We assume that the becomes consistent with C12 . Now, if a failure ‘f’ occurs atinitiator process PI broadcasts a control message Mask to all time T2, then after recovery, P0 and P1 can simply restartprocesses asking them to take their respective checkpoints. their computation from their respective consistent states C02The time between successive invocations of the algorithm is and C12 . Observe that process P1 now restarts from C1 2 inassumed to be much larger than the individual time periods the new situation instead of restarting from C11. Thereforeof the application processes used to take their basic the amount of rollback per process has been reduced. Notecheckpoints. that these two latest checkpoints form a recent consistent global checkpoint as in the synchronous approach. The In this work, unless otherwise specified by ‘a following condition states when a process has to take aprocess’ we mean an application (computing) process. forced checkpoint.Example 1: Consider the system shown in Fig. 1. Examine Condition C: For a given set of the latestthe diagram (left of the dotted line). At the starting states of checkpoints (basic), each from a different process in athe processes P0 and P1, the flags c0 and c1 are initialised to distributed system, a process Pi is forced to take azero. The flag c1 is set at 1 when process P1 decides to send checkpoint Cim+1, if after its previous checkpoint Cimthe message m1 to P0. It is reset to 0 when process P1 takes belonging to the set, its flag ci = 1.its basic checkpoint C11. Observe that the flag c1 is set to 1 350 All Rights Reserved © 2012 IJARCET
  5. 5. ISSN: 2278 – 1323 International Journal of Advanced Research in Computer Engineering & Technology Volume 1, Issue 4, June 2012 PI for the k th execution of the algorithm,3.3.3 Non-blocking approach (2) Pi has taken a decision if it needs to take a forced We explain first the problem associated with non- checkpoint and has implemented it,blocking approach. Consider a system of two processes Piand Pj. Assume that both processes have sent messages after (3) Pi has resumed its normal operation and then has senttheir last checkpoints. So both ci and cj are set at 1. Assume this application message mi.that the initiator process PI has sent the request messageMask. Let the request reach Pi before Pj. Then Pi takes its (4) The sending event of message mi has not yet beencheckpoint Cik because ci = 1 and sends a message mi to Pj. recorded by Pi.Now consider the following scenario. Suppose a little laterprocess Pj receives mi and still Pj has not received Mask. So,Pj processes the message. Now the request from PI arrives 3.4 Microrebootingat Pj. Process Pj finds that cj = 1. So it takes a checkpointCjr. We find that message mi has become an orphan because Microrebooting in a new approach for restoringof the checkpoint Cjr. Hence, Cik and Cjr cannot be the system to a stable state. Rebooting is generally acceptedconsistent. as a universal form of recovery for many software failures in the industry, even when the exact causes of failure are3.3.4 Solution to the non-blocking problem unknown. Rebooting provides a high-confidence way to reclaim stale or leaked resources. Rebooting is easy to To solve this problem, we propose that a process perform and automate and it returns the system to the bestbe allowed to send both piggybacked and non-piggybacked known, understood and tested state.application messages. We explain the idea below. Eachprocess Pi maintains an integer variable Ni, initially set at 0 Unfortunately rebooting for an unexpected crashand is incremented by 1 each time process Pi receives the can take a very long time to reconstruct the state. Amessage Mask from the initiator. Thus variable Ni microreboot is the selective crash-restart of only those partsrepresents how many times the check pointing algorithm of a system that trigger the observed failure. This techniquehas been executed including the current one (according to aims to preserve the recovery advantages of rebooting whilethe knowledge of process Pi). Note that at any given time t, mitigating the drawbacks. In general, a small subset offor any two processes Pi and Pj, their corresponding components is often responsible for a global system failure,variables Ni and Nj may not have the same values. It thus making the microreboot an effective technique fordepends on which process has received the message Mask system-global recovery.first. However, it is obvious that |Ni - Nj| is either 0 or 1.Below we first state the solution for a two-process system. By reducing the recovery to the smaller subset ofThe idea is similarly applicable for an n process system as components, microrebooting minimizes the amount of statewell. loss and reconstruction. To reduce the state loss, we need to store the state that must survive the microrebooting processTwo-process solution: into separate repositories which are crash safe. This separates the problem of data recovery from application- Consider a distributed system of two processes Pi logic recovery and lets us perform the latter at finer grainand Pj only. Assume that Pi has received Mask from the than the process level.initiator process PI for the k th execution of the algorithm,and has taken a decision whether to take a checkpoint or Microreboots are largely as effective as fullnot, and then has implemented its decision. Also assume reboots but 30 times faster. In our prototype, microrebootsthat Pi now wants to send an application message mi for the recover from a large category of failures for which systemfirst time to Pj after it finished participating in the k th administrators normally restart the application, includingexecution of the algorithm. Observe that Pi has no idea deadlocked or hung threads, memory leaks, and corruptwhether Pj has received Mask yet and has taken its volatile data. If a component microreboot doesn’t correctcheckpoint. To make sure that the message mi can never be thean orphan, Pi piggybacks mi with the variable Ni. Process Pj failure, we can progressively restart larger subsets ofreceives the piggybacked message <mi, Ni> from Pi. We explain why message mi can never been an orphan.Note that Ni = k; that is it is the kth execution of the Because the component-level reboot time isalgorithm that process Pi has last been involved with. It determined by how long the system takes to restart themeans the following to the receiver Pj of this piggybacked component and the component takes to reinitialize, amessage: microrebootable application should aim for components that are as small as possible.(1) Process Pi has already received Mask from the initiator 351 All Rights Reserved © 2012 IJARCET
  6. 6. ISSN: 2278 – 1323 International Journal of Advanced Research in Computer Engineering & Technology Volume 1, Issue 4, June 2012 Microrebooting just the necessary componentsreduces not only recovery time but also its effects on the REFERENCESsystem’s end users. [1] Wang, Y.-M.: ‘Consistent global checkpoints that4 CONCLUSION contain a given set of local checkpoints’, IEEE Trans. Comput., 1997, 46, (4), pp. 456–468 Fault tolerant environment was designed toprovide a fault-tolerant distributed environment thatprovides distributed system users and parallel programmers [2] Koo, R., and Toueg, S.: ‘Check pointing andwith an integrated processing environment, where they can rollback-recovery for distributed systems’, IEEEreliably execute their concurrent(distributed) applications Trans. Software Eng., 1987, 13, (1), pp. 23–31despite errors that might occur in the underlying hardware. Fault tolerant environment user-transparent error [3] Venkatesan, S., Juang, T.T.-Y., and Alagar, S.:detection mechanism covers processor node crashes and ‘Optimistic crash recovery without changinghardware transient failures, and allows for the integration of application messages’, IEEE Trans. Paralleluser-programmed error checks into the detectable errors Distrib. Syst., 1997, 8, (3), pp. 263–271database. A non-blocking checkpointing policy was adopted [4] Cao, G., and Singhal, M.: ‘On coordinated checkto backup and restore the state of the application processes. pointing in distributed systems’, IEEE Trans.The checkpointing mechanism forks an exact copy Parallel Distrib. Syst., 1998, 9, (12), pp. 1213–(thread)of the application program, this thread performs all 1225the checkpointing routines without suspending theexecution of the application code, thus significantly [5] Pradhan, D.K., and Vaidya, N.H.: ‘Roll-forwardreducing the checkpointing overhead. check pointing scheme: a novel fault-tolerant architecture’, IEEE Trans. Comput., 1994, 43, In order to co-ordinate the operation of the (10), pp. 1163–1174checkpointing mechanism in distributed computingenvironments, a novel approach to reliable distributedcomputing for messages-passing applications was devised. [6] Gass, R.C., and Gupta, B.: ‘An efficient checkIt takes advantage of the low failure-free overhead of pointing scheme for mobile computing systems’.coordinated checkpointing methods with logging messages Proc. ISCA 13th Int. Conf. Computer Applicationsthat cross the recovery line to avoid blocking the in Industry and Engineering, Honolulu, USA,application process during the checkpointing protocol. The November 2000, pp. 323–328low failure-free overhead is at the expense of a longerrollback time, which is admissible because of the extendedexecution time of the targeted application. [7] Gupta, B., Banerjee, S.K., and Liu, B.: ‘Design of new roll-forward recovery approach for The noteworthy point of the presented approach is distributed systems’, IEE Proc., Comput. Digit.that a process receiving a message does not need to worry Tech., 2002, 149, (3), pp. 105–112whether the received message may become an orphan ornot. It is the responsibility of the sender of the message to [8] George Candea, Aaron B. Brown, Armando Foxmake it non-orphan. Because of this, each process is able to and David Patterson: ‘Recovery-Orientedperform its responsibility independently and simultaneously Computing: Building Multitier Dependability’,with others just by testing its local Boolean flag. This IEEE Computer Society, November 2004, pp 60-makes the algorithm a single phase one and thereby, in 67effect, makes the algorithm fast, simple and efficient. [9] B. Gupta, S. Rahimi and Z. Liu : ‘Novel low- A microreboot is the selective crash-restart of only overhead roll-forward recovery scheme forthose parts of a system that trigger the observed failure. distributed systems’, IET Comput. Digit. Tech.,Microreboots are largely as effective as full reboots but 30 2007, 1, (4), pp. 397–404times faster [10] T. Osman and A. Bargiela: ‘FADI: A fault tolerant environment for open distributed computing’ , IEE Proc.-Softw., Vol. 147, No. 3, pp 91-99 352 All Rights Reserved © 2012 IJARCET