Your SlideShare is downloading. ×
On the quality of service of crash recovery
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

On the quality of service of crash recovery


Published on

Dear Students …

Dear Students
Ingenious techno Solution offers an expertise guidance on you Final Year IEEE & Non- IEEE Projects on the following domain
For further details contact us:
044-42046028 or 8428302179.

Ingenious Techno Solution
#241/85, 4th floor
Rangarajapuram main road,
Kodambakkam (Power House)

Published in: Education, Technology, Automotive
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 7, NO. 3, JULY-SEPTEMBER 2010 271 On the Quality of Service of Crash-Recovery Failure Detectors Tiejun Ma, Jane Hillston, and Stuart Anderson Abstract—We model the probabilistic behavior of a system comprising a failure detector and a monitored crash-recovery target. We extend failure detectors to take account of failure recovery in the target system. This involves extending QoS measures to include the recovery detection speed and proportion of failures detected. We also extend estimating the parameters of the failure detector to achieve a required QoS to configuring the crash-recovery failure detector. We investigate the impact of the dependability of the monitored process on the QoS of our failure detector. Our analysis indicates that variation in the MTTF and MTTR of the monitored process can have a significant impact on the QoS of our failure detector. Our analysis is supported by simulations that validate our theoretical results. Index Terms—Failure detectors, crash recovery, quality of service, availability, dependability, performance. Ç1 INTRODUCTION and accuracy, of crash failure detector implementations andF AULT tolerance is one of the most important issues for achieving dependable distributed systems. One of themost challenging problems in this research area is to tolerate failure detection algorithms, e.g., [5], [6], [7], [8], [9], [10]. It is important to note that most of this previous workthe Byzantine failure, which is also sometimes called the focused on the QoS of crash failure detectors is based on thearbitrary failure. This means that a process may behave in crash-stop or fail-free assumption. The fail-free assumptionan arbitrary manner, producing arbitrary responses at assumes that failures do not occur. The crash-stop assumptionarbitrary time [1]. It is the most difficult failure to detect. assumes that there is only one failure and the monitoringOne possible solution of Byzantine failure detection is procedure terminates once that crash failure is detected. Theadopting consensus algorithms. To achieve K fault toler- algorithms based on these assumptions focus on how toance, 3K þ 1 service replications are needed [2]. In the worst estimate the probabilistic message arrival time and a suitablecase, the K faulty services may send incorrect values, or time-out period for a failure detector to ensure a required QoS.incorrectly represent the values of others, but the remaining However, fail-free and crash-stop can be strong assump-2K þ 1 services can still return the same correct answer. tions. An alternative approach is to consider the crash-Crash failure detection is one of the most important building recovery paradigm as discussed by Guerraoui and Rodriguesblocks to achieve a successful consensus. However, detect- [11]. A process can keep crashing and recovering infinitelying crash failures is a difficult problem. In [3], Fischer et al. often and it is eventually always up and running. In theory, ashow the impossibility of separating a crashed process and a process recovery can be achieved by adopting stable storagevery slow one, in a pure asynchronous system, known as the and the state information of the process can be stored andFischer-Lynch-Paterson’s impossibility result. Subse- retrieved from the storage. After a crash is detected, thequently, failure detector oracles, which give possibly recovery procedure can be initiated to retrieve the latesterroneous information about the state of the monitored stored process information. In practice, in order to providetarget, have been proposed. In [4], Chandra and Toueg high availability, self-repairing and self-healing mechanismsintroduce the concept of unreliable crash failure detectors to are widely adopted in fault-tolerant systems to achievedetect the eventual crash behavior of a process and classify automatic recovery after a crash occurs. Particularly, ina set of abstract failure detectors based on the failure middleware systems, many techniques and algorithms havedetectors’ eventual behavior to solve a certain set of been proposed to achieve the self-repairing or self-healingmembership problems. This work inspired many research- goal, e.g., [12], [13], [14], [15].ers to study the quality of service (QoS), such as the speed In such systems, it is assumed that the system undergoes periodic crashes. During a crash period, the system is unable to service any requests or send any messages, externally behaving as if the system is unreachable. The end of the crash period is marked by a recovery, after which the system. T. Ma is with the Department of Computing, Imperial College London, South Kensington Campus, 180 Queens Gate London, SW7 2AZ, UK. returns to normal service and its internal state is restored to E-mail: the state before the crash failure occurred.. J. Hillston and S. Anderson are with the Laboratory for Foundations of For such systems, crash-recovery failure needs to be Computer Science, School of Informatics, University of Edinburgh, considered as a frequently occurring failure type to be 10 Crichton Street, Edinburgh EH8 9AB, UK. detected. However, the crash-recovery case has been little E-mail: {jeh, soa} studied, due to the fact that there are more possibleManuscript received 19 Feb. 2008; revised 21 Apr. 2009; accepted 30 June discrepancies between the failure detector and the monitored2009; published online 11 Aug. 2009.For information on obtaining reprints of this article, please send e-mail to: target, increasing the size of the state space of the, and reference IEEECS Log Number TDSC-2008-02-0037. process, making the QoS analysis for such a paradigm moreDigital Object Identifier no. 10.1109/TDSC.2009.36. complicated. 1545-5971/10/$26.00 ß 2010 IEEE Published by the IEEE Computer Society
  • 2. 272 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 7, NO. 3, JULY-SEPTEMBER 2010 In [16], we presented an evaluation of the QoS of a crash-recovery failure detector based on a simple time-out algo-rithm. A crash-recovery target was modeled as an alternatingrenewal process. The simulation results showed that thecrash-recovery behavior of the monitored target will impactthe QoS of such a failure detector, which implied that thecrash-recovery paradigm merited further studied. Such ananalysis was presented in [17]. In that paper, we outlinedhow to model the failure detection pair in a crash-recoveryrun and how to configure the failure detector to satisfy agiven QoS requirement. The current paper represents asubstantial expansion of [17]. We present more analytical Fig. 1. The QoS metrics without considering false positive mistakes.details and support the results with further simulationstudies. Analytical results, derived directly from the equa- detector and the QoS metrics. In terms of the transitionstions in this paper, are also plotted and compared with the defined above and the fail-free assumption, Chen et al.simulation results. We are then able to present a detailed define the following QoS metrics for a failure detector:analysis for each of the QoS metrics, which shows the failure detection time (TD ), mistake recurrence time (TMR ),validity of our model. mistake duration (TM ), good period duration (TG ), and1.1 Our Contribution query accuracy probability (PA ).We show how to remove the fail-free or crash-stop assump- Some recent research has extended the QoS work of [5] intion and model the probabilistic behavior of a failure a number of ways. For example, the authors of [6], [9], [10],detector with respect to a crash-recovery target, taking into [18] refine the model with different probabilistic message delay and loss estimation methods. Meanwhile, others, suchconsideration general dependability metrics, such as mean as [7], [8], [19], [20], [21], focus on the scalability andtime to failure (MTTF) and mean time to recovery (MTTR). We adaptivity of crash failure detection. But all of these papersoutline how the QoS of a failure detector is limited by the are based on eventual crash-stop behavior of the monitoreddependability of the monitored target. Moreover, we process or the fail-free assumption. Crash-recovery failureestablish that the crash-stop or fail-free models are special detectors have been considered by several groups, e.g.,cases of the crash-recovery model. Boichat and Guerraoui [22] implemented reliable and total In order to effectively assess the QoS of the failure order broadcast primitives, assuming a practical asynchro-detector in a crash-recovery run, we have defined new nous crash-recovery model in which the processes andQoS metrics to measure the recovery detection speed and channels may crash and recover or crash and never recover;the proportion of the failures of the monitored target which [23], [24], [25], [26], each of which proposes failure detectorsare detected. To make an accurate estimation of the failure to solve consensus problems rather than focusing on thedetector’s parameters needed to achieve a required QoS, a QoS of the failure detector itself. In [23], the monitoredconfiguration procedure for a crash-recovery failure detector process is characterized as always-up, eventually-up, even-is outlined. We demonstrate how to achieve the QoS from tually-down, or unstable. A process which crashes anda given set of requirements based on the NFD-S algorithm recovers infinitely many times is regarded as unstable.(see Appendix B, which can be found on the Computer But crash-recovery looping behavior exists for most systems.Society Digital Library at http://doi.ieeecomputersociety. From the perspective of stochastic theory, crash-recoveryorg/10.1109/TDSC.2009.36,) proposed by Chen et al. [5] behavior can be regarded as a regenerative process in whichwith suitable modifications. To the best of our knowledge, the probabilistic live and recovery times are not zero. In thenone of these aspects of QoS of failure detectors have been following sections, we will analyze such a crash-recoverypresented before. paradigm and its failure detector from a QoS perspective.1.2 Related Work This paper is organized as follows: in Section 2.1, we model a crash-recovery service with general dependabilityIn [5], Chen et al. propose a set of QoS metrics to measure metrics. Then, we show our model of the probabilisticthe accuracy and speed of a failure detector. Their model message communication and its QoS metrics. In Section 3,contains a pair of processes: one is the monitor process, theother is the monitored process, and there is only one crash we show how to model the crash-recovery failure detector’sduring the monitoring period. The analysis is based on two probabilistic behavior. We refine the completeness of a crash-separate stages of failure detection: the precrash stage, recovery failure detector and extend the QoS metrics towhich is a fail-free run; and the postcrash stage, which is a measure the completeness and the recovery detection speedcrash-stop run when the monitoring procedure will be of such a failure detector. Then, we show how to involveterminated. In order to formally define the QoS metrics, the general dependability metrics for an approximateChen et al. [5] define state transitions of a failure detector analysis of the QoS of a failure detector and how tomonitoring a target process under the fail-free assumption. configure a crash-recovery failure detector to satisfy a givenAt any time, the failure detector’s state is either Trust or set of QoS requirements. Moreover, we discuss the impactSuspect with respect to the monitored process’s liveness. If a of the dependability of the crash-recovery service on the QoSfailure detector moves from a Trust state to a Suspect state, of failure detectors in detail. In Section 4, the estimation ofthen an S-transition occurs; if the failure detector moves the input parameters of a crash-recovery failure detector isfrom a Suspect state to a Trust state, then a T-transition presented. We show how to estimate the message delay,occurs. Fig. 1 shows the state transitions of the failure message loss, MTTF, MTTR, etc., in a crash-recovery run. In
  • 3. MA ET AL.: ON THE QUALITY OF SERVICE OF CRASH-RECOVERY FAILURE DETECTORS 273 random variables fXðnÞ; n 2 N g, where XðnÞ is the random variable representing the time which elapses from the time of the nth regeneration point to the ðn þ 1Þth one (i.e., XðnÞ ¼ Snþ1 À Sn ). For simplicity of presentation, we use X instead of XðnÞ in the following since it is sufficient to consider a single regeneration period. Furthermore, we can consider X to be the sum of two independent random variables: Xa and Xc . Here, Xa represents the time which elapses from the time that the CR-TS starts a regeneration period to the time the CR-TS fails and Xc represents the time from when the CR-TS fails until to the time of the nextFig. 2. Crash-recovery service modeling. regeneration point. Lemma 1. In steady state, the CR-TS is an alternating renewalSection 5, the analytical and simulation results are plotted process and the time between any two consecutive recovery timeand analyzed in detail. We show that the dependability of a points is one period of the crash-recovery service’s lifetime.crash-recovery target has an impact on the QoS of a failure Thus, we assert that in order to design a failure detector fordetector and our analysis is valid. In Section 6, a brief the CR-TS, which is sensitive to the CR-TS’s behavior, wesummary of the paper is presented. Appendix A provides a only need to consider one period of the CR-TS since all of thenotation table for the variables used in the paper. other periods are independent and identically distributed.Appendix B shows the pseudocode of the NFD-S algorithm. 2.2 Dependability of a Crash-Recovery ServiceAppendix C presents the main proofs of the lemmas andtheorems presented in this paper. Dependability, one of the most important issues for computer systems, is a complex attribute. Laprie et al. [1] define the concept of dependability as the property of a2 CRASH-RECOVERY SERVICE AND QoS OF computer system such that reliance can justifiably be placed on the MESSAGE COMMUNICATION service it delivers. Associating timing information with the behavior of a system, its dependability can be describedIn this section, we outline the assumptions underlying quantitatively. Generally speaking, the dependability of aour framework, considering the crash-recovery behavior system can be measured according to a number of differentof the target service, its dependability characteristics, and aspects such as reliability, availability, consistency, usability,the behavior of the communication channel which security, etc. In order to simplify the measurements whichsupports the failure detection process. are related to failure detection, here, we only introduce2.1 The Crash-Recovery Service Modeling reliability, availability, and consistency, which are strongly related to the QoS of failure detectors.For a crash-recovery target service (CR-TS), we consider that In [27], Knight and Strunk give a definition of softwarethe service might crash at arbitrary time and take some time reliability and availability. We extend this with a definitionto be repaired and restart again after it fails. Let S be the of consistency as follows:state space of a stochastic process Z :¼ fZðtÞ; t ! 0g, whereZ captures a CR-TS’s lifetime. Then, S can be regarded as . Reliability: is the probability that the system will{Alive, Crash} and the CR-TS can periodically switch operate correctly in a specified operating environ-between these two states. A transition occurs when the ment up until time t (t > 0).state of the CR-TS changes. Fig. 2 shows the state transitions . Availability: is the probability that the system will beof a CR-TS, where a C-transition occurs when the state of the operational at time t.CR-TS switches from the Alive state to the Crash state; an . Consistency: is the probability that in a specifiedR-transition occurs when the state of the CR-TS switches operating environment, the system will return tofrom the Crash state to the Alive state. normal operation correctly after a failure within time t.Assumption 1. If the service’s recovery is treated as a restart, These three metrics present different aspects of the then the CR-TS’s lifetime Z is a regenerative process. system dependability. Generally, reliability presents how long a system will operate correctly and can be captured by Assumption 1 will be used in the following. It is based MTTF, which records the likelihood of a service to persist without a failure. Availability presents the probability that aon the following observations. The CR-TS will periodically system is accessible or reachable with correct operation atcrash and recover, leading to a sequence of time points, an arbitrary time and can be captured by mean time to failureS1 ; S2 ; . . . ; Sn ; . . . (n ! 0), representing the times of the divided by mean time between failure (MTTF ). Consistency MTBFCR-TS’s recovery. The behavior of the system after Sn presents the ability of a system to recover from a failure(n ! 0) is independent of what has occurred before, and state to the correct operation state and can be captured bythus, Sn can be regarded as a restart. Moreover, the MTTR, which records how quickly a system recovers.probability of Sn occurring is 1. This makes the time points In different scenarios, different aspects of dependabilityS1 ; S2 ; . . . ; Sn regeneration points. may be given greater relative importance. For example, Since the CR-TS’s lifetime Z is a regenerative process and consistency may be valued more than reliability in athe sequence fS1 ; S2 ; . . . ; Sn ; . . .g characterizes the lifetime system designed to be always accessible. This means thatof the service, we can give an alternative definition of the fault-tolerance mechanisms should be able to adapt tostochastic process Z. The stochastic process Z is a set of reflect differing dependability requirements.
  • 4. 274 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 7, NO. 3, JULY-SEPTEMBER 20102.3 QoS of Message CommunicationIn order to measure the communication between the FDS andtarget service quantitatively, we define the communicationpath between the FDS and the target service as a channel.Each communication component pair holds one or morevirtual one-way, source-to-destination channel. Messagescan only flow from the source component to the destinationcomponent. In addition, the channel model in this paperrelies on the assumption of a basic unreliable communicationchannel with fairness, no-creation, and no-duplication [28].This has some similarities with the Stubborn channels in [28],but they allow duplicated messages and we assume thatthere are no duplicated messages in our model. This channel-based communication, which maintains Fig. 3. State space in a crash-recovery run. (a) Fail-free transition.the interaction between the FDS and the CR-TS, can be (b) Crash-recovery transition.characterized by the QoS of the communication, the adoptedfailure detection algorithm, and the adopted communica- enough to be ignored and their local clocks are sufficientlytion protocol, each of which has some associated properties. synchronized (this can be guaranteed by some time synchro-In particular, we take the message transmission behavior to nization service such as the Network Time Protocol used inbe probabilistic: we describe the message delay or loss as [6]) to be regarded as a clock synchronized system. Theprobabilistic behaviors associated with the communication failure detection algorithm we adopt is the NFD-S algorithmchannel. proposed in [5].Definition 1. Let D be a random variable representing the time 3.2 Modeling a Push-Style Crash-Recovery FDS which elapses from the time a message is sent until the time it The failure detector (FDS) has a set of suspicion levels S s :¼ arrives at the destination and EðDÞ be the average message fT rust; Suspectg as in [5]. The FDS can either trust or suspect delay; let pL be the probability of a message loss during the a CR-TS’s liveness. Thus, for a fail-free run, a service only has transmission; let XL be a random variable representing the one state: Alive. The state space of an FDS is S f :¼ number of consecutive messages lost and EðXL Þ be the average fT rust-Alive; Suspect-Aliveg, and the event space of an FDS number of consecutive messages lost. F :¼ fS-transition; T -transitiong (Fig. 3a). For a fail-free run, the QoS metrics of an FDS can be measured quite From these definitions, properties such as the following straightforwardly. The average time spent in the Trust statecan be derived: is the mean length of the good period EðTG Þ; the average timeLemma 2. If each message’s transmission and loss behavior are spent in the Suspect state is the mean time of the mistake independent, then the probability that x (x ! 1) consecutive duration EðTM Þ; the average time between two consecutive messages are lost is transfers to the Suspect state (two consecutive S-transitions) is the mean time of the mistake recurrence EðTMR Þ. P rðXL ¼ xÞ ¼ px Á ð1 À pL Þ: L However, precisely speaking, the state space of an FDS S c :¼ S Â S s , where S is the state space of the target service. Overall, the QoS of this channel-based communication Therefore, for a CR-TS with failures, the state space of itsbetween the FDS and the CR-TS can be captured by EðDÞ, FDS increases because the service has more than one statepL and EðXL Þ. In the following sections, we analyze how (see Fig. 3b). If the suspicion level is more than two, then S cthe FDS monitors the CR-TS and how the FDS can be will increase as well. The QoS metrics of an FDS are noconfigured based on the characteristics of this channel- longer as simple as for fail-free runs.based communication. For a fail-free run (MTTF ! þ1) or a crash-stop run (MTTR ! þ1), the CR-TS’s current state S CRÀT S will be3 QoS OF THE CRASH-RECOVERY FDS Alive for all time up to the crash, and it is easy to deduce the3.1 System Model FDS’s accuracy S A directly from the FDS’s current state. However, for a crash-recovery run, since the CR-TS could failWe consider a distributed system model with two services: or recover at arbitrary time, S A cannot be deduced solelyOne FDS and one CR-TS, distributed over a wide-areanetwork. The FDS and the CR-TS are connected by an from the state of the FDS. Furthermore, compared with a fail-free or crash-stop run,unreliable communication channel (see Section 2.3). Liveness(heartbeat) messages are transmitted through the channel. there are more mistake types in a crash-recovery run. InThe communication channel does not create or duplicate previous work, such as [5], [6], [8], [9], [10], [18], [20], onlyliveness messages, but the messages might be lost or delayed the mistakes caused by the message transmission behaviorsindefinitely during transmission.1 The CR-TS can fail by (message delay and loss) are considered. But in a crash-crashing but can be repaired and restart to run again after recovery run, a mistake starts whenever the CR-TS’s andsome repair time, i.e., it behaves as a crash-recovery model. The FDS’s states diverge. Thus, there are also mistakes caused 3drift of the local clocks of the FDS and the CR-TS is small by the CR-TS’s crash (see TF in Fig. 1 or TM in Fig. 4c) and recovery (see Fig. 4d) due to the delayed detection of such 1. This channel-based message transmission is the same as the events. Fig. 4 shows the four types of mistake which could 1probabilistic network model in [5]. occur within a crash-recovery run. TM in Fig. 4a represents a
  • 5. MA ET AL.: ON THE QUALITY OF SERVICE OF CRASH-RECOVERY FAILURE DETECTORS 275 1 2 3 4Fig. 4. The analysis of possible TM in a crash-recovery run. (a) TM . (b) TM . (c) TM . (d) TM . 2mistake caused by a message delay. TM in Fig. 4b The above QoS metrics can measure some QoS aspects of 3 a failure detector in a crash-recovery run. However, theyrepresents a mistake caused by a message loss. TM inFig. 4c represents a mistake caused by CR-TS’s crash, while cannot measure how fast a recovery can be detected, the 4 proportion of the detected failures over the occurredthe FDS still trusts the CR-TS. TM in Fig. 4d represents amistake caused by CR-TS’s recovery, while the FDS still failures (completeness), etc. In the following section, wesuspects the CR-TS. A message loss or delay will result in a extend the QoS metrics to measure the recovery detectionSuspect-Alive mistake of the FDS (see Fig. 3b). A crash speed and the completeness of a failure detector.failure will result in a Trust-Crash mistake. A recovery eventwill result in a Suspect-Alive mistake. Mistakes caused by 3.3 Extended QoS Metrics for a Crash-Recoverydifferent reasons will result in different FDS parameter FDSreconfiguration plans. For instance, the best way for the For an FDS in a crash-recovery run, in addition to the QoSFDS to tolerate more message losses or a longer message metrics introduced above, we propose some new QoSdelay is to increase the time-out duration; the best way for metrics.the FDS to minimize the mistake duration caused by a crash First, in order to measure the speed with which an FDSevent is to decrease the time-out duration; and the best way can discover a recovery of the CR-TS, we define—theto minimize the mistake duration caused by a recovery recovery detection time (TDR )—a random variable whichevent is to increase the liveness message sending frequency. represents the time that elapses from the CR-TS’s recoveryThus, we can see that an inaccurate mistake type identifica- time (an R-transition) to the time when the FDS discoverstion might reduce the QoS of an FDS and should be the recovery. Then, since in a crash-recovery run, there is no eventualavoided. behavior of a CR-TS, and a fast recovery could make a From the above analysis, we can see that due to the failure undetectable by the FDS. Under such circumstances,increasing mistake types in a crash-recovery run, the defini- the completeness property of a failure detector defined in [4]tion of the QoS metrics in [5] using transitions is not valid in a can no longer be satisfied. In order to reflect this situation,crash-recovery run. Thus, we redefine them as below: we refine the definition of the completeness as follows: . Detection time (TD ): The elapsed time from when . Strong completeness: Every crash failure of a recover- the monitored target crashes until the failure able process will be detected. detector correctly suspects the monitored target. . Weak completeness: A specified proportion of the crash . Mistake recurrence time (TMR ): The time between failures of a recoverable process will be detected. the occurrence of two consecutive mistakes. Therefore, in order to measure the completeness property of a . Mistake duration (TM ): The time to correct a crash-recovery FDS, we propose a new QoS metric. The mistaken suspect or trust. detected failure proportion (RDF ) is a random variable . Good period duration (TG ): The duration for which capturing the ratio of the detected crashes over the occurred the failure detector maintains the correct state crashes (0 RDF 1). When no crash failures are detected, information. RDF ¼ 0. When all of the occurring crashes are detected, . Query accuracy probability (PA ): The probability RDF ¼ 1. The strong completeness property of an FDS that the state information from the failure detector is requires that EðRDF Þ ¼ 1 (where E denotes expectation). correct at an arbitrary time. The weak completeness property requires that EðRDF Þ ! RL , DF
  • 6. 276 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 7, NO. 3, JULY-SEPTEMBER 2010 . i is the time of the ith freshness point corresponding to i ; . b is the last freshness point3 before a crash; and . f is the freshness point corresponding to f . Let time-out be the threshold waiting time for the expected arrival of the liveness message before suspecting the CR-TS (time-out ¼ i À i in Fig. 5). Let tm (m ! 1) be a r recovery time of the current MTBF period (see Fig. 5). Then in our model, the key thing for the QoS bounds analysis is to derive the average number of mistakes that will happenFig. 5. The analysis of the FDS based on the NFD-S algorithm in a between the mth and ðm þ 1Þth recovery times, and thecrash-recovery run. average duration of each mistake. We make the followingwhere RL is the specified lower bound of the detected definitions as extensions of Definition 1 in [5]: DFfailure proportion and 0 RL DF 1. Definition 2. For the fail-free duration ½t1 ; t2 Þ within each Overall, the QoS for a crash-recovery FDS can be MTBF period:captured by PA , TM , TMR , TD , TDR , and RDF . In the nextsection, we will analyze the QoS bounds of the FDS based 1. k: for any i ! 1, let k be the smallest integer such that for all j ! i þ k, mj is sent at or after time i , whereon the NFD-S algorithm in a crash-recovery run by adopting mj is the jth heartbeat message.4the proposed basic and extended QoS metrics. 2. For any i ! 1, let pi ðxÞ be the probability that the FDS j3.4 QoS Estimate of the Crash-Recovery FDS Based does not receive the ði þ jÞth message miþj by on the NFD-S Algorithm time i þ x, for every j ! 0 and every x ! 0; let pi ¼ pi ð0Þ. 0 0In a crash-recovery run, as the state of a CR-TS can switch i 3. For any i ! 2, let q0 be the probability that the FDSbetween Alive and Crash, these crash or recovery events will receives message miÀ1 before time i .force the output of the FDS to be accurate or inaccurate. For 4. For any i ! 1, let ui ðxÞ be the probability that the FDSanalyzing the behavior of the failure detection pair, we suspects the CR-TS at time i þ x, for every x 2 ½0; Þ.want to pick an observation period, which will cover all the 5. pi : for any i ! 2, let pi be the probability that an s sevents which may possibly occur. In our model, we pick S-transition occurs at time i .one MTBF period as the observation period. This is because,as we discussed in Section 2.1, in order to study the steady According to the QoS analysis of the NFD-S algorithm instate behavior of a CR-TS throughout its lifetime, we only Proposition 3 in [5], we now analyze the basic QoS metricsneed to observe the time period between two consecutive of the FDS based on the NFD-S algorithm in a crash-recoveryregeneration points (recovery times) of the CR-TS and the run and show the following relations hold:average duration between the two consecutive regeneration Proposition 1.points is MTBF. In the following, we will treat these as alsoregeneration points of the system consisting of the failure 1. k ¼ dtime-out=e.detection pair. This is an approximation made for prag- 2. for all j ! 0 and for all x ! 0,matic reasons but it can be justified as follows: Fig. 5 shows the relationship between an FDS and a pi ðxÞ ¼ ðpL þ ð1 À pL Þ Á P rðD time-out þ x À jÞÞ j À ÁCR-TS on the interval t 2 ½t0 ; t3 Þ, where both t0 and t3 are Á P r Xa i À tm þ x : rregeneration points. Obviously, the mean time between t0and t3 is the MTBF. We split ½t0 ; t3 Þ into three intervals 3. i q0 ¼ ð1 À pL Þ Á P rðD time-out þ Þ½t0 ; t1 Þ, ½t1 ; t2 Þ, and ½t2 ; t3 Þ: À Á ÁP r Xa Q tm : iÀ r . t1 is the time when the FDS detects the transition of 4. For all x 2 ½0; Þ; ui ðxÞ ¼ k pi ðxÞ. j¼0 j the CR-TS from the Crash state to the Alive state. 5. pi ¼ q0 Á ui ð0Þ. s i . t2 is the time when the service crashes. Note that the In Proposition 1, the bounds of each QoS metric are period ½t1 ; t2 Þ is without failures. derived based on the analysis of the average number of Additionally, we define the following times: possible mistakes within the distinct intervals ½t0 ; t1 Þ, ½t1 ; t2 Þ, and ½t2 ; t3 Þ. In consequence, the following theorem holds . s is the first liveness message sending time after a and can be used to estimate the FDS’s parameters or QoS recovery; . f is the sending time of the last liveness message bounds within a crash-recovery run: before a crash; Theorem 1. The crash-recovery FDS based on the NFD-S . i is the sending time of a liveness message between algorithm has the following properties: s and f ; . is the liveness message sending interval; 3. The expected arrival time of the liveness message. 4. k is assumed to be independent of i approximately. In fact, in a crash- . s is the first decision time after recovery;2 recovery run, k is not completely independent of i. However, if the CR-TS will remain alive for a reasonable duration, k will be almost independent of i 2. The actual arrival time of the first received valid liveness message. except for the last few messages before the CR-TS crashes.
  • 7. MA ET AL.: ON THE QUALITY OF SERVICE OF CRASH-RECOVERY FAILURE DETECTORS 277 MT BF ! EðTMR Þ MT BF ð1Þ ! ÀÄ MT T F ÀEðT Å Á Æ Ç : DR Þ þ 1 Á pi þ EðDÞ þ 2 s If Xc þ time-out, then MT BF ! EðTMR Þ 2 MT BF ð2Þ ! ÀÄ MT T F ÀEðT Å Á Æ Ç ; DR Þ þ 1 Á pi þ EðDÞ þ 2 s R EðTD Þ þ EðTDR Þ þ MT T F ÀEðTDR Þ Á 0 ui ðxÞdx PA ! 1 À ; ð3Þ MT BF Fig. 6. The extended FDS configuration based on the NFD-S algorithm R in a crash-recovery run. EðTDR Þ þ MT T F ÀEðTDR Þ Á 0 ui ðxÞdx þ EðTD Þ EðTM Þ ÀÄ MT T F ÀEðTDR Þ Å Á ; ð4Þ þ 1 Á pi þ 1 needed to ensure that the NFD-S algorithm is still valid after s each recovery. However, without persistent storage to snapshot the runtime information frequently, when a crash EðTDR Þ ¼ EðDÞ þ Á EðXL Þ; ð5Þ failure occurs, all of the current runtime information might be lost. Thus, continuously increasing the heartbeat se- EðRDF Þ ! P rðXc þ time-outÞ: ð6Þ quence number cannot be guaranteed. Since the NFD-S algorithm assumes that the local clocks ofDetails of the proof of the theorem can be found in [29] and the FDS and the CR-TS are synchronized, we can compareAppendix C.2. the sending times of heartbeat messages instead of the When the monitored target is fail-free or crash-stop,5 for heartbeat sequence numbers in the algorithm. Then, for athe basic QoS metrics in [5], applying (1)-(4) of Theorem 1, crash-recovery FDS, if the QoS requirements of the FDS arewe can easily deduce that given, the configuration procedure is illustrated in Fig. 6. Initially, we can assume that the QoS of message EðTMR Þ ! ; ð7Þ communication is perfect (e.g., pL ¼ 0, EðDÞ is small and pi s EðXL Þ ¼ 0), and the CR-TS is fail-free. As the monitoring Z procedure continues, the estimation of the QoS of message 1 EðTM Þ Á ui ðxÞdx i ; ð8Þ communication and the dependability metrics of the CR-TS pi s 0 q0 will become more accurate. Thus, the FDS will be reconfi- Z gured to adapt to changing input parameters, which help 1 better estimate and time-out. PA ! 1 À Á ui ðxÞdx: ð9Þ 0 Then for given QoS requirements, expressed as bounds, Thus, EðTMR Þ, EðTM Þ, and PA are exactly reduced to the the following inequalities need to be satisfied where aQoS analysis results in [5] (see Appendix C.4 for the details superscript U denotes an upper bound and a superscript Lof the proof scratch). We can conclude that in terms of failure denotes a lower bound:detection, a fail-free run or a crash-stop run with MTTF U L L TD TD ; EðTMR Þ ! TMR ; PA ! PA ;tending to infinity is a particular case of a crash-recovery run. ð10Þ U UIf the monitored target’s MTTF is not sufficiently long and EðTM Þ TM ; EðTDR Þ TDR ; EðRDF Þ ! RL : DFthe target is recoverable, then the impact of its dependability From Theorem 1, we can estimate the parameters ( andmust also be taken into consideration. In the following time-out) of the NFD-S algorithm according to the followingsection, we will introduce how to configure the crash-recoveryFDS according to the QoS bounds we have derived from inequalities:Theorem 1. þ time-out U TD ; 0; ð11Þ3.5 The Configuration of the Crash-Recovery FDS Based on the NFD-S Algorithm MTBF L ÀÄ MTTFÀEðTDR Þ Å Á Æ Ç ! TMR ; ð12ÞFor crash failure detectors, it is crucial to select some þ 1 Á pi þ EðDÞ þ 2 s suitable input parameters (such as the liveness messageintersending interval and the time-out duration) to satisfy a Rgiven set of QoS requirements. In this section, we will show EðTD Þ þ EðTDR Þ þ MTTFÀEðTDR Þ Á 0 ui ðxÞdx Lhow to achieve such steps in a crash-recovery run based on 1À ! PA ; ð13Þ MTBFthe NFD-S algorithm. In a crash-recovery run, an assumptionthat the sequence numbers of the heartbeat messages are R EðTDR Þ þ MTTFÀEðTDR Þ Á ui ðxÞdx þ EðTD Þcontinually increasing after every recovery of the CR-TS is ÀÄ MTTFÀEðTDR Þ Å 0 Á U TM ; ð14Þ þ 1 Á pi þ 1 s 5. The precrash duration of the crash-stop process is a long run.
  • 8. 278 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 7, NO. 3, JULY-SEPTEMBER 2010 U EðDÞ þ EðXL Þ TDR ; ð15Þ P rðXc þ time-outÞ ! RL : DF ð16Þ Then, the task of the NFD-S algorithm is to find thelargest satisfying inequalities (12)-(15) and if such exists, Ufind the largest time-out that satisfies þ time-out TD andP rðXc þ time-outÞ ! RL . This can be done in the DFfollowing steps: L Step I. If TMR MTBF, continue; else the QoS of the FDScannot be achieved. Step II. Find the largest that satisfies the inequalities Fig. 7. Dependability metrics estimation.(12)-(15); otherwise cannot find an appropriate (QoScannot be achieved). uniformly distributed on ½l ; l þ Þ, then after a recovery U Step III. If 0, find the largest time-out TD À and has completed, the average tm can be estimated by cP rðXc þ time-outÞ ! RL .DF tm ¼ l þ . Notice that a smaller message intersending c 2 From the above steps, the estimation of and time-out for time () can result in a more accurate tm estimate. Then, the ca crash-recovery FDS based on the NFD-S algorithm amounts CR-TS’s MTBF, MTTF, MTTR, and the probability that theto finding a numerical solution for the inequalities (11)-(16). CR-TS has not crashed up to time i þ x since its lastThis can be done using binary search similarly to the recovery, P rðXa i þ x À tm Þ, can be estimated as follows: rapproach outlined in [5]. But the estimation of the input Estimate MTBF. From the definition of MTBF, we knowparameters of the configuration becomes more difficult that MTBF is only related to the CR-TS’s recovery timesbecause parameters, such as EðXL Þ, MTTF, MTTR, etc., are tm ðsÞ. These tm ðsÞ can be obtained by adopting the recovery r rneeded. How to estimate these parameters will be discussed time estimation methods proposed in [29]. Thus, MTBF canin Section 4. be estimated as below: Note that for this configuration procedure, choosing adifferent message transmission protocol (e.g., TCP and À Á 1 X À mþ1 n ÁUDP) can imply different QoS for message communication. MTBF ¼ E tmþ1 À tm ¼ r r tr À tm : r ð17Þ n m¼1Thus, this new configuration can be more adaptive to themessage transmission characteristics. For example, if the Estimate MTTF. MTTF can be estimated by using themessage loss probability or message delay is high for a recovery time (tm ) and the crash detection time (tm ) as r dcertain protocol, then the FDS can switch to a more reliable Eðtm À tm Þ ¼ MTTF þ EðTD Þ. Then, d rprotocol to achieve a better QoS without increasing thecommunication frequency or the time-out length. À Á 1XÀ m n Á In the next section, we will discuss how to estimate the MTTF ¼ E tm À tm À EðTD Þ ¼ d r td À tm À EðTD Þ: r n m¼1QoS of message transmission and the dependability metricsof the CR-TS. ð18Þ Estimate MTTR. MTTR can be estimated by using MTBF4 PARAMETER ESTIMATION and MTTF directly for MTTR ¼ MTBF À MTTF or by using tmþ1 and tm . Hence, the MTTR can be estimated as r dIn the previous section, we explained how to configure acrash-recovery FDS. However, for this procedure, several Eðtmþ1 À tm Þ ¼ MTTR À EðTD Þ. Then, r dinput parameters are needed (see Fig. 6). In this section, we À Á MTTR ¼ E tmþ1 À tm þ EðTD Þ r dwill show how to estimate these input parameters for anFDS configuration. 1 X À mþ1 n Á ð19Þ ¼ t À tm þ EðTD Þ: n m¼1 r d4.1 Dependability Metrics Estimation for the CR-TSFrom the CR-TS modeling in Section 2, we see that there is Estimate P rðXa i þ x À tm Þ. When the probability ran intimate relationship between the MTTF, MTTR, and density function fa ðxÞ or the probability distributionMTBF and the QoS of the FDS. In order to estimate these function Fa ðxÞ of Xa is known, the probability that thedependability metrics, we only need to estimate the crash CR-TS does not crash until i þ x after its last recovery canand recovery time of the CR-TS. We assume that the clocks be estimated asbetween the FDS and the CR-TS are synchronized. Let t1 be r Z i þxÀtmthe CR-TS’s first start time, then for m ! 1, tm represents the À m Á r r P r Xa i þ x À tr ¼ 1 À fa ðxÞdxmth recovery time; tm represents the mth recovery detection dr 0 ð20Þtime; tm represents the mth crash time; and tm represents þxÀtm c d ¼ 1 À Fa ðxÞj0i r :the mth crash detection time (see Fig. 7). tm can be saved to rthe persistent storage by the CR-TS after a recovery has When x ¼ 0, we obtain thatcompleted (see [29]). tm can be recorded by the FDS when a d Z i Àtmfailure is detected, EðTD Þ can be estimated by using À Á r Àtm1 Pn m m m m P r Xa i À tm ¼ 1 À fa ðxÞdx ¼ 1 À Fa ðxÞj0i r : m¼1 ðtd À tc Þ when tc is known. Actually, tc can be rn 0estimated by saving the latest successful message sending ð21Þtime l in the persistent storage. If a crash event happens
  • 9. MA ET AL.: ON THE QUALITY OF SERVICE OF CRASH-RECOVERY FAILURE DETECTORS 279 When the probability density function fa ðxÞ and the 4.3.2 The Impact on TMRprobability distribution function Fa ðxÞ of Xa are unknown, For a fail-free run, Chen et al. showed that when time-outan empirical distribution function (EDF) estimation method length increases linearly, TMR increases exponentially (Fig. 12can be adopted to estimate fa ðxÞ or Fa ðxÞ. In addition, in [5]). This implies that for such systems, an arbitrary level ofP rðXa i þ x À tm Þ is used to estimate the probability that r TMR can be achieved. Roughly speaking, in a fail-free run,an S-transition happens on ½t1 ; t2 ) (see Proposition 1), which when time-out increases to n  (n 2 Z þ and n ! 1), the FDS Zis used to count the average number of mistakes in that can tolerate around n consecutive communication messageperiod. If we maximize P rðXa i þ x À tm Þ, then a r losses. The mistake recurrence which is caused by messagemaximum average number of mistakes on ½t1 ; t2 ) will be latency or loss decreases P1n rapidly, whereobtained. Therefore, we will get stricter QoS boundestimates for PA , TM , and TMR . Thus, we can adopt i ¼ 1 P ¼ pL þ ð1 À pL Þ Á P rðtime-out Delay þ1Þ:and x ¼ 0 to simplify the estimation of P rðXa i þx À tm Þ. Notice that the above method is only for the strict For a crash-recovery run, mistakes may occur on both rbound estimation rather than an optimized estimation. crash and recovery (see Fig. 3b) since message transmission latency will delay the detection of the CR-TS’s state change.4.2 Message Loss Length Estimation These mistakes are inevitable. This means that the upperAs discussed earlier, the parameters related to message bound on TMR is governed by MTTF and MTTR (seetransmission are the average message delay (EðDÞ), prob- inequalities (1)-(2) in Theorem 1). Even if all message delaysability of message loss (pL ), and the consecutive message and losses can be tolerated, EðTMR Þ cannot increase to anloss number XL (see Fig. 6). Since pL and EðDÞ estimation arbitrary level when MTTF is not þ1 and MTTR is not þ1can be done very easily and have been introduced in many or 0. If failure is detectable, EðTMR Þ cannot exceed MTBF 2other papers such as [5], we do not discuss them here. The since for each MTBF duration, there will be at least twoadditional parameter XL is also used and captures the mistakes, corresponding to the two changes of state in thebursty message loss behavior. In this section, we propose a CR-TS. When failure is undetectable, mistakes may happenbasic estimation method for XL , assuming independent at the CR-TS’s crash or recovery time. Then, EðTMR Þ cannotmessage transmissions. exceed MTBF. Thus, after EðTMR Þ reaches MTBF , the overall 2Lemma 3. If each message’s transmission and loss behavior is EðTMR Þ approaches MTBF gradually. independent, then the mean number of consecutive message p ð1ÀpM Þ losses is EðXL Þ ¼ L 1ÀpLL À MpMþ1 , where M is the L 4.3.3 The Impact on PA maximum number of consecutive messages lost and pL is the PA , the proportion of time that the FDS is not in a mistake probability that each message is lost during the transmission. state, will depend on the ratio of EðTM Þ and EðTMR Þ The proof can be found in [29]. (PA ¼ 1 À EðTMRÞÞ in [5]). If a service is fail-free, PA can rapidly EðTMRemark 1. When M ! þ1 and 0 pL 1, then pM ! 0 approach 1. But in a crash-recovery run, when the time-out L and MpM ! 0, we obtain that L length is increased, both EðTM Þ and EðTMR Þ will eventually pL reach their upper bounds, meaning that PA will also be EðXL Þ ¼ : bounded. Generally, as time-out increases, less failures will 1 À pL 3 be detected and the mistakes caused by failures (see TM in From the above lemma, we see that if each liveness Fig. 4c) will have more impact on EðTM Þ; thus, EðTM Þ willmessage’s transmission is independent, EðXL Þ depends approach MTTR, since the maximum length of EðTM Þ is 3only on pL and can be computed straightforwardly. MTTR. As the time-out length becomes larger with respect to4.3 The Impact of Service Dependability Metrics on MTTR, more failures become undetectable. Thus, EðTM Þ the QoS of the FDS will gradually approach MTTR.A thorough analysis of the impact of the service depend- The speed of increase of TMR will depend on whenability metrics on the QoS of the FDS has been presented in TMR reaches MTBF . Before this bound is reached, as the 2[16]. Here, we only highlight the main observations. time-out length increases, TMR can increase exponentially fast, as more message losses can be tolerated. After TMR4.3.1 The Impact on TM and TD exceeds MTBF , it can only increase gradually to MTBF, as 2Generally, for an FDS, the time-out length governs the time-out increases and more and more crashes becomefailure detection speed because the FDS makes its decision undetectable. Thus, when TMR reaches its upper boundat the time-out points. As the time-out length decreases, the but TM has not yet reached its upper bound, PA willFDS will make faster, but less accurate, decisions. As time- decrease as time-out length increases. When both TM andout increases, TD slows down but the FDS can tolerate more TMR reach their upper bound, PA will approach MTTF , MTBFmessage delays or losses, which can improve the detection which is equal to the availability of the CR-TS.accuracy to some extent. For a CR-TS, continually increas-ing the time-out length may mean that failures become 5 SIMULATION EVALUATION AND ANALYSISundetectable, because its recovery duration could be shorterthan TD . Thus, EðTM Þ will not increase more than the In previous sections, we have shown how to calculate therecovery duration, MTTR.6 parameters of the FDS with a given set of QoS requirements and analyzed the QoS bounds of the crash-recovery FDS 6. Assuming that pL and D are not very large and MTTR ) . based on the NFD-S algorithm. In this section, we introduce
  • 10. 280 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 7, NO. 3, JULY-SEPTEMBER 2010Fig. 8. The NFD-S algorithm: EðTM Þ. Fig. 9. The NFD-S algorithm: EðTMR Þ.our analytical and simulation results, which verify our complete characteristics. If the time-out length was increasedprevious analysis work. to 200, EðTM Þ would approach MTTR ¼ 50 closely. An interesting phenomenon is visible in the graph as5.1 Evaluation of the Crash-Recovery FDS Based time-out increases from 0.5 to 1.1: EðTM Þ decreases (or on the NFD-S Algorithm increases more slowly), and then, increases again. WeFor the simulation studies, we fix the heartbeat interval at analyze this phenomenon in detail as follows: Recall that for ¼ 1 and gradually increase the time-out length. a given length of time-out, there are four aspects which have The message transmission parameters are pL ¼ 0:01 and impact on TM : the message delay and loss, and the CR-TS’sEðDÞ ¼ 0:02, and the delay is assumed to be exponentially crash and recovery (see Fig. 4). TM caused by a messagedistributed. These settings are similar to those used in the delay is governed by the ratio between EðDÞ and TD . For thesimulations in [5]. same EðDÞ, as time-out increases, more delayed messages 1 The CR-TS is defined as a recoverable process with can be tolerated. Thus, TM caused by a message delay (TM )various values of MTTF and MTTR (exponentially distrib- will decrease and occur less frequently. TM caused by a 2uted). We choose the exponential distribution for the message loss (TM ) is related to , pL , EðDÞ, and the time-outfollowing reasons. First, exponential failures are widely length. For constant message communication QoS (i.e., fixedadopted for reliability analysis in many practical systems; pL and EðDÞ), TM caused by message loss is governed by the ratio between and TD . Since as the time-out lengthsecond, unlike some heavy tailed distributions such as the increases, more message losses can be tolerated, the averagelog-normal distribution, crash, and recovery with an ex- 2 2 duration of TM will decrease, and TM will occur lessponential distribution will occur with reasonable interarri- 3 frequently. TM caused by a crash (TM ) is mainly governedval times, avoiding the CR-TS behaving like a fail-free or by TD (see Fig. 4c), because if a crash occurs, a false positivecrash-stop process. mistake will last until the time-out time or until the CR-TS recovers. For detectable crashes, as the time-out length5.1.1 Analysis for the Basic QoS Metrics 3 4 increases, TM will increase. TM caused by a recovery (TM ) isWe implemented the NFD-S algorithm presented in [5] to mainly governed by pL and EðDÞ (see Fig. 4d), since afterevaluate the QoS of the FDS and compared the results with the CR-TS’s recovery, a recovery can be detected as soon asthe analytical results derived from Theorem 1. Figs. 8, 9, and a valid liveness message is received.10 compare the QoS of the FDS based on the NFD-S algorithm From the above analysis, we know that for the same ,(simulation results) and the corresponding analytical results pL , EðDÞ, MTTF, and MTTR, when the time-out lengthfrom different perspectives. From these three figures, we increases, the average mistake duration caused by messagehave the following observations. 1 2 delays and message losses will decrease (TM b and TM c), the Fig. 8 presents the EðTM Þ of the FDS derived from average mistake duration caused by the CR-TS’s crash willsimulation and analytical results for two values of MTTR, 5 3 increase (TM d), and the average mistake caused by theand 50, with corresponding values of MTTF, 100 and 1,000. 4The simulation result for MTTR ¼ 5 shows that as the time- CR-TS’s recovery from a detectable crash is unaffected (TM )out length increases, EðTM Þ will tend to MTTR, i.e., EðTM Þ is but fewer crashes and recoveries will be detected. In thebounded by MTTR. With the exponentially distributed simulation pL ¼ 0:01 and MTBF ¼ 105, when time-out is 2 3MTTR used in the simulation, the proportion of the detectable small, TM and TM occur with similar frequency. When time-crashes will decrease more gradually. Thus, EðTM Þ ap- out increases from 0.5 to 1.0, (the FDS can tolerate zeroproaches MTTR more slowly than in the analytical results. message loss and most message delays), EðTM Þ increases 1 2 3 4 Simulation results for MTTR ¼ 50 confirm that if MTTR slow because TM b, TM b, TM d, and TM and their impactsbecomes large, as the time-out length increases, EðTM Þ can counterbalance. Overall, EðTM ) is stable within this period. 2also grow large, since the bound is now large. Note that in As the time-out length increases, TM will occur less 3the graph, we see only the linear part rather than the frequently. But TM occurs every MTBF period. Thus, as
  • 11. MA ET AL.: ON THE QUALITY OF SERVICE OF CRASH-RECOVERY FAILURE DETECTORS 281 However, from Fig. 10, we can also see that as the time-out length increases, PA is not always increasing as in a fail-freeor crash-stop run. Continually increasing time-out could de- crease PA . This is because TMR is bounded by MTBF or MTBF 2 as discussed above. After EðTMR Þ reaches MTBF , it increases 2 slowly rather than exponentially fast but EðTM Þ increases linearly and faster than EðTMR Þ. Thus, PA decreases, and finally, PA will approach MTTF , which is equal to the MTBF availability of the CR-TS. The above results indicate that for a highly available CR- TS, a reasonable QoS for the FDS can be achieved even if the FDS always trusts the CR-TS, when only the QoS metrics defined in [5] are considered. This is especially true for aFig. 10. The NFD-S algorithms: PA . highly available and highly consistent but not highly reliable CR-TS. However, the completeness property of the 3 FDS will not be satisfied. Consequently, these simulationthe time-out increases, TM will dominant and EðTM Þ will results demonstrate the necessity of the additional QoSincrease gradually. metrics we proposed in Section 3.3 to measure the In the simulation, pL ¼ 0:01 and MTBF ¼ 1;050, when 2 completeness aspects and the speed of the recovery detectionthe time-out length is small, TM will have more impact than 3 2 of a crash-recovery FDS. Furthermore, these results alsoTM , because TM occurs more frequently than the crash and demonstrate the necessity of adopting the recovery detec-recovery. Therefore, as the time-out length increases, the 2 tion protocols in [29], which can improve the proportion ofaverage duration of TM decreases and occurs less fre- detected failures without reducing other QoS aspects.quently; EðTM Þ will increase slower or even decrease since In Figs. 8, 9, and 10, we can also observe how themore message losses are tolerated. But if time-out continues 3 dependability of a CR-TS can influence the QoS of the increase, TM will become dominant and EðTM Þ will then Particularly, for a highly available but not highly reliableincrease gradually. CR-TS, the dependability of the CR-TS can have more Overall, Fig. 8 shows that in a crash-recovery run, EðTM Þexhibits quite different characteristics from a fail-free or impact than the performance of the algorithm and the QoScrash-stop run. If the message delay and the probability of of message transmission. In such situations, the depend-message loss are not very large, EðTM Þ is bounded by ability of the CR-TS must be taken into account for the FDSMTTR. From Fig. 8, we also observe that EðTM Þ can design and implementation.possibly be decreased for some time-out values. Unlike in a From Figs. 8, 9, and 10, we can see that PA , EðTMR Þ andfail-free run, continually increasing the time-out length EðTM Þ have bounds. Continually increasing the time-outcannot achieve a better ðTM Þ. length might not be a reasonable way to achieve better PA , Fig. 9 presents EðTMR Þ of the FDS derived analytically and EðTMR Þ, and EðTM Þ. A potential trade-off exists betweenfrom simulation with exponential MTTF and MTTR as above. the QoS metrics. For instance, for the NFD-S algorithm,We can see that with constant time-out length, as MTBF time-out 2 ð1; 1:1Þ (time-out þ 2 ½2; 2:1Š) might achieve theincreases, EðTMR Þ also increases. This implies that EðTMR Þ is best over all QoS.greatly impacted by the dependability of the CR-TS. In addition, EðTM Þ in a crash-recovery run exhibits quite We can also see that for both these simulation cases, different characteristics compared with a fail-free or crash-EðTMR Þ initially increases exponentially fast but after EðTMR Þ stop run. This is because in a crash-recovery run, the mistakesreaches MTBF , the rate of increase is reduced. For the CR-TS caused by the crash and recovery are taken into considera- 2with exponential MTTR, EðTMR Þ will increase gradually and tion, which means continually increasing the time-out lengthapproach MTBF, until all crashes become undetectable. This will not always decrease EðTM Þ. It may have the effect of 3is because for nondeterministic MTTR, as the time-out length increasing false positive mistakes (TM , see Fig. 4). As the time-increases, the proportion of the detectable crashes decreases. out length increases, mistakes caused by message delaysTherefore, for the detectable crashes, TMR MTBF , and for the and losses will occur less frequently, and false positive 2undetectable crashes, TMR MTBF. Thus, EðTMR Þ will mistakes (which were not considered previously) will startincrease gradually between ½MTBF ; MTBFŠ, and finally, 2 to dominate the QoS of the FDS.stabilize at MTBF. All of these results match our analysis in From Figs. 8, 9, and 10, we can observe that theSection 4.3 well and indicate that if a CR-TS is not fail-free simulation results of EðTM Þ are smaller than the analytical(MTTF ! 1) orcrash-stop (MTTR ! 1), EðTMR Þ will be results, and the simulation results of EðTMR Þ and PA arebounded by MTBF when failures are undetectable and by larger than the analytical results, which indicate that theMTBF 2 when failures are detectable. bound analysis of the basic QoS metrics in Theorem 1 is Fig. 10 considers PA under the same communication QoS. valid and the simulation results satisfy the QoS require-We see that when MTBF increases, PA will be improved. This ments according to the analysis. We can also observe ais because EðTMR Þ also increases. Thus, from the equation gap between the analytical and simulation results. This isPA ¼ 1 À EðTMRÞÞ , we know that for the same time-out length, EðTM caused by the overestimation or underestimation of somewhen MTBF increases, a better PA can be achieved. values within the analytical results. EðTM Þ is overestimated
  • 12. 282 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 7, NO. 3, JULY-SEPTEMBER 2010 Fig. 12. The QoS relationship between communication, CR-TS, and FDS.Fig. 11. The NFD-S algorithms: EðRDF Þ. decreases. When MTTR becomes shorter, EðRDF Þ will decrease faster. This is because the smaller MTTR is, the Uby using the total mistake duration over the underestimated sooner time-out þ crosses MTTR (TD MTTR). Therefore,average number of mistakes that might occur within a crash- more crashes remain undetected when the NFD-S algorithmrecovery period. Thus, the analytical results of EðTM Þ will be is adopted. In Fig. 11, we can also see that the simulationlarger than the simulation results. Similarly, EðTMR Þ is results of EðRDF Þ are larger than the analytical results, whichunderestimated by using the observation duration (MTBF) means that the bound analysis of EðRDF Þ is valid and theover an overestimation of the number of mistakes that simulation results satisfy the QoS requirements in terms ofmight occur within a period. For instance, the number of RL . However, since most existing failure detection algo- DF rithms adopt increasing the time-out length to tolerate moremistakes within the period is estimated as dEðDÞe þ 1, which message losses and delays, if a CR-TS is recoverable andis an upper bound rather than the average number. It recovers fast, it could be difficult for these algorithms tofollows that EðTMR Þ of the analytical results will be smaller achieve the QoS in [5] and satisfy the completeness property atthan the simulation results. Finally, PA is underestimated by the same time. In such a situation, the recovery detectionusing one minus an overestimated total mistake duration protocol introduced in [29] can be adopted, which can solveover the observation period (MTBF). Thus, PA of the this problem reasonably well.analytical results will be smaller than the simulation results. All of these results satisfy the QoS requirements U L LEðTM Þ TM , PA PA , and EðTMR Þ TMR . In addition, 6 CONCLUSIONaccording to the NFD-S algorithm, the failure detection In this paper, the crash-recovery target and its failure detectortime TD is bounded by þ time-out regardless of the are modeled as stochastic processes. We redefined pre- Ucorrectness of the detection; thus, TD TD must be viously proposed QoS metrics to be applicable to crash-satisfied. recovery failure detection and introduced some new metrics From Figs. 8, 9, and 10, we can also see that there are some to measure the recovery detection speed and the completenessgaps between the analytical results and the simulation property of a failure detector. We also discussed the impactresults. This is mainly caused by the overestimating and of the monitored target’s crash-recovery behavior on each QoSunderestimating method we adopted to restrict the failure metric and showed that if a failure detector’s parameters aredetector’s QoS bound as discussed above. In addition, we to be accurately estimated, these dependability character-use MTBF, MTTF, and MTTR, which are the expected values istics must be taken into account. Thus, we showed how torather the real values for each failure and recovery. In the configure the failure detector to satisfy a given set ofsimulation, the results are calculated according to the requirements based on the dependability characteristics inrandomly generated failure time and recovery time, whichrepresent the real time to failure and recovery, and these addition to the QoS of message transmission (see Fig. 12).random variables will deviate from the expected values. This was based on the NFD-S algorithm [5]. Our analysisThus, there will be some discrepancies between the simula- shows that the QoS analysis in [5] is a particular case of ation and analytical results. These gaps show that there is still crash-recovery run. Furthermore, we discussed how tospace to improve the accuracy of the model and it would be estimate the input parameters for the algorithm.interesting to investigate this point further in the future. Finally, the plotted simulation and analytical results demonstrate that our QoS bound analysis is valid and can be5.1.2 Analysis for the Extended QoS Metrics used as an approximate solution for the computation of theWe also plot the simulation and analytical results for the failure detector’s parameters or the QoS bounds estimationfailure detection proportion (RDF ) defined in Section 3.3 to if the failure detector’s parameters are given. Our simula-demonstrate the impact of the failure and recovery events tion results confirm that when a failure detector is designedon this metric. and implemented, the dependability of the crash-recovery Fig. 11 shows the proportion of failures detected by the target needs to be considered in order to achieve moreFDS, for different dependability characteristics of the CR-TS, accurate parameter estimation. Furthermore, if the recoverybased on both simulation and analytical results. As the of the monitored target needs to be detected, furthertime-out length increases, EðRDF Þ of the NFD-S algorithm enhancement of the existing algorithms is needed.
  • 13. MA ET AL.: ON THE QUALITY OF SERVICE OF CRASH-RECOVERY FAILURE DETECTORS 283ACKNOWLEDGMENTS [22] R. Boichat and R. Guerraoui, “Reliable and Total Order Broadcast in the Crash Recovery Model,” PhD thesis, Ecole PolytechniqueThe authors would like to thank Isi Mitrani, Mahesh Marina, Fed., 2001. [23] M.K. Aguilera, W. Chen, and S. Toueg, “Failure Detection andand the anonymous reviewers for their comments and Consensus in the Crash-Recovery Model,” Distributed Computing,suggestions which helped improve the quality of this paper. vol. 13, no. 2, pp. 99-125, Apr. 2000. [24] D. Dolev, R. Friedman, I. Keidar, and D. Malkhi, “FailureHillston’s work is supported in part by the SENSORIA Detectors in Omission Failure Environments,” Technical Reportproject, an EU FET-IST GC 2 project (IST-3-016004-IP-09). 96-1608, Dept. of Computer Science, Cornell Univ., 1996. [25] M. Hurfin, A. Mostefaoui, and M. Raynal, “Consensus in Asynchronous Systems Where Processes Can Crash and Recover,” Proc. 17th IEEE Symp. Reliable Distributed Systems, pp. 280-286, Oct.REFERENCES 1998.[1] J. Laprie, A. Avizienis, and H. Kopetz, Dependability: Basic Concepts [26] R. Oliveira, R. Guerraoui, and A. Schiper, “Consensus in the and Terminology. Springer-Verlag, 1992. Crash-Recover Model,” Technical Report 97-239, Dept. d’Informa-[2] L. Lamport, R. Shostak, and M. Pease, “The Byzantine Generals tique, EPFL, Problem,” ACM Trans. Programming Languages and Systems, vol. 4, html, 1997. no. 3, pp. 382-401, 1982. [27] J.C. Knight and E.A. Strunk, “Software Dependability,” Proc. Int’l[3] M.J. Fischer, N.A. Lynch, and M.S. Paterson, “Impossibility of Conf. Dependable Systems and Networks, Tutorials, June 2006. Distributed Consensus with One Faulty Process,” J. ACM, vol. 32, [28] R. Guerraoui, R. Oliveira, and A. Schiper, “Stubborn Commu- no. 2, pp. 374-382, Apr. 1985. nication Channels,” technical report, Dept. d’Informatique, EPFL,[4] T.D. Chandra and S. Toueg, “Unreliable Failure Detectors for 1998. Asynchronous Systems (Preliminary Version),” Proc. 10th ACM [29] T. Ma, “Qos of Crash-Recovery Failure Detection,” PhD disserta- Symp. Principles of Distributed Computing (PODC ’91), pp. 325-340, tion, The Univ. of Edinburgh, Mar. 2007. 1991.[5] W. Chen, S. Toueg, and M.K. Aguilera, “On the Quality of Service Tiejun Ma received the BEng degree in automa- of Failure Detectors,” IEEE Trans. Computers, vol. 51, no. 5, pp. 561- tion and the BEng degree in computer science 580, May 2002. from Dalian University of Technology, China, and[6] L. Falai and A. Bondavalli, “Experimental Evaluation of the QoS the MSc and PhD degrees from the Laboratory of Failure Detectors on Wide Area Network,” Proc. Int’l Conf. for Foundations of Computer Science, School of Dependable Systems and Networks, pp. 624-633, July 2005. Informatics, The University of Edinburgh, in 2003[7] N. Hayashibara, A. Cherif, and T. Katayama, “Failure Detectors and 2007, respectively. He is a postdoc research for Large-Scale Distributed Systems,” Proc. 21st IEEE Symp. associate of the Large-Scale Distributed System Reliable Distributed Systems, pp. 404-409, 2002. Group, Department of Computing, Imperial Col-[8] N. Hayashibara, X. Defago, R. Yared, and T. Katayama, “The lege. Before moving to the Imperial College, he Accrual Failure Detector,” Proc. 23rd IEEE Int’l Symp. Reliable was a staff member at the Oxford e-Research Centre, Oxford University, Distributed Systems, pp. 66-78, 2004. United Kingdom. His principal research interests are large-scale[9] R.C. Nunes and I. Jansch-Porto, “QoS of Timeout-Based Self- distributed systems, dependable computing, fault tolerance, performance Tuned Failure Detectors: The Effects of the Communication Delay evaluation, and grid computing. Predictor and the Safety Margin,” Proc. Int’l Conf. Dependable Systems and Networks, pp. 753-761, 2004. Jane Hillston received the BA degree in[10] I. Sotoma and E.R.M. Madeira, “A Markov Model for Quality of mathematics from the University of York, United Service of Failure Detectors in the Pressure of Loss Bursts,” Proc. Kingdom, the MSc degree in mathematics from 18th Int’l Conf. Advanced Information Networking and Applications, Lehigh University, and the PhD degree in vol. 2, pp. 62-67, 2004. computer science from The University of Edin-[11] R. Guerraoui and L. Rodrigues, Introduction to Reliable Distributed burgh in 1994. She is a professor of quantitative Programming. Springer, 2006. modeling in the School of Informatics at The[12] E.M. Dashofy, A. van der Hoek, and R.N. Taylor, “Towards University of Edinburgh, and holds an Advanced Architecture-Based Self-Healing Systems,” Proc. First Workshop Research Fellowship from the Engineering and Self-Healing Systems (WOSS ’02), pp. 21-26, 2002. Physical Sciences Research Council. She is a[13] M.E. Shin and D. Cooke, “Connector-Based Self-Healing Mechan- fellow of the Royal Society of Edinburgh. After a brief period working in ism for Components of a Reliable System,” Proc. 2005 Workshop industry, she joined the Department of Computer Science at The Design and Evolution of Autonomic Application Software, pp. 1-7, University of Edinburgh, as a research assistant, in 1989. Her work on 2005. the stochastic process algebra PEPA ( was[14] R. Koo and S. Toueg, “Checkpointing and Rollback-Recovery for recognized by the British Computer Society in 2004, which awarded her Distributed Systems,” IEEE Trans. Software Eng., vol. 13, no. 1, the first Roger Needham Award. Currently, her principal research pp. 23-31, Jan. 1987. interests are in the use of stochastic process algebras to model and[15] D. Manivannan and M. Singhal, “A Low-Overhead Recovery analyze computer systems and biological systems and the development Technique Using Quasi Synchronous Checkpointing,” Proc. IEEE of efficient solution techniques for such models. Int’l Conf. Distributed Computing Systems, pp. 100-107, 1996.[16] T. Ma, J. Hillston, and S. Anderson, “Evaluation of the QoS of Stuart Anderson is a senior lecturer in the Crash-Recovery Failure Detection,” Proc. ACM Symp. Applied School of Informatics at the University of Computing (DADS Track), 2007. Edinburgh. His principal research interests are[17] T. Ma, J. Hillston, and S. Anderson, “On the Quality of Service of in the dependability of sociotechnical systems, in Crash-Recovery Failure Detectors,” Proc. Int’l Conf. Dependable particular, the analysis of the role of risk and Systems and Networks, June 2007. trust in such systems.[18] M. Bertier, O. Marin, and P. Sens, “Implementation and Performance Evaluation of an Adaptable Failure Detector,” Proc. Int’l Conf. Dependable Systems and Networks, pp. 354-363, 2002.[19] I. Gupta, T.D. Chandra, and G.S. Goldszmidt, “On Scalable and Efficient Distributed Failure Detectors,” Proc. 12th ACM Symp. . For more information on this or any other computing topic, Principles of Distributed Computing, pp. 170-179, 2001. please visit our Digital Library at[20] R.V. Renesse, Y. Minsky, and M. Hayden, “A Gossip-Style Failure Detection Service,” technical report, Cornell Univ., 1998.[21] P. Stelling, I. Foster, C. Kesselman, C.A. Lee, and G. von Laszewski, “A Fault Detection Service for Wide Area Distributed Computations,” Cluster Computing, vol. 2, no. 2, pp. 117-128, 1999.