Fault tolerant systems

7,091 views

Published on

Fault tolerance in the presence of partial diagnostic coverage.

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
7,091
On SlideShare
0
From Embeds
0
Number of Embeds
2,436
Actions
Shares
0
Downloads
126
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Fault tolerant systems

  1. 1. FA U LT –TO LE R A N T SY ST E M R E LIA BILITY IN TH E PR E SE N C E O F IM PE R FE CT D IA G N O ST IC C O V E R A G E By Glen B. Alleman Irvine California, Copyright © 1980 Submitted in Partial Fulfillment Of Masters in Systems Management (MSSM) University of Southern California Los Angles, California June 1980 Revised and updated Niwot Colorado, Copyright © 1996, 2000, 2014
  2. 2. ii
  3. 3. FAULT–TOLERANT SYSTEM RELIABILITY IN THE PRESENCE OF IMPERFECT DIAGNOSTIC COVERAGE Glen B. Alleman The deployment of computer systems for the control of mission critical processes has become the norm in many industrial and commercial markets. The analysis of the reliability of these systems is usually understood in terms of the Mean Time to Failure. The design and analysis of high reliability systems is now a mature science. Starting with fault–tolerant central office switches (ESS4), dual redundant and n– way redundant systems are now available in variety of application domains. The technologies of microprocessor based industrial controls and redundant central processor systems create the opportunity to build fault–tolerant computing systems on a much smaller scale than previously found in the commercial market place. The diagnostic facilities utilized in a modern Fault–Tolerant Computer System attempts to detect fault conditions present in the hardware and embedded software. Coverage is the figure of merit describing the effectiveness of the diagnostic system. This thesis examines the effects of less than perfect diagnostics coverage on system reliability. The mathematical background for analyzing the coverage factor of fault–tolerant systems is presented in detail as well as specific examples of practical systems and their relative reliability measures. In a complex system, malfunction and even total nonfunction may not be detected for long periods, if ever. — John Gall
  4. 4. i TABLE OF CONTENTS INTRODUCTION ..................................................................................................... 10 Fault Tolerant System Definitions ....................................................................... 10 Fault–Tolerant System Functions ........................................................................ 11 Overview of This Thesis....................................................................................11 RELIABILITY, AVAILABILITY FOR SIMPLE SYSTEMS........................... 13 Deterministic Models.............................................................................................. 14 Probabilistic Models ...........................................................................................14 Exponential and Poisson Relationships..........................................................15 Reliability Availability and Failure Density Functions..................................20 Mean Time to Failure.........................................................................................23 Mean Time to Repair..........................................................................................27 Mean Time Between Failure.............................................................................27 Mean Time to First Failure................................................................................27 General Availability Analysis.............................................................................31 Instantaneous Availability...........................................................................33 Limiting Availability.....................................................................................34 SYSTEM RELIABILITY........................................................................................... 37 Series Systems......................................................................................................37 Parallel Systems....................................................................................................39 M–of–N Systems.................................................................................................39 Selecting the Proper Evaluation Parameters..................................................40 IMPERFECT FAULT COVERAGE AND RELIABILITY ............................ 42 Redundant System with Imperfect Coverage ................................................42 Generalized Imperfect Coverage .....................................................................44 MARKOV MODELS OF FAULT–TOLERANT SYSTEMS.......................... 49 Solving the Markov Matrix................................................................................52 Chapman–Kolmogorov Equations..........................................................52 Markov Matrix Notation....................................................................................55 Laplace Transform Techniques........................................................................56 Modeling a Duplex System.................................................................................... 58 Modeling a Triple–Redundant System ................................................................ 64 Modeling a Parallel System with Imperfect Coverage...................................... 68 Modeling A TMR System with Imperfect Coverage........................................ 74 Modeling A Generalized TMR System ............................................................... 76 Laplace Transform Solution to Systems of Equations.................................76 Specific Solution to the Generalized System .................................................78 PRACTICAL EFFECTS OF PARTIAL COVERAGE...................................... 85 Determining Coverage Factors............................................................................. 85
  5. 5. ii Coverage Measurement Statistics..............................................................86 Coverage Factor Measurement Assumptions.........................................86 Coverage Measurement Sampling Method.............................................87 Normal Population Statistics .....................................................................87 Sample Size Computation ..........................................................................88 General Confidence Intervals....................................................................89 Proportion Statistics ....................................................................................90 Confidence Interval Estimate of the Proportion...................................91 Unknown Population Proportion.............................................................91 Clopper–Person Estimation.......................................................................92 Practical Sample Estimates.........................................................................93 Time Dependent Aspects of Fault Coverage Measurement................94 Common Cause Failure Effects............................................................................ 95 Square Root Bounding Problem......................................................................97 Beta Factor Model...............................................................................................97 Multi–Nominal Failure Rate (Shock Model)..................................................97 Binomial Failure Rate Model ............................................................................98 Multi–Dependent Failure Fraction Model .....................................................98 Basic Parameter Model.......................................................................................99 Multiple Greeks Letter Model ..........................................................................99 Common Load Model..................................................................................... 100 Nonidentical Components Model ................................................................ 100 Practical Example of Common Cause Failure Analysis............................ 100 Common Cause Software Reliability............................................................ 102 Software Reliability Concepts................................................................. 103 Software Reliability and Fail–Safe Operations .................................... 109 PARTIAL FAULT COVERAGE SUMMARY .................................................. 111 Effects of Coverage .............................................................................................. 112 REMAINING QUESTIONS................................................................................. 113 Realistic Probability Distributions...................................................................... 113 Multiple Failure Distributions........................................................................ 114 Weilbull Distribution....................................................................................... 116 Periodic Maintenance ........................................................................................... 118 Periodic Maintenance of Repairable Systems ............................................. 119 Reliability Improvement for a TMR System............................................... 122 CONCLUSIONS ....................................................................................................... 124 MARKOV CHAINS................................................................................................. 125 Definition A.1................................................................................................... 125 Definition A.2................................................................................................... 125 Definition A.3................................................................................................... 126 Theorem A.1..................................................................................................... 126 Proof of Theorem A.1 .................................................................................... 126
  6. 6. iii Lemma A.1........................................................................................................ 128 Theorem A.2..................................................................................................... 128 Proof of Theorem A.2 .................................................................................... 128 Theorem A.3..................................................................................................... 130 Proof of Theorem A.3 .................................................................................... 130 SOLUTIONS TO LINEAR SYSTEMS............................................................... 133 Theorem B.1 ..................................................................................................... 135 Proof of Theorem B.1..................................................................................... 136 PROBABILITY GENERATING FUNCTIONS.............................................. 139 Definition C.1................................................................................................... 139 Theorem C.1..................................................................................................... 140 Proof of Theorem C.1..................................................................................... 140 POISSON PROCESSES.......................................................................................... 142 Definition D.1................................................................................................... 143 Definition D.2................................................................................................... 145 Definition D.3................................................................................................... 145 Definition D.4................................................................................................... 148 Definition D.5................................................................................................... 148 Definition D.6................................................................................................... 149 Theorem D.1..................................................................................................... 151 RENEWAL THEORY............................................................................................. 152 Definition E.1................................................................................................... 153 Theorem E.1..................................................................................................... 154 Proof of Theorem E.1..................................................................................... 154 Theorem E.2..................................................................................................... 155 Proof of Theorem E.2..................................................................................... 155 LAPLACE TRANSFORM GENERALIZED SOLUTION METHODS... 163 Definition F.1.................................................................................................... 164 Definition F.2.................................................................................................... 165 Definition F.3.................................................................................................... 165 Definition F.4.................................................................................................... 166 LAPLACE TRANSFORM GENERALIZED SOLUTION METHODS... 168 Definition F.1.................................................................................................... 169 Definition F.2.................................................................................................... 170 Definition F.3.................................................................................................... 170 Definition F.4.................................................................................................... 171
  7. 7. iv LIST OF FIGURES Number Page Figure 1 – Evaluation Criteria defining System Reliability. These criteria will be used to develop a set of time dependent metrics used to evaluate various configurations..............................................................................................13 Figure 2 – Assumptions regarding the behavior of a random process that generated events following the Poisson probability distribution function. .....................................................................................................................16 Figure 3 – State Transition probabilities as a function of time in the Continuous– Time Markov chain that is subject to the constraints of the Chapman– Kolmogorov equation. ............................................................................................51 Figure 4 – Definition of the exponential order of a function............................................57 Figure 5 – the state transition diagram for a Parallel Redundant system with repair. State { }2 represents the fault free operation mode, State { }1 represents a single fault with a return path to the fault free mode by a repair operation, and State { }0 represents the system failure mode, the absorption state.........................................................................................................59 Figure 6 – The transition diagram for a Triple Modular Redundant system with repair. State { }2 represents the fault free (TMR) operation mode, State { }1 represents a single fault (Duplex) operation mode with a return path to the fault free mode, and State { }0 represents the system failure mode, the absorbing state. ......................................................................................66 Figure 7 – The transition diagram for a Parallel Redundant system with repair and imperfect fault coverage. State { }2 represents the fault free mode, State { }1 represents a single fault with a return path to the fault free mode by a repair operation, and State { }0 represents the system failure mode. State { }0 can be reached from State { }2 through an uncovered fault, which causes the system to fail without the intermediate State { }1 mode. ..........................................................................................................................69 Figure 8 –The state transition diagram for a Triple Modular Redundant system with repair and imperfect fault coverage. State { }3 represents the fault free mode, State { }2 represents the single fault (Duplex) mode, State { }1 represents the two–fault (Simplex) mode, and State { }0 represents the system failure mode...........................................................................................74
  8. 8. v Figure 9 – The state transition diagram for a Generalized Triple Modular Redundant system with repair and [perfect fault detection coverage. The system initially operates in a fault free state { }0 . A fault in any module results in the transition to state { }1, ,N . A second fault while in state { }1, ,N results in the system failure state { }1N + .........................78 Figure 10 – Sample size requirement for a specified estimate as tabulated by Clopper and Pearson................................................................................................93 Figure 11 – Common Cause Failure modes guide figures for electronic programmable system [HSE87]. These ratios of non–CCF to CCF for various system configurations. CCFs are defined as non–random faults that are designed in or experienced through environmental damage to the system. Other sources [SINT88]. [SINT89] provide different figures........................................................................................................................102 Figure 12 – Four Software Growth Model expressions. The exponential and hyperexponential growth models represent software faults that are time independent. The S–Shaped growth models represent time delayed and time inflection software fault growth rates [Mats88].......................................104 Figure 13 – MTTF of Simplex, Parallel Redundant, and TMR Systems........................111 Figure 14 – MTTF of Parallel Redundant and TMR Systems with varying degrees of coverage...............................................................................................................112 Figure 15 – Mean Time to Failure increases for a Triple Modular Redundant system with periodic maintenance. This graph shows that maintenance intervals which are greater than one–half of the mean time to failure for one module have little effect on increasing reliability. But frequent maintenance, even low quality maintenance, improves the system reliability considerably............................................................................................123
  9. 9. vi ACKNOWLEDGMENTS The author wishes to thank Dr. Wing Toy of AT&T Naperville Laboratories, Naperville, Illinois for his consultation on the ESS4 Central Office Switch and his contributions to this work. Dr. Victor Lowe of Ford Aerospace, Newport Beach, California for his consultation on the general forms of Markov model solutions. Mr. Henk Hinssen of Exxon Corporation, Antwerp Belgium for his discussion of the effects of partial diagnostic coverage in Triple Modular Redundant Systems at the Exxon Polystyrene Plant, Antwerp, Belgium. Dr. Phil Bennet of The Centre for Software Engineering, Flixborough, England for his ideas regarding software reliability measurements in the presence of undetected faults. Mr. Daniel Lelivre of Factory Systems, Paris France for his comments and review of this work and its applicability to safety critical systems at Total, Mobile, and NorSoLor chemical plants. Several institutions have contributed source material for this work including The Foundation for Scientific and Industrial Research at the Norwegian Institute of Technology (SINTF), Trondheim, Norway and the United Kingdom Atomic Energy Authority, Systems Reliability Service, Culcheth, Warrington, England. This work is a derivative of an aborted PhD thesis in Computer Science at the University of California, Irvine and submitted as a Thesis in completion of a Master Degree in Systems Management, University of Southern California, 1980. . This effort started in the early 1980’s through TRW, when holding a PhD was a naïve dream, requiring much more work then I had capacity to produce.
  10. 10. vii PREFACE This work was originally written to support the design and development of the Triple Modular Redundant (TMR) computer produced by Triconex Corporation of Irvine, California, while pursuing a PhD in Computer Science. In 1987, Triconex designed and manufactured its first digital TMR process control computer that was deployed in a variety of industrial environments, including: turbine controls, boiler controls, fire and gas systems, emergency shutdown systems, and general purpose fault–tolerant real–time control systems. The Tricon (a classic 1980’s product name) was based on several innovative technologies. As the manager of software development for Triconex, I was intimately involved in the software and hardware of the Tricon. In 1987, TMR was not a completely new concept. Flight control systems and navigation computers were found in aerospace applications. The Space Shuttle used a TMR+1 computer system and was well understood by the public. What was new to the market was an affordable TMR computer that could be deployed in a rugged industrial environment. The heart of the Tricon was a hardware voting system that performed a 2–out–of–3 vote for all digital input signals presented to the control program. The contents of memory and the computed digital outputs were again voted 2–out–of–3 at the physical output devices. Once the digital command had been applied to the output device, its driven state was verified and the results reported to the control program. The Tricon contained 3 independent (but identical) 32–bit battery powered microprocessors, a 2–out–of–3 voting digital serial bus connecting the three processors, a dual redundant power system using DC–to–DC converters (state of the art for 1987), and three separate isolated serial I/O buses connecting the I/O subsystem to the three main processors. The I/O subsystem cards were
  11. 11. viii themselves TMR, using onboard 8–bit processors and a quad output device to vote 2–out–of–3 the digital commands received from the control program. The Tricon executed a control program on a periodic basis. The architecture of the operating software was modeled after the programmable controllers of the day, which were programmed in a ladder logic representing mechanical relays and timers. Both digital and analog devices provided input and output to the control program. The control program accepted input states from the I/O subsystem, evaluated the decision logic and produced output commands, which were sent to the I/O subsystem. This cycle was performed every 10ms in a normally configured system. In the presence of faults, the key to the survivability of the Tricon was the combination of TMR hardware and fault diagnostic software. Diagnostic software was applied to each processor element and the digital I/O device. This diagnostic software was capable of detecting all single stuck–at faults, many multiple stuck–at faults as well as many transient faults. A fault–injection and reliability evaluation technique developed by the author and described in this work was used to evaluate the coverage factor of the diagnostic software. Triconex no longer exists as an independent company, having been absorbed into a larger control systems vendor. The materials presented in this work were critical to Tricon’s TÜV and SINTF [SINTF89] certification for North Sea Norwegian Sector, German (then the Federal Republic), Belgium, and British Health and Safety Executive (HSE) industrial safety operations. The concept of fault–tolerant computing has become important again in the distributed computing market place. The Tandem Non–Stop processor, modern flight and navigation computers as well as telecommunications computers all depend on some form of diagnostics to initiate the fault detection and recovery process. A recent systems architectural paper mentioned TMR but without
  12. 12. ix sufficient attention to the underlying details. [1] The reissuing of this paper addresses several gaps in the literature: § The foundations of fault–tolerance and fault–tolerance modeling have faded from the computer science literature. The underlying mathematics of fault– tolerant systems present a challenge for an industry focused on rapid software development and short time to market pressures. § The understanding that unreliable and untrustworthy software systems are created by latent faults in both the hardware and software is poorly understood in this age of Object–Oriented programming and plug and play systems development. § The Markov models presented in this work have general applicability to distributed computer systems analysis and need to be restated. The application of these models to distributed processing systems, with symmetric multi–processor computers is a reemerging science. With the advent of high–availability computing systems, the foundations of these systems needs to be understood once again. § The current crop of computer science practitioners have very little understanding of the complexities and subtleties of the underlying hardware and firmware that make up the diagnostic systems of modern computers, their reliability models and the mathematics of system modeling. Glen B. Alleman Niwot Colorado 80503 Updated, April 2000 1 “Attribute Based Architectural Styles,” Mark Klein and Rick Kazman, CMU/SEI–99–TR–022, Software Engineering Institute, Carnegie Mellon University, October 1999.
  13. 13. 10/196 C h a p t e r 1 INTRODUCTION Two approaches are available to increase the system reliability of digital computer system: Fault avoidance (fault intolerance) and fault tolerance [Aviz75]. Fault avoidance results from conservative design techniques utilizing high–reliability components, system burn–in, and careful design and testing processes. The goal of fault avoidance is to reduce the possibility of a failure [Aviz84], [Rand75], [Kim86], [Ozak88]. The presence of faults however results in system failure, negating all prior efforts to increase system reliability [Litt75], [Low72]. Fault–tolerance provides the system with the ability to withstand a system fault, maintain a safe state in the presence of a fault, and possibly continue to operate in the presence of this fault. FAULT TOLERANT SYSTEM DEFINITIONS A set of consistent definitions is used here to avoid confusion with existing definitions. These definitions are provided by the IFIP Working Group 10.4, Reliable Computing and Fault–Tolerance [Aviz84], [Aviz82], [Ande82], [Robi82], [Lapr84], [TUV86]: § A Failure occurs when the system user perceives a service resource ceases to deliver the expected results. § An Error occurs when some part of a system resource assumes an undesired state. Such a state is contrary to the specification of the resource to the expectation (requirement) of the user. § A Fault is detected when either a failure of the resource occurs, or an error is observed within the resource. The cause of the failure or error is said to be a fault.
  14. 14. 11/196 FAULT–TOLERANT SYSTEM FUNCTIONS In fault–tolerant systems, hardware and software redundancy provides information needed to negate the effects of a fault [Aviz67]. The design of fault– tolerant systems involves the selection of a coordinated failure response mechanism that follows four steps [Siew84], [Mell77], [Toy86]: § Fault Detection § Fault Location and Identification § Fault Containment and Isolation § Fault Masking During the fault detection process, diagnostics are used to gather and analyze information generated by the fault detection hardware and software. These diagnostics determine the appropriate fault masking and fault recovery actions [Euri84], [Rouq86], [Ossf80], [Gluc86], [John85], [John86], [Kirr86], [Chan70]. It is the less than perfect operation of the Fault Detection, Location, and Identification processes of the system that is examined in this work. The reliability of the fault–tolerant system depends on the ability of the diagnostic subsystem to correctly detect and analyze faults [Kirr87], [Gall81], [Cook73], [Brue76], [Lamp82]. The measure of the correct operation of the diagnostic subsystem is called the Coverage Factor. It is assumed in most fault–tolerant product offerings that the diagnostic coverage factor is perfect, i.e. 100%. This work addresses the question: What is the reliability of the Fault–Tolerant system in the presence of less than perfect coverage? To answer this question, some background in the mathematics of reliability theory is necessary. Overview of This Thesis The development of a reliability model of a Triple Modular Redundant (TMR) system with imperfect diagnostic coverage is the goal of this work. Along the
  15. 15. 12/196 way, the underlying mathematics for analyzing these models is developed. The Markov Chain method will be the primary technique used to model the failure and repair processes of the TMR system. The Laplace transform will be used to solve the differential equations representing the transition probabilities between the various states of the TMR system described by the Markov model. The models developed for a TMR system with partial coverage can be applied to actual systems. In order to make the models useful in the real–world a deeper understanding of the diagnostic coverage and fault detection is presented. The appendices provide the background for the Markov models as well as the statistical process. The mathematics of Markov Chains and the statistical processes that underlay system faults and their repair processes can be applied to a variety of other analytical problems, including system performance analysis. It is hoped the reader will gain some appreciation of the complexity and beauty of modern systems as well as the subtitles of their design and operation. If the reader is interested in skipping to the end, Chapter 7 provides a summary of the effects of partial coverage on various system configurations.
  16. 16. 13/196 C h a p t e r 2 RELIABILITY, AVAILABILITY FOR SIMPLE SYSTEMS When presented with the reliability figures for a computer system, the user must often accept the stated value as factual and relevant, and construct a comparison matrix to determine the goodness of each product offering [Kraf81]. Difficulties often arise through the definition and interpretation of the term reliability. This chapter develops the necessary background for understanding the reliability criteria defined by the manufacturers of computer equipment. Figure 1 lists the criteria for defining system reliability [Siew82], [Ande72], [Ande79], [Ande81]. Deterministic Models Survival of at least k component failures Probabilistic Models ( )z t – Hazard (failure rate) function ( )R t – Reliability function µ – Repair Rate ( )A t – Availability function Single Parameter Models MTTF – Mean Time to failure MTTR – Mean Time to Repair MTBF – Mean Time Between Failure c – Coverage Figure 1 – Evaluation Criteria defining System Reliability. These criteria will be used to develop a set of time dependent metrics used to evaluate various configurations.
  17. 17. 14/196 DETERMINISTIC MODELS The simplest reliability model is a deterministic one, in which the minimum number of component failures that can be tolerated without system failure is taken as the figure of merit for the system. Probabilistic Models The failure rate of electronic and mechanical devices varies as a function of time. This time dependent failure rate is defined by the hazard function, ( )z t . The hazard function is also referred to as the hazard rate or mortality rate. For electronic components on the normal–life portion of their failure curve, the failure rate is assumed to be a constant, λ, rather than a function of time. The exponential probability distribution is the most common distribution encountered in reliability models, since it describes accurately most life testing aspects for electronic equipment [Kapu77]. The probability density function (pdf), Cumulative Distribution Function (CDF), reliability function ( ( )R t ), and hazard (failure rate) function ( ( )z t ) of the exponential distribution are expressed by the following [Kend77]: ( ) t pdf f t e −λ = = λ (2.1) ( ) 1 t CDF F t e−λ = = − (2.2) ( )Reliability t R t e−λ = = (2.3) ( )Hazard Function z t= = λ (2.4) The failure rate parameter λ describes the rate at which failures occur over time [DoD82]. In the analysis that follows, the failure rate is assumed to be constant, and measured as failures per million hours. Although a time dependent failure rate could be used for un–aged electronic components, the aging of the electronic
  18. 18. 15/196 components can remove the traditional bathtub curve failure distribution. The constant failure rate assumption is also extended to the firmware controlling the diagnostics of the system [Bish86], [Knig86], [Kell88], [Ehre78], [Eckh75], [Gmei79], [RTCA85]. Exponential and Poisson Relationships In modeling the reliability functions associated with actual equipment, several simplifying assumptions must be made to render the resulting mathematics tractable. These assumptions do not reduce the applicability of the resulting models to real–world phenomenon. One simplifying assumption is that the random variables associated with the failure process have exponential probability distributions. The property of the exponential distribution that makes it easy to analyze is that it does not decay with time. If the lifetime of a component is exponentially distributed, after some amount of time in use, the item is assumed to be good as new. Formally, this property states that the random variable X is memoryless, if the expression { } { }P X s t X t P X s> + > = > is valid for all , 0s t ≥ [Cram66], [Ross83]. If the random variable X is the lifetime of some item, then the probability that the item is functional at time s t+ , given that it survived to time t, is the same as the initial probability that is was functional at time s. If the item is functional at time t, then the distribution of the remaining amount of time that it survives is the same as the original lifetime distribution. The item does not remember that it has already been in use for a time t. This property is equivalent to the expression P X > s +t, X > t{ } P X > t{ } = P X > s{ } or { } { } { }P X s t P X s P X t> + = > > . Since the form of this expression is satisfied when the random variable X is exponentially distributed (since
  19. 19. 16/196 ( )s t s t e e e−λ + −λ −λ = ), it follows that exponentially distributed random variables are memoryless. The recognition of this property is vital to the understanding of the models presented in this work. If the underlying failure process is not memoryless, than the exponential distribution model is not valid. The exponential probability distributions and the related Poisson processes used in the reliability models are formally based on the assumptions shown in Figure 2 [Cox 62], [Thor26]. § Failures occur completely randomly and are independent of any previous failure. A single failure event does not provide any information regarding the time of the next failure event. § The probability of a failure during any interval of time [ ]0, t is proportional to the length of the interval, with a constant of proportionalityλ. The longer one waits the more likely it is a failure will occur. Figure 2 – Assumptions regarding the behavior of a random process that generated events following the Poisson probability distribution function. An expression describing the random processes in Figure 2 results from the Poisson Theorem which states that the probability of an event A occurring k times in n trials is approximately [Papo65], [Pois37], ( ) ( ) −− − + ⋅ 1 1 1 2 k n kn n n k p q k , (2.5) where { }p P A= is the probability of an event A occurring in a single trial and 1q p= − . This approximation is valid when , 0n p→ ∞ → and the product n p⋅ remains finite. It should be noted that a large number of different trials of independent systems is needed for this condition to hold, rather than a large number of repeated trials on the same system.
  20. 20. 17/196 The Poisson Theorem can be simplified to the following approximation for the probability of an event occurring k times in n trials [Kend77], ( ) ( ) ( ) ( )( ) ( ) − − +− − − + − + − ⎛ ⎞ ⎛ ⎞ = −⎜ ⎟⎜ ⎟ − ⎝ ⎠⎝ ⎠ = − = ⎛ ⎞ −⎜ ⎟ ⎝ ⎠ ≈ 1 2 1 2 ! 1 , ! ! 2 , !2 1 ! 1 ! . ! k n k k n k k knn np n k n k k k n k k np n npn np p q k n k k n n e n np e kn k e n np kk e n np e k π π (2.6) The exponential and Poisson expressions are directly related. A detailed understanding of this relationship will aid in the development of the analysis that follows. Using the Poisson assumptions described in Figure 2, the probability of n failures prior to time t is, { } ( )tP N n T t P n= ≤ = . (2.7) From of Eq. (2.7), the probability that no failures occur ( )0n = between time t and time t t+ Δ is, ( ) ( )[ ]0 0 1t t tP P t+Δ = −λΔ , (2.8) where the term npλ = describing the total number of failures is of moderate magnitude [Fell67]. The probability that n failures occur between time t and time + Δt t is then, ( ) ( )[ ] ( )[ ]1 1 , 0t t t tP n P n t P n t n+Δ = −λΔ + − λΔ > . (2.9)
  21. 21. 18/196 Using Eq. (2.9) and Eq. (2.8) and allowing 0tΔ → , a differential equation can be constructed describing the rate at which failures occur between time t and time t t+ Δ , ( ) ( ) ( ) ( ) ( ) 0 0 , 1 , for 0, t t t t t d P P dt d P n P n P n n dt = −λ = λ − − >⎡ ⎤⎣ ⎦ (2.10) with the initial conditions of, ( ) = 0.tP n (2.11) The unique solution to the differential equation in Eq. (2.10) is [Klie75], ( ) ( ) , 0, 1, 2, ! n t t t e P n n n −λ λ = = (2.12) which is the Poisson distribution defined in Eq. (2.6). Using Eq. (2.12) to define a function ( )F t representing the probability that no failures have occurred as of time t gives, ( ) { }0 .t tF t P n e −λ = = = (2.13) The expression in Eq. (2.13) is also the definition for the Cumulative Distribution Function, CDF, of the Poisson failure process [Fell67]. By using Eq. (2.19), the probability distribution function, pdf, of the Poisson process can be given as, ( ) ,t f t e −λ = λ (2.14)
  22. 22. 19/196 which is the exponential probability distribution. [2] The following statement describes the relationship between the Poisson and exponential expressions [Cox65], If the number of failures occurring over an interval of time is Poisson distributed, then the time between failures is exponentially distributed. An alternative method of relating the exponential and Poisson expressions is useful at this point. The functions defined in Eq. (2.1) and Eq. (2.2) are based on the interchangeability of the pdf and the CDF for any defined probability distribution. The Cumulative Distribution Function ( )F x of a random variable X is defined as a function obeying the following relationship [Papo65], ( ) { }, .F x P X x x= ≤ −∞ < < ∞ (2.15) The probability density function ( )f x of a random variable X can be derived from the CDF using the following [Dave70], ( ) ( ). d f x F x dx = (2.16) The CDF can be obtained from the pdf by the following, ( ) { } ( ) , . x F x P X x f t dt x −∞ = ≤ = −∞ < < ∞∫ (2.17) Using Eq. (2.16) and Eq. (2.17), the CDF and pdf expressions for an exponential distribution can be developed. If the mean time between failures (MTBF) is an Exponentially distributed random variable, the CDF is, 2 This development of the pdf is very informal. Making use of the forward reference to construct an expression is circular logic and would not be permitted in more formal circumstances. For the purposes of this work, this type of behavior can be tolerated, since the purpose of this development is to get to the results rather than dwell on the analysis process. This is a fundamental difference between mathematics and engineering.
  23. 23. 20/196 ( ) 1 , 0 , 0 , otherwise, t e t F t −λ ⎧ − ≤ ≤ ∞ = ⎨ ⎩ (2.18) The number of failures in the time interval [ ]0, t is a Poisson distributed random variable with a probability density function of, ( ) ( ) , 0, 0, otherwise, e td f t F t dt −λ ⎧λ > = = ⎨ ⎩ (2.19) where t is a random variable denoting the time between failures. Reliability Availability and Failure Density Functions An expression for the reliability of a system can be developed using the following technique. The probability of a failure as a function of time is defined as, { } ( )≤ = ≥, 0,P T t F t t (2.20) where t is a random variable denoting the failure time. ( )F t is a function defining the probability that the system will fail by time t. ( )F t is also the Cumulative Distribution Function (CDF) of the random variable t [Papo65]. The probability that the system will perform as intended at a certain time t is defined as the Reliability function and is defined as, ( ) ( )( ) { }= − = ≥1 .R t F t P T t (2.21) If the random variable describing the time to failure t has a probability density function ( )f t then using Eq. (2.21) the Reliability function is, ( ) ( ) ( ) ( ) ∞ ∞ = − = − =∫ ∫1 1 . t t R t F t f x dx f x dx (2.22)
  24. 24. 21/196 Assuming the time to failure random variable t has an exponential distribution its failure density defined by Eq. (2.19) is, ( ) , 0, 0.t f t e t−λ = λ ≥ λ ≥ (2.23) The resulting reliability function is then, ( ) ∞ −λ −λ = λ =∫ .t t t R t e dt e (2.24) A function describing the rate at which a system fails as a function of time is referred to as the Hazard function (Eq. (2.4)). Let T be a random variable representing the service life remaining for a specified system. Let ( )F x be the distribution function of T and let ( )f x be its probability density function. A new function ( )z x termed the Hazard Function or the Conditional Failure Function of T is given by ( ) ( ) ( ) = −1 f x z x F x . The function ( )z x dx is the conditional probability that the item will fail between x and +x dx given it has survived a time T greater than x. For a given hazard function ( )z x the corresponding distribution function is ( ) ( )( ) ( ) ⎡ ⎤ − = − −⎢ ⎥ ⎢ ⎥⎣ ⎦ ∫01 1 exp o x x F x F x z y dy where 0x is an arbitrary value of x. In a continuous time reliability model the hazard function is defined as the instantaneous failure rate of the system [Kapu77],
  25. 25. 22/196 ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) 0 lim , 1 , , , . t t t R t R t z t t R t d R t R t dt f t R t e e Δ → −λ −λ − + Δ = Δ ⋅ ⎡ ⎤ = −⎢ ⎥⎣ ⎦ = λ = = λ (2.25) The quantity ( )z t dt represents the probability that a system of age t will fail in the small interval of time [ ]+,t t dt . The hazard function is an important indicator of the change in the failure rate over the life of the system. For a system with an exponential failure rate, the hazard function is constant as shown in Eq. (2.25) and it is the only distribution that exhibits this property [Barl85]. Other reliability distributions will be shown in later chapters that have variable hazard rates. If a system contains no redundancy – this is, every component must function properly for the system to continue operation – and if component failures are statistically independent, the system reliability function is the product of the component reliabilities and follows an exponential probability distribution. The failure rate of such a system is the product of the failure rates of the individual components, ( ) ( ) ( ) 1 1 exp .i n n t sys i i i i R t R t e t−λ = − ⎡ ⎤= = = − λ⎣ ⎦∑∏ ∏ (2.26) In most cases it is possible to repair or replace failed components and accurate models of system reliability will consider this. As will be shown the repair activity is not as easily modeled as the failure mechanisms.
  26. 26. 23/196 For systems that can be repaired, a new measure of reliability can be defined, The probability that the system is operational at time “t.” This new measure is the Availability and is expressed as ( )A t . Availability ( )A t differs from reliability ( )R t in that any number of system failures can occur prior to time t but the system is considered available if those failures have been repaired prior to time t. For systems that can be repaired, it is assumed that the behavior of the repaired system and the original system are identical from a failure standpoint. In general, this is not true, as perfect renewal of the system configuration is not possible. The terms Mean Time to First Failure and Mean Time to Second Failure now become relevant. Assuming a constant failure rate λ, a constant repair rate µ, and identical failure behaviors between the repaired system and the original system, the steady–state system availability can be expressed as, .SSA µ = λ +µ (2.27) The expression in Eq. (2.27) is an approximation of the expression of the availability with repair requires the solution of the appropriate Markov model, which will be developed in a later chapter. Mean Time to Failure The Mean Time to Failure (MTTF) is the expected time to the first failure in a population of identical systems, given a successful system startup at time = 0t . The Cumulative Distribution function ( )F x in Eq. (2.15) and the probability density function ( )f x in Eq. (2.16) characterize the behavior of the probability distribution function of the underlying random failure process. These expressions
  27. 27. 24/196 are in a continuous integral form and require the solution of integral equations to produce a useable result. A concise parameter that describes the expected value of the random process is useful for comparison of different reliability models. This parameter is the Mean or Expected Value of the random variable denoted by [ ]E X and is defined by [Parz60], [Dave70], [ ] ( ) ∞ −∞ = ∫ .E X xf x dx (2.28) The expression in Eq. (2.28) denotes the expected value of the continuous function ( )f x . It is important to note that this definition assumes ( )x f x is integrable in the interval ( )−∞ ∞, . For an exponential probability density function of, ( ) , 0,x f x e x−λ = λ > (2.29) the mean or expected value of the exponential function is given by, [ ] ( ) 0 .x E X xf x dx e dx ∞ ∞ −λ −∞ = = λ∫ ∫ (2.30) The evaluation of Eq. (2.30) can be done in a straightforward manner using the Gamma function [Arfk70], which is defined as, ( ) ∞ − − Γ = >∫ 1 0 , 0,x x e dxα α α (2.31) or alternately, ( )∞ α− α Γ α = λ∫ 1 0 .x x e dx (2.32) Rewriting the expression in Eq. (2.30) for the expected values as,
  28. 28. 25/196 [ ] ∞ − = ∫0 1 ,u E X ue du λ (2.33) where substituting the variables, u x= λ and ,du dx= λ (2.34) results in, [ ] ( ) ∞ − = λ = Γ λ = λ ∫0 1 , 1 2 , 1 , u E X ue du (2.35) which is the MTTF for a simple system. Although this expression is useful for simple systems, a general–purpose expression representing the MTTF is needed. This function can be developed in the following manner. Let X denote the lifetime of a system so that the reliability function is, ( ) { }= > ,R t P X t (2.36) and the derivative of the reliability function which is also given in Eq. (2.21) and Eq. (2.22) is again defined as, ( ) ( )= − . d R t f t dt (2.37) The expression for the expected value or MTTF using Eq. (2.28) is given by: [ ] ( ) ( ) ∞ ∞ ⎛ ⎞ = = − ⎜ ⎟ ⎝ ⎠ ∫ ∫0 0 d E X tf t dt t R t dt dt (2.38)
  29. 29. 26/196 Using the technique of integration by parts [Smai49], [Arfk70] is shown in Eq. (2.39), ( ) ( ) ( ) ( ) ( ) ( )⎛ ⎞ ⎛ ⎞ − −⎜ ⎟ ⎜ ⎟ ⎝ ⎠ ⎝ ⎠ ∫ ∫ , b b a a bd d f x g x dx f x g x g x f x dx adx dx (2.39) to evaluate Eq. (2.38). Integrating by parts gives the expected value as, [ ] ( ) ( ) ∞ ∞ =− + ∫0 . 0 E X t R t R t dt (2.40) Since ( )R t approaches zero faster than t approaches infinity, Eq. (2.40) can be reduced to, [ ] ( ) ∞ = =∫0 ,E X R t dt MTTF (2.41) which is the expression for the Mean Time to Failure for a general system configuration. This direct relationship between MTTF and the system failure rate is one reason the constant failure rate assumption is often made when the supporting reliability data is scanty [Barl75]. Appendix G describes the analysis of the variance for this distribution. Using an exponential failure distribution implies two important behaviors for the system, § Since a used subsystem is stochastically as good as a new subsystem, a policy of scheduled replacement of used subsystems which are known to still be functioning, does not increase the lifetime of the system. § In estimation the mean system life and reliability, data can be collected consisting only of the number of hours of observed life and the number of observed failures; the ages of the subsystems under observation are of no concern.
  30. 30. 27/196 Mean Time to Repair The Mean Time to Repair (MTTR) is the expected time for the repair of a failed system or subsystem. For exponential distributions this is 1 MTTF = λ and 1 MTTR = µ . The steady state availability SSA defined in Eq. (2.27) can be rewritten in terms of these parameters, .SS MTTF A MTTR MTTF = + (2.42) Mean Time Between Failure The Mean Time Between Failure (MTBF) is often mistakenly used in place of Mean Time to Failure (MTTF). The MTBF is the mean time between failures in a system with repair, and is derived from a combination of repair and failure processes. The simplest approximation for MTBF is: = + .MTBF MTTF MTTR (2.43) In this work, it is assumed MTTR MTTF so that MTTR is used in place of MTBF. The Mean Time to Failure is considered since in fault–tolerant systems Failure occurs only when the redundancy features of the system fail to function properly. In the presence of perfect coverage and perfect repair the system should operate continuously. Therefore, failure of the system implies total loss of system capabilities. Mean Time to First Failure The Mean Time to Failure is defined as the expected time of the first failure in a population of identical systems. This development depends on the assumption that the failure rate is constant Eq. (2.25), exponentially distributed Eq. (2.14), and the repair time is constant, µ. In the general case, these assumptions may not
  31. 31. 28/196 be valid and the Mean Time to Failure (MTTF) is not equivalent to the Mean Time to First Failure (MTFF). By removing the exponential probability failure distribution restriction in Eq. (2.29) a generalized expression for the first failure time can be derived. Given a population of n subsystems each with a random variable =, 1,2, ,iX i n and a continuous pdf of ( )f x , the failure time for the th n subsystem is given by summing all the failure times prior to the failure, = = + + + = ∑1 2 1 . n n n i i S X X X X (2.44) If the random variables { }1 2, , , nX X X are independent and identically distributed, all with pdf’s of ( )f x , the random process described by these variables is referred to as an Ordinary Renewal Process [Cox62], [Ross70]. The details of the Renewal Process are shown in Appendix E. Given the random process described by Eq. (2.44) the distribution function of nS is provided by convolving each individual distribution function ( )F t . The convolution of two functions is defined as [Brac65], [Papo65]: ( ) ( ) ( ) ( ) ∞ −∞ ⊗ ≡ −∫ .f x g x f u g x u du (2.45) The resulting convolution function for the n+1 subsystem failure is given by: ( ) ( ) ( ) ( ) ( )+ = −∫1 0 . t n n F t F t x F x dx (2.46) In renewal processes, the random variables are actually functions and can be substituted in the reliability computations when:
  32. 32. 29/196 ( ) += ⇔ ≤ ≤ 1.n nN t n S t S (2.47) When the conditions in Eq. (2.47) are met, the probability of n renewals in a time interval is given by, ( ){ } { } { } { } ( ) ( ) ( ) ( ) 1 1 1 , , . n n n n n n P N t n P S t S P S t P S t F t F t + + + = = ≤ ≤ = ≤ − ≤ = − (2.48) The renewal function ( )H t can be defined as the average number of subsystem failures and repairs as a function of time, and is given as, ( ) ( ) .H t E N t= ⎡ ⎤⎣ ⎦ (2.49) Using Eq. (2.48) in the evaluation of Eq. (2.49) and Eq. (2.30) as the definition of the expectation value, gives the following for the renewal function, ( ) ( ){ } ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) 0 1 0 0 0 1 , . 1 . n n n n n n n n n H t nP N t n nF t nF t nF t n F t ∞ = ∞ ∞ + = = ∞ ∞ = = = = = − = − − ∑ ∑ ∑ ∑ ∑ (2.50) Simplifying Eq. (2.50) results in an expression for the renewal function of, ( ) ( ) ( ) ( )1 1 .n n H t F t F t ∞ + = = + ∑ (2.51) The term ( )1nF + is the convolution of ( )nF and F which gives, ( ) ( ) ( ) ( ) ( )1 0 , t n nF t F t x F x dx+ = −∫ (2.52) which results in the expression for the renewal function of,
  33. 33. 30/196 ( ) ( ) ( ) ( ) ( ) 1 0 . t n n H t F t F t x F x dx ∞ = = + −∑∫ (2.53) Rearranging the integral term in Eq. (2.53) gives, ( ) ( ) ( ) ( ) ( ) 10 . t n n H t F t F t x F x dx ∞ = ⎡ ⎤ = + −⎢ ⎥ ⎣ ⎦ ∑∫ (2.54) The summation term in Eq. (2.54) is the renewal function for the th n failure, giving, ( ) ( ) ( ) ( ) 0 . t H t F t H t x F x dx= + −∫ (2.55) Using Eq. (2.16), the renewal density function ( )h t is the derivative of the distribution function, giving, ( ) ( ). d h t H t dt = (2.56) Using Eq. (2.50) to evaluate the derivative results in, ( ) ( ) ( ) 1 ,n n h t f t ∞ = = ∑ (2.57) and using Eq. (2.54) as a substitute for the right–hand side of Eq. (2.57) results in, ( ) ( ) ( ) ( ) 0 . t h t f t h t x f x dx= + −∫ (2.58) Eq. (2.58) is known as the Renewal Equation [Ross70]. To solve the renewal equation, the Laplace transform will be used. The transform of the probability density function is,
  34. 34. 31/196 ( ){ } ( ) 0 ,sx f s e f x dx ∞ − = ∫L (2.59) and the transform of the renewal function is, ( ){ } ( ) 0 .sx h s e h x dx ∞ − = ∫L (2.60) Using the convolution property of the Laplace transform [Brac65], an equation for the renewal distribution can be generated, ( ){ } ( ){ } ( ){ } ( ){ },h s f s h s f s= +L L L L (2.61) and simplified to, ( ){ } ( ){ } ( ){ } . 1 f s h s f s = − L L L (2.62) Eq. (2.62) is now the generalized expression for the failure distribution for a random process within an arbitrary probability distribution. General Availability Analysis The steady state system availability defined in Eq. (2.42) assumes an exponential distribution for the failure rate of the system or subsystems. An important activity in the analysis of Fault–Tolerant systems is the development of a general– purpose availability expression, independent of the underlying failure distribution. In the analysis that follows, it will be assumed that when a subsystem fails it is repaired and the system restored to its functioning state. It will also be assumed that the restored system functions as if it were new, that is with the failure probability function restarted at 0t = .
  35. 35. 32/196 Let iT be the duration of the ith functioning period and let iD be the system downtime because of the failure of the system while the ith repair takes place. These durations will form the basis of the renewal process. By combining the subsystem failure interval and the subsystem repair duration, a random variable sequence is constructed such that, ; 1, 2,i i iX T D i= + = (2.63) It must be assumed that the duration of the functioning subsystems are identically distributed with a common Cumulative Distribution Function ( )W t and a common probability density function ( )w t and that the repair periods are also identically distributed with ( )G t and ( )g t . Using these assumptions the terms in Eq. (2.63) are also identically distributed such that, { }1,2, ,iX i = (2.64) meets the definition of a Renewal process developed Eq. (2.44). Using this development an expression for the convolution of the two independent random processes is given by, ( ){ } ( ){ } ( ){ }.f s w s g s=L L L (2.65) Using Eq. (2.62) gives, ( ){ } ( ){ } ( ){ } ( ){ } ( ){ } . 1 w s g s h s w s g s = − L L L L L (2.66) The average number of repairs ( )M t in the time interval ](0,t has the Laplace transform:
  36. 36. 33/196 ( ){ } ( ){ } ( ){ } ( ){ } ( ){ } . 1 w s g s M s s w s g s = ⎡ ⎤−⎣ ⎦ L L L L L (2.67) Instantaneous Availability The steady state availability defined in Eq. (2.42) can now be replaced with the instantaneous availability ( )A t . In the absence of a repair mechanism the availability ( )A t is equivalent to the repairability, ( ) ( )1R t A t= − of the subsystem. The subsystem may be functioning at time t because of two mutually exclusive reasons, § The subsystem has not failed from the beginning. § The last renewal occurred within the time period and the subsystem continued to function since that time. The probability associated with the second case is the convolution of the reliability function and the renewal density, giving, ( ) ( ) 0 , t R t x h x dx−∫ (2.68) which results in a expression for the instantaneous availability of, ( ) ( ) ( ) ( ) 0 . t A t R t R t x h x dx= + −∫ (2.69) Taking the Laplace transform of both sides of Eq. (2.69) gives, ( ){ } ( ){ } ( ){ } ( ){ } ( ){ } ( ){ } ( ){ } ( ){ } ( ){ } ( ){ } ( ){ } , 1 , 1 . 1 A s R s R s L h s R s h s w s L g s R s w s L g s = + ⎡ ⎤= +⎣ ⎦ ⎡ ⎤ = +⎢ ⎥ −⎢ ⎥⎣ ⎦ L L L L L L L L (2.70)
  37. 37. 34/196 Since the reliability of the system is given as ( ) ( )1R t W t= − , ( ){ } ( ){ } ( ){ } ( ){ } 1 , 11 . A s W s s w s w s s s s = − − = − = L L L L (2.71) Substituting gives, ( ){ } ( ){ } ( ){ } ( ){ } 1 . 1 w s A s s w s g s − = ⎡ ⎤−⎣ ⎦ L L L L (2.72) Given the failure–rate distribution and the repair–time distribution, Eq. (2.72) can be used to compute the instantaneous availability as a function of time. Limiting Availability An important question to ask is – what is the availability of the system after some long period of time? The limiting availability ( )A t as → ∞t is defined as A or simply the Availability. To derive an expression for the limiting availability the Final Value Theorem of Laplace transform can be used [Doet61], [Widd46], [ Brac65], [Ogat70], [Gupt66]. This theorem states that the steady state behavior of ( )f t is the same as the behavior of ( )sF s in the neighborhood of 0s = . Thus it is possible to obtain the value of ( )f t as → ∞t . Let, ( ) ( ) ( )− = +∫0 0 , t F t f x dx F (2.73) then using a table of Laplace transforms [Doet61], [Brac65],
  38. 38. 35/196 ( ){ } ( ) ( ){ } ( ) ∞ − − − = = ∫L L 0 0 ,st s F s F h s e f t dt (2.74) and by letting 0,s → ( ){ } ( ) ( ) ( ) ( ) ( ) ∞ − → − →∞ →∞ = + ⎡ ⎤ = +⎢ ⎥ ⎣ ⎦ = ∫ ∫ L 0 0 0 lim 0 , lim 0 , lim . s t s t s H s f t dt F f x dx F F t (2.75) The Limiting availability is then given as, ( ) ( ){ }0 lim lim . t s A A t s A s →∞ → = = L (2.76) For small values of s the following approximations can be made [Apos74], 1 ,st e st− ≅ − (2.77) giving, ( ){ } ( ) ( ) ( ) ∞ − ∞ ∞ = = − − λ ∫ ∫ ∫ L 0 0 0 , , 2 1 . st w s e w t dt w t dt s tw t dt (2.78) where 1 MTTF = λ and, ( ){ }= − µ L 2 1 ,g s (2.79)
  39. 39. 36/196 and where 1 MTTR = µ giving the limiting availability as, 0 11 1 lim . 1 1 1 1 1 s s MTTF A s s MTTF MTTR→ ⎡ ⎤⎛ ⎞ − −⎜ ⎟⎢ ⎥λ⎝ ⎠ λ= = =⎢ ⎥ +⎛ ⎞⎛ ⎞⎢ ⎥ +− − −⎜ ⎟⎜ ⎟⎢ ⎥ λ µλ λ⎝ ⎠⎝ ⎠⎣ ⎦ (2.80) Eq. (2.80) is an important result in the analysis of system reliability, because it shows that the limiting availability depends only on the Mean Time to Failure and the Mean Time to Repair and not in the underlying distributions of the failure and repair times.
  40. 40. 37/196 C h a p t e r 3 SYSTEM RELIABILITY This chapter provides the basis for the computation of the overall system reliability given a redundant architecture with partial fault detection coverage. Redundant systems can be modeled under variety operational assumptions. Of most interest in this work are dual and triple redundant systems that contain repair facilities. Series Systems Creating a reliable system often involves a series or parallel combination of independent systems or subsystems. If ( )iR t is the reliability of module i and all the modules are statistically independent, then the overall system reliability of modules connected in series is, ( ) ( ).series iR t R t= ∏ (3.1) For a series redundant system the failure probability seriesF is given by, ( ) ( ) ( ) ( )( ) 1 1 1 1 , 1 1 . n series series i i n i i F t R t R t F t = = = − = − = − − ∏ ∏ (3.2) Expanding Eq. (3.1) will illustrate an aspect of the exponential distribution. For a system of n subsystems connected in series the reliability of the system is given by Eq. (3.1). If a general purpose hazard function is used for the failure rate [Shoo68] defined by, ( ) ,k i i ih t c t= λ + (3.3)
  41. 41. 38/196 where iλ , ic , and k are constants, then the reliability function for the individual subsystem is given by, ( ) 1 exp , 1 k i i i t R t t c k + ⎡ ⎤ = − λ +⎢ ⎥+⎣ ⎦ (3.4) and the reliability functions for the system is given by, ( ) 1 1 1 exp . 1 kn n series i i i i t R t t c k + = = ⎡ ⎤ = − λ +⎢ ⎥+⎣ ⎦ ∑ ∑ (3.5) Defining two new terms for the summation of the failure rate and a new term for the time constant adjustment gives, 1 n i i ∗ = λ = λ∑ , 1 n i i c c∗ = = ∑ , and T t∗ = λ results in the series reliability expression of, ( ) ( ) 1 1 exp . 1 k series k c T R t T k ∗ + ∗ ∗ ⎡ ⎤⎛ ⎞⎛ ⎞⎢ ⎥⎜ ⎟= − + ⎜ ⎟⎢ ⎥⎜ ⎟λ +⎝ ⎠ λ⎝ ⎠⎣ ⎦ (3.6) As the number of subsystems grows large ( )∗ λ →∞ , the term ( )1 c k ∗ ∗ + λ is bounded and the expression for the system reliability becomes, ( )lim .T t series n R t e e ∗ − −λ →∞ = = (3.7) Eq. (3.7) defines the failure distribution of the system as the number of subsystems grows without bound. This implies that a large complex system will tend to follow exponential distribution failure models regardless of the internal organization of the subsystems.
  42. 42. 39/196 Parallel Systems In a parallel redundant configuration, the system fails only if all modules fail. The probability of a system failure in a parallel system given by, ( ) ( ) 1 1 . n iparallel i F t F t = = −∏ (3.8) The system reliability for a parallel system is given by, ( ) ( ) ( ) ( )( ) 1 1 1 1 , 1 1 . n iparallel parallel i n i i R t F t F t R t = = = − = − = − − ∏ ∏ (3.9) M–of–N Systems An M–of–N system is a generalized form the parallel system. Instead of requiring only one of the N modules of the system to remain functional, M modules are required. The system of interest in this work is a Triple Modular Redundant (TMR) configuration in which two of the three modules must function for the system to operate properly [Lyons 62], [Kuehn 69]. [3] For a given module reliability of mR the TMR reliability is given by, ( )3 2 3 1 . 2tm r m m mR R R R ⎛ ⎞ = + −⎜ ⎟ ⎝ ⎠ (3.10) In Eq. (3.10) all working states are enumerated. The 3 mR term represents that state in which all three modules are functional. The ( )2 3 1 2 m mR R ⎛ ⎞ −⎜ ⎟ ⎝ ⎠ term 3 In practical TMR systems, a simplex mode is allowed, which usually places the system in a shutdown mode, allowing the controlled process to be safely stopped.
  43. 43. 40/196 represents the three states in which any one module has failed and the two states in which a module is functional. Selecting the Proper Evaluation Parameters In comparing different redundant system configurations, it is desirable to summarize their reliability by a single parameter. The reliability may be an arbitrary complex function of time. The selection of the wrong summary parameter could lead to incorrect conclusions, as will be shown below. Consider a simplex system, with a reliability function of, ( ) ,t simplexR t e−λ = (3.11) and using Eq. (2.41) to derive the Mean Time to Failure results in, 1 .sim plexMTTF = λ (3.12) For a TMR system with an exponential reliability function, ( ) ( ) ( ) ( ) 3 2 2 3 3 1 , 2 3 2 , t t t tm r t t R t e e e e e −λ −λ −λ − λ − λ ⎛ ⎞ = + −⎜ ⎟ ⎝ ⎠ = − (3.13) and using Eq. (2.40) results in a Mean Time to Failure of, 3 2 . 2 3 tm rMTTF = − λ λ (3.14) Comparing the simplex and TMR reliability expressions gives, 5 1 . 6 tm r sim plexMTTF MTTF= ≤ = λ λ (3.15) By using the MTTF figure of merit, the TMR system can be shown to be less reliable than the Simplex system. The above equations do not include the facility
  44. 44. 41/196 for module repair. Once the TMR system has exhausted its redundancy, there is more hardware to fail then the remaining modules of the non–redundant system. This effect lowers the total system reliability. With online repair, the MTTF figure of merit for the TMR system becomes an important measure of the overall system reliability. These results illustrate why simplistic assumptions and calculations may result in erroneous information.
  45. 45. 42/196 C h a p t e r 4 IMPERFECT FAULT COVERAGE AND RELIABILITY Reliability models of systems with dynamic redundancy usually depend on perfect fault detection [Arno73], [Stif80]. The ability of the system to detect faults that occur can be classified as [Geis84], § Covered – faults that are detected. The probability that a fault belongs to this class is given by c. § Uncovered – faults that are not detected. The probability that a fault belongs to this class is given by ( )1 c− . The underlying diagnostic firmware and hardware may not provide perfect coverage for many reasons, primarily due to the complexity of the system under diagnosis [Rous79], [Cona72], [Wood79], [Soma86]. Because of this built–in complexity, an exhaustively tested set of diagnostics may not be possible. Another factor affecting the diagnostic coverage is the presence of intermittent faults [Dahb82], [Mall78]. The detection and analysis of these intermittent or permanent faults is further complicated by the presence of transient faults which behave as real faults but are only present in the system for a short time [Glas82], [Sosn86]. Modeling a fault–tolerant system in the presence of imperfect fault coverage becomes an important aspect in predicting the overall system reliability. Redundant System with Imperfect Coverage Before developing the Markov method of analyzing Fault–Tolerant systems, a conditional probability method will be used to derive the MTTF and MTBF for a redundant system with imperfect fault detection [Bour69]. Assume that the failure rate for each subsystem of the redundant system is described by an independent random variable λ. Let X denote the lifetime of a system with two modules, one active and the other in standby mode. Assume that the module in the standby
  46. 46. 43/196 mode does not experience a fault during the mission time interval. [4] Let Y be a random variable where, Y = 0 if a fault is not covered, and Y = 1 if a fault is covered, then, { } ( )0 1P y c= = − and { }1 .P y c= = To compute the MTTF of this system, the conditional expectation value of the system lifetime X given the fault coverage state Y is must be derived. If an uncovered fault occurs the MTTF of the system is the MTTF of the initially active module, { } 1 0 .P X Y = = λ (4.1) If a covered fault occurs the MTTF of the system is the sum of the MTTF of the active module and the MTTF of the inactive module, { } 2 1 .P X Y = = λ (4.2) The total expectation value of the system lifetime is then given by, [ ] ( ) ( )1 12 . c cc E X MTTF − + = + = = λ λ λ (4.3) The computation of the system reliability depends on the combination of the two independent exponential distribution functions when a covered fault occurs, ( ) 2 1 ,t f x t y te −λ = = = λ (4.4) and when an uncovered fault occurs ( )0 .t f x t y e −λ = = = λ (4.5) The joint exponential distribution function for both conditions is given by, 4 This is an invalid assumption in a practical sense, but it greatly simplifies this example.
  47. 47. 44/196 ( ) ( ) { } ( ) ( ) ( ) 2 , , , 1 ; 0, 0, , ; 0, 1. t t f t y f X t y P y f t y c e t y f t y cte t y −λ −λ = = ⋅ = λ − > = = λ > = (4.6) and the marginal density function of X is computed by summing over the joint density function, ( ) ( )2 1 .t t f t cte c e−λ −λ = λ + λ − (4.7) The system reliability as a function of the coverage is then given by integrating the joint density function in Eq. (4.7) to give, ( ) ( ) ( ) ( ) ( ) 0 2 0 2 1 1, 1 1 , 1 1 , 1 . t t t t t t t t R t f x dx cte c e dt cte c e dt c t e −λ −λ ∞ −λ −λ −λ = − = = − λ + λ − = − λ + λ − = + λ ∫ ∫ ∫ (4.8) Generalized Imperfect Coverage In the previous example, the system consisted of two modules, one in the active state and one in the standby state. The conditional probability that a fault will go undetected (uncovered) was computed using the conditional probability that the system will survive for a specified period. Cox [Cox55] analyzed the general case of a stage–type conditional probability distribution. The principle on which the method of stages is based is the memoryless property of the exponential distribution of Eq. (2.1) [Klie75]. The lack of memory is defined by the fact that the distribution of the time remaining for an exponentially distributed random variable is independent of the current age of the random variable, that is the variable is memoryless. Appendix D develops further the memoryless property of random variables with exponential distributions.
  48. 48. 45/196 In the generalized model, it is assumed that individual modules are always in one of two states – working or failed. It is also assumed that the modules are statistically independent and module repair can take place while the remainder of the system continues to function. In the general case of N active and S standby modules, the lifetime of the system is defined by a stage–type distribution. An active module has an exponential failure distribution with a constant failure rate λ. Assume that the modules in the standby state can fail at a rate µ (presuming0 ≤ µ ≤ λ). Let iX ( )1 i N≤ ≤ be a random variable denoting the lifetime of the active modules and let jY ( )1 j S≤ ≤ be a random variable denoting the lifetime of the standby modules. The system lifetime L is then, ( ) ( ) ( ) ( ) ( ) 1 2 1 2, min , , , ; , , , , 1 , , , 1 . N SL m N S X X X Y Y Y L m N S W N S L m N S = + − = + − (4.9) where ( ),W N S is the time to first failure among the N S+ modules. After the removal of the failed module, the system has N active modules and 1S − standby modules. As a result 1N S+ − modules have not aged by the memoryless exponential assumption and therefore the system lifetime is, ( ) ( ) ( ) 1 , ,0 , . S i L m N S L m N W N i = = + ∑ (4.10) Here ( ) ( ), ,0L m N S L m N= is the lifetime of the m–out–N system and is therefore a th k order statistic with 1k N m= − + [Kend77]. The distribution of ( ),0L m N is an ( )1N m− + – phase Hypoexponential distribution with parameters ( ), 1 , ,N N mλ − λ λ. The distribution for the time to first failure ( ),W N i has an exponential distribution with the parameter N iλ + µ.
  49. 49. 46/196 Using Theorem D.1 in Appendix D, the distribution ( )L ,m N S has a ( )1N S m+ − + –stage Hypoexponential distribution [Koba78], [Cox55], [Ash70] with parameters ( ) ( ), 1 , , , , 1 , ,N S N S N N N mλ + µ λ + + µ λ +µ λ − λ λ. Let ( ),m N S R t⎡ ⎤⎣ ⎦ denote the reliability of such a system, then the reliability function is defined as, ( ) ( ) , 1 , S N N j i t j im N S i i m R t a e b e− λ+ µ − λ ⎡ ⎤⎣ ⎦ = = = +∑ ∑ (4.11) where, 1 , S N i j j m j i N j j a j i j N i= = ≠ λ + µ λ = µ − µ λ − λ − µ ∏ ∏ (4.12) and, ( )= = ≠ λ + µ λ = − λ + µ λ − λ ∏ ∏1 . S N i j j m j i N j j b N i j j i (4.13) Defining the constant K = λ µ gives a new expression for the active and standby terms in the reliability equation Eq. (4.11) of,
  50. 50. 47/196 ( ) ( ) ( )( ) ( )( ) ( ) ( ) ( ) ( ) ( ) ( )( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) 1 1 1 1 1 1 1 , 1 1 1 1 ! ! 1 1 ! ! 1 ! ! 1 ! ! ! , 1 ! ! ! 1 1 1 1 N m i i N m N m NK S NK N N m a i i iNK i S i i N m K K K NK S S i NK i NK S S i i N N N m k i i m M m N m K K NK s S N S i m − + − − + − + + + − − = ⋅ + − − − ⎛ ⎞ ⎛ ⎞⎛ ⎞ + − +⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎝ ⎠ ⎝ ⎠⎝ ⎠ + = − ⋅ − + − ⎛ ⎞ − −⎜ ⎟ ⎝ ⎠⋅ ⎡ ⎤⎛ ⎞ − − −⎜ ⎟⎢ ⎥⎝ ⎠⎣ ⎦ + −⎛ ⎞⎛ ⎞⎛ ⎞ ⎜ ⎟⎜ ⎟⎜ ⎟−⎝ ⎠⎝ ⎠⎝ ⎠= − . i N mi K NK N m ⎛ ⎞ + −⎛ ⎞⎜ ⎟+⎜ ⎟⎜ ⎟⎝ ⎠ −⎝ ⎠ (4.14) A similar expression can be developed for, ( ) ( ) ( ) ( ) ( ) ( )( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )( ) ( ) ( ) ( ) ( ) ( ) − − − + + = ⋅ − + − + − − −⎡ ⎤ ⎡ ⎤ ⎡ ⎤⎣ ⎦ ⎣ ⎦ ⎣ ⎦ + − −⎡ ⎤⎣ ⎦= − + − − −⎡ ⎤⎣ ⎦ + − = − − + −⎡ ⎤⎣ ⎦ +⎛ ⎞⎛ ⎞⎛ ⎞ ⎜ ⎟⎜ ⎟⎜ ⎟ ⎝ ⎠⎝ ⎠⎝ ⎠= − − +⎛ ⎞ ⎜ ⎟ ⎝ ⎠ 1 , 1 1 1 1 ! 1 ! ! 1 , ! ! 1 ! ! ! ! ! ! ! ! 1 , ! ! ! ! ! ! 1 i i m i m i m NK S NK N m b N i K S N i K i N m i NK S N K N NK N i K S i m N i i m NK S S N i K N i m S NK N i K S i i m m NK S N i S i m N i K Si m S . (4.15) An expectation value of the reliability function derived from a general stage–type distribution can be found using the Laplace transform [Cox 55]. The Laplace transform of a stage–type random variable X is,
  51. 51. 48/196 ( ) µ γ β β β γ µ + = = = + + ∑ ∏L 1 1 2 1 1 1 , ir j X i i i j j s s (4.16) where γ β+ =1i i for ≤ ≤1 i r and γ + =1 1r . Defining the Laplace transform of the system described in Eq. (4.9) gives, ( ) ( ) ( ) ( ) ( ) ( ) ( ) λ µ λ µ λλ µ λ µ λ − = = − + = = + − + = − + + − + − ++ + + + + − + ∑ ∏ ∏ ∏ L 1 1 1 1 2 1 1 1 1 1 1 . 1 iS i X i j S N M j j N S j s c c s N S j N jN j c s N j s N j (4.17) By inverting the transformation in Eq. (4.17) an expression for the MTTF with imperfect coverage can be given as, [ ] ( ) λ µ λ µ λ − = = − + = = ⎧ ⎫⎪ ⎪ = − + +⎨ ⎬ + +⎪ ⎪⎩ ⎭ ∑ ∑ ∑ ∑1 2 1 1 1 1 1 1 1 . S S S N i i j S i j j M E X c c c N j N j j (4.18) The details of the above development are described in more detail in [Ing76], [Chan72], [King69], [Saat65], [Math70], [Triv82]. In the example described above, the system does not provide for repair. When repairable systems are analyzed in this manner, the number of stages becomes infinite. To deal with the infinite number of conditional probabilities a different technique must be employed. The Markov Chain is just such a technique, capable of dealing with a system configuration of many modules, each with repairability. An additional caution should be noted. The assumption of statistical independence is questionable in the case of stage–type failure distributions. In addition, the fixed probability distribution associated with each failure in the stage–type should be removed in the detailed analysis [Rams76].
  52. 52. 49/196 C h a p t e r 5 MARKOV MODELS OF FAULT–TOLERANT SYSTEMS A generalized modeling technique is required to deal with an arbitrary number of modules, failure events, and repair events in the analysis of Fault–Tolerant systems [Boss82]. Several techniques are available, including Petri Nets [Duga84], [Duga85], Fault Tree Analysis [Fuss76], Failure Mode and Effects Analysis [Mil1629], [Jame74], Event Tree Analysis [Gree82], and Hazard and Operability Studies [Lee80], [Robi78], [Smit85]. When system components are not independent, a state based analysis technique is needed which includes redundancy and repair [Biro86], [Guid86]. A Continuous Parameter Markov Chain is a method used to analyze systems that have state transitions that include repair processes [Hoel72], [Kend50], [Kend53]. A Markov Process is a stochastic process whose dynamic behavior is such that the probability distributions for its future behavior depend only on the present state and not how the process arrived in that state [Mark07], [Fell67], [Issa76], [Chun76], [Kulk84]. To illustrate the principles of a Markov process, consider a system S described in Figure 3, which is changing over time in such a way that its state at any instant in time v can be described in terms of a finite dimensional vector ( )X t , [Triv74], [Triv75a], [Triv75]. Assume that the state of the system at any time >, fort t v can be described by a predetermined function of the starting state v and the ending state t: ( ) ( ), .X t G X v t= ⎡ ⎤⎣ ⎦ (5.1)
  53. 53. 50/196 Given a set of reasonable starting conditions and the continuity of the function G a differential equation for ( )X t describing the rate at which transitions between each state of the system takes place can be derived by expanding both sides of Eq. (5.1) in powers of t to give, ( ) . dx X t dt = ⎡ ⎤⎣ ⎦H (5.2) Finite–dimensional deterministic systems described by the set of state vectors are equivalent to systems described by sets of ordinary differential equations [Bell60], [Brau67], [Beiz78], [Brue80]. This property will serve as the basis for analysis of fault–tolerant systems that include repair. It will be assumed that the system described by the set of differential equation in Eq. (5.2) can exist in only one of the finite number of states [Keme60], [Koba78]. The transition from state i to state j in this system takes place with some random probability defined by, ( ) ( ) ( ){ }, , ; , .ijp v t P X t j X v i t v i j S= = = ≥ ∈ (5.3) Eq. (5.3) is the conditional pdf of the system of state transitions and satisfies the relation, ( ), 1; 0 .j i S p v t v t ∀ ∈ = ≤ ≤∑ (5.4) The unconditional pdf of the state transition vector ( )X t is given by, ( ) ( ){ }, 1, 2, 3,jp t P X t j j= = = (5.5) with, pj t( )=1 ∀j∈S ∑ , ∀t > 0, (5.6)
  54. 54. 51/196 since the process at any time t must be in a unique state. An Absorbing Markov Process is one in which transitions have the following properties [Gave73], § There is at least one absorbing state, § From every state, it is possible to get to the absorbing state. i ki j j v t uv t Figure 3 – State Transition probabilities as a function of time in the Continuous–Time Markov chain that is subject to the constraints of the Chapman–Kolmogorov equation. The fundamental assumption of the Markov model is that the probability of a given state transition depends only on the current state of the system and not on any previous state. For continuous–time Markov processes, that is, those described by ordinary differential equations, the length of time already spent in the current state does not influence either the probability distribution of the next state or the probability distribution of the remaining time in the same state before the next transition. The Markov model fits with the standard assumption of the reliability models developed so far in this work, that the failure rates are constant, leading to an exponentially distributed state transition time for failures and a Poisson distribution for the occurrence of these failures.
  55. 55. 52/196 Solving the Markov Matrix In order to describe a continuous–time Markov process using transition matrices, it is necessary to specify the entire family of stochastic matrices, ( ){ }P t . Only those matrices that meet certain conditions are useful in finding the solution to the final absorption state rate of the system described by the Markov Chain [Cour77]. Initial value problems involving systems of equations may be solved using the Laplace transform. The advantage of this technique over traditional methods (Elimination, Eigenvalue solutions, and Fundamental Matrix [Pipe63], [Cour43]) is that satisfaction of initial values is automatically provided. No special techniques are needed to find particular solutions of the fundamental matrix, such as repeated eigenvalues [Lome88]. Chapman–Kolmogorov Equations A set of differential equations describing the transitions between each state can be derived if the following conditions are met by the transitions probability matrix [Bhar60], [Parz62], [Howa71]. These equations are the Chapman–Kolmogorov Equations and are defined as the transition probabilities of the Markov chain that satisfy Eq. (5.7) for all i and j, using Figure 3 as an example, ( ) ( ) ( ), , , .ij ik kj k p v t p v u p u t= ⋅∑ (5.7) A simplified notation for the matrix elements defined in Eq. (5.7) can be created where the elements of each matrix are given by, ( ) ( ) ( ), , , ,v t v u u t v u t= Η ≤ ≤H H (5.8) and where, ( ), ,t t =H I (5.9)
  56. 56. 53/196 is the identity matrix. The Forward Chapman–Kolmogorov Equation is now defined as, ( ) ( ) ( ), , , ,v t s t t v t t ∂ = ≤ ∂ H H Q (5.10) where the new matrix ( )tQ is defined as, ( ) ( ) 0 lim , t t t tΔ → − = Δ P I Q (5.11) with, .t t vΔ = − (5.12) The matrix ( )tQ is now defined as the transition rate matrix [Papo65a]. The elements of ( )tQ are ( )ijq t and are defined by, ( ) ( ) 0 , 1 lim ,ii ii t p t t t q t tΔ → + Δ − = Δ (5.13) and ( ) ( ) 0 , 1 lim , . ij ij t p t t t q t i j tΔ → + Δ − = ≠ Δ (5.14) If the system at time t is in state i, then the probability that a transition occurs to any state other than state i during the time interval t t+ Δ is given by, ( ) ( ),iiq t t o t− Δ + Δ (5.15) where ( )o h is any function of h that approaches zero faster than h, that is ( ) 0 lim 0. h o h h→ = Eq. (5.13) is the rate at which the process departs state i when the starting in state i.
  57. 57. 54/196 Similarly, given that the system is in state i at time t, the conditional probability that it will make a transition from state i to state j in the time interval [ ],t t t+ Δ is given by, ( ) ( ).ijq t t o tΔ + (5.16) Eq. (5.14) is the rate at which the process moves from state i to state j given that the system is in state i, since, ( ), 1,ijp v t =∑ (5.17) then Eq. (5.13) and Eq. (5.14) implies, ( ) 0, .ijq t i= ∀ ∈∑ S (5.18) Using these developments, the Backward Chapman–Kolmogorov equation is given by, ( ) ( ) ( ), , , .v t v v t v t v ∂ = − ≤ ∂ H Q H (5.19) The forward equation may be expressed in terms of its elements, ( ) ( ) ( ) ( ) ( ), , , .ij jj ij kj ik k j p v t q t p v t q t p v t t ≠ ∂ = + ∂ ∑ (5.20) The initial state i at the initial time v affects the solution of this set of differential equations only through the following conditions, ( ) =⎧ = ⎨ ≠⎩ 1, , 0,ij i j p v v i j (5.21) The backward matrix equation may be expressed in terms of its elements, ( ) ( ) ( ) ( ) ( ), , , ,ij jj ij ik kj k j p v t q t p v t q t p v t t ≠ ∂ = − − ∂ ∑ (5.22)
  58. 58. 55/196 with the initial conditions, ( ) =⎧ = ⎨ ≠⎩ 1, , 0,ij i j p t t i j (5.23) Markov Matrix Notation The expressions developed in the previous section can be represented by a transition probability matrix [Papo62] of the form, P = pij ! " # $= pmn ! ! ! pm0 " # " " # " " p11 p10 p0n p01 p00 ! " % % % % % % # $ & & & & & & . The entries in this matrix satisfy two properties; ≤ ≤0 1ijp and =∑ 1ij j p which is a restatement of Eq. (5.17). The Transition Probability Matrix can also be represented by a directed graph [Maye72], [Deo74]. A node labeled i in the directed graph represents state i of the Markov Chain and a branch labeled ijp from node i to node j implies that the conditional probability { }−= = =1n n ijP X j X j p is met by the Markov Process represented by the directed graph. The transition probabilities represent a set of differential equations describing the rate at which the transitions take place between each node in the directed graph. The differential equations are then represented by a matrix structure of,
  59. 59. 56/196 d dt Pn ! d dt P1 d dt P0 ! " # # # # # # # # $ % & & & & & & & & = pmn " " pm0 ! # ! p1n # p10 p0n … … p00 ! " # # # # # $ % & & & & & Pn ! P1 P0 ! " # # # # # $ % & & & & & . The solution to this set of linear homogeneous differential equations can be derived by elimination using the Laplace transform method. Laplace Transform Techniques Given a set of differential equations in Eq. (5.20) and Eq. (5.22), the Laplace transform can be used to generate solutions to these equations [Lome88]. One advantage of using the Laplace transform method is its ability to handle initial conditions automatically, without having first to find a general solution and then having to evaluate the integration constants. The Laplace transform is defined as, ( ) ( ) ( ){ } ∞ − = =∫ L 0 st F s e f t dt f t (5.24) The differential equation solution method depends on the following operational property of the Laplace transform [Krey72]. The Laplace transform of the derivative of a function is, ( ){ } ( ) ( ) ( ) ∞ − − − →∞ ⎡ ⎤ ′ ′= = +⎢ ⎥ ⎣ ⎦ ∫ ∫L 0 0 lim . 0 b st st st b b f t e f t dt e f t s e f t dt (5.25) In the limit, the integral appearing on the right–hand side of Eq. (5.25) is ( ){ }L f t , so that the first term in Eq. (5.25) can be evaluated in the following manner [McLac39],
  60. 60. 57/196 ( ) ( )− →∞ − 0 lim 0 .sb b e f b e f (5.26) Using the property of absolute values and limits [Arfk70], Eq. (5.26) can be rewritten as, ( ) ( )− − →∞ →∞ ≤lim lim .sb sb b b e f b e f b (5.27) The term ( )f b is of the order ab e as → ∞b . For >b T using the definition for exponential order, Eq. (5.27) can be reevaluated to the following, ( ) ( )αα − −− − →∞ →∞ →∞ ≤ =lim lim lim .s bsb sb b b b b e f b e Me Me (5.28) The function ( )f b is said to be of exponential order as b → ∞ if there exists a constant α such that: ( ) ,b e f bα− is bounded for all t greater than some T. If this statement is true, there also exists a constant M, such that ( ) , .t f b Me t Tα < > Figure 4 – Definition of the exponential order of a function. If s α> , then 0,s α− > giving, ( ) lim 0,s b b Me α− − →∞ = (5.29) so that in the limit, ( )lim 0,sb b e f b− →∞ = (5.30) giving the final form of the Laplace transform of a differential equation as, ( ){ } ( ){ } ( )0 .f t s f t f′ = −L L (5.31)
  61. 61. 58/196 The notation for the Laplace transform for the differential equation for the rate of arrival at the transition state i is then given by, ( ){ } ( ).i iP t P s⇒L (5.32) From this point on, this Laplace transform notation will be used in the solution of the Markov transition matrix differential equations. Using the expression ( ) ( ) { }1R t F t P T t= − = ≥ to define the system reliability, where ( )F t is the probability distribution function of the time to failure, a new random variable, Y, can be defined which represents the expected time to system failure. A notation can be defined such that ( ) ( ) ( )0 Y dR t dP t f t dt dt = − = is the failure density of the random variable Y. The Laplace transform of this failure density is denoted by ( ){ } ( ) ( ) ( )0 .Y Y Yf t s f s sP s⇒ = =L L In this work ( )0P s represents the absorbing state of the Markov model. By using the Laplace transform notation in the solution of differential equations, the inverse transform can be used to generate the failure density function for the random variable Y. Using Eq. (2.38) the derivative of the failure density function can be integrated to produce the Mean Time to Failure [ ] ( )0 d MTTF E Y t R t dt ∞ ⎛ ⎞ = = − ⎜ ⎟ ⎝ ⎠ ∫ . The inversion of the Laplace transform may be straightforward in some cases and more complex in other cases. MODELING A DUPLEX SYSTEM Duplex systems or Parallel Redundant systems have been utilized in electronic central office switching systems and other high–reliability systems for the past 35 years [Toy78]. Parallel redundant systems depend on fault detection and recovery for their proper operation. In most dual redundant architectures both system are
  62. 62. 59/196 monitored continuously, providing fault detection in the primary subsystem as well as the standby subsystem. This section describes the detailed development of the Markov model for a parallel redundant system with perfect diagnostic coverage. The failure rate of both subsystems are assumed to be a constant λ and the repair rate a constant µ . The system is considered failed when both subsystems have failed. The number of properly functioning subsystems is described in the state space { }2,1,0⇒S , where { }0 is the failure state of the system. The state diagram for the system is shown in Figure 5. 2 01 2λ µ λ Figure 5 – the state transition diagram for a Parallel Redundant system with repair. State { }2 represents the fault free operation mode, State { }1 represents a single fault with a return path to the fault free mode by a repair operation, and State { }0 represents the system failure mode, the absorption state. The initial state of the system is { }2 and the initial conditions for the transition equations are, ( ) ( ) ( )= = =2 1 00 1, 0 0 0.P P P (5.33) Using the initial conditions, the system of differential equations derived from the transition matrix,
  63. 63. 60/196 ( ) ( ) ( ) ( ) ( ) ( ) ( ) ⎡ ⎤ − λ µ ⎡ ⎤⎡ ⎤⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥= λ − λ + µ λ⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥λ⎢ ⎥ ⎣ ⎦ ⎣ ⎦ ⎢ ⎥⎣ ⎦ 2 2 1 1 0 0 2 0 2 , 0 2 0 dP t P t dt dP t P t dt dP t P t dt are given by, ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) 2 2 1 1 2 1 0 1 2 , 2 , . dP t P t P t dt dP t P t P t dt dP t P t dt = − λ + µ = λ − λ + µ = λ (5.34) Using the Laplace transform solution technique described in the previous section and in detail in [Doet61], [Widd46], [Lome88], [Rea78], and [Lath65] gives the following set of equations in Laplace form, ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) 2 2 1 1 2 1 0 1 1 2 , 2 , . sP s P s P s sP s P s P s sP s P s − = − + = − + = λ µ λ λ µ λ (5.35) Solving Eq. (5.35)(a) for the final failed state { }2 gives, ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) 2 2 1 2 1 1 2 2 1, 2 1, 1 , 2 sP s P s P s s P s P s P s P s s + λ = µ + + λ = µ + µ + = + λ (5.36) and solving for Eq. (5.36)(b) for state { }2 gives,
  64. 64. 61/196 ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) 1 2 1 1 1 2 1 2 2 , 2 , . 2 sP s P s P s sP s P s P s s P s P s = λ − λ + µ + λ + µ = λ + λ + µ = λ (5.37) Equating Eq. (5.36) and Eq. (5.37) a solution representing state { }1 can be derived, giving, ( ) ( ) ( ) ( ) 1 1 1 . 2 2 s P s P s s λ µ µ λ λ + + + = + (5.38) Multiplying each side by ( )1 1 P s gives, ( ) ( ) ( ) 1 1 , 2 2 s P s s µ λ µ λ λ + + + = + which results in, ( )( ) ( )1 2 2 2 .s s P s λ λ µ λ λµ+ + + = + (5.39) Solving Eq. (5.39) for state { }1 gives, ( ) ( )( )1 2 . 2 2 P s s s λ λ µ λ λµ = + + + − (5.40) Expanding and simplifying Eq. (5.40) gives, ( )1 2 2 2 . 3 2 P s s s s λ λ λ µ = + + + (5.41) Substituting Eq. (5.41) into Eq. (5.35)(c) gives the solution to the final absorbing state { }0 as,
  65. 65. 62/196 ( ) ( ) ( ) ( ) ( ) ( ) 0 1 0 2 2 2 0 2 2 , 2 , 3 2 . 3 sP s P s sP s s s s P s s s s s λ λ λ λ µ λ λ λ µ λ = ⎡ ⎤ = ⎢ ⎥ + + +⎣ ⎦ = ⎡ ⎤+ + +⎣ ⎦ (5.42) After producing the inverse Laplace transform of Eq. (5.42)(c), the probability that no subsystems are operating at time, 0t > is the result. Let the random variable Y be the time to failure of the system and ( )0P t be the probability that the system has failed at or before time t. The reliability of the system is then defined by, ( ) ( )01 .R t P t= − (5.43) Using Eq. (2.37), the failure density function for the random variable Y is given by, ( ) ( )0 ,Y dP tdR f t dt dt = − = (5.44) and using Eq. (5.31), its Laplace transform is given by, ( ) ( ) ( ) ( ) ( ) 2 0 0 2 2 2 0 . 3 2 Y YL s f s sP s P s s − λ = = − = + λ + µ + λ (5.45) Inverting Eq. (5.45) gives the failure density of Y as, ( ) ( )2 1 2 1 2 2 ,t t Yf t e eα αλ α α − − = − − (5.46) where, ( ) 2 2 1 2 3 6 , . 2 λ +µ ± λ + λµ +µ α α = (5.47)
  66. 66. 63/196 Using Eq. (2.28), the MTTF of the Parallel Redundant system with repair is given by, [ ] ( ) ( ) ( ) ( ) ∞ ∞ ∞ −α −α = = ⎡ ⎤λ = = −⎢ ⎥ α − α ⎣ ⎦ ⎡ ⎤λ = −⎢ ⎥ α − α α α⎣ ⎦ λ α − α = α α λ λ + µ = λ µ = + λ λ ∫ ∫ ∫2 1 0 2 1 2 0 0 2 2 2 1 2 2 1 2 1 2 2 2 1 2 2 22 2 2 , 2 1 1 , 2 , 2 3 , 2 3 . 2 2 Y y y E Y yf y dy ye dy ye dy (5.48) The MTTF of a two element Parallel Redundant system without repair ( )0µ = would have been equal to the first term in Eq. (5.48)(c). The effect of adding a repair facility to the system increases the mean life of the system by, 2 as a result of Repair , 2 MTTF µ = λ (5.49) or a factor of, 2 2 , 3 3 2 µ µλ = λ λ (5.50) over a system without repair facilities.
  67. 67. 64/196 MODELING A TRIPLE–REDUNDANT SYSTEM A Triple Modular Redundant (TMR) system continues to operate correctly as long as two of the three subsystems are functioning properly. A second subsystem failure causes the system to fail. This model is referred to as 3–2–0. A second architecture (shown in Figure 7) is possible in which the system will continue to operate in the presence of two (2) subsystem failures. This system operates in simplex mode 3–2–1–0. The 3–2–0 model without coverage will be developed in this section. Figure 6 describes a TMR system with a constant failure rate λ and a constant repair rate µ. The repair activity takes place with a constant response time whenever a subsystem fails, giving a Markov transition matrix of, ( ) ( ) ( ) ( ) ( ) ( ) ( ) ⎡ ⎤ − λ µ ⎡ ⎤⎡ ⎤⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ = λ − λ + µ λ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥λ⎢ ⎥ ⎣ ⎦ ⎣ ⎦ ⎢ ⎥⎣ ⎦ 2 2 1 1 0 0 3 0 3 2 . 0 2 0 dP t P t dt dP t P t dt dP t P t dt (5.51) The set of differential equations derived from the transition matrix is given by, ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) 2 2 1 1 2 1 0 1 3 , 3 2 , 2 . dP t P t P t dt dP t P t P t dt dP t P t dt = − λ + µ = λ − λ + µ = λ (5.52) Rewriting the differential equations in the Laplace transform format gives,
  68. 68. 65/196 ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) 2 2 1 1 2 1 0 1 1 3 , 3 2 , 2 . sP s P s P s sP s P s P s sP s P s − = − λ +µ = λ − λ +µ = λ (5.53) Using Eq. (5.53)(a) and Eq. (5.53)(b) to solve for state { }2 gives, ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) 2 2 1 2 1 1 2 3 1, 3 1, 1 . 3 sP s P s P s s P s P s P s P s s + λ = µ + + λ = µ + µ + = + λ (5.54)
  69. 69. 66/196 2 01 3λ µ 2λ Figure 6 – The transition diagram for a Triple Modular Redundant system with repair. State { }2 represents the fault free (TMR) operation mode, State { }1 represents a single fault (Duplex) operation mode with a return path to the fault free mode, and State { }0 represents the system failure mode, the absorbing state. Using Eq. (5.54)(a) and Eq. (5.54)(b) again to solve for state { }2 gives, ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) 1 2 1 1 1 2 2 1 3 2 , 2 3 , 2 . 3 sP s P s P s sP s P s P s s P s P s = λ − λ +µ + λ +µ = λ + λ +µ = λ (5.55) Equating (5.54) and Eq. (5.55) and solving for state { }1 gives, ( ) ( ) ( ) ( ) ( ) ( )( ) + λ + µ µ + = λ + λ λ = + λ + µ + λ − λµ 1 1 1 2 1 , 3 3 3 . 2 3 3 s P s P s s P s s s (5.56) Simplifying Eq. (5.56)(b) gives, ( )1 2 2 3 . 5 6 P s s s s λ = + λ + λ +µ (5.57)
  70. 70. 67/196 Substituting the solution for state { }1 , Eq. (5.57), into Eq. (5.54)(c) gives the solution for the final absorbing state { }0 , ( ) ( ) ( ) ( ) 0 1 2 2 2 0 2 2 3 2 2 , 5 6 6 5 6 . sP s P s s s s P s s s s s ⎡ ⎤λ = λ = λ ⎢ ⎥+ λ + λ +µ⎣ ⎦ λ = + λ + λ +µ (5.58) Expanding and factoring the denominator of Eq. (5.58)(b) gives the differential equation for the absorption state as, P0 s( )= 6λ2 s s+ 1 2 5λ+µ− λ2 +10λµ+µ2 ( )( )s+ 1 2 5λ+µ+ λ2 +10λµ+µ2 ( )( ) (5.59) Expanding the partial fractions of Eq. (5.59) and taking the inverse Laplace transform, results in the following reliability function, ( ) ( ) ( ) 2 21 2 2 21 2 2 2 5 10 2 2 2 2 5 10 2 2 5 10 2 10 5 10 . 2 10 R t e e − λ+µ− λ + λµ+µ − λ+µ+ λ + λµ+µ λ +µ + λ + λµ +µ = λ + λµ +µ λ +µ − λ + λµ +µ − λ + λµ +µ (5.60) Integrating Eq. (5.60) using Eq. (2.24) produces the MTTF of, ( ) ( ) 2 2 2 2 2 2 2 2 2 2 2 2 5 10 5 10 10 5 10 . 5 10 10 MTTF λ +µ + λ + λµ +µ = λ +µ λ + λµ +µ −λ − λµ −µ λ +µ − λ + λµ +µ − λ +µ λ + λµ +µ + λ + λµ +µ (5.61) Simplifying Eq. (5.61) gives the MTTF for a TMR system with repair as,
  71. 71. 68/196 2 5 . 6 MTTF λ +µ = λ (5.62) Rearranging Eq. (5.62) and isolating the repair term from the failure term gives, 2 5 . 6 6 MTTF µ = + λ λ (5.63) MODELING A PARALLEL SYSTEM WITH IMPERFECT COVERAGE A more realistic model of a Parallel Redundant System assumes that not all faults are recoverable and that the coverage factor c denotes the conditional probability that the system detects the fault and survives. The state diagram for this system is shown in Figure 7
  72. 72. 69/196 2 01 2 cλ µ λ ( )2 1 cλ − Figure 7 – The transition diagram for a Parallel Redundant system with repair and imperfect fault coverage. State { }2 represents the fault free mode, State { }1 represents a single fault with a return path to the fault free mode by a repair operation, and State { }0 represents the system failure mode. State { }0 can be reached from State { }2 through an uncovered fault, which causes the system to fail without the intermediate State { }1 mode. The transition matrix for Figure 7 is, ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ⎡ ⎤ − λ + λ − µ⎡ ⎤ ⎡ ⎤⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥ = λ − λ + µ λ⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥λ − λ⎢ ⎥ ⎣ ⎦ ⎣ ⎦ ⎢ ⎥⎣ ⎦ 2 2 1 1 0 0 2 2 1 0 2 , 2 1 2 0 dP t c c P t dt dP t c P t dt dP t c P t dt (5.64) With an initial state of { }2 producing a set of starting conditions, ( ) ( ) ( )2 1 00 1, 0 0 0P P P= = = ,

×