SlideShare a Scribd company logo
FAULT–TOLERANT SYSTEM
RELIABILITY IN THE
PRESENCE OF IMPERFECT
DIAGNOSTIC COVERAGE
By
Glen B. Alleman
Irvine California, Copyright © 1980
Submitted in Partial Fulfillment
Of
Masters in Systems Management (MSSM)
University of Southern California
Los Angles, California
June 1980
Revised and updated
Niwot Colorado, Copyright © 1996, 2000, 2014
ii
FAULT–TOLERANT SYSTEM
RELIABILITY IN THE
PRESENCE OF IMPERFECT
DIAGNOSTIC COVERAGE
Glen B. Alleman
The deployment of computer systems for the control of mission critical processes
has become the norm in many industrial and commercial markets. The analysis of
the reliability of these systems is usually understood in terms of the Mean Time to
Failure. The design and analysis of high reliability systems is now a mature science.
Starting with fault–tolerant central office switches (ESS4), dual redundant and n–
way redundant systems are now available in variety of application domains. The
technologies of microprocessor based industrial controls and redundant central
processor systems create the opportunity to build fault–tolerant computing
systems on a much smaller scale than previously found in the commercial market
place.
The diagnostic facilities utilized in a modern Fault–Tolerant Computer System
attempts to detect fault conditions present in the hardware and embedded
software. Coverage is the figure of merit describing the effectiveness of the
diagnostic system. This thesis examines the effects of less than perfect diagnostics
coverage on system reliability. The mathematical background for analyzing the
coverage factor of fault–tolerant systems is presented in detail as well as specific
examples of practical systems and their relative reliability measures.
In a complex system, malfunction and even total nonfunction may not be
detected for long periods, if ever.
— John Gall
i
TABLE OF CONTENTS
INTRODUCTION......................................................................................................10	
Fault Tolerant System Definitions........................................................................10	
Fault–Tolerant System Functions.........................................................................11	
Overview of This Thesis ...................................................................................11	
RELIABILITY, AVAILABILITY FOR SIMPLE SYSTEMS ......................13	
Deterministic Models ..............................................................................................13	
Probabilistic Models...........................................................................................14	
Exponential and Poisson Relationships .........................................................15	
Reliability Availability and Failure Density Functions .................................20	
Mean Time to Failure.........................................................................................23	
Mean Time to Repair .........................................................................................27	
Mean Time Between Failure.............................................................................27	
Mean Time to First Failure ...............................................................................27	
General Availability Analysis ............................................................................31	
Instantaneous Availability ..........................................................................33	
Limiting Availability ....................................................................................34	
SYSTEM RELIABILITY ......................................................................................37	
Series Systems......................................................................................................37	
Parallel Systems ...................................................................................................39	
M–of–N Systems................................................................................................39	
Selecting the Proper Evaluation Parameters..................................................40	
Imperfect Fault Coverage And Reliability...........................................................42	
Redundant System with Imperfect Coverage................................................42	
Generalized Imperfect Coverage.....................................................................44	
Markov Models Of Fault–Tolerant Systems.......................................................49	
Solving the Markov Matrix ...............................................................................52	
Chapman–Kolmogorov Equations..........................................................52	
Markov Matrix Notation...................................................................................55	
Laplace Transform Techniques........................................................................56	
Modeling a Duplex System.....................................................................................58	
Modeling a Triple–Redundant System.................................................................64	
Modeling a Parallel System with Imperfect Coverage.......................................68	
Modeling A TMR System with Imperfect Coverage.........................................74	
Modeling A Generalized TMR System................................................................76	
Laplace Transform Solution to Systems of Equations................................77	
Specific Solution to the Generalized System.................................................78	
PRACTICAL EFFECTS OF PARTIAL COVERAGE......................................85	
Determining Coverage Factors..............................................................................85
ii
Coverage Measurement Statistics .............................................................86	
Coverage Factor Measurement Assumptions ........................................86	
Coverage Measurement Sampling Method.............................................87	
Normal Population Statistics.....................................................................87	
Sample Size Computation..........................................................................88	
General Confidence Intervals....................................................................89	
Proportion Statistics....................................................................................90	
Confidence Interval Estimate of the Proportion...................................91	
Unknown Population Proportion.............................................................91	
Clopper–Person Estimation......................................................................92	
Practical Sample Estimates ........................................................................93	
Time Dependent Aspects of Fault Coverage Measurement ...............94	
Common Cause Failure Effects ............................................................................95	
Square Root Bounding Problem......................................................................97	
Beta Factor Model..............................................................................................97	
Multi–Nominal Failure Rate (Shock Model) .................................................97	
Binomial Failure Rate Model............................................................................98	
Multi–Dependent Failure Fraction Model.....................................................98	
Basic Parameter Model......................................................................................99	
Multiple Greeks Letter Model..........................................................................99	
Common Load Model .....................................................................................100	
Nonidentical Components Model.................................................................100	
Practical Example of Common Cause Failure Analysis ............................100	
Common Cause Software Reliability.............................................................102	
Software Reliability Concepts..................................................................103	
Software Reliability and Fail–Safe Operations.....................................109	
PARTIAL FAULT COVERAGE SUMMARY...................................................111	
Effects of Coverage...............................................................................................112	
REMAINING QUESTIONS..................................................................................113	
Realistic Probability Distributions.......................................................................113	
Multiple Failure Distributions ........................................................................114	
Weilbull Distribution........................................................................................116	
Periodic Maintenance............................................................................................118	
Periodic Maintenance of Repairable Systems..............................................119	
Reliability Improvement for a TMR System................................................122	
CONCLUSIONS........................................................................................................124	
MARKOV CHAINS..................................................................................................125	
Definition A.1....................................................................................................125	
Definition A.2....................................................................................................125	
Definition A.3....................................................................................................126	
Theorem A.1......................................................................................................126	
Proof of Theorem A.1.....................................................................................126
iii
Lemma A.1.........................................................................................................128	
Theorem A.2......................................................................................................128	
Proof of Theorem A.2.....................................................................................128	
Theorem A.3......................................................................................................130	
Proof of Theorem A.3.....................................................................................130	
SOLUTIONS TO LINEAR SYSTEMS................................................................133	
Theorem B.1......................................................................................................135	
Proof of Theorem B.1 .....................................................................................136	
PROBABILITY GENERATING FUNCTIONS ..............................................139	
Definition C.1....................................................................................................139	
Theorem C.1......................................................................................................140	
Proof of Theorem C.1 .....................................................................................140	
POISSON PROCESSES...........................................................................................142	
Definition D.1 ...................................................................................................143	
Definition D.2 ...................................................................................................145	
Definition D.3 ...................................................................................................145	
Definition D.4 ...................................................................................................148	
Definition D.5 ...................................................................................................148	
Definition D.6 ...................................................................................................149	
Theorem D.1 .....................................................................................................151	
RENEWAL THEORY..............................................................................................152	
Definition E.1....................................................................................................153	
Theorem E.1......................................................................................................154	
Proof of Theorem E.1.....................................................................................154	
Theorem E.2......................................................................................................155	
Proof of Theorem E.2.....................................................................................155	
LAPLACE TRANSFORM GENERALIZED SOLUTION METHODS...163	
Definition F.1 ....................................................................................................164	
Definition F.2 ....................................................................................................165	
Definition F.3 ....................................................................................................165	
Definition F.4 ....................................................................................................166	
LAPLACE TRANSFORM GENERALIZED SOLUTION METHODS...168	
Definition F.1 ....................................................................................................169	
Definition F.2 ....................................................................................................170	
Definition F.3 ....................................................................................................170	
Definition F.4 ....................................................................................................171
iv
LIST OF FIGURES
Number Page
Figure 1 – Evaluation Criteria defining System Reliability. These criteria will be
used to develop a set of time dependent metrics used to evaluate
various configurations. ............................................................................................13
Figure 2 – Assumptions regarding the behavior of a random process that
generated events following the Poisson probability distribution
function......................................................................................................................16
Figure 3 – State Transition probabilities as a function of time in the Continuous–
Time Markov chain that is subject to the constraints of the Chapman–
Kolmogorov equation.............................................................................................51
Figure 4 – Definition of the exponential order of a function............................................57
Figure 5 – the state transition diagram for a Parallel Redundant system with
repair. State represents the fault free operation mode, State
represents a single fault with a return path to the fault free mode by a
repair operation, and State represents the system failure mode, the
absorption state.........................................................................................................59
Figure 6 – The transition diagram for a Triple Modular Redundant system with
repair. State represents the fault free (TMR) operation mode, State
represents a single fault (Duplex) operation mode with a return
path to the fault free mode, and State represents the system failure
mode, the absorbing state.......................................................................................66
Figure 7 – The transition diagram for a Parallel Redundant system with repair and
imperfect fault coverage. State represents the fault free mode, State
represents a single fault with a return path to the fault free mode by
a repair operation, and State represents the system failure mode.
State can be reached from State through an uncovered fault,
which causes the system to fail without the intermediate State
mode...........................................................................................................................69
Figure 8 –The state transition diagram for a Triple Modular Redundant system
with repair and imperfect fault coverage. State represents the fault
free mode, State represents the single fault (Duplex) mode, State
represents the two–fault (Simplex) mode, and State represents
the system failure mode...........................................................................................74
{ }2 { }1
{ }0
{ }2
{ }1
{ }0
{ }2
{ }1
{ }0
{ }0 { }2
{ }1
{ }3
{ }2
{ }1 { }0
v
Figure 9 – The state transition diagram for a Generalized Triple Modular
Redundant system with repair and [perfect fault detection coverage.
The system initially operates in a fault free state . A fault in any
module results in the transition to state . A second fault while
in state results in the system failure state .........................78
Figure 10 – Sample size requirement for a specified estimate as tabulated by
Clopper and Pearson. ..............................................................................................93
Figure 11 – Common Cause Failure modes guide figures for electronic
programmable system [HSE87]. These ratios of non–CCF to CCF for
various system configurations. CCFs are defined as non–random faults
that are designed in or experienced through environmental damage to
the system. Other sources [SINT88]. [SINT89] provide different
figures. ......................................................................................................................102
Figure 12 – Four Software Growth Model expressions. The exponential and
hyperexponential growth models represent software faults that are time
independent. The S–Shaped growth models represent time delayed and
time inflection software fault growth rates [Mats88].......................................104
Figure 13 – MTTF of Simplex, Parallel Redundant, and TMR Systems. ......................111
Figure 14 – MTTF of Parallel Redundant and TMR Systems with varying degrees
of coverage. .............................................................................................................112
Figure 15 – Mean Time to Failure increases for a Triple Modular Redundant
system with periodic maintenance. This graph shows that maintenance
intervals which are greater than one–half of the mean time to failure for
one module have little effect on increasing reliability. But frequent
maintenance, even low quality maintenance, improves the system
reliability considerably. ..........................................................................................123
{ }0
{ }1, ,N!
{ }1, ,N! { }1N +
vi
ACKNOWLEDGMENTS
The author wishes to thank Dr. Wing Toy of AT&T Naperville Laboratories,
Naperville, Illinois for his consultation on the ESS4 Central Office Switch and his
contributions to this work. Dr. Victor Lowe of Ford Aerospace, Newport Beach,
California for his consultation on the general forms of Markov model solutions.
Mr. Henk Hinssen of Exxon Corporation, Antwerp Belgium for his discussion of
the effects of partial diagnostic coverage in Triple Modular Redundant Systems at
the Exxon Polystyrene Plant, Antwerp, Belgium. Dr. Phil Bennet of The Centre
for Software Engineering, Flixborough, England for his ideas regarding software
reliability measurements in the presence of undetected faults. Mr. Daniel Lelivre
of Factory Systems, Paris France for his comments and review of this work and
its applicability to safety critical systems at Total, Mobile, and NorSoLor chemical
plants.
Several institutions have contributed source material for this work including The
Foundation for Scientific and Industrial Research at the Norwegian Institute of
Technology (SINTF), Trondheim, Norway and the United Kingdom Atomic
Energy Authority, Systems Reliability Service, Culcheth, Warrington, England.
This work is submitted as a Thesis in completion of a Master Degree in Systems
Management, University of Southern California, 1980. It was extended in support
of the efforts that gained compliance of the Tricon with process safety standards
in the United States, Europe, and United Kingdom.
vii
PREFACE
This work was extended in support of the design and development of the Triple
Modular Redundant (TMR) computer produced by Triconex Corporation of
Irvine, California. In 1987, Triconex designed and manufactured its first digital
TMR process control computer that was deployed in a variety of industrial
environments, including: turbine controls, boiler controls, fire and gas systems,
emergency shutdown systems, and general-purpose fault–tolerant real–time
control systems.
The Tricon (a classic 1980’s product name) was based on several innovative
technologies. As the manager of software development for Triconex, I was
intimately involved in the software and hardware of the Tricon. In 1987, TMR
was not a completely new concept. Flight control systems and navigation
computers were found in aerospace applications. The Space Shuttle used a
TMR+1 computer system and was well understood by the public. What was new
to the market was an affordable TMR computer that could be deployed in a rugged
industrial environment. The heart of the Tricon was a hardware voting system that
performed a 2–out–of–3 vote for all digital input signals presented to the control
program. The contents of memory and the computed digital outputs were again
voted 2–out–of–3 at the physical output devices. Once the digital command had
been applied to the output device, its driven state was verified and the results
reported to the control program.
The Tricon contained 3 independent (but identical) 32–bit battery powered
microprocessors, a 2–out–of–3 voting digital serial bus connecting the three
processors, a dual redundant power system using DC–to–DC converters (state of
the art for 1987), and three separate isolated serial I/O buses connecting the I/O
subsystem to the three main processors. The I/O subsystem cards were
viii
themselves TMR, using onboard 8–bit processors and a quad output device to
vote 2–out–of–3 the digital commands received from the control program.
The Tricon executed a control program on a periodic basis. The architecture of
the operating software was modeled after the programmable controllers of the
day, which were programmed in a ladder logic representing mechanical relays and
timers. Both digital and analog devices provided input and output to the control
program. The control program accepted input states from the I/O subsystem,
evaluated the decision logic and produced output commands, which were sent to
the I/O subsystem. This cycle was performed every 10ms in a normally
configured system.
In the presence of faults, the key to the survivability of the Tricon was the
combination of TMR hardware and fault diagnostic software. Diagnostic
software was applied to each processor element and the digital I/O device. This
diagnostic software was capable of detecting all single stuck–at faults, many
multiple stuck–at faults as well as many transient faults. A fault–injection and
reliability evaluation technique developed by the author and described in this
work was used to evaluate the coverage factor of the diagnostic software.
Triconex no longer exists as an independent company, having been absorbed into
a larger control systems vendor. The materials presented in this work were critical
to Tricon’s TÜV and SINTF [SINTF89] certification for North Sea Norwegian
Sector, German (then the Federal Republic), Belgium, and British Health and
Safety Executive (HSE) industrial safety operations.
The concept of fault–tolerant computing has become important again in the
distributed computing market place. The Tandem Non–Stop processor, modern
flight and navigation computers as well as telecommunications computers all
depend on some form of diagnostics to initiate the fault detection and recovery
process. A recent systems architectural paper mentioned TMR but without
ix
sufficient attention to the underlying details. [1]
The reissuing of this paper
addresses several gaps in the literature:
§ The foundations of fault–tolerance and fault–tolerance modeling have faded
from the computer science literature. The underlying mathematics of fault–
tolerant systems present a challenge for an industry focused on rapid
software development and short time to market pressures.
§ The understanding that unreliable and untrustworthy software systems are
created by latent faults in both the hardware and software is poorly
understood in this age of Object–Oriented programming and plug and play
systems development.
§ The Markov models presented in this work have general applicability to
distributed computer systems analysis and need to be restated. The
application of these models to distributed processing systems, with
symmetric multi–processor computers is a reemerging science. With the
advent of high–availability computing systems, the foundations of these
systems needs to be understood once again.
§ The current crop of computer science practitioners have very little
understanding of the complexities and subtleties of the underlying hardware
and firmware that make up the diagnostic systems of modern computers,
their reliability models and the mathematics of system modeling.
Glen B. Alleman
Niwot Colorado 80503
Updated, April 2000
1 “Attribute Based Architectural Styles,” Mark Klein and Rick Kazman, CMU/SEI–99–TR–022, Software
Engineering Institute, Carnegie Mellon University, October 1999.
10/196
C h a p t e r 1
INTRODUCTION
Two approaches are available to increase the system reliability of digital computer
system: Fault avoidance (fault intolerance) and fault tolerance [Aviz75]. Fault
avoidance results from conservative design techniques utilizing high–reliability
components, system burn–in, and careful design and testing processes. The goal
of fault avoidance is to reduce the possibility of a failure [Aviz84], [Rand75],
[Kim86], [Ozak88]. The presence of faults however results in system failure,
negating all prior efforts to increase system reliability [Litt75], [Low72].
Fault–tolerance provides the system with the ability to withstand a system fault,
maintain a safe state in the presence of a fault, and possibly continue to operate in
the presence of this fault.
FAULT TOLERANT SYSTEM DEFINITIONS
A set of consistent definitions is used here to avoid confusion with existing
definitions. These definitions are provided by the IFIP Working Group 10.4,
Reliable Computing and Fault–Tolerance [Aviz84], [Aviz82], [Ande82], [Robi82],
[Lapr84], [TUV86]:
§ A Failure occurs when the system user perceives a service resource ceases to
deliver the expected results.
§ An Error occurs when some part of a system resource assumes an undesired
state. Such a state is contrary to the specification of the resource to the
expectation (requirement) of the user.
§ A Fault is detected when either a failure of the resource occurs, or an error is
observed within the resource. The cause of the failure or error is said to be a
fault.
11/196
FAULT–TOLERANT SYSTEM FUNCTIONS
In fault–tolerant systems, hardware and software redundancy provides
information needed to negate the effects of a fault [Aviz67]. The design of fault–
tolerant systems involves the selection of a coordinated failure response
mechanism that follows four steps [Siew84], [Mell77], [Toy86]:
§ Fault Detection
§ Fault Location and Identification
§ Fault Containment and Isolation
§ Fault Masking
During the fault detection process, diagnostics are used to gather and analyze
information generated by the fault detection hardware and software. These
diagnostics determine the appropriate fault masking and fault recovery actions
[Euri84], [Rouq86], [Ossf80], [Gluc86], [John85], [John86], [Kirr86], [Chan70]. It
is the less than perfect operation of the Fault Detection, Location, and
Identification processes of the system that is examined in this work.
The reliability of the fault–tolerant system depends on the ability of the diagnostic
subsystem to correctly detect and analyze faults [Kirr87], [Gall81], [Cook73],
[Brue76], [Lamp82]. The measure of the correct operation of the diagnostic
subsystem is called the Coverage Factor. It is assumed in most fault–tolerant
product offerings that the diagnostic coverage factor is perfect, i.e. 100%. This
work addresses the question:
What is the reliability of the Fault–Tolerant system in the presence of less
than perfect coverage?
To answer this question, some background in the mathematics of reliability
theory is necessary.
Overview of This Thesis
The development of a reliability model of a Triple Modular Redundant (TMR)
system with imperfect diagnostic coverage is the goal of this work. Along the
12/196
way, the underlying mathematics for analyzing these models is developed. The
Markov Chain method will be the primary technique used to model the failure
and repair processes of the TMR system. The Laplace transform will be used to
solve the differential equations representing the transition probabilities between
the various states of the TMR system described by the Markov model.
The models developed for a TMR system with partial coverage can be applied to
actual systems. In order to make the models useful in the real–world a deeper
understanding of the diagnostic coverage and fault detection is presented. The
appendices provide the background for the Markov models as well as the
statistical process.
The mathematics of Markov Chains and the statistical processes that underlay
system faults and their repair processes can be applied to a variety of other
analytical problems, including system performance analysis. It is hoped the reader
will gain some appreciation of the complexity and beauty of modern systems as
well as the subtitles of their design and operation.
If the reader is interested in skipping to the end, Chapter 7 provides a summary
of the effects of partial coverage on various system configurations.
13/196
C h a p t e r 2
RELIABILITY, AVAILABILITY FOR SIMPLE SYSTEMS
When presented with the reliability figures for a computer system, the user must
often accept the stated value as factual and relevant and construct a comparison
matrix to determine the goodness of each product offering [Kraf81]. Difficulties
often arise through the definition and interpretation of the term reliability.
This chapter develops the necessary background for understanding the reliability
criteria defined by the manufacturers of computer equipment. Figure 1 lists the
criteria for defining system reliability [Siew82], [Ande72], [Ande79], [Ande81].
Deterministic Models
Survival of at least k component failures
Probabilistic Models
– Hazard (failure rate) function
– Reliability function
– Repair Rate
– Availability function
Single Parameter Models
MTTF – Mean Time to failure
MTTR – Mean Time to Repair
MTBF – Mean Time Between Failure
c – Coverage
Figure 1 – Evaluation Criteria defining System
Reliability. These criteria will be used to
develop a set of time dependent metrics used
to evaluate various configurations.
DETERMINISTIC MODELS
The simplest reliability model is a deterministic one, in which the minimum
number of component failures that can be tolerated without system failure is
taken as the figure of merit for the system.
( )z t
( )R t
µ
( )A t
14/196
Probabilistic Models
The failure rate of electronic and mechanical devices varies as a function of time.
This time dependent failure rate is defined by the hazard function, . The
hazard function is also referred to as the hazard rate or mortality rate. For
electronic components on the normal–life portion of their failure curve, the
failure rate is assumed to be a constant, , rather than a function of time.
The exponential probability distribution is the most common distribution
encountered in reliability models, since it describes accurately most life testing
aspects for electronic equipment [Kapu77]. The probability density function (pdf),
Cumulative Distribution Function (CDF), reliability function ( ), and hazard
(failure rate) function ( ) of the exponential distribution are expressed by the
following [Kend77]:
(2.1)
(2.2)
(2.3)
(2.4)
The failure rate parameter describes the rate at which failures occur over time
[DoD82]. In the analysis that follows, the failure rate is assumed to be constant,
and measured as failures per million hours. Although a time dependent failure rate
could be used for un–aged electronic components, the aging of the electronic
components can remove the traditional bathtub curve failure distribution. The
constant failure rate assumption is also extended to the firmware controlling the
diagnostics of the system [Bish86], [Knig86], [Kell88], [Ehre78], [Eckh75],
[Gmei79], [RTCA85].
( )z t
l
( )R t
( )z t
( ) t
pdf f t e -l
= = l
( ) 1 t
CDF F t e -l
= = -
( )Reliability t
R t e-l
= =
( )Hazard Function z t= = l
l
15/196
Exponential and Poisson Relationships
In modeling the reliability functions associated with actual equipment, several
simplifying assumptions must be made to render the resulting mathematics
tractable. These assumptions do not reduce the applicability of the resulting
models to real–world phenomenon. One simplifying assumption is that the
random variables associated with the failure process have exponential probability
distributions.
The property of the exponential distribution that makes it easy to analyze is that it
does not decay with time. If the lifetime of a component is exponentially
distributed, after some amount of time in use, the item is assumed to be good as
new. Formally, this property states that the random variable is memoryless, if the
expression is valid for all [Cram66],
[Ross83]. If the random variable is the lifetime of some item, then the
probability that the item is functional at time , given that it survived to time
t, is the same as the initial probability that is was functional at time s. If the item is
functional at time t, then the distribution of the remaining amount of time that it
survives is the same as the original lifetime distribution. The item does not
remember that it has already been in use for a time t.
This property is equivalent to the expression or
. Since the form of this expression is
satisfied when the random variable X is exponentially distributed (since
), it follows that exponentially distributed random variables
are memoryless. The recognition of this property is vital to the understanding of the
models presented in this work. If the underlying failure process is not
memoryless, than the exponential distribution model is not valid.
X
{ } { }P X s t X t P X s> + > = > , 0s t ³
X
s t+
P X > s +t, X > t{ }
P X > t{ }
= P X > s{ }
{ } { } { }P X s t P X s P X t> + = > >
( )s t s t
e e e-l + -l -l
=
16/196
The exponential probability distributions and the related Poisson processes used
in the reliability models are formally based on the assumptions shown in Figure 2
[Cox 62], [Thor26].
§ Failures occur completely randomly and are independent of any previous
failure. A single failure event does not provide any information regarding
the time of the next failure event.
§ The probability of a failure during any interval of time is proportional
to the length of the interval, with a constant of proportionality . The
longer one waits the more likely it is a failure will occur.
Figure 2 – Assumptions regarding the
behavior of a random process that generated
events following the Poisson probability
distribution function.
An expression describing the random processes in Figure 2 results from the
Poisson Theorem which states that the probability of an event A occurring k times
in n trials is approximately [Papo65], [Pois37],
, (2.5)
where is the probability of an event A occurring in a single trial and
. This approximation is valid when and the product
remains finite. It should be noted that a large number of different trials of
independent systems is needed for this condition to hold, rather than a large
number of repeated trials on the same system.
The Poisson Theorem can be simplified to the following approximation for the
probability of an event occurring k times in n trials [Kend77],
[ ]0, t
l
( ) ( ) -- - +
×
!
!
1 1
1 2
k n kn n n k
p q
k
{ }p P A=
1q p= - , 0n p® ¥ ®
n p×
17/196
(2.6)
The exponential and Poisson expressions are directly related. A detailed
understanding of this relationship will aid in the development of the analysis
that follows.
Using the Poisson assumptions described in Figure 2, the probability of n
failures prior to time t is,
. (2.7)
From of Eq. (2.7), the probability that no failures occur between time t
and time is,
, (2.8)
where the term describing the total number of failures is of moderate
magnitude [Fell67]. The probability that n failures occur between time t and
time is then,
. (2.9)
( )
( )
( )
( )( )
( )
-
-
+-
-
- + - +
-
æ ö æ ö
= -ç ÷ç ÷
- è øè ø
=
-
=
æ ö
-ç ÷
è ø
»
1
2
1
2
!
1 ,
! !
2
,
!2
1
!
1
!
.
!
k n k
k n k
k
knn
np
n k n k k
k
n
k
k
np
n npn np
p q
k n k k n n
e n np
e
kn k e n
np
kk
e
n
np
e
k
p
p
{ } ( )tP N n T t P n= £ =
( )0n =
t t+ D
( ) ( )[ ]0 0 1t t tP P t+D = -lD
npl =
+ Dt t
( ) ( )[ ] ( )[ ]1 1 , 0t t t tP n P n t P n t n+D = -lD + - lD >
18/196
Using Eq. (2.9) and Eq. (2.8) and allowing , a differential equation can
be constructed describing the rate at which failures occur between time t and
time ,
(2.10)
with the initial conditions of,
(2.11)
The unique solution to the differential equation in Eq. (2.10) is [Klie75],
(2.12)
which is the Poisson distribution defined in Eq. (2.6). Using Eq. (2.12) to define
a function representing the probability that no failures have occurred as
of time t gives,
(2.13)
The expression in Eq. (2.13) is also the definition for the Cumulative
Distribution Function, CDF, of the Poisson failure process [Fell67]. By using
Eq. (2.19), the probability distribution function, pdf, of the Poisson process can
be given as,
(2.14)
0tD ®
t t+ D
( ) ( )
( ) ( ) ( )
0 0 ,
1 , for 0,
t t
t t t
d
P P
dt
d
P n P n P n n
dt
= -l
= l - - >é ùë û
( ) = 0.tP n
( )
( )
, 0, 1, 2,
!
n t
t
t e
P n n
n
-l
l
= = !
( )F t
( ) { }0 .t
tF t P n e -l
= = =
( ) ,t
f t e -l
= l
19/196
which is the exponential probability distribution. [2]
The following statement
describes the relationship between the Poisson and exponential expressions
[Cox65],
If the number of failures occurring over an interval of time is Poisson
distributed, then the time between failures is exponentially distributed.
An alternative method of relating the exponential and Poisson expressions is
useful at this point. The functions defined in Eq. (2.1) and Eq. (2.2) are based
on the interchangeability of the pdf and the CDF for any defined probability
distribution. The Cumulative Distribution Function of a random variable
X is defined as a function obeying the following relationship [Papo65],
(2.15)
The probability density function of a random variable X can be derived
from the CDF using the following [Dave70],
(2.16)
The CDF can be obtained from the pdf by the following,
(2.17)
Using Eq. (2.16) and Eq. (2.17), the CDF and pdf expressions for an exponential
distribution can be developed. If the mean time between failures (MTBF) is an
Exponentially distributed random variable, the CDF is,
2 This development of the pdf is very informal. Making use of the forward reference to construct an
expression is circular logic and would not be permitted in more formal circumstances. For the purposes of
this work, this type of behavior can be tolerated, since the purpose of this development is to get to the
results rather than dwell on the analysis process. This is a fundamental difference between mathematics
and engineering.
( )F x
( ) { }, .F x P X x x= £ -¥ < < ¥
( )f x
( ) ( ).
d
f x F x
dx
=
( ) { } ( ) , .
x
F x P X x f t dt x
-¥
= £ = -¥ < < ¥ò
20/196
(2.18)
The number of failures in the time interval is a Poisson distributed random
variable with a probability density function of,
(2.19)
where t is a random variable denoting the time between failures.
Reliability Availability and Failure Density Functions
An expression for the reliability of a system can be developed using the following
technique. The probability of a failure as a function of time is defined as,
(2.20)
where t is a random variable denoting the failure time. is a function
defining the probability that the system will fail by time t. is also the
Cumulative Distribution Function (CDF) of the random variable t [Papo65]. The
probability that the system will perform as intended at a certain time t is defined
as the Reliability function and is defined as,
(2.21)
If the random variable describing the time to failure t has a probability density
function then using Eq. (2.21) the Reliability function is,
(2.22)
Assuming the time to failure random variable t has an exponential distribution its
failure density defined by Eq. (2.19) is,
( )
1 , 0 ,
0 , otherwise,
t
e t
F t
-l
ì - £ £ ¥
= í
î
[ ]0, t
( ) ( )
, 0,
0, otherwise,
e td
f t F t
dt
-l
ìl >
= = í
î
{ } ( )£ = ³, 0,P T t F t t
( )F t
( )F t
( ) ( )( ) { }= - = ³1 .R t F t P T t
( )f t
( ) ( ) ( ) ( )
¥ ¥
= - = - =ò ò1 1 .
t t
R t F t f x dx f x dx
21/196
(2.23)
The resulting reliability function is then,
(2.24)
A function describing the rate at which a system fails as a function of time is
referred to as the Hazard function (Eq. (2.4)). Let T be a random variable
representing the service life remaining for a specified system. Let be the
distribution function of T and let be its probability density function. A
new function termed the Hazard Function or the Conditional Failure Function
of T is given by . The function is the conditional
probability that the item will fail between x and given it has survived a
time T greater than x.
For a given hazard function the corresponding distribution function is
where is an arbitrary value of x. In
a continuous time reliability model the hazard function is defined as the
instantaneous failure rate of the system [Kapu77],
( ) , 0, 0.t
f t e t-l
= l ³ l ³
( )
¥
-l -l
= l =ò .t t
t
R t e dt e
( )F x
( )f x
( )z x
( )
( )
( )
=
-1
f x
z x
F x
( )z x dx
+x dx
( )z x
( ) ( )( ) ( )
é ù
- = - -ê ú
ê úë û
ò01 1 exp
o
x
x
F x F x z y dy 0x
22/196
(2.25)
The quantity represents the probability that a system of age t will fail in
the small interval of time . The hazard function is an important
indicator of the change in the failure rate over the life of the system. For a system
with an exponential failure rate, the hazard function is constant as shown in
Eq. (2.25) and it is the only distribution that exhibits this property [Barl85]. Other
reliability distributions will be shown in later chapters that have variable hazard
rates.
If a system contains no redundancy – this is, every component must function
properly for the system to continue operation – and if component failures are
statistically independent, the system reliability function is the product of the
component reliabilities and follows an exponential probability distribution. The
failure rate of such a system is the product of the failure rates of the individual
components,
(2.26)
In most cases it is possible to repair or replace failed components and accurate
models of system reliability will consider this. As will be shown the repair activity
is not as easily modeled as the failure mechanisms.
( )
( ) ( )
( )
( )
( )
( )
( )
0
lim ,
1
,
,
,
.
t
t
t
R t R t
z t
t R t
d
R t
R t dt
f t
R t
e
e
D ®
-l
-l
- + D
=
D ×
é ù
= -ê úë û
=
l
=
= l
( )z t dt
[ ]+,t t dt
( ) ( ) ( )
1 1
exp .i
n n
t
sys i i
i i
R t R t e t-l
= -
é ù= = = - lë ûåÕ Õ
23/196
For systems that can be repaired, a new measure of reliability can be defined,
The probability that the system is operational at time “t.”
This new measure is the Availability and is expressed as . Availability
differs from reliability in that any number of system failures can occur
prior to time t but the system is considered available if those failures have been
repaired prior to time t.
For systems that can be repaired, it is assumed that the behavior of the repaired
system and the original system are identical from a failure standpoint. In general,
this is not true, as perfect renewal of the system configuration is not possible. The
terms Mean Time to First Failure and Mean Time to Second Failure now become
relevant.
Assuming a constant failure rate , a constant repair rate , and identical failure
behaviors between the repaired system and the original system, the steady–state
system availability can be expressed as,
(2.27)
The expression in Eq. (2.27) is an approximation of the expression of the
availability with repair requires the solution of the appropriate Markov model,
which will be developed in a later chapter.
Mean Time to Failure
The Mean Time to Failure (MTTF) is the expected time to the first failure in a
population of identical systems, given a successful system startup at time .
The Cumulative Distribution function in Eq. (2.15) and the probability
density function in Eq. (2.16) characterize the behavior of the probability
distribution function of the underlying random failure process. These expressions
( )A t ( )A t
( )R t
l µ
.SSA
µ
=
l +µ
= 0t
( )F x
( )f x
24/196
are in a continuous integral form and require the solution of integral equations to
produce a useable result. A concise parameter that describes the expected value
of the random process is useful for comparison of different reliability models.
This parameter is the Mean or Expected Value of the random variable denoted by
and is defined by [Parz60], [Dave70],
(2.28)
The expression in Eq. (2.28) denotes the expected value of the continuous
function . It is important to note that this definition assumes is
integrable in the interval .
For an exponential probability density function of,
(2.29)
the mean or expected value of the exponential function is given by,
(2.30)
The evaluation of Eq. (2.30) can be done in a straightforward manner using the
Gamma function [Arfk70], which is defined as,
(2.31)
or alternately,
(2.32)
Rewriting the expression in Eq. (2.30) for the expected values as,
[ ]E X
[ ] ( )
¥
-¥
= ò .E X xf x dx
( )f x ( )x f x
( )-¥ ¥,
( ) , 0,x
f x e x-l
= l >
[ ] ( )
0
.x
E X xf x dx e dx
¥ ¥
-l
-¥
= = lò ò
( )
¥
- -
G = >ò
1
0
, 0,x
x e dxa
a a
( )¥
a-
a
G a
=
lò
1
0
.x
x e dx
25/196
(2.33)
where substituting the variables,
and (2.34)
results in,
(2.35)
which is the MTTF for a simple system. Although this expression is useful for
simple systems, a general–purpose expression representing the MTTF is needed.
This function can be developed in the following manner.
Let X denote the lifetime of a system so that the reliability function is,
(2.36)
and the derivative of the reliability function which is also given in Eq. (2.21) and
Eq. (2.22) is again defined as,
(2.37)
The expression for the expected value or MTTF using Eq. (2.28) is given by:
(2.38)
[ ]
¥
-
= ò0
1
,u
E X ue du
l
u x= l ,du dx= l
[ ]
( )
¥
-
=
l
= G
l
=
l
ò0
1
,
1
2 ,
1
,
u
E X ue du
( ) { }= > ,R t P X t
( ) ( )= - .
d
R t f t
dt
[ ] ( ) ( )
¥ ¥
æ ö
= = - ç ÷
è ø
ò ò0 0
d
E X tf t dt t R t dt
dt
26/196
Using the technique of integration by parts [Smai49], [Arfk70] is shown in
Eq. (2.39),
(2.39)
to evaluate Eq. (2.38). Integrating by parts gives the expected value as,
(2.40)
Since approaches zero faster than t approaches infinity, Eq. (2.40) can be
reduced to,
(2.41)
which is the expression for the Mean Time to Failure for a general system
configuration. This direct relationship between MTTF and the system failure rate
is one reason the constant failure rate assumption is often made when the
supporting reliability data is scanty [Barl75]. Appendix G describes the analysis of
the variance for this distribution.
Using an exponential failure distribution implies two important behaviors for the
system,
§ Since a used subsystem is stochastically as good as a new subsystem, a policy
of scheduled replacement of used subsystems which are known to still be
functioning, does not increase the lifetime of the system.
§ In estimation the mean system life and reliability, data can be collected
consisting only of the number of hours of observed life and the number of
observed failures; the ages of the subsystems under observation are of no
concern.
( ) ( ) ( ) ( ) ( ) ( )æ ö æ ö
- -ç ÷ ç ÷
è ø è ø
ò ò ,
b b
a a
bd d
f x g x dx f x g x g x f x dx
adx dx
[ ] ( ) ( )
¥
¥
=- + ò0
.
0
E X t R t R t dt
( )R t
[ ] ( )
¥
= =ò0
,E X R t dt MTTF
27/196
Mean Time to Repair
The Mean Time to Repair (MTTR) is the expected time for the repair of a failed
system or subsystem. For exponential distributions this is and
. The steady state availability defined in Eq. (2.27) can be
rewritten in terms of these parameters,
(2.42)
Mean Time Between Failure
The Mean Time Between Failure (MTBF) is often mistakenly used in place of Mean
Time to Failure (MTTF). The MTBF is the mean time between failures in a system
with repair, and is derived from a combination of repair and failure processes.
The simplest approximation for MTBF is:
(2.43)
In this work, it is assumed so that MTTR is used in place of
MTBF. The Mean Time to Failure is considered since in fault–tolerant systems
Failure occurs only when the redundancy features of the system fail to function
properly. In the presence of perfect coverage and perfect repair the system should
operate continuously. Therefore, failure of the system implies total loss of system
capabilities.
Mean Time to First Failure
The Mean Time to Failure is defined as the expected time of the first failure in a
population of identical systems. This development depends on the assumption
that the failure rate is constant Eq. (2.25), exponentially distributed Eq. (2.14),
and the repair time is constant, . In the general case, these assumptions may not
1
MTTF =
l
1
MTTR =
µ
SSA
.SS
MTTF
A
MTTR MTTF
=
+
= + .MTBF MTTF MTTR
!MTTR MTTF
µ
28/196
be valid and the Mean Time to Failure (MTTF) is not equivalent to the Mean Time to
First Failure (MTFF).
By removing the exponential probability failure distribution restriction in
Eq. (2.29) a generalized expression for the first failure time can be derived.
Given a population of n subsystems each with a random variable
and a continuous pdf of , the failure time for the
subsystem is given by summing all the failure times prior to the failure,
(2.44)
If the random variables are independent and identically
distributed, all with pdf’s of , the random process described by these
variables is referred to as an Ordinary Renewal Process [Cox62], [Ross70]. The details
of the Renewal Process are shown in Appendix E.
Given the random process described by Eq. (2.44) the distribution function of
is provided by convolving each individual distribution function . The
convolution of two functions is defined as [Brac65], [Papo65]:
(2.45)
The resulting convolution function for the n+1 subsystem failure is given by:
(2.46)
In renewal processes, the random variables are actually functions and can be
substituted in the reliability computations when:
= !, 1,2, ,iX i n ( )f x th
n
=
= + + + = å!1 2
1
.
n
n n i
i
S X X X X
{ }!1 2, , , nX X X
( )f x
nS
( )F t
( ) ( ) ( ) ( )
¥
-¥
Ä º -ò .f x g x f u g x u du
( ) ( ) ( ) ( ) ( )+
= -ò1
0
.
t
n n
F t F t x F x dx
29/196
(2.47)
When the conditions in Eq. (2.47) are met, the probability of n renewals in a time
interval is given by,
(2.48)
The renewal function can be defined as the average number of subsystem
failures and repairs as a function of time, and is given as,
(2.49)
Using Eq. (2.48) in the evaluation of Eq. (2.49) and Eq. (2.30) as the definition of
the expectation value, gives the following for the renewal function,
(2.50)
Simplifying Eq. (2.50) results in an expression for the renewal function of,
(2.51)
The term is the convolution of and F which gives,
(2.52)
which results in the expression for the renewal function of,
( ) += Û £ £ 1.n nN t n S t S
( ){ } { }
{ } { }
( ) ( ) ( ) ( )
1
1
1
,
,
.
n n
n n
n n
P N t n P S t S
P S t P S t
F t F t
+
+
+
= = £ £
= £ - £
= -
( )H t
( ) ( ) .H t E N t= é ùë û
( ) ( ){ }
( ) ( ) ( ) ( )
( ) ( ) ( ) ( ) ( )
0
1
0 0
0 1
,
.
1 .
n
n n
n n
n n
n n
H t nP N t n
nF t nF t
nF t n F t
¥
=
¥ ¥
+
= =
¥ ¥
= =
= =
= -
= - -
å
å å
å å
( ) ( ) ( ) ( )1
1
.n
n
H t F t F t
¥
+
=
= + å
( )1nF + ( )nF
( ) ( ) ( ) ( ) ( )1
0
,
t
n nF t F t x F x dx+ = -ò
30/196
(2.53)
Rearranging the integral term in Eq. (2.53) gives,
(2.54)
The summation term in Eq. (2.54) is the renewal function for the failure,
giving,
(2.55)
Using Eq. (2.16), the renewal density function is the derivative of the
distribution function, giving,
(2.56)
Using Eq. (2.50) to evaluate the derivative results in,
(2.57)
and using Eq. (2.54) as a substitute for the right–hand side of Eq. (2.57) results in,
(2.58)
Eq. (2.58) is known as the Renewal Equation [Ross70]. To solve the renewal
equation, the Laplace transform will be used. The transform of the probability
density function is,
( ) ( ) ( ) ( ) ( )
1 0
.
t
n
n
H t F t F t x F x dx
¥
=
= + -åò
( ) ( ) ( ) ( ) ( )
10
.
t
n
n
H t F t F t x F x dx
¥
=
é ù
= + -ê ú
ë û
åò
th
n
( ) ( ) ( ) ( )
0
.
t
H t F t H t x F x dx= + -ò
( )h t
( ) ( ).
d
h t H t
dt
=
( ) ( ) ( )
1
,n
n
h t f t
¥
=
= å
( ) ( ) ( ) ( )
0
.
t
h t f t h t x f x dx= + -ò
31/196
(2.59)
and the transform of the renewal function is,
(2.60)
Using the convolution property of the Laplace transform [Brac65], an equation
for the renewal distribution can be generated,
(2.61)
and simplified to,
(2.62)
Eq. (2.62) is now the generalized expression for the failure distribution for a
random process within an arbitrary probability distribution.
General Availability Analysis
The steady state system availability defined in Eq. (2.42) assumes an exponential
distribution for the failure rate of the system or subsystems. An important activity
in the analysis of Fault–Tolerant systems is the development of a general–
purpose availability expression, independent of the underlying failure distribution.
In the analysis that follows, it will be assumed that when a subsystem fails it is
repaired and the system restored to its functioning state. It will also be assumed
that the restored system functions as if it were new, that is with the failure
probability function restarted at .
( ){ } ( )
0
,sx
f s e f x dx
¥
-
= òL
( ){ } ( )
0
.sx
h s e h x dx
¥
-
= òL
( ){ } ( ){ } ( ){ } ( ){ },h s f s h s f s= +L L L L
( ){ }
( ){ }
( ){ }
.
1
f s
h s
f s
=
-
L
L
L
0t =
32/196
Let be the duration of the ith functioning period and let be the system
downtime because of the failure of the system while the ith repair takes place.
These durations will form the basis of the renewal process.
By combining the subsystem failure interval and the subsystem repair duration, a
random variable sequence is constructed such that,
(2.63)
It must be assumed that the duration of the functioning subsystems are identically
distributed with a common Cumulative Distribution Function and a common
probability density function and that the repair periods are also identically
distributed with and . Using these assumptions the terms in Eq. (2.63)
are also identically distributed such that,
(2.64)
meets the definition of a Renewal process developed Eq. (2.44). Using this
development an expression for the convolution of the two independent random
processes is given by,
(2.65)
Using Eq. (2.62) gives,
(2.66)
The average number of repairs in the time interval has the Laplace
transform:
iT iD
; 1, 2,i i iX T D i= + = !
( )W t
( )w t
( )G t ( )g t
{ }1,2, ,iX i = !
( ){ } ( ){ } ( ){ }.f s w s g s=L L L
( ){ }
( ){ } ( ){ }
( ){ } ( ){ }
.
1
w s g s
h s
w s g s
=
-
L L
L
L L
( )M t ](0,t
33/196
(2.67)
Instantaneous Availability
The steady state availability defined in Eq. (2.42) can now be replaced with the
instantaneous availability . In the absence of a repair mechanism the
availability is equivalent to the repairability, of the
subsystem.
The subsystem may be functioning at time t because of two mutually exclusive
reasons,
§ The subsystem has not failed from the beginning.
§ The last renewal occurred within the time period and the subsystem
continued to function since that time.
The probability associated with the second case is the convolution of the
reliability function and the renewal density, giving,
(2.68)
which results in a expression for the instantaneous availability of,
(2.69)
Taking the Laplace transform of both sides of Eq. (2.69) gives,
(2.70)
( ){ }
( ){ } ( ){ }
( ){ } ( ){ }
.
1
w s g s
M s
s w s g s
=
é ù-ë û
L L
L
L L
( )A t
( )A t ( ) ( )1R t A t= -
( ) ( )
0
,
t
R t x h x dx-ò
( ) ( ) ( ) ( )
0
.
t
A t R t R t x h x dx= + -ò
( ){ } ( ){ } ( ){ } ( ){ }
( ){ } ( ){ }
( ){ }
( ){ } ( ){ }
( ){ } ( ){ }
,
1 ,
1 .
1
A s R s R s L h s
R s h s
w s L g s
R s
w s L g s
= +
é ù= +ë û
é ù
= +ê ú
-ê úë û
L L L
L L
L
L
L
34/196
Since the reliability of the system is given as ,
(2.71)
Substituting gives,
(2.72)
Given the failure–rate distribution and the repair–time distribution, Eq. (2.72) can
be used to compute the instantaneous availability as a function of time.
Limiting Availability
An important question to ask is – what is the availability of the system after some long
period of time? The limiting availability as is defined as A or simply
the Availability.
To derive an expression for the limiting availability the Final Value Theorem of
Laplace transform can be used [Doet61], [Widd46], [ Brac65], [Ogat70], [Gupt66].
This theorem states that the steady state behavior of is the same as the
behavior of in the neighborhood of . Thus it is possible to obtain the
value of as .
Let,
(2.73)
then using a table of Laplace transforms [Doet61], [Brac65],
( ) ( )1R t W t= -
( ){ } ( ){ }
( ){ } ( ){ }
1
,
11
.
A s W s
s
w s w s
s s s
= -
-
= - =
L L
L L
( ){ }
( ){ }
( ){ } ( ){ }
1
.
1
w s
A s
s w s g s
-
=
é ù-ë û
L
L
L L
( )A t ® ¥t
( )f t
( )sF s 0s =
( )f t ® ¥t
( ) ( ) ( )-
= +ò0
0 ,
t
F t f x dx F
35/196
(2.74)
and by letting
(2.75)
The Limiting availability is then given as,
(2.76)
For small values of s the following approximations can be made [Apos74],
(2.77)
giving,
(2.78)
where and,
(2.79)
( ){ } ( ) ( ){ } ( )
¥
- -
- = = òL L
0
0 ,st
s F s F h s e f t dt
0,s ®
( ){ } ( ) ( )
( ) ( )
( )
¥
-
®
-
®¥
®¥
= +
é ù
= +ê ú
ë û
=
ò
ò
L
0
0
0
lim 0 ,
lim 0 ,
lim .
s
t
s
t
s H s f t dt F
f x dx F
F t
( ) ( ){ }0
lim lim .
t s
A A t s A s
®¥ ®
= = L
1 ,st
e st-
@ -
( ){ } ( )
( ) ( )
¥
-
¥ ¥
=
= -
-
l
ò
ò ò
L
!
0
0 0
,
,
2
1 .
st
w s e w t dt
w t dt s tw t dt
1
MTTF =
l
( ){ }= -
µ
L
2
1 ,g s
36/196
and where giving the limiting availability as,
(2.80)
Eq. (2.80) is an important result in the analysis of system reliability, because it
shows that the limiting availability depends only on the Mean Time to Failure and
the Mean Time to Repair and not in the underlying distributions of the failure and
repair times.
1
MTTR =
µ
0
11 1
lim .
1 1
1 1 1
s
s
MTTF
A
s s MTTF MTTR®
é ùæ ö
- -ç ÷ê úlè ø l= = =ê ú
+æ öæ öê ú +- - -ç ÷ç ÷ê ú l µl lè øè øë û
37/196
C h a p t e r 3
SYSTEM RELIABILITY
This chapter provides the basis for the computation of the overall system
reliability given a redundant architecture with partial fault detection coverage.
Redundant systems can be modeled under variety operational assumptions. Of
most interest in this work are dual and triple redundant systems that contain
repair facilities.
Series Systems
Creating a reliable system often involves a series or parallel combination of
independent systems or subsystems. If is the reliability of module i and all
the modules are statistically independent, then the overall system reliability of
modules connected in series is,
(3.1)
For a series redundant system the failure probability is given by,
(3.2)
Expanding Eq. (3.1) will illustrate an aspect of the exponential distribution. For a
system of n subsystems connected in series the reliability of the system is given by
Eq. (3.1). If a general purpose hazard function is used for the failure rate
[Shoo68] defined by,
(3.3)
( )iR t
( ) ( ).series iR t R t= Õ
seriesF
( ) ( ) ( )
( )( )
1
1
1 1 ,
1 1 .
n
series series i
i
n
i
i
F t R t R t
F t
=
=
= - = -
= - -
Õ
Õ
( ) ,k
i i ih t c t= l +
38/196
where , , and k are constants, then the reliability function for the individual
subsystem is given by,
(3.4)
and the reliability functions for the system is given by,
(3.5)
Defining two new terms for the summation of the failure rate and a new term for
the time constant adjustment gives, , , and results
in the series reliability expression of,
(3.6)
As the number of subsystems grows large , the term is
bounded and the expression for the system reliability becomes,
(3.7)
Eq. (3.7) defines the failure distribution of the system as the number of
subsystems grows without bound. This implies that a large complex system will
tend to follow exponential distribution failure models regardless of the internal
organization of the subsystems.
il ic
( )
1
exp ,
1
k
i i i
t
R t t c
k
+
é ù
= - l +ê ú+ë û
( )
1
1 1
exp .
1
kn n
series i i
i i
t
R t t c
k
+
= =
é ù
= - l +ê ú+ë û
å å
1
n
i
i
*
=
l = lå 1
n
i
i
c c*
=
= å T t*
= l
( )
( )
1
1
exp .
1
k
series k
c T
R t T
k
* +
* *
é ùæ öæ öê úç ÷= - + ç ÷ê úç ÷l +è ø lè øë û
( )*
l ®¥
( )1
c
k
*
*
+ l
( )lim .T t
series
n
R t e e
*
- -l
®¥
= =
39/196
Parallel Systems
In a parallel redundant configuration, the system fails only if all modules fail. The
probability of a system failure in a parallel system given by,
(3.8)
The system reliability for a parallel system is given by,
(3.9)
M–of–N Systems
An M–of–N system is a generalized form the parallel system. Instead of requiring
only one of the N modules of the system to remain functional, M modules are
required. The system of interest in this work is a Triple Modular Redundant (TMR)
configuration in which two of the three modules must function for the system to
operate properly [Lyons 62], [Kuehn 69]. [3]
For a given module reliability of
the TMR reliability is given by,
(3.10)
In Eq. (3.10) all working states are enumerated. The term represents that
state in which all three modules are functional. The term
3 In practical TMR systems, a simplex mode is allowed, which usually places the system in a shutdown mode,
allowing the controlled process to be safely stopped.
( ) ( )
1
1 .
n
iparallel
i
F t F t
=
= -Õ
( ) ( ) ( )
( )( )
1
1
1 1 ,
1 1 .
n
iparallel parallel
i
n
i
i
R t F t F t
R t
=
=
= - = -
= - -
Õ
Õ
mR
( )3 2
3
1 .
2tm r m m mR R R R
æ ö
= + -ç ÷
è ø
3
mR
( )2
3
1
2 m mR R
æ ö
-ç ÷
è ø
40/196
represents the three states in which any one module has failed and the two states
in which a module is functional.
Selecting the Proper Evaluation Parameters
In comparing different redundant system configurations, it is desirable to
summarize their reliability by a single parameter. The reliability may be an
arbitrary complex function of time. The selection of the wrong summary
parameter could lead to incorrect conclusions, as will be shown below.
Consider a simplex system, with a reliability function of,
(3.11)
and using Eq. (2.41) to derive the Mean Time to Failure results in,
(3.12)
For a TMR system with an exponential reliability function,
(3.13)
and using Eq. (2.40) results in a Mean Time to Failure of,
(3.14)
Comparing the simplex and TMR reliability expressions gives,
(3.15)
By using the MTTF figure of merit, the TMR system can be shown to be less
reliable than the Simplex system. The above equations do not include the facility
( ) ,t
simplexR t e-l
=
1
.sim plexMTTF =
l
( ) ( ) ( ) ( )
3 2
2 3
3
1 ,
2
3 2 ,
t t t
tm r
t t
R t e e e
e e
-l -l -l
- l - l
æ ö
= + -ç ÷
è ø
= -
3 2
.
2 3
tm rMTTF = -
l l
5 1
.
6
tm r sim plexMTTF MTTF= £ =
l l
41/196
for module repair. Once the TMR system has exhausted its redundancy, there is
more hardware to fail then the remaining modules of the non–redundant system.
This effect lowers the total system reliability. With online repair, the MTTF figure
of merit for the TMR system becomes an important measure of the overall
system reliability.
These results illustrate why simplistic assumptions and calculations may result in
erroneous information.
42/196
C h a p t e r 4
IMPERFECT FAULT COVERAGE AND RELIABILITY
Reliability models of systems with dynamic redundancy usually depend on perfect
fault detection [Arno73], [Stif80]. The ability of the system to detect faults that
occur can be classified as [Geis84],
§ Covered – faults that are detected. The probability that a fault belongs to this
class is given by c.
§ Uncovered – faults that are not detected. The probability that a fault belongs
to this class is given by .
The underlying diagnostic firmware and hardware may not provide perfect
coverage for many reasons, primarily due to the complexity of the system under
diagnosis [Rous79], [Cona72], [Wood79], [Soma86]. Because of this built–in
complexity, an exhaustively tested set of diagnostics may not be possible.
Another factor affecting the diagnostic coverage is the presence of intermittent
faults [Dahb82], [Mall78]. The detection and analysis of these intermittent or
permanent faults is further complicated by the presence of transient faults which
behave as real faults but are only present in the system for a short time [Glas82],
[Sosn86]. Modeling a fault–tolerant system in the presence of imperfect fault
coverage becomes an important aspect in predicting the overall system reliability.
Redundant System with Imperfect Coverage
Before developing the Markov method of analyzing Fault–Tolerant systems, a
conditional probability method will be used to derive the MTTF and MTBF for a
redundant system with imperfect fault detection [Bour69]. Assume that the failure
rate for each subsystem of the redundant system is described by an independent
random variable . Let X denote the lifetime of a system with two modules, one
active and the other in standby mode. Assume that the module in the standby
( )1 c-
l
43/196
mode does not experience a fault during the mission time interval. [4]
Let Y be a
random variable where, Y = 0 if a fault is not covered, and Y = 1 if a fault is
covered, then, and
To compute the MTTF of this system, the conditional expectation value of the
system lifetime X given the fault coverage state Y is must be derived.
If an uncovered fault occurs the MTTF of the system is the MTTF of the initially
active module,
(4.1)
If a covered fault occurs the MTTF of the system is the sum of the MTTF of the
active module and the MTTF of the inactive module,
(4.2)
The total expectation value of the system lifetime is then given by,
(4.3)
The computation of the system reliability depends on the combination of the two
independent exponential distribution functions when a covered fault occurs,
(4.4)
and when an uncovered fault occurs
(4.5)
The joint exponential distribution function for both conditions is given by,
4 This is an invalid assumption in a practical sense, but it greatly simplifies this example.
{ } ( )0 1P y c= = - { }1 .P y c= =
{ }
1
0 .P X Y = =
l
{ }
2
1 .P X Y = =
l
[ ]
( ) ( )1 12
.
c cc
E X MTTF
- +
= + = =
l l l
( ) 2
1 ,t
f x t y te -l
= = = l
( )0 .t
f x t y e -l
= = = l
44/196
(4.6)
and the marginal density function of X is computed by summing over the joint
density function,
(4.7)
The system reliability as a function of the coverage is then given by integrating
the joint density function in Eq. (4.7) to give,
(4.8)
Generalized Imperfect Coverage
In the previous example, the system consisted of two modules, one in the active
state and one in the standby state. The conditional probability that a fault will go
undetected (uncovered) was computed using the conditional probability that the
system will survive for a specified period. Cox [Cox55] analyzed the general case
of a stage–type conditional probability distribution. The principle on which the
method of stages is based is the memoryless property of the exponential
distribution of Eq. (2.1) [Klie75]. The lack of memory is defined by the fact that
the distribution of the time remaining for an exponentially distributed random
variable is independent of the current age of the random variable, that is the
variable is memoryless. Appendix D develops further the memoryless property of
random variables with exponential distributions.
( ) ( ) { }
( ) ( )
( ) 2
, ,
, 1 ; 0, 0,
, ; 0, 1.
t
t
f t y f X t y P y
f t y c e t y
f t y cte t y
-l
-l
= = ×
= l - > =
= l > =
( ) ( )2
1 .t t
f t cte c e-l -l
= l + l -
( ) ( )
( )
( )
( )
0
2
0
2
1 1,
1 1 ,
1 1 ,
1 .
t
t
t t
t t
t
t
R t f x dx
cte c e dt
cte c e dt
c t e
-l -l
¥
-l -l
-l
= - =
= - l + l -
= - l + l -
= + l
ò
ò
ò
45/196
In the generalized model, it is assumed that individual modules are always in one
of two states – working or failed. It is also assumed that the modules are
statistically independent and module repair can take place while the remainder of
the system continues to function.
In the general case of N active and S standby modules, the lifetime of the system
is defined by a stage–type distribution. An active module has an exponential
failure distribution with a constant failure rate . Assume that the modules in the
standby state can fail at a rate (presuming ). Let be a
random variable denoting the lifetime of the active modules and let
be a random variable denoting the lifetime of the standby modules.
The system lifetime L is then,
(4.9)
where is the time to first failure among the modules. After
the removal of the failed module, the system has N active modules and
standby modules. As a result modules have not aged by the
memoryless exponential assumption and therefore the system lifetime is,
(4.10)
Here is the lifetime of the m–out–N system and is
therefore a order statistic with [Kend77]. The distribution of
is an – phase Hypoexponential distribution with
parameters . The distribution for the time to first failure
has an exponential distribution with the parameter .
l
µ 0 £ µ £ l iX ( )1 i N£ £
jY
( )1 j S£ £
( ) ( ) ( )
( ) ( )
1 2 1 2, min , , , ; , , , , 1 ,
, , 1 .
N SL m N S X X X Y Y Y L m N S
W N S L m N S
= + -
= + -
! !
( ),W N S N S+
1S -
1N S+ -
( ) ( ) ( )
1
, ,0 , .
S
i
L m N S L m N W N i
=
= + å
( ) ( ), ,0L m N S L m N=
th
k 1k N m= - +
( ),0L m N ( )1N m- +
( ), 1 , ,N N ml - l l!
( ),W N i N il + µ
46/196
Using Theorem D.1 in Appendix D, the distribution has a
–stage Hypoexponential distribution [Koba78], [Cox55], [Ash70]
with parameters .
Let denote the reliability of such a system, then the reliability
function is defined as,
(4.11)
where,
(4.12)
and,
(4.13)
Defining the constant gives a new expression for the active and
standby terms in the reliability equation Eq. (4.11) of,
( )L ,m N S
( )1N S m+ - +
( ) ( ), 1 , , , , 1 , ,N S N S N N N ml + µ l + + µ l +µ l - l l! !
( ),m N S
R té ùë û
( ) ( )
,
1
,
S N
N j i t
j im N S
i i m
R t a e b e- l+ µ - l
é ùë û
= =
= +å å
1
,
S N
i
j j m
j i
N j j
a
j i j N i= =
¹
l + µ l
=
µ - µ l - l - µ
Õ Õ
( )= =
¹
l + µ l
=
- l + µ l - l
Õ Õ1
.
S N
i
j j m
j i
N j j
b
N i j j i
K = l µ
47/196
(4.14)
A similar expression can be developed for,
(4.15)
An expectation value of the reliability function derived from a general stage–type
distribution can be found using the Laplace transform [Cox 55]. The Laplace
transform of a stage–type random variable X is,
( ) ( )
( )( ) ( )( ) ( )
( ) ( )
( )
( )
( )( ) ( )
( )
( ) ( )
( ) ( )
( )
1
1 1
1
1 1 1
,
1 1 1
1
! !
1 1
! ! 1 ! !
1 ! ! !
,
1 ! ! !
1
1
1
1
N m
i
i N m
N m
NK S NK N N m
a
i i iNK i S i i
N m
K K K
NK S S i
NK i NK S S i
i
N N N m
k
i i
m M m N m
K K
NK s S N
S i m
- +
- - +
- +
+ + - -
= ×
+ - - - æ ö æ öæ ö
+ - +ç ÷ ç ÷ç ÷
è ø è øè ø
+
= - × -
+ -
æ ö
- -ç ÷
è ø×
é ùæ ö
- - -ç ÷ê úè øë û
+ -æ öæ öæ ö
ç ÷ç ÷ç ÷-è øè øè ø= -
! !
! !
!
!
!
.
i
N mi
K
NK
N m
æ ö
+ -æ öç ÷+ç ÷ç ÷è ø -è ø
( ) ( )
( ) ( ) ( ) ( )( ) ( )
( ) ( ) ( )
( ) ( ) ( ) ( ) ( )
( )
( ) ( )( )
( ) ( ) ( )
( )
( )
-
-
-
+ +
= ×
- + - + - - -é ù é ù é ùë û ë û ë û
+ - -é ùë û=
- + - - -é ùë û
+ -
= -
- + -é ùë û
+æ öæ öæ ö
ç ÷ç ÷ç ÷
è øè øè ø= -
- +æ ö
ç ÷
è ø
! !
! ! !
1
,
1 1 1 1
! 1 ! ! 1
,
! ! 1 ! ! !
! ! ! ! !
1 ,
! ! ! ! ! !
1
i
i m
i m
i m
NK S NK N m
b
N i K S N i K i N m i
NK S N K N
NK N i K S i m N i i m
NK S S N i K N i m
S NK N i K S i i m m
NK S N i
S i m
N i K Si
m S
.
48/196
(4.16)
where for and . Defining the Laplace transform of
the system described in Eq. (4.9) gives,
(4.17)
By inverting the transformation in Eq. (4.17) an expression for the MTTF with
imperfect coverage can be given as,
(4.18)
The details of the above development are described in more detail in [Ing76],
[Chan72], [King69], [Saat65], [Math70], [Triv82]. In the example described above,
the system does not provide for repair. When repairable systems are analyzed in
this manner, the number of stages becomes infinite. To deal with the infinite
number of conditional probabilities a different technique must be employed. The
Markov Chain is just such a technique, capable of dealing with a system
configuration of many modules, each with repairability.
An additional caution should be noted. The assumption of statistical
independence is questionable in the case of stage–type failure distributions. In
addition, the fixed probability distribution associated with each failure in the
stage–type should be removed in the detailed analysis [Rams76].
( )
µ
g b b b g
µ
+
= =
= +
+
å Õ!L 1 1 2 1
1 1
,
ir
j
X i i
i j j
s
s
g b+ =1i i
£ £1 i r g + =1 1r
( ) ( )
( )
( )
( )
( )
( )
l µ
l µ
ll µ
l µ l
-
= =
- +
= =
+ - +
= -
+ + - +
- ++
+
+ + + - +
å Õ
Õ Õ
!
!
L 1
1 1
1
2
1 1
1
1
1
1
.
1
iS
i
X
i j
S N M
j j
N S j
s c c
s N S j
N jN j
c
s N j s N j
[ ] ( )
l µ l µ l
-
= = - + = =
ì üï ï
= - + +í ý
+ +ï ïî þ
å å å å1 2
1 1 1
1 1 1
1 .
S S S N
i
i j S i j j M
E X c c c
N j N j j
49/196
C h a p t e r 5
MARKOV MODELS OF FAULT–TOLERANT SYSTEMS
A generalized modeling technique is required to deal with an arbitrary number of
modules, failure events, and repair events in the analysis of Fault–Tolerant
systems [Boss82]. Several techniques are available, including Petri Nets [Duga84],
[Duga85], Fault Tree Analysis [Fuss76], Failure Mode and Effects Analysis
[Mil1629], [Jame74], Event Tree Analysis [Gree82], and Hazard and Operability
Studies [Lee80], [Robi78], [Smit85]. When system components are not
independent, a state based analysis technique is needed which includes
redundancy and repair [Biro86], [Guid86].
A Continuous Parameter Markov Chain is a method used to analyze systems that have
state transitions that include repair processes [Hoel72], [Kend50], [Kend53]. A
Markov Process is a stochastic process whose dynamic behavior is such that the
probability distributions for its future behavior depend only on the present state
and not how the process arrived in that state [Mark07], [Fell67], [Issa76],
[Chun76], [Kulk84].
To illustrate the principles of a Markov process, consider a system S described in
Figure 3, which is changing over time in such a way that its state at any instant in
time v can be described in terms of a finite dimensional vector , [Triv74],
[Triv75a], [Triv75]. Assume that the state of the system at any time
can be described by a predetermined function of the starting state v and the
ending state t:
(5.1)
Given a set of reasonable starting conditions and the continuity of the function G
a differential equation for describing the rate at which transitions between
( )X t
>, fort t v
( ) ( ), .X t G X v t= é ùë û
( )X t
50/196
each state of the system takes place can be derived by expanding both sides of
Eq. (5.1) in powers of t to give,
(5.2)
Finite–dimensional deterministic systems described by the set of state vectors are
equivalent to systems described by sets of ordinary differential equations [Bell60],
[Brau67], [Beiz78], [Brue80]. This property will serve as the basis for analysis of
fault–tolerant systems that include repair.
It will be assumed that the system described by the set of differential equation in
Eq. (5.2) can exist in only one of the finite number of states [Keme60], [Koba78].
The transition from state i to state j in this system takes place with some random
probability defined by,
(5.3)
Eq. (5.3) is the conditional pdf of the system of state transitions and satisfies the
relation,
(5.4)
The unconditional pdf of the state transition vector is given by,
(5.5)
with,
(5.6)
since the process at any time t must be in a unique state. An Absorbing Markov
Process is one in which transitions have the following properties [Gave73],
( ) .
dx
X t
dt
= é ùë ûH
( ) ( ) ( ){ }, , ; , .ijp v t P X t j X v i t v i j S= = = ³ Î
( ), 1; 0 .j
i S
p v t v t
" Î
= £ £å
( )X t
( ) ( ){ }, 1, 2, 3,jp t P X t j j= = = !
pj
t( )=1
∀j∈S
∑ , ∀t > 0,
51/196
§ There is at least one absorbing state,
§ From every state, it is possible to get to the absorbing state.
Figure 3 – State Transition probabilities as a
function of time in the Continuous–Time
Markov chain that is subject to the constraints
of the Chapman–Kolmogorov equation.
The fundamental assumption of the Markov model is that the probability of a
given state transition depends only on the current state of the system and not on
any previous state. For continuous–time Markov processes, that is, those
described by ordinary differential equations, the length of time already spent in
the current state does not influence either the probability distribution of the next
state or the probability distribution of the remaining time in the same state before
the next transition. The Markov model fits with the standard assumption of the
reliability models developed so far in this work, that the failure rates are constant,
leading to an exponentially distributed state transition time for failures and a
Poisson distribution for the occurrence of these failures.
i
ki
j
j
!
! !
v t
uv t
52/196
Solving the Markov Matrix
In order to describe a continuous–time Markov process using transition matrices,
it is necessary to specify the entire family of stochastic matrices, . Only
those matrices that meet certain conditions are useful in finding the solution to
the final absorption state rate of the system described by the Markov
Chain [Cour77].
Initial value problems involving systems of equations may be solved using the
Laplace transform. The advantage of this technique over traditional methods
(Elimination, Eigenvalue solutions, and Fundamental Matrix [Pipe63], [Cour43])
is that satisfaction of initial values is automatically provided. No special
techniques are needed to find particular solutions of the fundamental matrix, such
as repeated eigenvalues [Lome88].
Chapman–Kolmogorov Equations
A set of differential equations describing the transitions between each state can
be derived if the following conditions are met by the transitions probability
matrix [Bhar60], [Parz62], [Howa71]. These equations are the Chapman–Kolmogorov
Equations and are defined as the transition probabilities of the Markov chain that
satisfy Eq. (5.7) for all i and j, using Figure 3 as an example,
(5.7)
A simplified notation for the matrix elements defined in Eq. (5.7) can be created
where the elements of each matrix are given by,
(5.8)
and where,
(5.9)
( ){ }P t
( ) ( ) ( ), , , .ij ik kj
k
p v t p v u p u t= ×å
( ) ( ) ( ), , , ,v t v u u t v u t= H £ £H H
( ), ,t t =H I
53/196
is the identity matrix.
The Forward Chapman–Kolmogorov Equation is now defined as,
(5.10)
where the new matrix is defined as,
(5.11)
with,
(5.12)
The matrix is now defined as the transition rate matrix [Papo65a]. The
elements of are and are defined by,
(5.13)
and
(5.14)
If the system at time t is in state i, then the probability that a transition occurs to
any state other than state i during the time interval is given by,
(5.15)
where is any function of h that approaches zero faster than h, that is
Eq. (5.13) is the rate at which the process departs state i when the
starting in state i.
( ) ( ) ( ), , , ,v t s t t v t
t
¶
= £
¶
H H Q
( )tQ
( )
( )
0
lim ,
t
t
t
tD ®
-
=
D
P I
Q
.t t vD = -
( )tQ
( )tQ ( )ijq t
( )
( )
0
, 1
lim ,ii
ii
t
p t t t
q t
tD ®
+ D -
=
D
( )
( )
0
, 1
lim , .
ij
ij
t
p t t t
q t i j
tD ®
+ D -
= ¹
D
t t+ D
( ) ( ),iiq t t o t- D + D
( )o h
( )
0
lim 0.
h
o h
h®
=
54/196
Similarly, given that the system is in state i at time t, the conditional probability
that it will make a transition from state i to state j in the time interval is
given by,
(5.16)
Eq. (5.14) is the rate at which the process moves from state i to state j given that
the system is in state i, since,
(5.17)
then Eq. (5.13) and Eq. (5.14) implies,
(5.18)
Using these developments, the Backward Chapman–Kolmogorov equation is given by,
(5.19)
The forward equation may be expressed in terms of its elements,
(5.20)
The initial state i at the initial time v affects the solution of this set of differential
equations only through the following conditions,
(5.21)
The backward matrix equation may be expressed in terms of its elements,
(5.22)
[ ],t t t+ D
( ) ( ).ijq t t o tD +
( ), 1,ijp v t =å
( ) 0, .ijq t i= " Îå S
( ) ( ) ( ), , , .v t v v t v t
v
¶
= - £
¶
H Q H
( ) ( ) ( ) ( ) ( ), , , .ij jj ij kj ik
k j
p v t q t p v t q t p v t
t ¹
¶
= +
¶
å
( )
=ì
= í
¹î
1,
,
0,ij
i j
p v v
i j
( ) ( ) ( ) ( ) ( ), , , ,ij jj ij ik kj
k j
p v t q t p v t q t p v t
t ¹
¶
= - -
¶
å
55/196
with the initial conditions,
(5.23)
Markov Matrix Notation
The expressions developed in the previous section can be represented by a
transition probability matrix [Papo62] of the form,
The entries in this matrix satisfy two properties; and which
is a restatement of Eq. (5.17). The Transition Probability Matrix can also be
represented by a directed graph [Maye72], [Deo74]. A node labeled i in the
directed graph represents state i of the Markov Chain and a branch labeled
from node i to node j implies that the conditional probability
is met by the Markov Process represented by the
directed graph.
The transition probabilities represent a set of differential equations describing the
rate at which the transitions take place between each node in the directed graph.
The differential equations are then represented by a matrix structure of,
( )
=ì
= í
¹î
1,
,
0,ij
i j
p t t
i j
P = pij
!
"
#
$=
pmn
! ! ! pm0
" # "
" # "
" p11
p10
p0n
p01
p00
!
"
%
%
%
%
%
%
#
$
&
&
&
&
&
&
.
£ £0 1ijp =å 1ij
j
p
ijp
{ }-= = =1n n ijP X j X j p
56/196
The solution to this set of linear homogeneous differential equations can be
derived by elimination using the Laplace transform method.
Laplace Transform Techniques
Given a set of differential equations in Eq. (5.20) and Eq. (5.22), the Laplace
transform can be used to generate solutions to these equations [Lome88]. One
advantage of using the Laplace transform method is its ability to handle initial
conditions automatically, without having first to find a general solution and then
having to evaluate the integration constants. The Laplace transform is defined as,
(5.24)
The differential equation solution method depends on the following operational
property of the Laplace transform [Krey72]. The Laplace transform of the
derivative of a function is,
(5.25)
In the limit, the integral appearing on the right–hand side of Eq. (5.25) is
, so that the first term in Eq. (5.25) can be evaluated in the following
manner [McLac39],
d
dt
Pn
!
d
dt
P1
d
dt
P0
!
"
#
#
#
#
#
#
#
#
$
%
&
&
&
&
&
&
&
&
=
pmn
" " pm0
! # !
p1n
# p10
p0n
… … p00
!
"
#
#
#
#
#
$
%
&
&
&
&
&
Pn
!
P1
P0
!
"
#
#
#
#
#
$
%
&
&
&
&
&
.
( ) ( ) ( ){ }
¥
-
= =ò L
0
st
F s e f t dt f t
( ){ } ( ) ( ) ( )
¥
- - -
®¥
é ù
¢ ¢= = +ê ú
ë û
ò òL
0 0
lim .
0
b
st st st
b
b
f t e f t dt e f t s e f t dt
( ){ }L f t
57/196
(5.26)
Using the property of absolute values and limits [Arfk70], Eq. (5.26) can be
rewritten as,
(5.27)
The term is of the order as . For using the definition for
exponential order, Eq. (5.27) can be reevaluated to the following,
(5.28)
The function is said to be of exponential order as if there
exists a constant such that: is bounded for all t greater than
some T. If this statement is true, there also exists a constant M, such that
Figure 4 – Definition of the exponential order
of a function.
If , then giving,
(5.29)
so that in the limit,
(5.30)
giving the final form of the Laplace transform of a differential equation as,
(5.31)
( ) ( )-
®¥
- 0
lim 0 .sb
b
e f b e f
( ) ( )- -
®¥ ®¥
£lim lim .sb sb
b b
e f b e f b
( )f b ab
e ® ¥b >b T
( ) ( )aa - -- -
®¥ ®¥ ®¥
£ =lim lim lim .s bsb sb b
b b b
e f b e Me Me
( )f b b ® ¥
a ( ) ,b
e f ba-
( ) , .t
f b Me t Ta
< >
s a> 0,s a- >
( )
lim 0,s b
b
Me a- -
®¥
=
( )lim 0,sb
b
e f b-
®¥
=
( ){ } ( ){ } ( )0 .f t s f t f¢ = -L L
58/196
The notation for the Laplace transform for the differential equation for the rate
of arrival at the transition state i is then given by,
(5.32)
From this point on, this Laplace transform notation will be used in the solution
of the Markov transition matrix differential equations. Using the expression
to define the system reliability, where is the
probability distribution function of the time to failure, a new random variable, Y,
can be defined which represents the expected time to system failure. A notation
can be defined such that is the failure density of the
random variable Y. The Laplace transform of this failure density is denoted by
In this work represents the
absorbing state of the Markov model. By using the Laplace transform notation in
the solution of differential equations, the inverse transform can be used to
generate the failure density function for the random variable Y. Using Eq. (2.38)
the derivative of the failure density function can be integrated to produce the
Mean Time to Failure . The inversion of the
Laplace transform may be straightforward in some cases and more complex in
other cases.
MODELING A DUPLEX SYSTEM
Duplex systems or Parallel Redundant systems have been utilized in electronic
central office switching systems and other high–reliability systems for the past 35
years [Toy78]. Parallel redundant systems depend on fault detection and recovery
for their proper operation. In most dual redundant architectures both system are
( ){ } ( ).i iP t P sÞL
( ) ( ) { }1R t F t P T t= - = ³ ( )F t
( )
( ) ( )0
Y
dR t dP t
f t
dt dt
= - =
( ){ } ( ) ( ) ( )0 .Y Y Yf t s f s sP sÞ = =L L ( )0P s
[ ] ( )0
d
MTTF E Y t R t
dt
¥ æ ö
= = - ç ÷
è ø
ò
59/196
monitored continuously, providing fault detection in the primary subsystem as
well as the standby subsystem.
This section describes the detailed development of the Markov model for a
parallel redundant system with perfect diagnostic coverage. The failure rate of
both subsystems are assumed to be a constant and the repair rate a constant
. The system is considered failed when both subsystems have failed. The
number of properly functioning subsystems is described in the state space
, where is the failure state of the system. The state diagram for
the system is shown in Figure 5.
Figure 5 – the state transition diagram for a
Parallel Redundant system with repair. State
represents the fault free operation mode,
State represents a single fault with a
return path to the fault free mode by a repair
operation, and State represents the
system failure mode, the absorption state.
The initial state of the system is and the initial conditions for the transition
equations are,
(5.33)
Using the initial conditions, the system of differential equations derived from the
transition matrix,
l
µ
{ }2,1,0ÞS { }0
2 01
2l
µ
l
{ }2
{ }1
{ }0
{ }2
( ) ( ) ( )= = =2 1 00 1, 0 0 0.P P P
60/196
are given by,
(5.34)
Using the Laplace transform solution technique described in the previous section
and in detail in [Doet61], [Widd46], [Lome88], [Rea78], and [Lath65] gives the
following set of equations in Laplace form,
(5.35)
Solving Eq. (5.35)(a) for the final failed state gives,
(5.36)
and solving for Eq. (5.36)(b) for state gives,
( )
( )
( )
( )
( )
( )
( )
é ù
- l µ é ùé ùê ú
ê úê úê ú
ê úê úê ú
ê úê ú= l - l + µ lê ú
ê úê úê ú
ê úê úê ú
ê úê úlê ú ë û ë û
ê úë û
2
2
1
1
0
0
2 0
2 ,
0 2 0
dP t
P t
dt
dP t
P t
dt
dP t
P t
dt
( )
( ) ( )
( )
( ) ( ) ( )
( )
( )
2
2 1
1
2 1
0
1
2 ,
2 ,
.
dP t
P t P t
dt
dP t
P t P t
dt
dP t
P t
dt
= - l +µ
= l - l +µ
= l
( ) ( ) ( )
( ) ( ) ( ) ( )
( ) ( )
2 2 1
1 2 1
0 1
1 2 ,
2 ,
.
sP s P s P s
sP s P s P s
sP s P s
- = - +
= - +
=
l µ
l l µ
l
{ }2
( ) ( ) ( )
( ) ( ) ( )
( )
( )
( )
2 2 1
2 1
1
2
2 1,
2 1,
1
,
2
sP s P s P s
s P s P s
P s
P s
s
+ l = µ +
+ l = µ +
µ +
=
+ l
{ }2
61/196
(5.37)
Equating Eq. (5.36) and Eq. (5.37) a solution representing state can be
derived, giving,
(5.38)
Multiplying each side by gives, which results in,
(5.39)
Solving Eq. (5.39) for state gives,
(5.40)
Expanding and simplifying Eq. (5.40) gives,
(5.41)
Substituting Eq. (5.41) into Eq. (5.35)(c) gives the solution to the final absorbing
state as,
( ) ( ) ( ) ( )
( ) ( ) ( ) ( )
( )
( ) ( )
1 2 1
1 1 2
1
2
2 ,
2 ,
.
2
sP s P s P s
sP s P s P s
s P s
P s
= l - l + µ
+ l + µ = l
+ l + µ
=
l
{ }1
( ) ( ) ( )
( )
1 1 1
.
2 2
s P s P s
s
l µ µ
l l
+ + +
=
+
( )1
1
P s
( ) ( )
( )
1
1
,
2 2
s P s
s
µ
l µ
l l
+
+ +
=
+
( )( )
( )1
2
2 2 .s s
P s
l
l µ l lµ+ + + = +
{ }1
( )
( )( )1
2
.
2 2
P s
s s
l
l µ l lµ
=
+ + + -
( )1 2 2
2
.
3 2
P s
s s s
l
l l µ
=
+ + +
{ }0
62/196
(5.42)
After producing the inverse Laplace transform of Eq. (5.42)(c), the probability
that no subsystems are operating at time, is the result. Let the random
variable Y be the time to failure of the system and be the probability that
the system has failed at or before time t. The reliability of the system is then
defined by,
(5.43)
Using Eq. (2.37), the failure density function for the random variable Y is given
by,
(5.44)
and using Eq. (5.31), its Laplace transform is given by,
(5.45)
Inverting Eq. (5.45) gives the failure density of Y as,
(5.46)
where,
(5.47)
( ) ( )
( )
( )
( )
( )
0 1
0 2 2
2
0 2 2
,
2
,
3
2
.
3
sP s P s
sP s
s s s
P s
s s s s
l
l
l
l µ l
l
l µ l
=
é ù
= ê ú
+ + +ë û
=
é ù+ + +ë û
0t >
( )0P t
( ) ( )01 .R t P t= -
( )
( )0
,Y
dP tdR
f t
dt dt
= - =
( ) ( ) ( ) ( ) ( )
2
0 0 2 2
2
0 .
3 2
Y YL s f s sP s P
s s
- l
= = - =
+ l + µ + l
( ) ( )2 1
2
1 2
2
,t t
Yf t e ea al
a a
- -
= -
-
( ) 2 2
1 2
3 6
, .
2
l +µ ± l + lµ +µ
a a =
63/196
Using Eq. (2.28), the MTTF of the Parallel Redundant system with repair is given
by,
(5.48)
The MTTF of a two element Parallel Redundant system without repair
would have been equal to the first term in Eq. (5.48)(c). The effect of adding a
repair facility to the system increases the mean life of the system by,
(5.49)
or a factor of,
(5.50)
over a system without repair facilities.
[ ] ( )
( )
( )
( )
¥
¥ ¥
-a -a
= =
é ùl
= = -ê ú
a - a ë û
é ùl
= -ê ú
a - a a aë û
l a - a
=
a a
l l + µ
=
l
µ
= +
l l
ò
ò ò2 1
0
2
1 2 0 0
2
2 2
1 2 2 1
2
1 2
2 2
1 2
2
22
2
2
,
2 1 1
,
2
,
2 3
,
2
3
.
2 2
Y
y y
E Y yf y dy
ye dy ye dy
( )0µ =
2
as a result of Repair ,
2
MTTF
µ
=
l
2
2 ,
3 3
2
µ
µl =
l
l
64/196
MODELING A TRIPLE–REDUNDANT SYSTEM
A Triple Modular Redundant (TMR) system continues to operate correctly as
long as two of the three subsystems are functioning properly. A second
subsystem failure causes the system to fail. This model is referred to as 3–2–0. A
second architecture (shown in Figure 7) is possible in which the system will
continue to operate in the presence of two (2) subsystem failures. This system
operates in simplex mode 3–2–1–0. The 3–2–0 model without coverage will be
developed in this section. Figure 6 describes a TMR system with a constant
failure rate and a constant repair rate .
The repair activity takes place with a constant response time whenever a
subsystem fails, giving a Markov transition matrix of,
(5.51)
The set of differential equations derived from the transition matrix is given by,
(5.52)
Rewriting the differential equations in the Laplace transform format gives,
l µ
( )
( )
( )
( )
( )
( )
( )
é ù
- l µ é ùé ùê ú
ê úê úê ú
ê úê úê ú
ê úê úê ú = l - l + µ l ê úê úê ú
ê úê úê ú
ê úê úê ú
ê úê úlê ú ë û ë û
ê úë û
2
2
1
1
0
0
3 0
3 2 .
0 2 0
dP t
P t
dt
dP t
P t
dt
dP t
P t
dt
( )
( ) ( )
( )
( ) ( ) ( )
( )
( )
2
2 1
1
2 1
0
1
3 ,
3 2 ,
2 .
dP t
P t P t
dt
dP t
P t P t
dt
dP t
P t
dt
= - l +µ
= l - l +µ
= l
65/196
(5.53)
Using Eq. (5.53)(a) and Eq. (5.53)(b) to solve for state gives,
(5.54)
( ) ( ) ( )
( ) ( ) ( ) ( )
( ) ( )
2 2 1
1 2 1
0 1
1 3 ,
3 2 ,
2 .
sP s P s P s
sP s P s P s
sP s P s
- = - l +µ
= l - l +µ
= l
{ }2
( ) ( ) ( )
( ) ( ) ( )
( )
( )
( )
2 2 1
2 1
1
2
3 1,
3 1,
1
.
3
sP s P s P s
s P s P s
P s
P s
s
+ l = µ +
+ l = µ +
µ +
=
+ l
66/196
Figure 6 – The transition diagram for a Triple
Modular Redundant system with repair. State
represents the fault free (TMR) operation
mode, State represents a single fault
(Duplex) operation mode with a return path
to the fault free mode, and State
represents the system failure mode, the
absorbing state.
Using Eq. (5.54)(a) and Eq. (5.54)(b) again to solve for state gives,
(5.55)
Equating (5.54) and Eq. (5.55) and solving for state gives,
(5.56)
Simplifying Eq. (5.56)(b) gives,
(5.57)
2 01
3l
µ
2l
{ }2
{ }1
{ }0
{ }2
( ) ( ) ( ) ( )
( ) ( ) ( ) ( )
( )
( )
( )
1 2 1
1 1 2
2 1
3 2 ,
2 3 ,
2
.
3
sP s P s P s
sP s P s P s
s
P s P s
= l - l +µ
+ l +µ = l
+ l +µ
=
l
{ }1
( )
( )
( )
( )
( )
( )( )
+ l + µ µ +
=
l + l
l
=
+ l + µ + l - lµ
1
1
1
2 1
,
3 3
3
.
2 3 3
s P s
P s
s
P s
s s
( )1 2 2
3
.
5 6
P s
s s s
l
=
+ l + l +µ
67/196
Substituting the solution for state , Eq. (5.57), into Eq. (5.54)(c) gives the
solution for the final absorbing state ,
(5.58)
Expanding and factoring the denominator of Eq. (5.58)(b) gives the differential
equation for the absorption state as,
(5.59)
Expanding the partial fractions of Eq. (5.59) and taking the inverse Laplace
transform, results in the following reliability function,
(5.60)
Integrating Eq. (5.60) using Eq. (2.24) produces the MTTF of,
(5.61)
Simplifying Eq. (5.61) gives the MTTF for a TMR system with repair as,
{ }1
{ }0
( ) ( )
( )
( )
0 1 2 2
2
0 2 2
3
2 2 ,
5 6
6
5 6 .
sP s P s
s s s
P s
s s s s
é ùl
= l = l ê ú+ l + l +µë û
l
=
+ l + l +µ
P0
s( )=
6λ2
s s+ 1
2
5λ+µ− λ2
+10λµ+µ2
( )( )s+ 1
2
5λ+µ+ λ2
+10λµ+µ2
( )( )
( ) ( )
( )
2 21
2
2 21
2
2 2
5 10
2 2
2 2
5 10
2 2
5 10
2 10
5 10
.
2 10
R t e
e
- l+µ- l + lµ+µ
- l+µ+ l + lµ+µ
l +µ + l + lµ +µ
=
l + lµ +µ
l +µ - l + lµ +µ
-
l + lµ +µ
!
!
( )
( )
2 2
2 2 2 2
2 2
2 2 2 2
5 10
5 10 10
5 10
.
5 10 10
MTTF
l +µ + l + lµ +µ
=
l +µ l + lµ +µ -l - lµ -µ
l +µ - l + lµ +µ
-
l +µ l + lµ +µ + l + lµ +µ
!
!
68/196
(5.62)
Rearranging Eq. (5.62) and isolating the repair term from the failure term gives,
(5.63)
MODELING A PARALLEL SYSTEM WITH IMPERFECT COVERAGE
A more realistic model of a Parallel Redundant System assumes that not all faults
are recoverable and that the coverage factor c denotes the conditional probability
that the system detects the fault and survives. The state diagram for this system is
shown in Figure 7
2
5
.
6
MTTF
l +µ
=
l
2
5
.
6 6
MTTF
µ
= +
l l
69/196
Figure 7 – The transition diagram for a
Parallel Redundant system with repair and
imperfect fault coverage. State represents
the fault free mode, State represents a
single fault with a return path to the fault free
mode by a repair operation, and State
represents the system failure mode. State
can be reached from State through an
uncovered fault, which causes the system to
fail without the intermediate State mode.
The transition matrix for Figure 7 is,
(5.64)
With an initial state of producing a set of starting conditions,
,
2 01
2 cl
µ
l
( )2 1 cl -
{ }2
{ }1
{ }0
{ }0
{ }2
{ }1
( )
( )
( )
( )
( )
( )
( )
( )
( )
é ù
- l + l - µé ù é ùê ú
ê ú ê úê ú
ê ú ê úê ú
ê ú ê úê ú = l - l + µ lê ú ê úê ú
ê ú ê úê ú
ê ú ê úê ú
ê ú ê úl - lê ú ë û ë û
ê úë û
2
2
1
1
0
0
2 2 1 0
2 ,
2 1 2 0
dP t
c c P t
dt
dP t
c P t
dt
dP t
c P t
dt
{ }2
( ) ( ) ( )2 1 00 1, 0 0 0P P P= = =
70/196
the system of equations describing the state transitions are,
(5.65)
Using the Laplace transform method, the above equations are reduced to,
(5.66)
Using Eq. (5.66)(a) and solving for state gives,
(5.67)
Using Eq. (5.66)(b) to solve for state gives,
(5.68)
Equating Eq. (5.67)(c) and Eq. (5.68)(c) and solving for state gives,
(5.69)
( )
( ) ( ) ( ) ( )
( )
( ) ( ) ( )
( )
( ) ( ) ( )
2
2 2 1
1
2 1
0
2 1
2 2 1 ,
2 ,
2 1 .
dP t
cP t c P t P t
dt
dP t
cP t P t
dt
dP t
c P t P t
dt
= - l - l - +µ
= l - l +µ
= l - + l
( ) ( ) ( )
( ) ( ) ( ) ( )
( ) ( ) ( ) ( )
2 2 1
1 2 1
0 2 1
1 2 ,
2 ,
2 1 .
sP s P s P s
sP s cP s P s
sP s c P s P s
- = - l +µ
= l - l +µ
= l - + l
{ }2
( ) ( ) ( )
( ) ( ) ( )
( )
( )
( )
2 2 1
2 1
1
2
2 ,
2 ,
1
.
2
sP s P s P s
s P s P s
P s
P s
s
- l = µ
- l = µ
µ +
=
+ l
{ }2
( ) ( ) ( ) ( )
( ) ( ) ( )
( )
( ) ( )
1 2 1
1 2
1
2
2 ,
2 ,
.
2
sP s cP s P s
s P s cP s
s P s
P s
c
= l - l + µ
+ l + µ = l
+ l + µ
=
l
{ }1
( )
( )
( ) ( )1 11
.
2 2
P s s P s
s c
µ + + l + µ
=
+ l l
71/196
Simplifying Eq. (5.69) and solving for state gives,
(5.70)
Using Eq. (5.66)(a) and solving for state gives,
(5.71)
Using Eq. (5.66)(b) and solving for state gives,
(5.72)
Equating Eq. (5.71) and Eq. (5.72) and solving for state gives,
(5.73)
Substituting Eq. (5.70) and Eq. (5.73) into Eq. (5.66)(c) and solving for state
gives,
{ }1
( ) ( )( ) ( )
( )
( )( )
1 1
1
2 2 2 ,
2
.
2 2
cP s c s s P s
c
P s
s s c
lµ + l = + l + l +µ
l
=
+ l + l +µ - lµ
{ }1
( ) ( ) ( )
( ) ( ) ( )
( )
( ) ( )
2 2 1
2 1
1
1
2 ,
2 ,
2 1
.
sP s P s P s
s P s P s
s P s
P s
- l = µ
- l = µ
- l -
=
µ
{ }1
( ) ( ) ( ) ( )
( ) ( ) ( )
( )
( )
( )
1 2 1
1 2
1 2
2 ,
2 ,
2
.
sP s cP s P s
s P s cP s
c
P s P s
s
= l - l +µ
+ l +µ = l
l
=
+ l +µ
{ }2
( ) ( )
( )
( )
( )
( )
( )( )
2
2
2
2 1 2
,
.
2 2
s P s c
P s
s
s
P s
s s c
+ l - l
=
µ + l +µ
+ l +µ
=
+ l + l +µ - lµ
{ }0
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage

More Related Content

Similar to Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage

Honeywell Vista 15P Honeywell Vista-20P User Guide
Honeywell Vista 15P Honeywell Vista-20P User GuideHoneywell Vista 15P Honeywell Vista-20P User Guide
Honeywell Vista 15P Honeywell Vista-20P User Guide
Alarm Grid
 
Workshop gl prt english-introduction
Workshop gl prt english-introductionWorkshop gl prt english-introduction
Workshop gl prt english-introduction
home
 
Honeywell Vista 21IP User Guide
Honeywell Vista 21IP User GuideHoneywell Vista 21IP User Guide
Honeywell Vista 21IP User Guide
Alarm Grid
 
Design approach for fault
Design approach for faultDesign approach for fault
Design approach for fault
VLSICS Design
 
IPS Test Methodology
IPS Test MethodologyIPS Test Methodology
IPS Test Methodology
Ixia
 
A4 (1).pdf
A4 (1).pdfA4 (1).pdf
A4 (1).pdf
YashwanthCse
 
2016XXXX_Sensor_system_WEB
2016XXXX_Sensor_system_WEB2016XXXX_Sensor_system_WEB
2016XXXX_Sensor_system_WEBShan Guan
 
Unit 2-software development process notes
Unit 2-software development process notes Unit 2-software development process notes
Unit 2-software development process notes
arvind pandey
 
Axd3 340.620.11.02.02
Axd3 340.620.11.02.02Axd3 340.620.11.02.02
Axd3 340.620.11.02.02
guest0ba198
 
Datasheet_SE-Wonderware_AlarmAdviser
Datasheet_SE-Wonderware_AlarmAdviserDatasheet_SE-Wonderware_AlarmAdviser
Datasheet_SE-Wonderware_AlarmAdviserSuman Singh
 
Ch11 - Reliability Engineering
Ch11 - Reliability EngineeringCh11 - Reliability Engineering
Ch11 - Reliability Engineering
Harsh Verdhan Raj
 
Evolution of protective systems in petro chem
Evolution of protective systems in petro chemEvolution of protective systems in petro chem
Evolution of protective systems in petro chem
Glen Alleman
 
intouch
intouchintouch
intouch
foryou1010
 
The difference between in-depth analysis of virtual infrastructures & monitoring
The difference between in-depth analysis of virtual infrastructures & monitoringThe difference between in-depth analysis of virtual infrastructures & monitoring
The difference between in-depth analysis of virtual infrastructures & monitoring
BettyRManning
 
Omron PLC cqm1 opearation manual
Omron PLC cqm1 opearation manualOmron PLC cqm1 opearation manual
Omron PLC cqm1 opearation manual
Yan Zhang
 
records_6.7sp2_userguide
records_6.7sp2_userguiderecords_6.7sp2_userguide
records_6.7sp2_userguidePaul Vietorisz
 
Honeywell gsmv-install-guide
Honeywell gsmv-install-guideHoneywell gsmv-install-guide
Honeywell gsmv-install-guideAlarm Grid
 
Ch11 reliability engineering
Ch11 reliability engineeringCh11 reliability engineering
Ch11 reliability engineering
software-engineering-book
 

Similar to Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage (20)

Honeywell Vista 15P Honeywell Vista-20P User Guide
Honeywell Vista 15P Honeywell Vista-20P User GuideHoneywell Vista 15P Honeywell Vista-20P User Guide
Honeywell Vista 15P Honeywell Vista-20P User Guide
 
Workshop gl prt english-introduction
Workshop gl prt english-introductionWorkshop gl prt english-introduction
Workshop gl prt english-introduction
 
Honeywell Vista 21IP User Guide
Honeywell Vista 21IP User GuideHoneywell Vista 21IP User Guide
Honeywell Vista 21IP User Guide
 
Design approach for fault
Design approach for faultDesign approach for fault
Design approach for fault
 
IPS Test Methodology
IPS Test MethodologyIPS Test Methodology
IPS Test Methodology
 
A4 (1).pdf
A4 (1).pdfA4 (1).pdf
A4 (1).pdf
 
2016XXXX_Sensor_system_WEB
2016XXXX_Sensor_system_WEB2016XXXX_Sensor_system_WEB
2016XXXX_Sensor_system_WEB
 
Unit 2-software development process notes
Unit 2-software development process notes Unit 2-software development process notes
Unit 2-software development process notes
 
430ug slau049f
430ug slau049f430ug slau049f
430ug slau049f
 
Axd3 340.620.11.02.02
Axd3 340.620.11.02.02Axd3 340.620.11.02.02
Axd3 340.620.11.02.02
 
Datasheet_SE-Wonderware_AlarmAdviser
Datasheet_SE-Wonderware_AlarmAdviserDatasheet_SE-Wonderware_AlarmAdviser
Datasheet_SE-Wonderware_AlarmAdviser
 
Ch11 - Reliability Engineering
Ch11 - Reliability EngineeringCh11 - Reliability Engineering
Ch11 - Reliability Engineering
 
Evolution of protective systems in petro chem
Evolution of protective systems in petro chemEvolution of protective systems in petro chem
Evolution of protective systems in petro chem
 
intouch
intouchintouch
intouch
 
The difference between in-depth analysis of virtual infrastructures & monitoring
The difference between in-depth analysis of virtual infrastructures & monitoringThe difference between in-depth analysis of virtual infrastructures & monitoring
The difference between in-depth analysis of virtual infrastructures & monitoring
 
Omron PLC cqm1 opearation manual
Omron PLC cqm1 opearation manualOmron PLC cqm1 opearation manual
Omron PLC cqm1 opearation manual
 
11iadutil
11iadutil11iadutil
11iadutil
 
records_6.7sp2_userguide
records_6.7sp2_userguiderecords_6.7sp2_userguide
records_6.7sp2_userguide
 
Honeywell gsmv-install-guide
Honeywell gsmv-install-guideHoneywell gsmv-install-guide
Honeywell gsmv-install-guide
 
Ch11 reliability engineering
Ch11 reliability engineeringCh11 reliability engineering
Ch11 reliability engineering
 

More from Glen Alleman

Managing risk with deliverables planning
Managing risk with deliverables planningManaging risk with deliverables planning
Managing risk with deliverables planning
Glen Alleman
 
A Gentle Introduction to the IMP/IMS
A Gentle Introduction to the IMP/IMSA Gentle Introduction to the IMP/IMS
A Gentle Introduction to the IMP/IMS
Glen Alleman
 
Increasing the Probability of Project Success
Increasing the Probability of Project SuccessIncreasing the Probability of Project Success
Increasing the Probability of Project Success
Glen Alleman
 
Process Flow and Narrative for Agile+PPM
Process Flow and Narrative for Agile+PPMProcess Flow and Narrative for Agile+PPM
Process Flow and Narrative for Agile+PPM
Glen Alleman
 
Practices of risk management
Practices of risk managementPractices of risk management
Practices of risk management
Glen Alleman
 
Principles of Risk Management
Principles of Risk ManagementPrinciples of Risk Management
Principles of Risk Management
Glen Alleman
 
Deliverables Based Planning, PMBOK® and 5 Immutable Principles of Project Suc...
Deliverables Based Planning, PMBOK® and 5 Immutable Principles of Project Suc...Deliverables Based Planning, PMBOK® and 5 Immutable Principles of Project Suc...
Deliverables Based Planning, PMBOK® and 5 Immutable Principles of Project Suc...
Glen Alleman
 
From Principles to Strategies for Systems Engineering
From Principles to Strategies for Systems EngineeringFrom Principles to Strategies for Systems Engineering
From Principles to Strategies for Systems Engineering
Glen Alleman
 
NAVAIR Integrated Master Schedule Guide guide
NAVAIR Integrated Master Schedule Guide guideNAVAIR Integrated Master Schedule Guide guide
NAVAIR Integrated Master Schedule Guide guide
Glen Alleman
 
Building a Credible Performance Measurement Baseline
Building a Credible Performance Measurement BaselineBuilding a Credible Performance Measurement Baseline
Building a Credible Performance Measurement Baseline
Glen Alleman
 
Integrated master plan methodology (v2)
Integrated master plan methodology (v2)Integrated master plan methodology (v2)
Integrated master plan methodology (v2)
Glen Alleman
 
IMP / IMS Step by Step
IMP / IMS Step by StepIMP / IMS Step by Step
IMP / IMS Step by Step
Glen Alleman
 
DHS - Using functions points to estimate agile development programs (v2)
DHS - Using functions points to estimate agile development programs (v2)DHS - Using functions points to estimate agile development programs (v2)
DHS - Using functions points to estimate agile development programs (v2)
Glen Alleman
 
Making the impossible possible
Making the impossible possibleMaking the impossible possible
Making the impossible possible
Glen Alleman
 
Heliotropic Abundance
Heliotropic AbundanceHeliotropic Abundance
Heliotropic Abundance
Glen Alleman
 
Capabilities based planning
Capabilities based planningCapabilities based planning
Capabilities based planning
Glen Alleman
 
Process Flow and Narrative for Agile
Process Flow and Narrative for AgileProcess Flow and Narrative for Agile
Process Flow and Narrative for Agile
Glen Alleman
 
Building the Performance Measurement Baseline
Building the Performance Measurement BaselineBuilding the Performance Measurement Baseline
Building the Performance Measurement Baseline
Glen Alleman
 
Program Management Office Lean Software Development and Six Sigma
Program Management Office Lean Software Development and Six SigmaProgram Management Office Lean Software Development and Six Sigma
Program Management Office Lean Software Development and Six Sigma
Glen Alleman
 
Policy and Procedure Rollout
Policy and Procedure RolloutPolicy and Procedure Rollout
Policy and Procedure Rollout
Glen Alleman
 

More from Glen Alleman (20)

Managing risk with deliverables planning
Managing risk with deliverables planningManaging risk with deliverables planning
Managing risk with deliverables planning
 
A Gentle Introduction to the IMP/IMS
A Gentle Introduction to the IMP/IMSA Gentle Introduction to the IMP/IMS
A Gentle Introduction to the IMP/IMS
 
Increasing the Probability of Project Success
Increasing the Probability of Project SuccessIncreasing the Probability of Project Success
Increasing the Probability of Project Success
 
Process Flow and Narrative for Agile+PPM
Process Flow and Narrative for Agile+PPMProcess Flow and Narrative for Agile+PPM
Process Flow and Narrative for Agile+PPM
 
Practices of risk management
Practices of risk managementPractices of risk management
Practices of risk management
 
Principles of Risk Management
Principles of Risk ManagementPrinciples of Risk Management
Principles of Risk Management
 
Deliverables Based Planning, PMBOK® and 5 Immutable Principles of Project Suc...
Deliverables Based Planning, PMBOK® and 5 Immutable Principles of Project Suc...Deliverables Based Planning, PMBOK® and 5 Immutable Principles of Project Suc...
Deliverables Based Planning, PMBOK® and 5 Immutable Principles of Project Suc...
 
From Principles to Strategies for Systems Engineering
From Principles to Strategies for Systems EngineeringFrom Principles to Strategies for Systems Engineering
From Principles to Strategies for Systems Engineering
 
NAVAIR Integrated Master Schedule Guide guide
NAVAIR Integrated Master Schedule Guide guideNAVAIR Integrated Master Schedule Guide guide
NAVAIR Integrated Master Schedule Guide guide
 
Building a Credible Performance Measurement Baseline
Building a Credible Performance Measurement BaselineBuilding a Credible Performance Measurement Baseline
Building a Credible Performance Measurement Baseline
 
Integrated master plan methodology (v2)
Integrated master plan methodology (v2)Integrated master plan methodology (v2)
Integrated master plan methodology (v2)
 
IMP / IMS Step by Step
IMP / IMS Step by StepIMP / IMS Step by Step
IMP / IMS Step by Step
 
DHS - Using functions points to estimate agile development programs (v2)
DHS - Using functions points to estimate agile development programs (v2)DHS - Using functions points to estimate agile development programs (v2)
DHS - Using functions points to estimate agile development programs (v2)
 
Making the impossible possible
Making the impossible possibleMaking the impossible possible
Making the impossible possible
 
Heliotropic Abundance
Heliotropic AbundanceHeliotropic Abundance
Heliotropic Abundance
 
Capabilities based planning
Capabilities based planningCapabilities based planning
Capabilities based planning
 
Process Flow and Narrative for Agile
Process Flow and Narrative for AgileProcess Flow and Narrative for Agile
Process Flow and Narrative for Agile
 
Building the Performance Measurement Baseline
Building the Performance Measurement BaselineBuilding the Performance Measurement Baseline
Building the Performance Measurement Baseline
 
Program Management Office Lean Software Development and Six Sigma
Program Management Office Lean Software Development and Six SigmaProgram Management Office Lean Software Development and Six Sigma
Program Management Office Lean Software Development and Six Sigma
 
Policy and Procedure Rollout
Policy and Procedure RolloutPolicy and Procedure Rollout
Policy and Procedure Rollout
 

Recently uploaded

GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 

Recently uploaded (20)

GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 

Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage

  • 1. FAULT–TOLERANT SYSTEM RELIABILITY IN THE PRESENCE OF IMPERFECT DIAGNOSTIC COVERAGE By Glen B. Alleman Irvine California, Copyright © 1980 Submitted in Partial Fulfillment Of Masters in Systems Management (MSSM) University of Southern California Los Angles, California June 1980 Revised and updated Niwot Colorado, Copyright © 1996, 2000, 2014
  • 2. ii
  • 3. FAULT–TOLERANT SYSTEM RELIABILITY IN THE PRESENCE OF IMPERFECT DIAGNOSTIC COVERAGE Glen B. Alleman The deployment of computer systems for the control of mission critical processes has become the norm in many industrial and commercial markets. The analysis of the reliability of these systems is usually understood in terms of the Mean Time to Failure. The design and analysis of high reliability systems is now a mature science. Starting with fault–tolerant central office switches (ESS4), dual redundant and n– way redundant systems are now available in variety of application domains. The technologies of microprocessor based industrial controls and redundant central processor systems create the opportunity to build fault–tolerant computing systems on a much smaller scale than previously found in the commercial market place. The diagnostic facilities utilized in a modern Fault–Tolerant Computer System attempts to detect fault conditions present in the hardware and embedded software. Coverage is the figure of merit describing the effectiveness of the diagnostic system. This thesis examines the effects of less than perfect diagnostics coverage on system reliability. The mathematical background for analyzing the coverage factor of fault–tolerant systems is presented in detail as well as specific examples of practical systems and their relative reliability measures. In a complex system, malfunction and even total nonfunction may not be detected for long periods, if ever. — John Gall
  • 4. i TABLE OF CONTENTS INTRODUCTION......................................................................................................10 Fault Tolerant System Definitions........................................................................10 Fault–Tolerant System Functions.........................................................................11 Overview of This Thesis ...................................................................................11 RELIABILITY, AVAILABILITY FOR SIMPLE SYSTEMS ......................13 Deterministic Models ..............................................................................................13 Probabilistic Models...........................................................................................14 Exponential and Poisson Relationships .........................................................15 Reliability Availability and Failure Density Functions .................................20 Mean Time to Failure.........................................................................................23 Mean Time to Repair .........................................................................................27 Mean Time Between Failure.............................................................................27 Mean Time to First Failure ...............................................................................27 General Availability Analysis ............................................................................31 Instantaneous Availability ..........................................................................33 Limiting Availability ....................................................................................34 SYSTEM RELIABILITY ......................................................................................37 Series Systems......................................................................................................37 Parallel Systems ...................................................................................................39 M–of–N Systems................................................................................................39 Selecting the Proper Evaluation Parameters..................................................40 Imperfect Fault Coverage And Reliability...........................................................42 Redundant System with Imperfect Coverage................................................42 Generalized Imperfect Coverage.....................................................................44 Markov Models Of Fault–Tolerant Systems.......................................................49 Solving the Markov Matrix ...............................................................................52 Chapman–Kolmogorov Equations..........................................................52 Markov Matrix Notation...................................................................................55 Laplace Transform Techniques........................................................................56 Modeling a Duplex System.....................................................................................58 Modeling a Triple–Redundant System.................................................................64 Modeling a Parallel System with Imperfect Coverage.......................................68 Modeling A TMR System with Imperfect Coverage.........................................74 Modeling A Generalized TMR System................................................................76 Laplace Transform Solution to Systems of Equations................................77 Specific Solution to the Generalized System.................................................78 PRACTICAL EFFECTS OF PARTIAL COVERAGE......................................85 Determining Coverage Factors..............................................................................85
  • 5. ii Coverage Measurement Statistics .............................................................86 Coverage Factor Measurement Assumptions ........................................86 Coverage Measurement Sampling Method.............................................87 Normal Population Statistics.....................................................................87 Sample Size Computation..........................................................................88 General Confidence Intervals....................................................................89 Proportion Statistics....................................................................................90 Confidence Interval Estimate of the Proportion...................................91 Unknown Population Proportion.............................................................91 Clopper–Person Estimation......................................................................92 Practical Sample Estimates ........................................................................93 Time Dependent Aspects of Fault Coverage Measurement ...............94 Common Cause Failure Effects ............................................................................95 Square Root Bounding Problem......................................................................97 Beta Factor Model..............................................................................................97 Multi–Nominal Failure Rate (Shock Model) .................................................97 Binomial Failure Rate Model............................................................................98 Multi–Dependent Failure Fraction Model.....................................................98 Basic Parameter Model......................................................................................99 Multiple Greeks Letter Model..........................................................................99 Common Load Model .....................................................................................100 Nonidentical Components Model.................................................................100 Practical Example of Common Cause Failure Analysis ............................100 Common Cause Software Reliability.............................................................102 Software Reliability Concepts..................................................................103 Software Reliability and Fail–Safe Operations.....................................109 PARTIAL FAULT COVERAGE SUMMARY...................................................111 Effects of Coverage...............................................................................................112 REMAINING QUESTIONS..................................................................................113 Realistic Probability Distributions.......................................................................113 Multiple Failure Distributions ........................................................................114 Weilbull Distribution........................................................................................116 Periodic Maintenance............................................................................................118 Periodic Maintenance of Repairable Systems..............................................119 Reliability Improvement for a TMR System................................................122 CONCLUSIONS........................................................................................................124 MARKOV CHAINS..................................................................................................125 Definition A.1....................................................................................................125 Definition A.2....................................................................................................125 Definition A.3....................................................................................................126 Theorem A.1......................................................................................................126 Proof of Theorem A.1.....................................................................................126
  • 6. iii Lemma A.1.........................................................................................................128 Theorem A.2......................................................................................................128 Proof of Theorem A.2.....................................................................................128 Theorem A.3......................................................................................................130 Proof of Theorem A.3.....................................................................................130 SOLUTIONS TO LINEAR SYSTEMS................................................................133 Theorem B.1......................................................................................................135 Proof of Theorem B.1 .....................................................................................136 PROBABILITY GENERATING FUNCTIONS ..............................................139 Definition C.1....................................................................................................139 Theorem C.1......................................................................................................140 Proof of Theorem C.1 .....................................................................................140 POISSON PROCESSES...........................................................................................142 Definition D.1 ...................................................................................................143 Definition D.2 ...................................................................................................145 Definition D.3 ...................................................................................................145 Definition D.4 ...................................................................................................148 Definition D.5 ...................................................................................................148 Definition D.6 ...................................................................................................149 Theorem D.1 .....................................................................................................151 RENEWAL THEORY..............................................................................................152 Definition E.1....................................................................................................153 Theorem E.1......................................................................................................154 Proof of Theorem E.1.....................................................................................154 Theorem E.2......................................................................................................155 Proof of Theorem E.2.....................................................................................155 LAPLACE TRANSFORM GENERALIZED SOLUTION METHODS...163 Definition F.1 ....................................................................................................164 Definition F.2 ....................................................................................................165 Definition F.3 ....................................................................................................165 Definition F.4 ....................................................................................................166 LAPLACE TRANSFORM GENERALIZED SOLUTION METHODS...168 Definition F.1 ....................................................................................................169 Definition F.2 ....................................................................................................170 Definition F.3 ....................................................................................................170 Definition F.4 ....................................................................................................171
  • 7. iv LIST OF FIGURES Number Page Figure 1 – Evaluation Criteria defining System Reliability. These criteria will be used to develop a set of time dependent metrics used to evaluate various configurations. ............................................................................................13 Figure 2 – Assumptions regarding the behavior of a random process that generated events following the Poisson probability distribution function......................................................................................................................16 Figure 3 – State Transition probabilities as a function of time in the Continuous– Time Markov chain that is subject to the constraints of the Chapman– Kolmogorov equation.............................................................................................51 Figure 4 – Definition of the exponential order of a function............................................57 Figure 5 – the state transition diagram for a Parallel Redundant system with repair. State represents the fault free operation mode, State represents a single fault with a return path to the fault free mode by a repair operation, and State represents the system failure mode, the absorption state.........................................................................................................59 Figure 6 – The transition diagram for a Triple Modular Redundant system with repair. State represents the fault free (TMR) operation mode, State represents a single fault (Duplex) operation mode with a return path to the fault free mode, and State represents the system failure mode, the absorbing state.......................................................................................66 Figure 7 – The transition diagram for a Parallel Redundant system with repair and imperfect fault coverage. State represents the fault free mode, State represents a single fault with a return path to the fault free mode by a repair operation, and State represents the system failure mode. State can be reached from State through an uncovered fault, which causes the system to fail without the intermediate State mode...........................................................................................................................69 Figure 8 –The state transition diagram for a Triple Modular Redundant system with repair and imperfect fault coverage. State represents the fault free mode, State represents the single fault (Duplex) mode, State represents the two–fault (Simplex) mode, and State represents the system failure mode...........................................................................................74 { }2 { }1 { }0 { }2 { }1 { }0 { }2 { }1 { }0 { }0 { }2 { }1 { }3 { }2 { }1 { }0
  • 8. v Figure 9 – The state transition diagram for a Generalized Triple Modular Redundant system with repair and [perfect fault detection coverage. The system initially operates in a fault free state . A fault in any module results in the transition to state . A second fault while in state results in the system failure state .........................78 Figure 10 – Sample size requirement for a specified estimate as tabulated by Clopper and Pearson. ..............................................................................................93 Figure 11 – Common Cause Failure modes guide figures for electronic programmable system [HSE87]. These ratios of non–CCF to CCF for various system configurations. CCFs are defined as non–random faults that are designed in or experienced through environmental damage to the system. Other sources [SINT88]. [SINT89] provide different figures. ......................................................................................................................102 Figure 12 – Four Software Growth Model expressions. The exponential and hyperexponential growth models represent software faults that are time independent. The S–Shaped growth models represent time delayed and time inflection software fault growth rates [Mats88].......................................104 Figure 13 – MTTF of Simplex, Parallel Redundant, and TMR Systems. ......................111 Figure 14 – MTTF of Parallel Redundant and TMR Systems with varying degrees of coverage. .............................................................................................................112 Figure 15 – Mean Time to Failure increases for a Triple Modular Redundant system with periodic maintenance. This graph shows that maintenance intervals which are greater than one–half of the mean time to failure for one module have little effect on increasing reliability. But frequent maintenance, even low quality maintenance, improves the system reliability considerably. ..........................................................................................123 { }0 { }1, ,N! { }1, ,N! { }1N +
  • 9. vi ACKNOWLEDGMENTS The author wishes to thank Dr. Wing Toy of AT&T Naperville Laboratories, Naperville, Illinois for his consultation on the ESS4 Central Office Switch and his contributions to this work. Dr. Victor Lowe of Ford Aerospace, Newport Beach, California for his consultation on the general forms of Markov model solutions. Mr. Henk Hinssen of Exxon Corporation, Antwerp Belgium for his discussion of the effects of partial diagnostic coverage in Triple Modular Redundant Systems at the Exxon Polystyrene Plant, Antwerp, Belgium. Dr. Phil Bennet of The Centre for Software Engineering, Flixborough, England for his ideas regarding software reliability measurements in the presence of undetected faults. Mr. Daniel Lelivre of Factory Systems, Paris France for his comments and review of this work and its applicability to safety critical systems at Total, Mobile, and NorSoLor chemical plants. Several institutions have contributed source material for this work including The Foundation for Scientific and Industrial Research at the Norwegian Institute of Technology (SINTF), Trondheim, Norway and the United Kingdom Atomic Energy Authority, Systems Reliability Service, Culcheth, Warrington, England. This work is submitted as a Thesis in completion of a Master Degree in Systems Management, University of Southern California, 1980. It was extended in support of the efforts that gained compliance of the Tricon with process safety standards in the United States, Europe, and United Kingdom.
  • 10. vii PREFACE This work was extended in support of the design and development of the Triple Modular Redundant (TMR) computer produced by Triconex Corporation of Irvine, California. In 1987, Triconex designed and manufactured its first digital TMR process control computer that was deployed in a variety of industrial environments, including: turbine controls, boiler controls, fire and gas systems, emergency shutdown systems, and general-purpose fault–tolerant real–time control systems. The Tricon (a classic 1980’s product name) was based on several innovative technologies. As the manager of software development for Triconex, I was intimately involved in the software and hardware of the Tricon. In 1987, TMR was not a completely new concept. Flight control systems and navigation computers were found in aerospace applications. The Space Shuttle used a TMR+1 computer system and was well understood by the public. What was new to the market was an affordable TMR computer that could be deployed in a rugged industrial environment. The heart of the Tricon was a hardware voting system that performed a 2–out–of–3 vote for all digital input signals presented to the control program. The contents of memory and the computed digital outputs were again voted 2–out–of–3 at the physical output devices. Once the digital command had been applied to the output device, its driven state was verified and the results reported to the control program. The Tricon contained 3 independent (but identical) 32–bit battery powered microprocessors, a 2–out–of–3 voting digital serial bus connecting the three processors, a dual redundant power system using DC–to–DC converters (state of the art for 1987), and three separate isolated serial I/O buses connecting the I/O subsystem to the three main processors. The I/O subsystem cards were
  • 11. viii themselves TMR, using onboard 8–bit processors and a quad output device to vote 2–out–of–3 the digital commands received from the control program. The Tricon executed a control program on a periodic basis. The architecture of the operating software was modeled after the programmable controllers of the day, which were programmed in a ladder logic representing mechanical relays and timers. Both digital and analog devices provided input and output to the control program. The control program accepted input states from the I/O subsystem, evaluated the decision logic and produced output commands, which were sent to the I/O subsystem. This cycle was performed every 10ms in a normally configured system. In the presence of faults, the key to the survivability of the Tricon was the combination of TMR hardware and fault diagnostic software. Diagnostic software was applied to each processor element and the digital I/O device. This diagnostic software was capable of detecting all single stuck–at faults, many multiple stuck–at faults as well as many transient faults. A fault–injection and reliability evaluation technique developed by the author and described in this work was used to evaluate the coverage factor of the diagnostic software. Triconex no longer exists as an independent company, having been absorbed into a larger control systems vendor. The materials presented in this work were critical to Tricon’s TÜV and SINTF [SINTF89] certification for North Sea Norwegian Sector, German (then the Federal Republic), Belgium, and British Health and Safety Executive (HSE) industrial safety operations. The concept of fault–tolerant computing has become important again in the distributed computing market place. The Tandem Non–Stop processor, modern flight and navigation computers as well as telecommunications computers all depend on some form of diagnostics to initiate the fault detection and recovery process. A recent systems architectural paper mentioned TMR but without
  • 12. ix sufficient attention to the underlying details. [1] The reissuing of this paper addresses several gaps in the literature: § The foundations of fault–tolerance and fault–tolerance modeling have faded from the computer science literature. The underlying mathematics of fault– tolerant systems present a challenge for an industry focused on rapid software development and short time to market pressures. § The understanding that unreliable and untrustworthy software systems are created by latent faults in both the hardware and software is poorly understood in this age of Object–Oriented programming and plug and play systems development. § The Markov models presented in this work have general applicability to distributed computer systems analysis and need to be restated. The application of these models to distributed processing systems, with symmetric multi–processor computers is a reemerging science. With the advent of high–availability computing systems, the foundations of these systems needs to be understood once again. § The current crop of computer science practitioners have very little understanding of the complexities and subtleties of the underlying hardware and firmware that make up the diagnostic systems of modern computers, their reliability models and the mathematics of system modeling. Glen B. Alleman Niwot Colorado 80503 Updated, April 2000 1 “Attribute Based Architectural Styles,” Mark Klein and Rick Kazman, CMU/SEI–99–TR–022, Software Engineering Institute, Carnegie Mellon University, October 1999.
  • 13. 10/196 C h a p t e r 1 INTRODUCTION Two approaches are available to increase the system reliability of digital computer system: Fault avoidance (fault intolerance) and fault tolerance [Aviz75]. Fault avoidance results from conservative design techniques utilizing high–reliability components, system burn–in, and careful design and testing processes. The goal of fault avoidance is to reduce the possibility of a failure [Aviz84], [Rand75], [Kim86], [Ozak88]. The presence of faults however results in system failure, negating all prior efforts to increase system reliability [Litt75], [Low72]. Fault–tolerance provides the system with the ability to withstand a system fault, maintain a safe state in the presence of a fault, and possibly continue to operate in the presence of this fault. FAULT TOLERANT SYSTEM DEFINITIONS A set of consistent definitions is used here to avoid confusion with existing definitions. These definitions are provided by the IFIP Working Group 10.4, Reliable Computing and Fault–Tolerance [Aviz84], [Aviz82], [Ande82], [Robi82], [Lapr84], [TUV86]: § A Failure occurs when the system user perceives a service resource ceases to deliver the expected results. § An Error occurs when some part of a system resource assumes an undesired state. Such a state is contrary to the specification of the resource to the expectation (requirement) of the user. § A Fault is detected when either a failure of the resource occurs, or an error is observed within the resource. The cause of the failure or error is said to be a fault.
  • 14. 11/196 FAULT–TOLERANT SYSTEM FUNCTIONS In fault–tolerant systems, hardware and software redundancy provides information needed to negate the effects of a fault [Aviz67]. The design of fault– tolerant systems involves the selection of a coordinated failure response mechanism that follows four steps [Siew84], [Mell77], [Toy86]: § Fault Detection § Fault Location and Identification § Fault Containment and Isolation § Fault Masking During the fault detection process, diagnostics are used to gather and analyze information generated by the fault detection hardware and software. These diagnostics determine the appropriate fault masking and fault recovery actions [Euri84], [Rouq86], [Ossf80], [Gluc86], [John85], [John86], [Kirr86], [Chan70]. It is the less than perfect operation of the Fault Detection, Location, and Identification processes of the system that is examined in this work. The reliability of the fault–tolerant system depends on the ability of the diagnostic subsystem to correctly detect and analyze faults [Kirr87], [Gall81], [Cook73], [Brue76], [Lamp82]. The measure of the correct operation of the diagnostic subsystem is called the Coverage Factor. It is assumed in most fault–tolerant product offerings that the diagnostic coverage factor is perfect, i.e. 100%. This work addresses the question: What is the reliability of the Fault–Tolerant system in the presence of less than perfect coverage? To answer this question, some background in the mathematics of reliability theory is necessary. Overview of This Thesis The development of a reliability model of a Triple Modular Redundant (TMR) system with imperfect diagnostic coverage is the goal of this work. Along the
  • 15. 12/196 way, the underlying mathematics for analyzing these models is developed. The Markov Chain method will be the primary technique used to model the failure and repair processes of the TMR system. The Laplace transform will be used to solve the differential equations representing the transition probabilities between the various states of the TMR system described by the Markov model. The models developed for a TMR system with partial coverage can be applied to actual systems. In order to make the models useful in the real–world a deeper understanding of the diagnostic coverage and fault detection is presented. The appendices provide the background for the Markov models as well as the statistical process. The mathematics of Markov Chains and the statistical processes that underlay system faults and their repair processes can be applied to a variety of other analytical problems, including system performance analysis. It is hoped the reader will gain some appreciation of the complexity and beauty of modern systems as well as the subtitles of their design and operation. If the reader is interested in skipping to the end, Chapter 7 provides a summary of the effects of partial coverage on various system configurations.
  • 16. 13/196 C h a p t e r 2 RELIABILITY, AVAILABILITY FOR SIMPLE SYSTEMS When presented with the reliability figures for a computer system, the user must often accept the stated value as factual and relevant and construct a comparison matrix to determine the goodness of each product offering [Kraf81]. Difficulties often arise through the definition and interpretation of the term reliability. This chapter develops the necessary background for understanding the reliability criteria defined by the manufacturers of computer equipment. Figure 1 lists the criteria for defining system reliability [Siew82], [Ande72], [Ande79], [Ande81]. Deterministic Models Survival of at least k component failures Probabilistic Models – Hazard (failure rate) function – Reliability function – Repair Rate – Availability function Single Parameter Models MTTF – Mean Time to failure MTTR – Mean Time to Repair MTBF – Mean Time Between Failure c – Coverage Figure 1 – Evaluation Criteria defining System Reliability. These criteria will be used to develop a set of time dependent metrics used to evaluate various configurations. DETERMINISTIC MODELS The simplest reliability model is a deterministic one, in which the minimum number of component failures that can be tolerated without system failure is taken as the figure of merit for the system. ( )z t ( )R t µ ( )A t
  • 17. 14/196 Probabilistic Models The failure rate of electronic and mechanical devices varies as a function of time. This time dependent failure rate is defined by the hazard function, . The hazard function is also referred to as the hazard rate or mortality rate. For electronic components on the normal–life portion of their failure curve, the failure rate is assumed to be a constant, , rather than a function of time. The exponential probability distribution is the most common distribution encountered in reliability models, since it describes accurately most life testing aspects for electronic equipment [Kapu77]. The probability density function (pdf), Cumulative Distribution Function (CDF), reliability function ( ), and hazard (failure rate) function ( ) of the exponential distribution are expressed by the following [Kend77]: (2.1) (2.2) (2.3) (2.4) The failure rate parameter describes the rate at which failures occur over time [DoD82]. In the analysis that follows, the failure rate is assumed to be constant, and measured as failures per million hours. Although a time dependent failure rate could be used for un–aged electronic components, the aging of the electronic components can remove the traditional bathtub curve failure distribution. The constant failure rate assumption is also extended to the firmware controlling the diagnostics of the system [Bish86], [Knig86], [Kell88], [Ehre78], [Eckh75], [Gmei79], [RTCA85]. ( )z t l ( )R t ( )z t ( ) t pdf f t e -l = = l ( ) 1 t CDF F t e -l = = - ( )Reliability t R t e-l = = ( )Hazard Function z t= = l l
  • 18. 15/196 Exponential and Poisson Relationships In modeling the reliability functions associated with actual equipment, several simplifying assumptions must be made to render the resulting mathematics tractable. These assumptions do not reduce the applicability of the resulting models to real–world phenomenon. One simplifying assumption is that the random variables associated with the failure process have exponential probability distributions. The property of the exponential distribution that makes it easy to analyze is that it does not decay with time. If the lifetime of a component is exponentially distributed, after some amount of time in use, the item is assumed to be good as new. Formally, this property states that the random variable is memoryless, if the expression is valid for all [Cram66], [Ross83]. If the random variable is the lifetime of some item, then the probability that the item is functional at time , given that it survived to time t, is the same as the initial probability that is was functional at time s. If the item is functional at time t, then the distribution of the remaining amount of time that it survives is the same as the original lifetime distribution. The item does not remember that it has already been in use for a time t. This property is equivalent to the expression or . Since the form of this expression is satisfied when the random variable X is exponentially distributed (since ), it follows that exponentially distributed random variables are memoryless. The recognition of this property is vital to the understanding of the models presented in this work. If the underlying failure process is not memoryless, than the exponential distribution model is not valid. X { } { }P X s t X t P X s> + > = > , 0s t ³ X s t+ P X > s +t, X > t{ } P X > t{ } = P X > s{ } { } { } { }P X s t P X s P X t> + = > > ( )s t s t e e e-l + -l -l =
  • 19. 16/196 The exponential probability distributions and the related Poisson processes used in the reliability models are formally based on the assumptions shown in Figure 2 [Cox 62], [Thor26]. § Failures occur completely randomly and are independent of any previous failure. A single failure event does not provide any information regarding the time of the next failure event. § The probability of a failure during any interval of time is proportional to the length of the interval, with a constant of proportionality . The longer one waits the more likely it is a failure will occur. Figure 2 – Assumptions regarding the behavior of a random process that generated events following the Poisson probability distribution function. An expression describing the random processes in Figure 2 results from the Poisson Theorem which states that the probability of an event A occurring k times in n trials is approximately [Papo65], [Pois37], , (2.5) where is the probability of an event A occurring in a single trial and . This approximation is valid when and the product remains finite. It should be noted that a large number of different trials of independent systems is needed for this condition to hold, rather than a large number of repeated trials on the same system. The Poisson Theorem can be simplified to the following approximation for the probability of an event occurring k times in n trials [Kend77], [ ]0, t l ( ) ( ) -- - + × ! ! 1 1 1 2 k n kn n n k p q k { }p P A= 1q p= - , 0n p® ¥ ® n p×
  • 20. 17/196 (2.6) The exponential and Poisson expressions are directly related. A detailed understanding of this relationship will aid in the development of the analysis that follows. Using the Poisson assumptions described in Figure 2, the probability of n failures prior to time t is, . (2.7) From of Eq. (2.7), the probability that no failures occur between time t and time is, , (2.8) where the term describing the total number of failures is of moderate magnitude [Fell67]. The probability that n failures occur between time t and time is then, . (2.9) ( ) ( ) ( ) ( )( ) ( ) - - +- - - + - + - æ ö æ ö = -ç ÷ç ÷ - è øè ø = - = æ ö -ç ÷ è ø » 1 2 1 2 ! 1 , ! ! 2 , !2 1 ! 1 ! . ! k n k k n k k knn np n k n k k k n k k np n npn np p q k n k k n n e n np e kn k e n np kk e n np e k p p { } ( )tP N n T t P n= £ = ( )0n = t t+ D ( ) ( )[ ]0 0 1t t tP P t+D = -lD npl = + Dt t ( ) ( )[ ] ( )[ ]1 1 , 0t t t tP n P n t P n t n+D = -lD + - lD >
  • 21. 18/196 Using Eq. (2.9) and Eq. (2.8) and allowing , a differential equation can be constructed describing the rate at which failures occur between time t and time , (2.10) with the initial conditions of, (2.11) The unique solution to the differential equation in Eq. (2.10) is [Klie75], (2.12) which is the Poisson distribution defined in Eq. (2.6). Using Eq. (2.12) to define a function representing the probability that no failures have occurred as of time t gives, (2.13) The expression in Eq. (2.13) is also the definition for the Cumulative Distribution Function, CDF, of the Poisson failure process [Fell67]. By using Eq. (2.19), the probability distribution function, pdf, of the Poisson process can be given as, (2.14) 0tD ® t t+ D ( ) ( ) ( ) ( ) ( ) 0 0 , 1 , for 0, t t t t t d P P dt d P n P n P n n dt = -l = l - - >é ùë û ( ) = 0.tP n ( ) ( ) , 0, 1, 2, ! n t t t e P n n n -l l = = ! ( )F t ( ) { }0 .t tF t P n e -l = = = ( ) ,t f t e -l = l
  • 22. 19/196 which is the exponential probability distribution. [2] The following statement describes the relationship between the Poisson and exponential expressions [Cox65], If the number of failures occurring over an interval of time is Poisson distributed, then the time between failures is exponentially distributed. An alternative method of relating the exponential and Poisson expressions is useful at this point. The functions defined in Eq. (2.1) and Eq. (2.2) are based on the interchangeability of the pdf and the CDF for any defined probability distribution. The Cumulative Distribution Function of a random variable X is defined as a function obeying the following relationship [Papo65], (2.15) The probability density function of a random variable X can be derived from the CDF using the following [Dave70], (2.16) The CDF can be obtained from the pdf by the following, (2.17) Using Eq. (2.16) and Eq. (2.17), the CDF and pdf expressions for an exponential distribution can be developed. If the mean time between failures (MTBF) is an Exponentially distributed random variable, the CDF is, 2 This development of the pdf is very informal. Making use of the forward reference to construct an expression is circular logic and would not be permitted in more formal circumstances. For the purposes of this work, this type of behavior can be tolerated, since the purpose of this development is to get to the results rather than dwell on the analysis process. This is a fundamental difference between mathematics and engineering. ( )F x ( ) { }, .F x P X x x= £ -¥ < < ¥ ( )f x ( ) ( ). d f x F x dx = ( ) { } ( ) , . x F x P X x f t dt x -¥ = £ = -¥ < < ¥ò
  • 23. 20/196 (2.18) The number of failures in the time interval is a Poisson distributed random variable with a probability density function of, (2.19) where t is a random variable denoting the time between failures. Reliability Availability and Failure Density Functions An expression for the reliability of a system can be developed using the following technique. The probability of a failure as a function of time is defined as, (2.20) where t is a random variable denoting the failure time. is a function defining the probability that the system will fail by time t. is also the Cumulative Distribution Function (CDF) of the random variable t [Papo65]. The probability that the system will perform as intended at a certain time t is defined as the Reliability function and is defined as, (2.21) If the random variable describing the time to failure t has a probability density function then using Eq. (2.21) the Reliability function is, (2.22) Assuming the time to failure random variable t has an exponential distribution its failure density defined by Eq. (2.19) is, ( ) 1 , 0 , 0 , otherwise, t e t F t -l ì - £ £ ¥ = í î [ ]0, t ( ) ( ) , 0, 0, otherwise, e td f t F t dt -l ìl > = = í î { } ( )£ = ³, 0,P T t F t t ( )F t ( )F t ( ) ( )( ) { }= - = ³1 .R t F t P T t ( )f t ( ) ( ) ( ) ( ) ¥ ¥ = - = - =ò ò1 1 . t t R t F t f x dx f x dx
  • 24. 21/196 (2.23) The resulting reliability function is then, (2.24) A function describing the rate at which a system fails as a function of time is referred to as the Hazard function (Eq. (2.4)). Let T be a random variable representing the service life remaining for a specified system. Let be the distribution function of T and let be its probability density function. A new function termed the Hazard Function or the Conditional Failure Function of T is given by . The function is the conditional probability that the item will fail between x and given it has survived a time T greater than x. For a given hazard function the corresponding distribution function is where is an arbitrary value of x. In a continuous time reliability model the hazard function is defined as the instantaneous failure rate of the system [Kapu77], ( ) , 0, 0.t f t e t-l = l ³ l ³ ( ) ¥ -l -l = l =ò .t t t R t e dt e ( )F x ( )f x ( )z x ( ) ( ) ( ) = -1 f x z x F x ( )z x dx +x dx ( )z x ( ) ( )( ) ( ) é ù - = - -ê ú ê úë û ò01 1 exp o x x F x F x z y dy 0x
  • 25. 22/196 (2.25) The quantity represents the probability that a system of age t will fail in the small interval of time . The hazard function is an important indicator of the change in the failure rate over the life of the system. For a system with an exponential failure rate, the hazard function is constant as shown in Eq. (2.25) and it is the only distribution that exhibits this property [Barl85]. Other reliability distributions will be shown in later chapters that have variable hazard rates. If a system contains no redundancy – this is, every component must function properly for the system to continue operation – and if component failures are statistically independent, the system reliability function is the product of the component reliabilities and follows an exponential probability distribution. The failure rate of such a system is the product of the failure rates of the individual components, (2.26) In most cases it is possible to repair or replace failed components and accurate models of system reliability will consider this. As will be shown the repair activity is not as easily modeled as the failure mechanisms. ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) 0 lim , 1 , , , . t t t R t R t z t t R t d R t R t dt f t R t e e D ® -l -l - + D = D × é ù = -ê úë û = l = = l ( )z t dt [ ]+,t t dt ( ) ( ) ( ) 1 1 exp .i n n t sys i i i i R t R t e t-l = - é ù= = = - lë ûåÕ Õ
  • 26. 23/196 For systems that can be repaired, a new measure of reliability can be defined, The probability that the system is operational at time “t.” This new measure is the Availability and is expressed as . Availability differs from reliability in that any number of system failures can occur prior to time t but the system is considered available if those failures have been repaired prior to time t. For systems that can be repaired, it is assumed that the behavior of the repaired system and the original system are identical from a failure standpoint. In general, this is not true, as perfect renewal of the system configuration is not possible. The terms Mean Time to First Failure and Mean Time to Second Failure now become relevant. Assuming a constant failure rate , a constant repair rate , and identical failure behaviors between the repaired system and the original system, the steady–state system availability can be expressed as, (2.27) The expression in Eq. (2.27) is an approximation of the expression of the availability with repair requires the solution of the appropriate Markov model, which will be developed in a later chapter. Mean Time to Failure The Mean Time to Failure (MTTF) is the expected time to the first failure in a population of identical systems, given a successful system startup at time . The Cumulative Distribution function in Eq. (2.15) and the probability density function in Eq. (2.16) characterize the behavior of the probability distribution function of the underlying random failure process. These expressions ( )A t ( )A t ( )R t l µ .SSA µ = l +µ = 0t ( )F x ( )f x
  • 27. 24/196 are in a continuous integral form and require the solution of integral equations to produce a useable result. A concise parameter that describes the expected value of the random process is useful for comparison of different reliability models. This parameter is the Mean or Expected Value of the random variable denoted by and is defined by [Parz60], [Dave70], (2.28) The expression in Eq. (2.28) denotes the expected value of the continuous function . It is important to note that this definition assumes is integrable in the interval . For an exponential probability density function of, (2.29) the mean or expected value of the exponential function is given by, (2.30) The evaluation of Eq. (2.30) can be done in a straightforward manner using the Gamma function [Arfk70], which is defined as, (2.31) or alternately, (2.32) Rewriting the expression in Eq. (2.30) for the expected values as, [ ]E X [ ] ( ) ¥ -¥ = ò .E X xf x dx ( )f x ( )x f x ( )-¥ ¥, ( ) , 0,x f x e x-l = l > [ ] ( ) 0 .x E X xf x dx e dx ¥ ¥ -l -¥ = = lò ò ( ) ¥ - - G = >ò 1 0 , 0,x x e dxa a a ( )¥ a- a G a = lò 1 0 .x x e dx
  • 28. 25/196 (2.33) where substituting the variables, and (2.34) results in, (2.35) which is the MTTF for a simple system. Although this expression is useful for simple systems, a general–purpose expression representing the MTTF is needed. This function can be developed in the following manner. Let X denote the lifetime of a system so that the reliability function is, (2.36) and the derivative of the reliability function which is also given in Eq. (2.21) and Eq. (2.22) is again defined as, (2.37) The expression for the expected value or MTTF using Eq. (2.28) is given by: (2.38) [ ] ¥ - = ò0 1 ,u E X ue du l u x= l ,du dx= l [ ] ( ) ¥ - = l = G l = l ò0 1 , 1 2 , 1 , u E X ue du ( ) { }= > ,R t P X t ( ) ( )= - . d R t f t dt [ ] ( ) ( ) ¥ ¥ æ ö = = - ç ÷ è ø ò ò0 0 d E X tf t dt t R t dt dt
  • 29. 26/196 Using the technique of integration by parts [Smai49], [Arfk70] is shown in Eq. (2.39), (2.39) to evaluate Eq. (2.38). Integrating by parts gives the expected value as, (2.40) Since approaches zero faster than t approaches infinity, Eq. (2.40) can be reduced to, (2.41) which is the expression for the Mean Time to Failure for a general system configuration. This direct relationship between MTTF and the system failure rate is one reason the constant failure rate assumption is often made when the supporting reliability data is scanty [Barl75]. Appendix G describes the analysis of the variance for this distribution. Using an exponential failure distribution implies two important behaviors for the system, § Since a used subsystem is stochastically as good as a new subsystem, a policy of scheduled replacement of used subsystems which are known to still be functioning, does not increase the lifetime of the system. § In estimation the mean system life and reliability, data can be collected consisting only of the number of hours of observed life and the number of observed failures; the ages of the subsystems under observation are of no concern. ( ) ( ) ( ) ( ) ( ) ( )æ ö æ ö - -ç ÷ ç ÷ è ø è ø ò ò , b b a a bd d f x g x dx f x g x g x f x dx adx dx [ ] ( ) ( ) ¥ ¥ =- + ò0 . 0 E X t R t R t dt ( )R t [ ] ( ) ¥ = =ò0 ,E X R t dt MTTF
  • 30. 27/196 Mean Time to Repair The Mean Time to Repair (MTTR) is the expected time for the repair of a failed system or subsystem. For exponential distributions this is and . The steady state availability defined in Eq. (2.27) can be rewritten in terms of these parameters, (2.42) Mean Time Between Failure The Mean Time Between Failure (MTBF) is often mistakenly used in place of Mean Time to Failure (MTTF). The MTBF is the mean time between failures in a system with repair, and is derived from a combination of repair and failure processes. The simplest approximation for MTBF is: (2.43) In this work, it is assumed so that MTTR is used in place of MTBF. The Mean Time to Failure is considered since in fault–tolerant systems Failure occurs only when the redundancy features of the system fail to function properly. In the presence of perfect coverage and perfect repair the system should operate continuously. Therefore, failure of the system implies total loss of system capabilities. Mean Time to First Failure The Mean Time to Failure is defined as the expected time of the first failure in a population of identical systems. This development depends on the assumption that the failure rate is constant Eq. (2.25), exponentially distributed Eq. (2.14), and the repair time is constant, . In the general case, these assumptions may not 1 MTTF = l 1 MTTR = µ SSA .SS MTTF A MTTR MTTF = + = + .MTBF MTTF MTTR !MTTR MTTF µ
  • 31. 28/196 be valid and the Mean Time to Failure (MTTF) is not equivalent to the Mean Time to First Failure (MTFF). By removing the exponential probability failure distribution restriction in Eq. (2.29) a generalized expression for the first failure time can be derived. Given a population of n subsystems each with a random variable and a continuous pdf of , the failure time for the subsystem is given by summing all the failure times prior to the failure, (2.44) If the random variables are independent and identically distributed, all with pdf’s of , the random process described by these variables is referred to as an Ordinary Renewal Process [Cox62], [Ross70]. The details of the Renewal Process are shown in Appendix E. Given the random process described by Eq. (2.44) the distribution function of is provided by convolving each individual distribution function . The convolution of two functions is defined as [Brac65], [Papo65]: (2.45) The resulting convolution function for the n+1 subsystem failure is given by: (2.46) In renewal processes, the random variables are actually functions and can be substituted in the reliability computations when: = !, 1,2, ,iX i n ( )f x th n = = + + + = å!1 2 1 . n n n i i S X X X X { }!1 2, , , nX X X ( )f x nS ( )F t ( ) ( ) ( ) ( ) ¥ -¥ Ä º -ò .f x g x f u g x u du ( ) ( ) ( ) ( ) ( )+ = -ò1 0 . t n n F t F t x F x dx
  • 32. 29/196 (2.47) When the conditions in Eq. (2.47) are met, the probability of n renewals in a time interval is given by, (2.48) The renewal function can be defined as the average number of subsystem failures and repairs as a function of time, and is given as, (2.49) Using Eq. (2.48) in the evaluation of Eq. (2.49) and Eq. (2.30) as the definition of the expectation value, gives the following for the renewal function, (2.50) Simplifying Eq. (2.50) results in an expression for the renewal function of, (2.51) The term is the convolution of and F which gives, (2.52) which results in the expression for the renewal function of, ( ) += Û £ £ 1.n nN t n S t S ( ){ } { } { } { } ( ) ( ) ( ) ( ) 1 1 1 , , . n n n n n n P N t n P S t S P S t P S t F t F t + + + = = £ £ = £ - £ = - ( )H t ( ) ( ) .H t E N t= é ùë û ( ) ( ){ } ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) 0 1 0 0 0 1 , . 1 . n n n n n n n n n H t nP N t n nF t nF t nF t n F t ¥ = ¥ ¥ + = = ¥ ¥ = = = = = - = - - å å å å å ( ) ( ) ( ) ( )1 1 .n n H t F t F t ¥ + = = + å ( )1nF + ( )nF ( ) ( ) ( ) ( ) ( )1 0 , t n nF t F t x F x dx+ = -ò
  • 33. 30/196 (2.53) Rearranging the integral term in Eq. (2.53) gives, (2.54) The summation term in Eq. (2.54) is the renewal function for the failure, giving, (2.55) Using Eq. (2.16), the renewal density function is the derivative of the distribution function, giving, (2.56) Using Eq. (2.50) to evaluate the derivative results in, (2.57) and using Eq. (2.54) as a substitute for the right–hand side of Eq. (2.57) results in, (2.58) Eq. (2.58) is known as the Renewal Equation [Ross70]. To solve the renewal equation, the Laplace transform will be used. The transform of the probability density function is, ( ) ( ) ( ) ( ) ( ) 1 0 . t n n H t F t F t x F x dx ¥ = = + -åò ( ) ( ) ( ) ( ) ( ) 10 . t n n H t F t F t x F x dx ¥ = é ù = + -ê ú ë û åò th n ( ) ( ) ( ) ( ) 0 . t H t F t H t x F x dx= + -ò ( )h t ( ) ( ). d h t H t dt = ( ) ( ) ( ) 1 ,n n h t f t ¥ = = å ( ) ( ) ( ) ( ) 0 . t h t f t h t x f x dx= + -ò
  • 34. 31/196 (2.59) and the transform of the renewal function is, (2.60) Using the convolution property of the Laplace transform [Brac65], an equation for the renewal distribution can be generated, (2.61) and simplified to, (2.62) Eq. (2.62) is now the generalized expression for the failure distribution for a random process within an arbitrary probability distribution. General Availability Analysis The steady state system availability defined in Eq. (2.42) assumes an exponential distribution for the failure rate of the system or subsystems. An important activity in the analysis of Fault–Tolerant systems is the development of a general– purpose availability expression, independent of the underlying failure distribution. In the analysis that follows, it will be assumed that when a subsystem fails it is repaired and the system restored to its functioning state. It will also be assumed that the restored system functions as if it were new, that is with the failure probability function restarted at . ( ){ } ( ) 0 ,sx f s e f x dx ¥ - = òL ( ){ } ( ) 0 .sx h s e h x dx ¥ - = òL ( ){ } ( ){ } ( ){ } ( ){ },h s f s h s f s= +L L L L ( ){ } ( ){ } ( ){ } . 1 f s h s f s = - L L L 0t =
  • 35. 32/196 Let be the duration of the ith functioning period and let be the system downtime because of the failure of the system while the ith repair takes place. These durations will form the basis of the renewal process. By combining the subsystem failure interval and the subsystem repair duration, a random variable sequence is constructed such that, (2.63) It must be assumed that the duration of the functioning subsystems are identically distributed with a common Cumulative Distribution Function and a common probability density function and that the repair periods are also identically distributed with and . Using these assumptions the terms in Eq. (2.63) are also identically distributed such that, (2.64) meets the definition of a Renewal process developed Eq. (2.44). Using this development an expression for the convolution of the two independent random processes is given by, (2.65) Using Eq. (2.62) gives, (2.66) The average number of repairs in the time interval has the Laplace transform: iT iD ; 1, 2,i i iX T D i= + = ! ( )W t ( )w t ( )G t ( )g t { }1,2, ,iX i = ! ( ){ } ( ){ } ( ){ }.f s w s g s=L L L ( ){ } ( ){ } ( ){ } ( ){ } ( ){ } . 1 w s g s h s w s g s = - L L L L L ( )M t ](0,t
  • 36. 33/196 (2.67) Instantaneous Availability The steady state availability defined in Eq. (2.42) can now be replaced with the instantaneous availability . In the absence of a repair mechanism the availability is equivalent to the repairability, of the subsystem. The subsystem may be functioning at time t because of two mutually exclusive reasons, § The subsystem has not failed from the beginning. § The last renewal occurred within the time period and the subsystem continued to function since that time. The probability associated with the second case is the convolution of the reliability function and the renewal density, giving, (2.68) which results in a expression for the instantaneous availability of, (2.69) Taking the Laplace transform of both sides of Eq. (2.69) gives, (2.70) ( ){ } ( ){ } ( ){ } ( ){ } ( ){ } . 1 w s g s M s s w s g s = é ù-ë û L L L L L ( )A t ( )A t ( ) ( )1R t A t= - ( ) ( ) 0 , t R t x h x dx-ò ( ) ( ) ( ) ( ) 0 . t A t R t R t x h x dx= + -ò ( ){ } ( ){ } ( ){ } ( ){ } ( ){ } ( ){ } ( ){ } ( ){ } ( ){ } ( ){ } ( ){ } , 1 , 1 . 1 A s R s R s L h s R s h s w s L g s R s w s L g s = + é ù= +ë û é ù = +ê ú -ê úë û L L L L L L L L
  • 37. 34/196 Since the reliability of the system is given as , (2.71) Substituting gives, (2.72) Given the failure–rate distribution and the repair–time distribution, Eq. (2.72) can be used to compute the instantaneous availability as a function of time. Limiting Availability An important question to ask is – what is the availability of the system after some long period of time? The limiting availability as is defined as A or simply the Availability. To derive an expression for the limiting availability the Final Value Theorem of Laplace transform can be used [Doet61], [Widd46], [ Brac65], [Ogat70], [Gupt66]. This theorem states that the steady state behavior of is the same as the behavior of in the neighborhood of . Thus it is possible to obtain the value of as . Let, (2.73) then using a table of Laplace transforms [Doet61], [Brac65], ( ) ( )1R t W t= - ( ){ } ( ){ } ( ){ } ( ){ } 1 , 11 . A s W s s w s w s s s s = - - = - = L L L L ( ){ } ( ){ } ( ){ } ( ){ } 1 . 1 w s A s s w s g s - = é ù-ë û L L L L ( )A t ® ¥t ( )f t ( )sF s 0s = ( )f t ® ¥t ( ) ( ) ( )- = +ò0 0 , t F t f x dx F
  • 38. 35/196 (2.74) and by letting (2.75) The Limiting availability is then given as, (2.76) For small values of s the following approximations can be made [Apos74], (2.77) giving, (2.78) where and, (2.79) ( ){ } ( ) ( ){ } ( ) ¥ - - - = = òL L 0 0 ,st s F s F h s e f t dt 0,s ® ( ){ } ( ) ( ) ( ) ( ) ( ) ¥ - ® - ®¥ ®¥ = + é ù = +ê ú ë û = ò ò L 0 0 0 lim 0 , lim 0 , lim . s t s t s H s f t dt F f x dx F F t ( ) ( ){ }0 lim lim . t s A A t s A s ®¥ ® = = L 1 ,st e st- @ - ( ){ } ( ) ( ) ( ) ¥ - ¥ ¥ = = - - l ò ò ò L ! 0 0 0 , , 2 1 . st w s e w t dt w t dt s tw t dt 1 MTTF = l ( ){ }= - µ L 2 1 ,g s
  • 39. 36/196 and where giving the limiting availability as, (2.80) Eq. (2.80) is an important result in the analysis of system reliability, because it shows that the limiting availability depends only on the Mean Time to Failure and the Mean Time to Repair and not in the underlying distributions of the failure and repair times. 1 MTTR = µ 0 11 1 lim . 1 1 1 1 1 s s MTTF A s s MTTF MTTR® é ùæ ö - -ç ÷ê úlè ø l= = =ê ú +æ öæ öê ú +- - -ç ÷ç ÷ê ú l µl lè øè øë û
  • 40. 37/196 C h a p t e r 3 SYSTEM RELIABILITY This chapter provides the basis for the computation of the overall system reliability given a redundant architecture with partial fault detection coverage. Redundant systems can be modeled under variety operational assumptions. Of most interest in this work are dual and triple redundant systems that contain repair facilities. Series Systems Creating a reliable system often involves a series or parallel combination of independent systems or subsystems. If is the reliability of module i and all the modules are statistically independent, then the overall system reliability of modules connected in series is, (3.1) For a series redundant system the failure probability is given by, (3.2) Expanding Eq. (3.1) will illustrate an aspect of the exponential distribution. For a system of n subsystems connected in series the reliability of the system is given by Eq. (3.1). If a general purpose hazard function is used for the failure rate [Shoo68] defined by, (3.3) ( )iR t ( ) ( ).series iR t R t= Õ seriesF ( ) ( ) ( ) ( )( ) 1 1 1 1 , 1 1 . n series series i i n i i F t R t R t F t = = = - = - = - - Õ Õ ( ) ,k i i ih t c t= l +
  • 41. 38/196 where , , and k are constants, then the reliability function for the individual subsystem is given by, (3.4) and the reliability functions for the system is given by, (3.5) Defining two new terms for the summation of the failure rate and a new term for the time constant adjustment gives, , , and results in the series reliability expression of, (3.6) As the number of subsystems grows large , the term is bounded and the expression for the system reliability becomes, (3.7) Eq. (3.7) defines the failure distribution of the system as the number of subsystems grows without bound. This implies that a large complex system will tend to follow exponential distribution failure models regardless of the internal organization of the subsystems. il ic ( ) 1 exp , 1 k i i i t R t t c k + é ù = - l +ê ú+ë û ( ) 1 1 1 exp . 1 kn n series i i i i t R t t c k + = = é ù = - l +ê ú+ë û å å 1 n i i * = l = lå 1 n i i c c* = = å T t* = l ( ) ( ) 1 1 exp . 1 k series k c T R t T k * + * * é ùæ öæ öê úç ÷= - + ç ÷ê úç ÷l +è ø lè øë û ( )* l ®¥ ( )1 c k * * + l ( )lim .T t series n R t e e * - -l ®¥ = =
  • 42. 39/196 Parallel Systems In a parallel redundant configuration, the system fails only if all modules fail. The probability of a system failure in a parallel system given by, (3.8) The system reliability for a parallel system is given by, (3.9) M–of–N Systems An M–of–N system is a generalized form the parallel system. Instead of requiring only one of the N modules of the system to remain functional, M modules are required. The system of interest in this work is a Triple Modular Redundant (TMR) configuration in which two of the three modules must function for the system to operate properly [Lyons 62], [Kuehn 69]. [3] For a given module reliability of the TMR reliability is given by, (3.10) In Eq. (3.10) all working states are enumerated. The term represents that state in which all three modules are functional. The term 3 In practical TMR systems, a simplex mode is allowed, which usually places the system in a shutdown mode, allowing the controlled process to be safely stopped. ( ) ( ) 1 1 . n iparallel i F t F t = = -Õ ( ) ( ) ( ) ( )( ) 1 1 1 1 , 1 1 . n iparallel parallel i n i i R t F t F t R t = = = - = - = - - Õ Õ mR ( )3 2 3 1 . 2tm r m m mR R R R æ ö = + -ç ÷ è ø 3 mR ( )2 3 1 2 m mR R æ ö -ç ÷ è ø
  • 43. 40/196 represents the three states in which any one module has failed and the two states in which a module is functional. Selecting the Proper Evaluation Parameters In comparing different redundant system configurations, it is desirable to summarize their reliability by a single parameter. The reliability may be an arbitrary complex function of time. The selection of the wrong summary parameter could lead to incorrect conclusions, as will be shown below. Consider a simplex system, with a reliability function of, (3.11) and using Eq. (2.41) to derive the Mean Time to Failure results in, (3.12) For a TMR system with an exponential reliability function, (3.13) and using Eq. (2.40) results in a Mean Time to Failure of, (3.14) Comparing the simplex and TMR reliability expressions gives, (3.15) By using the MTTF figure of merit, the TMR system can be shown to be less reliable than the Simplex system. The above equations do not include the facility ( ) ,t simplexR t e-l = 1 .sim plexMTTF = l ( ) ( ) ( ) ( ) 3 2 2 3 3 1 , 2 3 2 , t t t tm r t t R t e e e e e -l -l -l - l - l æ ö = + -ç ÷ è ø = - 3 2 . 2 3 tm rMTTF = - l l 5 1 . 6 tm r sim plexMTTF MTTF= £ = l l
  • 44. 41/196 for module repair. Once the TMR system has exhausted its redundancy, there is more hardware to fail then the remaining modules of the non–redundant system. This effect lowers the total system reliability. With online repair, the MTTF figure of merit for the TMR system becomes an important measure of the overall system reliability. These results illustrate why simplistic assumptions and calculations may result in erroneous information.
  • 45. 42/196 C h a p t e r 4 IMPERFECT FAULT COVERAGE AND RELIABILITY Reliability models of systems with dynamic redundancy usually depend on perfect fault detection [Arno73], [Stif80]. The ability of the system to detect faults that occur can be classified as [Geis84], § Covered – faults that are detected. The probability that a fault belongs to this class is given by c. § Uncovered – faults that are not detected. The probability that a fault belongs to this class is given by . The underlying diagnostic firmware and hardware may not provide perfect coverage for many reasons, primarily due to the complexity of the system under diagnosis [Rous79], [Cona72], [Wood79], [Soma86]. Because of this built–in complexity, an exhaustively tested set of diagnostics may not be possible. Another factor affecting the diagnostic coverage is the presence of intermittent faults [Dahb82], [Mall78]. The detection and analysis of these intermittent or permanent faults is further complicated by the presence of transient faults which behave as real faults but are only present in the system for a short time [Glas82], [Sosn86]. Modeling a fault–tolerant system in the presence of imperfect fault coverage becomes an important aspect in predicting the overall system reliability. Redundant System with Imperfect Coverage Before developing the Markov method of analyzing Fault–Tolerant systems, a conditional probability method will be used to derive the MTTF and MTBF for a redundant system with imperfect fault detection [Bour69]. Assume that the failure rate for each subsystem of the redundant system is described by an independent random variable . Let X denote the lifetime of a system with two modules, one active and the other in standby mode. Assume that the module in the standby ( )1 c- l
  • 46. 43/196 mode does not experience a fault during the mission time interval. [4] Let Y be a random variable where, Y = 0 if a fault is not covered, and Y = 1 if a fault is covered, then, and To compute the MTTF of this system, the conditional expectation value of the system lifetime X given the fault coverage state Y is must be derived. If an uncovered fault occurs the MTTF of the system is the MTTF of the initially active module, (4.1) If a covered fault occurs the MTTF of the system is the sum of the MTTF of the active module and the MTTF of the inactive module, (4.2) The total expectation value of the system lifetime is then given by, (4.3) The computation of the system reliability depends on the combination of the two independent exponential distribution functions when a covered fault occurs, (4.4) and when an uncovered fault occurs (4.5) The joint exponential distribution function for both conditions is given by, 4 This is an invalid assumption in a practical sense, but it greatly simplifies this example. { } ( )0 1P y c= = - { }1 .P y c= = { } 1 0 .P X Y = = l { } 2 1 .P X Y = = l [ ] ( ) ( )1 12 . c cc E X MTTF - + = + = = l l l ( ) 2 1 ,t f x t y te -l = = = l ( )0 .t f x t y e -l = = = l
  • 47. 44/196 (4.6) and the marginal density function of X is computed by summing over the joint density function, (4.7) The system reliability as a function of the coverage is then given by integrating the joint density function in Eq. (4.7) to give, (4.8) Generalized Imperfect Coverage In the previous example, the system consisted of two modules, one in the active state and one in the standby state. The conditional probability that a fault will go undetected (uncovered) was computed using the conditional probability that the system will survive for a specified period. Cox [Cox55] analyzed the general case of a stage–type conditional probability distribution. The principle on which the method of stages is based is the memoryless property of the exponential distribution of Eq. (2.1) [Klie75]. The lack of memory is defined by the fact that the distribution of the time remaining for an exponentially distributed random variable is independent of the current age of the random variable, that is the variable is memoryless. Appendix D develops further the memoryless property of random variables with exponential distributions. ( ) ( ) { } ( ) ( ) ( ) 2 , , , 1 ; 0, 0, , ; 0, 1. t t f t y f X t y P y f t y c e t y f t y cte t y -l -l = = × = l - > = = l > = ( ) ( )2 1 .t t f t cte c e-l -l = l + l - ( ) ( ) ( ) ( ) ( ) 0 2 0 2 1 1, 1 1 , 1 1 , 1 . t t t t t t t t R t f x dx cte c e dt cte c e dt c t e -l -l ¥ -l -l -l = - = = - l + l - = - l + l - = + l ò ò ò
  • 48. 45/196 In the generalized model, it is assumed that individual modules are always in one of two states – working or failed. It is also assumed that the modules are statistically independent and module repair can take place while the remainder of the system continues to function. In the general case of N active and S standby modules, the lifetime of the system is defined by a stage–type distribution. An active module has an exponential failure distribution with a constant failure rate . Assume that the modules in the standby state can fail at a rate (presuming ). Let be a random variable denoting the lifetime of the active modules and let be a random variable denoting the lifetime of the standby modules. The system lifetime L is then, (4.9) where is the time to first failure among the modules. After the removal of the failed module, the system has N active modules and standby modules. As a result modules have not aged by the memoryless exponential assumption and therefore the system lifetime is, (4.10) Here is the lifetime of the m–out–N system and is therefore a order statistic with [Kend77]. The distribution of is an – phase Hypoexponential distribution with parameters . The distribution for the time to first failure has an exponential distribution with the parameter . l µ 0 £ µ £ l iX ( )1 i N£ £ jY ( )1 j S£ £ ( ) ( ) ( ) ( ) ( ) 1 2 1 2, min , , , ; , , , , 1 , , , 1 . N SL m N S X X X Y Y Y L m N S W N S L m N S = + - = + - ! ! ( ),W N S N S+ 1S - 1N S+ - ( ) ( ) ( ) 1 , ,0 , . S i L m N S L m N W N i = = + å ( ) ( ), ,0L m N S L m N= th k 1k N m= - + ( ),0L m N ( )1N m- + ( ), 1 , ,N N ml - l l! ( ),W N i N il + µ
  • 49. 46/196 Using Theorem D.1 in Appendix D, the distribution has a –stage Hypoexponential distribution [Koba78], [Cox55], [Ash70] with parameters . Let denote the reliability of such a system, then the reliability function is defined as, (4.11) where, (4.12) and, (4.13) Defining the constant gives a new expression for the active and standby terms in the reliability equation Eq. (4.11) of, ( )L ,m N S ( )1N S m+ - + ( ) ( ), 1 , , , , 1 , ,N S N S N N N ml + µ l + + µ l +µ l - l l! ! ( ),m N S R té ùë û ( ) ( ) , 1 , S N N j i t j im N S i i m R t a e b e- l+ µ - l é ùë û = = = +å å 1 , S N i j j m j i N j j a j i j N i= = ¹ l + µ l = µ - µ l - l - µ Õ Õ ( )= = ¹ l + µ l = - l + µ l - l Õ Õ1 . S N i j j m j i N j j b N i j j i K = l µ
  • 50. 47/196 (4.14) A similar expression can be developed for, (4.15) An expectation value of the reliability function derived from a general stage–type distribution can be found using the Laplace transform [Cox 55]. The Laplace transform of a stage–type random variable X is, ( ) ( ) ( )( ) ( )( ) ( ) ( ) ( ) ( ) ( ) ( )( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) 1 1 1 1 1 1 1 , 1 1 1 1 ! ! 1 1 ! ! 1 ! ! 1 ! ! ! , 1 ! ! ! 1 1 1 1 N m i i N m N m NK S NK N N m a i i iNK i S i i N m K K K NK S S i NK i NK S S i i N N N m k i i m M m N m K K NK s S N S i m - + - - + - + + + - - = × + - - - æ ö æ öæ ö + - +ç ÷ ç ÷ç ÷ è ø è øè ø + = - × - + - æ ö - -ç ÷ è ø× é ùæ ö - - -ç ÷ê úè øë û + -æ öæ öæ ö ç ÷ç ÷ç ÷-è øè øè ø= - ! ! ! ! ! ! ! . i N mi K NK N m æ ö + -æ öç ÷+ç ÷ç ÷è ø -è ø ( ) ( ) ( ) ( ) ( ) ( )( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )( ) ( ) ( ) ( ) ( ) ( ) - - - + + = × - + - + - - -é ù é ù é ùë û ë û ë û + - -é ùë û= - + - - -é ùë û + - = - - + -é ùë û +æ öæ öæ ö ç ÷ç ÷ç ÷ è øè øè ø= - - +æ ö ç ÷ è ø ! ! ! ! ! 1 , 1 1 1 1 ! 1 ! ! 1 , ! ! 1 ! ! ! ! ! ! ! ! 1 , ! ! ! ! ! ! 1 i i m i m i m NK S NK N m b N i K S N i K i N m i NK S N K N NK N i K S i m N i i m NK S S N i K N i m S NK N i K S i i m m NK S N i S i m N i K Si m S .
  • 51. 48/196 (4.16) where for and . Defining the Laplace transform of the system described in Eq. (4.9) gives, (4.17) By inverting the transformation in Eq. (4.17) an expression for the MTTF with imperfect coverage can be given as, (4.18) The details of the above development are described in more detail in [Ing76], [Chan72], [King69], [Saat65], [Math70], [Triv82]. In the example described above, the system does not provide for repair. When repairable systems are analyzed in this manner, the number of stages becomes infinite. To deal with the infinite number of conditional probabilities a different technique must be employed. The Markov Chain is just such a technique, capable of dealing with a system configuration of many modules, each with repairability. An additional caution should be noted. The assumption of statistical independence is questionable in the case of stage–type failure distributions. In addition, the fixed probability distribution associated with each failure in the stage–type should be removed in the detailed analysis [Rams76]. ( ) µ g b b b g µ + = = = + + å Õ!L 1 1 2 1 1 1 , ir j X i i i j j s s g b+ =1i i £ £1 i r g + =1 1r ( ) ( ) ( ) ( ) ( ) ( ) ( ) l µ l µ ll µ l µ l - = = - + = = + - + = - + + - + - ++ + + + + - + å Õ Õ Õ ! ! L 1 1 1 1 2 1 1 1 1 1 1 . 1 iS i X i j S N M j j N S j s c c s N S j N jN j c s N j s N j [ ] ( ) l µ l µ l - = = - + = = ì üï ï = - + +í ý + +ï ïî þ å å å å1 2 1 1 1 1 1 1 1 . S S S N i i j S i j j M E X c c c N j N j j
  • 52. 49/196 C h a p t e r 5 MARKOV MODELS OF FAULT–TOLERANT SYSTEMS A generalized modeling technique is required to deal with an arbitrary number of modules, failure events, and repair events in the analysis of Fault–Tolerant systems [Boss82]. Several techniques are available, including Petri Nets [Duga84], [Duga85], Fault Tree Analysis [Fuss76], Failure Mode and Effects Analysis [Mil1629], [Jame74], Event Tree Analysis [Gree82], and Hazard and Operability Studies [Lee80], [Robi78], [Smit85]. When system components are not independent, a state based analysis technique is needed which includes redundancy and repair [Biro86], [Guid86]. A Continuous Parameter Markov Chain is a method used to analyze systems that have state transitions that include repair processes [Hoel72], [Kend50], [Kend53]. A Markov Process is a stochastic process whose dynamic behavior is such that the probability distributions for its future behavior depend only on the present state and not how the process arrived in that state [Mark07], [Fell67], [Issa76], [Chun76], [Kulk84]. To illustrate the principles of a Markov process, consider a system S described in Figure 3, which is changing over time in such a way that its state at any instant in time v can be described in terms of a finite dimensional vector , [Triv74], [Triv75a], [Triv75]. Assume that the state of the system at any time can be described by a predetermined function of the starting state v and the ending state t: (5.1) Given a set of reasonable starting conditions and the continuity of the function G a differential equation for describing the rate at which transitions between ( )X t >, fort t v ( ) ( ), .X t G X v t= é ùë û ( )X t
  • 53. 50/196 each state of the system takes place can be derived by expanding both sides of Eq. (5.1) in powers of t to give, (5.2) Finite–dimensional deterministic systems described by the set of state vectors are equivalent to systems described by sets of ordinary differential equations [Bell60], [Brau67], [Beiz78], [Brue80]. This property will serve as the basis for analysis of fault–tolerant systems that include repair. It will be assumed that the system described by the set of differential equation in Eq. (5.2) can exist in only one of the finite number of states [Keme60], [Koba78]. The transition from state i to state j in this system takes place with some random probability defined by, (5.3) Eq. (5.3) is the conditional pdf of the system of state transitions and satisfies the relation, (5.4) The unconditional pdf of the state transition vector is given by, (5.5) with, (5.6) since the process at any time t must be in a unique state. An Absorbing Markov Process is one in which transitions have the following properties [Gave73], ( ) . dx X t dt = é ùë ûH ( ) ( ) ( ){ }, , ; , .ijp v t P X t j X v i t v i j S= = = ³ Î ( ), 1; 0 .j i S p v t v t " Î = £ £å ( )X t ( ) ( ){ }, 1, 2, 3,jp t P X t j j= = = ! pj t( )=1 ∀j∈S ∑ , ∀t > 0,
  • 54. 51/196 § There is at least one absorbing state, § From every state, it is possible to get to the absorbing state. Figure 3 – State Transition probabilities as a function of time in the Continuous–Time Markov chain that is subject to the constraints of the Chapman–Kolmogorov equation. The fundamental assumption of the Markov model is that the probability of a given state transition depends only on the current state of the system and not on any previous state. For continuous–time Markov processes, that is, those described by ordinary differential equations, the length of time already spent in the current state does not influence either the probability distribution of the next state or the probability distribution of the remaining time in the same state before the next transition. The Markov model fits with the standard assumption of the reliability models developed so far in this work, that the failure rates are constant, leading to an exponentially distributed state transition time for failures and a Poisson distribution for the occurrence of these failures. i ki j j ! ! ! v t uv t
  • 55. 52/196 Solving the Markov Matrix In order to describe a continuous–time Markov process using transition matrices, it is necessary to specify the entire family of stochastic matrices, . Only those matrices that meet certain conditions are useful in finding the solution to the final absorption state rate of the system described by the Markov Chain [Cour77]. Initial value problems involving systems of equations may be solved using the Laplace transform. The advantage of this technique over traditional methods (Elimination, Eigenvalue solutions, and Fundamental Matrix [Pipe63], [Cour43]) is that satisfaction of initial values is automatically provided. No special techniques are needed to find particular solutions of the fundamental matrix, such as repeated eigenvalues [Lome88]. Chapman–Kolmogorov Equations A set of differential equations describing the transitions between each state can be derived if the following conditions are met by the transitions probability matrix [Bhar60], [Parz62], [Howa71]. These equations are the Chapman–Kolmogorov Equations and are defined as the transition probabilities of the Markov chain that satisfy Eq. (5.7) for all i and j, using Figure 3 as an example, (5.7) A simplified notation for the matrix elements defined in Eq. (5.7) can be created where the elements of each matrix are given by, (5.8) and where, (5.9) ( ){ }P t ( ) ( ) ( ), , , .ij ik kj k p v t p v u p u t= ×å ( ) ( ) ( ), , , ,v t v u u t v u t= H £ £H H ( ), ,t t =H I
  • 56. 53/196 is the identity matrix. The Forward Chapman–Kolmogorov Equation is now defined as, (5.10) where the new matrix is defined as, (5.11) with, (5.12) The matrix is now defined as the transition rate matrix [Papo65a]. The elements of are and are defined by, (5.13) and (5.14) If the system at time t is in state i, then the probability that a transition occurs to any state other than state i during the time interval is given by, (5.15) where is any function of h that approaches zero faster than h, that is Eq. (5.13) is the rate at which the process departs state i when the starting in state i. ( ) ( ) ( ), , , ,v t s t t v t t ¶ = £ ¶ H H Q ( )tQ ( ) ( ) 0 lim , t t t tD ® - = D P I Q .t t vD = - ( )tQ ( )tQ ( )ijq t ( ) ( ) 0 , 1 lim ,ii ii t p t t t q t tD ® + D - = D ( ) ( ) 0 , 1 lim , . ij ij t p t t t q t i j tD ® + D - = ¹ D t t+ D ( ) ( ),iiq t t o t- D + D ( )o h ( ) 0 lim 0. h o h h® =
  • 57. 54/196 Similarly, given that the system is in state i at time t, the conditional probability that it will make a transition from state i to state j in the time interval is given by, (5.16) Eq. (5.14) is the rate at which the process moves from state i to state j given that the system is in state i, since, (5.17) then Eq. (5.13) and Eq. (5.14) implies, (5.18) Using these developments, the Backward Chapman–Kolmogorov equation is given by, (5.19) The forward equation may be expressed in terms of its elements, (5.20) The initial state i at the initial time v affects the solution of this set of differential equations only through the following conditions, (5.21) The backward matrix equation may be expressed in terms of its elements, (5.22) [ ],t t t+ D ( ) ( ).ijq t t o tD + ( ), 1,ijp v t =å ( ) 0, .ijq t i= " Îå S ( ) ( ) ( ), , , .v t v v t v t v ¶ = - £ ¶ H Q H ( ) ( ) ( ) ( ) ( ), , , .ij jj ij kj ik k j p v t q t p v t q t p v t t ¹ ¶ = + ¶ å ( ) =ì = í ¹î 1, , 0,ij i j p v v i j ( ) ( ) ( ) ( ) ( ), , , ,ij jj ij ik kj k j p v t q t p v t q t p v t t ¹ ¶ = - - ¶ å
  • 58. 55/196 with the initial conditions, (5.23) Markov Matrix Notation The expressions developed in the previous section can be represented by a transition probability matrix [Papo62] of the form, The entries in this matrix satisfy two properties; and which is a restatement of Eq. (5.17). The Transition Probability Matrix can also be represented by a directed graph [Maye72], [Deo74]. A node labeled i in the directed graph represents state i of the Markov Chain and a branch labeled from node i to node j implies that the conditional probability is met by the Markov Process represented by the directed graph. The transition probabilities represent a set of differential equations describing the rate at which the transitions take place between each node in the directed graph. The differential equations are then represented by a matrix structure of, ( ) =ì = í ¹î 1, , 0,ij i j p t t i j P = pij ! " # $= pmn ! ! ! pm0 " # " " # " " p11 p10 p0n p01 p00 ! " % % % % % % # $ & & & & & & . £ £0 1ijp =å 1ij j p ijp { }-= = =1n n ijP X j X j p
  • 59. 56/196 The solution to this set of linear homogeneous differential equations can be derived by elimination using the Laplace transform method. Laplace Transform Techniques Given a set of differential equations in Eq. (5.20) and Eq. (5.22), the Laplace transform can be used to generate solutions to these equations [Lome88]. One advantage of using the Laplace transform method is its ability to handle initial conditions automatically, without having first to find a general solution and then having to evaluate the integration constants. The Laplace transform is defined as, (5.24) The differential equation solution method depends on the following operational property of the Laplace transform [Krey72]. The Laplace transform of the derivative of a function is, (5.25) In the limit, the integral appearing on the right–hand side of Eq. (5.25) is , so that the first term in Eq. (5.25) can be evaluated in the following manner [McLac39], d dt Pn ! d dt P1 d dt P0 ! " # # # # # # # # $ % & & & & & & & & = pmn " " pm0 ! # ! p1n # p10 p0n … … p00 ! " # # # # # $ % & & & & & Pn ! P1 P0 ! " # # # # # $ % & & & & & . ( ) ( ) ( ){ } ¥ - = =ò L 0 st F s e f t dt f t ( ){ } ( ) ( ) ( ) ¥ - - - ®¥ é ù ¢ ¢= = +ê ú ë û ò òL 0 0 lim . 0 b st st st b b f t e f t dt e f t s e f t dt ( ){ }L f t
  • 60. 57/196 (5.26) Using the property of absolute values and limits [Arfk70], Eq. (5.26) can be rewritten as, (5.27) The term is of the order as . For using the definition for exponential order, Eq. (5.27) can be reevaluated to the following, (5.28) The function is said to be of exponential order as if there exists a constant such that: is bounded for all t greater than some T. If this statement is true, there also exists a constant M, such that Figure 4 – Definition of the exponential order of a function. If , then giving, (5.29) so that in the limit, (5.30) giving the final form of the Laplace transform of a differential equation as, (5.31) ( ) ( )- ®¥ - 0 lim 0 .sb b e f b e f ( ) ( )- - ®¥ ®¥ £lim lim .sb sb b b e f b e f b ( )f b ab e ® ¥b >b T ( ) ( )aa - -- - ®¥ ®¥ ®¥ £ =lim lim lim .s bsb sb b b b b e f b e Me Me ( )f b b ® ¥ a ( ) ,b e f ba- ( ) , .t f b Me t Ta < > s a> 0,s a- > ( ) lim 0,s b b Me a- - ®¥ = ( )lim 0,sb b e f b- ®¥ = ( ){ } ( ){ } ( )0 .f t s f t f¢ = -L L
  • 61. 58/196 The notation for the Laplace transform for the differential equation for the rate of arrival at the transition state i is then given by, (5.32) From this point on, this Laplace transform notation will be used in the solution of the Markov transition matrix differential equations. Using the expression to define the system reliability, where is the probability distribution function of the time to failure, a new random variable, Y, can be defined which represents the expected time to system failure. A notation can be defined such that is the failure density of the random variable Y. The Laplace transform of this failure density is denoted by In this work represents the absorbing state of the Markov model. By using the Laplace transform notation in the solution of differential equations, the inverse transform can be used to generate the failure density function for the random variable Y. Using Eq. (2.38) the derivative of the failure density function can be integrated to produce the Mean Time to Failure . The inversion of the Laplace transform may be straightforward in some cases and more complex in other cases. MODELING A DUPLEX SYSTEM Duplex systems or Parallel Redundant systems have been utilized in electronic central office switching systems and other high–reliability systems for the past 35 years [Toy78]. Parallel redundant systems depend on fault detection and recovery for their proper operation. In most dual redundant architectures both system are ( ){ } ( ).i iP t P sÞL ( ) ( ) { }1R t F t P T t= - = ³ ( )F t ( ) ( ) ( )0 Y dR t dP t f t dt dt = - = ( ){ } ( ) ( ) ( )0 .Y Y Yf t s f s sP sÞ = =L L ( )0P s [ ] ( )0 d MTTF E Y t R t dt ¥ æ ö = = - ç ÷ è ø ò
  • 62. 59/196 monitored continuously, providing fault detection in the primary subsystem as well as the standby subsystem. This section describes the detailed development of the Markov model for a parallel redundant system with perfect diagnostic coverage. The failure rate of both subsystems are assumed to be a constant and the repair rate a constant . The system is considered failed when both subsystems have failed. The number of properly functioning subsystems is described in the state space , where is the failure state of the system. The state diagram for the system is shown in Figure 5. Figure 5 – the state transition diagram for a Parallel Redundant system with repair. State represents the fault free operation mode, State represents a single fault with a return path to the fault free mode by a repair operation, and State represents the system failure mode, the absorption state. The initial state of the system is and the initial conditions for the transition equations are, (5.33) Using the initial conditions, the system of differential equations derived from the transition matrix, l µ { }2,1,0ÞS { }0 2 01 2l µ l { }2 { }1 { }0 { }2 ( ) ( ) ( )= = =2 1 00 1, 0 0 0.P P P
  • 63. 60/196 are given by, (5.34) Using the Laplace transform solution technique described in the previous section and in detail in [Doet61], [Widd46], [Lome88], [Rea78], and [Lath65] gives the following set of equations in Laplace form, (5.35) Solving Eq. (5.35)(a) for the final failed state gives, (5.36) and solving for Eq. (5.36)(b) for state gives, ( ) ( ) ( ) ( ) ( ) ( ) ( ) é ù - l µ é ùé ùê ú ê úê úê ú ê úê úê ú ê úê ú= l - l + µ lê ú ê úê úê ú ê úê úê ú ê úê úlê ú ë û ë û ê úë û 2 2 1 1 0 0 2 0 2 , 0 2 0 dP t P t dt dP t P t dt dP t P t dt ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) 2 2 1 1 2 1 0 1 2 , 2 , . dP t P t P t dt dP t P t P t dt dP t P t dt = - l +µ = l - l +µ = l ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) 2 2 1 1 2 1 0 1 1 2 , 2 , . sP s P s P s sP s P s P s sP s P s - = - + = - + = l µ l l µ l { }2 ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) 2 2 1 2 1 1 2 2 1, 2 1, 1 , 2 sP s P s P s s P s P s P s P s s + l = µ + + l = µ + µ + = + l { }2
  • 64. 61/196 (5.37) Equating Eq. (5.36) and Eq. (5.37) a solution representing state can be derived, giving, (5.38) Multiplying each side by gives, which results in, (5.39) Solving Eq. (5.39) for state gives, (5.40) Expanding and simplifying Eq. (5.40) gives, (5.41) Substituting Eq. (5.41) into Eq. (5.35)(c) gives the solution to the final absorbing state as, ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) 1 2 1 1 1 2 1 2 2 , 2 , . 2 sP s P s P s sP s P s P s s P s P s = l - l + µ + l + µ = l + l + µ = l { }1 ( ) ( ) ( ) ( ) 1 1 1 . 2 2 s P s P s s l µ µ l l + + + = + ( )1 1 P s ( ) ( ) ( ) 1 1 , 2 2 s P s s µ l µ l l + + + = + ( )( ) ( )1 2 2 2 .s s P s l l µ l lµ+ + + = + { }1 ( ) ( )( )1 2 . 2 2 P s s s l l µ l lµ = + + + - ( )1 2 2 2 . 3 2 P s s s s l l l µ = + + + { }0
  • 65. 62/196 (5.42) After producing the inverse Laplace transform of Eq. (5.42)(c), the probability that no subsystems are operating at time, is the result. Let the random variable Y be the time to failure of the system and be the probability that the system has failed at or before time t. The reliability of the system is then defined by, (5.43) Using Eq. (2.37), the failure density function for the random variable Y is given by, (5.44) and using Eq. (5.31), its Laplace transform is given by, (5.45) Inverting Eq. (5.45) gives the failure density of Y as, (5.46) where, (5.47) ( ) ( ) ( ) ( ) ( ) ( ) 0 1 0 2 2 2 0 2 2 , 2 , 3 2 . 3 sP s P s sP s s s s P s s s s s l l l l µ l l l µ l = é ù = ê ú + + +ë û = é ù+ + +ë û 0t > ( )0P t ( ) ( )01 .R t P t= - ( ) ( )0 ,Y dP tdR f t dt dt = - = ( ) ( ) ( ) ( ) ( ) 2 0 0 2 2 2 0 . 3 2 Y YL s f s sP s P s s - l = = - = + l + µ + l ( ) ( )2 1 2 1 2 2 ,t t Yf t e ea al a a - - = - - ( ) 2 2 1 2 3 6 , . 2 l +µ ± l + lµ +µ a a =
  • 66. 63/196 Using Eq. (2.28), the MTTF of the Parallel Redundant system with repair is given by, (5.48) The MTTF of a two element Parallel Redundant system without repair would have been equal to the first term in Eq. (5.48)(c). The effect of adding a repair facility to the system increases the mean life of the system by, (5.49) or a factor of, (5.50) over a system without repair facilities. [ ] ( ) ( ) ( ) ( ) ¥ ¥ ¥ -a -a = = é ùl = = -ê ú a - a ë û é ùl = -ê ú a - a a aë û l a - a = a a l l + µ = l µ = + l l ò ò ò2 1 0 2 1 2 0 0 2 2 2 1 2 2 1 2 1 2 2 2 1 2 2 22 2 2 , 2 1 1 , 2 , 2 3 , 2 3 . 2 2 Y y y E Y yf y dy ye dy ye dy ( )0µ = 2 as a result of Repair , 2 MTTF µ = l 2 2 , 3 3 2 µ µl = l l
  • 67. 64/196 MODELING A TRIPLE–REDUNDANT SYSTEM A Triple Modular Redundant (TMR) system continues to operate correctly as long as two of the three subsystems are functioning properly. A second subsystem failure causes the system to fail. This model is referred to as 3–2–0. A second architecture (shown in Figure 7) is possible in which the system will continue to operate in the presence of two (2) subsystem failures. This system operates in simplex mode 3–2–1–0. The 3–2–0 model without coverage will be developed in this section. Figure 6 describes a TMR system with a constant failure rate and a constant repair rate . The repair activity takes place with a constant response time whenever a subsystem fails, giving a Markov transition matrix of, (5.51) The set of differential equations derived from the transition matrix is given by, (5.52) Rewriting the differential equations in the Laplace transform format gives, l µ ( ) ( ) ( ) ( ) ( ) ( ) ( ) é ù - l µ é ùé ùê ú ê úê úê ú ê úê úê ú ê úê úê ú = l - l + µ l ê úê úê ú ê úê úê ú ê úê úê ú ê úê úlê ú ë û ë û ê úë û 2 2 1 1 0 0 3 0 3 2 . 0 2 0 dP t P t dt dP t P t dt dP t P t dt ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) 2 2 1 1 2 1 0 1 3 , 3 2 , 2 . dP t P t P t dt dP t P t P t dt dP t P t dt = - l +µ = l - l +µ = l
  • 68. 65/196 (5.53) Using Eq. (5.53)(a) and Eq. (5.53)(b) to solve for state gives, (5.54) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) 2 2 1 1 2 1 0 1 1 3 , 3 2 , 2 . sP s P s P s sP s P s P s sP s P s - = - l +µ = l - l +µ = l { }2 ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) 2 2 1 2 1 1 2 3 1, 3 1, 1 . 3 sP s P s P s s P s P s P s P s s + l = µ + + l = µ + µ + = + l
  • 69. 66/196 Figure 6 – The transition diagram for a Triple Modular Redundant system with repair. State represents the fault free (TMR) operation mode, State represents a single fault (Duplex) operation mode with a return path to the fault free mode, and State represents the system failure mode, the absorbing state. Using Eq. (5.54)(a) and Eq. (5.54)(b) again to solve for state gives, (5.55) Equating (5.54) and Eq. (5.55) and solving for state gives, (5.56) Simplifying Eq. (5.56)(b) gives, (5.57) 2 01 3l µ 2l { }2 { }1 { }0 { }2 ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) 1 2 1 1 1 2 2 1 3 2 , 2 3 , 2 . 3 sP s P s P s sP s P s P s s P s P s = l - l +µ + l +µ = l + l +µ = l { }1 ( ) ( ) ( ) ( ) ( ) ( )( ) + l + µ µ + = l + l l = + l + µ + l - lµ 1 1 1 2 1 , 3 3 3 . 2 3 3 s P s P s s P s s s ( )1 2 2 3 . 5 6 P s s s s l = + l + l +µ
  • 70. 67/196 Substituting the solution for state , Eq. (5.57), into Eq. (5.54)(c) gives the solution for the final absorbing state , (5.58) Expanding and factoring the denominator of Eq. (5.58)(b) gives the differential equation for the absorption state as, (5.59) Expanding the partial fractions of Eq. (5.59) and taking the inverse Laplace transform, results in the following reliability function, (5.60) Integrating Eq. (5.60) using Eq. (2.24) produces the MTTF of, (5.61) Simplifying Eq. (5.61) gives the MTTF for a TMR system with repair as, { }1 { }0 ( ) ( ) ( ) ( ) 0 1 2 2 2 0 2 2 3 2 2 , 5 6 6 5 6 . sP s P s s s s P s s s s s é ùl = l = l ê ú+ l + l +µë û l = + l + l +µ P0 s( )= 6λ2 s s+ 1 2 5λ+µ− λ2 +10λµ+µ2 ( )( )s+ 1 2 5λ+µ+ λ2 +10λµ+µ2 ( )( ) ( ) ( ) ( ) 2 21 2 2 21 2 2 2 5 10 2 2 2 2 5 10 2 2 5 10 2 10 5 10 . 2 10 R t e e - l+µ- l + lµ+µ - l+µ+ l + lµ+µ l +µ + l + lµ +µ = l + lµ +µ l +µ - l + lµ +µ - l + lµ +µ ! ! ( ) ( ) 2 2 2 2 2 2 2 2 2 2 2 2 5 10 5 10 10 5 10 . 5 10 10 MTTF l +µ + l + lµ +µ = l +µ l + lµ +µ -l - lµ -µ l +µ - l + lµ +µ - l +µ l + lµ +µ + l + lµ +µ ! !
  • 71. 68/196 (5.62) Rearranging Eq. (5.62) and isolating the repair term from the failure term gives, (5.63) MODELING A PARALLEL SYSTEM WITH IMPERFECT COVERAGE A more realistic model of a Parallel Redundant System assumes that not all faults are recoverable and that the coverage factor c denotes the conditional probability that the system detects the fault and survives. The state diagram for this system is shown in Figure 7 2 5 . 6 MTTF l +µ = l 2 5 . 6 6 MTTF µ = + l l
  • 72. 69/196 Figure 7 – The transition diagram for a Parallel Redundant system with repair and imperfect fault coverage. State represents the fault free mode, State represents a single fault with a return path to the fault free mode by a repair operation, and State represents the system failure mode. State can be reached from State through an uncovered fault, which causes the system to fail without the intermediate State mode. The transition matrix for Figure 7 is, (5.64) With an initial state of producing a set of starting conditions, , 2 01 2 cl µ l ( )2 1 cl - { }2 { }1 { }0 { }0 { }2 { }1 ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) é ù - l + l - µé ù é ùê ú ê ú ê úê ú ê ú ê úê ú ê ú ê úê ú = l - l + µ lê ú ê úê ú ê ú ê úê ú ê ú ê úê ú ê ú ê úl - lê ú ë û ë û ê úë û 2 2 1 1 0 0 2 2 1 0 2 , 2 1 2 0 dP t c c P t dt dP t c P t dt dP t c P t dt { }2 ( ) ( ) ( )2 1 00 1, 0 0 0P P P= = =
  • 73. 70/196 the system of equations describing the state transitions are, (5.65) Using the Laplace transform method, the above equations are reduced to, (5.66) Using Eq. (5.66)(a) and solving for state gives, (5.67) Using Eq. (5.66)(b) to solve for state gives, (5.68) Equating Eq. (5.67)(c) and Eq. (5.68)(c) and solving for state gives, (5.69) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) 2 2 2 1 1 2 1 0 2 1 2 2 1 , 2 , 2 1 . dP t cP t c P t P t dt dP t cP t P t dt dP t c P t P t dt = - l - l - +µ = l - l +µ = l - + l ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) 2 2 1 1 2 1 0 2 1 1 2 , 2 , 2 1 . sP s P s P s sP s cP s P s sP s c P s P s - = - l +µ = l - l +µ = l - + l { }2 ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) 2 2 1 2 1 1 2 2 , 2 , 1 . 2 sP s P s P s s P s P s P s P s s - l = µ - l = µ µ + = + l { }2 ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) 1 2 1 1 2 1 2 2 , 2 , . 2 sP s cP s P s s P s cP s s P s P s c = l - l + µ + l + µ = l + l + µ = l { }1 ( ) ( ) ( ) ( )1 11 . 2 2 P s s P s s c µ + + l + µ = + l l
  • 74. 71/196 Simplifying Eq. (5.69) and solving for state gives, (5.70) Using Eq. (5.66)(a) and solving for state gives, (5.71) Using Eq. (5.66)(b) and solving for state gives, (5.72) Equating Eq. (5.71) and Eq. (5.72) and solving for state gives, (5.73) Substituting Eq. (5.70) and Eq. (5.73) into Eq. (5.66)(c) and solving for state gives, { }1 ( ) ( )( ) ( ) ( ) ( )( ) 1 1 1 2 2 2 , 2 . 2 2 cP s c s s P s c P s s s c lµ + l = + l + l +µ l = + l + l +µ - lµ { }1 ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) 2 2 1 2 1 1 1 2 , 2 , 2 1 . sP s P s P s s P s P s s P s P s - l = µ - l = µ - l - = µ { }1 ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) 1 2 1 1 2 1 2 2 , 2 , 2 . sP s cP s P s s P s cP s c P s P s s = l - l +µ + l +µ = l l = + l +µ { }2 ( ) ( ) ( ) ( ) ( ) ( ) ( )( ) 2 2 2 2 1 2 , . 2 2 s P s c P s s s P s s s c + l - l = µ + l +µ + l +µ = + l + l +µ - lµ { }0