This document summarizes a thesis submitted in 1980 on fault-tolerant system reliability in the presence of imperfect diagnostic coverage. It examines how less than perfect diagnostics affect system reliability. It presents mathematical models for analyzing reliability of redundant systems like duplex and triple modular redundant configurations. It also discusses practical measurement of diagnostic coverage factors and effects of common cause failures. The document provides reliability analyses of specific redundant system examples and concludes with remaining questions around realistic failure distributions and effects of periodic maintenance.
This is a three parts lecture series. The parts will cover the basics and fundamentals of reliability engineering. Part 1 begins with introduction of reliability definition and other reliability characteristics and measurements. It will be followed by reliability calculation, estimation of failure rates and understanding of the implications of failure rates on system maintenance and replacements in Part 2. Then Part 3 will cover the most important and practical failure time distributions and how to obtain the parameters of the distributions and interpretations of these parameters. Hands-on computations of the failure rates and the estimation of the failure time distribution parameters will be conducted using standard Microsoft Excel.
Part 1. Reliability Definitions
1.Reliability---Time dependent characteristic
2.Failure rate
3.Mean Time to Failure
4.Availability
5.Mean residual life
This is a three parts lecture series. The parts will cover the basics and fundamentals of reliability engineering. Part 1 begins with introduction of reliability definition and other reliability characteristics and measurements. It will be followed by reliability calculation, estimation of failure rates and understanding of the implications of failure rates on system maintenance and replacements in Part 2. Then Part 3 will cover the most important and practical failure time distributions and how to obtain the parameters of the distributions and interpretations of these parameters. Hands-on computations of the failure rates and the estimation of the failure time distribution parameters will be conducted using standard Microsoft Excel.
Part 1. Reliability Definitions
1.Reliability---Time dependent characteristic
2.Failure rate
3.Mean Time to Failure
4.Availability
5.Mean residual life
Honeywell Vista 15P Honeywell Vista-20P User GuideAlarm Grid
Alarm Grid Home Security http://www.alarmgrid.com/ has provided this pdf with the permission and courtesy of Honeywell.
Alarm Grid is a home security product and alarm monitoring company that loves its customers. We have a strong appreciation of the DIY community, and want to make sure that we not only provide the best products and services out there but we also want to make sure that resources like these Honeywell product pdfs are easily accessible so that or curious customers can find what they need when they need it.
This pracitice-based education is designed for participants, who are involved in engineering, commissioning or optimization of process technology plants.
The understanding of regulation (cybernetics) applies to all areas of our society. Analytical thinking is encouraged by working at individually adaptable conrol loops. Therefore this workshop is also recommendable for non-technical areas.
With the help of sample exercises, participants learn to understand the dynamics of simple control loops and to solve simple control tasks without using higher mathematics.
Alarm Grid Home Security http://www.alarmgrid.com/ has provided this pdf with the permission and courtesy of Honeywell.
Alarm Grid is a home security product and alarm monitoring company that loves its customers. We have a strong appreciation of the DIY community, and want to make sure that we not only provide the best products and services out there but we also want to make sure that resources like these Honeywell product pdfs are easily accessible so that or curious customers can find what they need when they need it.
A new design for fault tolerant and fault recoverable ALU System has been proposed in this paper. Reliability is one of the most critical factors that have to be considered during the designing phase of any IC. In critical applications like Medical equipment & Military applications this reliability factor plays a
very critical role in determining the acceptance of product. Insertion of special modules in the main design for reliability enhancement will give considerable amount of area & power penalty. So, a novel approach to this problem is to find ways for reusing the already available components in digital system in efficient way to implement recoverable methodologies. Triple Modular Redundancy (TMR) has traditionally used
for protecting digital logic from the SEUs (single event upset) by triplicating the critical components of the system to give fault tolerance to system. ScTMR- Scan chain-based error recovery TMR technique provides recovery for all internal faults. ScTMR uses a roll-forward approach and employs the scan chain implemented in the circuits for testability purposes to recover the system to fault-free state. The proposed
design will incorporate a ScTMR controller over TMR system of ALU and will make the system fault tolerant and fault recoverable. Hence, proposed design will be more efficient & reliable to use in critical applications, than any other design present till today.
Evolution of protective systems in petro chemGlen Alleman
Electrical protective or emergency shutdown systems are utilized
throughout the petrochemical industry for safety and to avoid severe environmental and/or economic events. Requirements
fur these critical systems are that they work every time, on demand, and do not initiate nuisance events. These requirements were difficult to achieve in most early systems but the systems have improved over the years. Emergency shutdown system design has been unregulated in the U.S., but new standards will require strict guidelines for design, application, docllmentation, and software testing and control.
SCADA viết tắt của Supervisory Control And Data Acquisition là một hệ thống điều khiển giám sát và thu thập dữ liệu, nói một cách khác là một hệ thống hỗ trợ con người trong việc giám sát và điều khiển từ xa, ở cấp cao hơn hệ điều khiển tự động thông thường. Để có thể điều khiển và giám sát từ xa thì hệ SCADA phải có hệ thống truy cập, truyền tải dữ liệu cũng như hệ dao diện người- máy (HMI- Human Machine Interface).
The difference between in-depth analysis of virtual infrastructures & monitoringBettyRManning
Virtualization is an indispensable part of a modern data center. Frequently, the degree of virtualization is 90 percent or more. What formerly operated on a number of servers today runs on a few hosts.
Planning projects usually starts with tasks and milestones. The planner gathers this information from the participants – customers, engineers, subject matter experts. This information is usually arranged in the form of activities and milestones. PMBOK defines “project time management” in this manner. The activities are then sequenced according to the projects needs and mandatory dependencies.
More Related Content
Similar to Fault-Tolerant System Reliability in the Presence of Imperfect Diagnostic Coverage
Honeywell Vista 15P Honeywell Vista-20P User GuideAlarm Grid
Alarm Grid Home Security http://www.alarmgrid.com/ has provided this pdf with the permission and courtesy of Honeywell.
Alarm Grid is a home security product and alarm monitoring company that loves its customers. We have a strong appreciation of the DIY community, and want to make sure that we not only provide the best products and services out there but we also want to make sure that resources like these Honeywell product pdfs are easily accessible so that or curious customers can find what they need when they need it.
This pracitice-based education is designed for participants, who are involved in engineering, commissioning or optimization of process technology plants.
The understanding of regulation (cybernetics) applies to all areas of our society. Analytical thinking is encouraged by working at individually adaptable conrol loops. Therefore this workshop is also recommendable for non-technical areas.
With the help of sample exercises, participants learn to understand the dynamics of simple control loops and to solve simple control tasks without using higher mathematics.
Alarm Grid Home Security http://www.alarmgrid.com/ has provided this pdf with the permission and courtesy of Honeywell.
Alarm Grid is a home security product and alarm monitoring company that loves its customers. We have a strong appreciation of the DIY community, and want to make sure that we not only provide the best products and services out there but we also want to make sure that resources like these Honeywell product pdfs are easily accessible so that or curious customers can find what they need when they need it.
A new design for fault tolerant and fault recoverable ALU System has been proposed in this paper. Reliability is one of the most critical factors that have to be considered during the designing phase of any IC. In critical applications like Medical equipment & Military applications this reliability factor plays a
very critical role in determining the acceptance of product. Insertion of special modules in the main design for reliability enhancement will give considerable amount of area & power penalty. So, a novel approach to this problem is to find ways for reusing the already available components in digital system in efficient way to implement recoverable methodologies. Triple Modular Redundancy (TMR) has traditionally used
for protecting digital logic from the SEUs (single event upset) by triplicating the critical components of the system to give fault tolerance to system. ScTMR- Scan chain-based error recovery TMR technique provides recovery for all internal faults. ScTMR uses a roll-forward approach and employs the scan chain implemented in the circuits for testability purposes to recover the system to fault-free state. The proposed
design will incorporate a ScTMR controller over TMR system of ALU and will make the system fault tolerant and fault recoverable. Hence, proposed design will be more efficient & reliable to use in critical applications, than any other design present till today.
Evolution of protective systems in petro chemGlen Alleman
Electrical protective or emergency shutdown systems are utilized
throughout the petrochemical industry for safety and to avoid severe environmental and/or economic events. Requirements
fur these critical systems are that they work every time, on demand, and do not initiate nuisance events. These requirements were difficult to achieve in most early systems but the systems have improved over the years. Emergency shutdown system design has been unregulated in the U.S., but new standards will require strict guidelines for design, application, docllmentation, and software testing and control.
SCADA viết tắt của Supervisory Control And Data Acquisition là một hệ thống điều khiển giám sát và thu thập dữ liệu, nói một cách khác là một hệ thống hỗ trợ con người trong việc giám sát và điều khiển từ xa, ở cấp cao hơn hệ điều khiển tự động thông thường. Để có thể điều khiển và giám sát từ xa thì hệ SCADA phải có hệ thống truy cập, truyền tải dữ liệu cũng như hệ dao diện người- máy (HMI- Human Machine Interface).
The difference between in-depth analysis of virtual infrastructures & monitoringBettyRManning
Virtualization is an indispensable part of a modern data center. Frequently, the degree of virtualization is 90 percent or more. What formerly operated on a number of servers today runs on a few hosts.
Planning projects usually starts with tasks and milestones. The planner gathers this information from the participants – customers, engineers, subject matter experts. This information is usually arranged in the form of activities and milestones. PMBOK defines “project time management” in this manner. The activities are then sequenced according to the projects needs and mandatory dependencies.
Increasing the Probability of Project SuccessGlen Alleman
Risk Management is essential for development and production programs. Information about key cost, performance and schedule attributes are often uncertain or unknown until late in the program.
Risk issues that can be identified early in the program, which may potentially impact the program, termed Known Unknowns, can be alleviated with good risk management. -- Effective Risk Management 2nd Edition, Page 1, Edmund Conrow, American Institute of Aeronautics and Astronautics, 2003
Cost and schedule growth for complex projects is created when unrealistic technical performance expectations, unrealistic cost and schedule estimates, inadequate risk assessments, unanticipated technical issues, and poorly performed and ineffective risk management, contribute to project technical and programmatic shortfalls
From Principles to Strategies for Systems EngineeringGlen Alleman
From Principles to Strategies How to apply Principles, Practices, and Processes of Systems Engineering to solve complex technical, operational,
and organizational problems
Building a Credible Performance Measurement BaselineGlen Alleman
Establishing a credible Performance Measurement Baseline, with a risk adjusted Integrated Master Plan and Integrated Master Schedule, starts with the WBS and connects Technical Measures of progress to Earned Value
Capabilities‒Based Planning the capabilities needed to accomplish a mission or fulfill a business strategy
Only when capabilities are defined can we start with requirements elicitation
Starting with the development of a Rough Order of Magnitude (ROM) estimate of work and duration, creating the Product Roadmap and Release Plan, the Product and Sprint Backlogs, executing and statusing the Sprint, and informing the Earned Value Management Systems, using Physical Percent Complete of progress to plan.
Program Management Office Lean Software Development and Six SigmaGlen Alleman
Successfully combining a PMO, Agile, and Lean / 6 starts with understanding what benefit each paradigm brings to the table. Architecting a solution for the enterprise requires assembling a “Systems” with processes, people, and principles – all sharing the goal of business improvement.
This resource document describes the Program Governance Road map for product development, deployment, and sustainment of products and services in compliance with CMS guidance, ITIL IT management, CMMI best practices, and other guidance to assure high quality software is deployed for sustained operational success in mission critical domains.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
3. FAULT–TOLERANT SYSTEM
RELIABILITY IN THE
PRESENCE OF IMPERFECT
DIAGNOSTIC COVERAGE
Glen B. Alleman
The deployment of computer systems for the control of mission critical processes
has become the norm in many industrial and commercial markets. The analysis of
the reliability of these systems is usually understood in terms of the Mean Time to
Failure. The design and analysis of high reliability systems is now a mature science.
Starting with fault–tolerant central office switches (ESS4), dual redundant and n–
way redundant systems are now available in variety of application domains. The
technologies of microprocessor based industrial controls and redundant central
processor systems create the opportunity to build fault–tolerant computing
systems on a much smaller scale than previously found in the commercial market
place.
The diagnostic facilities utilized in a modern Fault–Tolerant Computer System
attempts to detect fault conditions present in the hardware and embedded
software. Coverage is the figure of merit describing the effectiveness of the
diagnostic system. This thesis examines the effects of less than perfect diagnostics
coverage on system reliability. The mathematical background for analyzing the
coverage factor of fault–tolerant systems is presented in detail as well as specific
examples of practical systems and their relative reliability measures.
In a complex system, malfunction and even total nonfunction may not be
detected for long periods, if ever.
— John Gall
4. i
TABLE OF CONTENTS
INTRODUCTION......................................................................................................10
Fault Tolerant System Definitions........................................................................10
Fault–Tolerant System Functions.........................................................................11
Overview of This Thesis ...................................................................................11
RELIABILITY, AVAILABILITY FOR SIMPLE SYSTEMS ......................13
Deterministic Models ..............................................................................................13
Probabilistic Models...........................................................................................14
Exponential and Poisson Relationships .........................................................15
Reliability Availability and Failure Density Functions .................................20
Mean Time to Failure.........................................................................................23
Mean Time to Repair .........................................................................................27
Mean Time Between Failure.............................................................................27
Mean Time to First Failure ...............................................................................27
General Availability Analysis ............................................................................31
Instantaneous Availability ..........................................................................33
Limiting Availability ....................................................................................34
SYSTEM RELIABILITY ......................................................................................37
Series Systems......................................................................................................37
Parallel Systems ...................................................................................................39
M–of–N Systems................................................................................................39
Selecting the Proper Evaluation Parameters..................................................40
Imperfect Fault Coverage And Reliability...........................................................42
Redundant System with Imperfect Coverage................................................42
Generalized Imperfect Coverage.....................................................................44
Markov Models Of Fault–Tolerant Systems.......................................................49
Solving the Markov Matrix ...............................................................................52
Chapman–Kolmogorov Equations..........................................................52
Markov Matrix Notation...................................................................................55
Laplace Transform Techniques........................................................................56
Modeling a Duplex System.....................................................................................58
Modeling a Triple–Redundant System.................................................................64
Modeling a Parallel System with Imperfect Coverage.......................................68
Modeling A TMR System with Imperfect Coverage.........................................74
Modeling A Generalized TMR System................................................................76
Laplace Transform Solution to Systems of Equations................................77
Specific Solution to the Generalized System.................................................78
PRACTICAL EFFECTS OF PARTIAL COVERAGE......................................85
Determining Coverage Factors..............................................................................85
5. ii
Coverage Measurement Statistics .............................................................86
Coverage Factor Measurement Assumptions ........................................86
Coverage Measurement Sampling Method.............................................87
Normal Population Statistics.....................................................................87
Sample Size Computation..........................................................................88
General Confidence Intervals....................................................................89
Proportion Statistics....................................................................................90
Confidence Interval Estimate of the Proportion...................................91
Unknown Population Proportion.............................................................91
Clopper–Person Estimation......................................................................92
Practical Sample Estimates ........................................................................93
Time Dependent Aspects of Fault Coverage Measurement ...............94
Common Cause Failure Effects ............................................................................95
Square Root Bounding Problem......................................................................97
Beta Factor Model..............................................................................................97
Multi–Nominal Failure Rate (Shock Model) .................................................97
Binomial Failure Rate Model............................................................................98
Multi–Dependent Failure Fraction Model.....................................................98
Basic Parameter Model......................................................................................99
Multiple Greeks Letter Model..........................................................................99
Common Load Model .....................................................................................100
Nonidentical Components Model.................................................................100
Practical Example of Common Cause Failure Analysis ............................100
Common Cause Software Reliability.............................................................102
Software Reliability Concepts..................................................................103
Software Reliability and Fail–Safe Operations.....................................109
PARTIAL FAULT COVERAGE SUMMARY...................................................111
Effects of Coverage...............................................................................................112
REMAINING QUESTIONS..................................................................................113
Realistic Probability Distributions.......................................................................113
Multiple Failure Distributions ........................................................................114
Weilbull Distribution........................................................................................116
Periodic Maintenance............................................................................................118
Periodic Maintenance of Repairable Systems..............................................119
Reliability Improvement for a TMR System................................................122
CONCLUSIONS........................................................................................................124
MARKOV CHAINS..................................................................................................125
Definition A.1....................................................................................................125
Definition A.2....................................................................................................125
Definition A.3....................................................................................................126
Theorem A.1......................................................................................................126
Proof of Theorem A.1.....................................................................................126
7. iv
LIST OF FIGURES
Number Page
Figure 1 – Evaluation Criteria defining System Reliability. These criteria will be
used to develop a set of time dependent metrics used to evaluate
various configurations. ............................................................................................13
Figure 2 – Assumptions regarding the behavior of a random process that
generated events following the Poisson probability distribution
function......................................................................................................................16
Figure 3 – State Transition probabilities as a function of time in the Continuous–
Time Markov chain that is subject to the constraints of the Chapman–
Kolmogorov equation.............................................................................................51
Figure 4 – Definition of the exponential order of a function............................................57
Figure 5 – the state transition diagram for a Parallel Redundant system with
repair. State represents the fault free operation mode, State
represents a single fault with a return path to the fault free mode by a
repair operation, and State represents the system failure mode, the
absorption state.........................................................................................................59
Figure 6 – The transition diagram for a Triple Modular Redundant system with
repair. State represents the fault free (TMR) operation mode, State
represents a single fault (Duplex) operation mode with a return
path to the fault free mode, and State represents the system failure
mode, the absorbing state.......................................................................................66
Figure 7 – The transition diagram for a Parallel Redundant system with repair and
imperfect fault coverage. State represents the fault free mode, State
represents a single fault with a return path to the fault free mode by
a repair operation, and State represents the system failure mode.
State can be reached from State through an uncovered fault,
which causes the system to fail without the intermediate State
mode...........................................................................................................................69
Figure 8 –The state transition diagram for a Triple Modular Redundant system
with repair and imperfect fault coverage. State represents the fault
free mode, State represents the single fault (Duplex) mode, State
represents the two–fault (Simplex) mode, and State represents
the system failure mode...........................................................................................74
{ }2 { }1
{ }0
{ }2
{ }1
{ }0
{ }2
{ }1
{ }0
{ }0 { }2
{ }1
{ }3
{ }2
{ }1 { }0
8. v
Figure 9 – The state transition diagram for a Generalized Triple Modular
Redundant system with repair and [perfect fault detection coverage.
The system initially operates in a fault free state . A fault in any
module results in the transition to state . A second fault while
in state results in the system failure state .........................78
Figure 10 – Sample size requirement for a specified estimate as tabulated by
Clopper and Pearson. ..............................................................................................93
Figure 11 – Common Cause Failure modes guide figures for electronic
programmable system [HSE87]. These ratios of non–CCF to CCF for
various system configurations. CCFs are defined as non–random faults
that are designed in or experienced through environmental damage to
the system. Other sources [SINT88]. [SINT89] provide different
figures. ......................................................................................................................102
Figure 12 – Four Software Growth Model expressions. The exponential and
hyperexponential growth models represent software faults that are time
independent. The S–Shaped growth models represent time delayed and
time inflection software fault growth rates [Mats88].......................................104
Figure 13 – MTTF of Simplex, Parallel Redundant, and TMR Systems. ......................111
Figure 14 – MTTF of Parallel Redundant and TMR Systems with varying degrees
of coverage. .............................................................................................................112
Figure 15 – Mean Time to Failure increases for a Triple Modular Redundant
system with periodic maintenance. This graph shows that maintenance
intervals which are greater than one–half of the mean time to failure for
one module have little effect on increasing reliability. But frequent
maintenance, even low quality maintenance, improves the system
reliability considerably. ..........................................................................................123
{ }0
{ }1, ,N!
{ }1, ,N! { }1N +
9. vi
ACKNOWLEDGMENTS
The author wishes to thank Dr. Wing Toy of AT&T Naperville Laboratories,
Naperville, Illinois for his consultation on the ESS4 Central Office Switch and his
contributions to this work. Dr. Victor Lowe of Ford Aerospace, Newport Beach,
California for his consultation on the general forms of Markov model solutions.
Mr. Henk Hinssen of Exxon Corporation, Antwerp Belgium for his discussion of
the effects of partial diagnostic coverage in Triple Modular Redundant Systems at
the Exxon Polystyrene Plant, Antwerp, Belgium. Dr. Phil Bennet of The Centre
for Software Engineering, Flixborough, England for his ideas regarding software
reliability measurements in the presence of undetected faults. Mr. Daniel Lelivre
of Factory Systems, Paris France for his comments and review of this work and
its applicability to safety critical systems at Total, Mobile, and NorSoLor chemical
plants.
Several institutions have contributed source material for this work including The
Foundation for Scientific and Industrial Research at the Norwegian Institute of
Technology (SINTF), Trondheim, Norway and the United Kingdom Atomic
Energy Authority, Systems Reliability Service, Culcheth, Warrington, England.
This work is submitted as a Thesis in completion of a Master Degree in Systems
Management, University of Southern California, 1980. It was extended in support
of the efforts that gained compliance of the Tricon with process safety standards
in the United States, Europe, and United Kingdom.
10. vii
PREFACE
This work was extended in support of the design and development of the Triple
Modular Redundant (TMR) computer produced by Triconex Corporation of
Irvine, California. In 1987, Triconex designed and manufactured its first digital
TMR process control computer that was deployed in a variety of industrial
environments, including: turbine controls, boiler controls, fire and gas systems,
emergency shutdown systems, and general-purpose fault–tolerant real–time
control systems.
The Tricon (a classic 1980’s product name) was based on several innovative
technologies. As the manager of software development for Triconex, I was
intimately involved in the software and hardware of the Tricon. In 1987, TMR
was not a completely new concept. Flight control systems and navigation
computers were found in aerospace applications. The Space Shuttle used a
TMR+1 computer system and was well understood by the public. What was new
to the market was an affordable TMR computer that could be deployed in a rugged
industrial environment. The heart of the Tricon was a hardware voting system that
performed a 2–out–of–3 vote for all digital input signals presented to the control
program. The contents of memory and the computed digital outputs were again
voted 2–out–of–3 at the physical output devices. Once the digital command had
been applied to the output device, its driven state was verified and the results
reported to the control program.
The Tricon contained 3 independent (but identical) 32–bit battery powered
microprocessors, a 2–out–of–3 voting digital serial bus connecting the three
processors, a dual redundant power system using DC–to–DC converters (state of
the art for 1987), and three separate isolated serial I/O buses connecting the I/O
subsystem to the three main processors. The I/O subsystem cards were
11. viii
themselves TMR, using onboard 8–bit processors and a quad output device to
vote 2–out–of–3 the digital commands received from the control program.
The Tricon executed a control program on a periodic basis. The architecture of
the operating software was modeled after the programmable controllers of the
day, which were programmed in a ladder logic representing mechanical relays and
timers. Both digital and analog devices provided input and output to the control
program. The control program accepted input states from the I/O subsystem,
evaluated the decision logic and produced output commands, which were sent to
the I/O subsystem. This cycle was performed every 10ms in a normally
configured system.
In the presence of faults, the key to the survivability of the Tricon was the
combination of TMR hardware and fault diagnostic software. Diagnostic
software was applied to each processor element and the digital I/O device. This
diagnostic software was capable of detecting all single stuck–at faults, many
multiple stuck–at faults as well as many transient faults. A fault–injection and
reliability evaluation technique developed by the author and described in this
work was used to evaluate the coverage factor of the diagnostic software.
Triconex no longer exists as an independent company, having been absorbed into
a larger control systems vendor. The materials presented in this work were critical
to Tricon’s TÜV and SINTF [SINTF89] certification for North Sea Norwegian
Sector, German (then the Federal Republic), Belgium, and British Health and
Safety Executive (HSE) industrial safety operations.
The concept of fault–tolerant computing has become important again in the
distributed computing market place. The Tandem Non–Stop processor, modern
flight and navigation computers as well as telecommunications computers all
depend on some form of diagnostics to initiate the fault detection and recovery
process. A recent systems architectural paper mentioned TMR but without
12. ix
sufficient attention to the underlying details. [1]
The reissuing of this paper
addresses several gaps in the literature:
§ The foundations of fault–tolerance and fault–tolerance modeling have faded
from the computer science literature. The underlying mathematics of fault–
tolerant systems present a challenge for an industry focused on rapid
software development and short time to market pressures.
§ The understanding that unreliable and untrustworthy software systems are
created by latent faults in both the hardware and software is poorly
understood in this age of Object–Oriented programming and plug and play
systems development.
§ The Markov models presented in this work have general applicability to
distributed computer systems analysis and need to be restated. The
application of these models to distributed processing systems, with
symmetric multi–processor computers is a reemerging science. With the
advent of high–availability computing systems, the foundations of these
systems needs to be understood once again.
§ The current crop of computer science practitioners have very little
understanding of the complexities and subtleties of the underlying hardware
and firmware that make up the diagnostic systems of modern computers,
their reliability models and the mathematics of system modeling.
Glen B. Alleman
Niwot Colorado 80503
Updated, April 2000
1 “Attribute Based Architectural Styles,” Mark Klein and Rick Kazman, CMU/SEI–99–TR–022, Software
Engineering Institute, Carnegie Mellon University, October 1999.
13. 10/196
C h a p t e r 1
INTRODUCTION
Two approaches are available to increase the system reliability of digital computer
system: Fault avoidance (fault intolerance) and fault tolerance [Aviz75]. Fault
avoidance results from conservative design techniques utilizing high–reliability
components, system burn–in, and careful design and testing processes. The goal
of fault avoidance is to reduce the possibility of a failure [Aviz84], [Rand75],
[Kim86], [Ozak88]. The presence of faults however results in system failure,
negating all prior efforts to increase system reliability [Litt75], [Low72].
Fault–tolerance provides the system with the ability to withstand a system fault,
maintain a safe state in the presence of a fault, and possibly continue to operate in
the presence of this fault.
FAULT TOLERANT SYSTEM DEFINITIONS
A set of consistent definitions is used here to avoid confusion with existing
definitions. These definitions are provided by the IFIP Working Group 10.4,
Reliable Computing and Fault–Tolerance [Aviz84], [Aviz82], [Ande82], [Robi82],
[Lapr84], [TUV86]:
§ A Failure occurs when the system user perceives a service resource ceases to
deliver the expected results.
§ An Error occurs when some part of a system resource assumes an undesired
state. Such a state is contrary to the specification of the resource to the
expectation (requirement) of the user.
§ A Fault is detected when either a failure of the resource occurs, or an error is
observed within the resource. The cause of the failure or error is said to be a
fault.
14. 11/196
FAULT–TOLERANT SYSTEM FUNCTIONS
In fault–tolerant systems, hardware and software redundancy provides
information needed to negate the effects of a fault [Aviz67]. The design of fault–
tolerant systems involves the selection of a coordinated failure response
mechanism that follows four steps [Siew84], [Mell77], [Toy86]:
§ Fault Detection
§ Fault Location and Identification
§ Fault Containment and Isolation
§ Fault Masking
During the fault detection process, diagnostics are used to gather and analyze
information generated by the fault detection hardware and software. These
diagnostics determine the appropriate fault masking and fault recovery actions
[Euri84], [Rouq86], [Ossf80], [Gluc86], [John85], [John86], [Kirr86], [Chan70]. It
is the less than perfect operation of the Fault Detection, Location, and
Identification processes of the system that is examined in this work.
The reliability of the fault–tolerant system depends on the ability of the diagnostic
subsystem to correctly detect and analyze faults [Kirr87], [Gall81], [Cook73],
[Brue76], [Lamp82]. The measure of the correct operation of the diagnostic
subsystem is called the Coverage Factor. It is assumed in most fault–tolerant
product offerings that the diagnostic coverage factor is perfect, i.e. 100%. This
work addresses the question:
What is the reliability of the Fault–Tolerant system in the presence of less
than perfect coverage?
To answer this question, some background in the mathematics of reliability
theory is necessary.
Overview of This Thesis
The development of a reliability model of a Triple Modular Redundant (TMR)
system with imperfect diagnostic coverage is the goal of this work. Along the
15. 12/196
way, the underlying mathematics for analyzing these models is developed. The
Markov Chain method will be the primary technique used to model the failure
and repair processes of the TMR system. The Laplace transform will be used to
solve the differential equations representing the transition probabilities between
the various states of the TMR system described by the Markov model.
The models developed for a TMR system with partial coverage can be applied to
actual systems. In order to make the models useful in the real–world a deeper
understanding of the diagnostic coverage and fault detection is presented. The
appendices provide the background for the Markov models as well as the
statistical process.
The mathematics of Markov Chains and the statistical processes that underlay
system faults and their repair processes can be applied to a variety of other
analytical problems, including system performance analysis. It is hoped the reader
will gain some appreciation of the complexity and beauty of modern systems as
well as the subtitles of their design and operation.
If the reader is interested in skipping to the end, Chapter 7 provides a summary
of the effects of partial coverage on various system configurations.
16. 13/196
C h a p t e r 2
RELIABILITY, AVAILABILITY FOR SIMPLE SYSTEMS
When presented with the reliability figures for a computer system, the user must
often accept the stated value as factual and relevant and construct a comparison
matrix to determine the goodness of each product offering [Kraf81]. Difficulties
often arise through the definition and interpretation of the term reliability.
This chapter develops the necessary background for understanding the reliability
criteria defined by the manufacturers of computer equipment. Figure 1 lists the
criteria for defining system reliability [Siew82], [Ande72], [Ande79], [Ande81].
Deterministic Models
Survival of at least k component failures
Probabilistic Models
– Hazard (failure rate) function
– Reliability function
– Repair Rate
– Availability function
Single Parameter Models
MTTF – Mean Time to failure
MTTR – Mean Time to Repair
MTBF – Mean Time Between Failure
c – Coverage
Figure 1 – Evaluation Criteria defining System
Reliability. These criteria will be used to
develop a set of time dependent metrics used
to evaluate various configurations.
DETERMINISTIC MODELS
The simplest reliability model is a deterministic one, in which the minimum
number of component failures that can be tolerated without system failure is
taken as the figure of merit for the system.
( )z t
( )R t
µ
( )A t
17. 14/196
Probabilistic Models
The failure rate of electronic and mechanical devices varies as a function of time.
This time dependent failure rate is defined by the hazard function, . The
hazard function is also referred to as the hazard rate or mortality rate. For
electronic components on the normal–life portion of their failure curve, the
failure rate is assumed to be a constant, , rather than a function of time.
The exponential probability distribution is the most common distribution
encountered in reliability models, since it describes accurately most life testing
aspects for electronic equipment [Kapu77]. The probability density function (pdf),
Cumulative Distribution Function (CDF), reliability function ( ), and hazard
(failure rate) function ( ) of the exponential distribution are expressed by the
following [Kend77]:
(2.1)
(2.2)
(2.3)
(2.4)
The failure rate parameter describes the rate at which failures occur over time
[DoD82]. In the analysis that follows, the failure rate is assumed to be constant,
and measured as failures per million hours. Although a time dependent failure rate
could be used for un–aged electronic components, the aging of the electronic
components can remove the traditional bathtub curve failure distribution. The
constant failure rate assumption is also extended to the firmware controlling the
diagnostics of the system [Bish86], [Knig86], [Kell88], [Ehre78], [Eckh75],
[Gmei79], [RTCA85].
( )z t
l
( )R t
( )z t
( ) t
pdf f t e -l
= = l
( ) 1 t
CDF F t e -l
= = -
( )Reliability t
R t e-l
= =
( )Hazard Function z t= = l
l
18. 15/196
Exponential and Poisson Relationships
In modeling the reliability functions associated with actual equipment, several
simplifying assumptions must be made to render the resulting mathematics
tractable. These assumptions do not reduce the applicability of the resulting
models to real–world phenomenon. One simplifying assumption is that the
random variables associated with the failure process have exponential probability
distributions.
The property of the exponential distribution that makes it easy to analyze is that it
does not decay with time. If the lifetime of a component is exponentially
distributed, after some amount of time in use, the item is assumed to be good as
new. Formally, this property states that the random variable is memoryless, if the
expression is valid for all [Cram66],
[Ross83]. If the random variable is the lifetime of some item, then the
probability that the item is functional at time , given that it survived to time
t, is the same as the initial probability that is was functional at time s. If the item is
functional at time t, then the distribution of the remaining amount of time that it
survives is the same as the original lifetime distribution. The item does not
remember that it has already been in use for a time t.
This property is equivalent to the expression or
. Since the form of this expression is
satisfied when the random variable X is exponentially distributed (since
), it follows that exponentially distributed random variables
are memoryless. The recognition of this property is vital to the understanding of the
models presented in this work. If the underlying failure process is not
memoryless, than the exponential distribution model is not valid.
X
{ } { }P X s t X t P X s> + > = > , 0s t ³
X
s t+
P X > s +t, X > t{ }
P X > t{ }
= P X > s{ }
{ } { } { }P X s t P X s P X t> + = > >
( )s t s t
e e e-l + -l -l
=
19. 16/196
The exponential probability distributions and the related Poisson processes used
in the reliability models are formally based on the assumptions shown in Figure 2
[Cox 62], [Thor26].
§ Failures occur completely randomly and are independent of any previous
failure. A single failure event does not provide any information regarding
the time of the next failure event.
§ The probability of a failure during any interval of time is proportional
to the length of the interval, with a constant of proportionality . The
longer one waits the more likely it is a failure will occur.
Figure 2 – Assumptions regarding the
behavior of a random process that generated
events following the Poisson probability
distribution function.
An expression describing the random processes in Figure 2 results from the
Poisson Theorem which states that the probability of an event A occurring k times
in n trials is approximately [Papo65], [Pois37],
, (2.5)
where is the probability of an event A occurring in a single trial and
. This approximation is valid when and the product
remains finite. It should be noted that a large number of different trials of
independent systems is needed for this condition to hold, rather than a large
number of repeated trials on the same system.
The Poisson Theorem can be simplified to the following approximation for the
probability of an event occurring k times in n trials [Kend77],
[ ]0, t
l
( ) ( ) -- - +
×
!
!
1 1
1 2
k n kn n n k
p q
k
{ }p P A=
1q p= - , 0n p® ¥ ®
n p×
20. 17/196
(2.6)
The exponential and Poisson expressions are directly related. A detailed
understanding of this relationship will aid in the development of the analysis
that follows.
Using the Poisson assumptions described in Figure 2, the probability of n
failures prior to time t is,
. (2.7)
From of Eq. (2.7), the probability that no failures occur between time t
and time is,
, (2.8)
where the term describing the total number of failures is of moderate
magnitude [Fell67]. The probability that n failures occur between time t and
time is then,
. (2.9)
( )
( )
( )
( )( )
( )
-
-
+-
-
- + - +
-
æ ö æ ö
= -ç ÷ç ÷
- è øè ø
=
-
=
æ ö
-ç ÷
è ø
»
1
2
1
2
!
1 ,
! !
2
,
!2
1
!
1
!
.
!
k n k
k n k
k
knn
np
n k n k k
k
n
k
k
np
n npn np
p q
k n k k n n
e n np
e
kn k e n
np
kk
e
n
np
e
k
p
p
{ } ( )tP N n T t P n= £ =
( )0n =
t t+ D
( ) ( )[ ]0 0 1t t tP P t+D = -lD
npl =
+ Dt t
( ) ( )[ ] ( )[ ]1 1 , 0t t t tP n P n t P n t n+D = -lD + - lD >
21. 18/196
Using Eq. (2.9) and Eq. (2.8) and allowing , a differential equation can
be constructed describing the rate at which failures occur between time t and
time ,
(2.10)
with the initial conditions of,
(2.11)
The unique solution to the differential equation in Eq. (2.10) is [Klie75],
(2.12)
which is the Poisson distribution defined in Eq. (2.6). Using Eq. (2.12) to define
a function representing the probability that no failures have occurred as
of time t gives,
(2.13)
The expression in Eq. (2.13) is also the definition for the Cumulative
Distribution Function, CDF, of the Poisson failure process [Fell67]. By using
Eq. (2.19), the probability distribution function, pdf, of the Poisson process can
be given as,
(2.14)
0tD ®
t t+ D
( ) ( )
( ) ( ) ( )
0 0 ,
1 , for 0,
t t
t t t
d
P P
dt
d
P n P n P n n
dt
= -l
= l - - >é ùë û
( ) = 0.tP n
( )
( )
, 0, 1, 2,
!
n t
t
t e
P n n
n
-l
l
= = !
( )F t
( ) { }0 .t
tF t P n e -l
= = =
( ) ,t
f t e -l
= l
22. 19/196
which is the exponential probability distribution. [2]
The following statement
describes the relationship between the Poisson and exponential expressions
[Cox65],
If the number of failures occurring over an interval of time is Poisson
distributed, then the time between failures is exponentially distributed.
An alternative method of relating the exponential and Poisson expressions is
useful at this point. The functions defined in Eq. (2.1) and Eq. (2.2) are based
on the interchangeability of the pdf and the CDF for any defined probability
distribution. The Cumulative Distribution Function of a random variable
X is defined as a function obeying the following relationship [Papo65],
(2.15)
The probability density function of a random variable X can be derived
from the CDF using the following [Dave70],
(2.16)
The CDF can be obtained from the pdf by the following,
(2.17)
Using Eq. (2.16) and Eq. (2.17), the CDF and pdf expressions for an exponential
distribution can be developed. If the mean time between failures (MTBF) is an
Exponentially distributed random variable, the CDF is,
2 This development of the pdf is very informal. Making use of the forward reference to construct an
expression is circular logic and would not be permitted in more formal circumstances. For the purposes of
this work, this type of behavior can be tolerated, since the purpose of this development is to get to the
results rather than dwell on the analysis process. This is a fundamental difference between mathematics
and engineering.
( )F x
( ) { }, .F x P X x x= £ -¥ < < ¥
( )f x
( ) ( ).
d
f x F x
dx
=
( ) { } ( ) , .
x
F x P X x f t dt x
-¥
= £ = -¥ < < ¥ò
23. 20/196
(2.18)
The number of failures in the time interval is a Poisson distributed random
variable with a probability density function of,
(2.19)
where t is a random variable denoting the time between failures.
Reliability Availability and Failure Density Functions
An expression for the reliability of a system can be developed using the following
technique. The probability of a failure as a function of time is defined as,
(2.20)
where t is a random variable denoting the failure time. is a function
defining the probability that the system will fail by time t. is also the
Cumulative Distribution Function (CDF) of the random variable t [Papo65]. The
probability that the system will perform as intended at a certain time t is defined
as the Reliability function and is defined as,
(2.21)
If the random variable describing the time to failure t has a probability density
function then using Eq. (2.21) the Reliability function is,
(2.22)
Assuming the time to failure random variable t has an exponential distribution its
failure density defined by Eq. (2.19) is,
( )
1 , 0 ,
0 , otherwise,
t
e t
F t
-l
ì - £ £ ¥
= í
î
[ ]0, t
( ) ( )
, 0,
0, otherwise,
e td
f t F t
dt
-l
ìl >
= = í
î
{ } ( )£ = ³, 0,P T t F t t
( )F t
( )F t
( ) ( )( ) { }= - = ³1 .R t F t P T t
( )f t
( ) ( ) ( ) ( )
¥ ¥
= - = - =ò ò1 1 .
t t
R t F t f x dx f x dx
24. 21/196
(2.23)
The resulting reliability function is then,
(2.24)
A function describing the rate at which a system fails as a function of time is
referred to as the Hazard function (Eq. (2.4)). Let T be a random variable
representing the service life remaining for a specified system. Let be the
distribution function of T and let be its probability density function. A
new function termed the Hazard Function or the Conditional Failure Function
of T is given by . The function is the conditional
probability that the item will fail between x and given it has survived a
time T greater than x.
For a given hazard function the corresponding distribution function is
where is an arbitrary value of x. In
a continuous time reliability model the hazard function is defined as the
instantaneous failure rate of the system [Kapu77],
( ) , 0, 0.t
f t e t-l
= l ³ l ³
( )
¥
-l -l
= l =ò .t t
t
R t e dt e
( )F x
( )f x
( )z x
( )
( )
( )
=
-1
f x
z x
F x
( )z x dx
+x dx
( )z x
( ) ( )( ) ( )
é ù
- = - -ê ú
ê úë û
ò01 1 exp
o
x
x
F x F x z y dy 0x
25. 22/196
(2.25)
The quantity represents the probability that a system of age t will fail in
the small interval of time . The hazard function is an important
indicator of the change in the failure rate over the life of the system. For a system
with an exponential failure rate, the hazard function is constant as shown in
Eq. (2.25) and it is the only distribution that exhibits this property [Barl85]. Other
reliability distributions will be shown in later chapters that have variable hazard
rates.
If a system contains no redundancy – this is, every component must function
properly for the system to continue operation – and if component failures are
statistically independent, the system reliability function is the product of the
component reliabilities and follows an exponential probability distribution. The
failure rate of such a system is the product of the failure rates of the individual
components,
(2.26)
In most cases it is possible to repair or replace failed components and accurate
models of system reliability will consider this. As will be shown the repair activity
is not as easily modeled as the failure mechanisms.
( )
( ) ( )
( )
( )
( )
( )
( )
0
lim ,
1
,
,
,
.
t
t
t
R t R t
z t
t R t
d
R t
R t dt
f t
R t
e
e
D ®
-l
-l
- + D
=
D ×
é ù
= -ê úë û
=
l
=
= l
( )z t dt
[ ]+,t t dt
( ) ( ) ( )
1 1
exp .i
n n
t
sys i i
i i
R t R t e t-l
= -
é ù= = = - lë ûåÕ Õ
26. 23/196
For systems that can be repaired, a new measure of reliability can be defined,
The probability that the system is operational at time “t.”
This new measure is the Availability and is expressed as . Availability
differs from reliability in that any number of system failures can occur
prior to time t but the system is considered available if those failures have been
repaired prior to time t.
For systems that can be repaired, it is assumed that the behavior of the repaired
system and the original system are identical from a failure standpoint. In general,
this is not true, as perfect renewal of the system configuration is not possible. The
terms Mean Time to First Failure and Mean Time to Second Failure now become
relevant.
Assuming a constant failure rate , a constant repair rate , and identical failure
behaviors between the repaired system and the original system, the steady–state
system availability can be expressed as,
(2.27)
The expression in Eq. (2.27) is an approximation of the expression of the
availability with repair requires the solution of the appropriate Markov model,
which will be developed in a later chapter.
Mean Time to Failure
The Mean Time to Failure (MTTF) is the expected time to the first failure in a
population of identical systems, given a successful system startup at time .
The Cumulative Distribution function in Eq. (2.15) and the probability
density function in Eq. (2.16) characterize the behavior of the probability
distribution function of the underlying random failure process. These expressions
( )A t ( )A t
( )R t
l µ
.SSA
µ
=
l +µ
= 0t
( )F x
( )f x
27. 24/196
are in a continuous integral form and require the solution of integral equations to
produce a useable result. A concise parameter that describes the expected value
of the random process is useful for comparison of different reliability models.
This parameter is the Mean or Expected Value of the random variable denoted by
and is defined by [Parz60], [Dave70],
(2.28)
The expression in Eq. (2.28) denotes the expected value of the continuous
function . It is important to note that this definition assumes is
integrable in the interval .
For an exponential probability density function of,
(2.29)
the mean or expected value of the exponential function is given by,
(2.30)
The evaluation of Eq. (2.30) can be done in a straightforward manner using the
Gamma function [Arfk70], which is defined as,
(2.31)
or alternately,
(2.32)
Rewriting the expression in Eq. (2.30) for the expected values as,
[ ]E X
[ ] ( )
¥
-¥
= ò .E X xf x dx
( )f x ( )x f x
( )-¥ ¥,
( ) , 0,x
f x e x-l
= l >
[ ] ( )
0
.x
E X xf x dx e dx
¥ ¥
-l
-¥
= = lò ò
( )
¥
- -
G = >ò
1
0
, 0,x
x e dxa
a a
( )¥
a-
a
G a
=
lò
1
0
.x
x e dx
28. 25/196
(2.33)
where substituting the variables,
and (2.34)
results in,
(2.35)
which is the MTTF for a simple system. Although this expression is useful for
simple systems, a general–purpose expression representing the MTTF is needed.
This function can be developed in the following manner.
Let X denote the lifetime of a system so that the reliability function is,
(2.36)
and the derivative of the reliability function which is also given in Eq. (2.21) and
Eq. (2.22) is again defined as,
(2.37)
The expression for the expected value or MTTF using Eq. (2.28) is given by:
(2.38)
[ ]
¥
-
= ò0
1
,u
E X ue du
l
u x= l ,du dx= l
[ ]
( )
¥
-
=
l
= G
l
=
l
ò0
1
,
1
2 ,
1
,
u
E X ue du
( ) { }= > ,R t P X t
( ) ( )= - .
d
R t f t
dt
[ ] ( ) ( )
¥ ¥
æ ö
= = - ç ÷
è ø
ò ò0 0
d
E X tf t dt t R t dt
dt
29. 26/196
Using the technique of integration by parts [Smai49], [Arfk70] is shown in
Eq. (2.39),
(2.39)
to evaluate Eq. (2.38). Integrating by parts gives the expected value as,
(2.40)
Since approaches zero faster than t approaches infinity, Eq. (2.40) can be
reduced to,
(2.41)
which is the expression for the Mean Time to Failure for a general system
configuration. This direct relationship between MTTF and the system failure rate
is one reason the constant failure rate assumption is often made when the
supporting reliability data is scanty [Barl75]. Appendix G describes the analysis of
the variance for this distribution.
Using an exponential failure distribution implies two important behaviors for the
system,
§ Since a used subsystem is stochastically as good as a new subsystem, a policy
of scheduled replacement of used subsystems which are known to still be
functioning, does not increase the lifetime of the system.
§ In estimation the mean system life and reliability, data can be collected
consisting only of the number of hours of observed life and the number of
observed failures; the ages of the subsystems under observation are of no
concern.
( ) ( ) ( ) ( ) ( ) ( )æ ö æ ö
- -ç ÷ ç ÷
è ø è ø
ò ò ,
b b
a a
bd d
f x g x dx f x g x g x f x dx
adx dx
[ ] ( ) ( )
¥
¥
=- + ò0
.
0
E X t R t R t dt
( )R t
[ ] ( )
¥
= =ò0
,E X R t dt MTTF
30. 27/196
Mean Time to Repair
The Mean Time to Repair (MTTR) is the expected time for the repair of a failed
system or subsystem. For exponential distributions this is and
. The steady state availability defined in Eq. (2.27) can be
rewritten in terms of these parameters,
(2.42)
Mean Time Between Failure
The Mean Time Between Failure (MTBF) is often mistakenly used in place of Mean
Time to Failure (MTTF). The MTBF is the mean time between failures in a system
with repair, and is derived from a combination of repair and failure processes.
The simplest approximation for MTBF is:
(2.43)
In this work, it is assumed so that MTTR is used in place of
MTBF. The Mean Time to Failure is considered since in fault–tolerant systems
Failure occurs only when the redundancy features of the system fail to function
properly. In the presence of perfect coverage and perfect repair the system should
operate continuously. Therefore, failure of the system implies total loss of system
capabilities.
Mean Time to First Failure
The Mean Time to Failure is defined as the expected time of the first failure in a
population of identical systems. This development depends on the assumption
that the failure rate is constant Eq. (2.25), exponentially distributed Eq. (2.14),
and the repair time is constant, . In the general case, these assumptions may not
1
MTTF =
l
1
MTTR =
µ
SSA
.SS
MTTF
A
MTTR MTTF
=
+
= + .MTBF MTTF MTTR
!MTTR MTTF
µ
31. 28/196
be valid and the Mean Time to Failure (MTTF) is not equivalent to the Mean Time to
First Failure (MTFF).
By removing the exponential probability failure distribution restriction in
Eq. (2.29) a generalized expression for the first failure time can be derived.
Given a population of n subsystems each with a random variable
and a continuous pdf of , the failure time for the
subsystem is given by summing all the failure times prior to the failure,
(2.44)
If the random variables are independent and identically
distributed, all with pdf’s of , the random process described by these
variables is referred to as an Ordinary Renewal Process [Cox62], [Ross70]. The details
of the Renewal Process are shown in Appendix E.
Given the random process described by Eq. (2.44) the distribution function of
is provided by convolving each individual distribution function . The
convolution of two functions is defined as [Brac65], [Papo65]:
(2.45)
The resulting convolution function for the n+1 subsystem failure is given by:
(2.46)
In renewal processes, the random variables are actually functions and can be
substituted in the reliability computations when:
= !, 1,2, ,iX i n ( )f x th
n
=
= + + + = å!1 2
1
.
n
n n i
i
S X X X X
{ }!1 2, , , nX X X
( )f x
nS
( )F t
( ) ( ) ( ) ( )
¥
-¥
Ä º -ò .f x g x f u g x u du
( ) ( ) ( ) ( ) ( )+
= -ò1
0
.
t
n n
F t F t x F x dx
32. 29/196
(2.47)
When the conditions in Eq. (2.47) are met, the probability of n renewals in a time
interval is given by,
(2.48)
The renewal function can be defined as the average number of subsystem
failures and repairs as a function of time, and is given as,
(2.49)
Using Eq. (2.48) in the evaluation of Eq. (2.49) and Eq. (2.30) as the definition of
the expectation value, gives the following for the renewal function,
(2.50)
Simplifying Eq. (2.50) results in an expression for the renewal function of,
(2.51)
The term is the convolution of and F which gives,
(2.52)
which results in the expression for the renewal function of,
( ) += Û £ £ 1.n nN t n S t S
( ){ } { }
{ } { }
( ) ( ) ( ) ( )
1
1
1
,
,
.
n n
n n
n n
P N t n P S t S
P S t P S t
F t F t
+
+
+
= = £ £
= £ - £
= -
( )H t
( ) ( ) .H t E N t= é ùë û
( ) ( ){ }
( ) ( ) ( ) ( )
( ) ( ) ( ) ( ) ( )
0
1
0 0
0 1
,
.
1 .
n
n n
n n
n n
n n
H t nP N t n
nF t nF t
nF t n F t
¥
=
¥ ¥
+
= =
¥ ¥
= =
= =
= -
= - -
å
å å
å å
( ) ( ) ( ) ( )1
1
.n
n
H t F t F t
¥
+
=
= + å
( )1nF + ( )nF
( ) ( ) ( ) ( ) ( )1
0
,
t
n nF t F t x F x dx+ = -ò
33. 30/196
(2.53)
Rearranging the integral term in Eq. (2.53) gives,
(2.54)
The summation term in Eq. (2.54) is the renewal function for the failure,
giving,
(2.55)
Using Eq. (2.16), the renewal density function is the derivative of the
distribution function, giving,
(2.56)
Using Eq. (2.50) to evaluate the derivative results in,
(2.57)
and using Eq. (2.54) as a substitute for the right–hand side of Eq. (2.57) results in,
(2.58)
Eq. (2.58) is known as the Renewal Equation [Ross70]. To solve the renewal
equation, the Laplace transform will be used. The transform of the probability
density function is,
( ) ( ) ( ) ( ) ( )
1 0
.
t
n
n
H t F t F t x F x dx
¥
=
= + -åò
( ) ( ) ( ) ( ) ( )
10
.
t
n
n
H t F t F t x F x dx
¥
=
é ù
= + -ê ú
ë û
åò
th
n
( ) ( ) ( ) ( )
0
.
t
H t F t H t x F x dx= + -ò
( )h t
( ) ( ).
d
h t H t
dt
=
( ) ( ) ( )
1
,n
n
h t f t
¥
=
= å
( ) ( ) ( ) ( )
0
.
t
h t f t h t x f x dx= + -ò
34. 31/196
(2.59)
and the transform of the renewal function is,
(2.60)
Using the convolution property of the Laplace transform [Brac65], an equation
for the renewal distribution can be generated,
(2.61)
and simplified to,
(2.62)
Eq. (2.62) is now the generalized expression for the failure distribution for a
random process within an arbitrary probability distribution.
General Availability Analysis
The steady state system availability defined in Eq. (2.42) assumes an exponential
distribution for the failure rate of the system or subsystems. An important activity
in the analysis of Fault–Tolerant systems is the development of a general–
purpose availability expression, independent of the underlying failure distribution.
In the analysis that follows, it will be assumed that when a subsystem fails it is
repaired and the system restored to its functioning state. It will also be assumed
that the restored system functions as if it were new, that is with the failure
probability function restarted at .
( ){ } ( )
0
,sx
f s e f x dx
¥
-
= òL
( ){ } ( )
0
.sx
h s e h x dx
¥
-
= òL
( ){ } ( ){ } ( ){ } ( ){ },h s f s h s f s= +L L L L
( ){ }
( ){ }
( ){ }
.
1
f s
h s
f s
=
-
L
L
L
0t =
35. 32/196
Let be the duration of the ith functioning period and let be the system
downtime because of the failure of the system while the ith repair takes place.
These durations will form the basis of the renewal process.
By combining the subsystem failure interval and the subsystem repair duration, a
random variable sequence is constructed such that,
(2.63)
It must be assumed that the duration of the functioning subsystems are identically
distributed with a common Cumulative Distribution Function and a common
probability density function and that the repair periods are also identically
distributed with and . Using these assumptions the terms in Eq. (2.63)
are also identically distributed such that,
(2.64)
meets the definition of a Renewal process developed Eq. (2.44). Using this
development an expression for the convolution of the two independent random
processes is given by,
(2.65)
Using Eq. (2.62) gives,
(2.66)
The average number of repairs in the time interval has the Laplace
transform:
iT iD
; 1, 2,i i iX T D i= + = !
( )W t
( )w t
( )G t ( )g t
{ }1,2, ,iX i = !
( ){ } ( ){ } ( ){ }.f s w s g s=L L L
( ){ }
( ){ } ( ){ }
( ){ } ( ){ }
.
1
w s g s
h s
w s g s
=
-
L L
L
L L
( )M t ](0,t
36. 33/196
(2.67)
Instantaneous Availability
The steady state availability defined in Eq. (2.42) can now be replaced with the
instantaneous availability . In the absence of a repair mechanism the
availability is equivalent to the repairability, of the
subsystem.
The subsystem may be functioning at time t because of two mutually exclusive
reasons,
§ The subsystem has not failed from the beginning.
§ The last renewal occurred within the time period and the subsystem
continued to function since that time.
The probability associated with the second case is the convolution of the
reliability function and the renewal density, giving,
(2.68)
which results in a expression for the instantaneous availability of,
(2.69)
Taking the Laplace transform of both sides of Eq. (2.69) gives,
(2.70)
( ){ }
( ){ } ( ){ }
( ){ } ( ){ }
.
1
w s g s
M s
s w s g s
=
é ù-ë û
L L
L
L L
( )A t
( )A t ( ) ( )1R t A t= -
( ) ( )
0
,
t
R t x h x dx-ò
( ) ( ) ( ) ( )
0
.
t
A t R t R t x h x dx= + -ò
( ){ } ( ){ } ( ){ } ( ){ }
( ){ } ( ){ }
( ){ }
( ){ } ( ){ }
( ){ } ( ){ }
,
1 ,
1 .
1
A s R s R s L h s
R s h s
w s L g s
R s
w s L g s
= +
é ù= +ë û
é ù
= +ê ú
-ê úë û
L L L
L L
L
L
L
37. 34/196
Since the reliability of the system is given as ,
(2.71)
Substituting gives,
(2.72)
Given the failure–rate distribution and the repair–time distribution, Eq. (2.72) can
be used to compute the instantaneous availability as a function of time.
Limiting Availability
An important question to ask is – what is the availability of the system after some long
period of time? The limiting availability as is defined as A or simply
the Availability.
To derive an expression for the limiting availability the Final Value Theorem of
Laplace transform can be used [Doet61], [Widd46], [ Brac65], [Ogat70], [Gupt66].
This theorem states that the steady state behavior of is the same as the
behavior of in the neighborhood of . Thus it is possible to obtain the
value of as .
Let,
(2.73)
then using a table of Laplace transforms [Doet61], [Brac65],
( ) ( )1R t W t= -
( ){ } ( ){ }
( ){ } ( ){ }
1
,
11
.
A s W s
s
w s w s
s s s
= -
-
= - =
L L
L L
( ){ }
( ){ }
( ){ } ( ){ }
1
.
1
w s
A s
s w s g s
-
=
é ù-ë û
L
L
L L
( )A t ® ¥t
( )f t
( )sF s 0s =
( )f t ® ¥t
( ) ( ) ( )-
= +ò0
0 ,
t
F t f x dx F
38. 35/196
(2.74)
and by letting
(2.75)
The Limiting availability is then given as,
(2.76)
For small values of s the following approximations can be made [Apos74],
(2.77)
giving,
(2.78)
where and,
(2.79)
( ){ } ( ) ( ){ } ( )
¥
- -
- = = òL L
0
0 ,st
s F s F h s e f t dt
0,s ®
( ){ } ( ) ( )
( ) ( )
( )
¥
-
®
-
®¥
®¥
= +
é ù
= +ê ú
ë û
=
ò
ò
L
0
0
0
lim 0 ,
lim 0 ,
lim .
s
t
s
t
s H s f t dt F
f x dx F
F t
( ) ( ){ }0
lim lim .
t s
A A t s A s
®¥ ®
= = L
1 ,st
e st-
@ -
( ){ } ( )
( ) ( )
¥
-
¥ ¥
=
= -
-
l
ò
ò ò
L
!
0
0 0
,
,
2
1 .
st
w s e w t dt
w t dt s tw t dt
1
MTTF =
l
( ){ }= -
µ
L
2
1 ,g s
39. 36/196
and where giving the limiting availability as,
(2.80)
Eq. (2.80) is an important result in the analysis of system reliability, because it
shows that the limiting availability depends only on the Mean Time to Failure and
the Mean Time to Repair and not in the underlying distributions of the failure and
repair times.
1
MTTR =
µ
0
11 1
lim .
1 1
1 1 1
s
s
MTTF
A
s s MTTF MTTR®
é ùæ ö
- -ç ÷ê úlè ø l= = =ê ú
+æ öæ öê ú +- - -ç ÷ç ÷ê ú l µl lè øè øë û
40. 37/196
C h a p t e r 3
SYSTEM RELIABILITY
This chapter provides the basis for the computation of the overall system
reliability given a redundant architecture with partial fault detection coverage.
Redundant systems can be modeled under variety operational assumptions. Of
most interest in this work are dual and triple redundant systems that contain
repair facilities.
Series Systems
Creating a reliable system often involves a series or parallel combination of
independent systems or subsystems. If is the reliability of module i and all
the modules are statistically independent, then the overall system reliability of
modules connected in series is,
(3.1)
For a series redundant system the failure probability is given by,
(3.2)
Expanding Eq. (3.1) will illustrate an aspect of the exponential distribution. For a
system of n subsystems connected in series the reliability of the system is given by
Eq. (3.1). If a general purpose hazard function is used for the failure rate
[Shoo68] defined by,
(3.3)
( )iR t
( ) ( ).series iR t R t= Õ
seriesF
( ) ( ) ( )
( )( )
1
1
1 1 ,
1 1 .
n
series series i
i
n
i
i
F t R t R t
F t
=
=
= - = -
= - -
Õ
Õ
( ) ,k
i i ih t c t= l +
41. 38/196
where , , and k are constants, then the reliability function for the individual
subsystem is given by,
(3.4)
and the reliability functions for the system is given by,
(3.5)
Defining two new terms for the summation of the failure rate and a new term for
the time constant adjustment gives, , , and results
in the series reliability expression of,
(3.6)
As the number of subsystems grows large , the term is
bounded and the expression for the system reliability becomes,
(3.7)
Eq. (3.7) defines the failure distribution of the system as the number of
subsystems grows without bound. This implies that a large complex system will
tend to follow exponential distribution failure models regardless of the internal
organization of the subsystems.
il ic
( )
1
exp ,
1
k
i i i
t
R t t c
k
+
é ù
= - l +ê ú+ë û
( )
1
1 1
exp .
1
kn n
series i i
i i
t
R t t c
k
+
= =
é ù
= - l +ê ú+ë û
å å
1
n
i
i
*
=
l = lå 1
n
i
i
c c*
=
= å T t*
= l
( )
( )
1
1
exp .
1
k
series k
c T
R t T
k
* +
* *
é ùæ öæ öê úç ÷= - + ç ÷ê úç ÷l +è ø lè øë û
( )*
l ®¥
( )1
c
k
*
*
+ l
( )lim .T t
series
n
R t e e
*
- -l
®¥
= =
42. 39/196
Parallel Systems
In a parallel redundant configuration, the system fails only if all modules fail. The
probability of a system failure in a parallel system given by,
(3.8)
The system reliability for a parallel system is given by,
(3.9)
M–of–N Systems
An M–of–N system is a generalized form the parallel system. Instead of requiring
only one of the N modules of the system to remain functional, M modules are
required. The system of interest in this work is a Triple Modular Redundant (TMR)
configuration in which two of the three modules must function for the system to
operate properly [Lyons 62], [Kuehn 69]. [3]
For a given module reliability of
the TMR reliability is given by,
(3.10)
In Eq. (3.10) all working states are enumerated. The term represents that
state in which all three modules are functional. The term
3 In practical TMR systems, a simplex mode is allowed, which usually places the system in a shutdown mode,
allowing the controlled process to be safely stopped.
( ) ( )
1
1 .
n
iparallel
i
F t F t
=
= -Õ
( ) ( ) ( )
( )( )
1
1
1 1 ,
1 1 .
n
iparallel parallel
i
n
i
i
R t F t F t
R t
=
=
= - = -
= - -
Õ
Õ
mR
( )3 2
3
1 .
2tm r m m mR R R R
æ ö
= + -ç ÷
è ø
3
mR
( )2
3
1
2 m mR R
æ ö
-ç ÷
è ø
43. 40/196
represents the three states in which any one module has failed and the two states
in which a module is functional.
Selecting the Proper Evaluation Parameters
In comparing different redundant system configurations, it is desirable to
summarize their reliability by a single parameter. The reliability may be an
arbitrary complex function of time. The selection of the wrong summary
parameter could lead to incorrect conclusions, as will be shown below.
Consider a simplex system, with a reliability function of,
(3.11)
and using Eq. (2.41) to derive the Mean Time to Failure results in,
(3.12)
For a TMR system with an exponential reliability function,
(3.13)
and using Eq. (2.40) results in a Mean Time to Failure of,
(3.14)
Comparing the simplex and TMR reliability expressions gives,
(3.15)
By using the MTTF figure of merit, the TMR system can be shown to be less
reliable than the Simplex system. The above equations do not include the facility
( ) ,t
simplexR t e-l
=
1
.sim plexMTTF =
l
( ) ( ) ( ) ( )
3 2
2 3
3
1 ,
2
3 2 ,
t t t
tm r
t t
R t e e e
e e
-l -l -l
- l - l
æ ö
= + -ç ÷
è ø
= -
3 2
.
2 3
tm rMTTF = -
l l
5 1
.
6
tm r sim plexMTTF MTTF= £ =
l l
44. 41/196
for module repair. Once the TMR system has exhausted its redundancy, there is
more hardware to fail then the remaining modules of the non–redundant system.
This effect lowers the total system reliability. With online repair, the MTTF figure
of merit for the TMR system becomes an important measure of the overall
system reliability.
These results illustrate why simplistic assumptions and calculations may result in
erroneous information.
45. 42/196
C h a p t e r 4
IMPERFECT FAULT COVERAGE AND RELIABILITY
Reliability models of systems with dynamic redundancy usually depend on perfect
fault detection [Arno73], [Stif80]. The ability of the system to detect faults that
occur can be classified as [Geis84],
§ Covered – faults that are detected. The probability that a fault belongs to this
class is given by c.
§ Uncovered – faults that are not detected. The probability that a fault belongs
to this class is given by .
The underlying diagnostic firmware and hardware may not provide perfect
coverage for many reasons, primarily due to the complexity of the system under
diagnosis [Rous79], [Cona72], [Wood79], [Soma86]. Because of this built–in
complexity, an exhaustively tested set of diagnostics may not be possible.
Another factor affecting the diagnostic coverage is the presence of intermittent
faults [Dahb82], [Mall78]. The detection and analysis of these intermittent or
permanent faults is further complicated by the presence of transient faults which
behave as real faults but are only present in the system for a short time [Glas82],
[Sosn86]. Modeling a fault–tolerant system in the presence of imperfect fault
coverage becomes an important aspect in predicting the overall system reliability.
Redundant System with Imperfect Coverage
Before developing the Markov method of analyzing Fault–Tolerant systems, a
conditional probability method will be used to derive the MTTF and MTBF for a
redundant system with imperfect fault detection [Bour69]. Assume that the failure
rate for each subsystem of the redundant system is described by an independent
random variable . Let X denote the lifetime of a system with two modules, one
active and the other in standby mode. Assume that the module in the standby
( )1 c-
l
46. 43/196
mode does not experience a fault during the mission time interval. [4]
Let Y be a
random variable where, Y = 0 if a fault is not covered, and Y = 1 if a fault is
covered, then, and
To compute the MTTF of this system, the conditional expectation value of the
system lifetime X given the fault coverage state Y is must be derived.
If an uncovered fault occurs the MTTF of the system is the MTTF of the initially
active module,
(4.1)
If a covered fault occurs the MTTF of the system is the sum of the MTTF of the
active module and the MTTF of the inactive module,
(4.2)
The total expectation value of the system lifetime is then given by,
(4.3)
The computation of the system reliability depends on the combination of the two
independent exponential distribution functions when a covered fault occurs,
(4.4)
and when an uncovered fault occurs
(4.5)
The joint exponential distribution function for both conditions is given by,
4 This is an invalid assumption in a practical sense, but it greatly simplifies this example.
{ } ( )0 1P y c= = - { }1 .P y c= =
{ }
1
0 .P X Y = =
l
{ }
2
1 .P X Y = =
l
[ ]
( ) ( )1 12
.
c cc
E X MTTF
- +
= + = =
l l l
( ) 2
1 ,t
f x t y te -l
= = = l
( )0 .t
f x t y e -l
= = = l
47. 44/196
(4.6)
and the marginal density function of X is computed by summing over the joint
density function,
(4.7)
The system reliability as a function of the coverage is then given by integrating
the joint density function in Eq. (4.7) to give,
(4.8)
Generalized Imperfect Coverage
In the previous example, the system consisted of two modules, one in the active
state and one in the standby state. The conditional probability that a fault will go
undetected (uncovered) was computed using the conditional probability that the
system will survive for a specified period. Cox [Cox55] analyzed the general case
of a stage–type conditional probability distribution. The principle on which the
method of stages is based is the memoryless property of the exponential
distribution of Eq. (2.1) [Klie75]. The lack of memory is defined by the fact that
the distribution of the time remaining for an exponentially distributed random
variable is independent of the current age of the random variable, that is the
variable is memoryless. Appendix D develops further the memoryless property of
random variables with exponential distributions.
( ) ( ) { }
( ) ( )
( ) 2
, ,
, 1 ; 0, 0,
, ; 0, 1.
t
t
f t y f X t y P y
f t y c e t y
f t y cte t y
-l
-l
= = ×
= l - > =
= l > =
( ) ( )2
1 .t t
f t cte c e-l -l
= l + l -
( ) ( )
( )
( )
( )
0
2
0
2
1 1,
1 1 ,
1 1 ,
1 .
t
t
t t
t t
t
t
R t f x dx
cte c e dt
cte c e dt
c t e
-l -l
¥
-l -l
-l
= - =
= - l + l -
= - l + l -
= + l
ò
ò
ò
48. 45/196
In the generalized model, it is assumed that individual modules are always in one
of two states – working or failed. It is also assumed that the modules are
statistically independent and module repair can take place while the remainder of
the system continues to function.
In the general case of N active and S standby modules, the lifetime of the system
is defined by a stage–type distribution. An active module has an exponential
failure distribution with a constant failure rate . Assume that the modules in the
standby state can fail at a rate (presuming ). Let be a
random variable denoting the lifetime of the active modules and let
be a random variable denoting the lifetime of the standby modules.
The system lifetime L is then,
(4.9)
where is the time to first failure among the modules. After
the removal of the failed module, the system has N active modules and
standby modules. As a result modules have not aged by the
memoryless exponential assumption and therefore the system lifetime is,
(4.10)
Here is the lifetime of the m–out–N system and is
therefore a order statistic with [Kend77]. The distribution of
is an – phase Hypoexponential distribution with
parameters . The distribution for the time to first failure
has an exponential distribution with the parameter .
l
µ 0 £ µ £ l iX ( )1 i N£ £
jY
( )1 j S£ £
( ) ( ) ( )
( ) ( )
1 2 1 2, min , , , ; , , , , 1 ,
, , 1 .
N SL m N S X X X Y Y Y L m N S
W N S L m N S
= + -
= + -
! !
( ),W N S N S+
1S -
1N S+ -
( ) ( ) ( )
1
, ,0 , .
S
i
L m N S L m N W N i
=
= + å
( ) ( ), ,0L m N S L m N=
th
k 1k N m= - +
( ),0L m N ( )1N m- +
( ), 1 , ,N N ml - l l!
( ),W N i N il + µ
49. 46/196
Using Theorem D.1 in Appendix D, the distribution has a
–stage Hypoexponential distribution [Koba78], [Cox55], [Ash70]
with parameters .
Let denote the reliability of such a system, then the reliability
function is defined as,
(4.11)
where,
(4.12)
and,
(4.13)
Defining the constant gives a new expression for the active and
standby terms in the reliability equation Eq. (4.11) of,
( )L ,m N S
( )1N S m+ - +
( ) ( ), 1 , , , , 1 , ,N S N S N N N ml + µ l + + µ l +µ l - l l! !
( ),m N S
R té ùë û
( ) ( )
,
1
,
S N
N j i t
j im N S
i i m
R t a e b e- l+ µ - l
é ùë û
= =
= +å å
1
,
S N
i
j j m
j i
N j j
a
j i j N i= =
¹
l + µ l
=
µ - µ l - l - µ
Õ Õ
( )= =
¹
l + µ l
=
- l + µ l - l
Õ Õ1
.
S N
i
j j m
j i
N j j
b
N i j j i
K = l µ
50. 47/196
(4.14)
A similar expression can be developed for,
(4.15)
An expectation value of the reliability function derived from a general stage–type
distribution can be found using the Laplace transform [Cox 55]. The Laplace
transform of a stage–type random variable X is,
( ) ( )
( )( ) ( )( ) ( )
( ) ( )
( )
( )
( )( ) ( )
( )
( ) ( )
( ) ( )
( )
1
1 1
1
1 1 1
,
1 1 1
1
! !
1 1
! ! 1 ! !
1 ! ! !
,
1 ! ! !
1
1
1
1
N m
i
i N m
N m
NK S NK N N m
a
i i iNK i S i i
N m
K K K
NK S S i
NK i NK S S i
i
N N N m
k
i i
m M m N m
K K
NK s S N
S i m
- +
- - +
- +
+ + - -
= ×
+ - - - æ ö æ öæ ö
+ - +ç ÷ ç ÷ç ÷
è ø è øè ø
+
= - × -
+ -
æ ö
- -ç ÷
è ø×
é ùæ ö
- - -ç ÷ê úè øë û
+ -æ öæ öæ ö
ç ÷ç ÷ç ÷-è øè øè ø= -
! !
! !
!
!
!
.
i
N mi
K
NK
N m
æ ö
+ -æ öç ÷+ç ÷ç ÷è ø -è ø
( ) ( )
( ) ( ) ( ) ( )( ) ( )
( ) ( ) ( )
( ) ( ) ( ) ( ) ( )
( )
( ) ( )( )
( ) ( ) ( )
( )
( )
-
-
-
+ +
= ×
- + - + - - -é ù é ù é ùë û ë û ë û
+ - -é ùë û=
- + - - -é ùë û
+ -
= -
- + -é ùë û
+æ öæ öæ ö
ç ÷ç ÷ç ÷
è øè øè ø= -
- +æ ö
ç ÷
è ø
! !
! ! !
1
,
1 1 1 1
! 1 ! ! 1
,
! ! 1 ! ! !
! ! ! ! !
1 ,
! ! ! ! ! !
1
i
i m
i m
i m
NK S NK N m
b
N i K S N i K i N m i
NK S N K N
NK N i K S i m N i i m
NK S S N i K N i m
S NK N i K S i i m m
NK S N i
S i m
N i K Si
m S
.
51. 48/196
(4.16)
where for and . Defining the Laplace transform of
the system described in Eq. (4.9) gives,
(4.17)
By inverting the transformation in Eq. (4.17) an expression for the MTTF with
imperfect coverage can be given as,
(4.18)
The details of the above development are described in more detail in [Ing76],
[Chan72], [King69], [Saat65], [Math70], [Triv82]. In the example described above,
the system does not provide for repair. When repairable systems are analyzed in
this manner, the number of stages becomes infinite. To deal with the infinite
number of conditional probabilities a different technique must be employed. The
Markov Chain is just such a technique, capable of dealing with a system
configuration of many modules, each with repairability.
An additional caution should be noted. The assumption of statistical
independence is questionable in the case of stage–type failure distributions. In
addition, the fixed probability distribution associated with each failure in the
stage–type should be removed in the detailed analysis [Rams76].
( )
µ
g b b b g
µ
+
= =
= +
+
å Õ!L 1 1 2 1
1 1
,
ir
j
X i i
i j j
s
s
g b+ =1i i
£ £1 i r g + =1 1r
( ) ( )
( )
( )
( )
( )
( )
l µ
l µ
ll µ
l µ l
-
= =
- +
= =
+ - +
= -
+ + - +
- ++
+
+ + + - +
å Õ
Õ Õ
!
!
L 1
1 1
1
2
1 1
1
1
1
1
.
1
iS
i
X
i j
S N M
j j
N S j
s c c
s N S j
N jN j
c
s N j s N j
[ ] ( )
l µ l µ l
-
= = - + = =
ì üï ï
= - + +í ý
+ +ï ïî þ
å å å å1 2
1 1 1
1 1 1
1 .
S S S N
i
i j S i j j M
E X c c c
N j N j j
52. 49/196
C h a p t e r 5
MARKOV MODELS OF FAULT–TOLERANT SYSTEMS
A generalized modeling technique is required to deal with an arbitrary number of
modules, failure events, and repair events in the analysis of Fault–Tolerant
systems [Boss82]. Several techniques are available, including Petri Nets [Duga84],
[Duga85], Fault Tree Analysis [Fuss76], Failure Mode and Effects Analysis
[Mil1629], [Jame74], Event Tree Analysis [Gree82], and Hazard and Operability
Studies [Lee80], [Robi78], [Smit85]. When system components are not
independent, a state based analysis technique is needed which includes
redundancy and repair [Biro86], [Guid86].
A Continuous Parameter Markov Chain is a method used to analyze systems that have
state transitions that include repair processes [Hoel72], [Kend50], [Kend53]. A
Markov Process is a stochastic process whose dynamic behavior is such that the
probability distributions for its future behavior depend only on the present state
and not how the process arrived in that state [Mark07], [Fell67], [Issa76],
[Chun76], [Kulk84].
To illustrate the principles of a Markov process, consider a system S described in
Figure 3, which is changing over time in such a way that its state at any instant in
time v can be described in terms of a finite dimensional vector , [Triv74],
[Triv75a], [Triv75]. Assume that the state of the system at any time
can be described by a predetermined function of the starting state v and the
ending state t:
(5.1)
Given a set of reasonable starting conditions and the continuity of the function G
a differential equation for describing the rate at which transitions between
( )X t
>, fort t v
( ) ( ), .X t G X v t= é ùë û
( )X t
53. 50/196
each state of the system takes place can be derived by expanding both sides of
Eq. (5.1) in powers of t to give,
(5.2)
Finite–dimensional deterministic systems described by the set of state vectors are
equivalent to systems described by sets of ordinary differential equations [Bell60],
[Brau67], [Beiz78], [Brue80]. This property will serve as the basis for analysis of
fault–tolerant systems that include repair.
It will be assumed that the system described by the set of differential equation in
Eq. (5.2) can exist in only one of the finite number of states [Keme60], [Koba78].
The transition from state i to state j in this system takes place with some random
probability defined by,
(5.3)
Eq. (5.3) is the conditional pdf of the system of state transitions and satisfies the
relation,
(5.4)
The unconditional pdf of the state transition vector is given by,
(5.5)
with,
(5.6)
since the process at any time t must be in a unique state. An Absorbing Markov
Process is one in which transitions have the following properties [Gave73],
( ) .
dx
X t
dt
= é ùë ûH
( ) ( ) ( ){ }, , ; , .ijp v t P X t j X v i t v i j S= = = ³ Î
( ), 1; 0 .j
i S
p v t v t
" Î
= £ £å
( )X t
( ) ( ){ }, 1, 2, 3,jp t P X t j j= = = !
pj
t( )=1
∀j∈S
∑ , ∀t > 0,
54. 51/196
§ There is at least one absorbing state,
§ From every state, it is possible to get to the absorbing state.
Figure 3 – State Transition probabilities as a
function of time in the Continuous–Time
Markov chain that is subject to the constraints
of the Chapman–Kolmogorov equation.
The fundamental assumption of the Markov model is that the probability of a
given state transition depends only on the current state of the system and not on
any previous state. For continuous–time Markov processes, that is, those
described by ordinary differential equations, the length of time already spent in
the current state does not influence either the probability distribution of the next
state or the probability distribution of the remaining time in the same state before
the next transition. The Markov model fits with the standard assumption of the
reliability models developed so far in this work, that the failure rates are constant,
leading to an exponentially distributed state transition time for failures and a
Poisson distribution for the occurrence of these failures.
i
ki
j
j
!
! !
v t
uv t
55. 52/196
Solving the Markov Matrix
In order to describe a continuous–time Markov process using transition matrices,
it is necessary to specify the entire family of stochastic matrices, . Only
those matrices that meet certain conditions are useful in finding the solution to
the final absorption state rate of the system described by the Markov
Chain [Cour77].
Initial value problems involving systems of equations may be solved using the
Laplace transform. The advantage of this technique over traditional methods
(Elimination, Eigenvalue solutions, and Fundamental Matrix [Pipe63], [Cour43])
is that satisfaction of initial values is automatically provided. No special
techniques are needed to find particular solutions of the fundamental matrix, such
as repeated eigenvalues [Lome88].
Chapman–Kolmogorov Equations
A set of differential equations describing the transitions between each state can
be derived if the following conditions are met by the transitions probability
matrix [Bhar60], [Parz62], [Howa71]. These equations are the Chapman–Kolmogorov
Equations and are defined as the transition probabilities of the Markov chain that
satisfy Eq. (5.7) for all i and j, using Figure 3 as an example,
(5.7)
A simplified notation for the matrix elements defined in Eq. (5.7) can be created
where the elements of each matrix are given by,
(5.8)
and where,
(5.9)
( ){ }P t
( ) ( ) ( ), , , .ij ik kj
k
p v t p v u p u t= ×å
( ) ( ) ( ), , , ,v t v u u t v u t= H £ £H H
( ), ,t t =H I
56. 53/196
is the identity matrix.
The Forward Chapman–Kolmogorov Equation is now defined as,
(5.10)
where the new matrix is defined as,
(5.11)
with,
(5.12)
The matrix is now defined as the transition rate matrix [Papo65a]. The
elements of are and are defined by,
(5.13)
and
(5.14)
If the system at time t is in state i, then the probability that a transition occurs to
any state other than state i during the time interval is given by,
(5.15)
where is any function of h that approaches zero faster than h, that is
Eq. (5.13) is the rate at which the process departs state i when the
starting in state i.
( ) ( ) ( ), , , ,v t s t t v t
t
¶
= £
¶
H H Q
( )tQ
( )
( )
0
lim ,
t
t
t
tD ®
-
=
D
P I
Q
.t t vD = -
( )tQ
( )tQ ( )ijq t
( )
( )
0
, 1
lim ,ii
ii
t
p t t t
q t
tD ®
+ D -
=
D
( )
( )
0
, 1
lim , .
ij
ij
t
p t t t
q t i j
tD ®
+ D -
= ¹
D
t t+ D
( ) ( ),iiq t t o t- D + D
( )o h
( )
0
lim 0.
h
o h
h®
=
57. 54/196
Similarly, given that the system is in state i at time t, the conditional probability
that it will make a transition from state i to state j in the time interval is
given by,
(5.16)
Eq. (5.14) is the rate at which the process moves from state i to state j given that
the system is in state i, since,
(5.17)
then Eq. (5.13) and Eq. (5.14) implies,
(5.18)
Using these developments, the Backward Chapman–Kolmogorov equation is given by,
(5.19)
The forward equation may be expressed in terms of its elements,
(5.20)
The initial state i at the initial time v affects the solution of this set of differential
equations only through the following conditions,
(5.21)
The backward matrix equation may be expressed in terms of its elements,
(5.22)
[ ],t t t+ D
( ) ( ).ijq t t o tD +
( ), 1,ijp v t =å
( ) 0, .ijq t i= " Îå S
( ) ( ) ( ), , , .v t v v t v t
v
¶
= - £
¶
H Q H
( ) ( ) ( ) ( ) ( ), , , .ij jj ij kj ik
k j
p v t q t p v t q t p v t
t ¹
¶
= +
¶
å
( )
=ì
= í
¹î
1,
,
0,ij
i j
p v v
i j
( ) ( ) ( ) ( ) ( ), , , ,ij jj ij ik kj
k j
p v t q t p v t q t p v t
t ¹
¶
= - -
¶
å
58. 55/196
with the initial conditions,
(5.23)
Markov Matrix Notation
The expressions developed in the previous section can be represented by a
transition probability matrix [Papo62] of the form,
The entries in this matrix satisfy two properties; and which
is a restatement of Eq. (5.17). The Transition Probability Matrix can also be
represented by a directed graph [Maye72], [Deo74]. A node labeled i in the
directed graph represents state i of the Markov Chain and a branch labeled
from node i to node j implies that the conditional probability
is met by the Markov Process represented by the
directed graph.
The transition probabilities represent a set of differential equations describing the
rate at which the transitions take place between each node in the directed graph.
The differential equations are then represented by a matrix structure of,
( )
=ì
= í
¹î
1,
,
0,ij
i j
p t t
i j
P = pij
!
"
#
$=
pmn
! ! ! pm0
" # "
" # "
" p11
p10
p0n
p01
p00
!
"
%
%
%
%
%
%
#
$
&
&
&
&
&
&
.
£ £0 1ijp =å 1ij
j
p
ijp
{ }-= = =1n n ijP X j X j p
59. 56/196
The solution to this set of linear homogeneous differential equations can be
derived by elimination using the Laplace transform method.
Laplace Transform Techniques
Given a set of differential equations in Eq. (5.20) and Eq. (5.22), the Laplace
transform can be used to generate solutions to these equations [Lome88]. One
advantage of using the Laplace transform method is its ability to handle initial
conditions automatically, without having first to find a general solution and then
having to evaluate the integration constants. The Laplace transform is defined as,
(5.24)
The differential equation solution method depends on the following operational
property of the Laplace transform [Krey72]. The Laplace transform of the
derivative of a function is,
(5.25)
In the limit, the integral appearing on the right–hand side of Eq. (5.25) is
, so that the first term in Eq. (5.25) can be evaluated in the following
manner [McLac39],
d
dt
Pn
!
d
dt
P1
d
dt
P0
!
"
#
#
#
#
#
#
#
#
$
%
&
&
&
&
&
&
&
&
=
pmn
" " pm0
! # !
p1n
# p10
p0n
… … p00
!
"
#
#
#
#
#
$
%
&
&
&
&
&
Pn
!
P1
P0
!
"
#
#
#
#
#
$
%
&
&
&
&
&
.
( ) ( ) ( ){ }
¥
-
= =ò L
0
st
F s e f t dt f t
( ){ } ( ) ( ) ( )
¥
- - -
®¥
é ù
¢ ¢= = +ê ú
ë û
ò òL
0 0
lim .
0
b
st st st
b
b
f t e f t dt e f t s e f t dt
( ){ }L f t
60. 57/196
(5.26)
Using the property of absolute values and limits [Arfk70], Eq. (5.26) can be
rewritten as,
(5.27)
The term is of the order as . For using the definition for
exponential order, Eq. (5.27) can be reevaluated to the following,
(5.28)
The function is said to be of exponential order as if there
exists a constant such that: is bounded for all t greater than
some T. If this statement is true, there also exists a constant M, such that
Figure 4 – Definition of the exponential order
of a function.
If , then giving,
(5.29)
so that in the limit,
(5.30)
giving the final form of the Laplace transform of a differential equation as,
(5.31)
( ) ( )-
®¥
- 0
lim 0 .sb
b
e f b e f
( ) ( )- -
®¥ ®¥
£lim lim .sb sb
b b
e f b e f b
( )f b ab
e ® ¥b >b T
( ) ( )aa - -- -
®¥ ®¥ ®¥
£ =lim lim lim .s bsb sb b
b b b
e f b e Me Me
( )f b b ® ¥
a ( ) ,b
e f ba-
( ) , .t
f b Me t Ta
< >
s a> 0,s a- >
( )
lim 0,s b
b
Me a- -
®¥
=
( )lim 0,sb
b
e f b-
®¥
=
( ){ } ( ){ } ( )0 .f t s f t f¢ = -L L
61. 58/196
The notation for the Laplace transform for the differential equation for the rate
of arrival at the transition state i is then given by,
(5.32)
From this point on, this Laplace transform notation will be used in the solution
of the Markov transition matrix differential equations. Using the expression
to define the system reliability, where is the
probability distribution function of the time to failure, a new random variable, Y,
can be defined which represents the expected time to system failure. A notation
can be defined such that is the failure density of the
random variable Y. The Laplace transform of this failure density is denoted by
In this work represents the
absorbing state of the Markov model. By using the Laplace transform notation in
the solution of differential equations, the inverse transform can be used to
generate the failure density function for the random variable Y. Using Eq. (2.38)
the derivative of the failure density function can be integrated to produce the
Mean Time to Failure . The inversion of the
Laplace transform may be straightforward in some cases and more complex in
other cases.
MODELING A DUPLEX SYSTEM
Duplex systems or Parallel Redundant systems have been utilized in electronic
central office switching systems and other high–reliability systems for the past 35
years [Toy78]. Parallel redundant systems depend on fault detection and recovery
for their proper operation. In most dual redundant architectures both system are
( ){ } ( ).i iP t P sÞL
( ) ( ) { }1R t F t P T t= - = ³ ( )F t
( )
( ) ( )0
Y
dR t dP t
f t
dt dt
= - =
( ){ } ( ) ( ) ( )0 .Y Y Yf t s f s sP sÞ = =L L ( )0P s
[ ] ( )0
d
MTTF E Y t R t
dt
¥ æ ö
= = - ç ÷
è ø
ò
62. 59/196
monitored continuously, providing fault detection in the primary subsystem as
well as the standby subsystem.
This section describes the detailed development of the Markov model for a
parallel redundant system with perfect diagnostic coverage. The failure rate of
both subsystems are assumed to be a constant and the repair rate a constant
. The system is considered failed when both subsystems have failed. The
number of properly functioning subsystems is described in the state space
, where is the failure state of the system. The state diagram for
the system is shown in Figure 5.
Figure 5 – the state transition diagram for a
Parallel Redundant system with repair. State
represents the fault free operation mode,
State represents a single fault with a
return path to the fault free mode by a repair
operation, and State represents the
system failure mode, the absorption state.
The initial state of the system is and the initial conditions for the transition
equations are,
(5.33)
Using the initial conditions, the system of differential equations derived from the
transition matrix,
l
µ
{ }2,1,0ÞS { }0
2 01
2l
µ
l
{ }2
{ }1
{ }0
{ }2
( ) ( ) ( )= = =2 1 00 1, 0 0 0.P P P
63. 60/196
are given by,
(5.34)
Using the Laplace transform solution technique described in the previous section
and in detail in [Doet61], [Widd46], [Lome88], [Rea78], and [Lath65] gives the
following set of equations in Laplace form,
(5.35)
Solving Eq. (5.35)(a) for the final failed state gives,
(5.36)
and solving for Eq. (5.36)(b) for state gives,
( )
( )
( )
( )
( )
( )
( )
é ù
- l µ é ùé ùê ú
ê úê úê ú
ê úê úê ú
ê úê ú= l - l + µ lê ú
ê úê úê ú
ê úê úê ú
ê úê úlê ú ë û ë û
ê úë û
2
2
1
1
0
0
2 0
2 ,
0 2 0
dP t
P t
dt
dP t
P t
dt
dP t
P t
dt
( )
( ) ( )
( )
( ) ( ) ( )
( )
( )
2
2 1
1
2 1
0
1
2 ,
2 ,
.
dP t
P t P t
dt
dP t
P t P t
dt
dP t
P t
dt
= - l +µ
= l - l +µ
= l
( ) ( ) ( )
( ) ( ) ( ) ( )
( ) ( )
2 2 1
1 2 1
0 1
1 2 ,
2 ,
.
sP s P s P s
sP s P s P s
sP s P s
- = - +
= - +
=
l µ
l l µ
l
{ }2
( ) ( ) ( )
( ) ( ) ( )
( )
( )
( )
2 2 1
2 1
1
2
2 1,
2 1,
1
,
2
sP s P s P s
s P s P s
P s
P s
s
+ l = µ +
+ l = µ +
µ +
=
+ l
{ }2
64. 61/196
(5.37)
Equating Eq. (5.36) and Eq. (5.37) a solution representing state can be
derived, giving,
(5.38)
Multiplying each side by gives, which results in,
(5.39)
Solving Eq. (5.39) for state gives,
(5.40)
Expanding and simplifying Eq. (5.40) gives,
(5.41)
Substituting Eq. (5.41) into Eq. (5.35)(c) gives the solution to the final absorbing
state as,
( ) ( ) ( ) ( )
( ) ( ) ( ) ( )
( )
( ) ( )
1 2 1
1 1 2
1
2
2 ,
2 ,
.
2
sP s P s P s
sP s P s P s
s P s
P s
= l - l + µ
+ l + µ = l
+ l + µ
=
l
{ }1
( ) ( ) ( )
( )
1 1 1
.
2 2
s P s P s
s
l µ µ
l l
+ + +
=
+
( )1
1
P s
( ) ( )
( )
1
1
,
2 2
s P s
s
µ
l µ
l l
+
+ +
=
+
( )( )
( )1
2
2 2 .s s
P s
l
l µ l lµ+ + + = +
{ }1
( )
( )( )1
2
.
2 2
P s
s s
l
l µ l lµ
=
+ + + -
( )1 2 2
2
.
3 2
P s
s s s
l
l l µ
=
+ + +
{ }0
65. 62/196
(5.42)
After producing the inverse Laplace transform of Eq. (5.42)(c), the probability
that no subsystems are operating at time, is the result. Let the random
variable Y be the time to failure of the system and be the probability that
the system has failed at or before time t. The reliability of the system is then
defined by,
(5.43)
Using Eq. (2.37), the failure density function for the random variable Y is given
by,
(5.44)
and using Eq. (5.31), its Laplace transform is given by,
(5.45)
Inverting Eq. (5.45) gives the failure density of Y as,
(5.46)
where,
(5.47)
( ) ( )
( )
( )
( )
( )
0 1
0 2 2
2
0 2 2
,
2
,
3
2
.
3
sP s P s
sP s
s s s
P s
s s s s
l
l
l
l µ l
l
l µ l
=
é ù
= ê ú
+ + +ë û
=
é ù+ + +ë û
0t >
( )0P t
( ) ( )01 .R t P t= -
( )
( )0
,Y
dP tdR
f t
dt dt
= - =
( ) ( ) ( ) ( ) ( )
2
0 0 2 2
2
0 .
3 2
Y YL s f s sP s P
s s
- l
= = - =
+ l + µ + l
( ) ( )2 1
2
1 2
2
,t t
Yf t e ea al
a a
- -
= -
-
( ) 2 2
1 2
3 6
, .
2
l +µ ± l + lµ +µ
a a =
66. 63/196
Using Eq. (2.28), the MTTF of the Parallel Redundant system with repair is given
by,
(5.48)
The MTTF of a two element Parallel Redundant system without repair
would have been equal to the first term in Eq. (5.48)(c). The effect of adding a
repair facility to the system increases the mean life of the system by,
(5.49)
or a factor of,
(5.50)
over a system without repair facilities.
[ ] ( )
( )
( )
( )
¥
¥ ¥
-a -a
= =
é ùl
= = -ê ú
a - a ë û
é ùl
= -ê ú
a - a a aë û
l a - a
=
a a
l l + µ
=
l
µ
= +
l l
ò
ò ò2 1
0
2
1 2 0 0
2
2 2
1 2 2 1
2
1 2
2 2
1 2
2
22
2
2
,
2 1 1
,
2
,
2 3
,
2
3
.
2 2
Y
y y
E Y yf y dy
ye dy ye dy
( )0µ =
2
as a result of Repair ,
2
MTTF
µ
=
l
2
2 ,
3 3
2
µ
µl =
l
l
67. 64/196
MODELING A TRIPLE–REDUNDANT SYSTEM
A Triple Modular Redundant (TMR) system continues to operate correctly as
long as two of the three subsystems are functioning properly. A second
subsystem failure causes the system to fail. This model is referred to as 3–2–0. A
second architecture (shown in Figure 7) is possible in which the system will
continue to operate in the presence of two (2) subsystem failures. This system
operates in simplex mode 3–2–1–0. The 3–2–0 model without coverage will be
developed in this section. Figure 6 describes a TMR system with a constant
failure rate and a constant repair rate .
The repair activity takes place with a constant response time whenever a
subsystem fails, giving a Markov transition matrix of,
(5.51)
The set of differential equations derived from the transition matrix is given by,
(5.52)
Rewriting the differential equations in the Laplace transform format gives,
l µ
( )
( )
( )
( )
( )
( )
( )
é ù
- l µ é ùé ùê ú
ê úê úê ú
ê úê úê ú
ê úê úê ú = l - l + µ l ê úê úê ú
ê úê úê ú
ê úê úê ú
ê úê úlê ú ë û ë û
ê úë û
2
2
1
1
0
0
3 0
3 2 .
0 2 0
dP t
P t
dt
dP t
P t
dt
dP t
P t
dt
( )
( ) ( )
( )
( ) ( ) ( )
( )
( )
2
2 1
1
2 1
0
1
3 ,
3 2 ,
2 .
dP t
P t P t
dt
dP t
P t P t
dt
dP t
P t
dt
= - l +µ
= l - l +µ
= l
68. 65/196
(5.53)
Using Eq. (5.53)(a) and Eq. (5.53)(b) to solve for state gives,
(5.54)
( ) ( ) ( )
( ) ( ) ( ) ( )
( ) ( )
2 2 1
1 2 1
0 1
1 3 ,
3 2 ,
2 .
sP s P s P s
sP s P s P s
sP s P s
- = - l +µ
= l - l +µ
= l
{ }2
( ) ( ) ( )
( ) ( ) ( )
( )
( )
( )
2 2 1
2 1
1
2
3 1,
3 1,
1
.
3
sP s P s P s
s P s P s
P s
P s
s
+ l = µ +
+ l = µ +
µ +
=
+ l
69. 66/196
Figure 6 – The transition diagram for a Triple
Modular Redundant system with repair. State
represents the fault free (TMR) operation
mode, State represents a single fault
(Duplex) operation mode with a return path
to the fault free mode, and State
represents the system failure mode, the
absorbing state.
Using Eq. (5.54)(a) and Eq. (5.54)(b) again to solve for state gives,
(5.55)
Equating (5.54) and Eq. (5.55) and solving for state gives,
(5.56)
Simplifying Eq. (5.56)(b) gives,
(5.57)
2 01
3l
µ
2l
{ }2
{ }1
{ }0
{ }2
( ) ( ) ( ) ( )
( ) ( ) ( ) ( )
( )
( )
( )
1 2 1
1 1 2
2 1
3 2 ,
2 3 ,
2
.
3
sP s P s P s
sP s P s P s
s
P s P s
= l - l +µ
+ l +µ = l
+ l +µ
=
l
{ }1
( )
( )
( )
( )
( )
( )( )
+ l + µ µ +
=
l + l
l
=
+ l + µ + l - lµ
1
1
1
2 1
,
3 3
3
.
2 3 3
s P s
P s
s
P s
s s
( )1 2 2
3
.
5 6
P s
s s s
l
=
+ l + l +µ
70. 67/196
Substituting the solution for state , Eq. (5.57), into Eq. (5.54)(c) gives the
solution for the final absorbing state ,
(5.58)
Expanding and factoring the denominator of Eq. (5.58)(b) gives the differential
equation for the absorption state as,
(5.59)
Expanding the partial fractions of Eq. (5.59) and taking the inverse Laplace
transform, results in the following reliability function,
(5.60)
Integrating Eq. (5.60) using Eq. (2.24) produces the MTTF of,
(5.61)
Simplifying Eq. (5.61) gives the MTTF for a TMR system with repair as,
{ }1
{ }0
( ) ( )
( )
( )
0 1 2 2
2
0 2 2
3
2 2 ,
5 6
6
5 6 .
sP s P s
s s s
P s
s s s s
é ùl
= l = l ê ú+ l + l +µë û
l
=
+ l + l +µ
P0
s( )=
6λ2
s s+ 1
2
5λ+µ− λ2
+10λµ+µ2
( )( )s+ 1
2
5λ+µ+ λ2
+10λµ+µ2
( )( )
( ) ( )
( )
2 21
2
2 21
2
2 2
5 10
2 2
2 2
5 10
2 2
5 10
2 10
5 10
.
2 10
R t e
e
- l+µ- l + lµ+µ
- l+µ+ l + lµ+µ
l +µ + l + lµ +µ
=
l + lµ +µ
l +µ - l + lµ +µ
-
l + lµ +µ
!
!
( )
( )
2 2
2 2 2 2
2 2
2 2 2 2
5 10
5 10 10
5 10
.
5 10 10
MTTF
l +µ + l + lµ +µ
=
l +µ l + lµ +µ -l - lµ -µ
l +µ - l + lµ +µ
-
l +µ l + lµ +µ + l + lµ +µ
!
!
71. 68/196
(5.62)
Rearranging Eq. (5.62) and isolating the repair term from the failure term gives,
(5.63)
MODELING A PARALLEL SYSTEM WITH IMPERFECT COVERAGE
A more realistic model of a Parallel Redundant System assumes that not all faults
are recoverable and that the coverage factor c denotes the conditional probability
that the system detects the fault and survives. The state diagram for this system is
shown in Figure 7
2
5
.
6
MTTF
l +µ
=
l
2
5
.
6 6
MTTF
µ
= +
l l
72. 69/196
Figure 7 – The transition diagram for a
Parallel Redundant system with repair and
imperfect fault coverage. State represents
the fault free mode, State represents a
single fault with a return path to the fault free
mode by a repair operation, and State
represents the system failure mode. State
can be reached from State through an
uncovered fault, which causes the system to
fail without the intermediate State mode.
The transition matrix for Figure 7 is,
(5.64)
With an initial state of producing a set of starting conditions,
,
2 01
2 cl
µ
l
( )2 1 cl -
{ }2
{ }1
{ }0
{ }0
{ }2
{ }1
( )
( )
( )
( )
( )
( )
( )
( )
( )
é ù
- l + l - µé ù é ùê ú
ê ú ê úê ú
ê ú ê úê ú
ê ú ê úê ú = l - l + µ lê ú ê úê ú
ê ú ê úê ú
ê ú ê úê ú
ê ú ê úl - lê ú ë û ë û
ê úë û
2
2
1
1
0
0
2 2 1 0
2 ,
2 1 2 0
dP t
c c P t
dt
dP t
c P t
dt
dP t
c P t
dt
{ }2
( ) ( ) ( )2 1 00 1, 0 0 0P P P= = =
73. 70/196
the system of equations describing the state transitions are,
(5.65)
Using the Laplace transform method, the above equations are reduced to,
(5.66)
Using Eq. (5.66)(a) and solving for state gives,
(5.67)
Using Eq. (5.66)(b) to solve for state gives,
(5.68)
Equating Eq. (5.67)(c) and Eq. (5.68)(c) and solving for state gives,
(5.69)
( )
( ) ( ) ( ) ( )
( )
( ) ( ) ( )
( )
( ) ( ) ( )
2
2 2 1
1
2 1
0
2 1
2 2 1 ,
2 ,
2 1 .
dP t
cP t c P t P t
dt
dP t
cP t P t
dt
dP t
c P t P t
dt
= - l - l - +µ
= l - l +µ
= l - + l
( ) ( ) ( )
( ) ( ) ( ) ( )
( ) ( ) ( ) ( )
2 2 1
1 2 1
0 2 1
1 2 ,
2 ,
2 1 .
sP s P s P s
sP s cP s P s
sP s c P s P s
- = - l +µ
= l - l +µ
= l - + l
{ }2
( ) ( ) ( )
( ) ( ) ( )
( )
( )
( )
2 2 1
2 1
1
2
2 ,
2 ,
1
.
2
sP s P s P s
s P s P s
P s
P s
s
- l = µ
- l = µ
µ +
=
+ l
{ }2
( ) ( ) ( ) ( )
( ) ( ) ( )
( )
( ) ( )
1 2 1
1 2
1
2
2 ,
2 ,
.
2
sP s cP s P s
s P s cP s
s P s
P s
c
= l - l + µ
+ l + µ = l
+ l + µ
=
l
{ }1
( )
( )
( ) ( )1 11
.
2 2
P s s P s
s c
µ + + l + µ
=
+ l l
74. 71/196
Simplifying Eq. (5.69) and solving for state gives,
(5.70)
Using Eq. (5.66)(a) and solving for state gives,
(5.71)
Using Eq. (5.66)(b) and solving for state gives,
(5.72)
Equating Eq. (5.71) and Eq. (5.72) and solving for state gives,
(5.73)
Substituting Eq. (5.70) and Eq. (5.73) into Eq. (5.66)(c) and solving for state
gives,
{ }1
( ) ( )( ) ( )
( )
( )( )
1 1
1
2 2 2 ,
2
.
2 2
cP s c s s P s
c
P s
s s c
lµ + l = + l + l +µ
l
=
+ l + l +µ - lµ
{ }1
( ) ( ) ( )
( ) ( ) ( )
( )
( ) ( )
2 2 1
2 1
1
1
2 ,
2 ,
2 1
.
sP s P s P s
s P s P s
s P s
P s
- l = µ
- l = µ
- l -
=
µ
{ }1
( ) ( ) ( ) ( )
( ) ( ) ( )
( )
( )
( )
1 2 1
1 2
1 2
2 ,
2 ,
2
.
sP s cP s P s
s P s cP s
c
P s P s
s
= l - l +µ
+ l +µ = l
l
=
+ l +µ
{ }2
( ) ( )
( )
( )
( )
( )
( )( )
2
2
2
2 1 2
,
.
2 2
s P s c
P s
s
s
P s
s s c
+ l - l
=
µ + l +µ
+ l +µ
=
+ l + l +µ - lµ
{ }0