A system is a collection of components, subsystems and/or assemblies arranged to a specific design in order to achieve desired functions with acceptable performance and reliability. The types of components, their quantities, their qualities and the manner in which they are arranged within the system have a direct effect on the system's reliability.
Often, the relationship between a system and its components is misunderstood or oversimplified. For example, the following statement is not valid: All of the components in a system have a 90% reliability at a given time, thus the reliability of the system is 90% for that time. Unfortunately, poor understanding of the relationship between a system and its constituent components can result in statements like this being accepted as factual, when in reality they are false.
Block diagrams are widely used in engineering and science and exist in many different forms. They can also be used to describe the interrelation between the components and to define the system. When used in this fashion, the block diagram is then referred to as a reliability block diagram (RBD).
A reliability block diagram is a graphical representation of the components of the system and how they are reliability-wise related (connected). (Note: One can also think of an RBD as a logic diagram for the system based on its characteristics.) It should be noted that this may differ from how the components are physically connected. An RBD of a simplified computer system with a redundant fan configuration is shown in the following figure.
After defining the properties of each block in a system, the blocks can then be connected in a reliability-wise manner to create a reliability block diagram for the system. The RBD provides a visual representation of the way the blocks are reliability-wise arranged. This means that a diagram will be created that represents the functioning state ( i.e. success or failure) of the system in terms of the functioning states of its components. In other words, this diagram demonstrates the effect of the success or failure of a component on the success or failure of the system. For example, if all components in a system must succeed in order for the system to succeed, the components will be arranged reliability-wise in series . If one of two components must succeed in order for the system to succeed, those two components will be arranged reliability-wise in parallel.
The reliability-wise arrangement of components is directly related to the derived mathematical description of the system. The mathematical description of the system is the key to the determination of the reliability of the system. In fact, the system's reliability function is that mathematical description (obtained using probabilistic methods) and it defines the system reliability in terms of the component reliabilities. The result is an analytical expression that describes the reliability of the system as a function of time based on the reliability functions of its components.
Systems can be generally classified into non-repairable and repairable systems. Non-repairable systems are those that do not get repaired when they fail. Specifically, the components of the system are not repaired or replaced when they fail. Most household products, for example, are non-repairable. This does not necessarily mean that they cannot be repaired, rather that it does not make economic sense to do so. For example, repairing a four-year-old microwave oven is economically unreasonable, since the repair would cost approximately as much as purchasing a new unit.
On the other hand, repairable systems are those that get repaired when they fail. This is done by repairing or replacing the failed components in the system. An automobile is an example of a repairable system. If the automobile is rendered inoperative when a component or subsystem fails, that component is typically repaired or replaced rather than purchasing a new automobile. In repairable systems, two types of distributions are considered: failure distributions and repair distributions. A failure distribution describes the time it takes for a component to fail. A repair distribution describes the time it takes to repair a component (time-to-repair instead of time-to-failure). In the case of repairable systems, the failure distribution itself is not a sufficient measure of system performance, since it does not account for the repair distribution. A new performance criterion called availability can be calculated, which accounts for both the failure and repair distributions.
Three subsystems are reliability-wise in series and make up a system. Subsystem 1 has a reliability of 99.5%, subsystem 2 has a reliability of 98.7% and subsystem 3 has a reliability of 97.3% for a mission of 100 hours. What is the overall reliability of the system for a 100 hour mission?
Solution to RBDs and Analytical System Reliability Example
Since the reliabilities of the subsystems are specified for 100 hours, the reliability of the system for a 100 hour mission is simply:
Effect of Component Reliability in a Series System
In a series configuration, the component with the smallest reliability has the biggest effect on the system's reliability. There is a saying that "a chain is only as strong as its weakest link." This is a good example of the effect of a component in a series system. In a chain, all the rings are in series and if any of the rings break, the system fails. In addition, the weakest link in the chain is the one that will break first. The weakest link dictates the strength of the chain in the same way that the weakest component/subsystem dictates the reliability of a series system. As a result, the reliability of a series system is always less than the reliability of the least reliable component.
In a simple parallel system, at least one of the units must succeed for the system to succeed. Units in parallel are also referred to as redundant units. Redundancy is a very important aspect of system design and reliability in that adding redundancy is one of several methods of improving system reliability. It is widely used in the aerospace industry and generally used in mission critical systems. Other example applications include the RAID computer hard drive systems, brake systems and support cables in bridges.
The probability of failure, or unreliability, for a system with n statistically independent parallel components is the probability that unit 1 fails and unit 2 fails and all of the other units in the system fail. So in a parallel system, all n units must fail for the system to fail. Put another way, if unit 1 succeeds or unit 2 succeeds or any of the n units succeeds, then the system succeeds. The unreliability of the system is then given by:
Observe the contrast with the series system, in which the system reliability was the product of the component reliabilities; whereas the parallel system has the overall system unreliability as the product of the component unreliabilities.
The reliability of the parallel system is then given by:
Consider a system consisting of three subsystems arranged reliability-wise in parallel. Subsystem 1 has a reliability of 99.5%, Subsystem 2 has a reliability of 98.7% and Subsystem 3 has a reliability of 97.3% for a mission of 100 hours. What is the overall reliability of the system for a 100 hour mission?
Solution to RBDs and Analytical System Reliability Example
Since the reliabilities of the subsystems are specified for 100 hours, the reliability of the system for a 100 hour mission is:
Effect of Component Reliability in a Parallel Configuration
When we examined a system of components in series, we found that the least reliable component has the biggest effect on the reliability of the system. However, the component with the highest reliability in a parallel configuration has the biggest effect on the system's reliability, since the most reliable component is the one that will most likely fail last. This is a very important property of the parallel configuration, specifically in the design and improvement of systems.
To properly deal with repairable systems , we need to first understand how components in these systems are restored ( i.e. the maintenance actions that the components undergo). In general, maintenance is defined as any action that restores failed units to an operational condition or retains non-failed units in an operational state. For repairable systems, maintenance plays a vital role in the life of a system. It affects the system's overall reliability , availability , downtime , cost of operation , etc. Generally, maintenance actions can be divided into three types: corrective maintenance, preventive maintenance and inspections.
Corrective maintenance consists of the action(s) taken to restore a failed system to operational status. This usually involves replacing or repairing the component that is responsible for the failure of the overall system. Corrective maintenance is performed at unpredictable intervals because a component's failure time is not known a priori . The objective of corrective maintenance is to restore the system to satisfactory operation within the shortest possible time. Corrective maintenance is typically carried out in three steps:
Diagnosis of the problem.
The maintenance technician must take time to locate the failed parts or otherwise satisfactorily assess the cause of the system failure.
Repair and/or replacement of faulty component(s).
Once the cause of system failure has been determined, action must be taken to address the cause, usually by replacing or repairing the components that caused the system to fail.
Verification of the repair action.
Once the components in question have been repaired or replaced, the maintenance technician must verify that the system is again successfully operating.
Preventive maintenance, unlike corrective maintenance, is the practice of replacing components or subsystems before they fail in order to promote continuous system operation. The schedule for preventive maintenance is based on observation of past system behavior, component wear-out mechanisms and knowledge of which components are vital to continued system operation. Cost is always a factor in the scheduling of preventive maintenance. (Note: Reliability can also be a factor but cost is a more general term because reliability and risk can be expressed in terms of cost.) In many circumstances, it is financially more sensible to replace parts or components that have not failed at predetermined intervals rather than to wait for a system failure that may result in a costly disruption in operations.
Inspections are used in order to uncover hidden failures (also called dormant failures). In general, no maintenance action is performed on the component during an inspection unless the component is found failed, in which case a corrective maintenance action is initiated. However, there might be cases where a partial restoration of the inspected item would be performed during an inspection. For example, when checking the motor oil in a car between scheduled oil changes, one might occasionally add some oil in order to keep it at a constant level.
Maintenance actions (preventive or corrective) are not instantaneous. There is a time associated with each action, i.e. the amount of time it takes to complete the action. This time is usually referred to as downtime and it is defined as the length of time an item is not operational. There are a number of different factors that can affect the length of downtime, such as the physical characteristics of the system, spare part availability, repair crew availability, human factors, environmental factors, etc. Downtime can be divided into two categories based on these factors:
This is the time during which the equipment is inoperable but not yet undergoing repair. This could be due to the time it takes for replacement parts to be shipped, administrative processing time, etc.
This is the time during which the equipment is inoperable and actually undergoing repair. In other words, the active downtime is the time it takes repair personnel to perform a repair or replacement. The length of the active downtime is greatly dependent on human factors, as well as on the design of the equipment. For example, the ease of accessibility of components in a system has a direct effect on the active downtime.
These downtime definitions are subjective and not necessarily mutually exclusive nor all-inclusive. As an example, consider the time required to diagnose the problem. One may need to diagnose the problem before ordering parts and then wait for the parts to arrive.
The influence of a variety of different factors on downtime results in the fact that the time it takes to repair/restore a specific item is not generally constant. That is, the time-to-repair is a random variable, much like the time-to-failure. The statement that "it takes on average five hours to repair" implies an underlying probabilistic distribution. Distributions that describe the time-to-repair are called repair distributions (or downtime distributions ) in order to distinguish them from the failure distributions . However, the methods employed to quantify these distributions are not any different mathematically than the methods employed to quantify failure distributions. The difference is in how they are employed, i.e. the events they describe and metrics utilized. As an example, when using a life distribution with failure data ( i.e. the event modeled was time-to-failure), unreliability provides the probability that the event (failure) will occur by that time, while reliability provides the probability that the event (failure) will not occur. In the case of downtime distributions, the data set consists of times-to-repair, thus what we termed as unreliability now becomes the probability of the event occurring ( i.e. repairing the component). Using these definitions, the probability of repairing the component by a given time, t , is also called the component's maintainability .
Maintainability is defined as the probability of performing a successful repair action within a given time. In other words, maintainability measures the ease and speed with which a system can be restored to operational status after a failure occurs. For example, if it is said that a particular component has a 90% maintainability in one hour, this means that there is a 90% probability that the component will be repaired within an hour. In maintainability, the random variable is time-to-repair, in the same manner as time-to-failure is the random variable in reliability. As an example, consider the maintainability equation for a system in which the repair times are distributed exponentially. Its maintainability M ( t ) is given by:
Note the similarity between this equation and the equation for the reliability of a system with exponentially distributed failure times. However, since the maintainability represents the probability of an event occurring (repairing the system) while the reliability represents the probability of an event not occurring (failure), the maintainability expression is the equivalent of the unreliability expression, (1 - R ). Furthermore, the single model parameter, Mu, is now referred to as the repair rate, which is analogous to the failure rate ,Lambda, used in reliability for an exponential distribution.
Similarly, the mean of the distribution can be obtained by:
If one considers both reliability (probability that the item will not fail) and maintainability (the probability that the item is successfully restored after failure), then an additional metric is needed for the probability that the component/system is operational at a given time, t ( i.e. has not failed or it has been restored after failure). This metric is availability . Availability is a performance criterion for repairable systems that accounts for both the reliability and maintainability properties of a component or system. It is defined as the probability that the system is operating properly when it is requested for use. That is, availability is the probability that a system is not failed or undergoing a repair action when it needs to be used. For example, if a lamp has a 99.9% availability, there will be one time out of a thousand that someone needs to use the lamp and finds out that the lamp is not operational either because the lamp is burned out or the lamp is in the process of being replaced. (Note: Availability is always associated with time, much like reliability and maintainability. As we will see in later sections, there are different availability classifications and for some of which, the definition depends on the time under consideration. Since no discussion about these classifications has been made yet, the time variable has been left out of this 99.9% availability statement.)
This metric alone tells us nothing about how many times the lamp has been replaced. For all we know, the lamp may be replaced every day or it could have never been replaced at all. Other metrics are still important and needed, such as the lamp's reliability. The next table illustrates the relationship between reliability, maintainability and availability.
Achieved availability is very similar to inherent availability with the exception that preventive maintenance (PM) downtimes are also included. Specifically, it is the steady state availability when considering corrective and preventive downtime of the system. It can be computed by looking at the mean time between maintenance actions, MTBM and the mean maintenance downtime, or:
Where the operating cycle is the overall time period of operation being investigated and uptime is the total time the system was functioning during the operating cycle. (Note: The operational availability is a function of time, t, or operating cycle.)
When there is no specified logistic downtime or preventive maintenance, this returns the Mean Availability of the system.
The operational availability is the availability that the customer actually experiences. It is essentially the a posteriori availability based on actual events that happened to the system. The previous availability definitions are a priori estimations based on models of the system failure and downtime distributions. In many cases, operational availability cannot be controlled by the manufacturer due to variation in location, resources and other factors that are the sole province of the end user of the product.
As an example, consider the following scenario. A diesel power generator is supplying electricity at a research site in Antarctica. The personnel are not satisfied with the generator. They estimated that in the past six months, they were without electricity due to generator failure for an accumulated time of 1.5 months.
Therefore, the operational availability of the diesel generator experienced by the personnel of the station is:
A life cycle cost analysis involves the analysis of the costs of a system or a component over its entire life span. Typical costs for a system may include:
Acquisition costs (or design and development costs).
Cost of failures.
Cost of repairs.
Cost for spares.
Loss of production.
A complete life cycle cost (LCC) analysis may also include other costs, as well as other accounting/financial elements (such as discount rates, interest rates, depreciation, present value of money, etc.).
For the purpose of this reference, it is sufficient to say that if one has all the required cost values (inputs), then a complete LCC analysis can be performed easily in a spreadsheet, since it really involves summations of costs and perhaps some computations involving interest rates. With respect to the cost inputs for such an analysis, the costs involved are either deterministic (such as acquisition costs, disposal costs etc.) or probabilistic (such as cost of failures, repairs, spares, downtime, etc.). Most of the probabilistic costs are directly related to the reliability and maintainability characteristics of the system.
The series reliability block (logic model) diagram is shown here:
The parallel reliability block (logic model) diagram is shown here:
All elements, (A,B,C,…,N) must work for equipment T, to work. The reliability of T is: R T = R A • R B • R C • … • R N = R T = R A + R B — R A R B = 1- (1- R A ) ( 1- R B ) At least one of the elements, (A,B) must work for equipment T, to work. The reliability of T is: A B C N T R A R B R C R N R T A B C T R A R B R C R T