2. 12.0 Reliability
A reliable DDBMS is one that can continue to process user requests even when the underlying system is unreliable, i.e.,
failures occur
Data replication + Easy scaling = Reliable system
Distribution enhances system reliability (not enough)
◦ Need number of protocols to be implemented to exploit distribution and replication
Reliability is closely related to the problem of how to maintain the atomicity and durability properties of transactions.
2
3. 12.1 Reliability Concepts and Measures
◦ 12.1.1 System, State and Failure
◦ Reliability refers to a system that consists of a set of components.
◦ The system has a state, which changes as the system operates.
◦ The behavior of the system : authoritative specification indicates the valid behavior of each system state.
◦ Any deviation of a system from the behavior described in the specification is considered a failure.
◦ The internal state of a system such that there exist circumstances in which further processing, by the
normal algorithms of the system, will lead to a failure which is not attributed to a subsequent fault, is
called erroneous state.
◦ The part of the state which is incorrect is an error.
◦ An error in the internal states of the components of a system or in the design of a system is a fault.
3
Fundamental Definitions
4. 12.1 Reliability Concepts and Measures
(contd..)
4
Fault Failure
Error
causes results in
Faults to error
5. 12.1 Reliability Concepts and Measures
(contd..)
5
◦ Hard faults
◦ Permanent
◦ Resulting failures are called hard failures
◦ Soft faults
◦ Transient or intermittent
◦ Account for more than 90% of all failures
◦ Resulting failures are called soft failures
Types of faults
6. 12.1 Reliability Concepts and Measures
(contd..)
6
Classification of Faults
Permanent
fault
Incorrect
design
Unstable or
marginal
components
Operator
mistake
Transient
error
Intermittent
error
Permanent
error
System Failure
Unstable
environment
7. 12.1 Reliability Concepts and Measures
(contd..)
◦ 12.1.2 Reliability and Availability
◦ Reliability:
◦ A measure of success with which a system conforms to some authoritative specification of its
behavior.
◦ Probability that the system has not experienced any failures within a given time period.
◦ Typically used to describe systems that cannot be repaired or where the continuous operation of
the system is critical.
◦ Availability:
◦ The fraction of the time that a system meets its specification.
◦ The probability that the system is operational at a given time t.
7
Fault tolerant measures
8. 12.1 Reliability Concepts and Measures
(contd..)
The reliability of a system, R(t) = Pr {0 failures in time [0,t] | no failures at t=0}
If occurrence of failures is Poisson
R(t) = Pr {0 failures in time [0,t]}
Then
where
z(x) is known as the hazard function which gives the time-dependent failure rate of the component
8
9. 12.1 Reliability Concepts and Measures
(contd..)
The mean number of failures in time [0, t] can be computed as
and the variance can be be computed as
Thus, reliability of a single component is and of a system consisting of n non-redundant
components as
9
10. 12.1 Reliability Concepts and Measures
(contd..)
◦ Availability, A(t), refers to the probability that the system is operational according to its specification at a
given point in time t. Several failures may have occurred prior to time t, but if they have all been repaired, the
system is available at time t.
Availability refers to the systems that can be repaired
Assumptions:
◦ Poisson failures with rate
◦ Repair time is exponentially distributed with mean 1/μ
Then, steady-state availability
10
11. 12.1 Reliability Concepts and Measures
(contd..)
o12.1.3 Mean time between Failures/Mean time to Repair
o MTBF
Mean Time Between Failures
o MTTR
Mean Time To Repair
o Using these two metrics, the steady-state availability of a system with exponential failure and repair rates can
be specified as
11
12. 12.1 Reliability Concepts and Measures
(contd..)
◦ System failures may be latent, in that a failure is typically detected some time after its occurrence. This period is
called error latency, and the average error latency time over several identical systems is called mean time to
detect (MTTD).
12
Editor's Notes
Reliability is closely related to the problem of how to maintain the
atomicity
and
durability
properties of transactions