Availability and Reliability

Availability and reliability, 2013

Slide 1
Principal dependability
properties

Availability and reliability, 2013

Slide 2
• Reliability
– The probability of failure-free system
operation over a specified time in a
given environment for a given purpose

Availability and reliability, 2013

Slide 3
• Availability
– The probability that a system, at a point in
time, will be operational and able to deliver
the requested services

Availability and reliability, 2013

Slide 4
Availability specification
• Both reliability and availability attributes
can be expressed as numbers:
– Availability of 0.999 means that the system
is up and running for 99.9% of the time;

Availability and reliability, 2013

Slide 5
Reliability specification
• Probability of failure on demand
(POFOD) of 0.0001 means that on
average 1 in 10, 000 demands for
service from a system will fail in some
way

Availability and reliability, 2013

Slide 6
Availability and reliability
• Availability and reliability are closely
related
– Obviously if a system is unavailable it is not
delivering the specified system services.

Availability and reliability, 2013

Slide 7
• However, it is possible to have systems
with low reliability that must be
available.
– So long as system failures can be repaired
quickly and does not damage data, some
system failures may not be a problem.

Availability and reliability, 2013

Slide 8
• Availability is therefore best considered
as a separate attribute reflecting
whether or not the system can deliver its
services.
• Availability takes repair time into
account, if the system has to be taken
out of service to repair faults.
Availability and reliability, 2013

Slide 9
Availability perception
• Availability is usually expressed as a
percentage of the time that the system
is available to deliver services e.g.
99.9%.

Availability and reliability, 2013

Slide 10
Availability and reliability, 2013

Slide 11
Subjective availability
• The number of users affected by the
service outage.
– Loss of service in the middle of the night is
less important for many systems than loss
of service during peak usage periods.

Availability and reliability, 2013

Slide 12
• The length of the outage.
– The longer the outage, the more the
disruption. Several short outages are less
likely to be disruptive than 1 long outage.
Long repair times are a particular problem.

Availability and reliability, 2013

Slide 13
Reliability metrics
• Probability of failure on demand
(POFOD)
– Probability that a system will not deliver a
service correctly when requested
– Used for systems where demands are
infrequent and intermittent
Availability and reliability, 2013

Slide 14
• Rate of occurrence of failure (ROCOF)
– Number of system failures in a given time
period
– Used for transaction processing systems
with frequent and regular transactions

Availability and reliability, 2013

Slide 15
•

Fault
–

•

Error
–

•

A characteristic of a software system that can lead to a
system error.

An erroneous system state that can lead to system behavior
that is unexpected by system users.

Failure
–

An event that occurs at some point in time when the system
does not deliver a service as expected by its users.

Availability and reliability, 2013

Slide 16
Faults-errors-failures
Fault
Error

Failure
Availability and reliability, 2013

Slide 17
Faults and failures
• Failures are a usually a result of system
errors.

• The incorrect state causes undesirable
system behaviour
• Incorrect state is a consequence of
executing faulty code
Availability and reliability, 2013

Slide 18
• However, faults do not necessarily
result in system errors
– The erroneous system state resulting from
the fault may be transient and ‘corrected’
before an error arises.
– The faulty code may never be executed.

Availability and reliability, 2013

Slide 19
• Errors do not necessarily lead to system
failures
– The error can be corrected by built-in error
detection and recovery
– The failure can be protected against by
built-in protection facilities. These may, for
example, protect system resources from
system errors
Availability and reliability, 2013

Slide 20
Reliability achievement
• Fault avoidance
– Development technique are used that
either minimise the possibility of
mistakes or trap mistakes before they
result in the introduction of system
faults.
Availability and reliability, 2013

Slide 21
• Fault detection and removal
– Verification and validation techniques
that increase the probability of
detecting and correcting errors before
the system goes into service are
used.
Availability and reliability, 2013

Slide 22
• Fault tolerance
– Run-time techniques are used to
ensure that system faults do not result
in system errors and/or that system
errors do not lead to system failures.

Availability and reliability, 2013

Slide 23
Summary
• Availability is the probability that a
system will be available when a service
request is made
• Reliability is the probablity that a system
will deliver a service as expected by
users
Availability and reliability, 2013

Slide 24
Summary
• Software faults lead to state errors lead
to operational failures
• Fault avoidance, detection and
tolerance are strategies for achieving
reliability
Availability and reliability, 2013

Slide 25

Availability and reliability

  • 1.
    Availability and Reliability Availabilityand reliability, 2013 Slide 1
  • 2.
  • 3.
    • Reliability – Theprobability of failure-free system operation over a specified time in a given environment for a given purpose Availability and reliability, 2013 Slide 3
  • 4.
    • Availability – Theprobability that a system, at a point in time, will be operational and able to deliver the requested services Availability and reliability, 2013 Slide 4
  • 5.
    Availability specification • Bothreliability and availability attributes can be expressed as numbers: – Availability of 0.999 means that the system is up and running for 99.9% of the time; Availability and reliability, 2013 Slide 5
  • 6.
    Reliability specification • Probabilityof failure on demand (POFOD) of 0.0001 means that on average 1 in 10, 000 demands for service from a system will fail in some way Availability and reliability, 2013 Slide 6
  • 7.
    Availability and reliability •Availability and reliability are closely related – Obviously if a system is unavailable it is not delivering the specified system services. Availability and reliability, 2013 Slide 7
  • 8.
    • However, itis possible to have systems with low reliability that must be available. – So long as system failures can be repaired quickly and does not damage data, some system failures may not be a problem. Availability and reliability, 2013 Slide 8
  • 9.
    • Availability istherefore best considered as a separate attribute reflecting whether or not the system can deliver its services. • Availability takes repair time into account, if the system has to be taken out of service to repair faults. Availability and reliability, 2013 Slide 9
  • 10.
    Availability perception • Availabilityis usually expressed as a percentage of the time that the system is available to deliver services e.g. 99.9%. Availability and reliability, 2013 Slide 10
  • 11.
  • 12.
    Subjective availability • Thenumber of users affected by the service outage. – Loss of service in the middle of the night is less important for many systems than loss of service during peak usage periods. Availability and reliability, 2013 Slide 12
  • 13.
    • The lengthof the outage. – The longer the outage, the more the disruption. Several short outages are less likely to be disruptive than 1 long outage. Long repair times are a particular problem. Availability and reliability, 2013 Slide 13
  • 14.
    Reliability metrics • Probabilityof failure on demand (POFOD) – Probability that a system will not deliver a service correctly when requested – Used for systems where demands are infrequent and intermittent Availability and reliability, 2013 Slide 14
  • 15.
    • Rate ofoccurrence of failure (ROCOF) – Number of system failures in a given time period – Used for transaction processing systems with frequent and regular transactions Availability and reliability, 2013 Slide 15
  • 16.
    • Fault – • Error – • A characteristic ofa software system that can lead to a system error. An erroneous system state that can lead to system behavior that is unexpected by system users. Failure – An event that occurs at some point in time when the system does not deliver a service as expected by its users. Availability and reliability, 2013 Slide 16
  • 17.
  • 18.
    Faults and failures •Failures are a usually a result of system errors. • The incorrect state causes undesirable system behaviour • Incorrect state is a consequence of executing faulty code Availability and reliability, 2013 Slide 18
  • 19.
    • However, faultsdo not necessarily result in system errors – The erroneous system state resulting from the fault may be transient and ‘corrected’ before an error arises. – The faulty code may never be executed. Availability and reliability, 2013 Slide 19
  • 20.
    • Errors donot necessarily lead to system failures – The error can be corrected by built-in error detection and recovery – The failure can be protected against by built-in protection facilities. These may, for example, protect system resources from system errors Availability and reliability, 2013 Slide 20
  • 21.
    Reliability achievement • Faultavoidance – Development technique are used that either minimise the possibility of mistakes or trap mistakes before they result in the introduction of system faults. Availability and reliability, 2013 Slide 21
  • 22.
    • Fault detectionand removal – Verification and validation techniques that increase the probability of detecting and correcting errors before the system goes into service are used. Availability and reliability, 2013 Slide 22
  • 23.
    • Fault tolerance –Run-time techniques are used to ensure that system faults do not result in system errors and/or that system errors do not lead to system failures. Availability and reliability, 2013 Slide 23
  • 24.
    Summary • Availability isthe probability that a system will be available when a service request is made • Reliability is the probablity that a system will deliver a service as expected by users Availability and reliability, 2013 Slide 24
  • 25.
    Summary • Software faultslead to state errors lead to operational failures • Fault avoidance, detection and tolerance are strategies for achieving reliability Availability and reliability, 2013 Slide 25