Availability and reliability

2,710 views

Published on

Accompanies video on my YouTube channel on system availability and reliability

Published in: Technology, Business
  • thank u so much !!!
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Availability and reliability

  1. 1. Availability and Reliability Availability and reliability, 2013 Slide 1
  2. 2. Principal dependability properties Availability and reliability, 2013 Slide 2
  3. 3. • Reliability – The probability of failure-free system operation over a specified time in a given environment for a given purpose Availability and reliability, 2013 Slide 3
  4. 4. • Availability – The probability that a system, at a point in time, will be operational and able to deliver the requested services Availability and reliability, 2013 Slide 4
  5. 5. Availability specification • Both reliability and availability attributes can be expressed as numbers: – Availability of 0.999 means that the system is up and running for 99.9% of the time; Availability and reliability, 2013 Slide 5
  6. 6. Reliability specification • Probability of failure on demand (POFOD) of 0.0001 means that on average 1 in 10, 000 demands for service from a system will fail in some way Availability and reliability, 2013 Slide 6
  7. 7. Availability and reliability • Availability and reliability are closely related – Obviously if a system is unavailable it is not delivering the specified system services. Availability and reliability, 2013 Slide 7
  8. 8. • However, it is possible to have systems with low reliability that must be available. – So long as system failures can be repaired quickly and does not damage data, some system failures may not be a problem. Availability and reliability, 2013 Slide 8
  9. 9. • Availability is therefore best considered as a separate attribute reflecting whether or not the system can deliver its services. • Availability takes repair time into account, if the system has to be taken out of service to repair faults. Availability and reliability, 2013 Slide 9
  10. 10. Availability perception • Availability is usually expressed as a percentage of the time that the system is available to deliver services e.g. 99.9%. Availability and reliability, 2013 Slide 10
  11. 11. Availability and reliability, 2013 Slide 11
  12. 12. Subjective availability • The number of users affected by the service outage. – Loss of service in the middle of the night is less important for many systems than loss of service during peak usage periods. Availability and reliability, 2013 Slide 12
  13. 13. • The length of the outage. – The longer the outage, the more the disruption. Several short outages are less likely to be disruptive than 1 long outage. Long repair times are a particular problem. Availability and reliability, 2013 Slide 13
  14. 14. Reliability metrics • Probability of failure on demand (POFOD) – Probability that a system will not deliver a service correctly when requested – Used for systems where demands are infrequent and intermittent Availability and reliability, 2013 Slide 14
  15. 15. • Rate of occurrence of failure (ROCOF) – Number of system failures in a given time period – Used for transaction processing systems with frequent and regular transactions Availability and reliability, 2013 Slide 15
  16. 16. • Fault – • Error – • A characteristic of a software system that can lead to a system error. An erroneous system state that can lead to system behavior that is unexpected by system users. Failure – An event that occurs at some point in time when the system does not deliver a service as expected by its users. Availability and reliability, 2013 Slide 16
  17. 17. Faults-errors-failures Fault Error Failure Availability and reliability, 2013 Slide 17
  18. 18. Faults and failures • Failures are a usually a result of system errors. • The incorrect state causes undesirable system behaviour • Incorrect state is a consequence of executing faulty code Availability and reliability, 2013 Slide 18
  19. 19. • However, faults do not necessarily result in system errors – The erroneous system state resulting from the fault may be transient and ‘corrected’ before an error arises. – The faulty code may never be executed. Availability and reliability, 2013 Slide 19
  20. 20. • Errors do not necessarily lead to system failures – The error can be corrected by built-in error detection and recovery – The failure can be protected against by built-in protection facilities. These may, for example, protect system resources from system errors Availability and reliability, 2013 Slide 20
  21. 21. Reliability achievement • Fault avoidance – Development technique are used that either minimise the possibility of mistakes or trap mistakes before they result in the introduction of system faults. Availability and reliability, 2013 Slide 21
  22. 22. • Fault detection and removal – Verification and validation techniques that increase the probability of detecting and correcting errors before the system goes into service are used. Availability and reliability, 2013 Slide 22
  23. 23. • Fault tolerance – Run-time techniques are used to ensure that system faults do not result in system errors and/or that system errors do not lead to system failures. Availability and reliability, 2013 Slide 23
  24. 24. Summary • Availability is the probability that a system will be available when a service request is made • Reliability is the probablity that a system will deliver a service as expected by users Availability and reliability, 2013 Slide 24
  25. 25. Summary • Software faults lead to state errors lead to operational failures • Fault avoidance, detection and tolerance are strategies for achieving reliability Availability and reliability, 2013 Slide 25

×