Software reliability engineering

1,876 views

Published on

Published in: Technology, Business

Software reliability engineering

  1. 1. Software Reliability Engineering Mark Turner
  2. 2. Topics Covered in this Presentation What software reliability engineering is and why it is needed. Defining software reliability targets. Operational profiles. Reliability risk management. Code inspection. Software testing. Reliable system design. Reliability modeling. Reliability demonstration.
  3. 3. INTRODUCTION What Software Reliability Engineering is and why it is needed
  4. 4. Different Views of Reliability Product development teams View reliability at the sub-domain level, addressing mechanical, electronic and software issues. Customers View reliability at the system level, with minimal consideration placed on sub-domain distinction. The primary measure of reliability is defined by the customer. To develop a reliable product engineering teams must consider both views (system and sub-domain). System Mechanical Reliability + Electronic Reliability + Software Reliability Although this presentation focuses on software reliability engineering, it should be viewed as a component part of an overall Design for Reliability process, not as a disparate activity as hardwaresoftware interactions may be missed. This presentation does not make any distinction between software and firmware, but the same techniques apply equally to both.
  5. 5. System-Level Reliability Modeling (1 of 2) A system is made up of components/sub-systems; each has its own inherent reliability. Software R=0.99 Computer Server R=0.9665 A “traditional” reliability program may include modeling, evaluation and testing to prove that the hardware meets the reliability target, but software should not be forgotten as it is a system component. Individually the hardware and software may meet the reliability target…but they also have to when they are combined. System probability of failure = H/W Failure Probability x S/W Failure Probability I.e., H/W = 0.9665, S/W = 0.99, System = 0.9665 x 0.99 = 0.9568 System Reliability = 95.68%
  6. 6. System-Level Reliability Modeling (2 of 2) Therefore the software reliability should also be accounted for in the system-level reliability model. Software may consist of both the operating system (OS) and configurable (turnkey) software. It may not be possible to influence the OS design, but turnkey software can be focused on. This may consist of re-used software such as library functions and newly developed software. If the reliability of the library functions is already understood then library function re-use simplifies the software reliability engineering process.
  7. 7. What is Software Reliability Engineering (SRE)? The quantitative study of the operational behavior of software-based systems with respect to user requirements concerning reliability. SRE has been adopted either as standard or as best practice by more than 50 organizations in their software projects including AT&T, Lucent, IBM, NASA and Microsoft, plus many others worldwide. This presentation will provide an introduction to software reliability engineering…..
  8. 8. Why is SRE Important? There are several key reasons a reliability engineering program should be implemented: So that it can be determined how satisfactorily products are functioning. Avoid over-designing – products could cost more than necessary and lower profit. If more features are added to meet customer demand then reliability should be monitored to ensure that defects are not designed in, which could impact reliability. If a customer’s product is not designed well, with reliability and quality in mind, then they may well turn to a COMPETITOR! Having a software reliability engineering process can make organizations more competitive as customers will always expect reliable software that is better and cheaper.
  9. 9. Why is SRE Beneficial? For Engineers: Managing customer demands: Enables software to be produced that is more reliable; built faster and cheaper. Makes engineers more successful in meeting customer demands. In turn this avoids conflicts – risk, pressure, schedule, functionality, cost etc. For the organization: Improves competitiveness. Reduces development costs. Provides customers with quantitative reliability metrics. Places less emphasis on tools and a greater emphasis on “designing in reliability.” Products can be developed that are delivered to the customer at the right time, at an acceptable cost, and with satisfactory reliability.
  10. 10. Common SRE Challenges Data is collected during test phases, so if problems are discovered it is too late for fundamental design changes to be made. Failure data collected during in-house testing may be limited, and may not represent failures that would be uncovered in the product’s actual operational environment. Reliability metrics obtained from restricted testing data may result in reliability metrics being inaccurate. There are many possible models that can be used to predict the reliability of the software, which can be very confusing. Even if the correct model is selected there may be no way of validating it due to having insufficient field data.
  11. 11. Fault Lifecycle Techniques Prevent faults from being inserted. Avoids faults being designed into the software when it is being constructed. Remove faults that have been inserted. Detect and eliminate faults that have been inserted through inspection and test. Design the software so that it is fault tolerant. Provide redundant services so that the software continues to work even though faults have occurred or are occurring. Forecast faults and/or failures. Evaluate the code and estimate how many faults are present and the occurrences and consequences of software failures.
  12. 12. Preventing Faults From Being Inserted Initial approach for reliable software A fault that is never created does not cost anything to fix. This should be the ultimate objective of software engineering. This requires: A formal requirement specification always being available that has been thoroughly reviewed and agreed to. Formal inspection and test methods being implemented and used. Early interaction with end-users (field trials) and requirement refinement if necessary. The correct analysis tools and disciplined tool use. Formal programming principles and environments that are enforced. Systematic techniques for software reuse. Formal software engineering processes and tools, if applied successfully, can be very effective in preventing faults (but is no guarantee!) However, software reuse without proper verification can result in disappointment.
  13. 13. Removing Faults When faults are injected into the software, the next method that can be used is fault removal. Approaches: Software inspection. Software testing. Both have become standard industry practices. This presentation will focus closely on these.
  14. 14. Fault Tolerance This is a survival attribute – the software has to continue to work even though a failure has occurred. Fault tolerance techniques enables a system to: Prevent dormant software faults from becoming active (i.e., defensive programming to check for input and output conditions and forbid illegal operations). Contain software errors within a confined boundary to prevent them from propagating further (i.e., exception handling routines to treat unsuccessful operations). Recover software operations from erroneous conditions by using techniques such as check pointing and rollback.
  15. 15. Fault/Failure Forecasting If software failures are likely to occur it is critical to estimate the number of failures and predict when each is likely to occur. This will help concentrate on failures that have the greatest probability of occurring, provide reliability improvement opportunities and improve customer satisfaction. Fault/failure forecasting requires: Defining a fault/failure relationship – why the failure occurs and its effect. Establishing a software reliability model. Developing procedures for measuring software reliability. Analyzing and evaluating the measurement results. Measuring software reliability provides: Useful metrics that can be used to plan further testing and debug efforts, to calculate warranty costs and plan further software releases. Determines when testing can be terminated.
  16. 16. SRE Process Overview This slide shows a general SRE process flow that has six major components: Determine Reliability Determine Reliability Objective Objective Define Operational Define Operational Profile Profile Perform Code Inspection Perform Code Inspection Determine the reliability Target. Define a software operational Profile. Perform Software Testing Continue Testing Select Appropriate Software Model Conduct code inspection. Perform software testing. Conduct reliability modelling to measure the software reliability – continuously improve the software reliability until the target is reached. Collect Failure Data Reliability Objectives met? Use software Reliability Model(s) Use software Reliability Model(s) to Calculate Current Reliability to Calculate Current Reliability Software Release Acceptable from Reliability Perspective Field reliability validation. Validate Field Reliability Validate Field Reliability
  17. 17. SRE Terms Reliability objective: The product’s reliability goal from the customer’s viewpoint. Operational profile: A set of system operational scenarios with their associated probability of occurrence. This encourages testers to select test cases according to the system’s likely operational usage. Reliability modeling: This is an essential element of SRE that determines whether the product meets its reliability objective. One or more models can be used to calculate, from failure data collected during system testing, various estimates of a product’s reliability as a function of test time. It can also provide the following information: Product reliability at the end of various test phases. Amount of additional test time required to reach the product’s reliability objective. The reliability growth that is still required (ratio of initial to target reliability). Prediction of field reliability. Field Reliability Validation: Determination of whether the actual field reliability meets the customer’s target.
  18. 18. OBJECTIVES Defining software reliability targets
  19. 19. Software Reliability Objectives Reliability target(s) should be defined and used to: Manage customer expectations. Determine how reliability growth can and will be tracked throughout the program. Determine availability targets. Software reliability is commonly expressed as an availability metric though rather than as a probabilistic reliability metric. This is defined as: Availability = Software uptime Software uptime + downtime A data collection and analysis methodology also has to be defined: How inspections will be conducted. How failure data will be collected. How the data will be analyzed, i.e., what model will be used? This helps project managers track metrics and plan resource.
  20. 20. Managing the Software Reliability Objective Defects are often inserted from the beginning of project. This is usually related to the intensity of the effort, i.e. the number of engineers working on the program, the project schedule and the various design decisions that are made etc. Defects are most often detected and addressed at a later date than the original design effort. Test efforts are relied on to discover most defects, this lag can have a negative impact on the program. This can be mitigated against by using code inspection, but some testing will still be necessary. Code inspections should be conducted to IEEE 1028. There is still a lag though between defect insertion and correction, which can have a negative impact on the program. The eventual defect rate represents the reliability target, and as defects are discovered and addressed the software reliability is increased, or grown – this is termed ‘Reliability Growth Management”.
  21. 21. Initial Reliability Growth Model - The Rayleigh Curve (1 of 3) The eventual goal should be to forecast the discovery rate of defects as a function of time throughout the software development program. This cannot be achieved until data from prior similar projects becomes available. This may take time but the effort provides value as it enables accurate forecasts to be achieved from the beginning of the project. Industry data is also available. This helps to manage customer expectations as it demonstrates a strategy for improving software reliability. To produce this curve, reliability data from prior software developments has to be available. Therefore this is a goal, it’s not a technique that can be used immediately. To get to this stage metrics need to be collected by using the methods discussed in this presentation.  1  2 − 1 2 t 2     2 Peak  f (t ) = K    te  Peak    
  22. 22. The Rayleigh Curve (2 of 3) The model's cumulative distribution function (CDF) describes the totalto-date effort expended or defects found at each interval – returns the software reliability at various points in time.  1 2  − t   2 Peak2  F (t ) = K 1 − e     
  23. 23. The Rayleigh Curve (3 of 3) Example: A software project has a 12 month delivery Prior data is available to generate a reliability forecast. The customer wants to know what the effect is of pulling the delivery in to 9 months. What is the answer? It reduces the total containment effectiveness (TCE), otherwise expressed as reliability, from 89.6% to 61%. Tradeoff: This allows expectations to be managed by explaining that to achieve early delivery their will be a tradeoff in the reliability, which may require a later release. This type of management helps to avoid possible customer dissatisfaction.
  24. 24. Further Information Software reliability growth using the Rayleigh Curve is discussed in greater depth in Appendix A of How Reliable Is Your Product?: 50 Ways to Improve Product Reliability, by Mike Silverman. The text of Appendix A was provided by the author of this presentation. This book is highly recommended for anybody that is interested in improving product reliability, available from Amazon or directly from Ops A La Carte.
  25. 25. Software Availability and Failure Intensity (1 of 2) As mentioned earlier, instead of a reliability metric being provided, customers may ask for a certain ‘availability’. This is the average (over time) probability that a system or a capability of a system is currently functional in a specified environment. It depends on: The probability of software failure Length of downtime when failure occurs. It essentially describes the expected fraction of the operating time during which a software component or system is functioning acceptably. If the software is not being modified (if further development or further releases are not planned) then the failure rate will be constant and therefore the availability will be constant.
  26. 26. Software Availability and Failure Intensity (2 of 2) Software uptime From earlier, availability is defined as: Availability = Software uptime + downtime Downtime can be expressed as: Downtime = t m λ Where: tm=downtime per failure λ=failure intensity For software , the downtime per failure is the time to recover from the failure, not the time required to find and remove the fault. 1 ∴ Availabilt y = 1 + tm λ If an availability specification for the software is specified, then the downtime per failure will determine a failure intensity objective: 1 − Availabili ty λ= Availabili ty × t m Either an availability or a failure intensity objective have to be defined.
  27. 27. Example A product must be available 99% of time. Required Downtime = 6 minutes (0.1hr) The downtime per failure can be used to determine the failure intensity objective. 1− A λ= Atm 1 − 0.99 ∴λ = 0.99 × 0.1 ∴ λ = 0.1 failure / hr or 100 failures/kHrs
  28. 28. Availability, Failure Intensity, Reliability and MTBF This presentation will discuss reliability in terms of availability, probability and MTBF. These are the relationships between each of these three metrics. A customer specifies an availability target of 0.99999 and a maximum software downtime of 5 minutes, or 0.083 hours. The failure intensity is determined from: λ= Downtime 0.00083 = = 0.083 failures / Hr Availability 0.0099999 What is the reliability probability for a period of 2 years? R (t ) = e − λT 1×10 9 =e − 0.99999×17520 = 0.99998 1×109 What is the Mean Time Between Failures (MTBF)? MTBF = 1 × 109 λ 1 × 109 = = 1.2 × 1010 0.083 Hours
  29. 29. THE OPERATIONAL PROFILE Defining a structured approach to inspection and test
  30. 30. Defining an Operational Profile An operational profile is a quantitative characterization of how a system will be used in the field by customers. Why is it useful? It provides information on how users will employ the product. It enables the most critical operations to be focused on during testing. This allows the efficiency of the reliability test effort to be improved. It allows more realistic test cases to be designed. To do this the individual software operations have to be identified, which are: Major system logical tasks that returns control to the system when complete. Major = a task that is related to a functional requirement or feature rather than a subtask. The operation can be initiated by a user, another part of the system, or by the systems own controller. For more information on operational profiles refer to Software Reliability Engineering: More Reliable Software Faster and Cheaper – John D. Musa
  31. 31. Developing an Operational Profile (1 of 5) Five steps are needed to develop an operational profile: 1. Identify operation initiators (users, other sub-systems, external systems, product’s own controller etc. 2. Create an operations list – this is a list of operations that each initiator can execute. If all initiators can execute every operation then the initiators can be omitted, and instead just focus on producing a thorough operations list.
  32. 32. Developing an Operational Profile (2 of 5) A good way to generate an operations list for a menu-driven product is to produce a ‘walk tree” rather than use an initiators list. An example of a menu driven system is provided below. This is based on a medical enteral pump, used for feeding patients.
  33. 33. Developing an Operational Profile (3 of 5) Step 3. Once the operational profile is complete it should be reviewed to ensure: All operations are of short duration in execution time (seconds at most). Each operation must have substantially different processing from the others. All operations must be well-formed, i.e., sending messages and displaying data are parts of the operation and not operations in themselves. The final list is complete with high probability. The total number of operations is reasonable, taking the test budget into account. This is because each operation will be focused on individually using a test case, so if the list is too long it may result in the project test phase being very lengthy.
  34. 34. Developing an Operational Profile (4 of 5) Step 4. Determine occurrence rates for each operation – this may need to be estimated to begin with, but can be revised later. Occurrence Rate = Number of operation occurrences Time the total set of operations is running
  35. 35. Developing an Operational Profile (5 of 5) Step 5. Determine the occurrence probabilities. Occurrence Probability = Occurrence rate of each operation Total operation occurrence rate This table has been rearranged by sorting the operations in order of descending probabilities. This presents the operational profile in a form that is more convenient to use.
  36. 36. Establish Failure Definitions What is critical to the customer? How does the customer define a failure? A failure is any departure of system behavior in execution from the user needs. A Fault is a defect that causes the failure (i.e., missing code). A fault may not result in failure…but a failure can only occur if a fault exists. Faults have to be detect – how can this be done? Answer – by developing an operational profile. This enables resource to be focused on addressing issues in operations that have the highest probability of failure. Results in failures having a low failure intensity. Failure modes should be defined early in the project – this provides a specification for what the system should NOT be doing! Failure severity classes can be defined as shown below. The failures that have the highest severity should be focused on first.
  37. 37. SOFTWARE FMEA Software reliability risk management
  38. 38. Software FMEA and Risk Analysis A software Failure Mode and Effects Analysis (SFMEA) is a systematic method that: Recognizes, evaluates, and prioritizes potential failures and their effects. Identifies and prioritizes actions that could eliminate or reduce the likelihood of potential failures occurring. Failure Mode (Defect) Cause Material or process input Process Step Effect Software Failure An FMEA aids in anticipating failure modes in order to determine and assess the risk to the customer or product. then Risks have to be reduced to acceptable levels.
  39. 39. Software FMEA and Risk Analysis (1 of 2) Fault trees provide a graphical and logical framework system failure modes to be analyzed. These can then be used to assess the overall impact of software failures on a system, or to prove that certain failure modes cannot occur. Here is a simple example of how to use a fault tree to perform a Software FMEA. It is far better to begin an FMEA using a fault tree. Filling in a spreadsheet immediately can easily result in confusion and is rarely successful!! SYSTEM BLOCK DIAGRAM Sensor Controller Actuator Potential failure mode - unintended system function. Results in undesirable system behavior - could include potential controller or sensor failures. The first step is to produce a fault tree
  40. 40. Software FMEA and Risk Analysis (2 of 2) 1 2
  41. 41. CODE INSPECTION A reliability improvement and risk management technique
  42. 42. Why Inspect Code? Formal inspections should be carried out on the: Requirements. Design. Code. Approximately 18 man hours plus rework are required per 300-400 lines of code. Test plans. “…formal design and code inspections rank as the most effective methods of defect removal yet discovered…(defect removal) can top 85%, about twice those of any form of testing.” -Capers Jones Applied Software Measurement, 3rd Ed. McGraw Hill 2008 Case study performed by the Data Analysis Center for Software (DACS): 85% Defect Containment: cost = $1,000,000, Duration = 12 months 95% Defect Containment: cost = $750,000, Duration = 10.8 months
  43. 43. Formal “Fagan Style” Inspections This is a defined process that is quantitatively managed. The objective is to do the thing right. There is no discussion of options, it is either right or wrong, or it requires investigation. Ideally 4 inspectors participate (it can be 3-5, but not less than 3). Participants have roles – Leader, Reader, Author and Tester. The review rate target is 150-200 lines of code per hour. What is found depends on how closely the inspectors look at the code. This is a 6 step process that is defined in IEEE 1028. Data is stored in a repository for future reference. The outcome should be that defects are found and fixed, and that data is collected and analyzed.
  44. 44. Relationship Between Inspection and Reliability (1 of 2) For a four-phase test process the reliability is likely to vary between 74% and 92% (based on industry data). Note that not all fixes address problems completely. Some fixes may not be totally effective, while others may also introduce further problems. This is where inspection can be of value. Adapted from a similar approach in : Capers Jones Applied Software Measurement, 3rd Ed. McGraw Hill 2008
  45. 45. Relationship Between Inspection and Reliability (2 of 2) Introducing inspection can increase the reliability to 93 – 99%(based on industry data). Inspection alone can enable the software to surpass the reliability that is obtained from a testonly process! This also increases the scope for reducing the emphasis on testing. Adapted from: Capers Jones Applied Software Measurement, 3rd Ed. McGraw Hill 2008
  46. 46. SOFTWARE TESTING Further defect detection and elimination
  47. 47. Static Analysis (1 of 2) This should be performed after the code is developed. It is pattern based – it scans the code to check for patterns that are known to cause defects. This type of analysis uses coding standard rules and enforces internal coding guidelines. This is a simple task, easily automated, that reduces future debugging effort. It is data flow based, in that it statically simulates execution paths, so is able to automatically detect potential runtime errors such as: Resource leaks. NullpointerExceptions. SQL injections. Security vulnerabilities. The benefits of static analysis are: It can examine more execution paths than conventional testing. It can be applied early in the software design, providing significant time and cost savings.
  48. 48. Static Analysis (2 of 2) Examples of warning classes that can be obtained from static analysis are: Buffer overrun Buffer underrun Cast alters value Ignored return value Division by zero Missing return statement Null pointer dereference Redundant condition Shift amount exceeds bit width Type overrun Type underrun Uninitialized variable Unreachable code Unused value Useless assignment
  49. 49. Buffer Overflow Example Consider the code segment below: char arr[32]; For (int i = 0; i < 64; i++) { arr[i] = (char)i; } Here, memory that is beyond the range of the stack-based variable “arr” is being explicitly addressed. This results in memory being overwritten, which could include the stack frame information that is required for the function to successfully return to its caller, etc. This coding pattern is typical of security vulnerabilities that exist in software. The specifics of the vulnerability may change from one instance to another, but the underlying problem remains the same, performing array copy operations that are incorrectly or insufficiently guarded against exploit. Static analysis can assist in detecting such coding patterns
  50. 50. Types of Tests Functional tests This is single execution of operations with interactions between the various operations minimized. The focus is on whether the operation executes correctly. Load tests These tests attempt to represent field use and the environment as accurately as possible, with operations executing simultaneously and interacting. Interactions can occur directly, through the data, or as a result of resource conflicts. This testing should use the operational profile. Regression tests Functional tests that can be conducted after every build involving significant change. The focus during these tests is to reveal faults that may have been created during the change process. Endurance tests Ad-hoc testing is similar to load testing in that it should represent the field use and environment as accurately as possible. This will focus on how the product is to be used…and may be misused.
  51. 51. RELIABLE SYSTEM DESIGN A look at fault tolerance, an essential aspect of system design
  52. 52. Reliable System Design (1 of 7) To achieve reliable system design software should be designed such that it is fault tolerant. Typical responses to system or software faults during operation includes a sequence of stages: Fault confinement, Fault detection, Diagnosis, Reconfiguration, Recovery, Restart, Repair, Reintegration.
  53. 53. Reliable System Design (2 of 7) Fault Confinement. Limits the spread of fault effects to one area of the system – prevents contamination of other areas. Achieved through use of: - self-checking acceptance tests, - exception handling routines, - consistency checking mechanisms, - multiple requests/confirmations. Erroneous system behaviors due to software faults are typically undetectable. Reduction of dependencies can help.
  54. 54. Reliable System Design (3 of 7) Fault Detection. This stage recognizes that something unexpected has occurred in the system. Fault latency – period of time between fault occurrence and detection. The shorter the fault latency is, the better the system can recover. Two technique classes are off-line and on-line fault diagnosis: - Off-line techniques are diagnostic programs. System cannot perform useful work under test. - On-line techniques provide real-time detection capability. System can still perform useful work. Watchdog monitors and redundancy schemes.
  55. 55. Reliable System Design (4 of 7) Diagnosis. This is necessary if the fault detection technique does not provide information about the failure location and/or properties. This is often an off-line technique that may require a system reset. On-line techniques can also be used i.e., when a diagnosis indicates unhealthy system conditions (such as low available resources), low-priority resources can be released automatically in order to achieve in-time transient failure prevention. Reconfiguration. This occurs when a fault is detected and a permanent failure is located. The system may reconfigure its components either to replace the failed component or to isolate it from the rest of the system (i.e., redundant memory, error checking of memory in case of partial corruption etc). Successful reconfiguration requires robust and flexible software architecture and reconfiguration schemes.
  56. 56. Reliable System Design (5 of 7) Recovery. Uses techniques to eliminate the effects of faults. There are two approaches: - fault masking, - retry and rollback. Fault masking hides effects of failures by allowing redundant, correct information to outweigh the incorrect information. Retry makes a second try at an operation as many faults are transient in nature. Rollback makes use of backed up (checkpointed) operations at some point in its processing prior to fault detection, and operation recommences from this point. Fault latency is very important because the rollback must go back far enough to avoid the effects of undetected errors that occurred before the detected error.
  57. 57. Reliable System Design (6 of 7) Restart. This occurs after the recovery of undamaged information. There are three approaches: - hot restart, - warm restart; - cold restart. Hot restart – resumption of all operations from the point of fault detection (this is only possible if no damage has occurred). Warm restart – only some of the processes can be resumed without loss. Cold restart – complete reload of the system is performed with no processes surviving.
  58. 58. Reliable System Design (7 of 7) Repair. Replacement of failed component – on or off-line. Off-line – system brought down to perform repair. System availability depends on how fast a fault can be located and removed. On-line – Component replaced immediately with a back up spare (similar to reconfiguration), or perhaps operation can continue without using the faulty component (i.e., masking redundancy or graceful degradation). On-line repair prevents system operation interruption. Reintegration. Repaired module must be reintegrated into the system. For on-line repair, reintegration must be performed without interrupting system operation. Non-redundant systems are fault intolerant and, to achieve reliability, fault avoidance is often the best approach. Redundant systems should use fault detection, masking redundancy (i.e., disabling 1 out of N units), and dynamic redundancy (i.e., temporarily disabling certain operations ) to automate one or more stages of fault handling.
  59. 59. RELIABILITY MODELING Determining what reliability has actually been achieved
  60. 60. Reliability Modeling (1 of 4) This is used to calculate what the current reliability is, and if the reliability target is not yet being achieved, determine how much testing and debug needs to be completed in order to achieve the reliability target. The questions that reliability modeling aims to answer are: How many failures are we likely to experience during a fixed time period? What is the probability of experiencing a failure in the next time period? What is the availability of the software system? Is the system ready for release (from a reliability perspective)? Software Failures t2 t1 T=0 T1 T2 t3 t4 T3 T4 T5 Ti is the Cumulative Time To Failure ti is the inter-arrival time = Ti – Ti-1 t6 t5 T6 t7 T7 T8 TE
  61. 61. Reliability Modeling (2 of 4) In reliability engineering it is usual to identify a failure distribution, especially when modeling non-repairable products*. This approach can be used because it is assumed that hardware faults are statistically independent and identically distributed. Where software is concerned, events (failures) are not necessarily independent due to interactions with other system elements, so in most cases failures are not identically distributed. When a failure occurs in a software system the next failure may depend on the current operational time of the unit, and therefore each failure event in the system may be DEPENDENT. * Although it can be argued that a software system can be repaired by fixing the fault, in reliability terms it is still a non-repairable product because it is not wearing out. For instance, a car is a repairable device as parts can be changed when they wear out, but this does not necessarily make it as good as new. If a software fault is repaired it is actually as good as new again, and in fact the improvement may make it better than new.
  62. 62. Reliability Modeling (3 of 4) Therefore what is needed is to model the Rate of Occurrence of Failures and the Number of Failures within a given time. As an example, with reference to the figure below, a model is needed that will report the fact that 8 failures are expected by timeTE and that the Rate of Occurrence of Failures is Increasing with Time. Software Failures t2 t1 T=0 T1 T2 t3 t4 T3 T4 T5 t6 t5 T6 t7 T7 T8 TE
  63. 63. Reliability Modeling (4 of 4) If a Distribution Analysis is performed on the Time-Between-Failures, then this is equivalent to saying that there are 9 different systems, where System 1 failed after t1 hours of operation, System 2 after t2,…, etc. T=0 System 1 System 2 System 3 . . . System 9 t1 t2 t3 T9 (suspension*) This is the same as assuming that the system is failure free if the fault is addressed, which may not necessarily be true as further failures may occur. Example: Changing the break pads on a car. This does not mean that the car is now failure free! * A unit that continues to work at the end of the analysis period or is removed from a test in working condition. I.e., it may fail at some point in the future.
  64. 64. An Example of an Incorrect Approach (1 of 4) This example has been included because it is a common approach to hardware reliability modeling but it CANNOT be used for modeling software reliability. This method is normally used to model a non-repairable hardware product. Unfortunately when used in analyzing software reliability it returns incorrect results…but it is an easy trap for a reliability engineer to fall into!!! Both firmware and hardware failure data is collected from three systems: A total of 6 different firmware and 4 different hardware failure modes are identified
  65. 65. An Example of an Incorrect Approach (2 of 4) The conventional reliability engineering approach is to take the TimeBetween-Failures for each system and then fit a distribution. 319-152 Notice that hardware failures have been removed. The time between the last failure and the current age is a Suspension.
  66. 66. An Example of an Incorrect Approach (3 of 4) A Weibull (life data) Analysis is conducted, but with software this is not appropriate! This analysis assumes a sample of 20 systems, and one system failed after 152hrs, the other after 319hrs, etc.
  67. 67. An Example of an Incorrect Approach (4 of 4) This system will be used for a total of 250 hours. What will the software reliability be? Distribution analysis is okay for non-repairable products containing only hardware, but not for anything containing software (or for repairable hardware only products). In products that contain software, events are dependent, and therefore alternative analysis methods should be used. However, it is correct to fit a distribution on the First-Time-to-Failure of each system. 97.63% GREAT RESULT BUT COMPLETELY WRONG!!!
  68. 68. An Example of a Correct Approach This is the probability that the unit will NOT fail in the first 250 hours. Reliability=68.36% Notice that the confidence interval is very wide.
  69. 69. Three Possible SRE approaches… Are multiple systems being tested? No Can testing be stopped after each phase to fix failure modes? No Use 3-Parameter Crow-Extended Model Yes Use NHPP model (This is the best option) This is the current state of the art in software reliability modeling, and is suitable for most projects. However, this approach is not suitable for testing a single unit (i.e., a large expensive system), or where not all faults are going to be fixed in between compiles. A better model is needed for this type of application. Yes Use Crow-Extended Model It is hypothesized by the author that these models may be more suitable for developments where the NHPP model cannot be well applied. This essentially represents a future state of software reliability testing. However, before being readily accepted they should be validated, i.e., by comparing their predicted reliability with actual field data. Use of these models has been included in this presentation for completeness and possible future application.
  70. 70. A Better SRE Analysis Approach (1 of 4) A model is needed that will take into account the fact that when a failure occurs the system has a “Current Age,” or in other words there is a further failure that is likely to occur. For example, in System 1, the system has an age of 152 hours after the first firmware failure mode has been detected. In other words, all other operations that can result in a failure also have an age of 152 hours and the next failure event is based on this fact.
  71. 71. A Better SRE Analysis Approach (2 of 4) The NHPP (Non-Homogenous Poisson Process) with a Power Law failure Intensity is such a model: Where: Pr[N(T)=n] is the probability that n failures will be observed by time, T. Λ(T) is the Failure Intensity Function (Rate of Occurrence of Failures). Just because a model is used for hardware does not mean that it cannot be suitable for software as well, as models simply describe times-to-failure. Therefore a hardware model can also be used for software, providing that it is a dependent model (failures are dependent on the operational time, rather than being independent).
  72. 72. A Better SRE Analysis Approach (3 of 4) NHPP model parameters: Here the failure events of System 1 are analyzed between the period of 0 and 1380 hours. This folio also contains the failure events for Systems 2 and 3 (not shown). Of interest is the fact that Beta >1, which indicates that the inter-arrival times between unique failures are decreasing, so there may be little opportunity for reliability improvement.
  73. 73. A Better SRE Analysis Approach (4 of 4) NHPP model results: Plot shows the cumulative number of failure vs. time, from which conclusions and further predictions can be obtained. The Weibull plot intersects the X-axis, so out-of-box failures should not be present. If it had intersected the Y-axis then this would indicate potential for out-of-box failures. The cumulative number of failures is 0.1352, or 13.52 failures per 25000 operational hours.
  74. 74. An Example Using the NHPP Model (1 of 8) Software is under development – the reliability requirement is to have no more than 1 fault in every 8 hours of software operation. Three Test Engineers provide a total of 24 hours of testing each day. One new compile is available for testing each week, when fixes are implemented. The failure rate goal is: 8 FR = = 0.125 24 Failures per hour In a testing day, the failure intensity goal 3 faults/day. FRI = 0.125 × 24 = 3 Faults per day
  75. 75. An Example Using the NHPP Model (2 of 8) Failure data is obtained: The data is grouped by the number of days until a new compile is available, i.e., the first 45 failures are contained in one group and are fixed in compile #1. NHPP model parameters
  76. 76. An Example Using the NHPP Model (3 of 8) The instantaneous failure intensity after 28 days of testing is 4.4947 faults/day. If testing is continued with the same growth rate, when will the goal of no more than 3 faults/day be achieved? The answer is after an additional 14928=121 days of testing and development (test-analyze-fix)
  77. 77. An Example Using the NHPP Model (4 of 8) An extra 121 days is longer than anticipated. Let’s take a closer look by generating a Failure Intensity vs. Time plot… Each of these lines indicates the failure intensity over a given interval (which in this case is 5 days). It can be seen that there was a jump in the failure intensity between 20 and 23 days. This is why it is estimated that more development time is required. The next step is to analyze the data set for the period up to 20 days of testing, before the failure intensity increased…
  78. 78. An Example Using the NHPP Model (5 of 8) The NHPP model data is limited to the first 20 days of testing and another Failure Intensity vs. Time plot is generated, but this time for the first 20 days: This plot shows the decrease in the failure intensity rate over the first 20 days of testing. This confirms that the failure intensity continuously reduced during the first 20 days.
  79. 79. An Example Using the NHPP Model (6 of 8) Based on the first 20 days of data the additional test and development duration can be recalculated, which results in there being an additional 55-28=27 days to achieve the goal of having no more than 3 faults/day, rather than 121! This generates questions: Why is there such a big difference in the test duration still required? What happened when the failure intensity jumped on the 23rd day of testing and development? Answer – New functionality was added. The jump in required test time is typical when new features are introduced, and applies to software and hardware alike. Because new functionality has been added it would be wise to reset the clock and track the reliability growth from the 20th day forward…
  80. 80. An Example Using the NHPP Model (7 of 8) Now the NHPP model parameters need to be obtained and plotted for the last 8 days of testing (8 days is an arbitrary number; enough data needs to be available to have confidence in any conclusions that are drawn). This provides better resolution. By taking a “macro’ view it can be seen that the failure intensity is starting to increase, so the minimum failure intensity point has been determined. For improved accuracy calculations should be based on this.
  81. 81. An Example Using the NHPP Model (8 of 8) Based on this data set 51-8=43 more days of developmental testing are required. It may be too early to make any predictions based on only 8 days of testing, but the result can be used to obtain a general idea of the remaining development time required and produce a test plan. To pull in the schedule 3 more Test Engineers could be added and the code recompiled every 2 days, which will complete the project within 1 month. There are also situations where some issues are fixed immediately, others are addressed later and more minor issues may not be addressed at all. In this type of situation the Crow Extended Model can be useful…
  82. 82. Crow-Extended Model Introduction (1 of 2) This is not a common SRE model but does have the benefit of supporting decision making by providing metrics such as Failure intensity vs. time. Demonstrated Mean Time Between Failures (MTBF*). MTBF growth that can be achieved through implementation of corrective actions. Maximum potential MTBF that can be achieved through implementation of corrective actions. Maximum potential MTBF that can likely be achieved for the software and estimates regarding latent failure modes that have not yet been uncovered through testing. This model utilizes A, BC and BD failure mode classifications to analyze growth data. A = Failure mode that will not be fixed. BC = A Failure mode that will be fixed while the test is in progress. BD = A Failure mode that will be corrected at the end of the test. * This model uses MTBF rather than failure intensity or reliability metrics. A conversion between these various metrics is provided in slide 28.
  83. 83. Crow-Extended Model Introduction (2 of 2) There is no reliability growth for A modes. The effectiveness of the corrective actions for BC modes is assumed to be demonstrated during the test. BD modes require a factor to be assigned that estimates the effectiveness of the correction that will be implemented after the test. Analysis using the Crow Extended model allows different management strategies to be considered by reviewing whether the reliability goal will be achieved. There is one constraint to this approach – the testing must be stopped at the end of the test phase and all BD modes must be fixed. The Crow Extended model will return misleading conclusions if it is used across multiple test phases. For those situations use the 3-Parameter Crow-Extended model (discussed next). DO NOT APPLY THIS MODEL TO A MULTIPLE SYSTEM TEST, USE THE NHPP MODEL INSTEAD.
  84. 84. Crow-Extended Model Example (1 of 8) A product underwent development testing, during which failure modes were observed. Some modes were corrected during the test (BC modes), some modes were corrected after the end of the test (delayed fixes, BD modes) and some modes were left in the system (A modes). The test was terminated after 400 hours; the times-to-failure are provided below:
  85. 85. Crow-Extended Model Example (2 of 8) An effectiveness factor has been assigned for each BD failure mode (delayed fixes). The effectiveness factor is based on engineering assessment and represents the fractional decrease in failure intensity of a failure mode after the implementation of a corrective action. The effectiveness factors for the BD failure modes are provided below: This is a metric that enables an assessment to be made of whether or not the corrective actions have been effective, and if they have, how effective they were. This is often a subjective judgment.
  86. 86. Crow-Extended Model Example (3 of 8) The times-to-failure data and effectiveness factors are entered: Note that this data sheet only displays 29 rows of data, but all data is entered even though it has not been shown. Effectiveness factor is expressed as 0-1 (0-100% of the failure intensity being removed by the corrective action).
  87. 87. Crow-Extended Model Example (4 of 8) Model parameter calculation: Here the failure events are analyzed between the period of 0 and 400 hours.
  88. 88. Crow-Extended Model Example (5 of 8) Growth potential MTBF plot: Growth potential MTBF (Maximum achievable MTBF based on current strategy) Projected MTBF (Estimated MTBF after delayed corrective actions have been implemented) Demonstrated MTBF (MTBF at end of test without corrective actions) Instantaneous MTBF (Demonstrated MTBF with time) The demonstrated MTBF, (the result of fixing BC modes during the test) is about 7.76 hours. The projected MTBF (the result of fixing BD mode after the test) is about 11.13 hours. The growth potential MTBF (if testing continues with the current strategy, i.e. modes corrected vs. modes not corrected and with the current effectiveness of each corrective action) is estimated to be about 14.7 hours. This is the maximum attainable MTBF.
  89. 89. Crow-Extended Model Example (6 of 8) An Average Failure Mode Strategy plot is a pie chart that breaks down the average failure intensity of the software into the following categories: A modes – 9.546%. BC modes addressed – 14.211%. BC modes still undetected – 30.655%. BD modes removed – 8.846%. BD modes to be removed – 3.355 (because corrective actions were <100% effective). BD modes still undetected – 33.386%
  90. 90. Crow-Extended Model Example (7 of 8) Individual Mode MTBF plot, which shows the MTBF of each individual failure mode. This enables the failure modes with the lowest MTBF to be identified. These are the failure modes that cause the majority of software failures, and should be addressed as the highest priority when reliability improvement activities are to be implemented. Blue = Failure mode MTBF before corrective action. Green = Failure mode MTBF after corrective action.
  91. 91. Crow-Extended Model Example (8 of 8) Failure Intensity vs. Time plot: This can be analyzed in exactly the same way as in the NHPP example.
  92. 92. 3-Parameter Crow-Extended Model Introduction (1 of 2) This is not a common SRE model either, but has the same benefits as the single parameter Crow-Extended model plus multiple test phases can also be taken into account. This model is ideal in situations where software is to be tested over multiple phases but where all bug fixes cannot be introduced as faults are discovered, i.e., all bugs will be addressed on an ad-hoc basis over an extended time period. The model provides the flexibility of not having to specify when the test will end, so it can be continuously updated with new test data. Therefore this model is optimized for continuous evaluation rather than fixed test periods. It can only be applied to an individual system, so it lends itself ideally to situations where an individual complex system is being tested. DO NOT APPLY ANY CROW MODEL TO A MULTIPLE SYSTEM TEST, USE THE NHPP MODEL INSTEAD.
  93. 93. 3-Parameter Crow-Extended Model Introduction (2 of 2) This model uses several event codes: F Failure time. I Time at which a certain BD failure mode has been corrected. BD modes that have not received a corrective action by time T will not have an associated I event in the Q data set. A failure that was due to a quality issue, such as a build problem rather than a design problem. The reliability engineer can decide whether or not to include quality issues in the analysis. P A failure that was due to a performance issue, such as an incorrect component being installed in a device where the embedded code is being tested. The reliability engineer can decide whether or not to include performance issues in the analysis. AP This is an analysis point, used to track overall project progress, which can be compared to planned growth phases. PH The end of a test phase. Test phases can be used to track overall project progress, which can be compared to planned growth phases. X A data point that is to be excluded from the analysis.
  94. 94. 3-Parameter Crow-Extended Model Example (1 of 11) Software is under development. Testing is to be conducted in 3 phases. Phase 1 – 6 weeks of manual testing that is run 45 hours per week, total 270 hours. Phase 2 – 4 weeks of automated testing that is run 24/7, total 672 hours. Phase 3 – 8 weeks of field manual testing that is run 40 hours per week, total 320 hours. One hour of continuous testing equates to 7 hours of customer usage, so the testing includes a usage acceleration factor of 7. The average fix delay for the three phases are 90 hours, 90 hours and 180 hours respectively (fix time = time delay between discovering a failure mode to the time the corrective action is incorporated into the design). Taking usage acceleration into account, cumulative test times for the three phases is 1890 hours, 6594 hours and 8834 hours respectively.
  95. 95. 3-Parameter Crow-Extended Model Example (2 of 11) Customer reliability target = 2 failures per year. Usage duty cycle = 0.1428 Therefore for continuous usage, reliability target is: = 2 failures every 1251 hrs. Failure intensity: = 2 = 0.0016 1251 Equivalent test time = 8834 hrs. Required MTBF: = 8834 = 625hrs 0.0016 × 8834
  96. 96. 3-Parameter Crow-Extended Model Example (3 of 11) The growth potential = 1.3 This is the amount by which the MTBF target should exceed the requirement for margin. The higher the GP, the lower the risk is to the program. This is an initial estimate based on prior experience, and the higher the GP margin, the less risk is present in the program. Average effectiveness factor = 0.5 (1.0 = a perfect fix, 0 = inadequate fix) This is also an initial estimate based on prior experience. Management strategy – address at least 90% of all unique failure modes prior to formal release. Beta parameter = 0.7 This is the rate for discovering new, distinct failure modes found during testing. Again, this is an estimate based on prior experience. A discovery beta of less than 1 indicates that the inter-arrival times between unique B modes are increasing. This is desirable, as it is assumed that most failures will be identified early on, and their inter-arrival times will become larger as the test progresses. This is an initial estimate; the actual discovery beta can be obtained from the final results, allowing this parameter estimation to be refined with testing experience.
  97. 97. 3-Parameter Crow-Extended Model Example (4 of 11) Based on the assumptions on the previous slide, an overall growth planning model can be created that shows the nominal and actual idealized growth curve and the planned growth MTBF for each phase. A growth planning folio is created and 1890, 6594 and 8834 are entered for the Cumulative Phase Times and 630, 630 and 1260 for the Average phase delays. Note that inter-phase average fix delays have been multiplied by 7 to take the usage acceleration factor into account.
  98. 98. 3-Parameter Crow-Extended Model Example (5 of 11) The project parameters are input into the Planning Calculations window . Given the MTBF target and design margin that has been specified, along with other required inputs to describe the planned reliability growth management strategy, the final MTBF that can be achieved is calculated, along with other useful results. Here it is verified that 625 hours is achievable (if it was not achievable a figure of less than 625 hours would be calculated).
  99. 99. 3-Parameter Crow-Extended Model Example (6 of 11) Effectiveness Factors for all BD modes are specified, together with when they are to be implemented. A growth planning plot can then be obtained: MTBF at end of phase 3 MTBF at end of phase 1 MTBF at end of phase 2 This plot displays the MTBF vs. Time values for the three phases that have been planned for the test.
  100. 100. 3-Parameter Crow-Extended Model Example (7 of 11) Test failure data is collected during the three phases: Actual discovery beta (original estimate was 0.7)
  101. 101. 3-Parameter Crow-Extended Model Example (8 of 11) Growth potential MTBF plot can now been obtained: Growth potential MTBF (Maximum achievable MTBF based on current strategy) Demonstrated MTBF (MTBF at end of test without corrective actions) Projected MTBF (Estimated MTBF after delayed corrective actions have been implemented) Instantaneous MTBF If the MTBF goal is higher than the Growth Potential line then the current design cannot achieve the desired goal and a redesign or change of goals may be required. For this example, the goal MTBF of 650 hours is well within the growth potential and is expected to be achieved after the implementation of the delayed BD fixes.
  102. 102. 3-Parameter Crow-Extended Model Example (9 of 11) Average Failure Mode Strategy plot, breaking down the average failure intensity of the software into categories: A modes – 13.432%. BC modes addressed – 19.281%. BC modes still undetected – 13.76%. BD modes removed – 25.893%. BD modes remain (because corrective actions were <100% effective) – 5.813%. BD modes still undetected – 21.882%
  103. 103. 3-Parameter Crow-Extended Model Example (10 of 11) Individual Mode MTBF plot showing the MTBF of each individual failure mode, thus enabling the failure modes with the lowest MTBF to be identified. Blue = Failure mode MTBF before corrective action. Green = Failure mode MTBF after corrective action.
  104. 104. 3-Parameter Crow-Extended Model Example (11 of 11) The RGA Quick calculation pad indicates that the discovery rate of new unseen BD modes at 630 hours is 0.0006 per hour. The Beta bounds are less than 1, indicating that there is still growth in the system (think of this as the leading edge slope of the bathtub curve; when beta=1 there is no more growth potential)
  105. 105. RELIABILITY DEMONSTRATION Demonstration that a minimum software reliability has been achieved
  106. 106. Reliability Demonstration Testing (1 of 2) There can be occasions when the actual software reliability may have to be measured through practical demonstration rather than testing. However, this is more applicable where all known faults have been removed and the software is considered to be stable. If the reliability has already been discovered by conducting a reliability growth program then there may be little value in conducting this test. This is actually more suitable for situations where a reliability growth program has not been conducted. This can be achieved through sequential sampling theory.
  107. 107. Reliability Demonstration Testing (2 of 2) A project specific chart depends on: Discrimination Ratio – this is an error in the failure intensity estimation that is considered to be acceptable. Consumer Risk Level – this is the probability of falsely claiming the failure intensity has been met when it has not. Supplier Risk Level – this is the probability of falsely claiming the failure intensity objective has not been met when it has. Common values are: Discrimination Ratio 2% Consumer Risk Level 0.1 (10%). Supplier Risk Level 0.1 (10%).
  108. 108. Example Requirement: 4 Failures/million operations 1 2 3 0.4 0.625 1.2 1.6 2.5 4.8 Multiply by requirement target Software can be accepted after failure 3 with 90% confidence that it is within the reliability target and a 10% risk that it is not. The boundary has to be crossed though.
  109. 109. Reliability Demonstration Test Chart Design (1 of 2) What if the software is still in the Continue region at the end of the test? Assume that the end of test is reached just after failure 2. Option 1 – Calculate the Failure Intensity Objective: Factor = FCURRENT 3.6 = = 1.44 F PREVIOUS 2.5 ∴ FIO = 1.44 × 4 = 5.76 Option 2 - Extend the test time by ≥factor. Grouped data CANNOT be used, it has to be obtained from individual units.
  110. 110. Reliability Demonstration Test Chart Design (2 of 2) The following formulae is used to design RDT charts: TN ( A − n )(ln γ ) = 1−γ TN and Accept-Continue Boundary (B − n )(ln γ ) = 1−γ Reject-Continue Boundary Where: TN: Normalized measure of when failure occur (horizontal coordinate). n: Failure number. γ: Discrimination ratio (ratio of max acceptable failure intensity to the failure objective). A and B are defined from: A = ln β 1−α B = ln 1− β α Where: α: Supplier risk (probability of falsely claiming objective is not met when it is). β: Consumer risk (probability of falsely claiming objective is met when it is not).
  111. 111. Reliability Demonstration Test Chart Design Example The boundary intersections with the x and y axis's can be calculated using the following formulae: B − n(ln γ ) A − n(ln γ ) ,n ,n 1− γ 1− γ B 0, ln γ A ,0 1− γ In this example n=16.
  112. 112. SRE Review Enables defect discovery rates to be forecast and monitored – helps all staff – enables customer expectations to be managed. Enables reliability targets to be established and monitored. Software FMEA enables failure modes and risks to be identified. Establishes formal and thorough test and analysis methodologies. Provides a method for modeling and demonstrating software reliability. Defines code inspection processes. Guarantees customer satisfaction!
  113. 113. References Adamantios Mettas, “Repairable Systems: Data Analysis and Modeling” Applied Reliability Symposium 2008. Michael R. Lyu, “Software Reliability Engineering: A Roadmap”. Dr Larry Crow, “An Extended Reliability Growth Model For Managing and Accessing Corrective Actions” Reliability And Maintainability Symposium 2004. John D. Musa, “Software Reliability Engineering: More Reliable Software Faster and Cheaper” Authorhouse 2004. Reliasoft RGA 7 Training Guide. Capers Jones, “Applied Software Measurement, 3rd Edition, McGraw Hill 2008.

×