SlideShare a Scribd company logo
1 of 122
&




We Provide You Confidence in Your Product ReliabilityTM
   Ops A La Carte / (408) 654-0499 / askops@opsalacarte.com / www.opsalacarte.com
Software Design
for Reliability (DfR)
  ½-day Seminar



   Ops A La Carte LLC // www.opsalacarte.com
The following presentation materials
 are copyright protected property of
        Ops A La Carte LLC.
Distribution of these materials is limited to
         your company staff only.

 These materials may not be distributed
outside of your company or used for any
      purpose other than training.
Software DfR ½-Day Seminar Agenda


  Agenda

  ◈ Introductions and Agenda Review
  ◈ Software Reliability Basic Concepts
  ◈ A “Best Practices” Approach to Developing Reliable Software
  ◈ Reliability Measurements and Metrics
  ◈ Wrap-up




(v0.1)                     Ops A La Carte ©                       3
Presenter’s
Biographical
   Sketch
Presenter’s Biographical Sketch – Bob Mueller
   ◈ Bob Mueller a senior consultant/program manager with OPS A La Carte and the Marisan
         Group. He is a product development professional with 30+ years of technical and
         management experience in software intensive product development, R/D process
         and quality systems development including extensive consulting experience with
         cross-functional product development teams and senior management.
   ◈ After receiving his M.S. in Physics in 1973, Bob joined Hewlett-Packard in Cupertino, CA
         in IC process development. In the next three decades before leaving hp in 2002, he
         held numerous positions in R/D, R/D management and consulting including:
            IC process development, process engineering and IC production management.
            Lead developer of an automated IC in-process/test monitor, analysis and
             control system (hp internal).
            R/D project management for sw intensive products (including process analysis and
             control, work cell control & quality control systems).
            Numerous R/D management positions in computer, analytical and healthcare
             businesses
             including FDA regulated systems with ISO 9001 certified organizations.
            Numerous program management positions focused on internal/external process
             improvement and consulting.
            Practice area manager and consultant for PG -- Engineering consulting team
             (internal hp)

   ◈ Bob’s current consulting interests include: Warranty process and quality system
         improvement, SW Reliability, agile SW product development methodologies and R/D
         product strategy and technology roadmap development.
   ◈ Bob has taught many internal hp classes and at local junior colleges.

(v0.1)                                   Ops A La Carte ©                                   5
Software Reliability Integration Services
                   for the Product
Reliability Integration in the Concept Phase        Reliability Integration in the Implementation Phase
Software Reliability Goal Setting                   Facilitation of Code Reliability Reviews
Software Reliability Program and Integration Plan   Software Robustness and Coverage Testing Techniques

Reliability Integration in the Design Phase
Facilitation of Team Design Template Reviews        Reliability Integration in the Testing Phase
Facilitation of Team Design Reviews                 Software Reliability Measurements and Metrics
Software Failure Analysis                           Usage Profile-based Testing
Software Fault Tolerance                            Software Reliability Estimation Techniques
                                                    Software Reliability Demonstration Tests
Software Design For Reliability (DfR)




Software Reliability
  Basic Concepts

                                    George de la Fuente
                                 georged@opsalacarte.com
                                     (408) 828-1105
                                   www.opsalacarte.com
Software Quality
        vs.
Software Reliability
Software Quality vs. Reliability
                               FACTORS                       CRITERIA

                                                                suitability
                               Functionality                    accuracy
                                                             interoperability
                                                                 security


                                                            understandability
                                 Usability                     learnability
                                                               operability
                                                             attractiveness

         Software                                                                     Software Quality
                                                                 maturity
          Quality               Reliability                  fault tolerance       The level to which the
          *ISO9126
         Quality Model
                                                              recoverability      software characteristics
                                                                                     conform to all the
                                                              time behavior            specifications.
                                Efficiency                 resource utilization



                                                              analysability
                                                              changeability
                                Portability                     stability
                                                               testability



                                                              adaptability
                                                              installability
                              Maintainability                co-existence
                                                             replaceability
(v0.1)                                                                                                       3
                                                Ops A La Carte ©
Defining
Software Reliability
Software Reliability Definitions

    “The probability of failure free software operation
        for a specified period of time in a specified
                        environment”
                                            ANSI/IEEE STD-729-1991




 ◈Examine the key points

 ◈Practical rewording of the definition
        Software reliability is a measure of the
    software failures that are visible to a customer
      and that prevents a system from delivering
  essential functionality for a specified period of time.
(v0.1)                   Ops A La Carte ©                            5
Software Reliability Can Be Measured

◈ Measurements (quantitative) are a required foundation
          Differs from quality which is not defined by measurements
          All measurements and metrics are based on run-time failures
◈ Only customer-visible failures are targeted
          Only defects that produce customer-visible failures affect reliability
          Corollaries
             ◘ Defects that do not trigger run-time failures do NOT affect reliability
                – badly formatted or commented code
                – defects in dead code
             ◘ Not all defects that are triggered at run-time produce customer-
                visible failures
                  – corruption of any unused region of memory

◈ S/W Reliability evolved from H/W Reliability
          Primary distinction: S/W Reliability focuses only on design reliability


(v0.1)                                   Ops A La Carte ©                                6
Software Reliability Is Based On Usage

 ◈ S/W failure characteristics are derived from the usage profile of a
         particular customer (or set of customers)
          Each usage profile triggers a different set of run-time S/W faults and failures

 ◈ Example of reliability perspective from 3 users of the same S/W
          Customer A
              ◘ Usage Profile – Exercises sections of S/W that produce very few failures.
              ◘ Assessment – S/W reliability is high.
          Customer B
              ◘ Usage Profile – Overlaps with Customer A’s usage profile. However,
                Customer B also exercises other sections of S/W that produce many,
                frequent failures
              ◘ Assessment – S/W reliability is low.
          Customer C
              ◘ Usage Profile – Similar to Customer B’s usage profile. However, Customer
                C has implemented workarounds to mitigate most of the S/W failures that
                were encountered. The final result is that the S/W executes with few
                failures but requires additional off-nominal steps.
              ◘ Assessment      – S/W quality is low since many workarounds are required.
                However, for the final configuration that includes these workarounds, S/W
                reliability is acceptable.

(v0.1)                                     Ops A La Carte ©                                  7
Reliability ≠ Correctness or Completeness

 ◈ Correctness is a measure with which the requirements
         model the intended customer base or industry functionality
          Correctness is validated by reviewing product requirements and
            functional specifications with key customers

 ◈ Completeness is a measure of the degree of intended
         functionality that is modeled by the S/W design
           Completeness is validated by performing requirements traceability at
            the design phase and design traceability at the coding phase

 ◈ Reliability is a measure of the behavior (i.e., failures) that
         prevents the S/W from delivering the designed
         functionality
           If the resulting S/W does not meet customer or market expectations,
            yet operates with very few failures based on its requirements and
            design, the S/W is still considered reliable



(v0.1)                                 Ops A La Carte ©                            8
Terminology

 Defects,
Faults and
 Failures
Software Defects That Affect Reliability

                                              Sources

         Documentation          Development                        Validation
         • User Manual          • Requirements          •••        • Unit Test Plans/Cases
         • Installation Guide   • System Architecture              • System-Level Test Plans/Cases
         • Technical Specs      • Designs                          • Design Review Scenarios or Checklists
                                • Source Code                      • Code Review Scenarios or Checklists
                                                                   • S/W Failure Analysis Categories




                        Categories
   Soft Maintenance                Run-time Impacts
                                   Run-time Impacts
                                                                                  Failures
    • Commenting                    • System outage
                                    • System outage                        • System outage
    • Style           •••           • Loss of functionality
                                    • Loss of functionality
    • Consistency                   • Annoyance                            • Loss of critical functionality
    • Standards/Guidelines          • Cosmetic
    • “Dead Code”




(v0.1)                                          Ops A La Carte ©                                              10
Terminology - Defect

 ◈ A flaw in S/W requirements, design or source code that
         produces unintended or incomplete run-time behavior
                                                                                               Defect
           Defects of commission
               ◘   Incorrect requirements are specified
               ◘   Requirements are incorrectly translated into a design model
               ◘   The design is incorrectly translated into source code
               ◘   The source code logic is flawed
           Defects of omission             There are amongst the most difficult class of defects to detect
               ◘ Not all requirements were used in creating a design model
               ◘ The source code did not implement all the design
               ◘ The source code has missing or incomplete logic

 ◈ Defects are static and can be detected and removed without
         executing the source code

 ◈ Defects that cannot trigger S/W failures are not counted for
         reliability purposes
           These are typically quality defects that affect other aspects of S/W quality such
            as soft maintenance defects and defects in test cases or documentation
(v0.1)                                       Ops A La Carte ©                                            11
Terminology - Fault

 ◈The result of triggering a S/W defect by
         executing the associated source code                     Defect


         Faults are NOT customer-visible
             ◘ Example: memory leak or a packet corruption
                                                                  Fault
               that requires retransmission by the higher
               layer stack


         A fault may be the transitional state that results in
           a failure
             ◘ Trivially simple defects (e.g., display spelling
               errors) do not have intermediate fault states




(v0.1)                              Ops A La Carte ©                       12
Terminology - Failure

 ◈A customer (or operational system)
         observation or detection that is perceived               Defect

         as an unacceptable departure of
         operation from the designed S/W
         behavior                                                 Fault



         Failures are the visible, run-time symptoms of faults
            ◘ Failures MUST be observable by the customer or      Failure

              another operational system


         Not all failures result in system outages




(v0.1)                             Ops A La Carte ©                         13
Defect-to-Failure Transition

 ◈Example
         A S/W function (or method) processes the data stored in a
          memory buffer and then frees the allocated memory buffer
          back to the memory pool

         A defect within this function (or method), when triggered, will
          fail to free the memory buffer before completion
                                                Entry Point

                                           Defect
                                                      1 (of many)
                                                         Logic
                                                        Branch
                                                         Points




                                  4 Possible Exit Points
(v0.1)                             Ops A La Carte ©                         14
Defect-to-Failure Transition                      (continued)


 ◈ Most of the possible logic paths do not trigger the defect
          If these are the only logic paths traversed by a customer, this portion
           of the S/W will be considered very reliable




(v0.1)                                Ops A La Carte ©                               15
Defect-to-Failure Transition                               (continued)


 ◈ Fault transition
          Eventually a logic path is executed that triggers the defect, resulting in a fault
           being generated
              ◘ The function (or method) completes its execution
              ◘ The fault causes the system to lose track of a single memory buffer
              ◘ The system continues to operate without a visible impact
          Since the fault causes no visible impact, a failure does NOT occur




(v0.1)                                     Ops A La Carte ©                                     16
Defect-to-Failure Transition                      (continued)


 ◈ Failure scenario
          After sufficient memory buffers have been lost,
           the buffer pool reaches a critical condition where
           either:
              ◘ No buffers are available to satisfy another
                allocation request (there are still some               (t1) Fault is triggered
                buffers in use)
                                                                       (t2) Fault is triggered
              ◘ All buffers have been lost through leakage
                                                                             •
                (no buffers will ever be freed for future                    •
                allocation requests)                                         •
                                                                       (tN) Fault is triggered

          Once the next buffer allocation is requested, a             (tF) Failure occurs
           failure occurs
              ◘ The system cannot continue to operate
                                                                Time (t)
                normally


          Note the time lag between the triggering of the
           last fault and the occurrence of the associated
           failure


(v0.1)                                    Ops A La Carte ©                                       17
Summary of Defects and Failures

 ◈ There are 3 types of run-time defects
                                                                        Defect   Defect    Defect
          1. Defects that are never executed (so they don’t trigger
             faults)

          2. Defects that are executed and trigger faults that do
             NOT result in failures                                               Fault     Fault


          3. Defects that are executed and trigger faults that result
             in failures
                                                                                           Failure
 ◈ Practical S/W Reliability focuses on defects
         that have the potential to cause failures by:
                                                                                 Defect
          1. Detecting and removing defects that result in failures
             during development

          2. Design & Implement fault tolerance techniques to
                                                                                  Fault
               ◘ prevent faults from producing failures or
               ◘ mitigating the effects of the resulting failures

                                                                                 Failure



(v0.1)                                      Ops A La Carte ©                                         18
Failure Distributions,
    Failure Rates
         and
        MTTF
Reliability and Failure Distributions

         Restated, reliability is the probability that a system does not
              experience a failure during a time interval, [0,T].


 ◈ Reliability is a measure of statistical probability, not certainty
          Ex: A system has a 99% reliability over a period of 100 days
              ◘ Does this imply that only 1 failure will occur during the 100 day period?

 ◈ Reliability is based on failure distribution models
          Represent the time distribution of failure occurrences
          Various failure distribution models exist:
              ◘   Exponential (most commonly used in S/W reliability)
              ◘   Weibull
              ◘   Poisson
              ◘   Normal
              ◘   Rayleigh
              ◘   etc….


 ◈ Let’s examine an exponential failure distribution model
(v0.1)                                    Ops A La Carte ©                                  20
Failure Distributions - Exponential

 ◈ Exponential Reliability Function
         The most widely used failure distribution is the exponential reliability function:
            ◘ Models a random distribution of failure occurrences
         Defined by:


                                              R(t)
                         R(t) =   e-λt
                                                                  λ = 0.1 failures/hr.




            where
            ◘ t is mission time
                    – the system is assumed to be operational at t=0
                    – The mission duration is represented by T
            ◘ “λ” is a constant, instantaneous failure rate (or failure intensity)
            ◘ MTTF = 1 / λ (for repairable systems)
(v0.1)                                      Ops A La Carte ©                                   21
A Closer Look At The Exponential Distribution

Reliability          • Mission duration: T = 100 hours
   R(t)              • Failure rate: λ = 0.1 failures/hr. (or 1 failure every 10 hrs.)
                     • MTTF = 10 hrs.




                         At t = 1 hr., the reliability is 90%



                         When t = MTTF, the reliability is always 37%
                         R = e-λt
                            = e-(1/MTTF)MTTF
                            = e-(1)
                            = 37%



                                                                       Time (hrs)
 (v0.1)               Ops A La Carte ©                                                   22
A Closer Look at Reliability Values

 ◈ Based on an exponential failure distribution, what does it mean
         for S/W to have 99% reliability after one year of operation?
          For a single S/W product:
              ◘ There is a 99% probability that the S/W will still be operational after 1 year
                  – Conversely, there is a 1% chance of a failure during that period.
              ◘ Note that this value does NOT tell us when, during the 1 year period, that a
                failure will occur.
                   – With the exponential distribution, as time progresses, the likelihood
                     (probability) of a failure increases.


          For a group of software products (e.g., 100 products):
              ◘ 99% of the products will be operational after 1 year (e.g., 99 products)
              ◘ There is a 36.6% probability that all 100 products will be operational after
                1 year
                   – This computed by multiplying the reliability of all the products:
                   f(t) = R1(t) x R2(t) x … x R100(t)
                         = 0.99 x 0.99 x … x 0.99
                         = 0.366


(v0.1)                                     Ops A La Carte ©                                      23
Sample Reliability Calculations

 ◈ What is the failure rate (λ) and MTTF necessary for to achieve
         this level of reliability?
              t = 1 yr.
                  = 8760 hrs.

              R(t)        = e-λt
              0.99        = e-(λ) x (8760)
              ln(0.99) = -(λ) x (8760)

              λ           = -ln(0.99) / (8760)
                          = 1.1 x 10-6 failures/hr. (1 failure every 99.5 years)

              MTTF = 1/ λ
                      = 871,613 hrs. (99.5 yrs.)

 ◈ What is reliability at the MTTF?
              t = MTTF = 871,613 hrs
              R(MTTF) = e-(λ) x (MTTF)
                            = e-(1/MTTF) x (MTTF)
                            = e-1
                            = 0.368 (~37%)
(v0.1)                                              Ops A La Carte ©               24
Software and Hardware Failure Rates

                                Software                                                          Hardware


                        Driven by effectiveness of S/W defect                                 Driven by three very
                         detection and repair processes over                                   different physical
                             the span of many upgrades                                          failure domains
  Failure Rate




                                                                     Failure Rate
λSW-B
                                                                  λHW-B

                 Pre-release      Useful Life        Obsolete                       Burn-In         Useful Life      Wearout
                 Testing         (w/upgrades)




                       Initial system deployment (i.e., completion of Pre-release Testing and Burn-in phases)
                       establishes a baseline for both the S/W (λSW-B) and H/W (λHW-B) failure rates




 (v0.1)                                                   Ops A La Carte ©                                                     25
Software
Availability
System Availability

     Availability is the percentage of time that a system is operational,
               accounting for planned and unplanned outages.

 ◈ Example: 90% Availability (for a timeframe T)
          Logical Representation
              ◘ System is operational for the first 90% of the timeframe and down for the last
                  10% of the timeframe

                                           Timeframe T

                                 System operational (0.9T)               System non-operational (0.1T)




                                                                           Failure      System
          Actual (or Possible) Representation                             occurs      restored

              ◘ 3 failures cause the system to be down for 10% of the timeframe



            Failure    System                      Failure    System    Failure       System
            occurs    restored                     occurs    restored   occurs       restored
(v0.1)                                              Ops A La Carte ©                                     27
System Availability                     (continued)


 ◈ System availability, A(T), is the relationship between the
         timeframes when a system is operational vs. down due to a
         failure-induced outage and is defined as:


                                               ___MTBF___
                                       A(T) = (MTBF + MTTR)

         where,
          The system is assumed to be operational at time t=0
          T = MTBF + MTTR and 0 ≤ t ≤ T
          MTBF (Mean Time Between Failure) is based on the failure rate
          MTTR (Mean Time To Repair) is the duration of the outage (i.e., the expected
           time to detect, repair and then restore the system to an operational state)




(v0.1)                                   Ops A La Carte ©                                 28
Software Availability

 ◈ System outages that are caused by S/W can be attributed to:
    1. Recoverable S/W failures
    2. S/W upgrades
    3. Unrecoverable S/W failures
         NOTE: Recoverable S/W failures are the most frequent S/W cause of
               system outages

 ◈ For outages due to recoverable S/W failures, availability is
         defined as:
                                              ___MTTF___
                                        A(T) = (MTTF + MTTR)

         where,
          MTTF is Mean Time To [next] Failure
          MTTR (Mean Time To [operational] Restoration) is still the duration of the
            outage, but without the notion of a “repair time”. Instead, it is the time until the
            same system is restored to an operational state via a system reboot or some
            level of S/W restart.


(v0.1)                                     Ops A La Carte ©                                        29
Software Availability                       (continued)


 ◈ A(T) can be increased by either:
          Increasing MTTF (i.e., increasing reliability) using S/W reliability practices
          Reducing MTTR (i.e., reducing downtime) using S/W availability practices

 ◈ MTTR can be reduced by:
          Implementing H/W redundancy (sparingly) to mask most likely failures
          Increasing the speed of failure detection (the key step)
          S/W and system recovery speeds can be increased by implementing Fast Fail
            and S/W restart designs
              ◘ Modular design practices allow S/W restarts to occur at the smallest
                 possible scope, e.g., thread or process vs. system or subsystem
              ◘ Drastic reductions in MTTR are only possible when availability is part of the
                 initial system/software design (like redundancy)


 ◈ Customers generally perceive enhanced S/W availability as a S/W
         reliability improvement
          Even if the failure rate remains unchanged


(v0.1)                                     Ops A La Carte ©                                     30
System Availability Timeframes
   Availability Class       Availability                   Timeframe vs. Mission Downtime
                                                                (Unavailability Range)
                                                   Timeframe =1year               Timeframe = 3 months
   (1) Unmanaged        90% (1 nine)                    36.5 days/year                   9.13 days
                                                      (52,560 mins/year)
   (2) Managed          99% (2 nines)                   3.65 days/year                  21.9 hours
   (good web                                          (5,256 mins/year)
   servers)
   (3) Well-managed     99.9% (3 nines)                 8.8 hours/year                  2.19 hours
                                                      (525.6 mins/year)
   (4) Fault Tolerant   99.99% (4 nines)                52.6 mins/year                 13.14 minutes
   (better commercial
   systems)
   (5) High-            99.999% (5 nines)                5.3 mins/year                 1.31minutes
   Availability
   (High-reliability
   products)
   (6) Very-High-       99.9999% (6 nines)              31.5 secs/year                 7.88 seconds
   Availability                                       (2.6 mins/5 years)
   (7) Ultra-           99.99999% (7 nines) to           3.2 secs/year                 0.79 seconds
   Availability                                                to
                        99.9999999% (9 nines)
                                                      31.5 millisecs/year
                                                    (15.8 secs/5 years or less)
(v0.1)                                       Ops A La Carte ©                                            31
Input
Robustness
Software Robustness

      Software Robustness is a measure of the software’s ability to
   handle exceptional input conditions so they do not become failures.


 ◈ Exceptional input conditions result from:
          Inputs that violate data value constraints
          Inputs that violate data relationships
          Inputs that violate the application’s timing requirements

 ◈ Robust S/W prevents exceptional inputs from:
         1. Causing a system outage
         2. Producing a silent failure by providing no indication that an exceptional input
           condition was detected, thus allowing for the failure to propagate
         3. Generating an error condition or response that incorrectly characterizes the
           exceptional input condition

 ◈ S/W robustness becomes increasingly important as a system
         becomes more flexible and the product’s customer base increases
         in size and usage diversity
(v0.1)                                     Ops A La Carte ©                                   33
Why Is Software Robustness Important ?

                                                           Inputs causing
                                User
                  User           #2                        erroneous
                   #1
                 Input   set           User
                                               II User
                                                Err        outputs
                                                      #n
                                        #3      e




                               Program




                                                           Erroneous
                                                           outputs
                Output set                    OErr
                                              Oe



(v0.1)                                 Ops A La Carte ©                     34
Software Robustness Studies

 ◈ 2 studies of S/W robustness
          Examined exceptional input condition testing of POSIX-compliant OSes and UNIX
           command line utilities
          Robustness testing was repeated on multiple releases contain fixes for the
           reported exceptional input failures

 ◈ Findings
          Failure rates associated with robustness testing were significant,
              ◘ Ranging from 10% - 33%
          After many significant, focused S/W fixes over multiple releases, failure rates
           still remained high

 ◈ Conclusions
          Traditional functional testing does not adequately test for exceptional input
           conditions
          Operational profiles testing also does not adequately test for exceptional input
           conditions
             ◘ (Reason) Operational profile testing prioritizes and sets limits on functional
                testing.
          Specific techniques are required to provide adequate test coverage and handling
           of exceptional input conditions



(v0.1)                                       Ops A La Carte ©                                   35
Software
  Fault
Tolerance
Software Fault Tolerance

         The ability of software to avoid executing a fault in a way that
                            results in a system failure.


 ◈ Despite the best development efforts, almost all systems are
         deployed with defects with the potential to produce critical
         failures
           A major study of S/W defects showed 1% of customer-reported failures reported
            within the 1st year produce system outages


 ◈ Fault tolerance increases the fault-resistant quality of a system
         during run-time by
           Detecting faults at the earliest possible point of execution
           Containing the damaging effects of a fault to the smallest possible scope
           Performing the most reliable recovery action possible

 ◈ Fault tolerant designs focus on handling “complex” failures
           Address defects that are not likely to be triggered during testing
(v0.1)                                      Ops A La Carte ©                                37
So,
  What Is
  Reliable
Software ??
Reliable Software Characteristics Summary

 ◈ Operates within the reliability specification that satisfies customer
         expectations
          Measured in terms of failure rate and availability level
          The goal is rarely “defect free” or “ultra-high reliability”



 ◈ “Gracefully” handles erroneous inputs from users, other systems,
         and transient hardware faults
          Attempts to prevent state or output data corruption from “erroneous” inputs



 ◈ Quickly detects, reports and recovers from S/W and transient
         H/W faults
          S/W provides the system behavior of continuously monitoring, “self-diagnosing”
           and “self-healing”
          Prevents as many run-time faults as possible from becoming system-level
           failures



(v0.1)                                      Ops A La Carte ©                                39
Questions?




(v0.1)     Ops A La Carte ©   40
Software Design For Reliability (DfR) Seminar



   A “Best Practices”
      Approach to
      Developing
   Reliable Software
                                       George de la Fuente
                                    georged@opsalacarte.com
                                        (408) 828-1105
                                      www.opsalacarte.com
Most Common Paths to Reliable Software

 1. Rely on H/W redundancy to mask out all S/W faults
          The most attractive and expensive approach
          Provides a increased system-level reliability using an availability
           technique
          Requires minimal S/W reliability

 2. “Testing In” reliability
          The most prevalent approach
          Limited and inefficient approach to defect detection and removal
             ◘ System testing will leave at least 30% of the code untested
             ◘ System testing will detect at best ~55% of all run-time failures
          Most companies don’t continue testing until their reliability targets are
           reached
             ◘ The testing phase is usually fixed in duration before the S/W is
                developed and is focused on defect removal not reliability testing
          S/W engineers will spend more than 1/2 of their time in the test phase
           using this approach
(v0.1)                                 Ops A La Carte ©                                2
S/W Design for Reliability

 3. S/W Design for Reliability
         Least utilized and understood approach


          Common methodologies
            1)Formal methods
            2)Programs based on a H/W reliability practices
            3)S/W process control
            4)Augment traditional SW development /w
               “best practices”




(v0.1)                            Ops A La Carte ©            3
Formal Methods

 ◈ Formal Methods (not commonly used for commercial SW)
          Methodologies for system behavior analysis and proof of correctness
             ◘ Utilize mathematical modeling of a system’s requirements and/or
                design

          Primarily used in the development of safety-critical systems that
           require very high degrees of:
             ◘ Confidence in expected system performance
             ◘ Quality audit information
             ◘ Targets of low or near zero failure rates
          Formal methods are not applicable to most S/W projects
             ◘ Cannot be used for all aspects of system design (e.g., user
               interface design)
             ◘ Do not scale to handle large and complex system development
             ◘ Mathematical requirements exceed the background of most S/W
               engineers




(v0.1)                                Ops A La Carte ©                           4
Using Hardware Reliability Practices

 ◈ S/W and H/W development practices are still fundamentally
         different
           The H/W lifecycle primarily focuses on architecture and design modeling
           S/W design modeling tools are rarely used
               ◘ Design-level simulation verification is limited
                   – Especially if a real-time operating system is required
               ◘ S/W engineers still challenge the value of generating complete designs
                     – This why S/W design tools support 2-way code generation
           Inherent S/W faults stem from the design process
               ◘ There is no aspect of faults from manufacturing or wear-out

 ◈ S/W is not built as an assembly of preexisting components
           True S/W component “reuse” is rare
               ◘ Most “reused” S/W components are at least “slightly” modified
               ◘ Modified “reused” S/W components are not certified before use
           S/W components are not developed to a specified set of reliability characteristics
           3rd party S/W components do not come with reliability characteristics
(v0.1)                                     Ops A La Carte ©                                      5
Hardware Reliability Practices

 ◈       …. assembly of preexisting components (continued)
          Acceleration mechanisms do not exist for S/W reliability testing


          Extending S/W designs after product deployment is commonplace
              ◘ H/W is designed to provide a stable, long-term platform
              ◘ S/W is designed with the knowledge that it will host frequent product
                customizations and extensions
              ◘ S/W updates provide fast development turnaround and have little or no
                manufacturing or distribution costs


          H/W failure analysis techniques (FMEAs and FTAs) are rarely successfully applied
           to S/W designs
              ◘ S/W engineers find it difficult to adapt these techniques below the system
                level




(v0.1)                                    Ops A La Carte ©                                    6
Software Process Control Methodologies

 ◈ S/W process control assumes a correlation between the maturity
         of the development process and the latent defect density in the
         final S/W
                    CMM Level   Defects/KLOC         Estimated Reliability
                        5             0.5                        99.95%

                        4          1.0 - 2.5              99.75% - 99.9%

                        3          2.5 – 3.5             99.65% - 99.75%

                        2          3.5 – 6.0              99.4% - 99.65%

                        1          6.0 – 60.0                  94% - 99.4%



 ◈ Process audits and more strict controls are implemented if
         the current process level does not yield the desired S/W
         reliability
          Process root cause analysis may not yield great improvement
             ◘ Practices within the processes must be fine tuned (but how??)
          Reliability improvement under this type of methodology is slow
             ◘ Process outcome cannot vary too much in either direction

(v0.1)                                      Ops A La Carte ©                   7
“Best Practices”
          for
Software Development
Sources of Industry Data

 Data was derived from a large-scale international survey of S/W
 lifecycle quality spanning:
          18 years (1984-2002)
          12,000+ projects
          600+ companies
               ◘ 30+ government/military organizations
          8 classes of software applications:
               1. Systems S/W
               2. Embedded S/W
               3. Military S/W
               4. Commercial S/W
               5. Outsourced S/W
               6. Information Technology (IT) S/W
               7. End-User developed personal S/W
               8. Web-based S/W


(v0.1)                                 Ops A La Carte ©            9
Terminology
 ◈ Best Practice
          A key S/W quality practice that significantly contributes towards increasing S/W
           reliability

 ◈ Best in Class Companies
          Companies that have the following two characteristics:
              ◘ Recognized for producing S/W-based products with the lowest failure rate
                in their industry
              ◘ Consistently deploying software based on their initial schedule targets

 ◈ Formal practice
          A S/W quality development practice that is well-understood and consistently
           implemented throughout the software development organization.
              ◘ Note: Formal practices are rarely undocumented.

 ◈ Informal practice
          A S/W quality development practice that is either implemented with varying
           degrees of rigor or in an inconsistent manner throughout the software
           development organization.
              ◘ Note: Informal practices are usually accompanied by the absence of
                 documented guidelines or standards.


(v0.1)                                    Ops A La Carte ©                                    10
“Best in Class” Company Best Practices

 ◈ S/W Life Cycle Practices
          Consistent implementations of the entire S/W lifecycle phases
           (requirements, design, code, unit test, system test and maintenance)

 ◈ Requirements
          Involve test engineers in requirements reviews
          Define quality and reliability targets
          Define negative requirements (i.e., “shall nots”)

 ◈ Development phase defect removal
          Formal inspections (requirements, design, and code)
          Failure analysis

 ◈ Design
          Team or group-oriented approach to design for the system and S/W
              ◘ NOTE: System design team includes other disciplines (e.g., H/W & Mech)




(v0.1)                                    Ops A La Carte ©                               11
“Best in Class” Company Best Practices (continued)
 ◈ Testing
          Robust Testing strategy to meet business / customer requirements
          Test plans completed and reviewed before the coding phase
          Mandatory developer unit testing
          Independently verify/test every software change (enhancements and fixes)
          Create formal test plans for all medium and large-sized projects
          Staff an independent and dedicated SQA team to at least 5% of size of the S/W
           development team
          Generate quality or reliability estimates
          Incorporate automated test tools into the test cycle


 ◈ S/W Quality Assurance
          Review and prioritize all changes after the development phase
          Record and track all changes to S/W artifacts throughout the life cycle
          Formalize unit testing reviews (test plans and results)
          Implement active quality assurance programs
          Root-cause analysis with resolution follow-up
          Gather and review customer product feedback
(v0.1)                                    Ops A La Carte ©                                 12
“Best in Class” Company Best Practices (continued)

 ◈ SCM and Defect Tracking
          Implement formal change management of artifact changes and S/W releases
          Incorporate automated defect tracking tools


 ◈ Metrics and Measurements
          Record and track all defects and failures
          Collect field data for root cause analysis on next project or release iteration
          Measure code test coverage
          Generate metrics based on code attributes (e.g., size and complexity)
          Generate defect removal efficiency measurements
          Track “bad fixes”




(v0.1)                                     Ops A La Carte ©                                  13
Weaknesses in S/W Development Practices

 ◈ Lack of engineer “ownership” for development and test practices
           Limited efficiency and effectiveness improvements made
           May lead to disjoint practices, resulting in no real “common” practices

 ◈ System design is “H/W-centric”
           Primary focus on H/W feasibility, functionality and performance
           Architectural reviews are not collaborative, team design sessions
           S/W requirements of the H/W platform are generally not entertained or
            implemented


 ◈ S/W defect removal relies mostly on system or subsystem-level
         testing
           Development phase defect removal is limited to cursory code reviews and sparse
            unit testing
               ◘ Designs and design reviews are satisfied using functional or interface
                   specifications
           No causal analysis is performed to improve future defect removal


(v0.1)                                     Ops A La Carte ©                                  14
Weaknesses in S/W Development Practices

 ◈ Limited system and S/W quality measurements and metrics
          Use of default defect tracking tool statistics as primary metrics/measurements
          Generally no data mining capability available for analysis


 ◈ Informal SQA processes and staffing leads to wasted efforts and
         incomplete coverage
          Too many trivial defects still present during system test phase
          Defect fixes that introduce additional defects are frequent
          S/W is shipped with many untested sections
          Significant, recurring, “real world” customer scenarios remain untested


 ◈ Limited or no tool support for:
          Unit testing
          Automated regression testing
          S/W analysis (static, dynamic, and coverage)

(v0.1)                                    Ops A La Carte ©                                  15
Application Behavior Patterns

   S/W Quality Methods                   System S/W                                  Embedded S/W
 Summary                         Overall, best S/W quality results             Wide range of S/W quality results

 Defect Removal Efficiency                Usually > 96%                                  Up to > 94%

                             Best quality results found in projects with       Most projects are < 26.5 KLOCS
 Projects Sizes
                                           > 550 KLOCS
                               Formal design and code inspections           Usually do not implement both design or
 Inspections
                                                                              code inspections (and not formally)

 Test Teams                           Independent SQA team                 Usually do not have separate SQA teams

                                Formal S/W quality measurement                Informal S/W quality measurement
 Measurement Control
                                       process and tools                             processes and tools

 Change Control              Formal change control process and tools       Informal change control process and tools

 Test plans                              Formal test plans                 Usually do not implement formal test plans

 Unit Testing                        Performed by developers                       Performed by developers

                                       6 to 10 test stages                              3 to 6 test stages
 Testing Stages
                                    (performed by SQA team)                    (usually performed by developers)

 Governing Processes           CMM/CMMI and Six-Sigma methods                     No consistent pattern found


(v0.1)                                            Ops A La Carte ©                                                      16
Application Behavior Patterns

   S/W Quality Methods                   System S/W                                  Commercial S/W
 Summary                         Overall, best S/W quality results             Wide range of S/W quality results

 Defect Removal Efficiency                Usually > 96%                                   Up to > 90%

                             Best quality results found in projects with        Most projects are > 275 KLOCS
 Projects Sizes
                                           > 550 KLOCS
                               Formal design and code inspections           Inconsistent use of formal design or code
 Inspections
                                                                                           inspections

 Test Teams                           Independent SQA team                 Inconsistent use of independent SQA teams

                                Formal S/W quality measurement                 Informal S/W quality measurement
 Measurement Control
                                       process and tools                              processes and tools

 Change Control              Formal change control process and tools        Formal change control process and tools

 Test plans                              Formal test plans                             Formal test plans

 Unit Testing                        Performed by developers                       Performed by developers

                                       6 to 10 test stages                             3 to 8 test stages
 Testing Stages
                                    (performed by SQA team)                    (extensive reliance on Beta trials)

 Governing Processes           CMM/CMMI and Six-Sigma methods                     No consistent pattern found


(v0.1)                                            Ops A La Carte ©                                                      17
Software
Defect Removal
  Techniques
Defect Origin and Discovery

          Typical Behavior

                     Requirements   Design           Coding     Testing        Maintenance
          Defect
          Origin




          Defect
                     Requirements   Design            Coding    Testing        Maintenance
         Discovery
                                                                          Surprise!


          Goal of Best Practices on Defect Discovery

           Defect    Requirements   Design            Coding    Testing       Maintenance

           Origin




          Defect
                     Requirements    Design            Coding    Testing       Maintenance
         Discovery
(v0.1)                                       Ops A La Carte ©                                19
Software Defect Removal Techniques

                 Defect Removal Technique                         Efficiency Range

                Design inspections                                   45% to 60%

                Code inspections                                     45% to 60%

                Unit testing                                         15% to 45%

                Regression test                                      15% to 30%

                Integration test                                     25% to 40%

                Performance test                                     20% to 40%

                System testing                                       25% to 55%

                Acceptance test (1 customer)                         25% to 35%


   ◈Development organizations try to find and remove more defects by
         implementing more stages of system testing
            Since there is a wide range of overlap between test stages, this approach
             becomes less efficient as it scales

(v0.1)                                         Ops A La Carte ©                          20
Defect Removal Technique Impact

           Design
           Inspections /              n         n         n         n         Y     n     Y     Y
           Reviews


           Code
           Inspections /              n         n         n        Y          n     n     Y     Y
           Reviews


           Formal SQA
           Processes
                                      n        Y          n         n         n     Y     n     Y



           Formal Testing             n         n         Y         n         n     Y     n     Y



           Median Defect
                                   40%       45%        53%      57%         60%   65%   85%   99%
           Efficiency



         This large potential available from design and code inspections/reviews are why most
         development organizations see greater improvements in S/W reliability with investments
         in the development phase than further investments in testing.
         NOTE: Design review results based on low-level design reviews.
(v0.1)                                                    Ops A La Carte ©                           21
Case Study: Quantifying the Software Quality Investment

 Objective:
     Develop a value-based approach to determine the necessary
      S/W quality investment using dependability attributes.

 Methodology:
     Use an integrated approach of project cost and quality
      estimation models (COCOMO II and COQUALMO) and
      empirically-based business value relationship factoring to
      analyze the data from a diverse set of 161 well-measured,
      S/W projects.

 Findings:
     The methodology was able to correlate the optimal S/W
       project quality investment level and strategy to the required
       project reliability level based on defect impact.
     The objectives were satisfied without using specific S/W
       reliability practices and focused heavily on defect detection
       during system testing.
(v0.1)                         Ops A La Carte ©                        22
Reliability, Development Cost & Test Time Tradeoffs




(v0.1)                 Ops A La Carte ©   Based on COCOMOII (Constructive Cost Model)   23
Reliability, Development Cost & Test Time Tradeoffs




                                   The relative cost/source instruction to
                                   achieve a Very High RELY rating is less
                                   than the amount of additional testing
                                   time that is required (54%) since early
                                   defect prevention reduce the required
                                   rework effort and allow for additional
                                   testing time.




(v0.1)                 Ops A La Carte ©   Based on COCOMOII (Constructive Cost Model)   24
Delivered Defects Scale




  ◈ Very Low rating delivers roughly the same amount of defects that are introduced
  ◈ Extra High rating reduces the delivered defects by a factor of 37

  Note: The assumed nominal defect introduction rate is 60 defects/KSLOC based
          on the following distribution:
         10 requirements defects/KSLOC
         20 design defects/KSLOC
         30 coding defects/KSLOC
(v0.1)                                     Ops A La Carte ©   Based on COQUALMO (Constructive Quality Model)   25
Defect Removal Factors Scale

            Automated Analysis                Peer Reviews                    Execution Testing and Tools

  Very      Compiler-based simple
                                        No peer reviews                   No testing
  Low       syntax checking

            Basic compiler              Ad-hoc informal walk-
  Low       capabilities                throughs
                                                                          Ad-hoc testing and debugging


            Compiler extensions.        Well-defined sequence for         Basic test, test data management and
  Nominal   Basic requirements and      preparation, review, and          problem-tracking support.
            design consistency          minimal follow-up                 Test criteria based on checklist.


                                                                          Well-defined test sequences tailored to
            Intermediate-level          Formal review roles and           the organization.
            module and intermodule.     well-trained participants
  High                                  using basic checklists and
                                                                          Basic test coverage tools and test
            Simple requirements and                                       support system.
            design.                     follow-up procedures.
                                                                          Basic test process management.
            More elaborate              Basic review checklists and       More advanced test tools and test data
            requirements and design.    root cause analysis.              preparation, basic test oracle support
  Very      Basic distributed-          Formal follow-up using            and distributed monitoring and
  High      processing and temporal     historical data on inspection     analysis, and assertion checking.
            analysis, model checking,   rates, preparation rates,         Metrics-based test process
            and symbolic execution.     and fault density.                management.
                                        Formal review of roles and
                                        procedures.                       Highly advanced tools for test oracles,
            Formalized specification                                      distributed monitoring and analysis,
                                        Extensive review checklists       and assertion checking.
  Extra     and verification.
                                        and root cause analysis.
  High      Advanced distributed-                                         Integration of automated analysis and
                                        Continuous review-process         test tools.
            processing
                                        improvement.
                                                                          Model-based test process management.
                                        Statistical process control.
(v0.1)                                        Ops A La Carte ©   Derived from COQUALMO (Constructive Quality Model)   26
DfR Based on “Best Practices”
         Modified Existing Best Practices                          New Best Practices
                                        S/W Life Cycle Practices
  Consistent implementations of the entire             Reliability testing a part of overall testing
    S/W lifecycle phases                                 strategy
    (requirements, design, code, unit test,             Define reliability goals as requirements
    system test and maintenance)
                                    Metrics and Measurements
  Record and track all defects and failures            Generate defect removal efficiency
  Collect field data for root cause analysis on         measurements
    next project or release iteration                   Track fix rationale such as “bad” fixes or
                                                         untested code
                                                        Collect failure data for analysis during the
                                                         system test phase
                                   Development Phase Practices
  Reviews of design and code                          Assess designs for availability
  Targeted developer unit testing                     Perform failure analysis
                                                Testing
  Independently verify/test every S/W                 Generate reliability estimates
    change (enhancements and fixes)
                                                  SQA
  Perform failure root-cause analysis
  Record and track all changes to S/W
    artifacts throughout the life cycle
(v0.1)                                       Ops A La Carte ©                                            27
Summary: DfR Based on “Best Practices”

 ◈ The strength of a “best practices” approach is it’s intuitiveness
          Incorporate considerations of essential functionality and failure behavior in order
            to understand failure modes and improve availability
          Perform design analysis to identify potential failure points and, where possible,
            redesign to remove failure points or reduce their impact
          Analyze S/W for critical failure trigger points and remove them or reduce their
            impact and frequency where possible
          Plan testing to maximize the overall S/W verification prior to field deployment
          Let measured data drive changes to reliability practices

 ◈ Focus on the removal of critical failures instead of all defects
          S/W with known defects and faults may still be perceived as reliable by users
              ◘ NASA studies identified projects that produced reliable S/W with only 70% of
                 the code tested
          Removing X% of the defects in a system will not necessarily improve the
            reliability by X%.
              ◘ One IBM study showed that removing 60% of the product’s defects resulted
                 in only a 3% reliability improvement
              ◘ S/W defects in rarely executed sections of code may never be encountered
                 by users and therefore may not improve reliability
                   – Exceptions for essential operations: boot, shutdown, data backup, etc.
(v0.1)                                    Ops A La Carte ©                                       28
Questions?




(v0.1)     Ops A La Carte ©   29
Software Design For Reliability (DfR) Seminar



        Reliability
       Measurements
           and
         Metrics
Metrics Supporting Reliability Strategies

   Common strategies and tactics used by teams developing
   highly reliable software products
         Explicit, robust reliability requirements during requirements phase

         Appropriate use of fault tolerant techniques in product design

         Robust design/operational requirements for maximizing product
           Availability

         Focused, targeted (data driven) defect inspection program

         Robust testing strategy and program:
           well defined focused mix of unit, regression, integration, system,
           exploratory and reliability demonstration testing

         Robust defect tracking/metrics program focused on important few
             ◘ Defect tracking and analysis for all phases of a product’s life
               including post shipment defects/failures (FRACUS)
(v0.1)                                 Ops A La Carte ©                          2
Reliability Measurements and Metrics
 ◈ Definitions
          Measurements – data collected for tracking or to calculate meta-data (metrics)
               ◘ Ex: defect counts by phase, defect insertion phase, defect detection phase
          Metrics – information derived from measurements (meta-data)
               ◘ Ex: failure rate, defect removal efficiency, defect density


 ◈ Reliability measurements and metrics accomplish several goals
            Provide estimates of S/W reliability prior to customer deployment
            Track reliability growth throughout the life cycle of a release
            Identify defect clusters based on code sections with frequent fixes
            Determine where to focus improvements based on analysis of defect/failure data


 Note: S/W Configuration Management (SCM) and defect tracking
       tools should be updated to facilitate the automatic tracking
       of this information
               ◘ Allow for data entry in all phases, including development
               ◘ Distinguish code base updates for critical defect repair vs. any other
                  changes, (e.g., enhancements, minor defect repairs, coding standards
                  updates, etc.)
(v0.1)                                      Ops A La Carte ©                                  3
Critical Measurements To Collect


         Measurement                                          Description
                                   Number of critical defects found during each non-operational
 Critical Defects by Phase
                                   phase (i.e. requirements, design, and coding)

                                   Number of critical failures found during each operational phase
 Critical Failures by Phase
                                   (i.e. unit testing, system testing, and field)

                                   The phase where the critical defect (or critical failure) was
 Critical Defect Insertion Phase
                                   inserted (or originated)
                                   The phase where the critical defect (or critical failure) was
 Critical Defect Detection Phase
                                   detected (or reported)

                                   A high-level indicator of a critical defect’s location within the
 Critical Defect Major Location
                                   source code (e.g., a S/W component or file name)

                                   A low-level indicator of a critical defect’s location within the
 Critical Defect Minor Location
                                   source code (e.g., the name of a class, method or data object)

                                   The time when a critical failure occurred since the beginning of a
 Critical Failure Time
                                   test run (typically measured in CPU or wall time)

 Critical Failure Root Cause       The relevant failure category for a specified critical failure




(v0.1)                                     Ops A La Carte ©                                             4
Metrics To Track


              Metric                                             Description
                                      The number of defects per KLOCs (1,000 lines of commented
 Critical Defect Density
                                      source code)

 Critical Defect Removal Efficiency   The percentage of defects identified within a given life cycle
 (CDRE)                               period

 Critical Failure Rate                The mean number of failures occurring within a reference period

                                      Current open defect demographics of current code base including
 Current Defect Demographics
                                      defects by severity, module, fix backlog, ect.
                                      Trends( e.g., defects vs. test time interval) of newly detected
 Failure/Defect arrival rates
                                      Failures and/or defects.

                                      The number of failures caused as side effects to a fixes of logged
 Bad Fixes
                                      defect

                                      The number of failures caused by code that was neither reviewed
 Unverified Code Fixes
                                      nor tested

                                      The summary of failure categories distributions for pareto root
 Failure root cause distribution
                                      cause analysis



(v0.1)                                        Ops A La Carte ©                                             5
Software Defect Distributions

   Average distribution of all types of S/W defects by lifecycle phase:
          20%     Requirements         50% of all S/W defects are introduced before coding
          30%     Design
          35%     Coding
          10%     Bad Defect Fixes (introduction of secondary defects)          1 in 10 defects fixed
                                                                                 during testing were
          5%      Customer Documentation                                        unintended side effects
                                                                                 of a previous defect “fix”


   Average distribution of S/W defects escalated from the field:
         (based on 1st year field defect report data)
          1%      Severity 1 (catastrophic)       Only ~20% of the customer-reported S/W
          20%     Severity 2 (serious)            defects are target for reliability improvements

          35%     Severity 3 (minor)
          44%     Severity 4 (annoyance or cosmetic)




(v0.1)                                    Ops A La Carte ©                                               6
Typical Defect Tracking (System Test)


                       Severity #1   Severity #2         Severity #3   Severity #4
         System Test                                                                 Total Defects
                        Defects       Defects             Defects       Defects
            Build                                                                       Found
                         Found         Found               Found         Found

         SysBuild-1        7             9                       16        22             54


         SysBuild-2        5             5                       14        26             50


         SysBuild-3        4             6                       8         16             34

              •             •             •                      •          •              •
              •             •             •                      •          •              •
              •             •             •                      •          •              •

         SysBuild-7        0             1                       4         6              11




(v0.1)                                        Ops A La Carte ©                                       7
Defect Removal Efficiency
 ◈Critical defect removal efficiency (CDRE) is a key reliability measure
                   Critical Defects Found
         CDRE =
                  Critical Defects Present

 ◈“Critical Defects Present” is the sum of the critical defects found in all
     phases as a result of reviews, testing and customer/field escalations
          System testing stages include integration, functional, loading, performance,
            acceptance, etc.
          Customer trials can be considered either a system testing stage, a preliminary,
            but separate, field deployment phase or a part of the field deployment phase
             ◘ Depending on the rationale for the trials
          The field deployment phase is measured as the first year following deployment
             ◘ The average life span of a S/W release since most S/W releases are
                  separated by increments no longer than 1 year

                                                                     System & Subsystem      Field
         Requirements      Design     Coding      Unit Testing
                                                                        Testing Stages    Deployment

                                                                                              Field
                Review Efficiency                              Testing Efficiency          Efficiency
                                                                         System Testing       Field
                        Development Efficiency
                                                                           Efficiency      Efficiency

                                                                                              Field
                                         Internal Efficiency
                                                                                           Efficiency
(v0.1)                                            Ops A La Carte ©                                      8
CDRE Example
                          Critical
          Origin          Defects         Metrics #1
                           Found
 Requirements Reviews       20
 Design Reviews             30
 Code Reviews               40
                                               170
 Unit Testing               25

 System & Subsystem
                            55
 Testing

 Field Deployment           40                  40

          TOTAL             210                210


                                                                        System & Subsystem      Field
          Requirements   Design        Coding        Unit Testing
                                                                           Testing Stages    Deployment

                                                                                                 Field
                                         Internal Efficiency
                                                                                              Efficiency



                                      Metric                      Removal Efficiency

                              Internal Efficiency                   81% = (170 / 210)

                                  Field Efficiency                  19% = (40 / 210)




(v0.1)                                               Ops A La Carte ©                                      9
CDRE Example
                             Critical
          Origin             Defects         Metrics #2
                              Found
 Requirements Reviews           20
 Design Reviews                 30
                                                  115
 Code Reviews                   40

 Unit Testing                   25

 System & Subsystem
                                55                 55
 Testing

 Field Deployment               40                 40

          TOTAL                210                210


                                                                           System & Subsystem      Field
          Requirements     Design         Coding        Unit Testing
                                                                              Testing Stages    Deployment

                                                                             System Testing         Field
                         Development Efficiency
                                                                               Efficiency        Efficiency



                                         Metric                      Removal Efficiency

                               Development Efficiency                  55% = (115 / 210)

                              System Testing Efficiency                26% = (55 / 210)

                                     Field Efficiency                  19% = (40 / 210)



(v0.1)                                                  Ops A La Carte ©                                      10
CDRE Example
                               Critical
          Origin               Defects         Metrics #3
                                Found
 Requirements Reviews             20
 Design Reviews                   30                 90
 Code Reviews                     40

 Unit Testing                     25
                                                     80
 System & Subsystem
                                  55
 Testing

 Field Deployment                 40                 40

          TOTAL                  210                210


                                                                             System & Subsystem      Field
          Requirements       Design         Coding        Unit Testing
                                                                                Testing Stages    Deployment

                                                                                                      Field
                    Review Efficiency                               Testing Efficiency             Efficiency



                                           Metric                      Removal Efficiency

                                   Review Efficiency                     43% = (90 / 210)

                                   Testing Efficiency                    38% = (80 / 210)

                                       Field Efficiency                  19% = (40 / 210)



(v0.1)                                                    Ops A La Carte ©                                      11
Sample Project Reliability Measurement Tracking

     At the end of the project Unit Testing Phase

                            Defects Found                          Critical Defects/Failures Found

            Phase                                Reqmts     Design       Code      Unit Test    Test        Field
                             Total    Critical
                                                 Critical   Critical    Critical    Critical   Critical   Failures
                            Defects   Defects
                                                 Defects    Defects     Defects    Failures    Failures   Reported

     Requirements             75        12         12

     Design                  123        45          6         39

     Code                    158        62          4         12          46

     Unit Test                78        25          1          5          17          2

     Development Totals      434       144         23         56          63          2


     Integration Test

     System Test

     Testing Totals


     Field Reports Totals




(v0.1)                                             Ops A La Carte ©                                                  12
Sample Project Reliability Measurement Tracking

     1 year after the end of the System Testing phase

                            Defects Found                          Critical Defects/Failures Found

            Phase                                Reqmts     Design       Code      Unit Test    Test        Field
                             Total    Critical
                                                 Critical   Critical    Critical    Critical   Critical   Failures
                            Defects   Defects
                                                 Defects    Defects     Defects    Failures    Failures   Reported

     Requirements             75        12         12

     Design                  123        45          6         39

     Code                    158        62          4         12          46

     Unit Test                78        25          1          5          17          2

     Development Totals      434       144         23         56          63          2


     Integration Test         43        13          0          4           7          1           1
     System Test             183        47          2         13          28          0           4
     Testing Totals          226        60          2         17          35          1           5

     Field Reports Totals     70        35          1          8          22          0           3          1


     Release Summary         720       239         26         81         120          3           8          1


(v0.1)                                             Ops A La Carte ©                                                  13
Sample Project Reliability Measurement Tracking

     1 year after the end of the System Testing phase

                            Defects Found                          Critical Defects/Failures Found

            Phase                                Reqmts     Design       Code        Unit Test    Test        Field
                             Total    Critical
                                                 Critical   Critical    Critical      Critical   Critical   Failures
                            Defects   Defects
                                                 Defects    Defects     Defects      Failures    Failures   Reported

     Requirements             75        12         12

     Design                  123        45          6         39

     Code                    158        62          4         12            46     Design DRE Measurements
                                                                        •   39 critical defects found in-phase
     Unit Test                78        25          1          5            17           2
                                                                        •   56 critical defects found during development
     Development Totals      434       144         23         56        •   63 critical defects found during testing
                                                                            17           2
                                                                        •   81 critical defects found overall
     Integration Test         45        13          0          4             7          1           1
     System Test             189        47          2         13            28        Design DRE Metrics
                                                                                       0      4
     Testing Totals          234        60          2         17        • 35 in-phase DRE (= 39/81)
                                                                          48%       1        5
                                                                        • 69% development DRE (= 56/81)
     Field Reports Totals     77        35          1          8        • 22 in-house DRE (= (56+17)/81)1
                                                                          90%       0        3


     Release Summary         745       239         26         81            120         3           8           1


(v0.1)                                             Ops A La Carte ©                                                    14
Sample Project Reliability Measurement Tracking

     1 year after the end of the System Testing phase

                            DefectsBad Fixes Measurements
                                    Found                          Critical Defects/Failures Found
    • 1 critical defect inserted and found during field deployment
           Phase                                    Reqmts   Design       Code     Unit Test    Test        Field
    • 5 critical defects inserted and Critical during system-level testing
                            Total     found         Critical Critical    Critical   Critical   Critical   Failures
                            Defects   Defects
    • 3 critical defects inserted during system-level testing and found during field deployment
                                                    Defects   Defects  Defects    Failures   Failures     Reported

    •Requirements
       1 critical defects inserted during unit testing and found during system-level testing
                               75       12            12
                                     Bad Fixes Measurements
    • 0 critical defects inserted during unit testing and found during field deployment
    •Design
       1 critical defect inserted and found during field deployment
                              123       45             6        39
    • 60 total critical defects found during system-level testing
    • 5 critical defects inserted and found during system-level testing
    •Codetotal critical defects found in62 field
       35                     158        the           4        12       46
    • 3 critical defects inserted during system-level testing and found during field deployment
    •Unit Test defects inserted during unit testing1
       1 critical              78       25                       5       17          2
                                                        and found during system-level testing
    •Development Totals inserted during Badtesting and found during field deployment
       0 critical defects     434         unit Fixes Metrics
                                        144           23        56       63          2
    • 10% of the test defects found during system-level (= (1 + 5)/60)
    • 60 total critical phase failures are from Bad Fixes testing
    • 11% of the fielddefectsfailuresin the field
    • 35 total critical phase found are from Bad Fixes (= (0 + 3 + 1)/35)
    •Integration Test and field failures are from Bad Fixes (= (1 + 5 + 07+ 3 + 1)/(60 + 35)) 1
       11% of the test         45       13             0         4                   1
     System Test             189        47            2         13         28          0           4
     Testing Totals          234        60            2         17         35          1           5

     Field Reports Totals     77        35            1          8         22          0           3         1


     Release Summary         745       239           26         81        120          3           8         1


(v0.1)                                               Ops A La Carte ©                                                15
Sample Project Reliability Metrics

                                       Metrics
           Phase              In-phase      Overall
                        CDD                                Bad Fixes
                                DRE          DRE
         Requirements   29%     46%

         Design         51%     48%
                                          58%
         Code           56%     38%                  84%
         Unit Test      18%     17%


         Testing                          26%              10%
                                                                 11%
         Field                            16%        16%   11%




(v0.1)                            Ops A La Carte ©                     16
Sample Project Reliability Metrics

                                            Metrics
           Phase                   In-phase      Overall
                          CDD                                   Bad Fixes
                                     DRE          DRE
         Requirements     29%        46%

         Design           51%        48%
                                               58%
         Code             56%        38%                  84%
         Unit Test        18%        17%


         Testing                               26%              10%
                                                                      11%
         Field                                 16%        16%   11%


         Sample Phase 1 Goals (50% improvement):
         • (reliability) increase in-house DRE to 92%
         • (efficiency) reduce field bad fixes to 5%




(v0.1)                                 Ops A La Carte ©                     17
Distribution of Defects Across Files

 ◈ There is a Pareto-like distribution of defects across files within a
         module
          Defects cluster in a relatively small number of files
          Conversely, more than half of the files have almost no critical defects




(v0.1)                                    Ops A La Carte ©                           18
Failure Density Analysis

 ◈ In general, there is a pareto distribution (80/20) of defects across
         files

 ◈ Failure density analysis provides a mechanism for early detection
         of “problematic” sections of code (i.e., defect clusters)
           Improving the reliability of these “problematic modules” can consume as much
            as 4 times the effort of a redesign
                 ◘ Less time is required to re-inspect and either restructure or redesign these
                   modules than to effectively “beat them into submission” through testing
           The goal is to identify “problematic” code as early as possible
                 ◘ Perform failure density analysis early on during unit testing
                      – Since different sections of code may be problematic during system
                        testing, the analysis should be repeated near the middle of this phase
                 ◘ SCM and defect tracking tools can be modified to provide this information
                   without much effort
                      – Source code can be analyzed to display “hot spot” histograms using
                        number of changes and/or number of failures
                      – Heuristics for failure density thresholds must be developed to
                        determine when action should be taken

(v0.1)                                       Ops A La Carte ©                                     19
Failure Density Distribution by File



                                                These files contain the majority of the reported
                                                failures and should be proactively analyzed for
                                                possible redesign, restructuring or additional
                                                defects
 % of Total
 Reported
 Failures




                F1   F2   F3   F4   F5   F6    F7   F8     F9 F10 F11 F12 F13 F14     … F23

                                         Source Code Files
(v0.1)                                        Ops A La Carte ©                                     20
Applying Causal Analysis to Defect Measurements

 ◈ Causal analysis(RCA) can be applied to defect
         measurements to improve defect removal effectiveness
         and efficiency
           Usually performed between life cycle iterations or S/W releases
           Upstream defect removal practices should be reviewed in light of the
            defects that were not detected in each phase
              ◘ Requires knowing the phases where a defect was introduced and
                detected
              ◘ Simple guidelines should be defined to determine the phase
                where the defect was introduced


           Defects are categorized by types
              ◘ Can determine if the problem is systemic to one or more
                categories, or
              ◘ Whether the problem is an issue of raising overall defect removal
                efficiency for a development phase

(v0.1)                                Ops A La Carte ©                              21
Causal Analysis Process
                                    (one shot analysis)

Objective: Create an initial, rough distribution of the defects and identify
           potential improvements to existing defect removal practices.

Process Outline (typically 6-8 hours over two days):
◈ Select a team from the senior engineers and MSEs of the development and test
         teams. Analysis Process:
           Before the meeting, select a representative sample of approximately 50-75
            defects from each development and test phase
           Convene meeting and explain the objectives and process to the team
           Classify the defects
               ◘   Start by walking the team through the classification of one defect
               ◘   Divide the defects into small groups and assign each person 2 groups of
                   defects (to force analysis overlap)
               ◘   Upon completion, collect and process the data offline in preparation for
                   team analysis and review
           Analyze the defect types using a histogram to look for a Pareto distribution and
            select the most prevalent defect types
           Develop recommendations for improvements to specific defect removal practices
           Implement the recommendations at the next possible opportunity and gather
            measurement data
(v0.1)                                      Ops A La Carte ©                                   22
The Orthogonal Defect Analysis Framework

         SPEC/RQMTS       DESIGN         CODE       ENV. SUPT.       DOCUMENTATION    OTHER

ORIGIN
( Where?)



 REQUIREMENTS                                                                         TEST SW
                      HW INTERFACE     PROC. (INTERPROC.)               LOGIC
            OR
                      SW INTERFACE      COMMUNICATIONS               COMPUTATION      TEST HW
SPECIFICATIONS
                                        DATA DEFINITION            DATA HANDLING
                      USER INTERFACE                                                   DEVELOPMENT
 FUNCTIONALITY
                                                                                      TOOLS
                                        MODULE DESIGN             MODULE INTERFACE/
                        FUNCTIONAL                                                    INTEGRATION SW
 TYPE
( What ?)               DESCRIPTION    LOGIC DESCRIPTION            IMPLEMENTATION

                                                                      STANDARDS
                                         ERROR CHECKING

                                         STANDARDS




                                                                                      • Other can also be
                                                                                        type classification for
                                                                                        any origin
         MODE
                  MISSING   UNCLEAR    WRONG    CHANGED       BETTER WAY
         (WHY?)


(v0.1)                                         Ops A La Carte ©                                                   23
Ops A La Carte Software Design for Reliability (SDfR) Seminar
Ops A La Carte Software Design for Reliability (SDfR) Seminar
Ops A La Carte Software Design for Reliability (SDfR) Seminar
Ops A La Carte Software Design for Reliability (SDfR) Seminar
Ops A La Carte Software Design for Reliability (SDfR) Seminar
Ops A La Carte Software Design for Reliability (SDfR) Seminar
Ops A La Carte Software Design for Reliability (SDfR) Seminar
Ops A La Carte Software Design for Reliability (SDfR) Seminar
Ops A La Carte Software Design for Reliability (SDfR) Seminar
Ops A La Carte Software Design for Reliability (SDfR) Seminar
Ops A La Carte Software Design for Reliability (SDfR) Seminar
Ops A La Carte Software Design for Reliability (SDfR) Seminar
Ops A La Carte Software Design for Reliability (SDfR) Seminar
Ops A La Carte Software Design for Reliability (SDfR) Seminar
Ops A La Carte Software Design for Reliability (SDfR) Seminar
Ops A La Carte Software Design for Reliability (SDfR) Seminar
Ops A La Carte Software Design for Reliability (SDfR) Seminar
Ops A La Carte Software Design for Reliability (SDfR) Seminar
Ops A La Carte Software Design for Reliability (SDfR) Seminar
Ops A La Carte Software Design for Reliability (SDfR) Seminar
Ops A La Carte Software Design for Reliability (SDfR) Seminar
Ops A La Carte Software Design for Reliability (SDfR) Seminar
Ops A La Carte Software Design for Reliability (SDfR) Seminar

More Related Content

Featured

AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 

Featured (20)

AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 

Ops A La Carte Software Design for Reliability (SDfR) Seminar

  • 1. & We Provide You Confidence in Your Product ReliabilityTM Ops A La Carte / (408) 654-0499 / askops@opsalacarte.com / www.opsalacarte.com
  • 2. Software Design for Reliability (DfR) ½-day Seminar Ops A La Carte LLC // www.opsalacarte.com
  • 3. The following presentation materials are copyright protected property of Ops A La Carte LLC. Distribution of these materials is limited to your company staff only. These materials may not be distributed outside of your company or used for any purpose other than training.
  • 4. Software DfR ½-Day Seminar Agenda Agenda ◈ Introductions and Agenda Review ◈ Software Reliability Basic Concepts ◈ A “Best Practices” Approach to Developing Reliable Software ◈ Reliability Measurements and Metrics ◈ Wrap-up (v0.1) Ops A La Carte © 3
  • 6. Presenter’s Biographical Sketch – Bob Mueller ◈ Bob Mueller a senior consultant/program manager with OPS A La Carte and the Marisan Group. He is a product development professional with 30+ years of technical and management experience in software intensive product development, R/D process and quality systems development including extensive consulting experience with cross-functional product development teams and senior management. ◈ After receiving his M.S. in Physics in 1973, Bob joined Hewlett-Packard in Cupertino, CA in IC process development. In the next three decades before leaving hp in 2002, he held numerous positions in R/D, R/D management and consulting including:  IC process development, process engineering and IC production management.  Lead developer of an automated IC in-process/test monitor, analysis and control system (hp internal).  R/D project management for sw intensive products (including process analysis and control, work cell control & quality control systems).  Numerous R/D management positions in computer, analytical and healthcare businesses including FDA regulated systems with ISO 9001 certified organizations.  Numerous program management positions focused on internal/external process improvement and consulting.  Practice area manager and consultant for PG -- Engineering consulting team (internal hp) ◈ Bob’s current consulting interests include: Warranty process and quality system improvement, SW Reliability, agile SW product development methodologies and R/D product strategy and technology roadmap development. ◈ Bob has taught many internal hp classes and at local junior colleges. (v0.1) Ops A La Carte © 5
  • 7. Software Reliability Integration Services for the Product Reliability Integration in the Concept Phase Reliability Integration in the Implementation Phase Software Reliability Goal Setting Facilitation of Code Reliability Reviews Software Reliability Program and Integration Plan Software Robustness and Coverage Testing Techniques Reliability Integration in the Design Phase Facilitation of Team Design Template Reviews Reliability Integration in the Testing Phase Facilitation of Team Design Reviews Software Reliability Measurements and Metrics Software Failure Analysis Usage Profile-based Testing Software Fault Tolerance Software Reliability Estimation Techniques Software Reliability Demonstration Tests
  • 8. Software Design For Reliability (DfR) Software Reliability Basic Concepts George de la Fuente georged@opsalacarte.com (408) 828-1105 www.opsalacarte.com
  • 9. Software Quality vs. Software Reliability
  • 10. Software Quality vs. Reliability FACTORS CRITERIA suitability Functionality accuracy interoperability security understandability Usability learnability operability attractiveness Software Software Quality maturity Quality Reliability fault tolerance The level to which the *ISO9126 Quality Model recoverability software characteristics conform to all the time behavior specifications. Efficiency resource utilization analysability changeability Portability stability testability adaptability installability Maintainability co-existence replaceability (v0.1) 3 Ops A La Carte ©
  • 12. Software Reliability Definitions “The probability of failure free software operation for a specified period of time in a specified environment” ANSI/IEEE STD-729-1991 ◈Examine the key points ◈Practical rewording of the definition Software reliability is a measure of the software failures that are visible to a customer and that prevents a system from delivering essential functionality for a specified period of time. (v0.1) Ops A La Carte © 5
  • 13. Software Reliability Can Be Measured ◈ Measurements (quantitative) are a required foundation  Differs from quality which is not defined by measurements  All measurements and metrics are based on run-time failures ◈ Only customer-visible failures are targeted  Only defects that produce customer-visible failures affect reliability  Corollaries ◘ Defects that do not trigger run-time failures do NOT affect reliability – badly formatted or commented code – defects in dead code ◘ Not all defects that are triggered at run-time produce customer- visible failures – corruption of any unused region of memory ◈ S/W Reliability evolved from H/W Reliability  Primary distinction: S/W Reliability focuses only on design reliability (v0.1) Ops A La Carte © 6
  • 14. Software Reliability Is Based On Usage ◈ S/W failure characteristics are derived from the usage profile of a particular customer (or set of customers)  Each usage profile triggers a different set of run-time S/W faults and failures ◈ Example of reliability perspective from 3 users of the same S/W  Customer A ◘ Usage Profile – Exercises sections of S/W that produce very few failures. ◘ Assessment – S/W reliability is high.  Customer B ◘ Usage Profile – Overlaps with Customer A’s usage profile. However, Customer B also exercises other sections of S/W that produce many, frequent failures ◘ Assessment – S/W reliability is low.  Customer C ◘ Usage Profile – Similar to Customer B’s usage profile. However, Customer C has implemented workarounds to mitigate most of the S/W failures that were encountered. The final result is that the S/W executes with few failures but requires additional off-nominal steps. ◘ Assessment – S/W quality is low since many workarounds are required. However, for the final configuration that includes these workarounds, S/W reliability is acceptable. (v0.1) Ops A La Carte © 7
  • 15. Reliability ≠ Correctness or Completeness ◈ Correctness is a measure with which the requirements model the intended customer base or industry functionality  Correctness is validated by reviewing product requirements and functional specifications with key customers ◈ Completeness is a measure of the degree of intended functionality that is modeled by the S/W design  Completeness is validated by performing requirements traceability at the design phase and design traceability at the coding phase ◈ Reliability is a measure of the behavior (i.e., failures) that prevents the S/W from delivering the designed functionality  If the resulting S/W does not meet customer or market expectations, yet operates with very few failures based on its requirements and design, the S/W is still considered reliable (v0.1) Ops A La Carte © 8
  • 17. Software Defects That Affect Reliability Sources Documentation Development Validation • User Manual • Requirements ••• • Unit Test Plans/Cases • Installation Guide • System Architecture • System-Level Test Plans/Cases • Technical Specs • Designs • Design Review Scenarios or Checklists • Source Code • Code Review Scenarios or Checklists • S/W Failure Analysis Categories Categories Soft Maintenance Run-time Impacts Run-time Impacts Failures • Commenting • System outage • System outage • System outage • Style ••• • Loss of functionality • Loss of functionality • Consistency • Annoyance • Loss of critical functionality • Standards/Guidelines • Cosmetic • “Dead Code” (v0.1) Ops A La Carte © 10
  • 18. Terminology - Defect ◈ A flaw in S/W requirements, design or source code that produces unintended or incomplete run-time behavior Defect  Defects of commission ◘ Incorrect requirements are specified ◘ Requirements are incorrectly translated into a design model ◘ The design is incorrectly translated into source code ◘ The source code logic is flawed  Defects of omission There are amongst the most difficult class of defects to detect ◘ Not all requirements were used in creating a design model ◘ The source code did not implement all the design ◘ The source code has missing or incomplete logic ◈ Defects are static and can be detected and removed without executing the source code ◈ Defects that cannot trigger S/W failures are not counted for reliability purposes  These are typically quality defects that affect other aspects of S/W quality such as soft maintenance defects and defects in test cases or documentation (v0.1) Ops A La Carte © 11
  • 19. Terminology - Fault ◈The result of triggering a S/W defect by executing the associated source code Defect Faults are NOT customer-visible ◘ Example: memory leak or a packet corruption Fault that requires retransmission by the higher layer stack A fault may be the transitional state that results in a failure ◘ Trivially simple defects (e.g., display spelling errors) do not have intermediate fault states (v0.1) Ops A La Carte © 12
  • 20. Terminology - Failure ◈A customer (or operational system) observation or detection that is perceived Defect as an unacceptable departure of operation from the designed S/W behavior Fault Failures are the visible, run-time symptoms of faults ◘ Failures MUST be observable by the customer or Failure another operational system Not all failures result in system outages (v0.1) Ops A La Carte © 13
  • 21. Defect-to-Failure Transition ◈Example A S/W function (or method) processes the data stored in a memory buffer and then frees the allocated memory buffer back to the memory pool A defect within this function (or method), when triggered, will fail to free the memory buffer before completion Entry Point Defect 1 (of many) Logic Branch Points 4 Possible Exit Points (v0.1) Ops A La Carte © 14
  • 22. Defect-to-Failure Transition (continued) ◈ Most of the possible logic paths do not trigger the defect  If these are the only logic paths traversed by a customer, this portion of the S/W will be considered very reliable (v0.1) Ops A La Carte © 15
  • 23. Defect-to-Failure Transition (continued) ◈ Fault transition  Eventually a logic path is executed that triggers the defect, resulting in a fault being generated ◘ The function (or method) completes its execution ◘ The fault causes the system to lose track of a single memory buffer ◘ The system continues to operate without a visible impact  Since the fault causes no visible impact, a failure does NOT occur (v0.1) Ops A La Carte © 16
  • 24. Defect-to-Failure Transition (continued) ◈ Failure scenario  After sufficient memory buffers have been lost, the buffer pool reaches a critical condition where either: ◘ No buffers are available to satisfy another allocation request (there are still some (t1) Fault is triggered buffers in use) (t2) Fault is triggered ◘ All buffers have been lost through leakage • (no buffers will ever be freed for future • allocation requests) • (tN) Fault is triggered  Once the next buffer allocation is requested, a (tF) Failure occurs failure occurs ◘ The system cannot continue to operate Time (t) normally  Note the time lag between the triggering of the last fault and the occurrence of the associated failure (v0.1) Ops A La Carte © 17
  • 25. Summary of Defects and Failures ◈ There are 3 types of run-time defects Defect Defect Defect 1. Defects that are never executed (so they don’t trigger faults) 2. Defects that are executed and trigger faults that do NOT result in failures Fault Fault 3. Defects that are executed and trigger faults that result in failures Failure ◈ Practical S/W Reliability focuses on defects that have the potential to cause failures by: Defect 1. Detecting and removing defects that result in failures during development 2. Design & Implement fault tolerance techniques to Fault ◘ prevent faults from producing failures or ◘ mitigating the effects of the resulting failures Failure (v0.1) Ops A La Carte © 18
  • 26. Failure Distributions, Failure Rates and MTTF
  • 27. Reliability and Failure Distributions Restated, reliability is the probability that a system does not experience a failure during a time interval, [0,T]. ◈ Reliability is a measure of statistical probability, not certainty  Ex: A system has a 99% reliability over a period of 100 days ◘ Does this imply that only 1 failure will occur during the 100 day period? ◈ Reliability is based on failure distribution models  Represent the time distribution of failure occurrences  Various failure distribution models exist: ◘ Exponential (most commonly used in S/W reliability) ◘ Weibull ◘ Poisson ◘ Normal ◘ Rayleigh ◘ etc…. ◈ Let’s examine an exponential failure distribution model (v0.1) Ops A La Carte © 20
  • 28. Failure Distributions - Exponential ◈ Exponential Reliability Function The most widely used failure distribution is the exponential reliability function: ◘ Models a random distribution of failure occurrences Defined by: R(t) R(t) = e-λt λ = 0.1 failures/hr. where ◘ t is mission time – the system is assumed to be operational at t=0 – The mission duration is represented by T ◘ “λ” is a constant, instantaneous failure rate (or failure intensity) ◘ MTTF = 1 / λ (for repairable systems) (v0.1) Ops A La Carte © 21
  • 29. A Closer Look At The Exponential Distribution Reliability • Mission duration: T = 100 hours R(t) • Failure rate: λ = 0.1 failures/hr. (or 1 failure every 10 hrs.) • MTTF = 10 hrs. At t = 1 hr., the reliability is 90% When t = MTTF, the reliability is always 37% R = e-λt = e-(1/MTTF)MTTF = e-(1) = 37% Time (hrs) (v0.1) Ops A La Carte © 22
  • 30. A Closer Look at Reliability Values ◈ Based on an exponential failure distribution, what does it mean for S/W to have 99% reliability after one year of operation?  For a single S/W product: ◘ There is a 99% probability that the S/W will still be operational after 1 year – Conversely, there is a 1% chance of a failure during that period. ◘ Note that this value does NOT tell us when, during the 1 year period, that a failure will occur. – With the exponential distribution, as time progresses, the likelihood (probability) of a failure increases.  For a group of software products (e.g., 100 products): ◘ 99% of the products will be operational after 1 year (e.g., 99 products) ◘ There is a 36.6% probability that all 100 products will be operational after 1 year – This computed by multiplying the reliability of all the products: f(t) = R1(t) x R2(t) x … x R100(t) = 0.99 x 0.99 x … x 0.99 = 0.366 (v0.1) Ops A La Carte © 23
  • 31. Sample Reliability Calculations ◈ What is the failure rate (λ) and MTTF necessary for to achieve this level of reliability? t = 1 yr. = 8760 hrs. R(t) = e-λt 0.99 = e-(λ) x (8760) ln(0.99) = -(λ) x (8760) λ = -ln(0.99) / (8760) = 1.1 x 10-6 failures/hr. (1 failure every 99.5 years) MTTF = 1/ λ = 871,613 hrs. (99.5 yrs.) ◈ What is reliability at the MTTF? t = MTTF = 871,613 hrs R(MTTF) = e-(λ) x (MTTF) = e-(1/MTTF) x (MTTF) = e-1 = 0.368 (~37%) (v0.1) Ops A La Carte © 24
  • 32. Software and Hardware Failure Rates Software Hardware Driven by effectiveness of S/W defect Driven by three very detection and repair processes over different physical the span of many upgrades failure domains Failure Rate Failure Rate λSW-B λHW-B Pre-release Useful Life Obsolete Burn-In Useful Life Wearout Testing (w/upgrades) Initial system deployment (i.e., completion of Pre-release Testing and Burn-in phases) establishes a baseline for both the S/W (λSW-B) and H/W (λHW-B) failure rates (v0.1) Ops A La Carte © 25
  • 34. System Availability Availability is the percentage of time that a system is operational, accounting for planned and unplanned outages. ◈ Example: 90% Availability (for a timeframe T)  Logical Representation ◘ System is operational for the first 90% of the timeframe and down for the last 10% of the timeframe Timeframe T System operational (0.9T) System non-operational (0.1T) Failure System  Actual (or Possible) Representation occurs restored ◘ 3 failures cause the system to be down for 10% of the timeframe Failure System Failure System Failure System occurs restored occurs restored occurs restored (v0.1) Ops A La Carte © 27
  • 35. System Availability (continued) ◈ System availability, A(T), is the relationship between the timeframes when a system is operational vs. down due to a failure-induced outage and is defined as: ___MTBF___ A(T) = (MTBF + MTTR) where,  The system is assumed to be operational at time t=0  T = MTBF + MTTR and 0 ≤ t ≤ T  MTBF (Mean Time Between Failure) is based on the failure rate  MTTR (Mean Time To Repair) is the duration of the outage (i.e., the expected time to detect, repair and then restore the system to an operational state) (v0.1) Ops A La Carte © 28
  • 36. Software Availability ◈ System outages that are caused by S/W can be attributed to: 1. Recoverable S/W failures 2. S/W upgrades 3. Unrecoverable S/W failures NOTE: Recoverable S/W failures are the most frequent S/W cause of system outages ◈ For outages due to recoverable S/W failures, availability is defined as: ___MTTF___ A(T) = (MTTF + MTTR) where,  MTTF is Mean Time To [next] Failure  MTTR (Mean Time To [operational] Restoration) is still the duration of the outage, but without the notion of a “repair time”. Instead, it is the time until the same system is restored to an operational state via a system reboot or some level of S/W restart. (v0.1) Ops A La Carte © 29
  • 37. Software Availability (continued) ◈ A(T) can be increased by either:  Increasing MTTF (i.e., increasing reliability) using S/W reliability practices  Reducing MTTR (i.e., reducing downtime) using S/W availability practices ◈ MTTR can be reduced by:  Implementing H/W redundancy (sparingly) to mask most likely failures  Increasing the speed of failure detection (the key step)  S/W and system recovery speeds can be increased by implementing Fast Fail and S/W restart designs ◘ Modular design practices allow S/W restarts to occur at the smallest possible scope, e.g., thread or process vs. system or subsystem ◘ Drastic reductions in MTTR are only possible when availability is part of the initial system/software design (like redundancy) ◈ Customers generally perceive enhanced S/W availability as a S/W reliability improvement  Even if the failure rate remains unchanged (v0.1) Ops A La Carte © 30
  • 38. System Availability Timeframes Availability Class Availability Timeframe vs. Mission Downtime (Unavailability Range) Timeframe =1year Timeframe = 3 months (1) Unmanaged 90% (1 nine) 36.5 days/year 9.13 days (52,560 mins/year) (2) Managed 99% (2 nines) 3.65 days/year 21.9 hours (good web (5,256 mins/year) servers) (3) Well-managed 99.9% (3 nines) 8.8 hours/year 2.19 hours (525.6 mins/year) (4) Fault Tolerant 99.99% (4 nines) 52.6 mins/year 13.14 minutes (better commercial systems) (5) High- 99.999% (5 nines) 5.3 mins/year 1.31minutes Availability (High-reliability products) (6) Very-High- 99.9999% (6 nines) 31.5 secs/year 7.88 seconds Availability (2.6 mins/5 years) (7) Ultra- 99.99999% (7 nines) to 3.2 secs/year 0.79 seconds Availability to 99.9999999% (9 nines) 31.5 millisecs/year (15.8 secs/5 years or less) (v0.1) Ops A La Carte © 31
  • 40. Software Robustness Software Robustness is a measure of the software’s ability to handle exceptional input conditions so they do not become failures. ◈ Exceptional input conditions result from:  Inputs that violate data value constraints  Inputs that violate data relationships  Inputs that violate the application’s timing requirements ◈ Robust S/W prevents exceptional inputs from: 1. Causing a system outage 2. Producing a silent failure by providing no indication that an exceptional input condition was detected, thus allowing for the failure to propagate 3. Generating an error condition or response that incorrectly characterizes the exceptional input condition ◈ S/W robustness becomes increasingly important as a system becomes more flexible and the product’s customer base increases in size and usage diversity (v0.1) Ops A La Carte © 33
  • 41. Why Is Software Robustness Important ? Inputs causing User User #2 erroneous #1 Input set User II User Err outputs #n #3 e Program Erroneous outputs Output set OErr Oe (v0.1) Ops A La Carte © 34
  • 42. Software Robustness Studies ◈ 2 studies of S/W robustness  Examined exceptional input condition testing of POSIX-compliant OSes and UNIX command line utilities  Robustness testing was repeated on multiple releases contain fixes for the reported exceptional input failures ◈ Findings  Failure rates associated with robustness testing were significant, ◘ Ranging from 10% - 33%  After many significant, focused S/W fixes over multiple releases, failure rates still remained high ◈ Conclusions  Traditional functional testing does not adequately test for exceptional input conditions  Operational profiles testing also does not adequately test for exceptional input conditions ◘ (Reason) Operational profile testing prioritizes and sets limits on functional testing.  Specific techniques are required to provide adequate test coverage and handling of exceptional input conditions (v0.1) Ops A La Carte © 35
  • 44. Software Fault Tolerance The ability of software to avoid executing a fault in a way that results in a system failure. ◈ Despite the best development efforts, almost all systems are deployed with defects with the potential to produce critical failures  A major study of S/W defects showed 1% of customer-reported failures reported within the 1st year produce system outages ◈ Fault tolerance increases the fault-resistant quality of a system during run-time by  Detecting faults at the earliest possible point of execution  Containing the damaging effects of a fault to the smallest possible scope  Performing the most reliable recovery action possible ◈ Fault tolerant designs focus on handling “complex” failures  Address defects that are not likely to be triggered during testing (v0.1) Ops A La Carte © 37
  • 45. So, What Is Reliable Software ??
  • 46. Reliable Software Characteristics Summary ◈ Operates within the reliability specification that satisfies customer expectations  Measured in terms of failure rate and availability level  The goal is rarely “defect free” or “ultra-high reliability” ◈ “Gracefully” handles erroneous inputs from users, other systems, and transient hardware faults  Attempts to prevent state or output data corruption from “erroneous” inputs ◈ Quickly detects, reports and recovers from S/W and transient H/W faults  S/W provides the system behavior of continuously monitoring, “self-diagnosing” and “self-healing”  Prevents as many run-time faults as possible from becoming system-level failures (v0.1) Ops A La Carte © 39
  • 47. Questions? (v0.1) Ops A La Carte © 40
  • 48. Software Design For Reliability (DfR) Seminar A “Best Practices” Approach to Developing Reliable Software George de la Fuente georged@opsalacarte.com (408) 828-1105 www.opsalacarte.com
  • 49. Most Common Paths to Reliable Software 1. Rely on H/W redundancy to mask out all S/W faults  The most attractive and expensive approach  Provides a increased system-level reliability using an availability technique  Requires minimal S/W reliability 2. “Testing In” reliability  The most prevalent approach  Limited and inefficient approach to defect detection and removal ◘ System testing will leave at least 30% of the code untested ◘ System testing will detect at best ~55% of all run-time failures  Most companies don’t continue testing until their reliability targets are reached ◘ The testing phase is usually fixed in duration before the S/W is developed and is focused on defect removal not reliability testing  S/W engineers will spend more than 1/2 of their time in the test phase using this approach (v0.1) Ops A La Carte © 2
  • 50. S/W Design for Reliability 3. S/W Design for Reliability Least utilized and understood approach  Common methodologies 1)Formal methods 2)Programs based on a H/W reliability practices 3)S/W process control 4)Augment traditional SW development /w “best practices” (v0.1) Ops A La Carte © 3
  • 51. Formal Methods ◈ Formal Methods (not commonly used for commercial SW)  Methodologies for system behavior analysis and proof of correctness ◘ Utilize mathematical modeling of a system’s requirements and/or design  Primarily used in the development of safety-critical systems that require very high degrees of: ◘ Confidence in expected system performance ◘ Quality audit information ◘ Targets of low or near zero failure rates  Formal methods are not applicable to most S/W projects ◘ Cannot be used for all aspects of system design (e.g., user interface design) ◘ Do not scale to handle large and complex system development ◘ Mathematical requirements exceed the background of most S/W engineers (v0.1) Ops A La Carte © 4
  • 52. Using Hardware Reliability Practices ◈ S/W and H/W development practices are still fundamentally different  The H/W lifecycle primarily focuses on architecture and design modeling  S/W design modeling tools are rarely used ◘ Design-level simulation verification is limited – Especially if a real-time operating system is required ◘ S/W engineers still challenge the value of generating complete designs – This why S/W design tools support 2-way code generation  Inherent S/W faults stem from the design process ◘ There is no aspect of faults from manufacturing or wear-out ◈ S/W is not built as an assembly of preexisting components  True S/W component “reuse” is rare ◘ Most “reused” S/W components are at least “slightly” modified ◘ Modified “reused” S/W components are not certified before use  S/W components are not developed to a specified set of reliability characteristics  3rd party S/W components do not come with reliability characteristics (v0.1) Ops A La Carte © 5
  • 53. Hardware Reliability Practices ◈ …. assembly of preexisting components (continued)  Acceleration mechanisms do not exist for S/W reliability testing  Extending S/W designs after product deployment is commonplace ◘ H/W is designed to provide a stable, long-term platform ◘ S/W is designed with the knowledge that it will host frequent product customizations and extensions ◘ S/W updates provide fast development turnaround and have little or no manufacturing or distribution costs  H/W failure analysis techniques (FMEAs and FTAs) are rarely successfully applied to S/W designs ◘ S/W engineers find it difficult to adapt these techniques below the system level (v0.1) Ops A La Carte © 6
  • 54. Software Process Control Methodologies ◈ S/W process control assumes a correlation between the maturity of the development process and the latent defect density in the final S/W CMM Level Defects/KLOC Estimated Reliability 5 0.5 99.95% 4 1.0 - 2.5 99.75% - 99.9% 3 2.5 – 3.5 99.65% - 99.75% 2 3.5 – 6.0 99.4% - 99.65% 1 6.0 – 60.0 94% - 99.4% ◈ Process audits and more strict controls are implemented if the current process level does not yield the desired S/W reliability  Process root cause analysis may not yield great improvement ◘ Practices within the processes must be fine tuned (but how??)  Reliability improvement under this type of methodology is slow ◘ Process outcome cannot vary too much in either direction (v0.1) Ops A La Carte © 7
  • 55. “Best Practices” for Software Development
  • 56. Sources of Industry Data Data was derived from a large-scale international survey of S/W lifecycle quality spanning:  18 years (1984-2002)  12,000+ projects  600+ companies ◘ 30+ government/military organizations  8 classes of software applications: 1. Systems S/W 2. Embedded S/W 3. Military S/W 4. Commercial S/W 5. Outsourced S/W 6. Information Technology (IT) S/W 7. End-User developed personal S/W 8. Web-based S/W (v0.1) Ops A La Carte © 9
  • 57. Terminology ◈ Best Practice  A key S/W quality practice that significantly contributes towards increasing S/W reliability ◈ Best in Class Companies  Companies that have the following two characteristics: ◘ Recognized for producing S/W-based products with the lowest failure rate in their industry ◘ Consistently deploying software based on their initial schedule targets ◈ Formal practice  A S/W quality development practice that is well-understood and consistently implemented throughout the software development organization. ◘ Note: Formal practices are rarely undocumented. ◈ Informal practice  A S/W quality development practice that is either implemented with varying degrees of rigor or in an inconsistent manner throughout the software development organization. ◘ Note: Informal practices are usually accompanied by the absence of documented guidelines or standards. (v0.1) Ops A La Carte © 10
  • 58. “Best in Class” Company Best Practices ◈ S/W Life Cycle Practices  Consistent implementations of the entire S/W lifecycle phases (requirements, design, code, unit test, system test and maintenance) ◈ Requirements  Involve test engineers in requirements reviews  Define quality and reliability targets  Define negative requirements (i.e., “shall nots”) ◈ Development phase defect removal  Formal inspections (requirements, design, and code)  Failure analysis ◈ Design  Team or group-oriented approach to design for the system and S/W ◘ NOTE: System design team includes other disciplines (e.g., H/W & Mech) (v0.1) Ops A La Carte © 11
  • 59. “Best in Class” Company Best Practices (continued) ◈ Testing  Robust Testing strategy to meet business / customer requirements  Test plans completed and reviewed before the coding phase  Mandatory developer unit testing  Independently verify/test every software change (enhancements and fixes)  Create formal test plans for all medium and large-sized projects  Staff an independent and dedicated SQA team to at least 5% of size of the S/W development team  Generate quality or reliability estimates  Incorporate automated test tools into the test cycle ◈ S/W Quality Assurance  Review and prioritize all changes after the development phase  Record and track all changes to S/W artifacts throughout the life cycle  Formalize unit testing reviews (test plans and results)  Implement active quality assurance programs  Root-cause analysis with resolution follow-up  Gather and review customer product feedback (v0.1) Ops A La Carte © 12
  • 60. “Best in Class” Company Best Practices (continued) ◈ SCM and Defect Tracking  Implement formal change management of artifact changes and S/W releases  Incorporate automated defect tracking tools ◈ Metrics and Measurements  Record and track all defects and failures  Collect field data for root cause analysis on next project or release iteration  Measure code test coverage  Generate metrics based on code attributes (e.g., size and complexity)  Generate defect removal efficiency measurements  Track “bad fixes” (v0.1) Ops A La Carte © 13
  • 61. Weaknesses in S/W Development Practices ◈ Lack of engineer “ownership” for development and test practices  Limited efficiency and effectiveness improvements made  May lead to disjoint practices, resulting in no real “common” practices ◈ System design is “H/W-centric”  Primary focus on H/W feasibility, functionality and performance  Architectural reviews are not collaborative, team design sessions  S/W requirements of the H/W platform are generally not entertained or implemented ◈ S/W defect removal relies mostly on system or subsystem-level testing  Development phase defect removal is limited to cursory code reviews and sparse unit testing ◘ Designs and design reviews are satisfied using functional or interface specifications  No causal analysis is performed to improve future defect removal (v0.1) Ops A La Carte © 14
  • 62. Weaknesses in S/W Development Practices ◈ Limited system and S/W quality measurements and metrics  Use of default defect tracking tool statistics as primary metrics/measurements  Generally no data mining capability available for analysis ◈ Informal SQA processes and staffing leads to wasted efforts and incomplete coverage  Too many trivial defects still present during system test phase  Defect fixes that introduce additional defects are frequent  S/W is shipped with many untested sections  Significant, recurring, “real world” customer scenarios remain untested ◈ Limited or no tool support for:  Unit testing  Automated regression testing  S/W analysis (static, dynamic, and coverage) (v0.1) Ops A La Carte © 15
  • 63. Application Behavior Patterns S/W Quality Methods System S/W Embedded S/W Summary Overall, best S/W quality results Wide range of S/W quality results Defect Removal Efficiency Usually > 96% Up to > 94% Best quality results found in projects with Most projects are < 26.5 KLOCS Projects Sizes > 550 KLOCS Formal design and code inspections Usually do not implement both design or Inspections code inspections (and not formally) Test Teams Independent SQA team Usually do not have separate SQA teams Formal S/W quality measurement Informal S/W quality measurement Measurement Control process and tools processes and tools Change Control Formal change control process and tools Informal change control process and tools Test plans Formal test plans Usually do not implement formal test plans Unit Testing Performed by developers Performed by developers 6 to 10 test stages 3 to 6 test stages Testing Stages (performed by SQA team) (usually performed by developers) Governing Processes CMM/CMMI and Six-Sigma methods No consistent pattern found (v0.1) Ops A La Carte © 16
  • 64. Application Behavior Patterns S/W Quality Methods System S/W Commercial S/W Summary Overall, best S/W quality results Wide range of S/W quality results Defect Removal Efficiency Usually > 96% Up to > 90% Best quality results found in projects with Most projects are > 275 KLOCS Projects Sizes > 550 KLOCS Formal design and code inspections Inconsistent use of formal design or code Inspections inspections Test Teams Independent SQA team Inconsistent use of independent SQA teams Formal S/W quality measurement Informal S/W quality measurement Measurement Control process and tools processes and tools Change Control Formal change control process and tools Formal change control process and tools Test plans Formal test plans Formal test plans Unit Testing Performed by developers Performed by developers 6 to 10 test stages 3 to 8 test stages Testing Stages (performed by SQA team) (extensive reliance on Beta trials) Governing Processes CMM/CMMI and Six-Sigma methods No consistent pattern found (v0.1) Ops A La Carte © 17
  • 66. Defect Origin and Discovery Typical Behavior Requirements Design Coding Testing Maintenance Defect Origin Defect Requirements Design Coding Testing Maintenance Discovery Surprise! Goal of Best Practices on Defect Discovery Defect Requirements Design Coding Testing Maintenance Origin Defect Requirements Design Coding Testing Maintenance Discovery (v0.1) Ops A La Carte © 19
  • 67. Software Defect Removal Techniques Defect Removal Technique Efficiency Range Design inspections 45% to 60% Code inspections 45% to 60% Unit testing 15% to 45% Regression test 15% to 30% Integration test 25% to 40% Performance test 20% to 40% System testing 25% to 55% Acceptance test (1 customer) 25% to 35% ◈Development organizations try to find and remove more defects by implementing more stages of system testing  Since there is a wide range of overlap between test stages, this approach becomes less efficient as it scales (v0.1) Ops A La Carte © 20
  • 68. Defect Removal Technique Impact Design Inspections / n n n n Y n Y Y Reviews Code Inspections / n n n Y n n Y Y Reviews Formal SQA Processes n Y n n n Y n Y Formal Testing n n Y n n Y n Y Median Defect 40% 45% 53% 57% 60% 65% 85% 99% Efficiency This large potential available from design and code inspections/reviews are why most development organizations see greater improvements in S/W reliability with investments in the development phase than further investments in testing. NOTE: Design review results based on low-level design reviews. (v0.1) Ops A La Carte © 21
  • 69. Case Study: Quantifying the Software Quality Investment Objective:  Develop a value-based approach to determine the necessary S/W quality investment using dependability attributes. Methodology:  Use an integrated approach of project cost and quality estimation models (COCOMO II and COQUALMO) and empirically-based business value relationship factoring to analyze the data from a diverse set of 161 well-measured, S/W projects. Findings:  The methodology was able to correlate the optimal S/W project quality investment level and strategy to the required project reliability level based on defect impact.  The objectives were satisfied without using specific S/W reliability practices and focused heavily on defect detection during system testing. (v0.1) Ops A La Carte © 22
  • 70. Reliability, Development Cost & Test Time Tradeoffs (v0.1) Ops A La Carte © Based on COCOMOII (Constructive Cost Model) 23
  • 71. Reliability, Development Cost & Test Time Tradeoffs The relative cost/source instruction to achieve a Very High RELY rating is less than the amount of additional testing time that is required (54%) since early defect prevention reduce the required rework effort and allow for additional testing time. (v0.1) Ops A La Carte © Based on COCOMOII (Constructive Cost Model) 24
  • 72. Delivered Defects Scale ◈ Very Low rating delivers roughly the same amount of defects that are introduced ◈ Extra High rating reduces the delivered defects by a factor of 37 Note: The assumed nominal defect introduction rate is 60 defects/KSLOC based on the following distribution: 10 requirements defects/KSLOC 20 design defects/KSLOC 30 coding defects/KSLOC (v0.1) Ops A La Carte © Based on COQUALMO (Constructive Quality Model) 25
  • 73. Defect Removal Factors Scale Automated Analysis Peer Reviews Execution Testing and Tools Very Compiler-based simple No peer reviews No testing Low syntax checking Basic compiler Ad-hoc informal walk- Low capabilities throughs Ad-hoc testing and debugging Compiler extensions. Well-defined sequence for Basic test, test data management and Nominal Basic requirements and preparation, review, and problem-tracking support. design consistency minimal follow-up Test criteria based on checklist. Well-defined test sequences tailored to Intermediate-level Formal review roles and the organization. module and intermodule. well-trained participants High using basic checklists and Basic test coverage tools and test Simple requirements and support system. design. follow-up procedures. Basic test process management. More elaborate Basic review checklists and More advanced test tools and test data requirements and design. root cause analysis. preparation, basic test oracle support Very Basic distributed- Formal follow-up using and distributed monitoring and High processing and temporal historical data on inspection analysis, and assertion checking. analysis, model checking, rates, preparation rates, Metrics-based test process and symbolic execution. and fault density. management. Formal review of roles and procedures. Highly advanced tools for test oracles, Formalized specification distributed monitoring and analysis, Extensive review checklists and assertion checking. Extra and verification. and root cause analysis. High Advanced distributed- Integration of automated analysis and Continuous review-process test tools. processing improvement. Model-based test process management. Statistical process control. (v0.1) Ops A La Carte © Derived from COQUALMO (Constructive Quality Model) 26
  • 74. DfR Based on “Best Practices” Modified Existing Best Practices New Best Practices S/W Life Cycle Practices  Consistent implementations of the entire  Reliability testing a part of overall testing S/W lifecycle phases strategy (requirements, design, code, unit test,  Define reliability goals as requirements system test and maintenance) Metrics and Measurements  Record and track all defects and failures  Generate defect removal efficiency  Collect field data for root cause analysis on measurements next project or release iteration  Track fix rationale such as “bad” fixes or untested code  Collect failure data for analysis during the system test phase Development Phase Practices  Reviews of design and code Assess designs for availability  Targeted developer unit testing Perform failure analysis Testing  Independently verify/test every S/W Generate reliability estimates change (enhancements and fixes) SQA  Perform failure root-cause analysis  Record and track all changes to S/W artifacts throughout the life cycle (v0.1) Ops A La Carte © 27
  • 75. Summary: DfR Based on “Best Practices” ◈ The strength of a “best practices” approach is it’s intuitiveness  Incorporate considerations of essential functionality and failure behavior in order to understand failure modes and improve availability  Perform design analysis to identify potential failure points and, where possible, redesign to remove failure points or reduce their impact  Analyze S/W for critical failure trigger points and remove them or reduce their impact and frequency where possible  Plan testing to maximize the overall S/W verification prior to field deployment  Let measured data drive changes to reliability practices ◈ Focus on the removal of critical failures instead of all defects  S/W with known defects and faults may still be perceived as reliable by users ◘ NASA studies identified projects that produced reliable S/W with only 70% of the code tested  Removing X% of the defects in a system will not necessarily improve the reliability by X%. ◘ One IBM study showed that removing 60% of the product’s defects resulted in only a 3% reliability improvement ◘ S/W defects in rarely executed sections of code may never be encountered by users and therefore may not improve reliability – Exceptions for essential operations: boot, shutdown, data backup, etc. (v0.1) Ops A La Carte © 28
  • 76. Questions? (v0.1) Ops A La Carte © 29
  • 77. Software Design For Reliability (DfR) Seminar Reliability Measurements and Metrics
  • 78. Metrics Supporting Reliability Strategies Common strategies and tactics used by teams developing highly reliable software products Explicit, robust reliability requirements during requirements phase Appropriate use of fault tolerant techniques in product design Robust design/operational requirements for maximizing product Availability Focused, targeted (data driven) defect inspection program Robust testing strategy and program: well defined focused mix of unit, regression, integration, system, exploratory and reliability demonstration testing Robust defect tracking/metrics program focused on important few ◘ Defect tracking and analysis for all phases of a product’s life including post shipment defects/failures (FRACUS) (v0.1) Ops A La Carte © 2
  • 79. Reliability Measurements and Metrics ◈ Definitions  Measurements – data collected for tracking or to calculate meta-data (metrics) ◘ Ex: defect counts by phase, defect insertion phase, defect detection phase  Metrics – information derived from measurements (meta-data) ◘ Ex: failure rate, defect removal efficiency, defect density ◈ Reliability measurements and metrics accomplish several goals  Provide estimates of S/W reliability prior to customer deployment  Track reliability growth throughout the life cycle of a release  Identify defect clusters based on code sections with frequent fixes  Determine where to focus improvements based on analysis of defect/failure data Note: S/W Configuration Management (SCM) and defect tracking tools should be updated to facilitate the automatic tracking of this information ◘ Allow for data entry in all phases, including development ◘ Distinguish code base updates for critical defect repair vs. any other changes, (e.g., enhancements, minor defect repairs, coding standards updates, etc.) (v0.1) Ops A La Carte © 3
  • 80. Critical Measurements To Collect Measurement Description Number of critical defects found during each non-operational Critical Defects by Phase phase (i.e. requirements, design, and coding) Number of critical failures found during each operational phase Critical Failures by Phase (i.e. unit testing, system testing, and field) The phase where the critical defect (or critical failure) was Critical Defect Insertion Phase inserted (or originated) The phase where the critical defect (or critical failure) was Critical Defect Detection Phase detected (or reported) A high-level indicator of a critical defect’s location within the Critical Defect Major Location source code (e.g., a S/W component or file name) A low-level indicator of a critical defect’s location within the Critical Defect Minor Location source code (e.g., the name of a class, method or data object) The time when a critical failure occurred since the beginning of a Critical Failure Time test run (typically measured in CPU or wall time) Critical Failure Root Cause The relevant failure category for a specified critical failure (v0.1) Ops A La Carte © 4
  • 81. Metrics To Track Metric Description The number of defects per KLOCs (1,000 lines of commented Critical Defect Density source code) Critical Defect Removal Efficiency The percentage of defects identified within a given life cycle (CDRE) period Critical Failure Rate The mean number of failures occurring within a reference period Current open defect demographics of current code base including Current Defect Demographics defects by severity, module, fix backlog, ect. Trends( e.g., defects vs. test time interval) of newly detected Failure/Defect arrival rates Failures and/or defects. The number of failures caused as side effects to a fixes of logged Bad Fixes defect The number of failures caused by code that was neither reviewed Unverified Code Fixes nor tested The summary of failure categories distributions for pareto root Failure root cause distribution cause analysis (v0.1) Ops A La Carte © 5
  • 82. Software Defect Distributions Average distribution of all types of S/W defects by lifecycle phase:  20% Requirements 50% of all S/W defects are introduced before coding  30% Design  35% Coding  10% Bad Defect Fixes (introduction of secondary defects) 1 in 10 defects fixed during testing were  5% Customer Documentation unintended side effects of a previous defect “fix” Average distribution of S/W defects escalated from the field: (based on 1st year field defect report data)  1% Severity 1 (catastrophic) Only ~20% of the customer-reported S/W  20% Severity 2 (serious) defects are target for reliability improvements  35% Severity 3 (minor)  44% Severity 4 (annoyance or cosmetic) (v0.1) Ops A La Carte © 6
  • 83. Typical Defect Tracking (System Test) Severity #1 Severity #2 Severity #3 Severity #4 System Test Total Defects Defects Defects Defects Defects Build Found Found Found Found Found SysBuild-1 7 9 16 22 54 SysBuild-2 5 5 14 26 50 SysBuild-3 4 6 8 16 34 • • • • • • • • • • • • • • • • • • SysBuild-7 0 1 4 6 11 (v0.1) Ops A La Carte © 7
  • 84. Defect Removal Efficiency ◈Critical defect removal efficiency (CDRE) is a key reliability measure Critical Defects Found CDRE = Critical Defects Present ◈“Critical Defects Present” is the sum of the critical defects found in all phases as a result of reviews, testing and customer/field escalations  System testing stages include integration, functional, loading, performance, acceptance, etc.  Customer trials can be considered either a system testing stage, a preliminary, but separate, field deployment phase or a part of the field deployment phase ◘ Depending on the rationale for the trials  The field deployment phase is measured as the first year following deployment ◘ The average life span of a S/W release since most S/W releases are separated by increments no longer than 1 year System & Subsystem Field Requirements Design Coding Unit Testing Testing Stages Deployment Field Review Efficiency Testing Efficiency Efficiency System Testing Field Development Efficiency Efficiency Efficiency Field Internal Efficiency Efficiency (v0.1) Ops A La Carte © 8
  • 85. CDRE Example Critical Origin Defects Metrics #1 Found Requirements Reviews 20 Design Reviews 30 Code Reviews 40 170 Unit Testing 25 System & Subsystem 55 Testing Field Deployment 40 40 TOTAL 210 210 System & Subsystem Field Requirements Design Coding Unit Testing Testing Stages Deployment Field Internal Efficiency Efficiency Metric Removal Efficiency Internal Efficiency 81% = (170 / 210) Field Efficiency 19% = (40 / 210) (v0.1) Ops A La Carte © 9
  • 86. CDRE Example Critical Origin Defects Metrics #2 Found Requirements Reviews 20 Design Reviews 30 115 Code Reviews 40 Unit Testing 25 System & Subsystem 55 55 Testing Field Deployment 40 40 TOTAL 210 210 System & Subsystem Field Requirements Design Coding Unit Testing Testing Stages Deployment System Testing Field Development Efficiency Efficiency Efficiency Metric Removal Efficiency Development Efficiency 55% = (115 / 210) System Testing Efficiency 26% = (55 / 210) Field Efficiency 19% = (40 / 210) (v0.1) Ops A La Carte © 10
  • 87. CDRE Example Critical Origin Defects Metrics #3 Found Requirements Reviews 20 Design Reviews 30 90 Code Reviews 40 Unit Testing 25 80 System & Subsystem 55 Testing Field Deployment 40 40 TOTAL 210 210 System & Subsystem Field Requirements Design Coding Unit Testing Testing Stages Deployment Field Review Efficiency Testing Efficiency Efficiency Metric Removal Efficiency Review Efficiency 43% = (90 / 210) Testing Efficiency 38% = (80 / 210) Field Efficiency 19% = (40 / 210) (v0.1) Ops A La Carte © 11
  • 88. Sample Project Reliability Measurement Tracking At the end of the project Unit Testing Phase Defects Found Critical Defects/Failures Found Phase Reqmts Design Code Unit Test Test Field Total Critical Critical Critical Critical Critical Critical Failures Defects Defects Defects Defects Defects Failures Failures Reported Requirements 75 12 12 Design 123 45 6 39 Code 158 62 4 12 46 Unit Test 78 25 1 5 17 2 Development Totals 434 144 23 56 63 2 Integration Test System Test Testing Totals Field Reports Totals (v0.1) Ops A La Carte © 12
  • 89. Sample Project Reliability Measurement Tracking 1 year after the end of the System Testing phase Defects Found Critical Defects/Failures Found Phase Reqmts Design Code Unit Test Test Field Total Critical Critical Critical Critical Critical Critical Failures Defects Defects Defects Defects Defects Failures Failures Reported Requirements 75 12 12 Design 123 45 6 39 Code 158 62 4 12 46 Unit Test 78 25 1 5 17 2 Development Totals 434 144 23 56 63 2 Integration Test 43 13 0 4 7 1 1 System Test 183 47 2 13 28 0 4 Testing Totals 226 60 2 17 35 1 5 Field Reports Totals 70 35 1 8 22 0 3 1 Release Summary 720 239 26 81 120 3 8 1 (v0.1) Ops A La Carte © 13
  • 90. Sample Project Reliability Measurement Tracking 1 year after the end of the System Testing phase Defects Found Critical Defects/Failures Found Phase Reqmts Design Code Unit Test Test Field Total Critical Critical Critical Critical Critical Critical Failures Defects Defects Defects Defects Defects Failures Failures Reported Requirements 75 12 12 Design 123 45 6 39 Code 158 62 4 12 46 Design DRE Measurements • 39 critical defects found in-phase Unit Test 78 25 1 5 17 2 • 56 critical defects found during development Development Totals 434 144 23 56 • 63 critical defects found during testing 17 2 • 81 critical defects found overall Integration Test 45 13 0 4 7 1 1 System Test 189 47 2 13 28 Design DRE Metrics 0 4 Testing Totals 234 60 2 17 • 35 in-phase DRE (= 39/81) 48% 1 5 • 69% development DRE (= 56/81) Field Reports Totals 77 35 1 8 • 22 in-house DRE (= (56+17)/81)1 90% 0 3 Release Summary 745 239 26 81 120 3 8 1 (v0.1) Ops A La Carte © 14
  • 91. Sample Project Reliability Measurement Tracking 1 year after the end of the System Testing phase DefectsBad Fixes Measurements Found Critical Defects/Failures Found • 1 critical defect inserted and found during field deployment Phase Reqmts Design Code Unit Test Test Field • 5 critical defects inserted and Critical during system-level testing Total found Critical Critical Critical Critical Critical Failures Defects Defects • 3 critical defects inserted during system-level testing and found during field deployment Defects Defects Defects Failures Failures Reported •Requirements 1 critical defects inserted during unit testing and found during system-level testing 75 12 12 Bad Fixes Measurements • 0 critical defects inserted during unit testing and found during field deployment •Design 1 critical defect inserted and found during field deployment 123 45 6 39 • 60 total critical defects found during system-level testing • 5 critical defects inserted and found during system-level testing •Codetotal critical defects found in62 field 35 158 the 4 12 46 • 3 critical defects inserted during system-level testing and found during field deployment •Unit Test defects inserted during unit testing1 1 critical 78 25 5 17 2 and found during system-level testing •Development Totals inserted during Badtesting and found during field deployment 0 critical defects 434 unit Fixes Metrics 144 23 56 63 2 • 10% of the test defects found during system-level (= (1 + 5)/60) • 60 total critical phase failures are from Bad Fixes testing • 11% of the fielddefectsfailuresin the field • 35 total critical phase found are from Bad Fixes (= (0 + 3 + 1)/35) •Integration Test and field failures are from Bad Fixes (= (1 + 5 + 07+ 3 + 1)/(60 + 35)) 1 11% of the test 45 13 0 4 1 System Test 189 47 2 13 28 0 4 Testing Totals 234 60 2 17 35 1 5 Field Reports Totals 77 35 1 8 22 0 3 1 Release Summary 745 239 26 81 120 3 8 1 (v0.1) Ops A La Carte © 15
  • 92. Sample Project Reliability Metrics Metrics Phase In-phase Overall CDD Bad Fixes DRE DRE Requirements 29% 46% Design 51% 48% 58% Code 56% 38% 84% Unit Test 18% 17% Testing 26% 10% 11% Field 16% 16% 11% (v0.1) Ops A La Carte © 16
  • 93. Sample Project Reliability Metrics Metrics Phase In-phase Overall CDD Bad Fixes DRE DRE Requirements 29% 46% Design 51% 48% 58% Code 56% 38% 84% Unit Test 18% 17% Testing 26% 10% 11% Field 16% 16% 11% Sample Phase 1 Goals (50% improvement): • (reliability) increase in-house DRE to 92% • (efficiency) reduce field bad fixes to 5% (v0.1) Ops A La Carte © 17
  • 94. Distribution of Defects Across Files ◈ There is a Pareto-like distribution of defects across files within a module  Defects cluster in a relatively small number of files  Conversely, more than half of the files have almost no critical defects (v0.1) Ops A La Carte © 18
  • 95. Failure Density Analysis ◈ In general, there is a pareto distribution (80/20) of defects across files ◈ Failure density analysis provides a mechanism for early detection of “problematic” sections of code (i.e., defect clusters)  Improving the reliability of these “problematic modules” can consume as much as 4 times the effort of a redesign ◘ Less time is required to re-inspect and either restructure or redesign these modules than to effectively “beat them into submission” through testing  The goal is to identify “problematic” code as early as possible ◘ Perform failure density analysis early on during unit testing – Since different sections of code may be problematic during system testing, the analysis should be repeated near the middle of this phase ◘ SCM and defect tracking tools can be modified to provide this information without much effort – Source code can be analyzed to display “hot spot” histograms using number of changes and/or number of failures – Heuristics for failure density thresholds must be developed to determine when action should be taken (v0.1) Ops A La Carte © 19
  • 96. Failure Density Distribution by File These files contain the majority of the reported failures and should be proactively analyzed for possible redesign, restructuring or additional defects % of Total Reported Failures F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 F13 F14 … F23 Source Code Files (v0.1) Ops A La Carte © 20
  • 97. Applying Causal Analysis to Defect Measurements ◈ Causal analysis(RCA) can be applied to defect measurements to improve defect removal effectiveness and efficiency  Usually performed between life cycle iterations or S/W releases  Upstream defect removal practices should be reviewed in light of the defects that were not detected in each phase ◘ Requires knowing the phases where a defect was introduced and detected ◘ Simple guidelines should be defined to determine the phase where the defect was introduced  Defects are categorized by types ◘ Can determine if the problem is systemic to one or more categories, or ◘ Whether the problem is an issue of raising overall defect removal efficiency for a development phase (v0.1) Ops A La Carte © 21
  • 98. Causal Analysis Process (one shot analysis) Objective: Create an initial, rough distribution of the defects and identify potential improvements to existing defect removal practices. Process Outline (typically 6-8 hours over two days): ◈ Select a team from the senior engineers and MSEs of the development and test teams. Analysis Process:  Before the meeting, select a representative sample of approximately 50-75 defects from each development and test phase  Convene meeting and explain the objectives and process to the team  Classify the defects ◘ Start by walking the team through the classification of one defect ◘ Divide the defects into small groups and assign each person 2 groups of defects (to force analysis overlap) ◘ Upon completion, collect and process the data offline in preparation for team analysis and review  Analyze the defect types using a histogram to look for a Pareto distribution and select the most prevalent defect types  Develop recommendations for improvements to specific defect removal practices  Implement the recommendations at the next possible opportunity and gather measurement data (v0.1) Ops A La Carte © 22
  • 99. The Orthogonal Defect Analysis Framework SPEC/RQMTS DESIGN CODE ENV. SUPT. DOCUMENTATION OTHER ORIGIN ( Where?) REQUIREMENTS TEST SW HW INTERFACE PROC. (INTERPROC.) LOGIC OR SW INTERFACE COMMUNICATIONS COMPUTATION TEST HW SPECIFICATIONS DATA DEFINITION DATA HANDLING USER INTERFACE DEVELOPMENT FUNCTIONALITY TOOLS MODULE DESIGN MODULE INTERFACE/ FUNCTIONAL INTEGRATION SW TYPE ( What ?) DESCRIPTION LOGIC DESCRIPTION IMPLEMENTATION STANDARDS ERROR CHECKING STANDARDS • Other can also be type classification for any origin MODE MISSING UNCLEAR WRONG CHANGED BETTER WAY (WHY?) (v0.1) Ops A La Carte © 23