Ops A La Carte Software Design for Reliability (SDfR) Seminar

&

We Provide You Confidence in Your Product ReliabilityTM
Ops A La Carte / (408) 654-0499 / askops@opsalacarte.com / www.opsalacarte.com

Software Design
for Reliability (DfR)
½-day Seminar

Ops A La Carte LLC // www.opsalacarte.com

The following presentation materials
are copyright protected property of
Ops A La Carte LLC.
Distribution of these materials is limited to
your company staff only.

These materials may not be distributed
outside of your company or used for any
purpose other than training.

Software DfR ½-Day Seminar Agenda

Agenda

◈ Introductions and Agenda Review
◈ Software Reliability Basic Concepts
◈ A “Best Practices” Approach to Developing Reliable Software
◈ Reliability Measurements and Metrics
◈ Wrap-up

(v0.1) Ops A La Carte © 3

Presenter’s
Biographical
Sketch

Presenter’s Biographical Sketch – Bob Mueller
◈ Bob Mueller a senior consultant/program manager with OPS A La Carte and the Marisan
Group. He is a product development professional with 30+ years of technical and
management experience in software intensive product development, R/D process
and quality systems development including extensive consulting experience with
cross-functional product development teams and senior management.
◈ After receiving his M.S. in Physics in 1973, Bob joined Hewlett-Packard in Cupertino, CA
in IC process development. In the next three decades before leaving hp in 2002, he
held numerous positions in R/D, R/D management and consulting including:
 IC process development, process engineering and IC production management.
 Lead developer of an automated IC in-process/test monitor, analysis and
control system (hp internal).
 R/D project management for sw intensive products (including process analysis and
control, work cell control & quality control systems).
 Numerous R/D management positions in computer, analytical and healthcare
businesses
including FDA regulated systems with ISO 9001 certified organizations.
 Numerous program management positions focused on internal/external process
improvement and consulting.
 Practice area manager and consultant for PG -- Engineering consulting team
(internal hp)

◈ Bob’s current consulting interests include: Warranty process and quality system
improvement, SW Reliability, agile SW product development methodologies and R/D
product strategy and technology roadmap development.
◈ Bob has taught many internal hp classes and at local junior colleges.


Software Reliability Integration Services
for the Product
Reliability Integration in the Concept Phase Reliability Integration in the Implementation Phase
Software Reliability Goal Setting Facilitation of Code Reliability Reviews
Software Reliability Program and Integration Plan Software Robustness and Coverage Testing Techniques

Reliability Integration in the Design Phase
Facilitation of Team Design Template Reviews Reliability Integration in the Testing Phase
Facilitation of Team Design Reviews Software Reliability Measurements and Metrics
Software Failure Analysis Usage Profile-based Testing
Software Fault Tolerance Software Reliability Estimation Techniques
Software Reliability Demonstration Tests

Software Design For Reliability (DfR)

Software Reliability
Basic Concepts

George de la Fuente
georged@opsalacarte.com
(408) 828-1105
www.opsalacarte.com

Software Quality
vs.
Software Reliability

Software Quality vs. Reliability
FACTORS CRITERIA

suitability
Functionality accuracy
interoperability
security

understandability
Usability learnability
operability
attractiveness

Software Software Quality
maturity
Quality Reliability fault tolerance The level to which the
*ISO9126
Quality Model
recoverability software characteristics
conform to all the
time behavior specifications.
Efficiency resource utilization

analysability
changeability
Portability stability
testability

adaptability
installability
Maintainability co-existence
replaceability
(v0.1) 3
Ops A La Carte ©

Software Reliability Definitions

“The probability of failure free software operation
for a specified period of time in a specified
environment”
ANSI/IEEE STD-729-1991

◈Examine the key points

◈Practical rewording of the definition
Software reliability is a measure of the
software failures that are visible to a customer
and that prevents a system from delivering
essential functionality for a specified period of time.

Software Reliability Can Be Measured

◈ Measurements (quantitative) are a required foundation
 Differs from quality which is not defined by measurements
 All measurements and metrics are based on run-time failures
◈ Only customer-visible failures are targeted
 Only defects that produce customer-visible failures affect reliability
 Corollaries
◘ Defects that do not trigger run-time failures do NOT affect reliability
– badly formatted or commented code
– defects in dead code
◘ Not all defects that are triggered at run-time produce customer-
visible failures
– corruption of any unused region of memory

◈ S/W Reliability evolved from H/W Reliability
 Primary distinction: S/W Reliability focuses only on design reliability


Software Reliability Is Based On Usage

◈ S/W failure characteristics are derived from the usage profile of a
particular customer (or set of customers)
 Each usage profile triggers a different set of run-time S/W faults and failures

◈ Example of reliability perspective from 3 users of the same S/W
 Customer A
◘ Usage Profile – Exercises sections of S/W that produce very few failures.
◘ Assessment – S/W reliability is high.
 Customer B
◘ Usage Profile – Overlaps with Customer A’s usage profile. However,
Customer B also exercises other sections of S/W that produce many,
frequent failures
◘ Assessment – S/W reliability is low.
 Customer C
◘ Usage Profile – Similar to Customer B’s usage profile. However, Customer
C has implemented workarounds to mitigate most of the S/W failures that
were encountered. The final result is that the S/W executes with few
failures but requires additional off-nominal steps.
◘ Assessment – S/W quality is low since many workarounds are required.
However, for the final configuration that includes these workarounds, S/W
reliability is acceptable.


Reliability ≠ Correctness or Completeness

◈ Correctness is a measure with which the requirements
model the intended customer base or industry functionality
 Correctness is validated by reviewing product requirements and
functional specifications with key customers

◈ Completeness is a measure of the degree of intended
functionality that is modeled by the S/W design
 Completeness is validated by performing requirements traceability at
the design phase and design traceability at the coding phase

◈ Reliability is a measure of the behavior (i.e., failures) that
prevents the S/W from delivering the designed
functionality
 If the resulting S/W does not meet customer or market expectations,
yet operates with very few failures based on its requirements and
design, the S/W is still considered reliable


Terminology

Defects,
Faults and
Failures

Software Defects That Affect Reliability

Sources

Documentation Development Validation
• User Manual • Requirements ••• • Unit Test Plans/Cases
• Installation Guide • System Architecture • System-Level Test Plans/Cases
• Technical Specs • Designs • Design Review Scenarios or Checklists
• Source Code • Code Review Scenarios or Checklists
• S/W Failure Analysis Categories

Categories
Soft Maintenance Run-time Impacts
Run-time Impacts
Failures
• Commenting • System outage
• System outage • System outage
• Style ••• • Loss of functionality
• Loss of functionality
• Consistency • Annoyance • Loss of critical functionality
• Standards/Guidelines • Cosmetic
• “Dead Code”


Terminology - Defect

◈ A flaw in S/W requirements, design or source code that
produces unintended or incomplete run-time behavior
Defect
 Defects of commission
◘ Incorrect requirements are specified
◘ Requirements are incorrectly translated into a design model
◘ The design is incorrectly translated into source code
◘ The source code logic is flawed
 Defects of omission There are amongst the most difficult class of defects to detect
◘ Not all requirements were used in creating a design model
◘ The source code did not implement all the design
◘ The source code has missing or incomplete logic

◈ Defects are static and can be detected and removed without
executing the source code

◈ Defects that cannot trigger S/W failures are not counted for
reliability purposes
 These are typically quality defects that affect other aspects of S/W quality such
as soft maintenance defects and defects in test cases or documentation

Terminology - Fault

◈The result of triggering a S/W defect by
executing the associated source code Defect

Faults are NOT customer-visible
◘ Example: memory leak or a packet corruption
Fault
that requires retransmission by the higher
layer stack

A fault may be the transitional state that results in
a failure
◘ Trivially simple defects (e.g., display spelling
errors) do not have intermediate fault states


Terminology - Failure

◈A customer (or operational system)
observation or detection that is perceived Defect

as an unacceptable departure of
operation from the designed S/W
behavior Fault

Failures are the visible, run-time symptoms of faults
◘ Failures MUST be observable by the customer or Failure

another operational system

Not all failures result in system outages


Defect-to-Failure Transition

◈Example
A S/W function (or method) processes the data stored in a
memory buffer and then frees the allocated memory buffer
back to the memory pool

A defect within this function (or method), when triggered, will
fail to free the memory buffer before completion
Entry Point

Defect
1 (of many)
Logic
Branch
Points

4 Possible Exit Points

Defect-to-Failure Transition (continued)

◈ Most of the possible logic paths do not trigger the defect
 If these are the only logic paths traversed by a customer, this portion
of the S/W will be considered very reliable



◈ Fault transition
 Eventually a logic path is executed that triggers the defect, resulting in a fault
being generated
◘ The function (or method) completes its execution
◘ The fault causes the system to lose track of a single memory buffer
◘ The system continues to operate without a visible impact
 Since the fault causes no visible impact, a failure does NOT occur



◈ Failure scenario
 After sufficient memory buffers have been lost,
the buffer pool reaches a critical condition where
either:
◘ No buffers are available to satisfy another
allocation request (there are still some (t1) Fault is triggered
buffers in use)
(t2) Fault is triggered
◘ All buffers have been lost through leakage
•
(no buffers will ever be freed for future •
allocation requests) •
(tN) Fault is triggered

 Once the next buffer allocation is requested, a (tF) Failure occurs
failure occurs
◘ The system cannot continue to operate
Time (t)
normally

 Note the time lag between the triggering of the
last fault and the occurrence of the associated
failure


Summary of Defects and Failures

◈ There are 3 types of run-time defects
Defect Defect Defect
1. Defects that are never executed (so they don’t trigger
faults)

2. Defects that are executed and trigger faults that do
NOT result in failures Fault Fault

3. Defects that are executed and trigger faults that result
in failures
Failure
◈ Practical S/W Reliability focuses on defects
that have the potential to cause failures by:
Defect
1. Detecting and removing defects that result in failures
during development

2. Design & Implement fault tolerance techniques to
Fault
◘ prevent faults from producing failures or
◘ mitigating the effects of the resulting failures

Failure


Failure Distributions,
Failure Rates
and
MTTF

Reliability and Failure Distributions

Restated, reliability is the probability that a system does not
experience a failure during a time interval, [0,T].

◈ Reliability is a measure of statistical probability, not certainty
 Ex: A system has a 99% reliability over a period of 100 days
◘ Does this imply that only 1 failure will occur during the 100 day period?

◈ Reliability is based on failure distribution models
 Represent the time distribution of failure occurrences
 Various failure distribution models exist:
◘ Exponential (most commonly used in S/W reliability)
◘ Weibull
◘ Poisson
◘ Normal
◘ Rayleigh
◘ etc….

◈ Let’s examine an exponential failure distribution model

Failure Distributions - Exponential

◈ Exponential Reliability Function
The most widely used failure distribution is the exponential reliability function:
◘ Models a random distribution of failure occurrences
Defined by:

R(t)
R(t) = e-λt
λ = 0.1 failures/hr.

where
◘ t is mission time
– the system is assumed to be operational at t=0
– The mission duration is represented by T
◘ “λ” is a constant, instantaneous failure rate (or failure intensity)
◘ MTTF = 1 / λ (for repairable systems)

A Closer Look At The Exponential Distribution

Reliability • Mission duration: T = 100 hours
R(t) • Failure rate: λ = 0.1 failures/hr. (or 1 failure every 10 hrs.)
• MTTF = 10 hrs.

At t = 1 hr., the reliability is 90%

When t = MTTF, the reliability is always 37%
R = e-λt
= e-(1/MTTF)MTTF
= e-(1)
= 37%

Time (hrs)

A Closer Look at Reliability Values

◈ Based on an exponential failure distribution, what does it mean
for S/W to have 99% reliability after one year of operation?
 For a single S/W product:
◘ There is a 99% probability that the S/W will still be operational after 1 year
– Conversely, there is a 1% chance of a failure during that period.
◘ Note that this value does NOT tell us when, during the 1 year period, that a
failure will occur.
– With the exponential distribution, as time progresses, the likelihood
(probability) of a failure increases.

 For a group of software products (e.g., 100 products):
◘ 99% of the products will be operational after 1 year (e.g., 99 products)
◘ There is a 36.6% probability that all 100 products will be operational after
1 year
– This computed by multiplying the reliability of all the products:
f(t) = R1(t) x R2(t) x … x R100(t)
= 0.99 x 0.99 x … x 0.99
= 0.366


Sample Reliability Calculations

◈ What is the failure rate (λ) and MTTF necessary for to achieve
this level of reliability?
t = 1 yr.
= 8760 hrs.

R(t) = e-λt
0.99 = e-(λ) x (8760)
ln(0.99) = -(λ) x (8760)

λ = -ln(0.99) / (8760)
= 1.1 x 10-6 failures/hr. (1 failure every 99.5 years)

MTTF = 1/ λ
= 871,613 hrs. (99.5 yrs.)

◈ What is reliability at the MTTF?
t = MTTF = 871,613 hrs
R(MTTF) = e-(λ) x (MTTF)
= e-(1/MTTF) x (MTTF)
= e-1
= 0.368 (~37%)

Software and Hardware Failure Rates

Software Hardware

Driven by effectiveness of S/W defect Driven by three very
detection and repair processes over different physical
the span of many upgrades failure domains
Failure Rate

Failure Rate
λSW-B
λHW-B

Pre-release Useful Life Obsolete Burn-In Useful Life Wearout
Testing (w/upgrades)

Initial system deployment (i.e., completion of Pre-release Testing and Burn-in phases)
establishes a baseline for both the S/W (λSW-B) and H/W (λHW-B) failure rates


System Availability

Availability is the percentage of time that a system is operational,
accounting for planned and unplanned outages.

◈ Example: 90% Availability (for a timeframe T)
 Logical Representation
◘ System is operational for the first 90% of the timeframe and down for the last
10% of the timeframe

Timeframe T

System operational (0.9T) System non-operational (0.1T)

Failure System
 Actual (or Possible) Representation occurs restored

◘ 3 failures cause the system to be down for 10% of the timeframe

Failure System Failure System Failure System
occurs restored occurs restored occurs restored

System Availability (continued)

◈ System availability, A(T), is the relationship between the
timeframes when a system is operational vs. down due to a
failure-induced outage and is defined as:

___MTBF___
A(T) = (MTBF + MTTR)

where,
 The system is assumed to be operational at time t=0
 T = MTBF + MTTR and 0 ≤ t ≤ T
 MTBF (Mean Time Between Failure) is based on the failure rate
 MTTR (Mean Time To Repair) is the duration of the outage (i.e., the expected
time to detect, repair and then restore the system to an operational state)


Software Availability

◈ System outages that are caused by S/W can be attributed to:
1. Recoverable S/W failures
2. S/W upgrades
3. Unrecoverable S/W failures
NOTE: Recoverable S/W failures are the most frequent S/W cause of
system outages

◈ For outages due to recoverable S/W failures, availability is
defined as:
___MTTF___
A(T) = (MTTF + MTTR)

where,
 MTTF is Mean Time To [next] Failure
 MTTR (Mean Time To [operational] Restoration) is still the duration of the
outage, but without the notion of a “repair time”. Instead, it is the time until the
same system is restored to an operational state via a system reboot or some
level of S/W restart.


Software Availability (continued)

◈ A(T) can be increased by either:
 Increasing MTTF (i.e., increasing reliability) using S/W reliability practices
 Reducing MTTR (i.e., reducing downtime) using S/W availability practices

◈ MTTR can be reduced by:
 Implementing H/W redundancy (sparingly) to mask most likely failures
 Increasing the speed of failure detection (the key step)
 S/W and system recovery speeds can be increased by implementing Fast Fail
and S/W restart designs
◘ Modular design practices allow S/W restarts to occur at the smallest
possible scope, e.g., thread or process vs. system or subsystem
◘ Drastic reductions in MTTR are only possible when availability is part of the
initial system/software design (like redundancy)

◈ Customers generally perceive enhanced S/W availability as a S/W
reliability improvement
 Even if the failure rate remains unchanged


System Availability Timeframes
Availability Class Availability Timeframe vs. Mission Downtime
(Unavailability Range)
Timeframe =1year Timeframe = 3 months
(1) Unmanaged 90% (1 nine) 36.5 days/year 9.13 days
(52,560 mins/year)
(2) Managed 99% (2 nines) 3.65 days/year 21.9 hours
(good web (5,256 mins/year)
servers)
(3) Well-managed 99.9% (3 nines) 8.8 hours/year 2.19 hours
(525.6 mins/year)
(4) Fault Tolerant 99.99% (4 nines) 52.6 mins/year 13.14 minutes
(better commercial
systems)
(5) High- 99.999% (5 nines) 5.3 mins/year 1.31minutes
Availability
(High-reliability
products)
(6) Very-High- 99.9999% (6 nines) 31.5 secs/year 7.88 seconds
Availability (2.6 mins/5 years)
(7) Ultra- 99.99999% (7 nines) to 3.2 secs/year 0.79 seconds
Availability to
99.9999999% (9 nines)
31.5 millisecs/year
(15.8 secs/5 years or less)

Software Robustness

Software Robustness is a measure of the software’s ability to
handle exceptional input conditions so they do not become failures.

◈ Exceptional input conditions result from:
 Inputs that violate data value constraints
 Inputs that violate data relationships
 Inputs that violate the application’s timing requirements

◈ Robust S/W prevents exceptional inputs from:
1. Causing a system outage
2. Producing a silent failure by providing no indication that an exceptional input
condition was detected, thus allowing for the failure to propagate
3. Generating an error condition or response that incorrectly characterizes the
exceptional input condition

◈ S/W robustness becomes increasingly important as a system
becomes more flexible and the product’s customer base increases
in size and usage diversity

Why Is Software Robustness Important ?

Inputs causing
User
User #2 erroneous
#1
Input set User
II User
Err outputs
#n
#3 e

Program

Erroneous
outputs
Output set OErr
Oe


Software Robustness Studies

◈ 2 studies of S/W robustness
 Examined exceptional input condition testing of POSIX-compliant OSes and UNIX
command line utilities
 Robustness testing was repeated on multiple releases contain fixes for the
reported exceptional input failures

◈ Findings
 Failure rates associated with robustness testing were significant,
◘ Ranging from 10% - 33%
 After many significant, focused S/W fixes over multiple releases, failure rates
still remained high

◈ Conclusions
 Traditional functional testing does not adequately test for exceptional input
conditions
 Operational profiles testing also does not adequately test for exceptional input
conditions
◘ (Reason) Operational profile testing prioritizes and sets limits on functional
testing.
 Specific techniques are required to provide adequate test coverage and handling
of exceptional input conditions


Software Fault Tolerance

The ability of software to avoid executing a fault in a way that
results in a system failure.

◈ Despite the best development efforts, almost all systems are
deployed with defects with the potential to produce critical
failures
 A major study of S/W defects showed 1% of customer-reported failures reported
within the 1st year produce system outages

◈ Fault tolerance increases the fault-resistant quality of a system
during run-time by
 Detecting faults at the earliest possible point of execution
 Containing the damaging effects of a fault to the smallest possible scope
 Performing the most reliable recovery action possible

◈ Fault tolerant designs focus on handling “complex” failures
 Address defects that are not likely to be triggered during testing

So,
What Is
Reliable
Software ??

Reliable Software Characteristics Summary

◈ Operates within the reliability specification that satisfies customer
expectations
 Measured in terms of failure rate and availability level
 The goal is rarely “defect free” or “ultra-high reliability”

◈ “Gracefully” handles erroneous inputs from users, other systems,
and transient hardware faults
 Attempts to prevent state or output data corruption from “erroneous” inputs

◈ Quickly detects, reports and recovers from S/W and transient
H/W faults
 S/W provides the system behavior of continuously monitoring, “self-diagnosing”
and “self-healing”
 Prevents as many run-time faults as possible from becoming system-level
failures


Questions?


Software Design For Reliability (DfR) Seminar

A “Best Practices”
Approach to
Developing
Reliable Software
George de la Fuente
georged@opsalacarte.com
(408) 828-1105
www.opsalacarte.com

Most Common Paths to Reliable Software

1. Rely on H/W redundancy to mask out all S/W faults
 The most attractive and expensive approach
 Provides a increased system-level reliability using an availability
technique
 Requires minimal S/W reliability

2. “Testing In” reliability
 The most prevalent approach
 Limited and inefficient approach to defect detection and removal
◘ System testing will leave at least 30% of the code untested
◘ System testing will detect at best ~55% of all run-time failures
 Most companies don’t continue testing until their reliability targets are
reached
◘ The testing phase is usually fixed in duration before the S/W is
developed and is focused on defect removal not reliability testing
 S/W engineers will spend more than 1/2 of their time in the test phase
using this approach

S/W Design for Reliability

3. S/W Design for Reliability
Least utilized and understood approach

 Common methodologies
1)Formal methods
2)Programs based on a H/W reliability practices
3)S/W process control
4)Augment traditional SW development /w
“best practices”


Formal Methods

◈ Formal Methods (not commonly used for commercial SW)
 Methodologies for system behavior analysis and proof of correctness
◘ Utilize mathematical modeling of a system’s requirements and/or
design

 Primarily used in the development of safety-critical systems that
require very high degrees of:
◘ Confidence in expected system performance
◘ Quality audit information
◘ Targets of low or near zero failure rates
 Formal methods are not applicable to most S/W projects
◘ Cannot be used for all aspects of system design (e.g., user
interface design)
◘ Do not scale to handle large and complex system development
◘ Mathematical requirements exceed the background of most S/W
engineers


Using Hardware Reliability Practices

◈ S/W and H/W development practices are still fundamentally
different
 The H/W lifecycle primarily focuses on architecture and design modeling
 S/W design modeling tools are rarely used
◘ Design-level simulation verification is limited
– Especially if a real-time operating system is required
◘ S/W engineers still challenge the value of generating complete designs
– This why S/W design tools support 2-way code generation
 Inherent S/W faults stem from the design process
◘ There is no aspect of faults from manufacturing or wear-out

◈ S/W is not built as an assembly of preexisting components
 True S/W component “reuse” is rare
◘ Most “reused” S/W components are at least “slightly” modified
◘ Modified “reused” S/W components are not certified before use
 S/W components are not developed to a specified set of reliability characteristics
 3rd party S/W components do not come with reliability characteristics

Hardware Reliability Practices

◈ …. assembly of preexisting components (continued)
 Acceleration mechanisms do not exist for S/W reliability testing

 Extending S/W designs after product deployment is commonplace
◘ H/W is designed to provide a stable, long-term platform
◘ S/W is designed with the knowledge that it will host frequent product
customizations and extensions
◘ S/W updates provide fast development turnaround and have little or no
manufacturing or distribution costs

 H/W failure analysis techniques (FMEAs and FTAs) are rarely successfully applied
to S/W designs
◘ S/W engineers find it difficult to adapt these techniques below the system
level


Software Process Control Methodologies

◈ S/W process control assumes a correlation between the maturity
of the development process and the latent defect density in the
final S/W
CMM Level Defects/KLOC Estimated Reliability
5 0.5 99.95%

4 1.0 - 2.5 99.75% - 99.9%

3 2.5 – 3.5 99.65% - 99.75%

2 3.5 – 6.0 99.4% - 99.65%

1 6.0 – 60.0 94% - 99.4%

◈ Process audits and more strict controls are implemented if
the current process level does not yield the desired S/W
reliability
 Process root cause analysis may not yield great improvement
◘ Practices within the processes must be fine tuned (but how??)
 Reliability improvement under this type of methodology is slow
◘ Process outcome cannot vary too much in either direction


“Best Practices”
for
Software Development

Sources of Industry Data

Data was derived from a large-scale international survey of S/W
lifecycle quality spanning:
 18 years (1984-2002)
 12,000+ projects
 600+ companies
◘ 30+ government/military organizations
 8 classes of software applications:
1. Systems S/W
2. Embedded S/W
3. Military S/W
4. Commercial S/W
5. Outsourced S/W
6. Information Technology (IT) S/W
7. End-User developed personal S/W
8. Web-based S/W


Terminology
◈ Best Practice
 A key S/W quality practice that significantly contributes towards increasing S/W
reliability

◈ Best in Class Companies
 Companies that have the following two characteristics:
◘ Recognized for producing S/W-based products with the lowest failure rate
in their industry
◘ Consistently deploying software based on their initial schedule targets

◈ Formal practice
 A S/W quality development practice that is well-understood and consistently
implemented throughout the software development organization.
◘ Note: Formal practices are rarely undocumented.

◈ Informal practice
 A S/W quality development practice that is either implemented with varying
degrees of rigor or in an inconsistent manner throughout the software
development organization.
◘ Note: Informal practices are usually accompanied by the absence of
documented guidelines or standards.


“Best in Class” Company Best Practices

◈ S/W Life Cycle Practices
 Consistent implementations of the entire S/W lifecycle phases
(requirements, design, code, unit test, system test and maintenance)

◈ Requirements
 Involve test engineers in requirements reviews
 Define quality and reliability targets
 Define negative requirements (i.e., “shall nots”)

◈ Development phase defect removal
 Formal inspections (requirements, design, and code)
 Failure analysis

◈ Design
 Team or group-oriented approach to design for the system and S/W
◘ NOTE: System design team includes other disciplines (e.g., H/W & Mech)


“Best in Class” Company Best Practices (continued)
◈ Testing
 Robust Testing strategy to meet business / customer requirements
 Test plans completed and reviewed before the coding phase
 Mandatory developer unit testing
 Independently verify/test every software change (enhancements and fixes)
 Create formal test plans for all medium and large-sized projects
 Staff an independent and dedicated SQA team to at least 5% of size of the S/W
development team
 Generate quality or reliability estimates
 Incorporate automated test tools into the test cycle

◈ S/W Quality Assurance
 Review and prioritize all changes after the development phase
 Record and track all changes to S/W artifacts throughout the life cycle
 Formalize unit testing reviews (test plans and results)
 Implement active quality assurance programs
 Root-cause analysis with resolution follow-up
 Gather and review customer product feedback

“Best in Class” Company Best Practices (continued)

◈ SCM and Defect Tracking
 Implement formal change management of artifact changes and S/W releases
 Incorporate automated defect tracking tools

◈ Metrics and Measurements
 Record and track all defects and failures
 Collect field data for root cause analysis on next project or release iteration
 Measure code test coverage
 Generate metrics based on code attributes (e.g., size and complexity)
 Generate defect removal efficiency measurements
 Track “bad fixes”


Weaknesses in S/W Development Practices

◈ Lack of engineer “ownership” for development and test practices
 Limited efficiency and effectiveness improvements made
 May lead to disjoint practices, resulting in no real “common” practices

◈ System design is “H/W-centric”
 Primary focus on H/W feasibility, functionality and performance
 Architectural reviews are not collaborative, team design sessions
 S/W requirements of the H/W platform are generally not entertained or
implemented

◈ S/W defect removal relies mostly on system or subsystem-level
testing
 Development phase defect removal is limited to cursory code reviews and sparse
unit testing
◘ Designs and design reviews are satisfied using functional or interface
specifications
 No causal analysis is performed to improve future defect removal


Weaknesses in S/W Development Practices

◈ Limited system and S/W quality measurements and metrics
 Use of default defect tracking tool statistics as primary metrics/measurements
 Generally no data mining capability available for analysis

◈ Informal SQA processes and staffing leads to wasted efforts and
incomplete coverage
 Too many trivial defects still present during system test phase
 Defect fixes that introduce additional defects are frequent
 S/W is shipped with many untested sections
 Significant, recurring, “real world” customer scenarios remain untested

◈ Limited or no tool support for:
 Unit testing
 Automated regression testing
 S/W analysis (static, dynamic, and coverage)


Application Behavior Patterns

S/W Quality Methods System S/W Embedded S/W
Summary Overall, best S/W quality results Wide range of S/W quality results

Defect Removal Efficiency Usually > 96% Up to > 94%

Best quality results found in projects with Most projects are < 26.5 KLOCS
Projects Sizes
> 550 KLOCS
Formal design and code inspections Usually do not implement both design or
Inspections
code inspections (and not formally)

Test Teams Independent SQA team Usually do not have separate SQA teams

Formal S/W quality measurement Informal S/W quality measurement
Measurement Control
process and tools processes and tools

Change Control Formal change control process and tools Informal change control process and tools

Test plans Formal test plans Usually do not implement formal test plans

Unit Testing Performed by developers Performed by developers

6 to 10 test stages 3 to 6 test stages
Testing Stages
(performed by SQA team) (usually performed by developers)

Governing Processes CMM/CMMI and Six-Sigma methods No consistent pattern found


Application Behavior Patterns

S/W Quality Methods System S/W Commercial S/W
Summary Overall, best S/W quality results Wide range of S/W quality results

Defect Removal Efficiency Usually > 96% Up to > 90%

Best quality results found in projects with Most projects are > 275 KLOCS
Projects Sizes
> 550 KLOCS
Formal design and code inspections Inconsistent use of formal design or code
Inspections
inspections

Test Teams Independent SQA team Inconsistent use of independent SQA teams

Formal S/W quality measurement Informal S/W quality measurement
Measurement Control
process and tools processes and tools

Change Control Formal change control process and tools Formal change control process and tools

Test plans Formal test plans Formal test plans

Unit Testing Performed by developers Performed by developers

6 to 10 test stages 3 to 8 test stages
Testing Stages
(performed by SQA team) (extensive reliance on Beta trials)

Governing Processes CMM/CMMI and Six-Sigma methods No consistent pattern found


Software
Defect Removal
Techniques

Defect Origin and Discovery

Typical Behavior

Requirements Design Coding Testing Maintenance
Defect
Origin

Defect
Discovery
Surprise!

Goal of Best Practices on Defect Discovery

Defect Requirements Design Coding Testing Maintenance

Origin

Defect
Discovery

Software Defect Removal Techniques

Defect Removal Technique Efficiency Range

Design inspections 45% to 60%

Code inspections 45% to 60%

Unit testing 15% to 45%

Regression test 15% to 30%

Integration test 25% to 40%

Performance test 20% to 40%

System testing 25% to 55%

Acceptance test (1 customer) 25% to 35%

◈Development organizations try to find and remove more defects by
implementing more stages of system testing
 Since there is a wide range of overlap between test stages, this approach
becomes less efficient as it scales


Defect Removal Technique Impact

Design
Inspections / n n n n Y n Y Y
Reviews

Code
Inspections / n n n Y n n Y Y
Reviews

Formal SQA
Processes
n Y n n n Y n Y

Formal Testing n n Y n n Y n Y

Median Defect
40% 45% 53% 57% 60% 65% 85% 99%
Efficiency

This large potential available from design and code inspections/reviews are why most
development organizations see greater improvements in S/W reliability with investments
in the development phase than further investments in testing.
NOTE: Design review results based on low-level design reviews.

Case Study: Quantifying the Software Quality Investment

Objective:
 Develop a value-based approach to determine the necessary
S/W quality investment using dependability attributes.

Methodology:
 Use an integrated approach of project cost and quality
estimation models (COCOMO II and COQUALMO) and
empirically-based business value relationship factoring to
analyze the data from a diverse set of 161 well-measured,
S/W projects.

Findings:
 The methodology was able to correlate the optimal S/W
project quality investment level and strategy to the required
project reliability level based on defect impact.
 The objectives were satisfied without using specific S/W
reliability practices and focused heavily on defect detection
during system testing.

Reliability, Development Cost & Test Time Tradeoffs

(v0.1) Ops A La Carte © Based on COCOMOII (Constructive Cost Model) 23

Reliability, Development Cost & Test Time Tradeoffs

The relative cost/source instruction to
achieve a Very High RELY rating is less
than the amount of additional testing
time that is required (54%) since early
defect prevention reduce the required
rework effort and allow for additional
testing time.

(v0.1) Ops A La Carte © Based on COCOMOII (Constructive Cost Model) 24

Delivered Defects Scale

◈ Very Low rating delivers roughly the same amount of defects that are introduced
◈ Extra High rating reduces the delivered defects by a factor of 37

Note: The assumed nominal defect introduction rate is 60 defects/KSLOC based
on the following distribution:
10 requirements defects/KSLOC
20 design defects/KSLOC
30 coding defects/KSLOC
(v0.1) Ops A La Carte © Based on COQUALMO (Constructive Quality Model) 25

Defect Removal Factors Scale

Automated Analysis Peer Reviews Execution Testing and Tools

Very Compiler-based simple
No peer reviews No testing
Low syntax checking

Basic compiler Ad-hoc informal walk-
Low capabilities throughs
Ad-hoc testing and debugging

Compiler extensions. Well-defined sequence for Basic test, test data management and
Nominal Basic requirements and preparation, review, and problem-tracking support.
design consistency minimal follow-up Test criteria based on checklist.

Well-defined test sequences tailored to
Intermediate-level Formal review roles and the organization.
module and intermodule. well-trained participants
High using basic checklists and
Basic test coverage tools and test
Simple requirements and support system.
design. follow-up procedures.
Basic test process management.
More elaborate Basic review checklists and More advanced test tools and test data
requirements and design. root cause analysis. preparation, basic test oracle support
Very Basic distributed- Formal follow-up using and distributed monitoring and
High processing and temporal historical data on inspection analysis, and assertion checking.
analysis, model checking, rates, preparation rates, Metrics-based test process
and symbolic execution. and fault density. management.
Formal review of roles and
procedures. Highly advanced tools for test oracles,
Formalized specification distributed monitoring and analysis,
Extensive review checklists and assertion checking.
Extra and verification.
and root cause analysis.
High Advanced distributed- Integration of automated analysis and
Continuous review-process test tools.
processing
improvement.
Model-based test process management.
Statistical process control.
(v0.1) Ops A La Carte © Derived from COQUALMO (Constructive Quality Model) 26

DfR Based on “Best Practices”
Modified Existing Best Practices New Best Practices
S/W Life Cycle Practices
 Consistent implementations of the entire  Reliability testing a part of overall testing
S/W lifecycle phases strategy
(requirements, design, code, unit test,  Define reliability goals as requirements
system test and maintenance)
Metrics and Measurements
 Record and track all defects and failures  Generate defect removal efficiency
 Collect field data for root cause analysis on measurements
next project or release iteration  Track fix rationale such as “bad” fixes or
untested code
 Collect failure data for analysis during the
system test phase
Development Phase Practices
 Reviews of design and code Assess designs for availability
 Targeted developer unit testing Perform failure analysis
Testing
 Independently verify/test every S/W Generate reliability estimates
change (enhancements and fixes)
SQA
 Perform failure root-cause analysis
 Record and track all changes to S/W
artifacts throughout the life cycle

Summary: DfR Based on “Best Practices”

◈ The strength of a “best practices” approach is it’s intuitiveness
 Incorporate considerations of essential functionality and failure behavior in order
to understand failure modes and improve availability
 Perform design analysis to identify potential failure points and, where possible,
redesign to remove failure points or reduce their impact
 Analyze S/W for critical failure trigger points and remove them or reduce their
impact and frequency where possible
 Plan testing to maximize the overall S/W verification prior to field deployment
 Let measured data drive changes to reliability practices

◈ Focus on the removal of critical failures instead of all defects
 S/W with known defects and faults may still be perceived as reliable by users
◘ NASA studies identified projects that produced reliable S/W with only 70% of
the code tested
 Removing X% of the defects in a system will not necessarily improve the
reliability by X%.
◘ One IBM study showed that removing 60% of the product’s defects resulted
in only a 3% reliability improvement
◘ S/W defects in rarely executed sections of code may never be encountered
by users and therefore may not improve reliability
– Exceptions for essential operations: boot, shutdown, data backup, etc.

Questions?


Software Design For Reliability (DfR) Seminar

Reliability
Measurements
and
Metrics

Metrics Supporting Reliability Strategies

Common strategies and tactics used by teams developing
highly reliable software products
Explicit, robust reliability requirements during requirements phase

Appropriate use of fault tolerant techniques in product design

Robust design/operational requirements for maximizing product
Availability

Focused, targeted (data driven) defect inspection program

Robust testing strategy and program:
well defined focused mix of unit, regression, integration, system,
exploratory and reliability demonstration testing

Robust defect tracking/metrics program focused on important few
◘ Defect tracking and analysis for all phases of a product’s life
including post shipment defects/failures (FRACUS)

Reliability Measurements and Metrics
◈ Definitions
 Measurements – data collected for tracking or to calculate meta-data (metrics)
◘ Ex: defect counts by phase, defect insertion phase, defect detection phase
 Metrics – information derived from measurements (meta-data)
◘ Ex: failure rate, defect removal efficiency, defect density

◈ Reliability measurements and metrics accomplish several goals
 Provide estimates of S/W reliability prior to customer deployment
 Track reliability growth throughout the life cycle of a release
 Identify defect clusters based on code sections with frequent fixes
 Determine where to focus improvements based on analysis of defect/failure data

Note: S/W Configuration Management (SCM) and defect tracking
tools should be updated to facilitate the automatic tracking
of this information
◘ Allow for data entry in all phases, including development
◘ Distinguish code base updates for critical defect repair vs. any other
changes, (e.g., enhancements, minor defect repairs, coding standards
updates, etc.)

Critical Measurements To Collect

Measurement Description
Number of critical defects found during each non-operational
Critical Defects by Phase
phase (i.e. requirements, design, and coding)

Number of critical failures found during each operational phase
Critical Failures by Phase
(i.e. unit testing, system testing, and field)

The phase where the critical defect (or critical failure) was
Critical Defect Insertion Phase
inserted (or originated)
The phase where the critical defect (or critical failure) was
Critical Defect Detection Phase
detected (or reported)

A high-level indicator of a critical defect’s location within the
Critical Defect Major Location
source code (e.g., a S/W component or file name)

A low-level indicator of a critical defect’s location within the
Critical Defect Minor Location
source code (e.g., the name of a class, method or data object)

The time when a critical failure occurred since the beginning of a
Critical Failure Time
test run (typically measured in CPU or wall time)

Critical Failure Root Cause The relevant failure category for a specified critical failure


Metrics To Track

Metric Description
The number of defects per KLOCs (1,000 lines of commented
Critical Defect Density
source code)

Critical Defect Removal Efficiency The percentage of defects identified within a given life cycle
(CDRE) period

Critical Failure Rate The mean number of failures occurring within a reference period

Current open defect demographics of current code base including
Current Defect Demographics
defects by severity, module, fix backlog, ect.
Trends( e.g., defects vs. test time interval) of newly detected
Failure/Defect arrival rates
Failures and/or defects.

The number of failures caused as side effects to a fixes of logged
Bad Fixes
defect

The number of failures caused by code that was neither reviewed
Unverified Code Fixes
nor tested

The summary of failure categories distributions for pareto root
Failure root cause distribution
cause analysis


Software Defect Distributions

Average distribution of all types of S/W defects by lifecycle phase:
 20% Requirements 50% of all S/W defects are introduced before coding
 30% Design
 35% Coding
 10% Bad Defect Fixes (introduction of secondary defects) 1 in 10 defects fixed
during testing were
 5% Customer Documentation unintended side effects
of a previous defect “fix”

Average distribution of S/W defects escalated from the field:
(based on 1st year field defect report data)
 1% Severity 1 (catastrophic) Only ~20% of the customer-reported S/W
 20% Severity 2 (serious) defects are target for reliability improvements

 35% Severity 3 (minor)
 44% Severity 4 (annoyance or cosmetic)


Typical Defect Tracking (System Test)

Severity #1 Severity #2 Severity #3 Severity #4
System Test Total Defects
Defects Defects Defects Defects
Build Found
Found Found Found Found

SysBuild-1 7 9 16 22 54

SysBuild-2 5 5 14 26 50

SysBuild-3 4 6 8 16 34

• • • • • •
• • • • • •
• • • • • •

SysBuild-7 0 1 4 6 11


Defect Removal Efficiency
◈Critical defect removal efficiency (CDRE) is a key reliability measure
Critical Defects Found
CDRE =
Critical Defects Present

◈“Critical Defects Present” is the sum of the critical defects found in all
phases as a result of reviews, testing and customer/field escalations
 System testing stages include integration, functional, loading, performance,
acceptance, etc.
 Customer trials can be considered either a system testing stage, a preliminary,
but separate, field deployment phase or a part of the field deployment phase
◘ Depending on the rationale for the trials
 The field deployment phase is measured as the first year following deployment
◘ The average life span of a S/W release since most S/W releases are
separated by increments no longer than 1 year

System & Subsystem Field
Requirements Design Coding Unit Testing
Testing Stages Deployment

Field
Review Efficiency Testing Efficiency Efficiency
System Testing Field
Development Efficiency
Efficiency Efficiency

Field
Internal Efficiency
Efficiency

CDRE Example
Critical
Origin Defects Metrics #1
Found
Requirements Reviews 20
Design Reviews 30
Code Reviews 40
170
Unit Testing 25

System & Subsystem
55
Testing

Field Deployment 40 40

TOTAL 210 210


Field
Internal Efficiency
Efficiency

Metric Removal Efficiency

Internal Efficiency 81% = (170 / 210)

Field Efficiency 19% = (40 / 210)


CDRE Example
Critical
Found
Design Reviews 30
115
Code Reviews 40

Unit Testing 25

System & Subsystem
55 55
Testing


TOTAL 210 210


System Testing Field
Development Efficiency
Efficiency Efficiency


Development Efficiency 55% = (115 / 210)

System Testing Efficiency 26% = (55 / 210)



CDRE Example
Critical
Found
Design Reviews 30 90
Code Reviews 40

Unit Testing 25
80
System & Subsystem
55
Testing


TOTAL 210 210


Field
Review Efficiency Testing Efficiency Efficiency


Review Efficiency 43% = (90 / 210)

Testing Efficiency 38% = (80 / 210)



Sample Project Reliability Measurement Tracking

At the end of the project Unit Testing Phase

Defects Found Critical Defects/Failures Found

Phase Reqmts Design Code Unit Test Test Field
Total Critical
Critical Critical Critical Critical Critical Failures
Defects Defects
Defects Defects Defects Failures Failures Reported

Requirements 75 12 12

Design 123 45 6 39

Code 158 62 4 12 46

Unit Test 78 25 1 5 17 2

Development Totals 434 144 23 56 63 2

Integration Test

System Test

Testing Totals

Field Reports Totals



1 year after the end of the System Testing phase


Total Critical
Defects Defects


Design 123 45 6 39

Code 158 62 4 12 46

Unit Test 78 25 1 5 17 2

Development Totals 434 144 23 56 63 2

Integration Test 43 13 0 4 7 1 1
System Test 183 47 2 13 28 0 4
Testing Totals 226 60 2 17 35 1 5

Field Reports Totals 70 35 1 8 22 0 3 1

Release Summary 720 239 26 81 120 3 8 1





Total Critical
Defects Defects


Design 123 45 6 39

Code 158 62 4 12 46 Design DRE Measurements
• 39 critical defects found in-phase
Unit Test 78 25 1 5 17 2
• 56 critical defects found during development
Development Totals 434 144 23 56 • 63 critical defects found during testing
17 2
• 81 critical defects found overall
Integration Test 45 13 0 4 7 1 1
System Test 189 47 2 13 28 Design DRE Metrics
0 4
Testing Totals 234 60 2 17 • 35 in-phase DRE (= 39/81)
48% 1 5
• 69% development DRE (= 56/81)
Field Reports Totals 77 35 1 8 • 22 in-house DRE (= (56+17)/81)1
90% 0 3

Release Summary 745 239 26 81 120 3 8 1




DefectsBad Fixes Measurements
Found Critical Defects/Failures Found
• 1 critical defect inserted and found during field deployment
• 5 critical defects inserted and Critical during system-level testing
Total found Critical Critical Critical Critical Critical Failures
Defects Defects
• 3 critical defects inserted during system-level testing and found during field deployment

•Requirements
1 critical defects inserted during unit testing and found during system-level testing
75 12 12
Bad Fixes Measurements
• 0 critical defects inserted during unit testing and found during field deployment
•Design
1 critical defect inserted and found during field deployment
123 45 6 39
• 60 total critical defects found during system-level testing
• 5 critical defects inserted and found during system-level testing
•Codetotal critical defects found in62 field
35 158 the 4 12 46
• 3 critical defects inserted during system-level testing and found during field deployment
•Unit Test defects inserted during unit testing1
1 critical 78 25 5 17 2
and found during system-level testing
•Development Totals inserted during Badtesting and found during field deployment
0 critical defects 434 unit Fixes Metrics
144 23 56 63 2
• 10% of the test defects found during system-level (= (1 + 5)/60)
• 60 total critical phase failures are from Bad Fixes testing
• 11% of the fielddefectsfailuresin the field
• 35 total critical phase found are from Bad Fixes (= (0 + 3 + 1)/35)
•Integration Test and field failures are from Bad Fixes (= (1 + 5 + 07+ 3 + 1)/(60 + 35)) 1
11% of the test 45 13 0 4 1
System Test 189 47 2 13 28 0 4
Testing Totals 234 60 2 17 35 1 5

Field Reports Totals 77 35 1 8 22 0 3 1

Release Summary 745 239 26 81 120 3 8 1


Sample Project Reliability Metrics

Metrics
Phase In-phase Overall
CDD Bad Fixes
DRE DRE
Requirements 29% 46%

Design 51% 48%
58%
Code 56% 38% 84%
Unit Test 18% 17%

Testing 26% 10%
11%
Field 16% 16% 11%


Sample Project Reliability Metrics

Metrics
Phase In-phase Overall
CDD Bad Fixes
DRE DRE
Requirements 29% 46%

Design 51% 48%
58%
Code 56% 38% 84%
Unit Test 18% 17%

Testing 26% 10%
11%
Field 16% 16% 11%

Sample Phase 1 Goals (50% improvement):
• (reliability) increase in-house DRE to 92%
• (efficiency) reduce field bad fixes to 5%


Distribution of Defects Across Files

◈ There is a Pareto-like distribution of defects across files within a
module
 Defects cluster in a relatively small number of files
 Conversely, more than half of the files have almost no critical defects


Failure Density Analysis

◈ In general, there is a pareto distribution (80/20) of defects across
files

◈ Failure density analysis provides a mechanism for early detection
of “problematic” sections of code (i.e., defect clusters)
 Improving the reliability of these “problematic modules” can consume as much
as 4 times the effort of a redesign
◘ Less time is required to re-inspect and either restructure or redesign these
modules than to effectively “beat them into submission” through testing
 The goal is to identify “problematic” code as early as possible
◘ Perform failure density analysis early on during unit testing
– Since different sections of code may be problematic during system
testing, the analysis should be repeated near the middle of this phase
◘ SCM and defect tracking tools can be modified to provide this information
without much effort
– Source code can be analyzed to display “hot spot” histograms using
number of changes and/or number of failures
– Heuristics for failure density thresholds must be developed to
determine when action should be taken


Failure Density Distribution by File

These files contain the majority of the reported
failures and should be proactively analyzed for
possible redesign, restructuring or additional
defects
% of Total
Reported
Failures

F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 F13 F14 … F23

Source Code Files

Applying Causal Analysis to Defect Measurements

◈ Causal analysis(RCA) can be applied to defect
measurements to improve defect removal effectiveness
and efficiency
 Usually performed between life cycle iterations or S/W releases
 Upstream defect removal practices should be reviewed in light of the
defects that were not detected in each phase
◘ Requires knowing the phases where a defect was introduced and
detected
◘ Simple guidelines should be defined to determine the phase
where the defect was introduced

 Defects are categorized by types
◘ Can determine if the problem is systemic to one or more
categories, or
◘ Whether the problem is an issue of raising overall defect removal
efficiency for a development phase


Causal Analysis Process
(one shot analysis)

Objective: Create an initial, rough distribution of the defects and identify
potential improvements to existing defect removal practices.

Process Outline (typically 6-8 hours over two days):
◈ Select a team from the senior engineers and MSEs of the development and test
teams. Analysis Process:
 Before the meeting, select a representative sample of approximately 50-75
defects from each development and test phase
 Convene meeting and explain the objectives and process to the team
 Classify the defects
◘ Start by walking the team through the classification of one defect
◘ Divide the defects into small groups and assign each person 2 groups of
defects (to force analysis overlap)
◘ Upon completion, collect and process the data offline in preparation for
team analysis and review
 Analyze the defect types using a histogram to look for a Pareto distribution and
select the most prevalent defect types
 Develop recommendations for improvements to specific defect removal practices
 Implement the recommendations at the next possible opportunity and gather
measurement data

The Orthogonal Defect Analysis Framework

SPEC/RQMTS DESIGN CODE ENV. SUPT. DOCUMENTATION OTHER

ORIGIN
( Where?)

REQUIREMENTS TEST SW
HW INTERFACE PROC. (INTERPROC.) LOGIC
OR
SW INTERFACE COMMUNICATIONS COMPUTATION TEST HW
SPECIFICATIONS
DATA DEFINITION DATA HANDLING
USER INTERFACE DEVELOPMENT
FUNCTIONALITY
TOOLS
MODULE DESIGN MODULE INTERFACE/
FUNCTIONAL INTEGRATION SW
TYPE
( What ?) DESCRIPTION LOGIC DESCRIPTION IMPLEMENTATION

STANDARDS
ERROR CHECKING

STANDARDS

• Other can also be
type classification for
any origin
MODE
MISSING UNCLEAR WRONG CHANGED BETTER WAY
(WHY?)


Ops A La Carte Software Design for Reliability (SDfR) Seminar

Ops A La Carte Software Design for Reliability (SDfR) Seminar

Recommended

Recommended

More Related Content

Featured

Featured (20)

Ops A La Carte Software Design for Reliability (SDfR) Seminar