Ops A La Carte Software Design for Reliability (SDfR) Seminar1. &
We Provide You Confidence in Your Product ReliabilityTM
Ops A La Carte / (408) 654-0499 / askops@opsalacarte.com / www.opsalacarte.com
3. The following presentation materials
are copyright protected property of
Ops A La Carte LLC.
Distribution of these materials is limited to
your company staff only.
These materials may not be distributed
outside of your company or used for any
purpose other than training.
4. Software DfR ½-Day Seminar Agenda
Agenda
◈ Introductions and Agenda Review
◈ Software Reliability Basic Concepts
◈ A “Best Practices” Approach to Developing Reliable Software
◈ Reliability Measurements and Metrics
◈ Wrap-up
(v0.1) Ops A La Carte © 3
6. Presenter’s Biographical Sketch – Bob Mueller
◈ Bob Mueller a senior consultant/program manager with OPS A La Carte and the Marisan
Group. He is a product development professional with 30+ years of technical and
management experience in software intensive product development, R/D process
and quality systems development including extensive consulting experience with
cross-functional product development teams and senior management.
◈ After receiving his M.S. in Physics in 1973, Bob joined Hewlett-Packard in Cupertino, CA
in IC process development. In the next three decades before leaving hp in 2002, he
held numerous positions in R/D, R/D management and consulting including:
IC process development, process engineering and IC production management.
Lead developer of an automated IC in-process/test monitor, analysis and
control system (hp internal).
R/D project management for sw intensive products (including process analysis and
control, work cell control & quality control systems).
Numerous R/D management positions in computer, analytical and healthcare
businesses
including FDA regulated systems with ISO 9001 certified organizations.
Numerous program management positions focused on internal/external process
improvement and consulting.
Practice area manager and consultant for PG -- Engineering consulting team
(internal hp)
◈ Bob’s current consulting interests include: Warranty process and quality system
improvement, SW Reliability, agile SW product development methodologies and R/D
product strategy and technology roadmap development.
◈ Bob has taught many internal hp classes and at local junior colleges.
(v0.1) Ops A La Carte © 5
7. Software Reliability Integration Services
for the Product
Reliability Integration in the Concept Phase Reliability Integration in the Implementation Phase
Software Reliability Goal Setting Facilitation of Code Reliability Reviews
Software Reliability Program and Integration Plan Software Robustness and Coverage Testing Techniques
Reliability Integration in the Design Phase
Facilitation of Team Design Template Reviews Reliability Integration in the Testing Phase
Facilitation of Team Design Reviews Software Reliability Measurements and Metrics
Software Failure Analysis Usage Profile-based Testing
Software Fault Tolerance Software Reliability Estimation Techniques
Software Reliability Demonstration Tests
8. Software Design For Reliability (DfR)
Software Reliability
Basic Concepts
George de la Fuente
georged@opsalacarte.com
(408) 828-1105
www.opsalacarte.com
10. Software Quality vs. Reliability
FACTORS CRITERIA
suitability
Functionality accuracy
interoperability
security
understandability
Usability learnability
operability
attractiveness
Software Software Quality
maturity
Quality Reliability fault tolerance The level to which the
*ISO9126
Quality Model
recoverability software characteristics
conform to all the
time behavior specifications.
Efficiency resource utilization
analysability
changeability
Portability stability
testability
adaptability
installability
Maintainability co-existence
replaceability
(v0.1) 3
Ops A La Carte ©
12. Software Reliability Definitions
“The probability of failure free software operation
for a specified period of time in a specified
environment”
ANSI/IEEE STD-729-1991
◈Examine the key points
◈Practical rewording of the definition
Software reliability is a measure of the
software failures that are visible to a customer
and that prevents a system from delivering
essential functionality for a specified period of time.
(v0.1) Ops A La Carte © 5
13. Software Reliability Can Be Measured
◈ Measurements (quantitative) are a required foundation
Differs from quality which is not defined by measurements
All measurements and metrics are based on run-time failures
◈ Only customer-visible failures are targeted
Only defects that produce customer-visible failures affect reliability
Corollaries
◘ Defects that do not trigger run-time failures do NOT affect reliability
– badly formatted or commented code
– defects in dead code
◘ Not all defects that are triggered at run-time produce customer-
visible failures
– corruption of any unused region of memory
◈ S/W Reliability evolved from H/W Reliability
Primary distinction: S/W Reliability focuses only on design reliability
(v0.1) Ops A La Carte © 6
14. Software Reliability Is Based On Usage
◈ S/W failure characteristics are derived from the usage profile of a
particular customer (or set of customers)
Each usage profile triggers a different set of run-time S/W faults and failures
◈ Example of reliability perspective from 3 users of the same S/W
Customer A
◘ Usage Profile – Exercises sections of S/W that produce very few failures.
◘ Assessment – S/W reliability is high.
Customer B
◘ Usage Profile – Overlaps with Customer A’s usage profile. However,
Customer B also exercises other sections of S/W that produce many,
frequent failures
◘ Assessment – S/W reliability is low.
Customer C
◘ Usage Profile – Similar to Customer B’s usage profile. However, Customer
C has implemented workarounds to mitigate most of the S/W failures that
were encountered. The final result is that the S/W executes with few
failures but requires additional off-nominal steps.
◘ Assessment – S/W quality is low since many workarounds are required.
However, for the final configuration that includes these workarounds, S/W
reliability is acceptable.
(v0.1) Ops A La Carte © 7
15. Reliability ≠ Correctness or Completeness
◈ Correctness is a measure with which the requirements
model the intended customer base or industry functionality
Correctness is validated by reviewing product requirements and
functional specifications with key customers
◈ Completeness is a measure of the degree of intended
functionality that is modeled by the S/W design
Completeness is validated by performing requirements traceability at
the design phase and design traceability at the coding phase
◈ Reliability is a measure of the behavior (i.e., failures) that
prevents the S/W from delivering the designed
functionality
If the resulting S/W does not meet customer or market expectations,
yet operates with very few failures based on its requirements and
design, the S/W is still considered reliable
(v0.1) Ops A La Carte © 8
17. Software Defects That Affect Reliability
Sources
Documentation Development Validation
• User Manual • Requirements ••• • Unit Test Plans/Cases
• Installation Guide • System Architecture • System-Level Test Plans/Cases
• Technical Specs • Designs • Design Review Scenarios or Checklists
• Source Code • Code Review Scenarios or Checklists
• S/W Failure Analysis Categories
Categories
Soft Maintenance Run-time Impacts
Run-time Impacts
Failures
• Commenting • System outage
• System outage • System outage
• Style ••• • Loss of functionality
• Loss of functionality
• Consistency • Annoyance • Loss of critical functionality
• Standards/Guidelines • Cosmetic
• “Dead Code”
(v0.1) Ops A La Carte © 10
18. Terminology - Defect
◈ A flaw in S/W requirements, design or source code that
produces unintended or incomplete run-time behavior
Defect
Defects of commission
◘ Incorrect requirements are specified
◘ Requirements are incorrectly translated into a design model
◘ The design is incorrectly translated into source code
◘ The source code logic is flawed
Defects of omission There are amongst the most difficult class of defects to detect
◘ Not all requirements were used in creating a design model
◘ The source code did not implement all the design
◘ The source code has missing or incomplete logic
◈ Defects are static and can be detected and removed without
executing the source code
◈ Defects that cannot trigger S/W failures are not counted for
reliability purposes
These are typically quality defects that affect other aspects of S/W quality such
as soft maintenance defects and defects in test cases or documentation
(v0.1) Ops A La Carte © 11
19. Terminology - Fault
◈The result of triggering a S/W defect by
executing the associated source code Defect
Faults are NOT customer-visible
◘ Example: memory leak or a packet corruption
Fault
that requires retransmission by the higher
layer stack
A fault may be the transitional state that results in
a failure
◘ Trivially simple defects (e.g., display spelling
errors) do not have intermediate fault states
(v0.1) Ops A La Carte © 12
20. Terminology - Failure
◈A customer (or operational system)
observation or detection that is perceived Defect
as an unacceptable departure of
operation from the designed S/W
behavior Fault
Failures are the visible, run-time symptoms of faults
◘ Failures MUST be observable by the customer or Failure
another operational system
Not all failures result in system outages
(v0.1) Ops A La Carte © 13
21. Defect-to-Failure Transition
◈Example
A S/W function (or method) processes the data stored in a
memory buffer and then frees the allocated memory buffer
back to the memory pool
A defect within this function (or method), when triggered, will
fail to free the memory buffer before completion
Entry Point
Defect
1 (of many)
Logic
Branch
Points
4 Possible Exit Points
(v0.1) Ops A La Carte © 14
22. Defect-to-Failure Transition (continued)
◈ Most of the possible logic paths do not trigger the defect
If these are the only logic paths traversed by a customer, this portion
of the S/W will be considered very reliable
(v0.1) Ops A La Carte © 15
23. Defect-to-Failure Transition (continued)
◈ Fault transition
Eventually a logic path is executed that triggers the defect, resulting in a fault
being generated
◘ The function (or method) completes its execution
◘ The fault causes the system to lose track of a single memory buffer
◘ The system continues to operate without a visible impact
Since the fault causes no visible impact, a failure does NOT occur
(v0.1) Ops A La Carte © 16
24. Defect-to-Failure Transition (continued)
◈ Failure scenario
After sufficient memory buffers have been lost,
the buffer pool reaches a critical condition where
either:
◘ No buffers are available to satisfy another
allocation request (there are still some (t1) Fault is triggered
buffers in use)
(t2) Fault is triggered
◘ All buffers have been lost through leakage
•
(no buffers will ever be freed for future •
allocation requests) •
(tN) Fault is triggered
Once the next buffer allocation is requested, a (tF) Failure occurs
failure occurs
◘ The system cannot continue to operate
Time (t)
normally
Note the time lag between the triggering of the
last fault and the occurrence of the associated
failure
(v0.1) Ops A La Carte © 17
25. Summary of Defects and Failures
◈ There are 3 types of run-time defects
Defect Defect Defect
1. Defects that are never executed (so they don’t trigger
faults)
2. Defects that are executed and trigger faults that do
NOT result in failures Fault Fault
3. Defects that are executed and trigger faults that result
in failures
Failure
◈ Practical S/W Reliability focuses on defects
that have the potential to cause failures by:
Defect
1. Detecting and removing defects that result in failures
during development
2. Design & Implement fault tolerance techniques to
Fault
◘ prevent faults from producing failures or
◘ mitigating the effects of the resulting failures
Failure
(v0.1) Ops A La Carte © 18
27. Reliability and Failure Distributions
Restated, reliability is the probability that a system does not
experience a failure during a time interval, [0,T].
◈ Reliability is a measure of statistical probability, not certainty
Ex: A system has a 99% reliability over a period of 100 days
◘ Does this imply that only 1 failure will occur during the 100 day period?
◈ Reliability is based on failure distribution models
Represent the time distribution of failure occurrences
Various failure distribution models exist:
◘ Exponential (most commonly used in S/W reliability)
◘ Weibull
◘ Poisson
◘ Normal
◘ Rayleigh
◘ etc….
◈ Let’s examine an exponential failure distribution model
(v0.1) Ops A La Carte © 20
28. Failure Distributions - Exponential
◈ Exponential Reliability Function
The most widely used failure distribution is the exponential reliability function:
◘ Models a random distribution of failure occurrences
Defined by:
R(t)
R(t) = e-λt
λ = 0.1 failures/hr.
where
◘ t is mission time
– the system is assumed to be operational at t=0
– The mission duration is represented by T
◘ “λ” is a constant, instantaneous failure rate (or failure intensity)
◘ MTTF = 1 / λ (for repairable systems)
(v0.1) Ops A La Carte © 21
29. A Closer Look At The Exponential Distribution
Reliability • Mission duration: T = 100 hours
R(t) • Failure rate: λ = 0.1 failures/hr. (or 1 failure every 10 hrs.)
• MTTF = 10 hrs.
At t = 1 hr., the reliability is 90%
When t = MTTF, the reliability is always 37%
R = e-λt
= e-(1/MTTF)MTTF
= e-(1)
= 37%
Time (hrs)
(v0.1) Ops A La Carte © 22
30. A Closer Look at Reliability Values
◈ Based on an exponential failure distribution, what does it mean
for S/W to have 99% reliability after one year of operation?
For a single S/W product:
◘ There is a 99% probability that the S/W will still be operational after 1 year
– Conversely, there is a 1% chance of a failure during that period.
◘ Note that this value does NOT tell us when, during the 1 year period, that a
failure will occur.
– With the exponential distribution, as time progresses, the likelihood
(probability) of a failure increases.
For a group of software products (e.g., 100 products):
◘ 99% of the products will be operational after 1 year (e.g., 99 products)
◘ There is a 36.6% probability that all 100 products will be operational after
1 year
– This computed by multiplying the reliability of all the products:
f(t) = R1(t) x R2(t) x … x R100(t)
= 0.99 x 0.99 x … x 0.99
= 0.366
(v0.1) Ops A La Carte © 23
31. Sample Reliability Calculations
◈ What is the failure rate (λ) and MTTF necessary for to achieve
this level of reliability?
t = 1 yr.
= 8760 hrs.
R(t) = e-λt
0.99 = e-(λ) x (8760)
ln(0.99) = -(λ) x (8760)
λ = -ln(0.99) / (8760)
= 1.1 x 10-6 failures/hr. (1 failure every 99.5 years)
MTTF = 1/ λ
= 871,613 hrs. (99.5 yrs.)
◈ What is reliability at the MTTF?
t = MTTF = 871,613 hrs
R(MTTF) = e-(λ) x (MTTF)
= e-(1/MTTF) x (MTTF)
= e-1
= 0.368 (~37%)
(v0.1) Ops A La Carte © 24
32. Software and Hardware Failure Rates
Software Hardware
Driven by effectiveness of S/W defect Driven by three very
detection and repair processes over different physical
the span of many upgrades failure domains
Failure Rate
Failure Rate
λSW-B
λHW-B
Pre-release Useful Life Obsolete Burn-In Useful Life Wearout
Testing (w/upgrades)
Initial system deployment (i.e., completion of Pre-release Testing and Burn-in phases)
establishes a baseline for both the S/W (λSW-B) and H/W (λHW-B) failure rates
(v0.1) Ops A La Carte © 25
34. System Availability
Availability is the percentage of time that a system is operational,
accounting for planned and unplanned outages.
◈ Example: 90% Availability (for a timeframe T)
Logical Representation
◘ System is operational for the first 90% of the timeframe and down for the last
10% of the timeframe
Timeframe T
System operational (0.9T) System non-operational (0.1T)
Failure System
Actual (or Possible) Representation occurs restored
◘ 3 failures cause the system to be down for 10% of the timeframe
Failure System Failure System Failure System
occurs restored occurs restored occurs restored
(v0.1) Ops A La Carte © 27
35. System Availability (continued)
◈ System availability, A(T), is the relationship between the
timeframes when a system is operational vs. down due to a
failure-induced outage and is defined as:
___MTBF___
A(T) = (MTBF + MTTR)
where,
The system is assumed to be operational at time t=0
T = MTBF + MTTR and 0 ≤ t ≤ T
MTBF (Mean Time Between Failure) is based on the failure rate
MTTR (Mean Time To Repair) is the duration of the outage (i.e., the expected
time to detect, repair and then restore the system to an operational state)
(v0.1) Ops A La Carte © 28
36. Software Availability
◈ System outages that are caused by S/W can be attributed to:
1. Recoverable S/W failures
2. S/W upgrades
3. Unrecoverable S/W failures
NOTE: Recoverable S/W failures are the most frequent S/W cause of
system outages
◈ For outages due to recoverable S/W failures, availability is
defined as:
___MTTF___
A(T) = (MTTF + MTTR)
where,
MTTF is Mean Time To [next] Failure
MTTR (Mean Time To [operational] Restoration) is still the duration of the
outage, but without the notion of a “repair time”. Instead, it is the time until the
same system is restored to an operational state via a system reboot or some
level of S/W restart.
(v0.1) Ops A La Carte © 29
37. Software Availability (continued)
◈ A(T) can be increased by either:
Increasing MTTF (i.e., increasing reliability) using S/W reliability practices
Reducing MTTR (i.e., reducing downtime) using S/W availability practices
◈ MTTR can be reduced by:
Implementing H/W redundancy (sparingly) to mask most likely failures
Increasing the speed of failure detection (the key step)
S/W and system recovery speeds can be increased by implementing Fast Fail
and S/W restart designs
◘ Modular design practices allow S/W restarts to occur at the smallest
possible scope, e.g., thread or process vs. system or subsystem
◘ Drastic reductions in MTTR are only possible when availability is part of the
initial system/software design (like redundancy)
◈ Customers generally perceive enhanced S/W availability as a S/W
reliability improvement
Even if the failure rate remains unchanged
(v0.1) Ops A La Carte © 30
38. System Availability Timeframes
Availability Class Availability Timeframe vs. Mission Downtime
(Unavailability Range)
Timeframe =1year Timeframe = 3 months
(1) Unmanaged 90% (1 nine) 36.5 days/year 9.13 days
(52,560 mins/year)
(2) Managed 99% (2 nines) 3.65 days/year 21.9 hours
(good web (5,256 mins/year)
servers)
(3) Well-managed 99.9% (3 nines) 8.8 hours/year 2.19 hours
(525.6 mins/year)
(4) Fault Tolerant 99.99% (4 nines) 52.6 mins/year 13.14 minutes
(better commercial
systems)
(5) High- 99.999% (5 nines) 5.3 mins/year 1.31minutes
Availability
(High-reliability
products)
(6) Very-High- 99.9999% (6 nines) 31.5 secs/year 7.88 seconds
Availability (2.6 mins/5 years)
(7) Ultra- 99.99999% (7 nines) to 3.2 secs/year 0.79 seconds
Availability to
99.9999999% (9 nines)
31.5 millisecs/year
(15.8 secs/5 years or less)
(v0.1) Ops A La Carte © 31
40. Software Robustness
Software Robustness is a measure of the software’s ability to
handle exceptional input conditions so they do not become failures.
◈ Exceptional input conditions result from:
Inputs that violate data value constraints
Inputs that violate data relationships
Inputs that violate the application’s timing requirements
◈ Robust S/W prevents exceptional inputs from:
1. Causing a system outage
2. Producing a silent failure by providing no indication that an exceptional input
condition was detected, thus allowing for the failure to propagate
3. Generating an error condition or response that incorrectly characterizes the
exceptional input condition
◈ S/W robustness becomes increasingly important as a system
becomes more flexible and the product’s customer base increases
in size and usage diversity
(v0.1) Ops A La Carte © 33
41. Why Is Software Robustness Important ?
Inputs causing
User
User #2 erroneous
#1
Input set User
II User
Err outputs
#n
#3 e
Program
Erroneous
outputs
Output set OErr
Oe
(v0.1) Ops A La Carte © 34
42. Software Robustness Studies
◈ 2 studies of S/W robustness
Examined exceptional input condition testing of POSIX-compliant OSes and UNIX
command line utilities
Robustness testing was repeated on multiple releases contain fixes for the
reported exceptional input failures
◈ Findings
Failure rates associated with robustness testing were significant,
◘ Ranging from 10% - 33%
After many significant, focused S/W fixes over multiple releases, failure rates
still remained high
◈ Conclusions
Traditional functional testing does not adequately test for exceptional input
conditions
Operational profiles testing also does not adequately test for exceptional input
conditions
◘ (Reason) Operational profile testing prioritizes and sets limits on functional
testing.
Specific techniques are required to provide adequate test coverage and handling
of exceptional input conditions
(v0.1) Ops A La Carte © 35
44. Software Fault Tolerance
The ability of software to avoid executing a fault in a way that
results in a system failure.
◈ Despite the best development efforts, almost all systems are
deployed with defects with the potential to produce critical
failures
A major study of S/W defects showed 1% of customer-reported failures reported
within the 1st year produce system outages
◈ Fault tolerance increases the fault-resistant quality of a system
during run-time by
Detecting faults at the earliest possible point of execution
Containing the damaging effects of a fault to the smallest possible scope
Performing the most reliable recovery action possible
◈ Fault tolerant designs focus on handling “complex” failures
Address defects that are not likely to be triggered during testing
(v0.1) Ops A La Carte © 37
46. Reliable Software Characteristics Summary
◈ Operates within the reliability specification that satisfies customer
expectations
Measured in terms of failure rate and availability level
The goal is rarely “defect free” or “ultra-high reliability”
◈ “Gracefully” handles erroneous inputs from users, other systems,
and transient hardware faults
Attempts to prevent state or output data corruption from “erroneous” inputs
◈ Quickly detects, reports and recovers from S/W and transient
H/W faults
S/W provides the system behavior of continuously monitoring, “self-diagnosing”
and “self-healing”
Prevents as many run-time faults as possible from becoming system-level
failures
(v0.1) Ops A La Carte © 39
48. Software Design For Reliability (DfR) Seminar
A “Best Practices”
Approach to
Developing
Reliable Software
George de la Fuente
georged@opsalacarte.com
(408) 828-1105
www.opsalacarte.com
49. Most Common Paths to Reliable Software
1. Rely on H/W redundancy to mask out all S/W faults
The most attractive and expensive approach
Provides a increased system-level reliability using an availability
technique
Requires minimal S/W reliability
2. “Testing In” reliability
The most prevalent approach
Limited and inefficient approach to defect detection and removal
◘ System testing will leave at least 30% of the code untested
◘ System testing will detect at best ~55% of all run-time failures
Most companies don’t continue testing until their reliability targets are
reached
◘ The testing phase is usually fixed in duration before the S/W is
developed and is focused on defect removal not reliability testing
S/W engineers will spend more than 1/2 of their time in the test phase
using this approach
(v0.1) Ops A La Carte © 2
50. S/W Design for Reliability
3. S/W Design for Reliability
Least utilized and understood approach
Common methodologies
1)Formal methods
2)Programs based on a H/W reliability practices
3)S/W process control
4)Augment traditional SW development /w
“best practices”
(v0.1) Ops A La Carte © 3
51. Formal Methods
◈ Formal Methods (not commonly used for commercial SW)
Methodologies for system behavior analysis and proof of correctness
◘ Utilize mathematical modeling of a system’s requirements and/or
design
Primarily used in the development of safety-critical systems that
require very high degrees of:
◘ Confidence in expected system performance
◘ Quality audit information
◘ Targets of low or near zero failure rates
Formal methods are not applicable to most S/W projects
◘ Cannot be used for all aspects of system design (e.g., user
interface design)
◘ Do not scale to handle large and complex system development
◘ Mathematical requirements exceed the background of most S/W
engineers
(v0.1) Ops A La Carte © 4
52. Using Hardware Reliability Practices
◈ S/W and H/W development practices are still fundamentally
different
The H/W lifecycle primarily focuses on architecture and design modeling
S/W design modeling tools are rarely used
◘ Design-level simulation verification is limited
– Especially if a real-time operating system is required
◘ S/W engineers still challenge the value of generating complete designs
– This why S/W design tools support 2-way code generation
Inherent S/W faults stem from the design process
◘ There is no aspect of faults from manufacturing or wear-out
◈ S/W is not built as an assembly of preexisting components
True S/W component “reuse” is rare
◘ Most “reused” S/W components are at least “slightly” modified
◘ Modified “reused” S/W components are not certified before use
S/W components are not developed to a specified set of reliability characteristics
3rd party S/W components do not come with reliability characteristics
(v0.1) Ops A La Carte © 5
53. Hardware Reliability Practices
◈ …. assembly of preexisting components (continued)
Acceleration mechanisms do not exist for S/W reliability testing
Extending S/W designs after product deployment is commonplace
◘ H/W is designed to provide a stable, long-term platform
◘ S/W is designed with the knowledge that it will host frequent product
customizations and extensions
◘ S/W updates provide fast development turnaround and have little or no
manufacturing or distribution costs
H/W failure analysis techniques (FMEAs and FTAs) are rarely successfully applied
to S/W designs
◘ S/W engineers find it difficult to adapt these techniques below the system
level
(v0.1) Ops A La Carte © 6
54. Software Process Control Methodologies
◈ S/W process control assumes a correlation between the maturity
of the development process and the latent defect density in the
final S/W
CMM Level Defects/KLOC Estimated Reliability
5 0.5 99.95%
4 1.0 - 2.5 99.75% - 99.9%
3 2.5 – 3.5 99.65% - 99.75%
2 3.5 – 6.0 99.4% - 99.65%
1 6.0 – 60.0 94% - 99.4%
◈ Process audits and more strict controls are implemented if
the current process level does not yield the desired S/W
reliability
Process root cause analysis may not yield great improvement
◘ Practices within the processes must be fine tuned (but how??)
Reliability improvement under this type of methodology is slow
◘ Process outcome cannot vary too much in either direction
(v0.1) Ops A La Carte © 7
56. Sources of Industry Data
Data was derived from a large-scale international survey of S/W
lifecycle quality spanning:
18 years (1984-2002)
12,000+ projects
600+ companies
◘ 30+ government/military organizations
8 classes of software applications:
1. Systems S/W
2. Embedded S/W
3. Military S/W
4. Commercial S/W
5. Outsourced S/W
6. Information Technology (IT) S/W
7. End-User developed personal S/W
8. Web-based S/W
(v0.1) Ops A La Carte © 9
57. Terminology
◈ Best Practice
A key S/W quality practice that significantly contributes towards increasing S/W
reliability
◈ Best in Class Companies
Companies that have the following two characteristics:
◘ Recognized for producing S/W-based products with the lowest failure rate
in their industry
◘ Consistently deploying software based on their initial schedule targets
◈ Formal practice
A S/W quality development practice that is well-understood and consistently
implemented throughout the software development organization.
◘ Note: Formal practices are rarely undocumented.
◈ Informal practice
A S/W quality development practice that is either implemented with varying
degrees of rigor or in an inconsistent manner throughout the software
development organization.
◘ Note: Informal practices are usually accompanied by the absence of
documented guidelines or standards.
(v0.1) Ops A La Carte © 10
58. “Best in Class” Company Best Practices
◈ S/W Life Cycle Practices
Consistent implementations of the entire S/W lifecycle phases
(requirements, design, code, unit test, system test and maintenance)
◈ Requirements
Involve test engineers in requirements reviews
Define quality and reliability targets
Define negative requirements (i.e., “shall nots”)
◈ Development phase defect removal
Formal inspections (requirements, design, and code)
Failure analysis
◈ Design
Team or group-oriented approach to design for the system and S/W
◘ NOTE: System design team includes other disciplines (e.g., H/W & Mech)
(v0.1) Ops A La Carte © 11
59. “Best in Class” Company Best Practices (continued)
◈ Testing
Robust Testing strategy to meet business / customer requirements
Test plans completed and reviewed before the coding phase
Mandatory developer unit testing
Independently verify/test every software change (enhancements and fixes)
Create formal test plans for all medium and large-sized projects
Staff an independent and dedicated SQA team to at least 5% of size of the S/W
development team
Generate quality or reliability estimates
Incorporate automated test tools into the test cycle
◈ S/W Quality Assurance
Review and prioritize all changes after the development phase
Record and track all changes to S/W artifacts throughout the life cycle
Formalize unit testing reviews (test plans and results)
Implement active quality assurance programs
Root-cause analysis with resolution follow-up
Gather and review customer product feedback
(v0.1) Ops A La Carte © 12
60. “Best in Class” Company Best Practices (continued)
◈ SCM and Defect Tracking
Implement formal change management of artifact changes and S/W releases
Incorporate automated defect tracking tools
◈ Metrics and Measurements
Record and track all defects and failures
Collect field data for root cause analysis on next project or release iteration
Measure code test coverage
Generate metrics based on code attributes (e.g., size and complexity)
Generate defect removal efficiency measurements
Track “bad fixes”
(v0.1) Ops A La Carte © 13
61. Weaknesses in S/W Development Practices
◈ Lack of engineer “ownership” for development and test practices
Limited efficiency and effectiveness improvements made
May lead to disjoint practices, resulting in no real “common” practices
◈ System design is “H/W-centric”
Primary focus on H/W feasibility, functionality and performance
Architectural reviews are not collaborative, team design sessions
S/W requirements of the H/W platform are generally not entertained or
implemented
◈ S/W defect removal relies mostly on system or subsystem-level
testing
Development phase defect removal is limited to cursory code reviews and sparse
unit testing
◘ Designs and design reviews are satisfied using functional or interface
specifications
No causal analysis is performed to improve future defect removal
(v0.1) Ops A La Carte © 14
62. Weaknesses in S/W Development Practices
◈ Limited system and S/W quality measurements and metrics
Use of default defect tracking tool statistics as primary metrics/measurements
Generally no data mining capability available for analysis
◈ Informal SQA processes and staffing leads to wasted efforts and
incomplete coverage
Too many trivial defects still present during system test phase
Defect fixes that introduce additional defects are frequent
S/W is shipped with many untested sections
Significant, recurring, “real world” customer scenarios remain untested
◈ Limited or no tool support for:
Unit testing
Automated regression testing
S/W analysis (static, dynamic, and coverage)
(v0.1) Ops A La Carte © 15
63. Application Behavior Patterns
S/W Quality Methods System S/W Embedded S/W
Summary Overall, best S/W quality results Wide range of S/W quality results
Defect Removal Efficiency Usually > 96% Up to > 94%
Best quality results found in projects with Most projects are < 26.5 KLOCS
Projects Sizes
> 550 KLOCS
Formal design and code inspections Usually do not implement both design or
Inspections
code inspections (and not formally)
Test Teams Independent SQA team Usually do not have separate SQA teams
Formal S/W quality measurement Informal S/W quality measurement
Measurement Control
process and tools processes and tools
Change Control Formal change control process and tools Informal change control process and tools
Test plans Formal test plans Usually do not implement formal test plans
Unit Testing Performed by developers Performed by developers
6 to 10 test stages 3 to 6 test stages
Testing Stages
(performed by SQA team) (usually performed by developers)
Governing Processes CMM/CMMI and Six-Sigma methods No consistent pattern found
(v0.1) Ops A La Carte © 16
64. Application Behavior Patterns
S/W Quality Methods System S/W Commercial S/W
Summary Overall, best S/W quality results Wide range of S/W quality results
Defect Removal Efficiency Usually > 96% Up to > 90%
Best quality results found in projects with Most projects are > 275 KLOCS
Projects Sizes
> 550 KLOCS
Formal design and code inspections Inconsistent use of formal design or code
Inspections
inspections
Test Teams Independent SQA team Inconsistent use of independent SQA teams
Formal S/W quality measurement Informal S/W quality measurement
Measurement Control
process and tools processes and tools
Change Control Formal change control process and tools Formal change control process and tools
Test plans Formal test plans Formal test plans
Unit Testing Performed by developers Performed by developers
6 to 10 test stages 3 to 8 test stages
Testing Stages
(performed by SQA team) (extensive reliance on Beta trials)
Governing Processes CMM/CMMI and Six-Sigma methods No consistent pattern found
(v0.1) Ops A La Carte © 17
66. Defect Origin and Discovery
Typical Behavior
Requirements Design Coding Testing Maintenance
Defect
Origin
Defect
Requirements Design Coding Testing Maintenance
Discovery
Surprise!
Goal of Best Practices on Defect Discovery
Defect Requirements Design Coding Testing Maintenance
Origin
Defect
Requirements Design Coding Testing Maintenance
Discovery
(v0.1) Ops A La Carte © 19
67. Software Defect Removal Techniques
Defect Removal Technique Efficiency Range
Design inspections 45% to 60%
Code inspections 45% to 60%
Unit testing 15% to 45%
Regression test 15% to 30%
Integration test 25% to 40%
Performance test 20% to 40%
System testing 25% to 55%
Acceptance test (1 customer) 25% to 35%
◈Development organizations try to find and remove more defects by
implementing more stages of system testing
Since there is a wide range of overlap between test stages, this approach
becomes less efficient as it scales
(v0.1) Ops A La Carte © 20
68. Defect Removal Technique Impact
Design
Inspections / n n n n Y n Y Y
Reviews
Code
Inspections / n n n Y n n Y Y
Reviews
Formal SQA
Processes
n Y n n n Y n Y
Formal Testing n n Y n n Y n Y
Median Defect
40% 45% 53% 57% 60% 65% 85% 99%
Efficiency
This large potential available from design and code inspections/reviews are why most
development organizations see greater improvements in S/W reliability with investments
in the development phase than further investments in testing.
NOTE: Design review results based on low-level design reviews.
(v0.1) Ops A La Carte © 21
69. Case Study: Quantifying the Software Quality Investment
Objective:
Develop a value-based approach to determine the necessary
S/W quality investment using dependability attributes.
Methodology:
Use an integrated approach of project cost and quality
estimation models (COCOMO II and COQUALMO) and
empirically-based business value relationship factoring to
analyze the data from a diverse set of 161 well-measured,
S/W projects.
Findings:
The methodology was able to correlate the optimal S/W
project quality investment level and strategy to the required
project reliability level based on defect impact.
The objectives were satisfied without using specific S/W
reliability practices and focused heavily on defect detection
during system testing.
(v0.1) Ops A La Carte © 22
71. Reliability, Development Cost & Test Time Tradeoffs
The relative cost/source instruction to
achieve a Very High RELY rating is less
than the amount of additional testing
time that is required (54%) since early
defect prevention reduce the required
rework effort and allow for additional
testing time.
(v0.1) Ops A La Carte © Based on COCOMOII (Constructive Cost Model) 24
72. Delivered Defects Scale
◈ Very Low rating delivers roughly the same amount of defects that are introduced
◈ Extra High rating reduces the delivered defects by a factor of 37
Note: The assumed nominal defect introduction rate is 60 defects/KSLOC based
on the following distribution:
10 requirements defects/KSLOC
20 design defects/KSLOC
30 coding defects/KSLOC
(v0.1) Ops A La Carte © Based on COQUALMO (Constructive Quality Model) 25
73. Defect Removal Factors Scale
Automated Analysis Peer Reviews Execution Testing and Tools
Very Compiler-based simple
No peer reviews No testing
Low syntax checking
Basic compiler Ad-hoc informal walk-
Low capabilities throughs
Ad-hoc testing and debugging
Compiler extensions. Well-defined sequence for Basic test, test data management and
Nominal Basic requirements and preparation, review, and problem-tracking support.
design consistency minimal follow-up Test criteria based on checklist.
Well-defined test sequences tailored to
Intermediate-level Formal review roles and the organization.
module and intermodule. well-trained participants
High using basic checklists and
Basic test coverage tools and test
Simple requirements and support system.
design. follow-up procedures.
Basic test process management.
More elaborate Basic review checklists and More advanced test tools and test data
requirements and design. root cause analysis. preparation, basic test oracle support
Very Basic distributed- Formal follow-up using and distributed monitoring and
High processing and temporal historical data on inspection analysis, and assertion checking.
analysis, model checking, rates, preparation rates, Metrics-based test process
and symbolic execution. and fault density. management.
Formal review of roles and
procedures. Highly advanced tools for test oracles,
Formalized specification distributed monitoring and analysis,
Extensive review checklists and assertion checking.
Extra and verification.
and root cause analysis.
High Advanced distributed- Integration of automated analysis and
Continuous review-process test tools.
processing
improvement.
Model-based test process management.
Statistical process control.
(v0.1) Ops A La Carte © Derived from COQUALMO (Constructive Quality Model) 26
74. DfR Based on “Best Practices”
Modified Existing Best Practices New Best Practices
S/W Life Cycle Practices
Consistent implementations of the entire Reliability testing a part of overall testing
S/W lifecycle phases strategy
(requirements, design, code, unit test, Define reliability goals as requirements
system test and maintenance)
Metrics and Measurements
Record and track all defects and failures Generate defect removal efficiency
Collect field data for root cause analysis on measurements
next project or release iteration Track fix rationale such as “bad” fixes or
untested code
Collect failure data for analysis during the
system test phase
Development Phase Practices
Reviews of design and code Assess designs for availability
Targeted developer unit testing Perform failure analysis
Testing
Independently verify/test every S/W Generate reliability estimates
change (enhancements and fixes)
SQA
Perform failure root-cause analysis
Record and track all changes to S/W
artifacts throughout the life cycle
(v0.1) Ops A La Carte © 27
75. Summary: DfR Based on “Best Practices”
◈ The strength of a “best practices” approach is it’s intuitiveness
Incorporate considerations of essential functionality and failure behavior in order
to understand failure modes and improve availability
Perform design analysis to identify potential failure points and, where possible,
redesign to remove failure points or reduce their impact
Analyze S/W for critical failure trigger points and remove them or reduce their
impact and frequency where possible
Plan testing to maximize the overall S/W verification prior to field deployment
Let measured data drive changes to reliability practices
◈ Focus on the removal of critical failures instead of all defects
S/W with known defects and faults may still be perceived as reliable by users
◘ NASA studies identified projects that produced reliable S/W with only 70% of
the code tested
Removing X% of the defects in a system will not necessarily improve the
reliability by X%.
◘ One IBM study showed that removing 60% of the product’s defects resulted
in only a 3% reliability improvement
◘ S/W defects in rarely executed sections of code may never be encountered
by users and therefore may not improve reliability
– Exceptions for essential operations: boot, shutdown, data backup, etc.
(v0.1) Ops A La Carte © 28
78. Metrics Supporting Reliability Strategies
Common strategies and tactics used by teams developing
highly reliable software products
Explicit, robust reliability requirements during requirements phase
Appropriate use of fault tolerant techniques in product design
Robust design/operational requirements for maximizing product
Availability
Focused, targeted (data driven) defect inspection program
Robust testing strategy and program:
well defined focused mix of unit, regression, integration, system,
exploratory and reliability demonstration testing
Robust defect tracking/metrics program focused on important few
◘ Defect tracking and analysis for all phases of a product’s life
including post shipment defects/failures (FRACUS)
(v0.1) Ops A La Carte © 2
79. Reliability Measurements and Metrics
◈ Definitions
Measurements – data collected for tracking or to calculate meta-data (metrics)
◘ Ex: defect counts by phase, defect insertion phase, defect detection phase
Metrics – information derived from measurements (meta-data)
◘ Ex: failure rate, defect removal efficiency, defect density
◈ Reliability measurements and metrics accomplish several goals
Provide estimates of S/W reliability prior to customer deployment
Track reliability growth throughout the life cycle of a release
Identify defect clusters based on code sections with frequent fixes
Determine where to focus improvements based on analysis of defect/failure data
Note: S/W Configuration Management (SCM) and defect tracking
tools should be updated to facilitate the automatic tracking
of this information
◘ Allow for data entry in all phases, including development
◘ Distinguish code base updates for critical defect repair vs. any other
changes, (e.g., enhancements, minor defect repairs, coding standards
updates, etc.)
(v0.1) Ops A La Carte © 3
80. Critical Measurements To Collect
Measurement Description
Number of critical defects found during each non-operational
Critical Defects by Phase
phase (i.e. requirements, design, and coding)
Number of critical failures found during each operational phase
Critical Failures by Phase
(i.e. unit testing, system testing, and field)
The phase where the critical defect (or critical failure) was
Critical Defect Insertion Phase
inserted (or originated)
The phase where the critical defect (or critical failure) was
Critical Defect Detection Phase
detected (or reported)
A high-level indicator of a critical defect’s location within the
Critical Defect Major Location
source code (e.g., a S/W component or file name)
A low-level indicator of a critical defect’s location within the
Critical Defect Minor Location
source code (e.g., the name of a class, method or data object)
The time when a critical failure occurred since the beginning of a
Critical Failure Time
test run (typically measured in CPU or wall time)
Critical Failure Root Cause The relevant failure category for a specified critical failure
(v0.1) Ops A La Carte © 4
81. Metrics To Track
Metric Description
The number of defects per KLOCs (1,000 lines of commented
Critical Defect Density
source code)
Critical Defect Removal Efficiency The percentage of defects identified within a given life cycle
(CDRE) period
Critical Failure Rate The mean number of failures occurring within a reference period
Current open defect demographics of current code base including
Current Defect Demographics
defects by severity, module, fix backlog, ect.
Trends( e.g., defects vs. test time interval) of newly detected
Failure/Defect arrival rates
Failures and/or defects.
The number of failures caused as side effects to a fixes of logged
Bad Fixes
defect
The number of failures caused by code that was neither reviewed
Unverified Code Fixes
nor tested
The summary of failure categories distributions for pareto root
Failure root cause distribution
cause analysis
(v0.1) Ops A La Carte © 5
82. Software Defect Distributions
Average distribution of all types of S/W defects by lifecycle phase:
20% Requirements 50% of all S/W defects are introduced before coding
30% Design
35% Coding
10% Bad Defect Fixes (introduction of secondary defects) 1 in 10 defects fixed
during testing were
5% Customer Documentation unintended side effects
of a previous defect “fix”
Average distribution of S/W defects escalated from the field:
(based on 1st year field defect report data)
1% Severity 1 (catastrophic) Only ~20% of the customer-reported S/W
20% Severity 2 (serious) defects are target for reliability improvements
35% Severity 3 (minor)
44% Severity 4 (annoyance or cosmetic)
(v0.1) Ops A La Carte © 6
83. Typical Defect Tracking (System Test)
Severity #1 Severity #2 Severity #3 Severity #4
System Test Total Defects
Defects Defects Defects Defects
Build Found
Found Found Found Found
SysBuild-1 7 9 16 22 54
SysBuild-2 5 5 14 26 50
SysBuild-3 4 6 8 16 34
• • • • • •
• • • • • •
• • • • • •
SysBuild-7 0 1 4 6 11
(v0.1) Ops A La Carte © 7
84. Defect Removal Efficiency
◈Critical defect removal efficiency (CDRE) is a key reliability measure
Critical Defects Found
CDRE =
Critical Defects Present
◈“Critical Defects Present” is the sum of the critical defects found in all
phases as a result of reviews, testing and customer/field escalations
System testing stages include integration, functional, loading, performance,
acceptance, etc.
Customer trials can be considered either a system testing stage, a preliminary,
but separate, field deployment phase or a part of the field deployment phase
◘ Depending on the rationale for the trials
The field deployment phase is measured as the first year following deployment
◘ The average life span of a S/W release since most S/W releases are
separated by increments no longer than 1 year
System & Subsystem Field
Requirements Design Coding Unit Testing
Testing Stages Deployment
Field
Review Efficiency Testing Efficiency Efficiency
System Testing Field
Development Efficiency
Efficiency Efficiency
Field
Internal Efficiency
Efficiency
(v0.1) Ops A La Carte © 8
85. CDRE Example
Critical
Origin Defects Metrics #1
Found
Requirements Reviews 20
Design Reviews 30
Code Reviews 40
170
Unit Testing 25
System & Subsystem
55
Testing
Field Deployment 40 40
TOTAL 210 210
System & Subsystem Field
Requirements Design Coding Unit Testing
Testing Stages Deployment
Field
Internal Efficiency
Efficiency
Metric Removal Efficiency
Internal Efficiency 81% = (170 / 210)
Field Efficiency 19% = (40 / 210)
(v0.1) Ops A La Carte © 9
86. CDRE Example
Critical
Origin Defects Metrics #2
Found
Requirements Reviews 20
Design Reviews 30
115
Code Reviews 40
Unit Testing 25
System & Subsystem
55 55
Testing
Field Deployment 40 40
TOTAL 210 210
System & Subsystem Field
Requirements Design Coding Unit Testing
Testing Stages Deployment
System Testing Field
Development Efficiency
Efficiency Efficiency
Metric Removal Efficiency
Development Efficiency 55% = (115 / 210)
System Testing Efficiency 26% = (55 / 210)
Field Efficiency 19% = (40 / 210)
(v0.1) Ops A La Carte © 10
87. CDRE Example
Critical
Origin Defects Metrics #3
Found
Requirements Reviews 20
Design Reviews 30 90
Code Reviews 40
Unit Testing 25
80
System & Subsystem
55
Testing
Field Deployment 40 40
TOTAL 210 210
System & Subsystem Field
Requirements Design Coding Unit Testing
Testing Stages Deployment
Field
Review Efficiency Testing Efficiency Efficiency
Metric Removal Efficiency
Review Efficiency 43% = (90 / 210)
Testing Efficiency 38% = (80 / 210)
Field Efficiency 19% = (40 / 210)
(v0.1) Ops A La Carte © 11
88. Sample Project Reliability Measurement Tracking
At the end of the project Unit Testing Phase
Defects Found Critical Defects/Failures Found
Phase Reqmts Design Code Unit Test Test Field
Total Critical
Critical Critical Critical Critical Critical Failures
Defects Defects
Defects Defects Defects Failures Failures Reported
Requirements 75 12 12
Design 123 45 6 39
Code 158 62 4 12 46
Unit Test 78 25 1 5 17 2
Development Totals 434 144 23 56 63 2
Integration Test
System Test
Testing Totals
Field Reports Totals
(v0.1) Ops A La Carte © 12
89. Sample Project Reliability Measurement Tracking
1 year after the end of the System Testing phase
Defects Found Critical Defects/Failures Found
Phase Reqmts Design Code Unit Test Test Field
Total Critical
Critical Critical Critical Critical Critical Failures
Defects Defects
Defects Defects Defects Failures Failures Reported
Requirements 75 12 12
Design 123 45 6 39
Code 158 62 4 12 46
Unit Test 78 25 1 5 17 2
Development Totals 434 144 23 56 63 2
Integration Test 43 13 0 4 7 1 1
System Test 183 47 2 13 28 0 4
Testing Totals 226 60 2 17 35 1 5
Field Reports Totals 70 35 1 8 22 0 3 1
Release Summary 720 239 26 81 120 3 8 1
(v0.1) Ops A La Carte © 13
90. Sample Project Reliability Measurement Tracking
1 year after the end of the System Testing phase
Defects Found Critical Defects/Failures Found
Phase Reqmts Design Code Unit Test Test Field
Total Critical
Critical Critical Critical Critical Critical Failures
Defects Defects
Defects Defects Defects Failures Failures Reported
Requirements 75 12 12
Design 123 45 6 39
Code 158 62 4 12 46 Design DRE Measurements
• 39 critical defects found in-phase
Unit Test 78 25 1 5 17 2
• 56 critical defects found during development
Development Totals 434 144 23 56 • 63 critical defects found during testing
17 2
• 81 critical defects found overall
Integration Test 45 13 0 4 7 1 1
System Test 189 47 2 13 28 Design DRE Metrics
0 4
Testing Totals 234 60 2 17 • 35 in-phase DRE (= 39/81)
48% 1 5
• 69% development DRE (= 56/81)
Field Reports Totals 77 35 1 8 • 22 in-house DRE (= (56+17)/81)1
90% 0 3
Release Summary 745 239 26 81 120 3 8 1
(v0.1) Ops A La Carte © 14
91. Sample Project Reliability Measurement Tracking
1 year after the end of the System Testing phase
DefectsBad Fixes Measurements
Found Critical Defects/Failures Found
• 1 critical defect inserted and found during field deployment
Phase Reqmts Design Code Unit Test Test Field
• 5 critical defects inserted and Critical during system-level testing
Total found Critical Critical Critical Critical Critical Failures
Defects Defects
• 3 critical defects inserted during system-level testing and found during field deployment
Defects Defects Defects Failures Failures Reported
•Requirements
1 critical defects inserted during unit testing and found during system-level testing
75 12 12
Bad Fixes Measurements
• 0 critical defects inserted during unit testing and found during field deployment
•Design
1 critical defect inserted and found during field deployment
123 45 6 39
• 60 total critical defects found during system-level testing
• 5 critical defects inserted and found during system-level testing
•Codetotal critical defects found in62 field
35 158 the 4 12 46
• 3 critical defects inserted during system-level testing and found during field deployment
•Unit Test defects inserted during unit testing1
1 critical 78 25 5 17 2
and found during system-level testing
•Development Totals inserted during Badtesting and found during field deployment
0 critical defects 434 unit Fixes Metrics
144 23 56 63 2
• 10% of the test defects found during system-level (= (1 + 5)/60)
• 60 total critical phase failures are from Bad Fixes testing
• 11% of the fielddefectsfailuresin the field
• 35 total critical phase found are from Bad Fixes (= (0 + 3 + 1)/35)
•Integration Test and field failures are from Bad Fixes (= (1 + 5 + 07+ 3 + 1)/(60 + 35)) 1
11% of the test 45 13 0 4 1
System Test 189 47 2 13 28 0 4
Testing Totals 234 60 2 17 35 1 5
Field Reports Totals 77 35 1 8 22 0 3 1
Release Summary 745 239 26 81 120 3 8 1
(v0.1) Ops A La Carte © 15
92. Sample Project Reliability Metrics
Metrics
Phase In-phase Overall
CDD Bad Fixes
DRE DRE
Requirements 29% 46%
Design 51% 48%
58%
Code 56% 38% 84%
Unit Test 18% 17%
Testing 26% 10%
11%
Field 16% 16% 11%
(v0.1) Ops A La Carte © 16
93. Sample Project Reliability Metrics
Metrics
Phase In-phase Overall
CDD Bad Fixes
DRE DRE
Requirements 29% 46%
Design 51% 48%
58%
Code 56% 38% 84%
Unit Test 18% 17%
Testing 26% 10%
11%
Field 16% 16% 11%
Sample Phase 1 Goals (50% improvement):
• (reliability) increase in-house DRE to 92%
• (efficiency) reduce field bad fixes to 5%
(v0.1) Ops A La Carte © 17
94. Distribution of Defects Across Files
◈ There is a Pareto-like distribution of defects across files within a
module
Defects cluster in a relatively small number of files
Conversely, more than half of the files have almost no critical defects
(v0.1) Ops A La Carte © 18
95. Failure Density Analysis
◈ In general, there is a pareto distribution (80/20) of defects across
files
◈ Failure density analysis provides a mechanism for early detection
of “problematic” sections of code (i.e., defect clusters)
Improving the reliability of these “problematic modules” can consume as much
as 4 times the effort of a redesign
◘ Less time is required to re-inspect and either restructure or redesign these
modules than to effectively “beat them into submission” through testing
The goal is to identify “problematic” code as early as possible
◘ Perform failure density analysis early on during unit testing
– Since different sections of code may be problematic during system
testing, the analysis should be repeated near the middle of this phase
◘ SCM and defect tracking tools can be modified to provide this information
without much effort
– Source code can be analyzed to display “hot spot” histograms using
number of changes and/or number of failures
– Heuristics for failure density thresholds must be developed to
determine when action should be taken
(v0.1) Ops A La Carte © 19
96. Failure Density Distribution by File
These files contain the majority of the reported
failures and should be proactively analyzed for
possible redesign, restructuring or additional
defects
% of Total
Reported
Failures
F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 F13 F14 … F23
Source Code Files
(v0.1) Ops A La Carte © 20
97. Applying Causal Analysis to Defect Measurements
◈ Causal analysis(RCA) can be applied to defect
measurements to improve defect removal effectiveness
and efficiency
Usually performed between life cycle iterations or S/W releases
Upstream defect removal practices should be reviewed in light of the
defects that were not detected in each phase
◘ Requires knowing the phases where a defect was introduced and
detected
◘ Simple guidelines should be defined to determine the phase
where the defect was introduced
Defects are categorized by types
◘ Can determine if the problem is systemic to one or more
categories, or
◘ Whether the problem is an issue of raising overall defect removal
efficiency for a development phase
(v0.1) Ops A La Carte © 21
98. Causal Analysis Process
(one shot analysis)
Objective: Create an initial, rough distribution of the defects and identify
potential improvements to existing defect removal practices.
Process Outline (typically 6-8 hours over two days):
◈ Select a team from the senior engineers and MSEs of the development and test
teams. Analysis Process:
Before the meeting, select a representative sample of approximately 50-75
defects from each development and test phase
Convene meeting and explain the objectives and process to the team
Classify the defects
◘ Start by walking the team through the classification of one defect
◘ Divide the defects into small groups and assign each person 2 groups of
defects (to force analysis overlap)
◘ Upon completion, collect and process the data offline in preparation for
team analysis and review
Analyze the defect types using a histogram to look for a Pareto distribution and
select the most prevalent defect types
Develop recommendations for improvements to specific defect removal practices
Implement the recommendations at the next possible opportunity and gather
measurement data
(v0.1) Ops A La Carte © 22
99. The Orthogonal Defect Analysis Framework
SPEC/RQMTS DESIGN CODE ENV. SUPT. DOCUMENTATION OTHER
ORIGIN
( Where?)
REQUIREMENTS TEST SW
HW INTERFACE PROC. (INTERPROC.) LOGIC
OR
SW INTERFACE COMMUNICATIONS COMPUTATION TEST HW
SPECIFICATIONS
DATA DEFINITION DATA HANDLING
USER INTERFACE DEVELOPMENT
FUNCTIONALITY
TOOLS
MODULE DESIGN MODULE INTERFACE/
FUNCTIONAL INTEGRATION SW
TYPE
( What ?) DESCRIPTION LOGIC DESCRIPTION IMPLEMENTATION
STANDARDS
ERROR CHECKING
STANDARDS
• Other can also be
type classification for
any origin
MODE
MISSING UNCLEAR WRONG CHANGED BETTER WAY
(WHY?)
(v0.1) Ops A La Carte © 23