Software Reliability Overview
- 1. Software Design For Reliability (DfR) Seminar
An Overview of
Software Reliability
Bob Mueller
bobm@opsalacarte.com
www.opsalacarte.com
- 2. Software Quality
and
Software Reliability
Related Disciplines,
Yet Very Different
- 3. Definition of Software Quality
FACTORS CRITERIA
suitability
Functionality accuracy
interoperability
security
understandability
Usability learnability
operability
attractiveness
Software Software Quality
Quality
maturity
Reliability fault tolerance The level to which the
*ISO9126
Quality Model
recoverability software characteristics
conform to all the
time behavior specifications.
Efficiency resource utilization
analysability
changeability
Portability stability
testability
adaptability
installability
Maintainability co-existence
replaceability
(v0.5) Ops A La Carte © 3
- 4. Most Common Misconception
FACTORS CRITERIA
What organizations
suitability believe they are doing
Functionality accuracy ------------------
interoperability We have a strong SW
security
quality program. We don’t
need to add SW reliability
understandability
Usability learnability practices.
operability
attractiveness
Software
Quality
maturity
Reliability fault tolerance
*ISO9126
What is missing
recoverability
Quality Model ---------------
Implementing sufficient
SW reliability practices
time behavior
Efficiency resource utilization
to satisfy customer
expectations
analysability
changeability
Portability stability
testability What the organizations
are really doing
adaptability
------------------
installability Implementing only a
Maintainability co-existence sparse set of SW quality
replaceability
practices
(v0.5) Ops A La Carte © 4
- 6. Software Reliability Can Be Measured
Software Reliability is 20 years behind HW reliability
◈Ramifications of failure
Education on the consumer side
Many consumers just expect unreliable s/w
◈Education on the manufacturer’s side
Mfgs don’t know new innovative methods
Mfgs don’t figure out how users will use product
◈Software engineers are more free-spirited than HW
◈Entry cost for a SW devel. team less than for HW
(v0.5) Ops A La Carte © 6
- 7. Reliability vs. Cost
TOTAL
COST
OPTIMUM CURVE
COST
POINT RELIABILITY
PROGRAM
COSTS
COST
HW
WARRANTY
COSTS
RELIABILITY The SW impact on
HW warranty costs
is minimal at best
(v0.5) Ops A La Carte © 7
- 8. Reliability vs. Cost, continued
◈SW has no associated manufacturing costs, so warranty
costs and saving are almost entirely allocated to HW
◈If there are no cost savings associated with improving
software reliability, why not leave it as is and focus on
improving HW reliability to save money?
One study found that the root causes of typical embedded
system failures were SW, not HW, by a ratio of 10:1.
Customers buy systems, not just HW.
◈The benefits for a SW Reliability Program are not in direct
cost savings, rather in:
Increased SW/FW staff availability with reduced operational
schedules resulting from fewer corrective maintenance
content.
Increased customer goodwill based on improved customer
(v0.5)
satisfaction. Ops A La Carte © 8
- 10. Software Reliability Definitions
The customer perception of
the software’s ability to deliver the expected functionality
in the target environment
without failing.
◈ Examine the key points
◈ Practical rewording of the definition
Software reliability is a measure of
the software failures that are visible to a customer and
prevent a system from delivering essential functionality.
(v0.5) Ops A La Carte © 10
- 11. Software Reliability Can Be Measured
◈ Measurements are a required foundation
Differs from quality which is not defined by measurements
All measurements and metrics are based on run-time failures
◈ Only customer-visible failures are targeted
Only defects that produce customer-visible failures affect reliability
Corollaries
◘ Defects that do not trigger run-time failures do NOT affect reliability
– badly formatted or commented code
– defects in dead code
◘ Not all defects that are triggered at run-time produce customer-visible
failures
– corruption of any unused region of memory
◈ SW Reliability evolved from HW Reliability
SW Reliability focuses only on design reliability
HW Reliability has no counterpart to this
(v0.5) Ops A La Carte © 11
- 12. Software Reliability Is Based On Usage
◈ SW failure characteristics are derived from the usage profile of a
particular customer or set of customers
Each usage profile triggers a different set of run-time SW faults and failures
◈ Example
Examine product usage by 2 different customers
◘ Customer A’s usage profile only exercises the sections of SW that produce
very few failures.
◘ Customer B’s usage profile overlaps with Customer A’s usage profile, but
additionally exercises other sections of SW that produce many, frequent
failures.
Customer assessment of the product’s software reliability
◘ Customer A’s assessment - the SW reliability is high
◘ Customer B’s assessment - the SW reliability is low
(v0.5) Ops A La Carte © 12
- 13. Reliability ≠ Correctness
◈ Correctness is a measure of the degree of intended functionality
implemented by the SW
Correctness measures the completeness of requirements and the accuracy of
defining a SW model based on these requirements
◈ Reliability is a measure of the behavior (i.e., failures) that
prevents the software from delivering the implemented
functionality
(v0.5) Ops A La Carte © 13
- 15. Terminology
◈ Defect
A flaw in the requirements, design or source code that produces
implementation logic that will trigger a fault Defect
◘ Defects of omission
– Not all requirements were used in creating a design model
– The design satisfies all requirements but is incomplete
– The source code did not implement all the design
– The source code has missing or incomplete logic
◘ Defects of commission
– Incorrect requirements are specified
– Requirements are incorrectly translated into a design model
– The design is incorrectly translated into source code
– The source code logic is flawed
Defects are static and can be detected and removed without
executing the source code
Defects that cannot trigger a SW failure are not tracked or measured
◘ Ex: quality defects, such as test case and soft maintenance
defects, and defects in “dead code”
(v0.5) Ops A La Carte © 15
- 16. Terminology (continued)
◈ Fault
The result of triggering a SW defect by executing the associated Defect
implementation logic
◘ Faults are NOT always visible to the customer
◘ A fault can be the transitional state that results in a failure
Fault
◘ Trivially simple defects (e.g., display spelling errors) do not
have intermediate fault states
◈ Failure
Defect
A customer (or operational system) observation or detection that
is perceived as an unacceptable departure of operation from the
designed SW behavior
◘ Failures MUST be observable by the customer or an Fault
operational system
◘ Failures are the visible, run-time symptoms of faults
◘ Not all failures result in system outages
Failure
(v0.5) Ops A La Carte © 16
- 17. Basic Failure Classification
◈ High-level SW failure classification based on complexity
and time-sensitivity of triggering the associated defect:
Bohr Bugs
Heisen Bugs
Aging Bugs
◈ Bohr Bugs
Named after the “Bohr” atom
◘ Connotation: Deterministic failures that are straight-forward to isolate
Failures are easily reproducible, even after a system restart/reboot
Most frequent failure category detected during development, testing and early
deployment
These are considered “trivial” defects since every execution of the associated
logic results in a failure
(v0.5) Ops A La Carte © 17
- 18. Basic Failure Classification (continued)
◈ Heisen Bugs
Named after the Heisenberg uncertainty principle
◘ Connotation: Failures that are difficult to isolate to a root cause
Intermittent failures that are rarely triggered and difficult to reproducible.
Unlikely to reoccur following a system restart/reboot
Common root causes:
◘ Synchronization boundaries between SW components
◘ Improper or insufficient exception handling
◘ Interdependent timing of multiple events
Rarely detected when the SW is not mature (i.e., during early development and
testing phases)
The best methods to deal with these “tough” defects are by
◘ Identification using SW failure analysis
◘ Impact mitigation using fault tolerant code
(v0.5) Ops A La Carte © 18
- 19. Basic Failure Classification (continued)
◈ Aging Bugs
Attributed to the results of continuous, long-term operations or use
◘ Connotation: Failures resulting from accumulation of erroneous conditions
Transient failures occur after extended run-time or functional cycles where the
contributing faults have occurred numerous times
Preceding faults may lead to system performance degradation before a failure
occurs
Extremely unlikely to reoccur following a system restart/reboot due to the
longevity requirement
Common root causes:
◘ Deterioration in the availability of OS resources (e.g., depletion of device
handles, memory leaks, heap fragmentation)
◘ Data corruption
◘ Application race conditions
◘ Accumulation of numerical round-off errors
◘ Gradual data accumulation for sampling or queue build-up
The best methods to deal with these “tough” defects are by
◘ Identification using SW failure analysis
◘ Impact mitigation using fault tolerant code
(v0.5) Ops A La Carte © 19
- 21. Reliable Software Characteristics
◈ Operates within the reliability specification that satisfies customer
expectations
Measured in terms of failure rate and availability level
The goal is rarely “defect free” or “ultra-high reliability”
◈ “Gracefully” handles erroneous inputs from users, other systems,
and transient hardware faults
Attempts to prevent state or output data corruption from “erroneous” inputs
◈ Quickly detects, reports and recovers from SW and transient HW
faults
SW provides system behave as continuously monitoring, self-diagnosing” and
“self-healing”
Prevents as many run-time faults as possible from becoming system-level
failures
(v0.5) Ops A La Carte © 21
- 22. Common Paths to Software Reliability
◈ Traditional SW Reliability Programs - Predictions
Program directed by a separate team of reliability engineers
Development process viewed as a SW-generating, black box
◘ Develop prediction models to estimate the number of faults in the SW
Reliability techniques used to identify defects and produce SW reliability metrics
◘ Traditional HW failure analysis techniques, e.g., FMEAs or FTAs
◘ Defect estimation and tracking
◈ SW Process Control
Based on the assumption of a correlation between development process
maturity and latent defect density in the final SW
◘ Ex: CMM Level 3 organizations can develop SW with 3.5 defects/KSLOC
If the current process level does not yield the desired SW reliability, audits and
stricter process controls are implemented
◈ Quality Through SW Testing
Most prevalent approach for implementing SW reliability
Assumes reliability is increased by expanding the types of system tests (e.g.,
integration, performance and loading) and increasing the duration of testing
Measured by counting and classifying defects
(v0.5) Ops A La Carte © 22
- 23. Common Paths to Software Reliability (continued)
◈ These approaches generally do not provide a complete solution
Reliability prediction models are not well-understood
SW engineers find it difficult to apply HW failure analysis techniques to detailed
SW designs
Only 20% of the SW defects identified by quality processes during development
(e.g., code inspections) affect reliability
System testing is an inefficient mechanism for finding run-time failures
◘ Generally identifies no more than 50% of run-time failures
Quality processes for tracking defects do not produce SW reliability information
such as defect density and failure rates
◈ Net Effect:
SW engineers still end up spending more than 50% of their time debugging,
instead of focusing on designing or implementing source code
(v0.5) Ops A La Carte © 23
- 25. Software Defect Distributions
Average distribution of SW defects by lifecycle phase:
20% Requirements
30% Design
35% Coding
10% Bad Defect Fixes (introduction of secondary defects)
5% Customer Documentation
Average distribution of SW defects at the time of field deployment:
(based on 1st year field defect report data)
1% Severity 1 (catastrophic)
20% Severity 2 (major)
35% Severity 3 (minor)
44% Severity 4 (annoyance)
(v0.5) Ops A La Carte © 25
- 26. Typical Defect Tracking (System Test)
Severity #1 Severity #2 Severity #3 Severity #4
System Test Total Defects
Defects Defects Defects Defects
Build Found
Found Found Found Found
SysBuild-01 7 9 16 22 54
SysBuild-02 5 5 14 26 50
SysBuild-03 4 6 8 16 34
• • • • • •
• • • • • •
• • • • • •
SysBuild-7 0 1 4 6 11
(v0.5) Ops A La Carte © 26
- 27. Defect Origin and Discovery
Typical Behavior
Requirements Design Coding Testing Maintenance
Defect
Origin
Defect
Requirements Design Coding Testing Maintenance
Discovery
Surprise!
Goal of Best Practices on Defect Discovery
Defect Requirements Design Coding Testing Maintenance
Origin
Defect
Requirements Design Coding Testing Maintenance
Discovery
(v0.5) Ops A La Carte © 27
- 28. Defect Removal Efficiencies
◈ Defect removal efficiency is a key reliability measure
Defects found
Removal efficiency =
Defects present
◈ “Defects present” is the critical parameter that is based on inspections, testing and
field data
System & Subsystem Field
Requirements Design Coding Unit Testing
Testing Stages Deployment
Inspection Efficiency Overall
Testing Efficiency
Efficiency
Example:
Origin Defects Found Metric Removal Efficiency
Inspections 90 Inspection Efficiency 43% = (90 / 210)
Unit Testing 25 Testing Efficiency 38% = (80 / 210)
System & Subsystem Overall Efficiency 81% = (170 / 210)
55
Testing
Field Deployment 40
TOTAL 210
(v0.5) Ops A La Carte © 28
- 29. Reliability Defect Tracking (All Phases)
Total Reqmts Design Code Unit Test System Test
Total
Critical Defect Critical Critical Critical Critical Critical
Activity Failures
Failures Density Defects Defects Defects Failures Failures
Found
Found Found Found Found Found Found
Reqmts 75 12 16% 12
Design 123 45 37% 4 41
Code 158 72 46% 4 6 62
Unit Test 78 25 35% 1 4 18 2
Development
434 154 21 51 80 2
Totals
DRE
57% 80% 78% 100%
(development)
System Test 189 53 68% 1 11 31 6 2
DRE
(after system 55% 66% 56% 25% 100%
testing)
(v0.5) Ops A La Carte © 29
- 30. Defect Removal Technique Impact
Design
Inspections / n n n n Y n Y Y
Reviews
Code
Inspections / n n n Y n n Y Y
Reviews
Formal SQA
Processes
n Y n n n Y n Y
Formal Testing n n Y n n Y n Y
Median Defect
40% 45% 53% 57% 60% 65% 85% 99%
Efficiency
(v0.5) Ops A La Carte © 30
- 31. Typical Defect Reduction Goals
200
150
100
50
SysBld SysBld SysBld SysBld SysBld
#1 #2 #3 #4 #5
System Test
(v0.5) Ops A La Carte © 31
- 32. Design for Reliability
200
150
Goal is to Predict Defect Totals for Next Phase
100
50
Req Design Code Unit SysBld SysBld SysBld SysBld SysBld Field
Test #1 #2 #3 #4 #5 Failures
Development System Test Deployment
(v0.5) Ops A La Carte © 32
- 34. Goals of Reliability Practices
Reliability Practices split the development lifecycle into 2 opposing
phases:
Pre-deployment Post-deployment
Focus on Fault Intolerance Focus on Fault Tolerance
Fault Tolerance Techniques
Fault Avoidance Techniques Allow a system to operate
Prevents defects from being predictably in the presence of faults
introduced
System Restoration Techniques
Fault Removal Techniques Quickly restore the operational state
Detects and repairs faults of a system in the simplest manner
possible
Goal: Increase reliability by Goal: Increase availability by
eliminating critical defects that reducing or avoiding the effects
reduce the failure rate of faults
(v0.5) Ops A La Carte © 34
- 35. Software Reliability Practices
Analysis Design Verification
Formal Formal Interface Boundary Value Analysis
Scenario/Checklist Specification
Equivalence Class
Analysis
Defensive Programming Partitioning
FRACAS
Fault Tolerance Reliability Growth Testing
FMECA
Modular Design Fault Injection Testing
FTA
Error Detection and Static/Dynamic Code
Petri Nets Correction Analysis
Change Impact Analysis Critical Functionality Coverage Testing
Isolation
Common Cause Failure Usage Profile Testing
Analysis Design by Contract
Cleanroom
Sneak Analysis Reliability Allocation
Design Diversity
(v0.5) Ops A La Carte © 35
- 36. Design and Code Inspections
◈ The original rationale for inspections (current payback):
“Inspections require less time and resources to detect and repair defects than
traditional testing and debugging”
Work done at Nortel Technologies in 1991 demonstrated that 65% to 90% of
operational defects were detected by inspections at 1/4 to 2/3 the cost of testing
◈ Soft maintenance rationale (future payback):
Data collected on 130 inspection sessions findings on the long-term, software
maintenance benefits of inspections as follows:
True Defects - The code behavior was wrong and an
execution affecting change was made to
resolve it.
False Positives – Any issue not requiring a code or
document change.
Soft Maintenance Changes – Any other issue that
resulted in a code or document change, e.g.,
code restructuring or addition of code
comments.
(v0.5) Ops A La Carte © 36
- 37. Spectrum of Inspection Methodologies
Method / # of Detection Collection Post Process
Team Size
Originator Sessions Method Meeting Feedback
Yes
Fagan Large 1 Ad hoc
Group oriented
None
Yes
Bisant Small 1 Ad hoc
Group oriented
None
Yes Root Cause
Gilb Large 1 Checklist
Group oriented Analysis
No
Meetingless
Large 1 Unspecified Individual None
Inspection Oriented
Yes
ADR Small >1 Scenario
Group oriented
None
4 Yes
Britcher Unspecified
Parallel
Scenario
Group oriented
None
No
Phased >1 Checklist
Small (Mtg only to None
Inspection Sequential (comp)
reconcile data)
>1 Yes
N-fold Small
Parallel
Ad hoc
Group oriented
None
Code No
Small 1 Ad hoc None
Reading Mtg Optional
WOW! No wonder inspections are not well-understood, there’s too many methodologies.
AND, THERE ARE MORE OPTIONS…
(v0.5) Ops A La Carte © 37
- 38. Spectrum of Technical Review Methodologies
Inspections are just one of the many classes of Technical Review Methodologies.
• Informal • Formal
• Individual initiative • Team-oriented
• Small time commitment • Multiple meetings and pre-
meeting preparation
• General Feedback • Compliance with Standards
• Defect Detection • Satisfies Specifications
Adhoc Pairs Team
Review Programming Review
Peer Desk Walkthrough Inspection
Check
(Passaround
Check)
(v0.5) Ops A La Carte © 38
- 39. Why Isn’t Software Reliability Prevalent ??
“Those are very good ideas. We would like to implement them and
we know we should try. However, there just isn’t enough time.”
◈ The erroneous arguments all assume testing is the most effective
defect detection methodology
Results from inspections/reviews are generally poor
Engineers believe that testers will do a more thorough and efficient job of
testing than any effort they implement (inspections and unit testing)
Managers believe progress can be demonstrated faster and better once the SW
is in the system test phase
◈ Remember, just like the story of the lumberjack and his ax,
“If you don’t have time to do it correctly the first time,
then you must have time to do it over later!”
(v0.5) Ops A La Carte © 39
- 40. Software DfR Tools by Phase
Phase Activities Tools
◈Benchmarking
Concept Define SW reliability requirements ◈Internal Goal Setting
◈Gap Analysis
◈SW Failure Analysis
Architecture &
Modeling & Predictions ◈SW Fault Tolerance
High Level Design
◈Human Factors Analysis
Design
◈Identify core, critical and vulnerable sections of the ◈Human Factors Analysis
Low Level Design design ◈Derating Analysis
◈Static detection of design defects ◈Worst Case Analysis
◈FRACAS
Coding Static detection of coding defects
◈RCA
◈FRACAS
Unit Testing Dynamic detection of design and coding defects
◈RCA
◈FRACAS
Integration and System Testing SW Statistical Testing ◈RCA
◈SW Reliability Testing
◈FRACAS
Operations and Maintenance Continuous assessment of product reliability
◈RCA
(v0.5) Ops A La Carte © 40