Software Reliability Overview

Software Design For Reliability (DfR) Seminar

An Overview of
Software Reliability

Bob Mueller
bobm@opsalacarte.com
www.opsalacarte.com

Software Quality
and

Related Disciplines,
Yet Very Different

Definition of Software Quality
FACTORS CRITERIA

suitability
Functionality accuracy
interoperability
security

understandability
Usability learnability
operability
attractiveness

Software Software Quality
Quality
maturity
Reliability fault tolerance The level to which the
*ISO9126
Quality Model
recoverability software characteristics
conform to all the
time behavior specifications.
Efficiency resource utilization

analysability
changeability
Portability stability
testability

adaptability
installability
Maintainability co-existence
replaceability
(v0.5) Ops A La Carte © 3

Most Common Misconception
FACTORS CRITERIA
What organizations
suitability believe they are doing
Functionality accuracy ------------------
interoperability We have a strong SW
security
quality program. We don’t
need to add SW reliability
understandability
Usability learnability practices.
operability
attractiveness

Software
Quality
maturity
Reliability fault tolerance
*ISO9126
What is missing
recoverability
Quality Model ---------------
Implementing sufficient
SW reliability practices
time behavior
Efficiency resource utilization
to satisfy customer
expectations

analysability
changeability
Portability stability
testability What the organizations
are really doing
adaptability
------------------
installability Implementing only a
Maintainability co-existence sparse set of SW quality
replaceability
practices

Software Design For Reliability (DfR) Seminar

Background on

Software Reliability Can Be Measured

Software Reliability is 20 years behind HW reliability
◈Ramifications of failure
 Education on the consumer side
 Many consumers just expect unreliable s/w
◈Education on the manufacturer’s side
 Mfgs don’t know new innovative methods
 Mfgs don’t figure out how users will use product
◈Software engineers are more free-spirited than HW
◈Entry cost for a SW devel. team less than for HW


Reliability vs. Cost

TOTAL
COST
OPTIMUM CURVE
COST
POINT RELIABILITY
PROGRAM
COSTS
COST

HW
WARRANTY
COSTS

RELIABILITY The SW impact on
HW warranty costs
is minimal at best


Reliability vs. Cost, continued

◈SW has no associated manufacturing costs, so warranty
costs and saving are almost entirely allocated to HW
◈If there are no cost savings associated with improving
software reliability, why not leave it as is and focus on
improving HW reliability to save money?
 One study found that the root causes of typical embedded
system failures were SW, not HW, by a ratio of 10:1.
 Customers buy systems, not just HW.
◈The benefits for a SW Reliability Program are not in direct
cost savings, rather in:
 Increased SW/FW staff availability with reduced operational
schedules resulting from fewer corrective maintenance
content.
 Increased customer goodwill based on improved customer
(v0.5)
satisfaction. Ops A La Carte © 8

Software Reliability Definitions

The customer perception of
the software’s ability to deliver the expected functionality
in the target environment
without failing.

◈ Examine the key points

◈ Practical rewording of the definition

Software reliability is a measure of
the software failures that are visible to a customer and
prevent a system from delivering essential functionality.


Software Reliability Can Be Measured

◈ Measurements are a required foundation
 Differs from quality which is not defined by measurements
 All measurements and metrics are based on run-time failures

◈ Only customer-visible failures are targeted
 Only defects that produce customer-visible failures affect reliability
 Corollaries
◘ Defects that do not trigger run-time failures do NOT affect reliability
– badly formatted or commented code
– defects in dead code
◘ Not all defects that are triggered at run-time produce customer-visible
failures
– corruption of any unused region of memory

◈ SW Reliability evolved from HW Reliability
 SW Reliability focuses only on design reliability
 HW Reliability has no counterpart to this


Software Reliability Is Based On Usage

◈ SW failure characteristics are derived from the usage profile of a
particular customer or set of customers
 Each usage profile triggers a different set of run-time SW faults and failures

◈ Example
 Examine product usage by 2 different customers
◘ Customer A’s usage profile only exercises the sections of SW that produce
very few failures.
◘ Customer B’s usage profile overlaps with Customer A’s usage profile, but
additionally exercises other sections of SW that produce many, frequent
failures.
 Customer assessment of the product’s software reliability
◘ Customer A’s assessment - the SW reliability is high
◘ Customer B’s assessment - the SW reliability is low


Reliability ≠ Correctness

◈ Correctness is a measure of the degree of intended functionality
implemented by the SW
 Correctness measures the completeness of requirements and the accuracy of
defining a SW model based on these requirements

◈ Reliability is a measure of the behavior (i.e., failures) that
prevents the software from delivering the implemented
functionality


Defects,
Faults,
and
Failures

Terminology

◈ Defect
 A flaw in the requirements, design or source code that produces
implementation logic that will trigger a fault Defect

◘ Defects of omission
– Not all requirements were used in creating a design model
– The design satisfies all requirements but is incomplete
– The source code did not implement all the design
– The source code has missing or incomplete logic
◘ Defects of commission
– Incorrect requirements are specified
– Requirements are incorrectly translated into a design model
– The design is incorrectly translated into source code
– The source code logic is flawed
 Defects are static and can be detected and removed without
executing the source code
 Defects that cannot trigger a SW failure are not tracked or measured
◘ Ex: quality defects, such as test case and soft maintenance
defects, and defects in “dead code”


Terminology (continued)

◈ Fault
 The result of triggering a SW defect by executing the associated Defect
implementation logic
◘ Faults are NOT always visible to the customer
◘ A fault can be the transitional state that results in a failure
Fault
◘ Trivially simple defects (e.g., display spelling errors) do not
have intermediate fault states

◈ Failure
Defect
 A customer (or operational system) observation or detection that
is perceived as an unacceptable departure of operation from the
designed SW behavior
◘ Failures MUST be observable by the customer or an Fault
operational system
◘ Failures are the visible, run-time symptoms of faults
◘ Not all failures result in system outages
Failure


Basic Failure Classification

◈ High-level SW failure classification based on complexity
and time-sensitivity of triggering the associated defect:
 Bohr Bugs
 Heisen Bugs
 Aging Bugs

◈ Bohr Bugs
 Named after the “Bohr” atom
◘ Connotation: Deterministic failures that are straight-forward to isolate
 Failures are easily reproducible, even after a system restart/reboot
 Most frequent failure category detected during development, testing and early
deployment
 These are considered “trivial” defects since every execution of the associated
logic results in a failure


Basic Failure Classification (continued)

◈ Heisen Bugs
 Named after the Heisenberg uncertainty principle
◘ Connotation: Failures that are difficult to isolate to a root cause
 Intermittent failures that are rarely triggered and difficult to reproducible.
 Unlikely to reoccur following a system restart/reboot
 Common root causes:
◘ Synchronization boundaries between SW components
◘ Improper or insufficient exception handling
◘ Interdependent timing of multiple events
 Rarely detected when the SW is not mature (i.e., during early development and
testing phases)
 The best methods to deal with these “tough” defects are by
◘ Identification using SW failure analysis
◘ Impact mitigation using fault tolerant code


Basic Failure Classification (continued)

◈ Aging Bugs
 Attributed to the results of continuous, long-term operations or use
◘ Connotation: Failures resulting from accumulation of erroneous conditions
 Transient failures occur after extended run-time or functional cycles where the
contributing faults have occurred numerous times
 Preceding faults may lead to system performance degradation before a failure
occurs
 Extremely unlikely to reoccur following a system restart/reboot due to the
longevity requirement
 Common root causes:
◘ Deterioration in the availability of OS resources (e.g., depletion of device
handles, memory leaks, heap fragmentation)
◘ Data corruption
◘ Application race conditions
◘ Accumulation of numerical round-off errors
◘ Gradual data accumulation for sampling or queue build-up
 The best methods to deal with these “tough” defects are by
◘ Identification using SW failure analysis
◘ Impact mitigation using fault tolerant code

What Is
Reliable
Software ??

Reliable Software Characteristics

◈ Operates within the reliability specification that satisfies customer
expectations
 Measured in terms of failure rate and availability level
 The goal is rarely “defect free” or “ultra-high reliability”

◈ “Gracefully” handles erroneous inputs from users, other systems,
and transient hardware faults
 Attempts to prevent state or output data corruption from “erroneous” inputs

◈ Quickly detects, reports and recovers from SW and transient HW
faults
 SW provides system behave as continuously monitoring, self-diagnosing” and
“self-healing”
 Prevents as many run-time faults as possible from becoming system-level
failures


Common Paths to Software Reliability

◈ Traditional SW Reliability Programs - Predictions
 Program directed by a separate team of reliability engineers
 Development process viewed as a SW-generating, black box
◘ Develop prediction models to estimate the number of faults in the SW
 Reliability techniques used to identify defects and produce SW reliability metrics
◘ Traditional HW failure analysis techniques, e.g., FMEAs or FTAs
◘ Defect estimation and tracking

◈ SW Process Control
 Based on the assumption of a correlation between development process
maturity and latent defect density in the final SW
◘ Ex: CMM Level 3 organizations can develop SW with 3.5 defects/KSLOC
 If the current process level does not yield the desired SW reliability, audits and
stricter process controls are implemented

◈ Quality Through SW Testing
 Most prevalent approach for implementing SW reliability
 Assumes reliability is increased by expanding the types of system tests (e.g.,
integration, performance and loading) and increasing the duration of testing
 Measured by counting and classifying defects

Common Paths to Software Reliability (continued)

◈ These approaches generally do not provide a complete solution
 Reliability prediction models are not well-understood
 SW engineers find it difficult to apply HW failure analysis techniques to detailed
SW designs
 Only 20% of the SW defects identified by quality processes during development
(e.g., code inspections) affect reliability
 System testing is an inefficient mechanism for finding run-time failures
◘ Generally identifies no more than 50% of run-time failures
 Quality processes for tracking defects do not produce SW reliability information
such as defect density and failure rates

◈ Net Effect:
 SW engineers still end up spending more than 50% of their time debugging,
instead of focusing on designing or implementing source code


Design for Reliability
(DfR)

Software Defect Distributions

Average distribution of SW defects by lifecycle phase:
 20% Requirements
 30% Design
 35% Coding
 10% Bad Defect Fixes (introduction of secondary defects)
 5% Customer Documentation

Average distribution of SW defects at the time of field deployment:
(based on 1st year field defect report data)
 1% Severity 1 (catastrophic)
 20% Severity 2 (major)
 35% Severity 3 (minor)
 44% Severity 4 (annoyance)


Typical Defect Tracking (System Test)

Severity #1 Severity #2 Severity #3 Severity #4
System Test Total Defects
Defects Defects Defects Defects
Build Found
Found Found Found Found

SysBuild-01 7 9 16 22 54

SysBuild-02 5 5 14 26 50

SysBuild-03 4 6 8 16 34

• • • • • •
• • • • • •
• • • • • •

SysBuild-7 0 1 4 6 11


Defect Origin and Discovery

Typical Behavior

Requirements Design Coding Testing Maintenance
Defect
Origin

Defect
Discovery
Surprise!

Goal of Best Practices on Defect Discovery

Defect Requirements Design Coding Testing Maintenance

Origin

Defect
Discovery

Defect Removal Efficiencies

◈ Defect removal efficiency is a key reliability measure
Defects found
Removal efficiency =
Defects present

◈ “Defects present” is the critical parameter that is based on inspections, testing and
field data

System & Subsystem Field
Requirements Design Coding Unit Testing
Testing Stages Deployment
Inspection Efficiency Overall
Testing Efficiency
Efficiency

Example:

Origin Defects Found Metric Removal Efficiency

Inspections 90 Inspection Efficiency 43% = (90 / 210)
Unit Testing 25 Testing Efficiency 38% = (80 / 210)
System & Subsystem Overall Efficiency 81% = (170 / 210)
55
Testing

Field Deployment 40

TOTAL 210


Reliability Defect Tracking (All Phases)

Total Reqmts Design Code Unit Test System Test
Total
Critical Defect Critical Critical Critical Critical Critical
Activity Failures
Failures Density Defects Defects Defects Failures Failures
Found
Found Found Found Found Found Found

Reqmts 75 12 16% 12

Design 123 45 37% 4 41

Code 158 72 46% 4 6 62

Unit Test 78 25 35% 1 4 18 2

Development
434 154 21 51 80 2
Totals
DRE
57% 80% 78% 100%
(development)

System Test 189 53 68% 1 11 31 6 2

DRE
(after system 55% 66% 56% 25% 100%
testing)


Defect Removal Technique Impact

Design
Inspections / n n n n Y n Y Y
Reviews

Code
Inspections / n n n Y n n Y Y
Reviews

Formal SQA
Processes
n Y n n n Y n Y

Formal Testing n n Y n n Y n Y

Median Defect
40% 45% 53% 57% 60% 65% 85% 99%
Efficiency


Typical Defect Reduction Goals

200

150

100

50

SysBld SysBld SysBld SysBld SysBld
#1 #2 #3 #4 #5

System Test

Design for Reliability

200

150

Goal is to Predict Defect Totals for Next Phase

100

50

Req Design Code Unit SysBld SysBld SysBld SysBld SysBld Field
Test #1 #2 #3 #4 #5 Failures

Development System Test Deployment

Practices

Goals of Reliability Practices

Reliability Practices split the development lifecycle into 2 opposing
phases:

Pre-deployment Post-deployment
Focus on Fault Intolerance Focus on Fault Tolerance

Fault Tolerance Techniques
Fault Avoidance Techniques Allow a system to operate
Prevents defects from being predictably in the presence of faults
introduced
System Restoration Techniques
Fault Removal Techniques Quickly restore the operational state
Detects and repairs faults of a system in the simplest manner
possible

Goal: Increase reliability by Goal: Increase availability by
eliminating critical defects that reducing or avoiding the effects
reduce the failure rate of faults


Software Reliability Practices

Analysis Design Verification

 Formal  Formal Interface  Boundary Value Analysis
Scenario/Checklist Specification
 Equivalence Class
Analysis
 Defensive Programming Partitioning
 FRACAS
 Fault Tolerance  Reliability Growth Testing
 FMECA
 Modular Design  Fault Injection Testing
 FTA
 Error Detection and  Static/Dynamic Code
 Petri Nets Correction Analysis
 Change Impact Analysis  Critical Functionality  Coverage Testing
Isolation
 Common Cause Failure  Usage Profile Testing
Analysis  Design by Contract
 Cleanroom
 Sneak Analysis  Reliability Allocation
 Design Diversity


Design and Code Inspections

◈ The original rationale for inspections (current payback):
“Inspections require less time and resources to detect and repair defects than
traditional testing and debugging”

 Work done at Nortel Technologies in 1991 demonstrated that 65% to 90% of
operational defects were detected by inspections at 1/4 to 2/3 the cost of testing

◈ Soft maintenance rationale (future payback):
 Data collected on 130 inspection sessions findings on the long-term, software
maintenance benefits of inspections as follows:

True Defects - The code behavior was wrong and an
execution affecting change was made to
resolve it.

False Positives – Any issue not requiring a code or
document change.

Soft Maintenance Changes – Any other issue that
resulted in a code or document change, e.g.,
code restructuring or addition of code
comments.


Spectrum of Inspection Methodologies
Method / # of Detection Collection Post Process
Team Size
Originator Sessions Method Meeting Feedback
Yes
Fagan Large 1 Ad hoc
Group oriented
None

Yes
Bisant Small 1 Ad hoc
Group oriented
None

Yes Root Cause
Gilb Large 1 Checklist
Group oriented Analysis

No
Meetingless
Large 1 Unspecified Individual None
Inspection Oriented

Yes
ADR Small >1 Scenario
Group oriented
None

4 Yes
Britcher Unspecified
Parallel
Scenario
Group oriented
None

No
Phased >1 Checklist
Small (Mtg only to None
Inspection Sequential (comp)
reconcile data)

>1 Yes
N-fold Small
Parallel
Ad hoc
Group oriented
None

Code No
Small 1 Ad hoc None
Reading Mtg Optional

WOW! No wonder inspections are not well-understood, there’s too many methodologies.
AND, THERE ARE MORE OPTIONS…

Spectrum of Technical Review Methodologies

Inspections are just one of the many classes of Technical Review Methodologies.

• Informal • Formal

• Individual initiative • Team-oriented

• Small time commitment • Multiple meetings and pre-
meeting preparation

• General Feedback • Compliance with Standards

• Defect Detection • Satisfies Specifications

Adhoc Pairs Team
Review Programming Review
Peer Desk Walkthrough Inspection
Check
(Passaround
Check)

Why Isn’t Software Reliability Prevalent ??

“Those are very good ideas. We would like to implement them and
we know we should try. However, there just isn’t enough time.”

◈ The erroneous arguments all assume testing is the most effective
defect detection methodology
 Results from inspections/reviews are generally poor
 Engineers believe that testers will do a more thorough and efficient job of
testing than any effort they implement (inspections and unit testing)
 Managers believe progress can be demonstrated faster and better once the SW
is in the system test phase

◈ Remember, just like the story of the lumberjack and his ax,

“If you don’t have time to do it correctly the first time,
then you must have time to do it over later!”


Software DfR Tools by Phase

Phase Activities Tools

◈Benchmarking
Concept Define SW reliability requirements ◈Internal Goal Setting
◈Gap Analysis

◈SW Failure Analysis
Architecture &
Modeling & Predictions ◈SW Fault Tolerance
High Level Design
◈Human Factors Analysis
Design
◈Identify core, critical and vulnerable sections of the ◈Human Factors Analysis
Low Level Design design ◈Derating Analysis
◈Static detection of design defects ◈Worst Case Analysis

◈FRACAS
Coding Static detection of coding defects
◈RCA

◈FRACAS
Unit Testing Dynamic detection of design and coding defects
◈RCA

◈FRACAS
Integration and System Testing SW Statistical Testing ◈RCA
◈SW Reliability Testing

◈FRACAS
Operations and Maintenance Continuous assessment of product reliability
◈RCA


Questions?


Software Reliability Overview

Recommended

Recommended

More Related Content

Featured

Featured (20)

Software Reliability Overview