Software Reliability For Engineers - J.K.Orr 2015-09-23
1. Software Reliability For
Dummies Engineers
James K. Orr
Independent Consultant
jkorr@gatech.edu
Copyright 2015 By James K. Orr 19/23/2015
2. Introduction
• This presentation presents a very simplified approach
to computing Software Reliability – an engineering
approach as opposed to a complex statistical approach.
• This approach evolved from analysis of the Space
Shuttle Primary Avionics Software System (PASS), the
software that controlled the Space Shuttle from pre-
launch, through ascent, on-orbit, entry to landing.
• Approach may be limited to similar systems (large scale
critical software with relatively few users).
• If you would like assistance in applying this method,
please contact me at jkorr@gatech.edu.
Copyright 2015 By James K. Orr 29/23/2015
3. Contents
• Introduction
• Contents
• Evolution of Space Shuttle PASS Alternate
Reliability Model in 1989.
• Generalized Approach For Software Reliability
With Examples, Equations, and Simulation
• Sample Results From Space Shuttle PASS
Reliability Analysis
• References
Copyright 2015 By James K. Orr 39/23/2015
4. Evolution of
Space Shuttle PASS
Alternate Reliability Model
in 1989
Copyright 2015 By James K. Orr 49/23/2015
5. Requirement Reliability Prediction
• Following the loss of the Space Shuttle Challenger and
crew in 1986, IBM Federal Systems Division – Houston
as the Space Shuttle Primary Avionics Software System
developer was assigned a “return to flight” action to
model the software reliability of “loss of vehicle and
crew” latent errors (defects).
• This was a two step process. First, compute software
reliability (time to next failure). Second, model the
probability that a failure occurring during flight would
be a “loss of vehicle and crew” latent errors (defects).
• Discussion in this paper focuses on the first activity,
compute software reliability (time to next failure).
Copyright 2015 By James K. Orr 59/23/2015
6. “Professional” Approach
• IBM Federal Systems Division – Houston
contacted multiple experts in software reliability.
Ultimate, N. F. Schneidewind and his “SMERFS”
software reliability estimation tool was selected
to model the Space Shuttle PASS reliability.
• See reference 1 for one paper that documents
the results of this work. The link with reference 1
also connects to a full list of papers, etc. by DR.
Schneidewind.
Copyright 2015 By James K. Orr 69/23/2015
7. Motivating An “Engineering” Approach
• During this time (1986 – 1989), I was working as senior technical staff at
IBM Federal Systems Division – Houston. Roles included:
– Project Coordination and Technical Leadership, 1984-1988. Led initiatives to support
high flight rate in period leading to loss of Space Shuttle Challenger in January 1986.
Oversaw initiatives to implement mandatory changes to On-Board Shuttle Software
(PASS) prior to return to flight in September 1988. Earned IBM highest technical
achievement award, outstanding achievement award for shuttle software engineering,
development and verification technical leadership
– Member of IBM/NASA Shuttle Flight Software (PASS) Discrepancy Review Board, 1981-
1992. Maintained rigor of Discrepancy Review Board process, ensuring identification
and correction of process escapes and identification and correction of similar errors due
to prior process deficiencies found by audits.
• In these roles, I reviewed the results produced by IBM and the “SMERFS”
software reliability estimation tool. A key part of the process was to
separate software into “layers” based on the development cycle for each
release of PASS flight software. In comparing the failures (Flight Software
Discrepancies) by development cycle to data being processed by the of
IBM/NASA Shuttle Flight Software (PASS) Discrepancy Review Board, I
observed significant differences in time to failure by release.
Copyright 2015 By James K. Orr 79/23/2015
8. “Engineering” Approach
• FROM NOTES DATED 04/16/1990 (WITH
ADDED HISTORICAL INSIGHT)
– Analysis was done “eye balling” time between
failures for recent releases. An “engineering
judgment” was used of a rough estimated time to
next failure by release. This was compared to the
values produced by “SMERFS” as well as the
prototype for the Alternate Reliability Model.
• Data on next page has been updated with
actual time to next failure as of 03/14/2007.
Copyright 2015 By James K. Orr 89/23/2015
9. Evaluation Of “SMERFS”
Operational
Increment
Engineering
Judgment
Time to Next
Failure (Days)
“SMERFS”
09/30/89
MTTF (Days)
Alternate
Reliability
Model
12/01/89
MTTF
(Days)
Next Failure
After 12/89
Actual MTTF
(Days)
Added
03/14/2007
4 1500 167 970 2262
5 700 164 729 2746
6 600 146 539 203
7 800 291 864 327
7C 1000 466 1458 4484
8A 700 455 1461 2958
8B 350 256 351 6393
8C 180 420 143 185
Composite
(Combine All
Above)
63 30 60 67
See Reference 2, page 16 to identify time frame for Operational Increments.
See pages 43 – 53 for significant process improvements applied for
Operational Increments OI-8A, OI-8B, and OI-8C.
Copyright 2015 By James K. Orr 99/23/2015
10. Evaluation Of “SMERFS”
• Alternate model was developed to better match the engineering
judgment values. In hindsight, looking back after 17 plus years (in
2007), the engineering judgment was most accurate (6 %
conservative), followed by the Alternate Reliability Model (10 %
conservative). SMERFS was conservative, but in error by 55 % of
the actual results.
• RATIONALE FOR “ALTERNATE RELIABILITY MODEL” From Notes
dated 04/16/1990
– First, subtle differences existed between Predicted Time Between Failures
using “SMERFS” (Statistical Modeling and Estimating of Reliability Functions
for Software) and the actual data. The key difference that was unacceptable
was the skew in probability of the next error occurring on older OI’s (for
example, OI 4) rather than on recent OI’s (for example, OI-8C). Actual data
showed an opposite trend.
– Second, “SMERFS” required significant historical data prior to producing
accurate results making it inappropriate for predicting in advance the
reliability of unreleased systems
Copyright 2015 By James K. Orr 109/23/2015
11. Effects Of Process Improvement
• Candidate reasons for mismatch between “SMERFS” and reality. See
Reference 3, page 9. This shows very large spike in product error rate for
Operational Increments OI-1 and OI-2. See page 16 for tabular data.
• Continual process improvements through OI-8C may have accounted for
error in “SMERFS” predications.
OI-1
OI-2
OI-3
OI-4
OI-5 OI-6
OI-7
OI-8B
OI-8C
STS-1
STS-2 STS-5
OI-25 Process Issues During Transition From IBM To Loral
Copyright 2015 By James K. Orr 119/23/2015
12. Summary of the method
• The Space Shuttle “Alternate Reliability Model” program was developed
for the Space Shuttle Primary Avionics Software System, which is human
flight rated system of approximately 450,000 sources lines of code
(excluding comment lines). Operational Increment release development
over 15 years has demonstrated that the reliability characteristics per unit
of changed code for each release is very consistent, with variations
explainable by process deviations or other special causes of variation.
Copyright 2015 By James K. Orr 129/23/2015
13. Summary of the method
• The Space Shuttle “Alternate Reliability Model” program computes
software reliability estimates in complex software systems even as the
reliability characteristics change over time.
• The method and tool works in two independent modes.
– First, when failure data is available, it will estimate two model
coefficients for each grouping of software being analyzed. These two
model coefficients can then be used to calculate the software
reliability characteristics of each grouping of software, and also total
software reliability for all groupings combined.
– Second, the two model coefficients are also normalized based on
relative size. For appropriate circumstances (e.g., the software is
produced with essentially the same equivalent quality process),
estimates of software reliability can be made prior to any failures
occurring based on relative size of the software.
• Once the two model coefficients are determined, reliability and failure
information over a user defined time interval can be computed
Copyright 2015 By James K. Orr 139/23/2015
14. Required Inputs Mode # 1
Mode # 1 (use actual failure data to compute reliability)
• Define software as “uniform layers.” These layers represent functionally
whatever characteristic desired to be modeled. In the Space Shuttle Primary
Avionics Software System context, each layer represents all new/changed
software delivered by each release. In the Constellation context, layers could
be broken down by function and criticality of the software.
• Relative size measure of each layer. In the Space Shuttle Primary Avionics
Software System context, relative size is defined by new/changed source lines
of code (slocs). In the Constellation context, relative size could be based on
number of requirements, functions points, or any other measure desired.
Comparing relative functional size of software function to the PASS slocs and
Constellation size parameter of choice could perform correlation between
Space Shuttle Primary Avionics Software System and Constellation.
– Data on each failure.
– Date of failure
– “Layer” of software that was the source of the failure
Copyright 2015 By James K. Orr 149/23/2015
15. Required Inputs Mode # 2
Mode # 2 (use relative size and historical data to compute reliability)
• Define software as “uniform layers.” These layers could represent
as desired to be modeled.
• Relative size measure of each layer.
• Expected relative quality level compared to historical data (could be
subjective).
All (produce reliability calculations)
• Date or date range. Typically, this would correspond to (a) the date
of flight at which you wanted a Mean Time To Failure, or (b) any
range of dates over which you wanted to determine the expected
number of failures (expressed as a scalar, which in some contexts
would represent the likelihood of a failure in that interval).
Copyright 2015 By James K. Orr 159/23/2015
16. MATHEMATICAL BASIS
“ALTERNATE RELIABILITY MODEL”
The expected number of failures at any time
• X = K_layer ln(t) - K_layer ln(tref) for t > tref
• Where X = number of software failures
• K_layer = a single constant that characterizes the grouping of software
• tref = a reference time in days shortly after release (Configuration Inspection date), typically on the order of 90 days. 90
days was selected as the time to normally reconfigure a system for flight and begin its use
– The 90 days makes operations sense in the Space Shuttle Program Primary Avionics Software System (PASS)
environment
– The 90 days avoids a lot of mathematical issues as t takes on small values, ultimately approaching 0.
• t = time in days after release (Configuration Inspection date). t varies from approximately 90 days up to approximately
10,000 days for Space Shuttle Primary Avionics Software System (PASS) data.
• For every pair of successive failures, a value of K can be computed. Values computed for each pair of successive failures may
vary by a factor on the order of 100.
– K failures N to N + 1 = 1 / (ln( t at failure N + 1 ) - ln( t at failure N ) )
– In the Space Shuttle Primary Avionics Software System Data, failures are some times reported on the same day.
Mathematically, the above equation does not work for this situation. The approach adopted was to treat all failures
occurring within 12 days (evolved through multiple iterations) in one K calculation
– If two failures within one 12 day interval
• K failures N to N + 2 = 2 / (ln( t at failure N + 2 ) - ln( t at failure N ) )
– If three failures within one 12 day interval
• K failures N to N + 3 = 3 / (ln( t at failure N + 3 ) - ln( t at failure N ) )
– Etc.
• The above calculations give a series of K terms each associated with a time interval. A single value of K layer for each by
“layer” or set of released changes is calculated by weighting by the associated delta time interval. Note the equation below
is simplified to the case where all failures occur more than 12 days apart. Note also that the method assumes a failure at
the current date for each layer as the calculations are performed to insure a conservative estimate is produced.
Copyright 2015 By James K. Orr 169/23/2015
17. MATHEMATICAL BASIS
“ALTERNATE RELIABILITY MODEL”
• Standard deviation is computed directly from
all of the computed F_factor.
• Normalized standard deviation (SD_factor) is
computed by dividing the standard deviation
by the composite final F_factor.
Copyright 2015 By James K. Orr 179/23/2015
22. Summary Noise Calculations
• Ideal Data uses integer dates as close as possible
to produce exactly even integer failures using
model equations.
• Noise 1 Data uses random variance in dates to
produce a Standard Deviation in F_factor of
about 17 %.
• Noise 2 Date uses random variance in dates to
produce a Standard Deviation in F_factor or
about 28 %.
• The above are simply samples, no other value.
Copyright 2015 By James K. Orr 229/23/2015
24. Key Issues
• Must separate failures during development from failures post release operations.
• Ideally separate failures from post release operations by completion date of each
release content.
• Selection of tref is critical in that it must not be near 0. Zero time is normally when
verification testing is completed.
– Based on Space Shuttle PASS experience, a value of 90 days is recommended.
This was the time from when verification on an Operational Increment was
completed until a flight specific reconfigured release was available to field
users (crew training, Software Avionics Integration Laboratory testing).
– Alternately, the time at which the first failure occurs could also be used (if
significantly greater than 0).
• Method does not work in trying to treat single failures occurring on the same day
or very, very close together.
– Space Shuttle PASS engineering solution was to group all failures occurring
within 12 days into a single calculation with N number of failures between the
two time points.
Copyright 2015 By James K. Orr 249/23/2015
25. Test If This Approach Is Valid
• This model may or may not work for any
specific system and set of failure data.
• The most direct test is to plot failures versus
time from release verification completion with
time as logarithmic scale.
– If this plot is approximately linear, then this
approach (PASS “Alternate Reliability Model”) is
valid.
– If there are failures very near delta time near 0,
these should be ignored for modeling purposes.
Copyright 2015 By James K. Orr 259/23/2015
26. Effect Of Not Isolating Each Release
Copyright 2015 By James K. Orr 269/23/2015
27. Extreme Samples Of Model Equations
• Sample 1 has random variation in dates for failures,
plus assumes two failures at second failure point to
demonstrate multiple failures within a short time
period (typically within 12 days). Uses same dates as
Noise 2 Sample for first four failures only.
• Sample 2 has random variation in dates for failures.
Uses same dates as Noise 2 Sample for first four
failures only.
• Sample 3 has random variation in dates for failures.
Uses same dates as Noise 1 Sample for first four
failures only.
Copyright 2015 By James K. Orr 279/23/2015
38. Sample Results From
Space Shuttle PASS
Reliability Analysis
Copyright 2015 By James K. Orr 389/23/2015
39. Estimate K_factor
• The following four charts illustrate how K_factor can be predicted from other
software metrics such as Product Error Rate (Product Errors per 1000
new/changed source lines of code)
• Data is shown from OI-20 (released in 1990) to OI-30 (released in 2003).
These were large releases with 7 to 20 years of service life. Relatively stable
software development and verification process was used except for OI-25 (see
Reference 2 for more information).
• Page 40 shows Product Error Rate data from Reference 3. Page 41 shows PASS
K_factor per 1000 new/changed source lines of code from my personal notes.
• Page 42 tabulates key information. Page 43 plots the relationship between
K_factor per KSLOC versus Product Error Rate (Product Errors per KSLOC).
• This relationship could be used to estimate reliability of a future system if an
estimate of Product Error Rate is known based on prior process performance.
– K_factor = (K_factor per KSLOC as a function of Product Error Rate) *
KSLOC of system
Copyright 2015 By James K. Orr 399/23/2015
40. Reference 3, Page 16
Copyright 2015 By James K. Orr 40
Focus On AP-101S
(upgraded General
Purpose Computer)
Major Releases
9/23/2015
41. PASS “Alternate Reliability Model” K_factor
Copyright 2015 By James K. Orr 41
OI-30
OI-29
OI-28
OI-27
OI-26B
OI-26
OI-25
OI-24
OI-23
OI-22
OI-21
OI-20
9/23/2015
44. Discussion Of Results For PASS
• Computed Alternate Reliability Model coefficients analysis for OI-3 and later systems using
the post flight failures. For OI-30, OI-32, OI-33, and OI-34, adjusted the calculated values due
to the assumption of an additional failure on the day of the analysis gave unrealistically high
values. Alternate Reliability Model coefficients were adjusted to a value per unit of size
(1000 uncommented new/changed source lines of HAL/S code, or KSLOC) that was consistent
with other similar recent OI’s.
• For Release 16 (STS-1) to OI-2, failure data exists for the combined releases, not separately.
Computation of Alternate Reliability Model coefficients was done by comparing failures per
year for the combined releases to Alternate Model output for assumed of Alternate
Reliability Model coefficients. Alternate Reliability Model coefficients derived based on
constant value per unit of size (KLSOC). Additional unique analysis produced the Alternate
Reliability Model coefficient showing the variability of the predicted failures per year.
• Analysis focused on flown systems. Data from Operational Increments not flown combined
with later flown increment. As an example, failures and KSLOC’s from OI-7C and OI-8A are
included in the calculation of Alternate Reliability Model coefficients for OI-8B. For simplicity,
combined data from OI-8F under OI-20 even thought OI-8F supported flights due to the small
size of OI-8F and OI-8F’s unique nature. OI-8F made operating system changes to support the
AP-101S General Purpose Computer upgrade.
Copyright 2015 By James K. Orr 449/23/2015
48. 1. Schneidewind, N. F. and Keller, T. W. 1992. "Application of Reliability Models to the
Space Shuttle," IEEE Software, July 1992, pp. 28-33.
– See list of papers by N.F. Schneidewind at
• http://faculty.nps.edu/vitae/cgi-
bin/vita.cgi?p=display_more&id=1023567911&field=pubs
2. James K. Orr, Daryl Peltier, Space Shuttle Program Primary Avionics Software System
(PASS) Success Legacy - Major Accomplishments and Lessons Learned Detail Historical
Timeline Analysis, August 24, 2010, NASA JSC-CN-21350, presented at NASA-
Contractors Chief Engineers Council 3-day meeting August 24-26, 2010, in Montreal,
Canada.
– Free at
• http://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/20100028293.pdf
3. James K. Orr, Daryl Peltier, Space Shuttle Program Primary Avionics Software System
(PASS) Success Legacy – Quality & Reliability Data, August 24, 2010, NASA JSC-CN-
21317, presented at NASA-Contractors Chief Engineers Council 3-day meeting August
24-26, 2010, in Montreal, Canada
– Free at
• http://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/20100029536.pdf
Copyright 2015 By James K. Orr 489/23/2015