54 c o m m u n i c at i o n s o f t h e a c m n o.docx

54 c o m m u n i c at i o n s o f t h e a c m | n o v e m b e
r 2 0 0 9 | v o l . 5 2 | n o . 1 1
practice
e V e r Y o N e K N o W s M a i N T e N a N C e is difficult
and
boring, and therefore avoids doing it. It doesn’t help
that many pointy-haired bosses (PHBs) say things like:
“no one needs to do maintenance—that’s a waste of
time.”
“Get the software out now; we can decide what its
real function is later.”
“Do the hardware first, without thinking about the
software.”
“Don’t allow any room or facility for expansion. You
can decide later how to sandwich the changes in.”
These statements are a fair description of development
during the last boom, and not too far
from what many of us are doing today.
This is not a good thing: when you hit
the first bug, all the time you may have
“saved” by ignoring the need to do
maintenance will be gone.
During a previous boom, General

Electric designed a mainframe that it
claimed would be sufficient for all the
computer uses in Boston, and would
never need to be shut down for repair
or for software tweaks. The machine
it eventually built wasn’t nearly big
enough, but it did succeed at running
continuously without need for hard-
ware or software changes.
Today we have a distributed net-
work of computers provided by thou-
sands of businesses, sufficient for ev-
eryone in at least North America, if not
the world. Still, we must keep shutting
down individual parts of the network to
repair or change the software. We do so
because we’ve forgotten how to do soft-
ware maintenance.
What is software maintenance?
Software maintenance is not like hard-
ware maintenance, which is the return
of the item to its original state. Software
maintenance involves moving an item
away from its original state. It encom-
passes all activities associated with the
process of changing software. That in-
cludes everything associated with “bug
fixes,” functional and performance
enhancements, providing backward
compatibility, updating its algorithm,
covering up hardware errors, creating
user-interface access methods, and
other cosmetic changes.

In software, adding a six-lane au-
tomobile expressway to a railroad
bridge is considered maintenance—
and it would be particularly valuable
if you could do it without stopping the
train traffic.
Is it possible to design software so it
can be maintained in this way? Yes, it
is. So, why don’t we?
the four horsemen of
the apocalypse
There are four approaches to software
You Don’t
Know
Jack about
software
maintenance
D o i : 1 0 . 1 1 4 5 / 1 5 9 2 7 6 1 . 1 5 9 2 7 7 7
Article development led by
queue.acm.org
Long considered an afterthought, software
maintenance is easiest and most effective
when built into a system from the ground up.
BY PauL stachouR anD DaViD coLLieR-BRoWn
P
h

o
t
o
g
r
a
P
h
b
y
r
a
L
P
h
g
r
U
n
e
W
a
L
D

r 2 0 0 9 | v o l . 5 2 | n o . 1 1
practice
tems, the specification and designs
aren’t quite good enough, so in prac-
tice the specification is frozen while
it’s still faulty. This is often because it
cannot be validated, so you can’t tell if
it’s faulty until too late. Then the spec-
ification is not adhered to when code
is written, so you can’t prove the pro-
gram follows the specification, much
less prove it’s correct. So, you test un-
til the program is late, and then ship.
Some months later you replace it as a
complete entity, by sending out new
ROMs. This is the typical history of
video games, washing machines, and
embedded systems from the U.S. De-
partment of Defense.
Discrete. The discrete change ap-
proach is the current state of prac-
tice: define hard-and-fast, highly
configuration-controlled interfaces
to elements of software, and regularly
carry out massive all-at-once changes.
Next, ship an entire new copy of the
program, or a “patch” that silently

replaces entire executables and li-
braries. (As we write this, a new copy
of Open Office is asking us please to
download it.)
In theory, the process accepts (re-
luctantly) the fact of change, keeps a
parts list and tools list on every item,
allows only preauthorized changes
under strict configuration control,
and forces all servers’/users’ changes
to take place in one discrete step. In
practice, the program is running mul-
tiple places, and each must kick off
its users, do the upgrade, and then let
them back on again. Change happens
more often and in more places than
predicted, all the components of an
maintenance: traditional, never, dis-
crete, and continuous—or, perhaps,
war, famine, plague, and death. In any
case, 3.5 of them are terrible ideas.
Traditional (or “everyone’s first
project”). This one is easy: don’t even
think about the possibility of main-
tenance. Hard-code constants, avoid
subroutines, use all global variables,
use short and non-meaningful vari-
able names. In other words, make it
difficult to change any one thing with-
out changing everything. Everyone
knows examples of this approach—
and the PHBs who thoughtlessly push
you into it, usually because of sched-

ule pressures.
Trying to maintain this kind of soft-
ware is like fighting a war. The enemy
fights back! It particularly fights back
when you have to change interfaces,
and you find you’ve only changed
some of the copies.
Never. The second approach is to
decide upfront that maintenance will
never occur. You simply write wonder-
ful programs right from the start. This
is actually credible in some embedded
systems, which will be burned to ROM
and never changed. Toasters, video
games, and cruise missiles come to
mind.
All you have to do is design per-
fect specifications and interfaces,
and never change them. Change only
the implementation, and then only
for bug fixes before the product is
released. The code quality is wildly
better than it is for the traditional ap-
proach, but never quite good enough
to avoid change completely.
Even for very simple embedded sys-
item are not recorded, and patching is
alive (and, unfortunately, thriving) be-
cause of the time lag for authorization
and the rebuild time for the system.

Furthermore, while official inter-
faces are controlled, unofficial in-
terfaces proliferate; and with C and
older languages, data structures are
so available that even when change is
desired, too many functions “know”
that the structure has a particular
layout. When you change the data
structure, some program or library
that you didn’t even know existed
starts to crash or return enotsup.
A mismatch between an older Linux
kernel and newer glibc once had
getuid returning “Operation not
supported,” much to the surprise of
the recipients.
Experience shows that it is com-
pletely unrealistic to expect all users
to whom an interface is visible will be
able to change at the same time. The
result is that single-step changes can-
not happen: multiple change interre-
lationships conflict, networks mean
multiple versions are simultaneously
current, and owners/users want to
control change dates.
Vendors try to force discrete chang-
es, but the changes actually spread
through a population of computers
in a wave over time. This is often lik-
ened to a plague, and is every bit as
popular.
Customers use a variant of the

“never” approach to software main-
tenance against the vendors of these
plagues: they build a known work-
ing configuration, then “freeze and
forget.” When an update is required,
they build a completely new system
from the ground up and freeze it. This
works unless you get an urgent secu-
rity patch, at which time you either
ignore it or start a large unscheduled
rebuild project.
Continuous change. At first, this ap-
proach to maintenance sounds like
just running new code willy-nilly and
watching what happens. We know at
least one company that does just that:
a newly logged-on user will unknow-
ingly be running different code from
everyone else. If it doesn’t work, the
user’s system will either crash or be
kicked off by the sysadmin, then will
have to log back on and repeat the
work using the previous version.
Real-world structure for managing interface changes.
struct item_loc_t {
struct {
unsigned short major; /* = 1 */
unsigned short minor; /* = 0 */
} version;
unsigned part_no;
unsigned quantity;
struct location_t {
char state[4];

char city[8];
unsigned warehouse;
short area;
short pigeonhole;
} location;
...
practice
n o v e m b e r 2 0 0 9 | v o l . 5 2 | n o . 1 1 | c o m m
u n i c at i o n s o f t h e a c m 57
However, that is not the real mean-
ing of continuous. The real continu-
ous approach comes from Multics,
the machine that was never sup-
posed to shut down and that used
controlled, transparent change. The
developers understood the only con-
stant is change and that migration
for hardware, software, and function
during system operation is necessary.
Therefore, the ability to change was
designed from the very beginning.
Software in particular must be writ-
ten to evolve as changes happen, us-
ing a weakly typed high-level language
and, in older programs, a good macro
assembler. No direct references are al-
lowed to anything if they can be avoid-
ed. Every data structure is designed
for expansion and self-identifying
as to version. Every code segment is

made self-identifying by the compil-
er or other construction procedure.
Code and data are changeable on a
per-command/process/system basis,
and as few as possible copies of any-
thing are kept, so single copies could
be dynamically updated as necessary.
The most important thing is to
manage interface changes. Even in
the Multics days, it was easy to forget
to change every single instance of an
interface. Today, with distributed pro-
grams, changing all possible copies of
an interface at once is going to be in-
sanely difficult, if not flat-out impos-
sible.
Who Does it Right?
BBN Technologies was the first com-
pany to perform continuous con-
trolled change when they built the
ARPANET backbone in 1969. They
placed a 1-bit version number in ev-
ery packet. If it changed from 0 to 1,
it meant that the IMP (router) was to
switch to a new version of its software
and set the bit to 1 on every outgoing
packet. This allowed the entire ARPA-
NET to switch easily to new versions
of the software without interrupting
its operation. That was very important
to the pre-TCP Internet, as it was quite
experimental and suffered a consider-
able amount of change.

With Multics, the developers did
all of these good things, the most im-
portant of which was the discipline
used with data structures: if an inter-
face took more than one parameter,
all the parameters were versioned by
placing them in a structure with a ver-
sion number. The caller set the ver-
sion, and the recipient checked it. If it
was completely obsolete, it was flatly
rejected. If it was not quite current,
it was processed differently, by be-
ing upgraded on input and probably
downgraded on return.
This meant that many different
versions of a program or kernel mod-
ule could exist simultaneously, while
upgrades took place at the user’s con-
venience. It also meant that upgrades
could happen automatically and that
multiple sites, multiple suppliers,
and networks didn’t cause problems.
An example of a structure used by
a U.S.-based warehousing company
(translated to C from Multics PL/1)
is illustrated in the accompanying
box. The company bought a Canadian
competitor and needed to add inter-
country transfers, initially from three
of its warehouses in border cities.
This, in turn, required the state field
to split into two parts:

char country _ code[4]
char state _ province[4];
To identify this, the company incre-
mented the version number from 1.0
to 2.0 and arranged for the server to
support both types. New clients used
version 2.0 structures and were able
to ship to Canada. Old ones continued
to use version 1.0 structures. When
the server received a type 1 structure,
it used an “updater” subroutine that
copied the data into a type 2 structure
and set the country code to U.S.
In a more modern language, you
would add a new subclass with a con-
structor that supports a country code,
and update your new clients to use it.
The process is this:
Update the server.1.
Change the clients that run in 2.
the three border-state warehouses.
Now they can move items from U.S. to
Canadian warehouses.
Deploy updated clients to those 3.
Canadian locations needing to move
stock.
Update all of the U.S.-based cli-4.
ents at their leisure.
Using this approach, there is never

a need to stop the whole system, only
the individual copies, and that can be
software
maintenance is
not like hardware
maintenance,
which is the
return of the item
to its original
state. software
maintenance
involves moving
an item away from
its original state.
r 2 0 0 9 | v o l . 5 2 | n o . 1 1
practice
scheduled around a business’s conve-
nience. The change can be immedi-
ate, or can wait for a suitable time.
Once the client updates have oc-
curred, we simultaneously add a check
to produce a server error message for
anyone who accidentally uses an ou-
dated U.S.-only version of the client.
This check is a bit like the “can’t hap-
pen” case in an else-if: it’s done to
identify impossibly out-of-date calls.
It fails conspicuously, and the system

administrators can then hunt down
and replace the ancient version of the
program. This also discourages the
unwise from permanently deferring
fixes to their programs, much like the
coarse version numbers on entire pro-
grams in present practice.
modern examples
This kind of fine-grain versioning is
sometimes seen in more recent pro-
grams. Linkers are an example, as
they read files containing numbered
records, each of which identifies a
particular kind of code or data. For ex-
ample, a record number 7 might con-
tain the information needed to link
a subroutine call, containing items
such as the name of the function to
call and a space for an address. If the
linker uses record types 1 through 34,
and later needs to extend 7 for a new
compiler, then create a type 35, use it
for the new compiler, and schedule
changes from type 7 to type 35 in all
the other compilers, typically by an-
nouncing the date on which type 7 re-
cords would no longer be accepted.
Another example is in networking
protocols such as IBM SMB (Server
Message Block), used for Windows
networking. It has both protocol ver-
sions and packet types that can be
used exactly the same way as the re-
cord types of a linker.

Object languages can also support
controlled maintenance by creat-
ing new versions as subclasses of the
same parent. This is a slightly odd use
of a subclass, as the variations you
create aren’t necessarily meant to per-
sist, but you can go back and clean out
unneeded variants later, after they’re
no longer in use.
With AJAX, a reasonably small cli-
ent can be downloaded every time the
program is run, thus allowing change
without versioning. A larger client
would need only a simple version-
ing scheme, enough to allow it to be
downloaded whenever it was out of
date.
An elegant modern form of contin-
uous maintenance exists in relational
databases: one can always add col-
umns to a relation, and there is a well-
known value called null that stands
for “no data.” If the programs that
use the database understand that any
calculation with a null yields a null,
then a new column can be added, pro-
grams changed to use it over some
period of time, and the old column(s)
filled with nulls. Once all the users of
the old column are no more, as indi-
cated by the column being null for
some time, then the old column can

be dropped.
Another elegant mechanism is a
markup language such as SGML or
XML, which can add or subtract attri-
butes of a type at will. If you’re careful
to change the attribute name when
the type changes, and if your XML
processor understands that adding 3
to a null value is still null, you’ve an
easy way to transfer and store mutat-
ing data.
maintenance isn’t hard, it’s easy
During the last boom, (author) Col-
lier-Brown’s team needed to create
a single front end to multiple back
ends, under the usual insane time
pressures. The front end passed a few
parameters and a C structure to the
back ends, and the structure repeat-
edly needed to be changed for one or
another of the back ends as they were
developed.
Even when all the programs were on
the same machine, the team couldn’t
change them simultaneously because
they would have been forced to stop
everything they were doing and ap-
ply a structure change. Therefore, the
team started using version numbers.
If a back end needed version 2.6 of the
structure, it told the front end, which
handed it the new one. If it could use
only version 2.5, that’s what it asked

for. The team never had a “flag day”
when all work stopped to apply an
interface change. They could make
those changes when they could sched-
ule them.
Of course, the team did have to
make the changes eventually, and
their management had to manage
that, but they were able to make the
changes when it wouldn’t destroy our
schedule. In an early precursor to test-
directed design, they had a regression
test that checked whether all the ver-
sion numbers were up to date and
warned them if updates were needed.
The first time the team avoided a
flag day, they gained the few hours ex-
pended preparing for change. By the
12th time, they were winning big.
Maintenance really is easy. More
importantly, investing time to pre-
pare for it can save you and your man-
agement time in the most frantic of
projects.
Related articles
on queue.acm.org
The Meaning of Maintenance
Kode Vicious
http://queue.acm.org/detail.cfm?id=1594861

The Long Road to 64 Bits
John Mashey
A Conversation with David Brown
Paul Stachour is a software engineer equally at home
in development, quality assurance, and process. one
of his focal areas is how to create correct, reliable,
functional software in effective and efficient ways in many
programming languages. Most of his work has been with
life-, safety-, and security-critical applications from his
home base in the twin Cities of Minnesota.
David Collier-Brown is an author and systems
programmer, formerly with Sun Microsystems, who
mostly does performance and capacity work from his
home in toronto.
© 2009 aCM 0001-0782/09/1100 $10.00
contributed articles
142 c o m m u n i c at i o n s o f t h e a c m | J a n u a r y
2 0 1 0 | v o l . 5 3 | n o . 1
d o i : 1 0 . 1 1 4 5 / 1 6 2 9 1 7 5 . 1 6 2 9 2 0 9
by Paul d. Witman and terry ryan

M a n y o r g a n i z at i o n s a r e s u c c e s s f u l w i t h s
o f t wa r e
reuse at fine to medium granularities – ranging from
objects, subroutines, and components through
software product lines. However, relatively little has
been published on very large-grained reuse. One
example of this type of large-grained reuse might be
that of an entire Internet banking system (applications
and infrastructure) reused in business units all over
the world. In contrast, “large scale” software reuse
in current research generally refers to systems that
reuse a large number of smaller components, or that
perhaps reuse subsystems.9 In this article, we explore a
case of an organization with an internal development
group that has been very successful with large-grained
software reuse.
BigFinancial, and the BigFinancial Technology
Center (BTC) in particular, have created a number of
software systems that have been reused in multiple
businesses and in multiple countries. BigFinancial
and BTC thus provided a rich source of data for
case studies to look at the characteristics of those
projects and why they have been successful, as well
as to look at projects that have been less successful
and to understand what has caused those results and
what might be done differently to prevent issues in
the future. The research is focused on technology,
process, and organizational elements of the
development process, rather than on specific product
features and functions.
Supporting reuse at a large-grained
level may help to alleviate some of the

issues that occur in more traditional
reuse programs, which tend to be finer-
grained. In particular, because BigFi-
nancial was trying to gain commonal-
ity in business processes and operating
models, reuse of large-grained compo-
nents was more closely aligned with its
business goals. This same effect may
well not have happened with finer-
grained reuse, due to the continued
ability of business units to more readily
pick and choose components for reuse.
BTC is a technology development
unit of BigFinancial, with operations
in both the eastern and western US. Ap-
proximately 500 people are employed
by BTC, reporting ultimately through a
single line manager responsible to the
Global Retail Business unit head of Big-
Financial. BTC is organized to deliver
both products and infrastructure com-
ponents to BigFinancial, and its prod-
uct line has through the years included
consumer Internet banking services,
teller systems, ATM software, and net-
work management tools. BigFinancial
has its U.S. operations headquartered
in the eastern U.S., and employs more
than 8,000 technologists worldwide.
In cooperation with BTC, we selected
three cases for further study from a pool
of about 25. These cases were the Java
Banking Toolkit (JBT) and its related ap-
plication systems, the Worldwide Single

Signon (WSSO) subsystem, and the Big-
Financial Message Switch (BMS).
background – software
reuse and bigfinancial
Various definitions appear in the lit-
erature for software reuse. Karlsson de-
fines software reuse as “the process of
creating software systems from existing
software assets, rather than building
software systems from scratch.” One
taxonomy of the approaches to software
reuse includes notions of the scope of
reuse, the target of the reuse, and the
granularity of the reuse.5 The notion of
granularity is a key differentiator of the
type of software reuse practiced at Big-
Financial, as BigFinancial has demon-
think big
for reuse
J a n u a r y 2 0 1 0 | v o l . 5 3 | n o . 1 | c o m m u n i
c at i o n s o f t h e a c m 143
portal services, and alerts capabilities,
and thus the JBT infrastructure is al-
ready reused for multiple applications.
To some extent, these multiple appli-
cations could be studied as subcases,
though they have thus far tended to be
deployed as a group. In addition, the

online banking, portal services, and
alerts functions are themselves reused
at the application level across multiple
business units globally.
Initial findings indicated that sever-
al current and recent projects showed
significant reuse across independent
business units that could have made
alternative technology development
decisions. The results are summarized
in Table 1.
While significant effort is required
to support multiple languages and
business-specific functional variabili-
ty, BTC found that it was able to accom-
modate these requirements by design-
ing its products to be rule-based, and by
designing its user interface to separate
content from language. In this manner,
business rules drove the behavior of
the Internet banking applications, and
language- and format-definition tools
drove the details of application behav-
ior, while maintaining a consistent set
of underlying application code.
In the late 1990s, BTC was respon-
sible for creation of system infrastruc-
ture components, built on top of in-
dustry-standard commercial operating
systems and components, to support
the banking functionality required
by its customers within BigFinancial.
The functions of these infrastructure

components included systems man-
agement, high-reliability logging pro-
cesses, high-availability mechanisms,
and other features not readily available
in commercial products at the time
that the components were created. The
same infrastructure was used to sup-
port consumer Internet banking as
strated success in large-grained reuse
programs – building a system once and
reusing it in multiple businesses.
Product Line Technology models,
such as that proposed by Griss4 and fur-
ther expanded upon by Clements and
Northrop2 and by Krueger6 suggest that
software components can be treated
similarly to the notions used in manu-
facturing – reusable parts that contrib-
ute to consistency across a product line
as well as to improved efficiencies in
manufacturing. Benefits of such reuse
include the high levels of commonal-
ity of such features as user interfaces,7
which increases switching costs and
customer loyalty in some domains.
This could logically extend to banking
systems in the form of common func-
tionality and user interfaces across
systems within a business, and across
business units.
BigFinancial has had several in-
stances of successful, large-grained re-
use projects. We identified projects that

have been successfully reused across a
wide range of business environments
or business domains, resulting in sig-
nificant benefit to BigFinancial. These
included the JBT platform and its re-
lated application packages, as well as
the Worldwide SSO product. These
projects demonstrated broad success,
and the authors evaluated these for evi-
dence to identify what contributed to,
and what may have worked against, the
success of each project.
The authors also identified another
project that has been successfully re-
used across a relatively narrow range of
business environments. This project,
the BigFinancial Message Switch (BMS)
was designed for a region-wide level of
reuse, and had succeeded at that level.
As such, it appears to have invested ap-
propriately in features and capabilities
needed for its client base, and did not
appear to have over-invested.
online banking and
related services
We focused on BTC’s multi-use Java
Banking Toolkit (JBT) as a model of
a successful project. The Toolkit is
in wide use across multiple business
units, and represents reuse both at the
largest-grained levels as well as reuse
of large-scale infrastructure compo-
nents. JBT supports three application
sets today, including online banking,

well as automated teller machines. The
Internet banking services will be iden-
tified here as the Legacy Internet Bank-
ing product (LIB).
BigFinancial’s initial forays into
Internet transaction services were ac-
complished via another instance of
reuse. Taking its pre-Internet banking
components, BTC was able to “scrape”
the content from the pages displayed
in that product, and wrap HTML code
around them for display on a Web
browser. Other components were re-
sponsible for modifying the input and
menuing functions for the Internet.
The purpose for this approach to
Internet delivery was to more rapidly
deliver a product to the Internet, with-
out modification of the legacy business
logic, thereby reducing risk as well. In
what amounted to an early separation
of business and presentation logic, the
pre-Internet business logic remained
in place, and the presentation layer
re-mapped its content for the browser
environment.
In 2002, BigFinancial and BTC rec-
ognized two key issues that needed to
be addressed. The platform for their
legacy Internet Banking application
was nearing end of life (having been
first deployed in 1996), and there were

too many disparate platforms for its
consumer Internet offerings. BTC’s
Internet banking, alerts, and portal
functions each required separate hard-
ware and operating environments.
BTC planned its activities such that the
costs of the new development could
fit within the existing annual mainte-
nance and new development costs al-
ready being paid by its clients.
BTC and business executives cited
trust in BTC’s organization as a key to
allowing BTC the opportunity to devel-
op the JBT product. In addition, BTC’s
prior success with reusing software
components at fine and medium gran-
table 1. selected reuse results
Project reused in business units
System Infrastructure Consumer Internet banking;
automated Teller Machines
all users of BTC’s legacy Internet banking
components – >35 businesses worldwide
System Infrastructure Internet banking – Small Business
approximately 4 business units worldwide
Internet banking Europe > 15 business units
Internet banking asia > 10 business units
Internet banking latin america > 6 business units

Internet banking north america > 4 business units
2 0 1 0 | v o l . 5 3 | n o . 1
ularities led to a culture that promoted
reuse as a best practice.
Starting in late 2002, BTC developed
an integrated platform and application
set for a range of consumer Internet
functions. The infrastructure package,
named the Java Banking Toolkit (JBT),
was based on Java 2 Enterprise Edition
(J2EE) standards and was intended
to allow BigFinancial to centralize its
server infrastructure for consumer
Internet functions. The authors con-
ducted detailed interviews with several
BTC managers and architects, and re-
viewed several hundred documents.
Current deployment statistics for JBT
are shown in Table 2.
The JBT infrastructure and appli-
cations were designed and built by
BTC and its regional partners, with in-
put from its clients around the world.
BTC’s experience had shown that con-
sumer banking applications were not
fundamentally different from one an-

other across the business units, and
BTC proposed and received funding
for creation of a consolidated applica-
tion set for Internet banking. A market
evaluation determined that there were
no suitable, globally reusable, com-
plete applications on the market, nor
any other organization with the track
record of success required for confi-
dence in the delivery. Final funding
approval came from BigFinancial tech-
nology and business executives.
The requirements for JBT called
for several major functional elements.
The requirements were broken out
among the infrastructural elements
supporting the various planned appli-
cation packages, and the applications
themselves. The applications delivered
with the initial release of JBT included
a consumer Internet banking applica-
tion set, an account activity and bal-
ance alerting function, and a portal
content toolset.
Each of these components was de-
signed to be reused intact in each busi-
ness unit around the world, requiring
only changes to business rules and
language phrases that may be unique
to a business. One of the fundamental
requirements for each of the JBT appli-
cations was to include capabilities that
were designed to be common to and

shared by as many business units as
possible, while allowing for all neces-
sary business-specific variability.
Such variability was planned for in
the requirements process, building
on the LIB infrastructure and applica-
tions, as well as the legacy portal and
alerts services that were already in pro-
duction. Examples of the region- and
business-specific variability include
language variations, compliance with
local regulatory requirements, and
functionality based on local and re-
gional competitive requirements.
JBT’s initial high-level requirements
documents included requirements
across a range of categories. These
categories included technology, opera-
tions, deployment, development, and
tools. These requirements were in-
tended to form the foundation for ini-
tial discussion and agreement with the
stakeholders, and to support division of
the upcoming tasks to define the archi-
tecture. Nine additional, more detailed,
requirements documents were created
to flesh out the details referenced in
the top-level requirements. Additional
topics addressed by the detailed docu-
ments included language, business
rules, host messaging, logging, portal
services, and system management.

One of BigFinancial’s regional tech-
nology leaders reported that JBT has
been much easier to integrate than the
legacy product, given its larger applica-
tion base and ability to readily add ap-
plications to it. Notably, he indicated
that JBT’s design had taken into ac-
count the lessons learned from prior
products, including improvements in
performance, stability, and total cost
of ownership. This resulted in a “win/
win/win for businesses, technology
groups, and customers.”
From an economic viewpoint, BigFi-
nancial indicates that the cost savings
for first-time business unit implemen-
tations of products already deployed to
other business units averaged between
20 and 40%, relative to the cost of new de-
velopment. Further, the cost savings for
subsequent deployments of updated re-
leases to a group of business units result-
ed in cost savings of 50% – 75% relative to
the cost of maintaining the software for
each business unit independently.
All core banking functionality is
supported by a single global applica-
tion set. There remain, in some cases,
functions required only by a specific
business or region. The JBT architec-
ture allows for those region-specific
applications to be developed by the
regional technology unit as required.
An overview of the JBT architecture is

shown in Figure 1.
BTC implemented JBT on principles
of a layered architecture,12 focusing on
interoperability and modularity. For
example, the application components
interact only with the application body
section of the page; all other elements
of navigation and branding are handled
by the common and portal services
figure 1. Java banking toolkit architecture overview
table 2. Jbt reuse results
region business units
Europe > 18 business units
Asia > 14 business units
Latin America > 9 business units
North America > 5 business units
ments to global product capabilities,
along with the cost of training, devel-
opment and testing of business rules,
and ramp-up of operational processes.

In contrast, ongoing maintenance sav-
ings are generally larger, due to the
commonality across the code base for
numerous business units. This com-
monality enables bug fixes, security
patches, and other maintenance activi-
ties to be performed on one code base,
rather than one for each business unit.
BigFinancial has demonstrated that
it is possible for a large organization,
building software for its own internal
use, to move beyond the more common
models of software reuse. In so doing,
BigFinancial has achieved significant
economies of scale across its many
business units, and has shortened the
time to market for new deployments of
its products.
Numerous factors were critical to
the success of the reuse projects. These
included elements expected from the
more traditional reuse literature, in-
cluding organizational structure, tech-
nological foundations, and economic
factors. In addition, several new ele-
ments have been identified. These in-
clude the notions of trust and culture,
the concepts of a track record of large-
and fine-grained reuse success, and the
virtuous (and potentially vicious) cycle
of corporate mandates. Conversely,
organizational barriers prove to be the
greatest inhibitor to successful reuse.13

BTC took specific steps, over a period
of many years, to create and strengthen
its culture of reuse. Across numerous
product lines, reuse of components and
infrastructure packages was strongly
encouraged. Reuse of large-grained
elements was the next logical step,
working with a group of business units
within a single regional organization.
This supported the necessary business
alignment to enable large-grained re-
use. In addition, due to its position as a
global technology provide to BigFinan-
cial, BTC was able to leverage its knowl-
edge of requirements across business
units, and explicitly design products to
be readily reusable, as well as to drive
commonality of requirements to sup-
port that reuse as well.
On the technical factors related
to reuse, BTC’s results have provided
empirical evidence regarding the use
of various technologies and patterns
elements. In addition, transactional
messaging is isolated from the applica-
tion via a message abstraction layer, so
that unique messaging models can be
used in each region, if necessary.
JBT includes both the infrastruc-
ture and applications components for
a range of banking functionality. The
infrastructure and applications com-
ponents are defined as independently

changeable releases, but are currently
packaged as a group to simplify the de-
ployment process.
Funding and governance of the
projects are coordinated through BTC,
with significant participation from the
business units. Business units have the
opportunity to choose other vendors
for their technology needs, though the
corporate technology strategy limited
that option as the JBT project gained
wider rollout status. Business units
participate in a semi-annual in-person
planning exercise to evaluate enhance-
ment requests and prioritize new busi-
ness deployments.
results
The authors examined a total of six dif-
ferent cases of software reuse. Three of
these were subcases of the Java Banking
Toolkit (JBT) – Internet banking, portal
services, and alerts, along with the re-
use of the JBT platform itself. The oth-
ers were the Worldwide SSO product,
and the BigFinancial Message Switch.
There were a variety of reuse success
levels, and a variety of levels of evidence
of anticipated supports and barriers to
reuse. The range of outcomes is repre-
sented as a two dimensional graph, as
shown in Figure 2.
BigFinancial measures its reuse

success in a very pragmatic, straight-
forward fashion. Rather than measur-
ing reused modules, lines of code, or
function points, BigFinancial instead
simply measures total deployments
of compatible code sets. Due to on-
going enhancements, the code base
continues to evolve over time, but in a
backwards-compatible fashion, so that
older versions can be and are readily
upgraded to the latest version as busi-
ness needs dictate.
BTC did not explicitly capture hard
economic measures of cost savings.
However, their estimates of the range
of cost savings are shown in Figure 3.
Cost savings are smaller for new de-
ployments due to the significant effort
required to map business unit require-
figure 2. reuse expectations and outcomes
2 0 1 0 | v o l . 5 3 | n o . 1
in actual reuse environments. Some
of these technologies and patterns are
platform-independent interfaces, busi-
ness rule structures, rigorous isolation
of concerns across software layers, and
versioning of interfaces to allow phased

migration of components to updated
interfaces. These techniques, among
others, are commonly recognized as
good architectural approaches for de-
signing systems, and have been exam-
ined more closely for their contribution
to the success of the reuse activities. In
this examination, they have been found
to contribute highly to the technologi-
cal elements required for success of
large-grained reuse projects.
Product vendors, and particularly
application service providers, routinely
conduct this type of development and
reuse, though with different motiva-
tions. (Application service providers
are now often referred to as providers
of Software as a Service.) As commer-
cial providers, they are more likely to be
market-driven, often with sales of Pro-
fessional Services for customization. In
contrast, the motivations in evidence
at BigFinancial seemed more aimed
at achieving the best combinations of
functionality, time to market, and cost.
The research provided an opportu-
nity to examine, in-depth, the various
forms of reuse practiced on three proj-
ects, and three subprojects, inside Big-
Financial. Some of those forms include
design reuse, code reuse, pattern reuse,
and test case reuse. The authors have
found based on documents and re-
ports from participants that the active

practice of systematic, finer-grained re-
use contributed to successful reuse of
systems at larger levels of granularity.
This study has provided a view of
management structures and leader-
ship styles, and an opportunity to ex-
amine how those contribute to, or work
against, successful reuse. Much has
been captured about IT governance in
general, and about organizational con-
structs to support reuse in various situ-
ations at BigFinancial/BTC. Leadership
of both BTC and BigFinancial was cited
as contributing to the success of the re-
use efforts, and indeed also was cited
as a prerequisite for even launching
a project that intends to accomplish
such large-grained reuse.
Sabherwal11 notes the criticality of
trust in outsourced IS relationships,
where the participants in projects may
not know one another before a project,
and may only work together on the one
project. As such, the establishment
and maintenance of trust is critical in
that environment. This is not entirely
applicable to BTC, as it is a peer organi-
zation to its client’s technology groups,
and its members often have long-stand-
ing relationships with their peers. Ring
and Van de Ven examine the broader
notions of cooperative inter-organiza-
tional relationships (IOR’s), and note

that trust is a fundamental part of an
IOR. Trust is used to serve to mitigate
the risks inherent in a relationship,
and at both a personal and organiza-
tional level is itself mitigated by the po-
tential overriding forces of the legal or
organizational systems.10 This element
does seem to be applicable to BTC’s en-
vironment, in that trust is reported to
have been foundational to the assign-
ment of the creation of JBT to BTC.
Griss notes that culture is one ele-
ment of the organizational structure
that can impede reuse. A culture that
fears loss of creativity, lacks trust, or
doesn’t know how to effectively reuse
software will not be as successful as an
organization that doesn’t have these
impediments.4 The converse is likely
then also reasonable – that a culture
that focuses on and implicitly welcomes
reuse will likely be more successful.
BTC’s long history of reuse, its lack of
explicit incentives and metrics around
more traditional reuse, and its position
as a global provider of technology to its
business partners make it likely that its
culture, is, indeed a strong supporter
of its reuse success.
Several other researchers have com-
mented on the impact of organizational
culture on reuse. Morisio et al8 refer in
passing to cultural factors, primarily as
potential inhibitors to reuse. Card and

Comer1 examine four cultural aspects
that can contribute to reuse adoption:
training, incentives, measurement,
and management. In addition, Card
and Comer’s work focuses generally on
cultural barriers, and how to overcome
them. In BTC’s case, however, there is
a solid cultural bias for reuse, and one
that, for example, no longer requires
incentives to promote reuse.
One key participant in the study had
a strong opinion to offer in relation to
fine- vs. coarse-grained reuse. The lead
architect for JBT was explicitly and vig-
orously opposed to a definition of reuse
figure 3. reuse cost savings ranges
Paul D. Witman, ([email protected] ) is an
Assistant Professor of Information Technology at
California Lutheran University.
Terry Ryan ([email protected]) is an Associate
Professor and Dean of the School of Information Systems
at Claremont Graduate University.
© 2010 ACm 0001-0782/10/0100 $10.00

that slanted toward fine-grained reuse
– of objects and components at a fine-
grained level. This person’s opinion
was that while reuse at this granularity
was possible (indeed, BTC demonstrat-
ed success at this level), fine-grained
reuse was very difficult to achieve in a
distributed development project. The
lead architect further believed that
the leverage it provides was not nearly
as great as the leverage from a large-
grained reuse program. The integrators
of such larger-grained components can
then have more confidence that the
component has been used in a similar
environment, tested under appropri-
ate loads, and so on – relieving the risk
that a fine-grained component built for
one domain may get misused in a new
domain or at a new scale, and be unsuc-
cessful in that environment.
While BTC’s JBT product does, to
some extent, work as part of a software
product line (supporting its three ma-
jor applications), JBT’s real reuse does
not come in the form of developing
more instances from a common set of
core assets. Rather, it appears that JBT
is itself reused, intact, to support the
needs of each of the various businesses
in a highly configurable fashion.
Organizational barriers appeared,
at least in part, to contribute to the lack
of broad deployment of the BigFinan-

cial Message Switch. Gallivan3 defined
a model for technology innovation as-
similation and adoption, which includ-
ed the notion that even in the face of
management directive, some employ-
ees and organizations might not adopt
and assimilate a particular technology
or innovation. This concept might part-
ly explain the results with BMS, that it
was possible for some business units
and technology groups to resist its in-
troduction on a variety of grounds, in-
cluding business case, even with a de-
cision by a global steering committee
to proceed with deployment.
We noted previously the negative
impact of inter-organizational barriers
on reuse adoption, particularly in the
BMS case. This was particularly evident
in that the organization that created
BMS, and was in large part responsible
for “selling” it to other business units,
was positioned at a regional rather than
global technology level. This organiza-
tional location, along with the organi-
zation’s more limited experience with
globally reusable products, may have
contributed to the difficulty in accom-
plishing broader reuse of that product.
conclusion
While BTC’s results and BigFinancial’s
specific business needs may be some-
what unusual, it is likely that the busi-

ness and technology practices support-
ing reuse may be generalizable to other
banks and other technology users. Good
system architecture, supporting reuse,
and an established business case that
identify the business value of the reuse
were fundamental to establishing the
global reuse accomplished by BTC, and
should be readily scalable to smaller
and less global environments.
Key factors contributing to a suc-
cessful project will be a solid technolo-
gy foundation, experience building and
maintaining reusable software, and a
financial and organizational structure
that supports and promotes reuse. In
addition, the organization will need to
actively build a culture of large-grained
reuse, and establish trust with its busi-
ness partners. Establishing that trust
will be vital to even having the oppor-
tunity to propose a large-grained reus-
able project.
References
1. Card, D. and Comer, E. Why do so many reuse
programs fail? IEEE Software 11, 5, 114-115.
2. Clements, P. and Northrop, L.m. Software Product
Lines: Practices and Patterns Addison-Wesley
Professional, 2002.
3. Gallivan, m.J. Organizational adoption and assimilation
of complex technological innovations: Development

and application of a new framework. The DATA BASE
for Advances in Information Systems 32, 3, 51-85.
4. Griss, m.L. Software reuse: From library to factory.
IBM Systems Journal 32, 4, 548-566.
5. Karlsson, E.-A. Software Reuse: A Holistic Approach.
John Wiley & Sons, West Sussex, England, 1995.
6. Krueger, C.W. New methods in software product line
practice. Comm. ACM 49, 12, (Dec. 2006), 37-40.
7. malan, R. and Wentzel, K. Economics of Software
Reuse Revisited. Hewlett-Packard Software
Technology Laboratory, Irvine, CA, 1993, 19.
8. morisio, m., Ezran, m. and Tully, C. Success and failure
factors in software reuse. IEEE Transactions on
Software Engineering 28, 4, 340-357.
9. Ramachandran, m. and Fleischer, W., Design for large
scale software reuse: An industrial case study. In
Proceedings for International Conference on Software
Reuse, (Orlando, FL, 1996), 104-111.
10. Ring, P.S. and Van de Ven, A.H. Developmental
processes of cooperative interorganizational
relationships. Academy of Management Review 19, 1,
90-118.
11. Sabherwal, R. The Role of Trust in Outsourced IS
Development Projects. Comm. of the ACM 42, 2, (Feb.
1999), 80-86.
12. Szyperski, C., Gruntz, D. and murer, S. Component
software: beyond object-oriented programming ACm

Press, New York, 2002.
13. Witman, P. and Ryan, T., Innovation in large-grained
software reuse: A case from banking. In Proceedings
for Hawaii International Conference on System
Sciences, (Waikoloa, HI, 2007), IEEE Computer
Society.
Copyright of Communications of the ACM is the property of
Association for Computing Machinery and its
content may not be copied or emailed to multiple sites or posted
to a listserv without the copyright holder's
express written permission. However, users may print,
download, or email articles for individual use.
54 C O M M U N I C AT I O N S O F T H E A C M | J A
N U A R Y 2 0 1 8 | V O L . 6 1 | N O . 1
practice
I
M
A
G
E
B

Y
V
I
T
E
Z
S
L
A
V
V
A
L
K
A
T H E H E T E R O G E N E I T Y, C O M P L E X I T Y, and
scale of cloud
applications make verification of their fault tolerance
properties challenging. Companies are moving away
from formal methods and toward large-scale testing
in which components are deliberately compromised
to identify weaknesses in the software. For example,
techniques such as Jepsen apply fault-injection testing
to distributed data stores, and Chaos Engineering
performs fault injection experiments on production
systems, often on live traffic. Both approaches have
captured the attention of industry and academia alike.

Unfortunately, the search space of distinct fault
combinations that an infrastructure can test is
intractable. Existing failure-testing solutions require
skilled and intelligent users who can supply the faults
to inject. These superusers, known as Chaos Engineers
and Jepsen experts, must study the sys-
tems under test, observe system execu-
tions, and then formulate hypotheses
about which faults are most likely to
expose real system-design flaws. This
approach is fundamentally unscal-
able and unprincipled. It relies on the
superuser’s ability to interpret how
a distributed system employs redun-
dancy to mask or ameliorate faults
and, moreover, the ability to recognize
the insufficiencies in those redundan-
cies—in other words, human genius.
This article presents a call to arms
for the distributed systems research
community to improve the state of
the art in fault tolerance testing.
Ordinary users need tools that au-
tomate the selection of custom-tai-
lored faults to inject. We conjecture
that the process by which superusers
select experiments—observing execu-
tions, constructing models of system
redundancy, and identifying weak-
nesses in the models—can be effec-
tively modeled in software. The ar-
ticle describes a prototype validating
this conjecture, presents early results
from the lab and the field, and identi-

fies new research directions that can
make this vision a reality.
The Future Is Disorder
Providing an “always-on” experience
for users and customers means that
distributed software must be fault tol-
erant—that is to say, it must be writ-
ten to anticipate, detect, and either
mask or gracefully handle the effects
of fault events such as hardware fail-
ures and network partitions. Writing
fault-tolerant software—whether for
distributed data management systems
involving the interaction of a handful
of physical machines, or for Web ap-
plications involving the cooperation of
tens of thousands—remains extremely
difficult. While the state of the art in
verification and program analysis con-
tinues to evolve in the academic world,
the industry is moving very much in
the opposite direction: away from for-
mal methods (however, with some
noteworthy exceptions,41) and toward
Abstracting
the Geniuses
Away from
Failure Testing
D O I : 1 0 . 1 1 4 5 / 3 1 5 2 4 8 3
Article development led by
queue.acm.org

Ordinary users need tools that automate the
selection of custom-tailored faults to inject.
BY PETER ALVARO AND SEVERINE TYMON
http://dx.doi.org/10.1145/3152483
J A N U A R Y 2 0 1 8 | V O L . 6 1 | N O . 1 | C O M
M U N I C AT I O N S O F T H E A C M 55
N U A R Y 2 0 1 8 | V O L . 6 1 | N O . 1
practice
up the stack and frustrate any attempts
at abstraction.
The Old Guard. The modern myth:
Formally verified distributed compo-
nents. If we cannot rely on geniuses to
hide the specter of partial failure, the
next best hope is to face it head on,
armed with tools. Until quite recently,
many of us (academics in particular)
looked to formal methods such as
model checking16,20,29,39,40,53,54 to assist
“mere mortal” programmers in writ-
ing distributed code that upholds its
guarantees despite pervasive uncer-
tainty in distributed executions. It is
not reasonable to exhaustively search
the state space of large-scale systems

(one cannot, for example, model
check Netflix), but the hope is that
modularity and composition (the next
best tools for conquering complexity)
can be brought to bear. If individual
distributed components could be
formally verified and combined into
systems in a way that preserved their
guarantees, then global fault toler-
ance could be obtained via composi-
tion of local fault tolerance.
Unfortunately, this, too, is a pipe
dream. Most model checkers require
a formal specification; most real-world
systems have none (or have not had one
since the design phase, many versions
ago). Software model checkers and oth-
er program-analysis tools require the
source code of the system under study.
The accessibility of source code is also
an increasingly tenuous assumption.
Many of the data stores targeted by
tools such as Jepsen are closed source;
large-scale architectures, while typical-
ly built from open source components,
are increasingly polyglot (written in a
wide variety of languages).
Finally, even if you assume that spec-
ifications or source code are available,
techniques such as model checking are
not a viable strategy for ensuring that
applications are fault tolerant because,
as mentioned, in the context of time-
outs, fault tolerance itself is an end-to-

end property that does not necessarily
hold under composition. Even if you
are lucky enough to build a system out
of individually verified components, it
does not follow the system is fault toler-
ant—you may have made a critical error
in the glue that binds them.
The Vanguard. The emerging ethos:
YOLO. Modern distributed systems
approaches that combine testing with
fault injection.
Here, we describe the underlying
causes of this trend, why it has been
successful so far, and why it is doomed
to fail in its current practice.
The Old Gods. The ancient myth:
Leave it to the experts. Once upon a
time, distributed systems researchers
and practitioners were confident that
the responsibility for addressing the
problem of fault tolerance could be
relegated to a small priesthood of ex-
perts. Protocols for failure detection,
recovery, reliable communication,
consensus, and replication could be
implemented once and hidden away
in libraries, ready for use by the layfolk.
This has been a reasonable dream.
After all, abstraction is the best tool
for overcoming complexity in com-
puter science, and composing reliable

systems from unreliable components
is fundamental to classical system
design.33 Reliability techniques such
as process pairs18 and RAID45 dem-
onstrate that partial failure can, in
certain cases, be handled at the low-
est levels of a system and successfully
masked from applications.
Unfortunately, these approaches
rely on failure detection. Perfect failure
detectors are impossible to implement
in a distributed system,9,15 in which it
is impossible to distinguish between
delay and failure. Attempts to mask
the fundamental uncertainty arising
from partial failure in a distributed
system—for example, RPC (remote
procedure calls8) and NFS (network file
system49)—have met (famously) with
difficulties. Despite the broad consen-
sus that these attempts are failed ab-
stractions,28 in the absence of better
abstractions, people continue to rely
on them to the consternation of devel-
opers, operators, and users.
In a distributed system—that is, a
system of loosely coupled components
interacting via messages—the failure
of a component is only ever manifested
as the absence of a message. The only
way to detect the absence of a message
is via a timeout, an ambiguous signal
that means either the message will nev-
er come or that it merely has not come

yet. Timeouts are an end-to-end con-
cern28,48 that must ultimately be man-
aged by the application. Hence, partial
failures in distributed systems bubble
While the state
of the art in
verification and
program analysis
continues to evolve
in the academic
world, the industry
is moving in the
opposite direction:
away from formal
methods and
toward approaches
that combine
testing with fault
injection.
practice
are simply too large, too heteroge-
neous, and too dynamic for these
classic approaches to software qual-
ity to take root. In reaction, practitio-
ners increasingly rely on resiliency
techniques based on testing and fault
injection.6,14,19,23,27,35 These “black box”
approaches (which perturb and ob-

serve the complete system, rather
than its components) are (arguably)
better suited for testing an end-to-
end property such as fault tolerance.
Instead of deriving guarantees from
understanding how a system works
on the inside, testers of the system
observe its behavior from the outside,
building confidence that it functions
correctly under stress.
Two giants have recently emerged
in this space: Chaos Engineering6 and
Jepsen testing.24 Chaos Engineering,
the practice of actively perturbing pro-
duction systems to increase overall site
resiliency, was pioneered by Netflix,6
but since then LinkedIn,52 Microsoft,38
Uber,47 and PagerDuty5 have developed
Chaos-based infrastructures. Jepsen
performs black box testing and fault
injection on unmodified distributed
data management systems, in search
of correctness violations (for example,
counterexamples that show an execu-
tion was not linearizable).
Both approaches are pragmatic and
empirical. Each builds an understand-
ing of how a system operates under
faults by running the system and observ-
ing its behavior. Both approaches offer
a pay-as-you-go method to resiliency:
the initial cost of integration is low,
and the more experiments that are
performed, the higher the confidence

that the system under test is robust.
Because these approaches represent
a straightforward enrichment of exist-
ing best practices in testing with well-
understood fault injection techniques,
they are easy to adopt. Finally, and
perhaps most importantly, both ap-
proaches have been shown to be effec-
tive at identifying bugs.
Unfortunately, both techniques
also have a fatal flaw: they are manual
processes that require an extremely
sophisticated operator. Chaos Engi-
neers are a highly specialized subclass
of site reliability engineers. To devise
a custom fault injection strategy, a
Chaos Engineer typically meets with
different service teams to build an
understanding of the idiosyncrasies
of various components and their in-
teractions. The Chaos Engineer then
targets those services and interactions
that seem likely to have latent fault tol-
erance weaknesses. Not only is this ap-
proach difficult to scale since it must
be repeated for every new composition
of services, but its critical currency—
a mental model of the system under
study—is hidden away in a person’s
brain. These points are reminiscent
of a bigger (and more worrying) trend
in industry toward reliability priest-
hoods,7 complete with icons (dash-
boards) and rituals (playbooks).

Jepsen is in principle a framework
that anyone can use, but to the best of
our knowledge all of the reported bugs
discovered by Jepsen to date were dis-
covered by its inventor, Kyle Kingsbury,
who currently operates a “distributed
systems safety research” consultancy.24
Applying Jepsen to a storage system
requires the superuser carefully read
the system documentation, generate
workloads, and observe the externally
visible behaviors of the system under
test. It is then up to the operator to
choose—from the massive combina-
torial space of “nemeses,” including
machine crashes and network parti-
tions—those fault schedules that are
likely to drive the system into returning
incorrect responses.
A human in the loop is the kiss of
death for systems that need to keep up
with software evolution. Human atten-
tion should always be targeted at tasks
that computers cannot do! Moreover,
the specialists that Chaos and Jepsen
testing require are expensive and rare.
Here, we show how geniuses can be ab-
stracted away from the process of fail-
ure testing.
We Don’t Need Another Hero
Rapidly changing assumptions about
our visibility into distributed system
internals have made obsolete many

if not all of the classic approaches to
software quality, while emerging “cha-
os-based” approaches are fragile and
unscalable because of their genius-in-
the-loop requirement.
We present our vision of automated
failure testing by looking at how the
same changing environments that has-
tened the demise of time-tested resil-
iency techniques can enable new ones.
We argue the best way to automate the
experts out of the failure-testing loop is
to imitate their best practices in soft-
ware and show how the emergence of
sophisticated observability infrastruc-
ture makes this possible.
The order is rapidly fadin.’ For large-
scale distributed systems, the three
fundamental assumptions of tradi-
tional approaches to software quality
are quickly fading in the rearview mir-
ror. The first to go was the belief that
you could rely on experts to solve the
hardest problems in the domain. Sec-
ond was the assumption that a formal
specification of the system is available.
Finally, any program analysis (broadly
defined) that requires that source code
is available must be taken off the ta-
ble. The erosion of these assumptions
helps explain the move away from clas-
sic academic approaches to resiliency
in favor of the black box approaches

described earlier.
What hope is there of understand-
ing the behavior of complex systems
in this new reality? Luckily, the fact
that it is more difficult than ever to
understand distributed systems from
the inside has led to the rapid evolu-
tion of tools that allow us to under-
stand them from the outside. Call-
graph logging was first described by
Google;51 similar systems are in use
at Twitter,4 Netflix,1 and Uber,50 and
the technique has since been stan-
dardized.43 It is reasonable to assume
that a modern microservice-based
Internet enterprise will already have
instrumented its systems to collect
call-graph traces. A number of start-
ups that focus on observability have
recently emerged.21,34 Meanwhile,
provenance collection techniques
for data processing systems11,22,42 are
becoming mature, as are operating
system-level provenance tools.44 Re-
cent work12,55 has attempted to infer
causal and communication structure
of distributed computations from
raw logs, bringing high-level explana-
tions of outcomes within reach even
for uninstrumented systems.
Regarding testing distributed systems.
Chaos Monkey, like they mention, is awe-
some, and I also highly recommend get-
ting Kyle to run Jepsen tests.

—Commentator on HackerRumor
N U A R Y 2 0 1 8 | V O L . 6 1 | N O . 1
practice
of properties that are either maintained
throughout the system’s execution (for
example, system invariants or safety
properties) or established during execu-
tion (for example, liveness properties).
Most distributed systems with which
we interact, though their executions
may be unbounded, nevertheless pro-
vide finite, bounded interactions that
have outcomes. For example, a broad-
cast protocol may run “forever” in a re-
active system, but each broadcast deliv-
ered to all group members constitutes
a successful execution.
By viewing distributed systems in
this way, we can revise the definition:
A system is fault tolerant if it provides
sufficient mechanisms to achieve its
successful outcomes despite the given
class of faults.
Step 3: Formulate experiments that
target weaknesses in the façade. If we
could understand all of the ways in
which a system can obtain its good

outcomes, we could understand which
faults it can tolerate (or which faults it
could be sensitive to). We assert that
(whether they realize it or not!) the
process by which Chaos Engineers
and Jepsen superusers determine, on
a system-by-system basis, which faults
to inject uses precisely this kind of rea-
soning. A target experiment should
exercise a combination of faults that
knocks out all of the supports for an ex-
pected outcome.
Carrying out the experiments turns
out to be the easy part. Fault injection
infrastructure, much like observability
infrastructure, has evolved rapidly in
recent years. In contrast to random,
coarse-grained approaches to distrib-
uted fault injection such as Chaos
Monkey,23 approaches such as FIT
(failure injection testing)17 and Grem-
lin32 allow faults to be injected at the
granularity of individual requests with
high precision.
Step 4. Profit! This process can be ef-
fectively automated. The emergence of
sophisticated tracing tools described
earlier makes it easier than ever to
build redundancy models even from
the executions of black box systems.
The rapid evolution of fault injection
infrastructure makes it easier than
ever to test fault hypotheses on large-
scale systems. Figure 1 illustrates how

the automation described in this here
fits neatly between existing observ-
Away from the experts. While this
quote is anecdotal, it is difficult to
imagine a better example of the fun-
damental unscalability of the current
state of the art. A single person can-
not possibly keep pace with the ex-
plosion of distributed system imple-
mentations. If we can take the human
out of this critical loop, we must; if we
cannot, we should probably throw in
the towel.
The first step to understanding how
to automate any process is to compre-
hend the human component that we
would like to abstract away. How do
Chaos Engineers and Jepsen superus-
ers apply their unique genius in prac-
tice? Here is the three-step recipe com-
mon to both approaches.
Step 1: Observe the system in action.
The human element of the Chaos and
Jepsen processes begins with princi-
pled observation, broadly defined.
A Chaos Engineer will, after study-
ing the external API of services rel-
evant to a given class of interactions,
meet with the engineering teams to
better understand the details of the
implementations of the individual

services.25 To understand the high-
level interactions among services, the
engineer will then peruse call-graph
traces in a trace repository.3
A Jepsen superuser typically begins
by reviewing the product documenta-
tion, both to determine the guarantees
that the system should uphold and to
learn something about the mecha-
nisms by which it does so. From there,
the superuser builds a model of the
behavior of the system based on inter-
action with the system’s external API.
Since the systems under study are typ-
ically data management and storage,
these interactions involve generating
histories of reads and writes.31
The first step to understanding what
can go wrong in a distributed system is
watching things go right: observing the
system in the common case.
Step 2. Build a mental model of how
the system tolerates faults. The com-
mon next step in both approaches is
the most subtle and subjective. Once
there is a mental model of how a dis-
tributed system behaves (at least in the
common case), how is it used to help
choose the appropriate faults to inject?
At this point we are forced to dabble in
conjecture: bear with us.
Fault tolerance is redundancy. Giv-

en some fixed set of faults, we say that
a system is “fault tolerant” exactly if it
operates correctly in all executions in
which those faults occur. What does it
mean to “operate correctly”? Correct-
ness is a system-specific notion, but,
broadly speaking, is expressed in terms
Figure 1. Our vision of automated failure
testing.
explanations
models
of
redundancy
fault
injection
Figure 2. Fault injection and fault-tolerant code.
APP1 APP1 APP2 APP2
caller
fault
callee
API API API API API

practice
ability infrastructure and fault injec-
tion infrastructure, consuming the
former, maintaining a model of system
redundancy, and using it to param-
eterize the latter. Explanations of sys-
tem outcomes and fault injection in-
frastructures are already available. In
the current state of the art, the puzzle
piece that fits them together (models of
redundancy) is a manual process. LDFI
(as we will explain) shows that automa-
tion of this component is possible.
A Blast from the Past
In previous work, we introduced a bug-
finding tool called LDFI (lineage-driven
fault injection).2 LDFI uses data prove-
nance collected during simulations of
distributed executions to build deriva-
tion graphs for system outcomes. These
graphs function much like the models
of system redundancy described ear-
lier. LDFI then converts the derivation
graphs into a Boolean formula whose
satisfying assignments correspond to
combinations of faults that invalidate
all derivations of the outcome. An ex-
periment targeting those faults will
then either expose a bug (that is, the ex-
pected outcome fails to occur) or reveal
additional derivations (for example, af-
ter a timeout, the system fails over to a
backup) that can be used to enrich the
model and constrain future solutions.

At its heart, LDFI reapplies well-
understood techniques from data
management systems, treating fault
tolerance as a materialized view main-
tenance problem.2,13 It models a dis-
tributed system as a query, its expect-
ed outcomes as query outcomes, and
critical facts such as “replica A is up at
time t” and “there is connectivity be-
tween nodes X and Y during the inter-
val i . . . j” as base facts. It can then ask
a how-to query:37 What changes to base
data will cause changes to the derived
data in the view? The answers to this
query are the faults that could, accord-
ing to the current model, invalidate the
expected outcomes.
The idea seems far-fetched, but the
LDFI approach shows a great deal of
promise. The initial prototype demon-
strated the efficacy of the approach at
the level of protocols, identifying bugs
in replication, broadcast, and commit
protocols.2,46 Notably, LDFI reproduced
a bug in the replication protocol used by
the Kafka distributed log26 that was first
(manually) identified by Kingsbury.30
A later iteration of LDFI is deployed at
Netflix,1 where (much like the illustra-
tion in Figure 1) it was implemented
as a microservice that consumes traces
from a call-graph repository service and
provides inputs for a fault injection ser-

vice. Since its deployment, LDFI has
identified 11 critical bugs in user-fac-
ing applications at Netflix.1
Rumors from the Future
The prior research presented earlier is
only the tip of the iceberg. Much work
still needs to be undertaken to realize
the vision of fully automated failure
testing for distributed systems. Here,
we highlight nascent research that
shows promise and identifies new di-
rections that will help realize our vision.
Don’t overthink fault injection. In the
context of resiliency testing for distribut-
ed systems, attempting to enumerate
and faithfully simulate every possible
kind of fault is a tempting but dis-
tracting path. The problem of under-
standing all the causes of faults is not
directly relevant to the target, which
is to ensure that code (along with its
configuration) intended to detect and
mitigate faults performs as expected.
Consider Figure 2: The diagram on
the left shows a microservice-based
architecture; arrows represent calls
generated by a client request. The
right-hand side zooms in on a pair of
interacting services. The shaded box
in the caller service represents the
fault tolerance logic that is intended
to detect and handle faults of the cal-
lee. Failure testing targets bugs in this

logic. The fault tolerance logic targeted
in this bug search is represented as the
shaded box in the caller service, while
the injected faults affect the callee.
The common effect of all faults, from
the perspective of the caller, is explicit
error returns, corrupted responses,
and (possibly infinite) delay. Of these
manifestations, the first two can be ad-
equately tested with unit tests. The last
is difficult to test, leading to branches
of code that are infrequently executed.
If we inject only delay, and only at com-
ponent boundaries, we conjecture that
we can address the majority of bugs re-
lated to fault tolerance.
Explanations everywhere. If we can
provide better explanations of system
outcomes, we can build better models
The rapid evolution
of fault injection
infrastructure
makes it easier
than ever to test
fault hypotheses
on large-scale
systems.
N U A R Y 2 0 1 8 | V O L . 6 1 | N O . 1

practice
to embrace (rather than abstracting
away) this uncertainty.
Distributed systems are probabi-
listic by nature and are arguably bet-
ter modeled probabilistically. Future
directions of work include the proba-
bilistic representation of system re-
dundancy and an exploration of how
this representation can be exploited to
guide the search of fault experiments.
We encourage the research community
to join in exploring alternative internal
representations of system redundancy.
Turning the explanations inside
out. Most of the classic work on data
provenance in database research has
focused on aspects related to human-
computer interaction. Explanations of
why a query returned a particular result
can be used to debug both the query
and the initial database—given an un-
expected result, what changes could be
made to the query or the database to fix
it? By contrast, in the class of systems
we envision (and for LDFI concretely),
explanations are part of the internal
language of the reasoner, used to con-
struct models of redundancy in order
to drive the search through faults.
Ideally, explanations should play a
role in both worlds. After all, when a

bug-finding tool such as LDFI identi-
fies a counterexample to a correctness
property, the job of the programmers
has only just begun—now they must un-
dertake the onerous job of distributed
debugging. Tooling around debugging
has not kept up with the explosive pace
of distributed systems development.
We continue to use tools that were de-
signed for a single site, a uniform mem-
ory, and a single clock. While we are not
certain what an ideal distributed debug-
ger should look like, we are quite certain
that it does not look like GDB (GNU Proj-
ect debugger).36 The derivation graphs
used by LDFI show how provenance can
also serve a role in debugging by provid-
ing a concise, visual explanation of how
the system reached a bad state.
This line of research can be pushed
further. To understand the root causes
of a bug in LDFI, a human operator
must review the provenance graphs of
the good and bad executions and then
examine the ways in which they differ.
Intuitively, if you could abstractly
subtract the (incomplete by assump-
tion) explanations of the bad outcomes
from the explanations of the good out-
of redundancy. Unfortunately, a bar-
rier to entry for systems such as LDFI
is the unwillingness of software de-
velopers and operators to instrument
their systems for tracing or provenance

collection. Fortunately, operating sys-
tem-level provenance-collection tech-
niques are mature and can be applied
to uninstrumented systems.
Moreover, the container revolution
makes simulating distributed execu-
tions of black box software within a
single hypervisor easier than ever. We
are actively exploring the collection
of system call-level provenance from
unmodified distributed software in
order to select a custom-tailored fault
injection schedule. Doing so requires
extrapolating application-level causal
structure from low-level traces, iden-
tifying appropriate cut points in an
observed execution, and finally syn-
chronizing the execution with fault
injection actions.
We are also interested in the pos-
sibility of inferring high-level explana-
tions from even noisier signals, such as
raw logs. This would allow us to relax
the assumption that the systems un-
der study have been instrumented to
collect execution traces. While this is
a difficult problem, work such as the
Mystery Machine12 developed at Face-
book shows great promise.
Toward better models. The LDFI
system represents system redundancy
using derivation graphs and treats the
task of identifying possible bugs as a

materialized-view maintenance prob-
lem. LDFI was hence able to exploit
well-understood theory and mecha-
nisms from the history of data man-
agement systems research. But this is
just one of many ways to represent how
a system provides alternative computa-
tions to achieve its expected outcomes.
A shortcoming of the LDFI approach
is its reliance on assumptions of de-
terminism. In particular, it assumes
that if it has witnessed a computation
that, under a particular contingency
(that is, given certain inputs and in the
presence of certain faults), produces
a successful outcome, then any future
computation under that contingency
will produce the same outcome. That
is to say, it ignores the uncertainty in
timing that is fundamental to distrib-
uted systems. A more appropriate way
to model system redundancy would be
The container
revolution makes
simulating
distributed
executions of
black-box software
within a single
hypervisor easier
than ever.

practice
36. Matloff, N., Salzman, P.J. The Art of Debugging with
GDB, DDD, and Eclipse. No Starch Press, 2008.
37. Meliou, A., Suciu, D. Tiresias: The database oracle for
how-to queries. Proceedings of the ACM SIGMOD
International Conference on the Management of Data
(2012), 337-348.
38. Microsoft Azure Documentation. Introduction to the
fault analysis service, 2016; https://azure.microsoft.
com/en-us/documentation/articles/ service-fabric-
testability-overview/.
39. Musuvathi, M. et al. CMC: A pragmatic approach to
model checking real code. ACM SIGOPS Operating
Systems Review. In Proceedings of the 5th Symposium
on Operating Systems Design and Implementation 36
(2002), 75–88.
40. Musuvathi, M. et al. Finding and reproducing
Heisenbugs in concurrent programs. In Proceedings
of the 8th Usenix Conference on Operating Systems
Design and Implementation (2008), 267–280.
41. Newcombe, C. et al. Use of formal methods at
Amazon Web Services. Technical Report, 2014; http://
lamport.azurewebsites.net/tla/formal-methods-
amazon.pdf.
42. Olston, C., Reed, B. Inspector Gadget: A framework
for custom monitoring and debugging of distributed

data flows. In Proceedings of the ACM SIGMOD
International Conference on the Management of Data
(2011), 1221–1224.
43. OpenTracing. 2016; http://opentracing.io/.
44. Pasquier, T.F. J.-M., Singh, J., Eyers, D.M., Bacon, J.
CamFlow: Managed data-sharing for cloud services,
2015; https://arxiv.org/pdf/1506.04391.pdf.
45. Patterson, D.A., Gibson, G., Katz, R.H. A case for
redundant arrays of inexpensive disks (RAID). In
Proceedings of the 1988 ACM SIGMOD International
Conference on Management of Data, 109–116;
http://web.mit.edu/6.033/2015/wwwdocs/papers/
Patterson88.pdf.
46. Ramasubramanian, K. et al. Growing a protocol. In
Proceedings of the 9th Usenix Workshop on Hot Topics
in Cloud Computing (2017).
47. Reinhold, E. Rewriting Uber engineering: The
opportunities microservices provide. Uber Engineering,
2016; https: //eng.uber.com/building-tincup/.
48. Saltzer, J. H., Reed, D.P., Clark, D.D. End-to-end
arguments in system design. ACM Trans. Computing
Systems 2, 4 (1984): 277–288.
49. Sandberg, R. The Sun network file system: design,
implementation and experience. Technical report, Sun
Microsystems. In Proceedings of the Summer 1986
Usenix Technical Conference and Exhibition.
50. Shkuro, Y. Jaeger: Uber’s distributed tracing system.
Uber Engineering, 2017; https://uber.github.io/jaeger/.

51. Sigelman, B.H. et al. Dapper, a large-scale distributed
systems tracing infrastructure. Technical report.
Research at Google, 2010; https://research.google.
com/pubs/pub36356.html.
52. Shenoy, A. A deep dive into Simoorg: Our open source
failure induction framework. Linkedin Engineering,
2016; https://engineering.linkedin.com/blog/2016/03/
deep-dive-Simoorg-open-source-failure-induction-
framework.
53. Yang, J. et al.L., Zhou, L. MODIST: Transparent
model checking of unmodifed distributed systems.
In Proceedings of the 6th Usenix Symposium on
Networked Systems Design and Implementation
(2009), 213–228.
54. Yu, Y., Manolios, P., Lamport, L. Model checking TLA+
specifications. In Proceedings of the 10th IFIP WG
10.5 Advanced Research Working Conference on
Correct Hardware Design and Verification Methods
(1999), 54–66.
55. Zhao, X. et al. Lprof: A non-intrusive request flow
profiler for distributed systems. In Proceedings of the
11th Usenix Conference on Operating Systems Design
and Implementation (2014), 629–644.
Peter Alvaro is an assistant professor of computer
science at the University of California Santa Cruz,
where he leads the Disorderly Labs research group
(disorderlylabs.github.io).
Severine Tymon is a technical writer who has written
documentation for both internal and external users

of enterprise and open source software, including for
Microsoft, CNET, VMware, and Oracle.
Copyright held by owners/authors.
Publication rights licensed to ACM. $15.00.
comes,10 then the root cause of the dis-
crepancy would be likely to be near the
“frontier” of the difference.
Conclusion
A sea change is occurring in the tech-
niques used to determine whether
distributed systems are fault tolerant.
The emergence of fault injection ap-
proaches such as Chaos Engineering
and Jepsen is a reaction to the erosion
of the availability of expert program-
mers, formal specifications, and uni-
form source code. For all of their prom-
ise, these new approaches are crippled
by their reliance on superusers who
decide which faults to inject.
To address this critical shortcom-
ing, we propose a way of modeling and
ultimately automating the process
carried out by these superusers. The
enabling technologies for this vision
are the rapidly improving observabil-
ity and fault injection infrastructures
that are becoming commonplace in
the industry. While LDFI provides con-
structive proof that this approach is
possible and profitable, it is only the
beginning. Much work remains to be

done in targeting faults at a finer grain,
constructing more accurate models of
system redundancy, and providing bet-
ter explanations to end users of exactly
what went wrong when bugs are identi-
fied. The distributed systems research
community is invited to join in explor-
ing this new and promising domain.
Related articles
on queue.acm.org
Fault Injection in Production
John Allspaw
The Verification of a Distributed System
Caitie McCaffrey
Injecting Errors for Fun and Profit
Steve Chessin
References
1. Alvaro, P. et al. Automating failure-testing research
at Internet scale. In Proceedings of the 7th ACM
Symposium on Cloud Computing (2016), 17–28.
2. Alvaro, P., Rosen, J., Hellerstein, J.M. Lineage-driven
fault injection. In Proceedings of the ACM SIGMOD
International Conference on Management of Data
(2015), 331–346.
3. Andrus, K. Personal communication, 2016.

4. Aniszczyk, C. Distributed systems tracing with Zipkin.
Twitter Engineering; https://blog.twitter.com/2012/
distributed-systems-tracing-with-zipkin.
5. Barth, D. Inject failure to make your systems more
reliable. DevOps.com; http://devops.com/2014/06/03/
inject-failure/.
6. Basiri, A. et al. Chaos Engineering. IEEE Software 33, 3
(2016), 35–41.
7. Beyer, B., Jones, C., Petoff, J., Murphy, N.R. Site
Reliability Engineering. O’Reilly, 2016.
8. Birrell, A.D., Nelson, B.J. Implementing remote
procedure calls. ACM Trans. Computer Systems 2, 1
(1984), 39–59.
9. Chandra, T.D., Hadzilacos, V., Toueg, S. The weakest
failure detector for solving consensus. J.ACM 43, 4
(1996), 685–722.
10. Chen, A. et al. The good, the bad, and the differences:
better network diagnostics with differential
provenance. In Proceedings of the ACM SIGCOMM
Conference (2016), 115–128.
11. Chothia, Z., Liagouris, J., McSherry, F., Roscoe, T.
Explaining outputs in modern data analytics. In
Proceedings of the VLDB Endowment 9, 12 (2016):
1137–1148.
12. Chow, M. et al. The Mystery Machine: End-to-end
performance analysis of large-scale Internet services.
In Proceedings of the 11th Usenix Conference on

Operating Systems Design and Implementation
(2014), 217–231.
13. Cui, Y., Widom, J., Wiener, J.L. Tracing the lineage of
view data in a warehousing environment. ACM Trans.
Database Systems 25, 2 (2000), 179–227.
14. Dawson, S., Jahanian, F., Mitton, T. ORCHESTRA: A
Fault Injection Environment for Distributed Systems.
In Proceedings of the 26th International Symposium
on Fault-tolerant Computing, (1996).
15. Fischer, M.J., Lynch, N.A., Paterson, M.S. Impossibility
of distributed consensus with one faulty process.
J. ACM 32, 2 (1985): 374–382; https://groups.csail.mit.
edu/tds/papers/Lynch/jacm85.pdf.
16. Fisman, D., Kupferman, O., Lustig, Y. On verifying
fault tolerance of distributed protocols. In Tools
and Algorithms for the Construction and Analysis of
Systems, Lecture Notes in Computer Science 4963,
Springer Verlag (2008). 315–331.
17. Gopalani, N., Andrus, K., Schmaus, B. FIT: Failure
injection testing. Netflix Technology Blog; http://
techblog.netflix.com/2014/10/fit-failure-injection-
testing.html.
18. Gray, J. Why do computers stop and what can
be done about it? Tandem Technical Report 85.7
(1985); http://www.hpl.hp.com/techreports/
tandem/TR-85.7.pdf.
19. Gunawi, H.S. et al. FATE and DESTINI: A framework
for cloud recovery testing. In Proceedings of the 8th
Usenix Conference on Networked Systems Design

and Implementation (2011), 238–252; http://db.cs.
berkeley.edu/papers/nsdi11-fate-destini.pdf.
20. Holzmann, G. The SPIN Model Checker: Primer and
Reference Manual. Addison-Wesley Professional, 2003.
21. Honeycomb. 2016; https://honeycomb.io/.
22. Interlandi, M. et al. Titian: Data provenance support in
Spark. In Proceedings of the VLDB Endowment 9, 33
(2015), 216–227.
23. Izrailevsky, Y., Tseitlin, A. The Netflix Simian Army.
Netflix Technology Blog; http: //techblog.netflix.
com/2011/07/ netflix-simian-army.html.
24. Jepsen. Distributed systems safety research, 2016;
http://jepsen.io/.
25. Jones, N. Personal communication, 2016.
26. Kafka 0.8.0. Apache, 2013; https://kafka.apache.
org/08/documentation.html.
27. Kanawati, G.A., Kanawati, N.A., Abraham, J.A. Ferrari:
A flexible software-based fault and error injection
system. IEEE Trans. Computers 44, 2 (1995): 248–260.
28. Kendall, S.C., Waldo, J., Wollrath, A., Wyant, G. A note
on distributed computing. Technical Report, 1994. Sun
Microsystems Laboratories.
29. Killian, C.E., Anderson, J.W., Jhala, R., Vahdat, A. Life,
death, and the critical transition: Finding liveness
bugs in systems code. Networked System Design and
Implementation, (2007); 243–256.

30. Kingsbury, K. Call me maybe: Kafka, 2013; http://
aphyr.com/posts/293-call-me-maybe-kafka.
31. Kingsbury, K. Personal communication, 2016.
32. Lafeldt, M. The discipline of Chaos Engineering.
Gremlin Inc., 2017; https://blog.gremlininc.com/the-
discipline-of-chaos-engineering-e39d2383c459.
33. Lampson, B.W. Atomic transactions. In Distributed
Systems—Architecture and Implementation, An
Advanced Cours: (1980), 246–265; https://link.
springer.com/chapter/10.1007%2F3-540-10571-9_11.
34. LightStep. 2016; http://lightstep.com/.
35. Marinescu, P.D., Candea, G. LFI: A practical and
general library-level fault injector. In IEEE/IFIP
International Conference on Dependable Systems and
Networks (2009).
Copyright of Communications of the ACM is the property of
Association for Computing
Machinery and its content may not be copied or emailed to
multiple sites or posted to a
listserv without the copyright holder's express written
permission. However, users may print,
download, or email articles for individual use.
International Journal of Performability Engineering Vol. 6, No.

6, November 2010, pp. 531-546.
© RAMS Consultants
Printed in India
*
Corresponding author’s email: [email protected] 531
Successful Application of Software Reliability: A Case Study
NORMAN F. SCHNEIDEWIND
Fellow of the IEEE
2822 Raccoon Trail
Pebble Beach, California 93953 USA
(Received on July 30, 2009, revised on May 3, 2010)
Abstract: The purpose of this case study is to help readers
implement or improve a
software reliability program in their organizations, using a step-
by-step approach based on
the Institute of Electrical and Electronic Engineers (IEEE) and
the American Institute of
Aeronautics and Astronautics Recommended (AIAA) Practice
for Software Reliability,
released in June 2008, supported by a case study from the
NASA Space Shuttle.

This case study covers the major phases that the software
engineering practitioner
needs in planning and executing a software reliability-
engineering program. These phases
require a number of steps for their implementation. These steps
provide a structured
approach to the software reliability process. Each step will be
discussed to provide a good
understanding of the entire software reliability process. Major
topics covered are: data
collection, reliability risk assessment, reliability prediction,
reliability prediction
interpretation, testing, reliability decisions, and lessons learned
from the NASA Space
Shuttle software reliability engineering program.
Keywords: software reliability program, Institute of Electrical
and Electronic Engineers
and the American Institute of Aeronautics and Astronautics
Recommended Practice for
Software Reliability, NASA Space Shuttle application
1. Introduction
The IEEEAIAA recommended practice provides a
foundation on which

practitioners and researchers can build consistent methods [1].
This case study will
describe the SRE process and show that it is important for an
organization to have a
disciplined process if it is to produce high reliability software.
To accomplish this purpose,
an overview is presented of existing practice in software
reliability, as represented by the
recommended practice [1]. This will provide the reader with the
foundation to understand
the basic process of Software Reliability engineering (SRE).
The Space Shuttle Primary
Avionics Software Subsystem will be used to illustrate the SRE
process.
The reliability prediction models that will be used are based on
some key definitions
and assumptions, as follows:
Definitions
Interval: an integer time unit t of constant or variable length
defined by t-1 <t <t+1, where
t>0; failures are counted in intervals.
Number of Intervals: the number of contiguous integer time
units t of constant or variable

length represented by a positive real number.
Norman F. Schneidewind
.
532
Operational Increment (OI): a software system comprised of
modules and configured from
a series of builds to meet Shuttle mission functional
requirements.
Time: continuous CPU execution time over an interval range.
Assumptions
1. Faults that cause failures are removed.
2. As more failures occur and more faults are corrected,
remaining failures will be
reduced.
3. The remaining failures are "zero" for those OI's that were
executed for extremely
long times (years) with no additional failure reports;

correspondingly, for these
OI's, maximum failures equals total observed failures.
1.1 Space Shuttle Flight Software Application
The Shuttle software represents a successful integration of
many of the computer
industry's most advanced software engineering practices and
approaches. Beginning in the
late 1970's, this software development and maintenance project
has evolved one of the
world's most mature software processes applying the principles
of the highest levels of the
Software Engineering Institute's (SEI) Capability Maturity
Model (the software is rated
Level 5 on the SEI scale) and ISO 9001 Standards [2]. This
software process includes
state-of-the-practice software reliability engineering (SRE)
methodologies.
The goals of the recommended practice are to: interpret
software reliability
predictions, support verification and validation of the software,
assess the risk of
deploying the software, predict the reliability of the software,
develop test strategies to

bring the software into conformance with reliability
specifications, and make reliability
decisions regarding deployment of the software.
Reliability predictions are used by the developer to add
confidence to a formal
software certification process comprised of requirements risk
analysis, design and code
inspections, testing, and independent verification and
validation. This case study uses the
experience obtained from the application of SRE on the Shuttle
project, because this
application is judged by NASA and the developer to be a
successful application of SRE
[6]. These SRE techniques and concepts should be of value for
other software systems
1.2 Reliability Measurements and Predictions
There are a number of measurements and predictions that can
be made of reliability
to verify and validate the software. Among these are remaining
failures, maximum
failures, total test time required to attain a given fraction of
remaining failures, and time to
next failure. These have been shown to be useful measurements
and predictions for: 1)

providing confidence that the software has achieved reliability
goals; 2) rationalizing how
long to test a software component (e.g., testing sufficiently long
to verify that the measured
reliability conforms to design specifications); and 3) analyzing
the risk of not achieving
remaining failures and time to next failure goals [6]. Having
predictions of the extent to
which the software is not fault free (remaining failures) and
whether a failure it is likely to
occur during a mission (time to next failure) provide criteria for
assessing the risk of
deploying the software. Furthermore, fraction of remaining
failures can be used as both an
Successful Application of Software Reliability: Case Study
533
operational quality goal in predicting total test time
requirements and, conversely, as an
indicator of operational quality as a function of total test time
expended [6].
The various software reliability measurements and predictions

can be divided into the
following two categories to use in combination to assist in
assuring the desired level of
reliability of the software in mission critical systems like the
Shuttle. The two categories
are: 1) measurements and predictions that are associated with
residual software faults and
failures, and 2) measurements and predictions that are
associated with the ability of the
software to complete a mission without experiencing a failure of
a specified severity. In
the first category are: remaining failures, maximum failures,
fraction of remaining failures,
and total test time required to attain a given number of fraction
of remaining failures. In
the second category are: time to next failure and total test time
required to attain a given
time to next failure. In addition, there is the risk associated with
not attaining the required
remaining failures and time to next failure goals. Lastly, there
is operational quality that is
derived from fraction of remaining failures. With this type of
information, a software
manager can determine whether more testing is warranted or

whether the software is
sufficiently tested to allow its release or unrestricted use. These
predictions provide a
quantitative basis for achieving reliability goals [2].
1.3 Interpretations and Credibility
The two most critical factors in establishing credibility in
software reliability
predictions are the validation method and the way the
predictions are interpreted. For
example, a "conservative" prediction can be interpreted as
providing an "additional margin
of confidence" in the software reliability, if that predicted
reliability already exceeds an
established "acceptable level" or requirement. It may not be
possible to validate
predictions of the reliability of software precisely, but it is
possible with "high confidence"
to predict a lower bound on the reliability of that software
within a specified environment.
If there historical failure data were available for a series of
previous dates (and there
is actual data for the failure history following those dates), it
would be possible to compare

the predictions to the actual reliability and evaluate the
performance of the model. Taking
this approach will significantly enhance the credibility of
predictions among those who
must make software deployment decisions based on the
predictions [9].
1.4 Verification and Validation
Software reliability measurement and prediction are useful
approaches to verify and
validate software. Measurement refers to collecting and
analyzing data about the observed
reliability of software, for example the occurrence of failures
during test. Prediction refers
to using a model to forecast future software reliability, for
example failure rate during
operation. Measurement also provides the failure data that is
used to estimate the
parameters of reliability models (i.e., make the best fit of the
model to the observed failure
data). Once the parameters have been estimated, the model is
used to predict the future
reliability of the software. Verification ensures that the
software product, as it exists in a
given project phase, satisfies the conditions imposed in the

54 c o m m u n i c at i o n s o f t h e a c m n o.docx

54 c o m m u n i c at i o n s o f t h e a c m n o.docx

Recommended

Recommended

More Related Content

Similar to 54 c o m m u n i c at i o n s o f t h e a c m n o.docx

Similar to 54 c o m m u n i c at i o n s o f t h e a c m n o.docx (20)

More from alinainglis

More from alinainglis (20)

Recently uploaded

Recently uploaded (20)

54 c o m m u n i c at i o n s o f t h e a c m n o.docx