SlideShare a Scribd company logo
1 of 276
54 c o m m u n i c at i o n s o f t h e a c m | n o v e m b e
r 2 0 0 9 | v o l . 5 2 | n o . 1 1
practice
e V e r Y o N e K N o W s M a i N T e N a N C e is difficult
and
boring, and therefore avoids doing it. It doesn’t help
that many pointy-haired bosses (PHBs) say things like:
“no one needs to do maintenance—that’s a waste of
time.”
“Get the software out now; we can decide what its
real function is later.”
“Do the hardware first, without thinking about the
software.”
“Don’t allow any room or facility for expansion. You
can decide later how to sandwich the changes in.”
These statements are a fair description of development
during the last boom, and not too far
from what many of us are doing today.
This is not a good thing: when you hit
the first bug, all the time you may have
“saved” by ignoring the need to do
maintenance will be gone.
During a previous boom, General
Electric designed a mainframe that it
claimed would be sufficient for all the
computer uses in Boston, and would
never need to be shut down for repair
or for software tweaks. The machine
it eventually built wasn’t nearly big
enough, but it did succeed at running
continuously without need for hard-
ware or software changes.
Today we have a distributed net-
work of computers provided by thou-
sands of businesses, sufficient for ev-
eryone in at least North America, if not
the world. Still, we must keep shutting
down individual parts of the network to
repair or change the software. We do so
because we’ve forgotten how to do soft-
ware maintenance.
What is software maintenance?
Software maintenance is not like hard-
ware maintenance, which is the return
of the item to its original state. Software
maintenance involves moving an item
away from its original state. It encom-
passes all activities associated with the
process of changing software. That in-
cludes everything associated with “bug
fixes,” functional and performance
enhancements, providing backward
compatibility, updating its algorithm,
covering up hardware errors, creating
user-interface access methods, and
other cosmetic changes.
In software, adding a six-lane au-
tomobile expressway to a railroad
bridge is considered maintenance—
and it would be particularly valuable
if you could do it without stopping the
train traffic.
Is it possible to design software so it
can be maintained in this way? Yes, it
is. So, why don’t we?
the four horsemen of
the apocalypse
There are four approaches to software
You Don’t
Know
Jack about
software
maintenance
D o i : 1 0 . 1 1 4 5 / 1 5 9 2 7 6 1 . 1 5 9 2 7 7 7
Article development led by
queue.acm.org
Long considered an afterthought, software
maintenance is easiest and most effective
when built into a system from the ground up.
BY PauL stachouR anD DaViD coLLieR-BRoWn
P
h
o
t
o
g
r
a
P
h
b
y
r
a
L
P
h
g
r
U
n
e
W
a
L
D
56 c o m m u n i c at i o n s o f t h e a c m | n o v e m b e
r 2 0 0 9 | v o l . 5 2 | n o . 1 1
practice
tems, the specification and designs
aren’t quite good enough, so in prac-
tice the specification is frozen while
it’s still faulty. This is often because it
cannot be validated, so you can’t tell if
it’s faulty until too late. Then the spec-
ification is not adhered to when code
is written, so you can’t prove the pro-
gram follows the specification, much
less prove it’s correct. So, you test un-
til the program is late, and then ship.
Some months later you replace it as a
complete entity, by sending out new
ROMs. This is the typical history of
video games, washing machines, and
embedded systems from the U.S. De-
partment of Defense.
Discrete. The discrete change ap-
proach is the current state of prac-
tice: define hard-and-fast, highly
configuration-controlled interfaces
to elements of software, and regularly
carry out massive all-at-once changes.
Next, ship an entire new copy of the
program, or a “patch” that silently
replaces entire executables and li-
braries. (As we write this, a new copy
of Open Office is asking us please to
download it.)
In theory, the process accepts (re-
luctantly) the fact of change, keeps a
parts list and tools list on every item,
allows only preauthorized changes
under strict configuration control,
and forces all servers’/users’ changes
to take place in one discrete step. In
practice, the program is running mul-
tiple places, and each must kick off
its users, do the upgrade, and then let
them back on again. Change happens
more often and in more places than
predicted, all the components of an
maintenance: traditional, never, dis-
crete, and continuous—or, perhaps,
war, famine, plague, and death. In any
case, 3.5 of them are terrible ideas.
Traditional (or “everyone’s first
project”). This one is easy: don’t even
think about the possibility of main-
tenance. Hard-code constants, avoid
subroutines, use all global variables,
use short and non-meaningful vari-
able names. In other words, make it
difficult to change any one thing with-
out changing everything. Everyone
knows examples of this approach—
and the PHBs who thoughtlessly push
you into it, usually because of sched-
ule pressures.
Trying to maintain this kind of soft-
ware is like fighting a war. The enemy
fights back! It particularly fights back
when you have to change interfaces,
and you find you’ve only changed
some of the copies.
Never. The second approach is to
decide upfront that maintenance will
never occur. You simply write wonder-
ful programs right from the start. This
is actually credible in some embedded
systems, which will be burned to ROM
and never changed. Toasters, video
games, and cruise missiles come to
mind.
All you have to do is design per-
fect specifications and interfaces,
and never change them. Change only
the implementation, and then only
for bug fixes before the product is
released. The code quality is wildly
better than it is for the traditional ap-
proach, but never quite good enough
to avoid change completely.
Even for very simple embedded sys-
item are not recorded, and patching is
alive (and, unfortunately, thriving) be-
cause of the time lag for authorization
and the rebuild time for the system.
Furthermore, while official inter-
faces are controlled, unofficial in-
terfaces proliferate; and with C and
older languages, data structures are
so available that even when change is
desired, too many functions “know”
that the structure has a particular
layout. When you change the data
structure, some program or library
that you didn’t even know existed
starts to crash or return enotsup.
A mismatch between an older Linux
kernel and newer glibc once had
getuid returning “Operation not
supported,” much to the surprise of
the recipients.
Experience shows that it is com-
pletely unrealistic to expect all users
to whom an interface is visible will be
able to change at the same time. The
result is that single-step changes can-
not happen: multiple change interre-
lationships conflict, networks mean
multiple versions are simultaneously
current, and owners/users want to
control change dates.
Vendors try to force discrete chang-
es, but the changes actually spread
through a population of computers
in a wave over time. This is often lik-
ened to a plague, and is every bit as
popular.
Customers use a variant of the
“never” approach to software main-
tenance against the vendors of these
plagues: they build a known work-
ing configuration, then “freeze and
forget.” When an update is required,
they build a completely new system
from the ground up and freeze it. This
works unless you get an urgent secu-
rity patch, at which time you either
ignore it or start a large unscheduled
rebuild project.
Continuous change. At first, this ap-
proach to maintenance sounds like
just running new code willy-nilly and
watching what happens. We know at
least one company that does just that:
a newly logged-on user will unknow-
ingly be running different code from
everyone else. If it doesn’t work, the
user’s system will either crash or be
kicked off by the sysadmin, then will
have to log back on and repeat the
work using the previous version.
Real-world structure for managing interface changes.
struct item_loc_t {
struct {
unsigned short major; /* = 1 */
unsigned short minor; /* = 0 */
} version;
unsigned part_no;
unsigned quantity;
struct location_t {
char state[4];
char city[8];
unsigned warehouse;
short area;
short pigeonhole;
} location;
...
practice
n o v e m b e r 2 0 0 9 | v o l . 5 2 | n o . 1 1 | c o m m
u n i c at i o n s o f t h e a c m 57
However, that is not the real mean-
ing of continuous. The real continu-
ous approach comes from Multics,
the machine that was never sup-
posed to shut down and that used
controlled, transparent change. The
developers understood the only con-
stant is change and that migration
for hardware, software, and function
during system operation is necessary.
Therefore, the ability to change was
designed from the very beginning.
Software in particular must be writ-
ten to evolve as changes happen, us-
ing a weakly typed high-level language
and, in older programs, a good macro
assembler. No direct references are al-
lowed to anything if they can be avoid-
ed. Every data structure is designed
for expansion and self-identifying
as to version. Every code segment is
made self-identifying by the compil-
er or other construction procedure.
Code and data are changeable on a
per-command/process/system basis,
and as few as possible copies of any-
thing are kept, so single copies could
be dynamically updated as necessary.
The most important thing is to
manage interface changes. Even in
the Multics days, it was easy to forget
to change every single instance of an
interface. Today, with distributed pro-
grams, changing all possible copies of
an interface at once is going to be in-
sanely difficult, if not flat-out impos-
sible.
Who Does it Right?
BBN Technologies was the first com-
pany to perform continuous con-
trolled change when they built the
ARPANET backbone in 1969. They
placed a 1-bit version number in ev-
ery packet. If it changed from 0 to 1,
it meant that the IMP (router) was to
switch to a new version of its software
and set the bit to 1 on every outgoing
packet. This allowed the entire ARPA-
NET to switch easily to new versions
of the software without interrupting
its operation. That was very important
to the pre-TCP Internet, as it was quite
experimental and suffered a consider-
able amount of change.
With Multics, the developers did
all of these good things, the most im-
portant of which was the discipline
used with data structures: if an inter-
face took more than one parameter,
all the parameters were versioned by
placing them in a structure with a ver-
sion number. The caller set the ver-
sion, and the recipient checked it. If it
was completely obsolete, it was flatly
rejected. If it was not quite current,
it was processed differently, by be-
ing upgraded on input and probably
downgraded on return.
This meant that many different
versions of a program or kernel mod-
ule could exist simultaneously, while
upgrades took place at the user’s con-
venience. It also meant that upgrades
could happen automatically and that
multiple sites, multiple suppliers,
and networks didn’t cause problems.
An example of a structure used by
a U.S.-based warehousing company
(translated to C from Multics PL/1)
is illustrated in the accompanying
box. The company bought a Canadian
competitor and needed to add inter-
country transfers, initially from three
of its warehouses in border cities.
This, in turn, required the state field
to split into two parts:
char country _ code[4]
char state _ province[4];
To identify this, the company incre-
mented the version number from 1.0
to 2.0 and arranged for the server to
support both types. New clients used
version 2.0 structures and were able
to ship to Canada. Old ones continued
to use version 1.0 structures. When
the server received a type 1 structure,
it used an “updater” subroutine that
copied the data into a type 2 structure
and set the country code to U.S.
In a more modern language, you
would add a new subclass with a con-
structor that supports a country code,
and update your new clients to use it.
The process is this:
Update the server.1.
Change the clients that run in 2.
the three border-state warehouses.
Now they can move items from U.S. to
Canadian warehouses.
Deploy updated clients to those 3.
Canadian locations needing to move
stock.
Update all of the U.S.-based cli-4.
ents at their leisure.
Using this approach, there is never
a need to stop the whole system, only
the individual copies, and that can be
software
maintenance is
not like hardware
maintenance,
which is the
return of the item
to its original
state. software
maintenance
involves moving
an item away from
its original state.
58 c o m m u n i c at i o n s o f t h e a c m | n o v e m b e
r 2 0 0 9 | v o l . 5 2 | n o . 1 1
practice
scheduled around a business’s conve-
nience. The change can be immedi-
ate, or can wait for a suitable time.
Once the client updates have oc-
curred, we simultaneously add a check
to produce a server error message for
anyone who accidentally uses an ou-
dated U.S.-only version of the client.
This check is a bit like the “can’t hap-
pen” case in an else-if: it’s done to
identify impossibly out-of-date calls.
It fails conspicuously, and the system
administrators can then hunt down
and replace the ancient version of the
program. This also discourages the
unwise from permanently deferring
fixes to their programs, much like the
coarse version numbers on entire pro-
grams in present practice.
modern examples
This kind of fine-grain versioning is
sometimes seen in more recent pro-
grams. Linkers are an example, as
they read files containing numbered
records, each of which identifies a
particular kind of code or data. For ex-
ample, a record number 7 might con-
tain the information needed to link
a subroutine call, containing items
such as the name of the function to
call and a space for an address. If the
linker uses record types 1 through 34,
and later needs to extend 7 for a new
compiler, then create a type 35, use it
for the new compiler, and schedule
changes from type 7 to type 35 in all
the other compilers, typically by an-
nouncing the date on which type 7 re-
cords would no longer be accepted.
Another example is in networking
protocols such as IBM SMB (Server
Message Block), used for Windows
networking. It has both protocol ver-
sions and packet types that can be
used exactly the same way as the re-
cord types of a linker.
Object languages can also support
controlled maintenance by creat-
ing new versions as subclasses of the
same parent. This is a slightly odd use
of a subclass, as the variations you
create aren’t necessarily meant to per-
sist, but you can go back and clean out
unneeded variants later, after they’re
no longer in use.
With AJAX, a reasonably small cli-
ent can be downloaded every time the
program is run, thus allowing change
without versioning. A larger client
would need only a simple version-
ing scheme, enough to allow it to be
downloaded whenever it was out of
date.
An elegant modern form of contin-
uous maintenance exists in relational
databases: one can always add col-
umns to a relation, and there is a well-
known value called null that stands
for “no data.” If the programs that
use the database understand that any
calculation with a null yields a null,
then a new column can be added, pro-
grams changed to use it over some
period of time, and the old column(s)
filled with nulls. Once all the users of
the old column are no more, as indi-
cated by the column being null for
some time, then the old column can
be dropped.
Another elegant mechanism is a
markup language such as SGML or
XML, which can add or subtract attri-
butes of a type at will. If you’re careful
to change the attribute name when
the type changes, and if your XML
processor understands that adding 3
to a null value is still null, you’ve an
easy way to transfer and store mutat-
ing data.
maintenance isn’t hard, it’s easy
During the last boom, (author) Col-
lier-Brown’s team needed to create
a single front end to multiple back
ends, under the usual insane time
pressures. The front end passed a few
parameters and a C structure to the
back ends, and the structure repeat-
edly needed to be changed for one or
another of the back ends as they were
developed.
Even when all the programs were on
the same machine, the team couldn’t
change them simultaneously because
they would have been forced to stop
everything they were doing and ap-
ply a structure change. Therefore, the
team started using version numbers.
If a back end needed version 2.6 of the
structure, it told the front end, which
handed it the new one. If it could use
only version 2.5, that’s what it asked
for. The team never had a “flag day”
when all work stopped to apply an
interface change. They could make
those changes when they could sched-
ule them.
Of course, the team did have to
make the changes eventually, and
their management had to manage
that, but they were able to make the
changes when it wouldn’t destroy our
schedule. In an early precursor to test-
directed design, they had a regression
test that checked whether all the ver-
sion numbers were up to date and
warned them if updates were needed.
The first time the team avoided a
flag day, they gained the few hours ex-
pended preparing for change. By the
12th time, they were winning big.
Maintenance really is easy. More
importantly, investing time to pre-
pare for it can save you and your man-
agement time in the most frantic of
projects.
Related articles
on queue.acm.org
The Meaning of Maintenance
Kode Vicious
http://queue.acm.org/detail.cfm?id=1594861
The Long Road to 64 Bits
John Mashey
http://queue.acm.org/detail.cfm?id=1165766
A Conversation with David Brown
http://queue.acm.org/detail.cfm?id=1165764
Paul Stachour is a software engineer equally at home
in development, quality assurance, and process. one
of his focal areas is how to create correct, reliable,
functional software in effective and efficient ways in many
programming languages. Most of his work has been with
life-, safety-, and security-critical applications from his
home base in the twin Cities of Minnesota.
David Collier-Brown is an author and systems
programmer, formerly with Sun Microsystems, who
mostly does performance and capacity work from his
home in toronto.
© 2009 aCM 0001-0782/09/1100 $10.00
contributed articles
142 c o m m u n i c at i o n s o f t h e a c m | J a n u a r y
2 0 1 0 | v o l . 5 3 | n o . 1
d o i : 1 0 . 1 1 4 5 / 1 6 2 9 1 7 5 . 1 6 2 9 2 0 9
by Paul d. Witman and terry ryan
M a n y o r g a n i z at i o n s a r e s u c c e s s f u l w i t h s
o f t wa r e
reuse at fine to medium granularities – ranging from
objects, subroutines, and components through
software product lines. However, relatively little has
been published on very large-grained reuse. One
example of this type of large-grained reuse might be
that of an entire Internet banking system (applications
and infrastructure) reused in business units all over
the world. In contrast, “large scale” software reuse
in current research generally refers to systems that
reuse a large number of smaller components, or that
perhaps reuse subsystems.9 In this article, we explore a
case of an organization with an internal development
group that has been very successful with large-grained
software reuse.
BigFinancial, and the BigFinancial Technology
Center (BTC) in particular, have created a number of
software systems that have been reused in multiple
businesses and in multiple countries. BigFinancial
and BTC thus provided a rich source of data for
case studies to look at the characteristics of those
projects and why they have been successful, as well
as to look at projects that have been less successful
and to understand what has caused those results and
what might be done differently to prevent issues in
the future. The research is focused on technology,
process, and organizational elements of the
development process, rather than on specific product
features and functions.
Supporting reuse at a large-grained
level may help to alleviate some of the
issues that occur in more traditional
reuse programs, which tend to be finer-
grained. In particular, because BigFi-
nancial was trying to gain commonal-
ity in business processes and operating
models, reuse of large-grained compo-
nents was more closely aligned with its
business goals. This same effect may
well not have happened with finer-
grained reuse, due to the continued
ability of business units to more readily
pick and choose components for reuse.
BTC is a technology development
unit of BigFinancial, with operations
in both the eastern and western US. Ap-
proximately 500 people are employed
by BTC, reporting ultimately through a
single line manager responsible to the
Global Retail Business unit head of Big-
Financial. BTC is organized to deliver
both products and infrastructure com-
ponents to BigFinancial, and its prod-
uct line has through the years included
consumer Internet banking services,
teller systems, ATM software, and net-
work management tools. BigFinancial
has its U.S. operations headquartered
in the eastern U.S., and employs more
than 8,000 technologists worldwide.
In cooperation with BTC, we selected
three cases for further study from a pool
of about 25. These cases were the Java
Banking Toolkit (JBT) and its related ap-
plication systems, the Worldwide Single
Signon (WSSO) subsystem, and the Big-
Financial Message Switch (BMS).
background – software
reuse and bigfinancial
Various definitions appear in the lit-
erature for software reuse. Karlsson de-
fines software reuse as “the process of
creating software systems from existing
software assets, rather than building
software systems from scratch.” One
taxonomy of the approaches to software
reuse includes notions of the scope of
reuse, the target of the reuse, and the
granularity of the reuse.5 The notion of
granularity is a key differentiator of the
type of software reuse practiced at Big-
Financial, as BigFinancial has demon-
think big
for reuse
J a n u a r y 2 0 1 0 | v o l . 5 3 | n o . 1 | c o m m u n i
c at i o n s o f t h e a c m 143
contributed articles
portal services, and alerts capabilities,
and thus the JBT infrastructure is al-
ready reused for multiple applications.
To some extent, these multiple appli-
cations could be studied as subcases,
though they have thus far tended to be
deployed as a group. In addition, the
online banking, portal services, and
alerts functions are themselves reused
at the application level across multiple
business units globally.
Initial findings indicated that sever-
al current and recent projects showed
significant reuse across independent
business units that could have made
alternative technology development
decisions. The results are summarized
in Table 1.
While significant effort is required
to support multiple languages and
business-specific functional variabili-
ty, BTC found that it was able to accom-
modate these requirements by design-
ing its products to be rule-based, and by
designing its user interface to separate
content from language. In this manner,
business rules drove the behavior of
the Internet banking applications, and
language- and format-definition tools
drove the details of application behav-
ior, while maintaining a consistent set
of underlying application code.
In the late 1990s, BTC was respon-
sible for creation of system infrastruc-
ture components, built on top of in-
dustry-standard commercial operating
systems and components, to support
the banking functionality required
by its customers within BigFinancial.
The functions of these infrastructure
components included systems man-
agement, high-reliability logging pro-
cesses, high-availability mechanisms,
and other features not readily available
in commercial products at the time
that the components were created. The
same infrastructure was used to sup-
port consumer Internet banking as
strated success in large-grained reuse
programs – building a system once and
reusing it in multiple businesses.
Product Line Technology models,
such as that proposed by Griss4 and fur-
ther expanded upon by Clements and
Northrop2 and by Krueger6 suggest that
software components can be treated
similarly to the notions used in manu-
facturing – reusable parts that contrib-
ute to consistency across a product line
as well as to improved efficiencies in
manufacturing. Benefits of such reuse
include the high levels of commonal-
ity of such features as user interfaces,7
which increases switching costs and
customer loyalty in some domains.
This could logically extend to banking
systems in the form of common func-
tionality and user interfaces across
systems within a business, and across
business units.
BigFinancial has had several in-
stances of successful, large-grained re-
use projects. We identified projects that
have been successfully reused across a
wide range of business environments
or business domains, resulting in sig-
nificant benefit to BigFinancial. These
included the JBT platform and its re-
lated application packages, as well as
the Worldwide SSO product. These
projects demonstrated broad success,
and the authors evaluated these for evi-
dence to identify what contributed to,
and what may have worked against, the
success of each project.
The authors also identified another
project that has been successfully re-
used across a relatively narrow range of
business environments. This project,
the BigFinancial Message Switch (BMS)
was designed for a region-wide level of
reuse, and had succeeded at that level.
As such, it appears to have invested ap-
propriately in features and capabilities
needed for its client base, and did not
appear to have over-invested.
online banking and
related services
We focused on BTC’s multi-use Java
Banking Toolkit (JBT) as a model of
a successful project. The Toolkit is
in wide use across multiple business
units, and represents reuse both at the
largest-grained levels as well as reuse
of large-scale infrastructure compo-
nents. JBT supports three application
sets today, including online banking,
well as automated teller machines. The
Internet banking services will be iden-
tified here as the Legacy Internet Bank-
ing product (LIB).
BigFinancial’s initial forays into
Internet transaction services were ac-
complished via another instance of
reuse. Taking its pre-Internet banking
components, BTC was able to “scrape”
the content from the pages displayed
in that product, and wrap HTML code
around them for display on a Web
browser. Other components were re-
sponsible for modifying the input and
menuing functions for the Internet.
The purpose for this approach to
Internet delivery was to more rapidly
deliver a product to the Internet, with-
out modification of the legacy business
logic, thereby reducing risk as well. In
what amounted to an early separation
of business and presentation logic, the
pre-Internet business logic remained
in place, and the presentation layer
re-mapped its content for the browser
environment.
In 2002, BigFinancial and BTC rec-
ognized two key issues that needed to
be addressed. The platform for their
legacy Internet Banking application
was nearing end of life (having been
first deployed in 1996), and there were
too many disparate platforms for its
consumer Internet offerings. BTC’s
Internet banking, alerts, and portal
functions each required separate hard-
ware and operating environments.
BTC planned its activities such that the
costs of the new development could
fit within the existing annual mainte-
nance and new development costs al-
ready being paid by its clients.
BTC and business executives cited
trust in BTC’s organization as a key to
allowing BTC the opportunity to devel-
op the JBT product. In addition, BTC’s
prior success with reusing software
components at fine and medium gran-
table 1. selected reuse results
Project reused in business units
System Infrastructure Consumer Internet banking;
automated Teller Machines
all users of BTC’s legacy Internet banking
components – >35 businesses worldwide
System Infrastructure Internet banking – Small Business
approximately 4 business units worldwide
Internet banking Europe > 15 business units
Internet banking asia > 10 business units
Internet banking latin america > 6 business units
Internet banking north america > 4 business units
contributed articles
144 c o m m u n i c at i o n s o f t h e a c m | J a n u a r y
2 0 1 0 | v o l . 5 3 | n o . 1
ularities led to a culture that promoted
reuse as a best practice.
Starting in late 2002, BTC developed
an integrated platform and application
set for a range of consumer Internet
functions. The infrastructure package,
named the Java Banking Toolkit (JBT),
was based on Java 2 Enterprise Edition
(J2EE) standards and was intended
to allow BigFinancial to centralize its
server infrastructure for consumer
Internet functions. The authors con-
ducted detailed interviews with several
BTC managers and architects, and re-
viewed several hundred documents.
Current deployment statistics for JBT
are shown in Table 2.
The JBT infrastructure and appli-
cations were designed and built by
BTC and its regional partners, with in-
put from its clients around the world.
BTC’s experience had shown that con-
sumer banking applications were not
fundamentally different from one an-
other across the business units, and
BTC proposed and received funding
for creation of a consolidated applica-
tion set for Internet banking. A market
evaluation determined that there were
no suitable, globally reusable, com-
plete applications on the market, nor
any other organization with the track
record of success required for confi-
dence in the delivery. Final funding
approval came from BigFinancial tech-
nology and business executives.
The requirements for JBT called
for several major functional elements.
The requirements were broken out
among the infrastructural elements
supporting the various planned appli-
cation packages, and the applications
themselves. The applications delivered
with the initial release of JBT included
a consumer Internet banking applica-
tion set, an account activity and bal-
ance alerting function, and a portal
content toolset.
Each of these components was de-
signed to be reused intact in each busi-
ness unit around the world, requiring
only changes to business rules and
language phrases that may be unique
to a business. One of the fundamental
requirements for each of the JBT appli-
cations was to include capabilities that
were designed to be common to and
shared by as many business units as
possible, while allowing for all neces-
sary business-specific variability.
Such variability was planned for in
the requirements process, building
on the LIB infrastructure and applica-
tions, as well as the legacy portal and
alerts services that were already in pro-
duction. Examples of the region- and
business-specific variability include
language variations, compliance with
local regulatory requirements, and
functionality based on local and re-
gional competitive requirements.
JBT’s initial high-level requirements
documents included requirements
across a range of categories. These
categories included technology, opera-
tions, deployment, development, and
tools. These requirements were in-
tended to form the foundation for ini-
tial discussion and agreement with the
stakeholders, and to support division of
the upcoming tasks to define the archi-
tecture. Nine additional, more detailed,
requirements documents were created
to flesh out the details referenced in
the top-level requirements. Additional
topics addressed by the detailed docu-
ments included language, business
rules, host messaging, logging, portal
services, and system management.
One of BigFinancial’s regional tech-
nology leaders reported that JBT has
been much easier to integrate than the
legacy product, given its larger applica-
tion base and ability to readily add ap-
plications to it. Notably, he indicated
that JBT’s design had taken into ac-
count the lessons learned from prior
products, including improvements in
performance, stability, and total cost
of ownership. This resulted in a “win/
win/win for businesses, technology
groups, and customers.”
From an economic viewpoint, BigFi-
nancial indicates that the cost savings
for first-time business unit implemen-
tations of products already deployed to
other business units averaged between
20 and 40%, relative to the cost of new de-
velopment. Further, the cost savings for
subsequent deployments of updated re-
leases to a group of business units result-
ed in cost savings of 50% – 75% relative to
the cost of maintaining the software for
each business unit independently.
All core banking functionality is
supported by a single global applica-
tion set. There remain, in some cases,
functions required only by a specific
business or region. The JBT architec-
ture allows for those region-specific
applications to be developed by the
regional technology unit as required.
An overview of the JBT architecture is
shown in Figure 1.
BTC implemented JBT on principles
of a layered architecture,12 focusing on
interoperability and modularity. For
example, the application components
interact only with the application body
section of the page; all other elements
of navigation and branding are handled
by the common and portal services
figure 1. Java banking toolkit architecture overview
table 2. Jbt reuse results
region business units
Europe > 18 business units
Asia > 14 business units
Latin America > 9 business units
North America > 5 business units
J a n u a r y 2 0 1 0 | v o l . 5 3 | n o . 1 | c o m m u n i
c at i o n s o f t h e a c m 145
contributed articles
ments to global product capabilities,
along with the cost of training, devel-
opment and testing of business rules,
and ramp-up of operational processes.
In contrast, ongoing maintenance sav-
ings are generally larger, due to the
commonality across the code base for
numerous business units. This com-
monality enables bug fixes, security
patches, and other maintenance activi-
ties to be performed on one code base,
rather than one for each business unit.
BigFinancial has demonstrated that
it is possible for a large organization,
building software for its own internal
use, to move beyond the more common
models of software reuse. In so doing,
BigFinancial has achieved significant
economies of scale across its many
business units, and has shortened the
time to market for new deployments of
its products.
Numerous factors were critical to
the success of the reuse projects. These
included elements expected from the
more traditional reuse literature, in-
cluding organizational structure, tech-
nological foundations, and economic
factors. In addition, several new ele-
ments have been identified. These in-
clude the notions of trust and culture,
the concepts of a track record of large-
and fine-grained reuse success, and the
virtuous (and potentially vicious) cycle
of corporate mandates. Conversely,
organizational barriers prove to be the
greatest inhibitor to successful reuse.13
BTC took specific steps, over a period
of many years, to create and strengthen
its culture of reuse. Across numerous
product lines, reuse of components and
infrastructure packages was strongly
encouraged. Reuse of large-grained
elements was the next logical step,
working with a group of business units
within a single regional organization.
This supported the necessary business
alignment to enable large-grained re-
use. In addition, due to its position as a
global technology provide to BigFinan-
cial, BTC was able to leverage its knowl-
edge of requirements across business
units, and explicitly design products to
be readily reusable, as well as to drive
commonality of requirements to sup-
port that reuse as well.
On the technical factors related
to reuse, BTC’s results have provided
empirical evidence regarding the use
of various technologies and patterns
elements. In addition, transactional
messaging is isolated from the applica-
tion via a message abstraction layer, so
that unique messaging models can be
used in each region, if necessary.
JBT includes both the infrastruc-
ture and applications components for
a range of banking functionality. The
infrastructure and applications com-
ponents are defined as independently
changeable releases, but are currently
packaged as a group to simplify the de-
ployment process.
Funding and governance of the
projects are coordinated through BTC,
with significant participation from the
business units. Business units have the
opportunity to choose other vendors
for their technology needs, though the
corporate technology strategy limited
that option as the JBT project gained
wider rollout status. Business units
participate in a semi-annual in-person
planning exercise to evaluate enhance-
ment requests and prioritize new busi-
ness deployments.
results
The authors examined a total of six dif-
ferent cases of software reuse. Three of
these were subcases of the Java Banking
Toolkit (JBT) – Internet banking, portal
services, and alerts, along with the re-
use of the JBT platform itself. The oth-
ers were the Worldwide SSO product,
and the BigFinancial Message Switch.
There were a variety of reuse success
levels, and a variety of levels of evidence
of anticipated supports and barriers to
reuse. The range of outcomes is repre-
sented as a two dimensional graph, as
shown in Figure 2.
BigFinancial measures its reuse
success in a very pragmatic, straight-
forward fashion. Rather than measur-
ing reused modules, lines of code, or
function points, BigFinancial instead
simply measures total deployments
of compatible code sets. Due to on-
going enhancements, the code base
continues to evolve over time, but in a
backwards-compatible fashion, so that
older versions can be and are readily
upgraded to the latest version as busi-
ness needs dictate.
BTC did not explicitly capture hard
economic measures of cost savings.
However, their estimates of the range
of cost savings are shown in Figure 3.
Cost savings are smaller for new de-
ployments due to the significant effort
required to map business unit require-
figure 2. reuse expectations and outcomes
contributed articles
146 c o m m u n i c at i o n s o f t h e a c m | J a n u a r y
2 0 1 0 | v o l . 5 3 | n o . 1
in actual reuse environments. Some
of these technologies and patterns are
platform-independent interfaces, busi-
ness rule structures, rigorous isolation
of concerns across software layers, and
versioning of interfaces to allow phased
migration of components to updated
interfaces. These techniques, among
others, are commonly recognized as
good architectural approaches for de-
signing systems, and have been exam-
ined more closely for their contribution
to the success of the reuse activities. In
this examination, they have been found
to contribute highly to the technologi-
cal elements required for success of
large-grained reuse projects.
Product vendors, and particularly
application service providers, routinely
conduct this type of development and
reuse, though with different motiva-
tions. (Application service providers
are now often referred to as providers
of Software as a Service.) As commer-
cial providers, they are more likely to be
market-driven, often with sales of Pro-
fessional Services for customization. In
contrast, the motivations in evidence
at BigFinancial seemed more aimed
at achieving the best combinations of
functionality, time to market, and cost.
The research provided an opportu-
nity to examine, in-depth, the various
forms of reuse practiced on three proj-
ects, and three subprojects, inside Big-
Financial. Some of those forms include
design reuse, code reuse, pattern reuse,
and test case reuse. The authors have
found based on documents and re-
ports from participants that the active
practice of systematic, finer-grained re-
use contributed to successful reuse of
systems at larger levels of granularity.
This study has provided a view of
management structures and leader-
ship styles, and an opportunity to ex-
amine how those contribute to, or work
against, successful reuse. Much has
been captured about IT governance in
general, and about organizational con-
structs to support reuse in various situ-
ations at BigFinancial/BTC. Leadership
of both BTC and BigFinancial was cited
as contributing to the success of the re-
use efforts, and indeed also was cited
as a prerequisite for even launching
a project that intends to accomplish
such large-grained reuse.
Sabherwal11 notes the criticality of
trust in outsourced IS relationships,
where the participants in projects may
not know one another before a project,
and may only work together on the one
project. As such, the establishment
and maintenance of trust is critical in
that environment. This is not entirely
applicable to BTC, as it is a peer organi-
zation to its client’s technology groups,
and its members often have long-stand-
ing relationships with their peers. Ring
and Van de Ven examine the broader
notions of cooperative inter-organiza-
tional relationships (IOR’s), and note
that trust is a fundamental part of an
IOR. Trust is used to serve to mitigate
the risks inherent in a relationship,
and at both a personal and organiza-
tional level is itself mitigated by the po-
tential overriding forces of the legal or
organizational systems.10 This element
does seem to be applicable to BTC’s en-
vironment, in that trust is reported to
have been foundational to the assign-
ment of the creation of JBT to BTC.
Griss notes that culture is one ele-
ment of the organizational structure
that can impede reuse. A culture that
fears loss of creativity, lacks trust, or
doesn’t know how to effectively reuse
software will not be as successful as an
organization that doesn’t have these
impediments.4 The converse is likely
then also reasonable – that a culture
that focuses on and implicitly welcomes
reuse will likely be more successful.
BTC’s long history of reuse, its lack of
explicit incentives and metrics around
more traditional reuse, and its position
as a global provider of technology to its
business partners make it likely that its
culture, is, indeed a strong supporter
of its reuse success.
Several other researchers have com-
mented on the impact of organizational
culture on reuse. Morisio et al8 refer in
passing to cultural factors, primarily as
potential inhibitors to reuse. Card and
Comer1 examine four cultural aspects
that can contribute to reuse adoption:
training, incentives, measurement,
and management. In addition, Card
and Comer’s work focuses generally on
cultural barriers, and how to overcome
them. In BTC’s case, however, there is
a solid cultural bias for reuse, and one
that, for example, no longer requires
incentives to promote reuse.
One key participant in the study had
a strong opinion to offer in relation to
fine- vs. coarse-grained reuse. The lead
architect for JBT was explicitly and vig-
orously opposed to a definition of reuse
figure 3. reuse cost savings ranges
J a n u a r y 2 0 1 0 | v o l . 5 3 | n o . 1 | c o m m u n i
c at i o n s o f t h e a c m 147
contributed articles
Paul D. Witman, ([email protected] ) is an
Assistant Professor of Information Technology at
California Lutheran University.
Terry Ryan ([email protected]) is an Associate
Professor and Dean of the School of Information Systems
at Claremont Graduate University.
© 2010 ACm 0001-0782/10/0100 $10.00
that slanted toward fine-grained reuse
– of objects and components at a fine-
grained level. This person’s opinion
was that while reuse at this granularity
was possible (indeed, BTC demonstrat-
ed success at this level), fine-grained
reuse was very difficult to achieve in a
distributed development project. The
lead architect further believed that
the leverage it provides was not nearly
as great as the leverage from a large-
grained reuse program. The integrators
of such larger-grained components can
then have more confidence that the
component has been used in a similar
environment, tested under appropri-
ate loads, and so on – relieving the risk
that a fine-grained component built for
one domain may get misused in a new
domain or at a new scale, and be unsuc-
cessful in that environment.
While BTC’s JBT product does, to
some extent, work as part of a software
product line (supporting its three ma-
jor applications), JBT’s real reuse does
not come in the form of developing
more instances from a common set of
core assets. Rather, it appears that JBT
is itself reused, intact, to support the
needs of each of the various businesses
in a highly configurable fashion.
Organizational barriers appeared,
at least in part, to contribute to the lack
of broad deployment of the BigFinan-
cial Message Switch. Gallivan3 defined
a model for technology innovation as-
similation and adoption, which includ-
ed the notion that even in the face of
management directive, some employ-
ees and organizations might not adopt
and assimilate a particular technology
or innovation. This concept might part-
ly explain the results with BMS, that it
was possible for some business units
and technology groups to resist its in-
troduction on a variety of grounds, in-
cluding business case, even with a de-
cision by a global steering committee
to proceed with deployment.
We noted previously the negative
impact of inter-organizational barriers
on reuse adoption, particularly in the
BMS case. This was particularly evident
in that the organization that created
BMS, and was in large part responsible
for “selling” it to other business units,
was positioned at a regional rather than
global technology level. This organiza-
tional location, along with the organi-
zation’s more limited experience with
globally reusable products, may have
contributed to the difficulty in accom-
plishing broader reuse of that product.
conclusion
While BTC’s results and BigFinancial’s
specific business needs may be some-
what unusual, it is likely that the busi-
ness and technology practices support-
ing reuse may be generalizable to other
banks and other technology users. Good
system architecture, supporting reuse,
and an established business case that
identify the business value of the reuse
were fundamental to establishing the
global reuse accomplished by BTC, and
should be readily scalable to smaller
and less global environments.
Key factors contributing to a suc-
cessful project will be a solid technolo-
gy foundation, experience building and
maintaining reusable software, and a
financial and organizational structure
that supports and promotes reuse. In
addition, the organization will need to
actively build a culture of large-grained
reuse, and establish trust with its busi-
ness partners. Establishing that trust
will be vital to even having the oppor-
tunity to propose a large-grained reus-
able project.
References
1. Card, D. and Comer, E. Why do so many reuse
programs fail? IEEE Software 11, 5, 114-115.
2. Clements, P. and Northrop, L.m. Software Product
Lines: Practices and Patterns Addison-Wesley
Professional, 2002.
3. Gallivan, m.J. Organizational adoption and assimilation
of complex technological innovations: Development
and application of a new framework. The DATA BASE
for Advances in Information Systems 32, 3, 51-85.
4. Griss, m.L. Software reuse: From library to factory.
IBM Systems Journal 32, 4, 548-566.
5. Karlsson, E.-A. Software Reuse: A Holistic Approach.
John Wiley & Sons, West Sussex, England, 1995.
6. Krueger, C.W. New methods in software product line
practice. Comm. ACM 49, 12, (Dec. 2006), 37-40.
7. malan, R. and Wentzel, K. Economics of Software
Reuse Revisited. Hewlett-Packard Software
Technology Laboratory, Irvine, CA, 1993, 19.
8. morisio, m., Ezran, m. and Tully, C. Success and failure
factors in software reuse. IEEE Transactions on
Software Engineering 28, 4, 340-357.
9. Ramachandran, m. and Fleischer, W., Design for large
scale software reuse: An industrial case study. In
Proceedings for International Conference on Software
Reuse, (Orlando, FL, 1996), 104-111.
10. Ring, P.S. and Van de Ven, A.H. Developmental
processes of cooperative interorganizational
relationships. Academy of Management Review 19, 1,
90-118.
11. Sabherwal, R. The Role of Trust in Outsourced IS
Development Projects. Comm. of the ACM 42, 2, (Feb.
1999), 80-86.
12. Szyperski, C., Gruntz, D. and murer, S. Component
software: beyond object-oriented programming ACm
Press, New York, 2002.
13. Witman, P. and Ryan, T., Innovation in large-grained
software reuse: A case from banking. In Proceedings
for Hawaii International Conference on System
Sciences, (Waikoloa, HI, 2007), IEEE Computer
Society.
Copyright of Communications of the ACM is the property of
Association for Computing Machinery and its
content may not be copied or emailed to multiple sites or posted
to a listserv without the copyright holder's
express written permission. However, users may print,
download, or email articles for individual use.
54 C O M M U N I C AT I O N S O F T H E A C M | J A
N U A R Y 2 0 1 8 | V O L . 6 1 | N O . 1
practice
I
M
A
G
E
B
Y
V
I
T
E
Z
S
L
A
V
V
A
L
K
A
T H E H E T E R O G E N E I T Y, C O M P L E X I T Y, and
scale of cloud
applications make verification of their fault tolerance
properties challenging. Companies are moving away
from formal methods and toward large-scale testing
in which components are deliberately compromised
to identify weaknesses in the software. For example,
techniques such as Jepsen apply fault-injection testing
to distributed data stores, and Chaos Engineering
performs fault injection experiments on production
systems, often on live traffic. Both approaches have
captured the attention of industry and academia alike.
Unfortunately, the search space of distinct fault
combinations that an infrastructure can test is
intractable. Existing failure-testing solutions require
skilled and intelligent users who can supply the faults
to inject. These superusers, known as Chaos Engineers
and Jepsen experts, must study the sys-
tems under test, observe system execu-
tions, and then formulate hypotheses
about which faults are most likely to
expose real system-design flaws. This
approach is fundamentally unscal-
able and unprincipled. It relies on the
superuser’s ability to interpret how
a distributed system employs redun-
dancy to mask or ameliorate faults
and, moreover, the ability to recognize
the insufficiencies in those redundan-
cies—in other words, human genius.
This article presents a call to arms
for the distributed systems research
community to improve the state of
the art in fault tolerance testing.
Ordinary users need tools that au-
tomate the selection of custom-tai-
lored faults to inject. We conjecture
that the process by which superusers
select experiments—observing execu-
tions, constructing models of system
redundancy, and identifying weak-
nesses in the models—can be effec-
tively modeled in software. The ar-
ticle describes a prototype validating
this conjecture, presents early results
from the lab and the field, and identi-
fies new research directions that can
make this vision a reality.
The Future Is Disorder
Providing an “always-on” experience
for users and customers means that
distributed software must be fault tol-
erant—that is to say, it must be writ-
ten to anticipate, detect, and either
mask or gracefully handle the effects
of fault events such as hardware fail-
ures and network partitions. Writing
fault-tolerant software—whether for
distributed data management systems
involving the interaction of a handful
of physical machines, or for Web ap-
plications involving the cooperation of
tens of thousands—remains extremely
difficult. While the state of the art in
verification and program analysis con-
tinues to evolve in the academic world,
the industry is moving very much in
the opposite direction: away from for-
mal methods (however, with some
noteworthy exceptions,41) and toward
Abstracting
the Geniuses
Away from
Failure Testing
D O I : 1 0 . 1 1 4 5 / 3 1 5 2 4 8 3
Article development led by
queue.acm.org
Ordinary users need tools that automate the
selection of custom-tailored faults to inject.
BY PETER ALVARO AND SEVERINE TYMON
http://dx.doi.org/10.1145/3152483
J A N U A R Y 2 0 1 8 | V O L . 6 1 | N O . 1 | C O M
M U N I C AT I O N S O F T H E A C M 55
56 C O M M U N I C AT I O N S O F T H E A C M | J A
N U A R Y 2 0 1 8 | V O L . 6 1 | N O . 1
practice
up the stack and frustrate any attempts
at abstraction.
The Old Guard. The modern myth:
Formally verified distributed compo-
nents. If we cannot rely on geniuses to
hide the specter of partial failure, the
next best hope is to face it head on,
armed with tools. Until quite recently,
many of us (academics in particular)
looked to formal methods such as
model checking16,20,29,39,40,53,54 to assist
“mere mortal” programmers in writ-
ing distributed code that upholds its
guarantees despite pervasive uncer-
tainty in distributed executions. It is
not reasonable to exhaustively search
the state space of large-scale systems
(one cannot, for example, model
check Netflix), but the hope is that
modularity and composition (the next
best tools for conquering complexity)
can be brought to bear. If individual
distributed components could be
formally verified and combined into
systems in a way that preserved their
guarantees, then global fault toler-
ance could be obtained via composi-
tion of local fault tolerance.
Unfortunately, this, too, is a pipe
dream. Most model checkers require
a formal specification; most real-world
systems have none (or have not had one
since the design phase, many versions
ago). Software model checkers and oth-
er program-analysis tools require the
source code of the system under study.
The accessibility of source code is also
an increasingly tenuous assumption.
Many of the data stores targeted by
tools such as Jepsen are closed source;
large-scale architectures, while typical-
ly built from open source components,
are increasingly polyglot (written in a
wide variety of languages).
Finally, even if you assume that spec-
ifications or source code are available,
techniques such as model checking are
not a viable strategy for ensuring that
applications are fault tolerant because,
as mentioned, in the context of time-
outs, fault tolerance itself is an end-to-
end property that does not necessarily
hold under composition. Even if you
are lucky enough to build a system out
of individually verified components, it
does not follow the system is fault toler-
ant—you may have made a critical error
in the glue that binds them.
The Vanguard. The emerging ethos:
YOLO. Modern distributed systems
approaches that combine testing with
fault injection.
Here, we describe the underlying
causes of this trend, why it has been
successful so far, and why it is doomed
to fail in its current practice.
The Old Gods. The ancient myth:
Leave it to the experts. Once upon a
time, distributed systems researchers
and practitioners were confident that
the responsibility for addressing the
problem of fault tolerance could be
relegated to a small priesthood of ex-
perts. Protocols for failure detection,
recovery, reliable communication,
consensus, and replication could be
implemented once and hidden away
in libraries, ready for use by the layfolk.
This has been a reasonable dream.
After all, abstraction is the best tool
for overcoming complexity in com-
puter science, and composing reliable
systems from unreliable components
is fundamental to classical system
design.33 Reliability techniques such
as process pairs18 and RAID45 dem-
onstrate that partial failure can, in
certain cases, be handled at the low-
est levels of a system and successfully
masked from applications.
Unfortunately, these approaches
rely on failure detection. Perfect failure
detectors are impossible to implement
in a distributed system,9,15 in which it
is impossible to distinguish between
delay and failure. Attempts to mask
the fundamental uncertainty arising
from partial failure in a distributed
system—for example, RPC (remote
procedure calls8) and NFS (network file
system49)—have met (famously) with
difficulties. Despite the broad consen-
sus that these attempts are failed ab-
stractions,28 in the absence of better
abstractions, people continue to rely
on them to the consternation of devel-
opers, operators, and users.
In a distributed system—that is, a
system of loosely coupled components
interacting via messages—the failure
of a component is only ever manifested
as the absence of a message. The only
way to detect the absence of a message
is via a timeout, an ambiguous signal
that means either the message will nev-
er come or that it merely has not come
yet. Timeouts are an end-to-end con-
cern28,48 that must ultimately be man-
aged by the application. Hence, partial
failures in distributed systems bubble
While the state
of the art in
verification and
program analysis
continues to evolve
in the academic
world, the industry
is moving in the
opposite direction:
away from formal
methods and
toward approaches
that combine
testing with fault
injection.
J A N U A R Y 2 0 1 8 | V O L . 6 1 | N O . 1 | C O M
M U N I C AT I O N S O F T H E A C M 57
practice
are simply too large, too heteroge-
neous, and too dynamic for these
classic approaches to software qual-
ity to take root. In reaction, practitio-
ners increasingly rely on resiliency
techniques based on testing and fault
injection.6,14,19,23,27,35 These “black box”
approaches (which perturb and ob-
serve the complete system, rather
than its components) are (arguably)
better suited for testing an end-to-
end property such as fault tolerance.
Instead of deriving guarantees from
understanding how a system works
on the inside, testers of the system
observe its behavior from the outside,
building confidence that it functions
correctly under stress.
Two giants have recently emerged
in this space: Chaos Engineering6 and
Jepsen testing.24 Chaos Engineering,
the practice of actively perturbing pro-
duction systems to increase overall site
resiliency, was pioneered by Netflix,6
but since then LinkedIn,52 Microsoft,38
Uber,47 and PagerDuty5 have developed
Chaos-based infrastructures. Jepsen
performs black box testing and fault
injection on unmodified distributed
data management systems, in search
of correctness violations (for example,
counterexamples that show an execu-
tion was not linearizable).
Both approaches are pragmatic and
empirical. Each builds an understand-
ing of how a system operates under
faults by running the system and observ-
ing its behavior. Both approaches offer
a pay-as-you-go method to resiliency:
the initial cost of integration is low,
and the more experiments that are
performed, the higher the confidence
that the system under test is robust.
Because these approaches represent
a straightforward enrichment of exist-
ing best practices in testing with well-
understood fault injection techniques,
they are easy to adopt. Finally, and
perhaps most importantly, both ap-
proaches have been shown to be effec-
tive at identifying bugs.
Unfortunately, both techniques
also have a fatal flaw: they are manual
processes that require an extremely
sophisticated operator. Chaos Engi-
neers are a highly specialized subclass
of site reliability engineers. To devise
a custom fault injection strategy, a
Chaos Engineer typically meets with
different service teams to build an
understanding of the idiosyncrasies
of various components and their in-
teractions. The Chaos Engineer then
targets those services and interactions
that seem likely to have latent fault tol-
erance weaknesses. Not only is this ap-
proach difficult to scale since it must
be repeated for every new composition
of services, but its critical currency—
a mental model of the system under
study—is hidden away in a person’s
brain. These points are reminiscent
of a bigger (and more worrying) trend
in industry toward reliability priest-
hoods,7 complete with icons (dash-
boards) and rituals (playbooks).
Jepsen is in principle a framework
that anyone can use, but to the best of
our knowledge all of the reported bugs
discovered by Jepsen to date were dis-
covered by its inventor, Kyle Kingsbury,
who currently operates a “distributed
systems safety research” consultancy.24
Applying Jepsen to a storage system
requires the superuser carefully read
the system documentation, generate
workloads, and observe the externally
visible behaviors of the system under
test. It is then up to the operator to
choose—from the massive combina-
torial space of “nemeses,” including
machine crashes and network parti-
tions—those fault schedules that are
likely to drive the system into returning
incorrect responses.
A human in the loop is the kiss of
death for systems that need to keep up
with software evolution. Human atten-
tion should always be targeted at tasks
that computers cannot do! Moreover,
the specialists that Chaos and Jepsen
testing require are expensive and rare.
Here, we show how geniuses can be ab-
stracted away from the process of fail-
ure testing.
We Don’t Need Another Hero
Rapidly changing assumptions about
our visibility into distributed system
internals have made obsolete many
if not all of the classic approaches to
software quality, while emerging “cha-
os-based” approaches are fragile and
unscalable because of their genius-in-
the-loop requirement.
We present our vision of automated
failure testing by looking at how the
same changing environments that has-
tened the demise of time-tested resil-
iency techniques can enable new ones.
We argue the best way to automate the
experts out of the failure-testing loop is
to imitate their best practices in soft-
ware and show how the emergence of
sophisticated observability infrastruc-
ture makes this possible.
The order is rapidly fadin.’ For large-
scale distributed systems, the three
fundamental assumptions of tradi-
tional approaches to software quality
are quickly fading in the rearview mir-
ror. The first to go was the belief that
you could rely on experts to solve the
hardest problems in the domain. Sec-
ond was the assumption that a formal
specification of the system is available.
Finally, any program analysis (broadly
defined) that requires that source code
is available must be taken off the ta-
ble. The erosion of these assumptions
helps explain the move away from clas-
sic academic approaches to resiliency
in favor of the black box approaches
described earlier.
What hope is there of understand-
ing the behavior of complex systems
in this new reality? Luckily, the fact
that it is more difficult than ever to
understand distributed systems from
the inside has led to the rapid evolu-
tion of tools that allow us to under-
stand them from the outside. Call-
graph logging was first described by
Google;51 similar systems are in use
at Twitter,4 Netflix,1 and Uber,50 and
the technique has since been stan-
dardized.43 It is reasonable to assume
that a modern microservice-based
Internet enterprise will already have
instrumented its systems to collect
call-graph traces. A number of start-
ups that focus on observability have
recently emerged.21,34 Meanwhile,
provenance collection techniques
for data processing systems11,22,42 are
becoming mature, as are operating
system-level provenance tools.44 Re-
cent work12,55 has attempted to infer
causal and communication structure
of distributed computations from
raw logs, bringing high-level explana-
tions of outcomes within reach even
for uninstrumented systems.
Regarding testing distributed systems.
Chaos Monkey, like they mention, is awe-
some, and I also highly recommend get-
ting Kyle to run Jepsen tests.
—Commentator on HackerRumor
58 C O M M U N I C AT I O N S O F T H E A C M | J A
N U A R Y 2 0 1 8 | V O L . 6 1 | N O . 1
practice
of properties that are either maintained
throughout the system’s execution (for
example, system invariants or safety
properties) or established during execu-
tion (for example, liveness properties).
Most distributed systems with which
we interact, though their executions
may be unbounded, nevertheless pro-
vide finite, bounded interactions that
have outcomes. For example, a broad-
cast protocol may run “forever” in a re-
active system, but each broadcast deliv-
ered to all group members constitutes
a successful execution.
By viewing distributed systems in
this way, we can revise the definition:
A system is fault tolerant if it provides
sufficient mechanisms to achieve its
successful outcomes despite the given
class of faults.
Step 3: Formulate experiments that
target weaknesses in the façade. If we
could understand all of the ways in
which a system can obtain its good
outcomes, we could understand which
faults it can tolerate (or which faults it
could be sensitive to). We assert that
(whether they realize it or not!) the
process by which Chaos Engineers
and Jepsen superusers determine, on
a system-by-system basis, which faults
to inject uses precisely this kind of rea-
soning. A target experiment should
exercise a combination of faults that
knocks out all of the supports for an ex-
pected outcome.
Carrying out the experiments turns
out to be the easy part. Fault injection
infrastructure, much like observability
infrastructure, has evolved rapidly in
recent years. In contrast to random,
coarse-grained approaches to distrib-
uted fault injection such as Chaos
Monkey,23 approaches such as FIT
(failure injection testing)17 and Grem-
lin32 allow faults to be injected at the
granularity of individual requests with
high precision.
Step 4. Profit! This process can be ef-
fectively automated. The emergence of
sophisticated tracing tools described
earlier makes it easier than ever to
build redundancy models even from
the executions of black box systems.
The rapid evolution of fault injection
infrastructure makes it easier than
ever to test fault hypotheses on large-
scale systems. Figure 1 illustrates how
the automation described in this here
fits neatly between existing observ-
Away from the experts. While this
quote is anecdotal, it is difficult to
imagine a better example of the fun-
damental unscalability of the current
state of the art. A single person can-
not possibly keep pace with the ex-
plosion of distributed system imple-
mentations. If we can take the human
out of this critical loop, we must; if we
cannot, we should probably throw in
the towel.
The first step to understanding how
to automate any process is to compre-
hend the human component that we
would like to abstract away. How do
Chaos Engineers and Jepsen superus-
ers apply their unique genius in prac-
tice? Here is the three-step recipe com-
mon to both approaches.
Step 1: Observe the system in action.
The human element of the Chaos and
Jepsen processes begins with princi-
pled observation, broadly defined.
A Chaos Engineer will, after study-
ing the external API of services rel-
evant to a given class of interactions,
meet with the engineering teams to
better understand the details of the
implementations of the individual
services.25 To understand the high-
level interactions among services, the
engineer will then peruse call-graph
traces in a trace repository.3
A Jepsen superuser typically begins
by reviewing the product documenta-
tion, both to determine the guarantees
that the system should uphold and to
learn something about the mecha-
nisms by which it does so. From there,
the superuser builds a model of the
behavior of the system based on inter-
action with the system’s external API.
Since the systems under study are typ-
ically data management and storage,
these interactions involve generating
histories of reads and writes.31
The first step to understanding what
can go wrong in a distributed system is
watching things go right: observing the
system in the common case.
Step 2. Build a mental model of how
the system tolerates faults. The com-
mon next step in both approaches is
the most subtle and subjective. Once
there is a mental model of how a dis-
tributed system behaves (at least in the
common case), how is it used to help
choose the appropriate faults to inject?
At this point we are forced to dabble in
conjecture: bear with us.
Fault tolerance is redundancy. Giv-
en some fixed set of faults, we say that
a system is “fault tolerant” exactly if it
operates correctly in all executions in
which those faults occur. What does it
mean to “operate correctly”? Correct-
ness is a system-specific notion, but,
broadly speaking, is expressed in terms
Figure 1. Our vision of automated failure
testing.
explanations
models
of
redundancy
fault
injection
Figure 2. Fault injection and fault-tolerant code.
APP1 APP1 APP2 APP2
caller
fault
callee
API API API API API
J A N U A R Y 2 0 1 8 | V O L . 6 1 | N O . 1 | C O M
M U N I C AT I O N S O F T H E A C M 59
practice
ability infrastructure and fault injec-
tion infrastructure, consuming the
former, maintaining a model of system
redundancy, and using it to param-
eterize the latter. Explanations of sys-
tem outcomes and fault injection in-
frastructures are already available. In
the current state of the art, the puzzle
piece that fits them together (models of
redundancy) is a manual process. LDFI
(as we will explain) shows that automa-
tion of this component is possible.
A Blast from the Past
In previous work, we introduced a bug-
finding tool called LDFI (lineage-driven
fault injection).2 LDFI uses data prove-
nance collected during simulations of
distributed executions to build deriva-
tion graphs for system outcomes. These
graphs function much like the models
of system redundancy described ear-
lier. LDFI then converts the derivation
graphs into a Boolean formula whose
satisfying assignments correspond to
combinations of faults that invalidate
all derivations of the outcome. An ex-
periment targeting those faults will
then either expose a bug (that is, the ex-
pected outcome fails to occur) or reveal
additional derivations (for example, af-
ter a timeout, the system fails over to a
backup) that can be used to enrich the
model and constrain future solutions.
At its heart, LDFI reapplies well-
understood techniques from data
management systems, treating fault
tolerance as a materialized view main-
tenance problem.2,13 It models a dis-
tributed system as a query, its expect-
ed outcomes as query outcomes, and
critical facts such as “replica A is up at
time t” and “there is connectivity be-
tween nodes X and Y during the inter-
val i . . . j” as base facts. It can then ask
a how-to query:37 What changes to base
data will cause changes to the derived
data in the view? The answers to this
query are the faults that could, accord-
ing to the current model, invalidate the
expected outcomes.
The idea seems far-fetched, but the
LDFI approach shows a great deal of
promise. The initial prototype demon-
strated the efficacy of the approach at
the level of protocols, identifying bugs
in replication, broadcast, and commit
protocols.2,46 Notably, LDFI reproduced
a bug in the replication protocol used by
the Kafka distributed log26 that was first
(manually) identified by Kingsbury.30
A later iteration of LDFI is deployed at
Netflix,1 where (much like the illustra-
tion in Figure 1) it was implemented
as a microservice that consumes traces
from a call-graph repository service and
provides inputs for a fault injection ser-
vice. Since its deployment, LDFI has
identified 11 critical bugs in user-fac-
ing applications at Netflix.1
Rumors from the Future
The prior research presented earlier is
only the tip of the iceberg. Much work
still needs to be undertaken to realize
the vision of fully automated failure
testing for distributed systems. Here,
we highlight nascent research that
shows promise and identifies new di-
rections that will help realize our vision.
Don’t overthink fault injection. In the
context of resiliency testing for distribut-
ed systems, attempting to enumerate
and faithfully simulate every possible
kind of fault is a tempting but dis-
tracting path. The problem of under-
standing all the causes of faults is not
directly relevant to the target, which
is to ensure that code (along with its
configuration) intended to detect and
mitigate faults performs as expected.
Consider Figure 2: The diagram on
the left shows a microservice-based
architecture; arrows represent calls
generated by a client request. The
right-hand side zooms in on a pair of
interacting services. The shaded box
in the caller service represents the
fault tolerance logic that is intended
to detect and handle faults of the cal-
lee. Failure testing targets bugs in this
logic. The fault tolerance logic targeted
in this bug search is represented as the
shaded box in the caller service, while
the injected faults affect the callee.
The common effect of all faults, from
the perspective of the caller, is explicit
error returns, corrupted responses,
and (possibly infinite) delay. Of these
manifestations, the first two can be ad-
equately tested with unit tests. The last
is difficult to test, leading to branches
of code that are infrequently executed.
If we inject only delay, and only at com-
ponent boundaries, we conjecture that
we can address the majority of bugs re-
lated to fault tolerance.
Explanations everywhere. If we can
provide better explanations of system
outcomes, we can build better models
The rapid evolution
of fault injection
infrastructure
makes it easier
than ever to test
fault hypotheses
on large-scale
systems.
60 C O M M U N I C AT I O N S O F T H E A C M | J A
N U A R Y 2 0 1 8 | V O L . 6 1 | N O . 1
practice
to embrace (rather than abstracting
away) this uncertainty.
Distributed systems are probabi-
listic by nature and are arguably bet-
ter modeled probabilistically. Future
directions of work include the proba-
bilistic representation of system re-
dundancy and an exploration of how
this representation can be exploited to
guide the search of fault experiments.
We encourage the research community
to join in exploring alternative internal
representations of system redundancy.
Turning the explanations inside
out. Most of the classic work on data
provenance in database research has
focused on aspects related to human-
computer interaction. Explanations of
why a query returned a particular result
can be used to debug both the query
and the initial database—given an un-
expected result, what changes could be
made to the query or the database to fix
it? By contrast, in the class of systems
we envision (and for LDFI concretely),
explanations are part of the internal
language of the reasoner, used to con-
struct models of redundancy in order
to drive the search through faults.
Ideally, explanations should play a
role in both worlds. After all, when a
bug-finding tool such as LDFI identi-
fies a counterexample to a correctness
property, the job of the programmers
has only just begun—now they must un-
dertake the onerous job of distributed
debugging. Tooling around debugging
has not kept up with the explosive pace
of distributed systems development.
We continue to use tools that were de-
signed for a single site, a uniform mem-
ory, and a single clock. While we are not
certain what an ideal distributed debug-
ger should look like, we are quite certain
that it does not look like GDB (GNU Proj-
ect debugger).36 The derivation graphs
used by LDFI show how provenance can
also serve a role in debugging by provid-
ing a concise, visual explanation of how
the system reached a bad state.
This line of research can be pushed
further. To understand the root causes
of a bug in LDFI, a human operator
must review the provenance graphs of
the good and bad executions and then
examine the ways in which they differ.
Intuitively, if you could abstractly
subtract the (incomplete by assump-
tion) explanations of the bad outcomes
from the explanations of the good out-
of redundancy. Unfortunately, a bar-
rier to entry for systems such as LDFI
is the unwillingness of software de-
velopers and operators to instrument
their systems for tracing or provenance
collection. Fortunately, operating sys-
tem-level provenance-collection tech-
niques are mature and can be applied
to uninstrumented systems.
Moreover, the container revolution
makes simulating distributed execu-
tions of black box software within a
single hypervisor easier than ever. We
are actively exploring the collection
of system call-level provenance from
unmodified distributed software in
order to select a custom-tailored fault
injection schedule. Doing so requires
extrapolating application-level causal
structure from low-level traces, iden-
tifying appropriate cut points in an
observed execution, and finally syn-
chronizing the execution with fault
injection actions.
We are also interested in the pos-
sibility of inferring high-level explana-
tions from even noisier signals, such as
raw logs. This would allow us to relax
the assumption that the systems un-
der study have been instrumented to
collect execution traces. While this is
a difficult problem, work such as the
Mystery Machine12 developed at Face-
book shows great promise.
Toward better models. The LDFI
system represents system redundancy
using derivation graphs and treats the
task of identifying possible bugs as a
materialized-view maintenance prob-
lem. LDFI was hence able to exploit
well-understood theory and mecha-
nisms from the history of data man-
agement systems research. But this is
just one of many ways to represent how
a system provides alternative computa-
tions to achieve its expected outcomes.
A shortcoming of the LDFI approach
is its reliance on assumptions of de-
terminism. In particular, it assumes
that if it has witnessed a computation
that, under a particular contingency
(that is, given certain inputs and in the
presence of certain faults), produces
a successful outcome, then any future
computation under that contingency
will produce the same outcome. That
is to say, it ignores the uncertainty in
timing that is fundamental to distrib-
uted systems. A more appropriate way
to model system redundancy would be
The container
revolution makes
simulating
distributed
executions of
black-box software
within a single
hypervisor easier
than ever.
J A N U A R Y 2 0 1 8 | V O L . 6 1 | N O . 1 | C O M
M U N I C AT I O N S O F T H E A C M 61
practice
36. Matloff, N., Salzman, P.J. The Art of Debugging with
GDB, DDD, and Eclipse. No Starch Press, 2008.
37. Meliou, A., Suciu, D. Tiresias: The database oracle for
how-to queries. Proceedings of the ACM SIGMOD
International Conference on the Management of Data
(2012), 337-348.
38. Microsoft Azure Documentation. Introduction to the
fault analysis service, 2016; https://azure.microsoft.
com/en-us/documentation/articles/ service-fabric-
testability-overview/.
39. Musuvathi, M. et al. CMC: A pragmatic approach to
model checking real code. ACM SIGOPS Operating
Systems Review. In Proceedings of the 5th Symposium
on Operating Systems Design and Implementation 36
(2002), 75–88.
40. Musuvathi, M. et al. Finding and reproducing
Heisenbugs in concurrent programs. In Proceedings
of the 8th Usenix Conference on Operating Systems
Design and Implementation (2008), 267–280.
41. Newcombe, C. et al. Use of formal methods at
Amazon Web Services. Technical Report, 2014; http://
lamport.azurewebsites.net/tla/formal-methods-
amazon.pdf.
42. Olston, C., Reed, B. Inspector Gadget: A framework
for custom monitoring and debugging of distributed
data flows. In Proceedings of the ACM SIGMOD
International Conference on the Management of Data
(2011), 1221–1224.
43. OpenTracing. 2016; http://opentracing.io/.
44. Pasquier, T.F. J.-M., Singh, J., Eyers, D.M., Bacon, J.
CamFlow: Managed data-sharing for cloud services,
2015; https://arxiv.org/pdf/1506.04391.pdf.
45. Patterson, D.A., Gibson, G., Katz, R.H. A case for
redundant arrays of inexpensive disks (RAID). In
Proceedings of the 1988 ACM SIGMOD International
Conference on Management of Data, 109–116;
http://web.mit.edu/6.033/2015/wwwdocs/papers/
Patterson88.pdf.
46. Ramasubramanian, K. et al. Growing a protocol. In
Proceedings of the 9th Usenix Workshop on Hot Topics
in Cloud Computing (2017).
47. Reinhold, E. Rewriting Uber engineering: The
opportunities microservices provide. Uber Engineering,
2016; https: //eng.uber.com/building-tincup/.
48. Saltzer, J. H., Reed, D.P., Clark, D.D. End-to-end
arguments in system design. ACM Trans. Computing
Systems 2, 4 (1984): 277–288.
49. Sandberg, R. The Sun network file system: design,
implementation and experience. Technical report, Sun
Microsystems. In Proceedings of the Summer 1986
Usenix Technical Conference and Exhibition.
50. Shkuro, Y. Jaeger: Uber’s distributed tracing system.
Uber Engineering, 2017; https://uber.github.io/jaeger/.
51. Sigelman, B.H. et al. Dapper, a large-scale distributed
systems tracing infrastructure. Technical report.
Research at Google, 2010; https://research.google.
com/pubs/pub36356.html.
52. Shenoy, A. A deep dive into Simoorg: Our open source
failure induction framework. Linkedin Engineering,
2016; https://engineering.linkedin.com/blog/2016/03/
deep-dive-Simoorg-open-source-failure-induction-
framework.
53. Yang, J. et al.L., Zhou, L. MODIST: Transparent
model checking of unmodifed distributed systems.
In Proceedings of the 6th Usenix Symposium on
Networked Systems Design and Implementation
(2009), 213–228.
54. Yu, Y., Manolios, P., Lamport, L. Model checking TLA+
specifications. In Proceedings of the 10th IFIP WG
10.5 Advanced Research Working Conference on
Correct Hardware Design and Verification Methods
(1999), 54–66.
55. Zhao, X. et al. Lprof: A non-intrusive request flow
profiler for distributed systems. In Proceedings of the
11th Usenix Conference on Operating Systems Design
and Implementation (2014), 629–644.
Peter Alvaro is an assistant professor of computer
science at the University of California Santa Cruz,
where he leads the Disorderly Labs research group
(disorderlylabs.github.io).
Severine Tymon is a technical writer who has written
documentation for both internal and external users
of enterprise and open source software, including for
Microsoft, CNET, VMware, and Oracle.
Copyright held by owners/authors.
Publication rights licensed to ACM. $15.00.
comes,10 then the root cause of the dis-
crepancy would be likely to be near the
“frontier” of the difference.
Conclusion
A sea change is occurring in the tech-
niques used to determine whether
distributed systems are fault tolerant.
The emergence of fault injection ap-
proaches such as Chaos Engineering
and Jepsen is a reaction to the erosion
of the availability of expert program-
mers, formal specifications, and uni-
form source code. For all of their prom-
ise, these new approaches are crippled
by their reliance on superusers who
decide which faults to inject.
To address this critical shortcom-
ing, we propose a way of modeling and
ultimately automating the process
carried out by these superusers. The
enabling technologies for this vision
are the rapidly improving observabil-
ity and fault injection infrastructures
that are becoming commonplace in
the industry. While LDFI provides con-
structive proof that this approach is
possible and profitable, it is only the
beginning. Much work remains to be
done in targeting faults at a finer grain,
constructing more accurate models of
system redundancy, and providing bet-
ter explanations to end users of exactly
what went wrong when bugs are identi-
fied. The distributed systems research
community is invited to join in explor-
ing this new and promising domain.
Related articles
on queue.acm.org
Fault Injection in Production
John Allspaw
http://queue.acm.org/detail.cfm?id=2353017
The Verification of a Distributed System
Caitie McCaffrey
http://queue.acm.org/detail.cfm?id=2889274
Injecting Errors for Fun and Profit
Steve Chessin
http://queue.acm.org/detail.cfm?id=1839574
References
1. Alvaro, P. et al. Automating failure-testing research
at Internet scale. In Proceedings of the 7th ACM
Symposium on Cloud Computing (2016), 17–28.
2. Alvaro, P., Rosen, J., Hellerstein, J.M. Lineage-driven
fault injection. In Proceedings of the ACM SIGMOD
International Conference on Management of Data
(2015), 331–346.
3. Andrus, K. Personal communication, 2016.
4. Aniszczyk, C. Distributed systems tracing with Zipkin.
Twitter Engineering; https://blog.twitter.com/2012/
distributed-systems-tracing-with-zipkin.
5. Barth, D. Inject failure to make your systems more
reliable. DevOps.com; http://devops.com/2014/06/03/
inject-failure/.
6. Basiri, A. et al. Chaos Engineering. IEEE Software 33, 3
(2016), 35–41.
7. Beyer, B., Jones, C., Petoff, J., Murphy, N.R. Site
Reliability Engineering. O’Reilly, 2016.
8. Birrell, A.D., Nelson, B.J. Implementing remote
procedure calls. ACM Trans. Computer Systems 2, 1
(1984), 39–59.
9. Chandra, T.D., Hadzilacos, V., Toueg, S. The weakest
failure detector for solving consensus. J.ACM 43, 4
(1996), 685–722.
10. Chen, A. et al. The good, the bad, and the differences:
better network diagnostics with differential
provenance. In Proceedings of the ACM SIGCOMM
Conference (2016), 115–128.
11. Chothia, Z., Liagouris, J., McSherry, F., Roscoe, T.
Explaining outputs in modern data analytics. In
Proceedings of the VLDB Endowment 9, 12 (2016):
1137–1148.
12. Chow, M. et al. The Mystery Machine: End-to-end
performance analysis of large-scale Internet services.
In Proceedings of the 11th Usenix Conference on
Operating Systems Design and Implementation
(2014), 217–231.
13. Cui, Y., Widom, J., Wiener, J.L. Tracing the lineage of
view data in a warehousing environment. ACM Trans.
Database Systems 25, 2 (2000), 179–227.
14. Dawson, S., Jahanian, F., Mitton, T. ORCHESTRA: A
Fault Injection Environment for Distributed Systems.
In Proceedings of the 26th International Symposium
on Fault-tolerant Computing, (1996).
15. Fischer, M.J., Lynch, N.A., Paterson, M.S. Impossibility
of distributed consensus with one faulty process.
J. ACM 32, 2 (1985): 374–382; https://groups.csail.mit.
edu/tds/papers/Lynch/jacm85.pdf.
16. Fisman, D., Kupferman, O., Lustig, Y. On verifying
fault tolerance of distributed protocols. In Tools
and Algorithms for the Construction and Analysis of
Systems, Lecture Notes in Computer Science 4963,
Springer Verlag (2008). 315–331.
17. Gopalani, N., Andrus, K., Schmaus, B. FIT: Failure
injection testing. Netflix Technology Blog; http://
techblog.netflix.com/2014/10/fit-failure-injection-
testing.html.
18. Gray, J. Why do computers stop and what can
be done about it? Tandem Technical Report 85.7
(1985); http://www.hpl.hp.com/techreports/
tandem/TR-85.7.pdf.
19. Gunawi, H.S. et al. FATE and DESTINI: A framework
for cloud recovery testing. In Proceedings of the 8th
Usenix Conference on Networked Systems Design
and Implementation (2011), 238–252; http://db.cs.
berkeley.edu/papers/nsdi11-fate-destini.pdf.
20. Holzmann, G. The SPIN Model Checker: Primer and
Reference Manual. Addison-Wesley Professional, 2003.
21. Honeycomb. 2016; https://honeycomb.io/.
22. Interlandi, M. et al. Titian: Data provenance support in
Spark. In Proceedings of the VLDB Endowment 9, 33
(2015), 216–227.
23. Izrailevsky, Y., Tseitlin, A. The Netflix Simian Army.
Netflix Technology Blog; http: //techblog.netflix.
com/2011/07/ netflix-simian-army.html.
24. Jepsen. Distributed systems safety research, 2016;
http://jepsen.io/.
25. Jones, N. Personal communication, 2016.
26. Kafka 0.8.0. Apache, 2013; https://kafka.apache.
org/08/documentation.html.
27. Kanawati, G.A., Kanawati, N.A., Abraham, J.A. Ferrari:
A flexible software-based fault and error injection
system. IEEE Trans. Computers 44, 2 (1995): 248–260.
28. Kendall, S.C., Waldo, J., Wollrath, A., Wyant, G. A note
on distributed computing. Technical Report, 1994. Sun
Microsystems Laboratories.
29. Killian, C.E., Anderson, J.W., Jhala, R., Vahdat, A. Life,
death, and the critical transition: Finding liveness
bugs in systems code. Networked System Design and
Implementation, (2007); 243–256.
30. Kingsbury, K. Call me maybe: Kafka, 2013; http://
aphyr.com/posts/293-call-me-maybe-kafka.
31. Kingsbury, K. Personal communication, 2016.
32. Lafeldt, M. The discipline of Chaos Engineering.
Gremlin Inc., 2017; https://blog.gremlininc.com/the-
discipline-of-chaos-engineering-e39d2383c459.
33. Lampson, B.W. Atomic transactions. In Distributed
Systems—Architecture and Implementation, An
Advanced Cours: (1980), 246–265; https://link.
springer.com/chapter/10.1007%2F3-540-10571-9_11.
34. LightStep. 2016; http://lightstep.com/.
35. Marinescu, P.D., Candea, G. LFI: A practical and
general library-level fault injector. In IEEE/IFIP
International Conference on Dependable Systems and
Networks (2009).
Copyright of Communications of the ACM is the property of
Association for Computing
Machinery and its content may not be copied or emailed to
multiple sites or posted to a
listserv without the copyright holder's express written
permission. However, users may print,
download, or email articles for individual use.
International Journal of Performability Engineering Vol. 6, No.
6, November 2010, pp. 531-546.
© RAMS Consultants
Printed in India
*
Corresponding author’s email: [email protected] 531
Successful Application of Software Reliability: A Case Study
NORMAN F. SCHNEIDEWIND
Fellow of the IEEE
2822 Raccoon Trail
Pebble Beach, California 93953 USA
(Received on July 30, 2009, revised on May 3, 2010)
Abstract: The purpose of this case study is to help readers
implement or improve a
software reliability program in their organizations, using a step-
by-step approach based on
the Institute of Electrical and Electronic Engineers (IEEE) and
the American Institute of
Aeronautics and Astronautics Recommended (AIAA) Practice
for Software Reliability,
released in June 2008, supported by a case study from the
NASA Space Shuttle.
This case study covers the major phases that the software
engineering practitioner
needs in planning and executing a software reliability-
engineering program. These phases
require a number of steps for their implementation. These steps
provide a structured
approach to the software reliability process. Each step will be
discussed to provide a good
understanding of the entire software reliability process. Major
topics covered are: data
collection, reliability risk assessment, reliability prediction,
reliability prediction
interpretation, testing, reliability decisions, and lessons learned
from the NASA Space
Shuttle software reliability engineering program.
Keywords: software reliability program, Institute of Electrical
and Electronic Engineers
and the American Institute of Aeronautics and Astronautics
Recommended Practice for
Software Reliability, NASA Space Shuttle application
1. Introduction
The IEEEAIAA recommended practice provides a
foundation on which
practitioners and researchers can build consistent methods [1].
This case study will
describe the SRE process and show that it is important for an
organization to have a
disciplined process if it is to produce high reliability software.
To accomplish this purpose,
an overview is presented of existing practice in software
reliability, as represented by the
recommended practice [1]. This will provide the reader with the
foundation to understand
the basic process of Software Reliability engineering (SRE).
The Space Shuttle Primary
Avionics Software Subsystem will be used to illustrate the SRE
process.
The reliability prediction models that will be used are based on
some key definitions
and assumptions, as follows:
Definitions
Interval: an integer time unit t of constant or variable length
defined by t-1 <t <t+1, where
t>0; failures are counted in intervals.
Number of Intervals: the number of contiguous integer time
units t of constant or variable
length represented by a positive real number.
Norman F. Schneidewind
.
532
Operational Increment (OI): a software system comprised of
modules and configured from
a series of builds to meet Shuttle mission functional
requirements.
Time: continuous CPU execution time over an interval range.
Assumptions
1. Faults that cause failures are removed.
2. As more failures occur and more faults are corrected,
remaining failures will be
reduced.
3. The remaining failures are "zero" for those OI's that were
executed for extremely
long times (years) with no additional failure reports;
correspondingly, for these
OI's, maximum failures equals total observed failures.
1.1 Space Shuttle Flight Software Application
The Shuttle software represents a successful integration of
many of the computer
industry's most advanced software engineering practices and
approaches. Beginning in the
late 1970's, this software development and maintenance project
has evolved one of the
world's most mature software processes applying the principles
of the highest levels of the
Software Engineering Institute's (SEI) Capability Maturity
Model (the software is rated
Level 5 on the SEI scale) and ISO 9001 Standards [2]. This
software process includes
state-of-the-practice software reliability engineering (SRE)
methodologies.
The goals of the recommended practice are to: interpret
software reliability
predictions, support verification and validation of the software,
assess the risk of
deploying the software, predict the reliability of the software,
develop test strategies to
bring the software into conformance with reliability
specifications, and make reliability
decisions regarding deployment of the software.
Reliability predictions are used by the developer to add
confidence to a formal
software certification process comprised of requirements risk
analysis, design and code
inspections, testing, and independent verification and
validation. This case study uses the
experience obtained from the application of SRE on the Shuttle
project, because this
application is judged by NASA and the developer to be a
successful application of SRE
[6]. These SRE techniques and concepts should be of value for
other software systems
1.2 Reliability Measurements and Predictions
There are a number of measurements and predictions that can
be made of reliability
to verify and validate the software. Among these are remaining
failures, maximum
failures, total test time required to attain a given fraction of
remaining failures, and time to
next failure. These have been shown to be useful measurements
and predictions for: 1)
providing confidence that the software has achieved reliability
goals; 2) rationalizing how
long to test a software component (e.g., testing sufficiently long
to verify that the measured
reliability conforms to design specifications); and 3) analyzing
the risk of not achieving
remaining failures and time to next failure goals [6]. Having
predictions of the extent to
which the software is not fault free (remaining failures) and
whether a failure it is likely to
occur during a mission (time to next failure) provide criteria for
assessing the risk of
deploying the software. Furthermore, fraction of remaining
failures can be used as both an
Successful Application of Software Reliability: Case Study
533
operational quality goal in predicting total test time
requirements and, conversely, as an
indicator of operational quality as a function of total test time
expended [6].
The various software reliability measurements and predictions
can be divided into the
following two categories to use in combination to assist in
assuring the desired level of
reliability of the software in mission critical systems like the
Shuttle. The two categories
are: 1) measurements and predictions that are associated with
residual software faults and
failures, and 2) measurements and predictions that are
associated with the ability of the
software to complete a mission without experiencing a failure of
a specified severity. In
the first category are: remaining failures, maximum failures,
fraction of remaining failures,
and total test time required to attain a given number of fraction
of remaining failures. In
the second category are: time to next failure and total test time
required to attain a given
time to next failure. In addition, there is the risk associated with
not attaining the required
remaining failures and time to next failure goals. Lastly, there
is operational quality that is
derived from fraction of remaining failures. With this type of
information, a software
manager can determine whether more testing is warranted or
whether the software is
sufficiently tested to allow its release or unrestricted use. These
predictions provide a
quantitative basis for achieving reliability goals [2].
1.3 Interpretations and Credibility
The two most critical factors in establishing credibility in
software reliability
predictions are the validation method and the way the
predictions are interpreted. For
example, a "conservative" prediction can be interpreted as
providing an "additional margin
of confidence" in the software reliability, if that predicted
reliability already exceeds an
established "acceptable level" or requirement. It may not be
possible to validate
predictions of the reliability of software precisely, but it is
possible with "high confidence"
to predict a lower bound on the reliability of that software
within a specified environment.
If there historical failure data were available for a series of
previous dates (and there
is actual data for the failure history following those dates), it
would be possible to compare
the predictions to the actual reliability and evaluate the
performance of the model. Taking
this approach will significantly enhance the credibility of
predictions among those who
must make software deployment decisions based on the
predictions [9].
1.4 Verification and Validation
Software reliability measurement and prediction are useful
approaches to verify and
validate software. Measurement refers to collecting and
analyzing data about the observed
reliability of software, for example the occurrence of failures
during test. Prediction refers
to using a model to forecast future software reliability, for
example failure rate during
operation. Measurement also provides the failure data that is
used to estimate the
parameters of reliability models (i.e., make the best fit of the
model to the observed failure
data). Once the parameters have been estimated, the model is
used to predict the future
reliability of the software. Verification ensures that the
software product, as it exists in a
given project phase, satisfies the conditions imposed in the
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx
54    c o m m u n i c at i o n s  o f  t h e  a c m       n o.docx

More Related Content

Similar to 54 c o m m u n i c at i o n s o f t h e a c m n o.docx

Summer project- Jack Fletcher
Summer project- Jack Fletcher Summer project- Jack Fletcher
Summer project- Jack Fletcher Jack Fletcher
 
Do modernizing the Mainframe for DevOps.
Do modernizing the Mainframe for DevOps.Do modernizing the Mainframe for DevOps.
Do modernizing the Mainframe for DevOps.Massimo Talia
 
The dependency inversion principle
The dependency inversion principleThe dependency inversion principle
The dependency inversion principlenavicorevn
 
From hello world to goodbye code
From hello world to goodbye codeFrom hello world to goodbye code
From hello world to goodbye codeKim Moir
 
Software design.edited (1)
Software design.edited (1)Software design.edited (1)
Software design.edited (1)FarjanaAhmed3
 
SDLC and Software Process Models
SDLC and Software Process ModelsSDLC and Software Process Models
SDLC and Software Process ModelsNana Sarpong
 
10 Code Anti-Patterns to Avoid in Software Development.pdf
10 Code Anti-Patterns to Avoid in Software Development.pdf10 Code Anti-Patterns to Avoid in Software Development.pdf
10 Code Anti-Patterns to Avoid in Software Development.pdfAhmed Salama
 
Shift Happens - Rapidly Rolling Forward During Production Failure
Shift Happens - Rapidly Rolling Forward During Production FailureShift Happens - Rapidly Rolling Forward During Production Failure
Shift Happens - Rapidly Rolling Forward During Production FailureIBM UrbanCode Products
 
Periodic Table of Agile Principles and Practices
Periodic Table of Agile Principles and PracticesPeriodic Table of Agile Principles and Practices
Periodic Table of Agile Principles and PracticesJérôme Kehrli
 
From 🤦 to 🐿️
From 🤦 to 🐿️From 🤦 to 🐿️
From 🤦 to 🐿️Ori Pekelman
 
A Beginners Guide To Legacy Systems
A Beginners Guide To Legacy SystemsA Beginners Guide To Legacy Systems
A Beginners Guide To Legacy SystemsSyed Hassan Raza
 
Secret Twists to Efficiently Develop Reactive Software Systems
Secret Twists to Efficiently Develop Reactive Software SystemsSecret Twists to Efficiently Develop Reactive Software Systems
Secret Twists to Efficiently Develop Reactive Software SystemsBart Jonkers
 
Easy oracle & weblogic provisioning and deployment
Easy oracle & weblogic provisioning and deploymentEasy oracle & weblogic provisioning and deployment
Easy oracle & weblogic provisioning and deploymentBert Hajee
 
System analsis and design
System analsis and designSystem analsis and design
System analsis and designRizwan Kabir
 
Software Engineering
Software Engineering Software Engineering
Software Engineering JayaKamal
 
It’s 2021. Why are we -still- rebooting for patches? A look at Live Patching.
It’s 2021. Why are we -still- rebooting for patches? A look at Live Patching.It’s 2021. Why are we -still- rebooting for patches? A look at Live Patching.
It’s 2021. Why are we -still- rebooting for patches? A look at Live Patching.All Things Open
 

Similar to 54 c o m m u n i c at i o n s o f t h e a c m n o.docx (20)

Summer project- Jack Fletcher
Summer project- Jack Fletcher Summer project- Jack Fletcher
Summer project- Jack Fletcher
 
Do modernizing the Mainframe for DevOps.
Do modernizing the Mainframe for DevOps.Do modernizing the Mainframe for DevOps.
Do modernizing the Mainframe for DevOps.
 
Faq
FaqFaq
Faq
 
The dependency inversion principle
The dependency inversion principleThe dependency inversion principle
The dependency inversion principle
 
From hello world to goodbye code
From hello world to goodbye codeFrom hello world to goodbye code
From hello world to goodbye code
 
Dark launch
Dark launchDark launch
Dark launch
 
Software design.edited (1)
Software design.edited (1)Software design.edited (1)
Software design.edited (1)
 
SAD15 - Maintenance
SAD15 - MaintenanceSAD15 - Maintenance
SAD15 - Maintenance
 
Web-development-git
Web-development-gitWeb-development-git
Web-development-git
 
SDLC and Software Process Models
SDLC and Software Process ModelsSDLC and Software Process Models
SDLC and Software Process Models
 
10 Code Anti-Patterns to Avoid in Software Development.pdf
10 Code Anti-Patterns to Avoid in Software Development.pdf10 Code Anti-Patterns to Avoid in Software Development.pdf
10 Code Anti-Patterns to Avoid in Software Development.pdf
 
Shift Happens - Rapidly Rolling Forward During Production Failure
Shift Happens - Rapidly Rolling Forward During Production FailureShift Happens - Rapidly Rolling Forward During Production Failure
Shift Happens - Rapidly Rolling Forward During Production Failure
 
Periodic Table of Agile Principles and Practices
Periodic Table of Agile Principles and PracticesPeriodic Table of Agile Principles and Practices
Periodic Table of Agile Principles and Practices
 
From 🤦 to 🐿️
From 🤦 to 🐿️From 🤦 to 🐿️
From 🤦 to 🐿️
 
A Beginners Guide To Legacy Systems
A Beginners Guide To Legacy SystemsA Beginners Guide To Legacy Systems
A Beginners Guide To Legacy Systems
 
Secret Twists to Efficiently Develop Reactive Software Systems
Secret Twists to Efficiently Develop Reactive Software SystemsSecret Twists to Efficiently Develop Reactive Software Systems
Secret Twists to Efficiently Develop Reactive Software Systems
 
Easy oracle & weblogic provisioning and deployment
Easy oracle & weblogic provisioning and deploymentEasy oracle & weblogic provisioning and deployment
Easy oracle & weblogic provisioning and deployment
 
System analsis and design
System analsis and designSystem analsis and design
System analsis and design
 
Software Engineering
Software Engineering Software Engineering
Software Engineering
 
It’s 2021. Why are we -still- rebooting for patches? A look at Live Patching.
It’s 2021. Why are we -still- rebooting for patches? A look at Live Patching.It’s 2021. Why are we -still- rebooting for patches? A look at Live Patching.
It’s 2021. Why are we -still- rebooting for patches? A look at Live Patching.
 

More from alinainglis

· Present a discussion of what team is. What type(s) of team do .docx
· Present a discussion of what team is. What type(s) of team do .docx· Present a discussion of what team is. What type(s) of team do .docx
· Present a discussion of what team is. What type(s) of team do .docxalinainglis
 
· Presentation of your project. Prepare a PowerPoint with 8 slid.docx
· Presentation of your project. Prepare a PowerPoint with 8 slid.docx· Presentation of your project. Prepare a PowerPoint with 8 slid.docx
· Presentation of your project. Prepare a PowerPoint with 8 slid.docxalinainglis
 
· Prepare a research proposal, mentioning a specific researchabl.docx
· Prepare a research proposal, mentioning a specific researchabl.docx· Prepare a research proposal, mentioning a specific researchabl.docx
· Prepare a research proposal, mentioning a specific researchabl.docxalinainglis
 
· Previous professional experiences that have had a profound.docx
· Previous professional experiences that have had a profound.docx· Previous professional experiences that have had a profound.docx
· Previous professional experiences that have had a profound.docxalinainglis
 
· Please select ONE of the following questions and write a 200-wor.docx
· Please select ONE of the following questions and write a 200-wor.docx· Please select ONE of the following questions and write a 200-wor.docx
· Please select ONE of the following questions and write a 200-wor.docxalinainglis
 
· Please use Firefox for access to cronometer.com16 ye.docx
· Please use Firefox for access to cronometer.com16 ye.docx· Please use Firefox for access to cronometer.com16 ye.docx
· Please use Firefox for access to cronometer.com16 ye.docxalinainglis
 
· Please share theoretical explanations based on social, cultural an.docx
· Please share theoretical explanations based on social, cultural an.docx· Please share theoretical explanations based on social, cultural an.docx
· Please share theoretical explanations based on social, cultural an.docxalinainglis
 
· If we accept the fact that we may need to focus more on teaching.docx
· If we accept the fact that we may need to focus more on teaching.docx· If we accept the fact that we may need to focus more on teaching.docx
· If we accept the fact that we may need to focus more on teaching.docxalinainglis
 
· How many employees are working for youtotal of 5 employees .docx
· How many employees are working for youtotal of 5 employees  .docx· How many employees are working for youtotal of 5 employees  .docx
· How many employees are working for youtotal of 5 employees .docxalinainglis
 
· How should the risks be prioritized· Who should do the priori.docx
· How should the risks be prioritized· Who should do the priori.docx· How should the risks be prioritized· Who should do the priori.docx
· How should the risks be prioritized· Who should do the priori.docxalinainglis
 
· How does the distribution mechanism control the issues address.docx
· How does the distribution mechanism control the issues address.docx· How does the distribution mechanism control the issues address.docx
· How does the distribution mechanism control the issues address.docxalinainglis
 
· Helen Petrakis Identifying Data Helen Petrakis is a 5.docx
· Helen Petrakis Identifying Data Helen Petrakis is a 5.docx· Helen Petrakis Identifying Data Helen Petrakis is a 5.docx
· Helen Petrakis Identifying Data Helen Petrakis is a 5.docxalinainglis
 
· Global O365 Tenant Settings relevant to SPO, and recommended.docx
· Global O365 Tenant Settings relevant to SPO, and recommended.docx· Global O365 Tenant Settings relevant to SPO, and recommended.docx
· Global O365 Tenant Settings relevant to SPO, and recommended.docxalinainglis
 
· Focus on the identified client within your chosen case.· Analy.docx
· Focus on the identified client within your chosen case.· Analy.docx· Focus on the identified client within your chosen case.· Analy.docx
· Focus on the identified client within your chosen case.· Analy.docxalinainglis
 
· Find current events regarding any issues in public health .docx
· Find current events regarding any issues in public health .docx· Find current events regarding any issues in public health .docx
· Find current events regarding any issues in public health .docxalinainglis
 
· Explore and assess different remote access solutions.Assig.docx
· Explore and assess different remote access solutions.Assig.docx· Explore and assess different remote access solutions.Assig.docx
· Explore and assess different remote access solutions.Assig.docxalinainglis
 
· FASB ASC & GARS Login credentials LinkUser ID AAA51628Pas.docx
· FASB ASC & GARS Login credentials LinkUser ID AAA51628Pas.docx· FASB ASC & GARS Login credentials LinkUser ID AAA51628Pas.docx
· FASB ASC & GARS Login credentials LinkUser ID AAA51628Pas.docxalinainglis
 
· Due Sat. Sep. · Format Typed, double-spaced, sub.docx
· Due Sat. Sep. · Format Typed, double-spaced, sub.docx· Due Sat. Sep. · Format Typed, double-spaced, sub.docx
· Due Sat. Sep. · Format Typed, double-spaced, sub.docxalinainglis
 
· Expectations for Power Point Presentations in Units IV and V I.docx
· Expectations for Power Point Presentations in Units IV and V I.docx· Expectations for Power Point Presentations in Units IV and V I.docx
· Expectations for Power Point Presentations in Units IV and V I.docxalinainglis
 
· Due Friday by 1159pmResearch Paper--IssueTopic Ce.docx
· Due Friday by 1159pmResearch Paper--IssueTopic Ce.docx· Due Friday by 1159pmResearch Paper--IssueTopic Ce.docx
· Due Friday by 1159pmResearch Paper--IssueTopic Ce.docxalinainglis
 

More from alinainglis (20)

· Present a discussion of what team is. What type(s) of team do .docx
· Present a discussion of what team is. What type(s) of team do .docx· Present a discussion of what team is. What type(s) of team do .docx
· Present a discussion of what team is. What type(s) of team do .docx
 
· Presentation of your project. Prepare a PowerPoint with 8 slid.docx
· Presentation of your project. Prepare a PowerPoint with 8 slid.docx· Presentation of your project. Prepare a PowerPoint with 8 slid.docx
· Presentation of your project. Prepare a PowerPoint with 8 slid.docx
 
· Prepare a research proposal, mentioning a specific researchabl.docx
· Prepare a research proposal, mentioning a specific researchabl.docx· Prepare a research proposal, mentioning a specific researchabl.docx
· Prepare a research proposal, mentioning a specific researchabl.docx
 
· Previous professional experiences that have had a profound.docx
· Previous professional experiences that have had a profound.docx· Previous professional experiences that have had a profound.docx
· Previous professional experiences that have had a profound.docx
 
· Please select ONE of the following questions and write a 200-wor.docx
· Please select ONE of the following questions and write a 200-wor.docx· Please select ONE of the following questions and write a 200-wor.docx
· Please select ONE of the following questions and write a 200-wor.docx
 
· Please use Firefox for access to cronometer.com16 ye.docx
· Please use Firefox for access to cronometer.com16 ye.docx· Please use Firefox for access to cronometer.com16 ye.docx
· Please use Firefox for access to cronometer.com16 ye.docx
 
· Please share theoretical explanations based on social, cultural an.docx
· Please share theoretical explanations based on social, cultural an.docx· Please share theoretical explanations based on social, cultural an.docx
· Please share theoretical explanations based on social, cultural an.docx
 
· If we accept the fact that we may need to focus more on teaching.docx
· If we accept the fact that we may need to focus more on teaching.docx· If we accept the fact that we may need to focus more on teaching.docx
· If we accept the fact that we may need to focus more on teaching.docx
 
· How many employees are working for youtotal of 5 employees .docx
· How many employees are working for youtotal of 5 employees  .docx· How many employees are working for youtotal of 5 employees  .docx
· How many employees are working for youtotal of 5 employees .docx
 
· How should the risks be prioritized· Who should do the priori.docx
· How should the risks be prioritized· Who should do the priori.docx· How should the risks be prioritized· Who should do the priori.docx
· How should the risks be prioritized· Who should do the priori.docx
 
· How does the distribution mechanism control the issues address.docx
· How does the distribution mechanism control the issues address.docx· How does the distribution mechanism control the issues address.docx
· How does the distribution mechanism control the issues address.docx
 
· Helen Petrakis Identifying Data Helen Petrakis is a 5.docx
· Helen Petrakis Identifying Data Helen Petrakis is a 5.docx· Helen Petrakis Identifying Data Helen Petrakis is a 5.docx
· Helen Petrakis Identifying Data Helen Petrakis is a 5.docx
 
· Global O365 Tenant Settings relevant to SPO, and recommended.docx
· Global O365 Tenant Settings relevant to SPO, and recommended.docx· Global O365 Tenant Settings relevant to SPO, and recommended.docx
· Global O365 Tenant Settings relevant to SPO, and recommended.docx
 
· Focus on the identified client within your chosen case.· Analy.docx
· Focus on the identified client within your chosen case.· Analy.docx· Focus on the identified client within your chosen case.· Analy.docx
· Focus on the identified client within your chosen case.· Analy.docx
 
· Find current events regarding any issues in public health .docx
· Find current events regarding any issues in public health .docx· Find current events regarding any issues in public health .docx
· Find current events regarding any issues in public health .docx
 
· Explore and assess different remote access solutions.Assig.docx
· Explore and assess different remote access solutions.Assig.docx· Explore and assess different remote access solutions.Assig.docx
· Explore and assess different remote access solutions.Assig.docx
 
· FASB ASC & GARS Login credentials LinkUser ID AAA51628Pas.docx
· FASB ASC & GARS Login credentials LinkUser ID AAA51628Pas.docx· FASB ASC & GARS Login credentials LinkUser ID AAA51628Pas.docx
· FASB ASC & GARS Login credentials LinkUser ID AAA51628Pas.docx
 
· Due Sat. Sep. · Format Typed, double-spaced, sub.docx
· Due Sat. Sep. · Format Typed, double-spaced, sub.docx· Due Sat. Sep. · Format Typed, double-spaced, sub.docx
· Due Sat. Sep. · Format Typed, double-spaced, sub.docx
 
· Expectations for Power Point Presentations in Units IV and V I.docx
· Expectations for Power Point Presentations in Units IV and V I.docx· Expectations for Power Point Presentations in Units IV and V I.docx
· Expectations for Power Point Presentations in Units IV and V I.docx
 
· Due Friday by 1159pmResearch Paper--IssueTopic Ce.docx
· Due Friday by 1159pmResearch Paper--IssueTopic Ce.docx· Due Friday by 1159pmResearch Paper--IssueTopic Ce.docx
· Due Friday by 1159pmResearch Paper--IssueTopic Ce.docx
 

Recently uploaded

Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfMahmoud M. Sallam
 
MARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupMARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupJonathanParaisoCruz
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentInMediaRes1
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.arsicmarija21
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementmkooblal
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxDr.Ibrahim Hassaan
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
MICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptxMICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptxabhijeetpadhi001
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceSamikshaHamane
 

Recently uploaded (20)

Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdf
 
MARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupMARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized Group
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media Component
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of management
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptx
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
MICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptxMICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptx
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in Pharmacovigilance
 

54 c o m m u n i c at i o n s o f t h e a c m n o.docx

  • 1. 54 c o m m u n i c at i o n s o f t h e a c m | n o v e m b e r 2 0 0 9 | v o l . 5 2 | n o . 1 1 practice e V e r Y o N e K N o W s M a i N T e N a N C e is difficult and boring, and therefore avoids doing it. It doesn’t help that many pointy-haired bosses (PHBs) say things like: “no one needs to do maintenance—that’s a waste of time.” “Get the software out now; we can decide what its real function is later.” “Do the hardware first, without thinking about the software.” “Don’t allow any room or facility for expansion. You can decide later how to sandwich the changes in.” These statements are a fair description of development during the last boom, and not too far from what many of us are doing today. This is not a good thing: when you hit the first bug, all the time you may have “saved” by ignoring the need to do maintenance will be gone. During a previous boom, General
  • 2. Electric designed a mainframe that it claimed would be sufficient for all the computer uses in Boston, and would never need to be shut down for repair or for software tweaks. The machine it eventually built wasn’t nearly big enough, but it did succeed at running continuously without need for hard- ware or software changes. Today we have a distributed net- work of computers provided by thou- sands of businesses, sufficient for ev- eryone in at least North America, if not the world. Still, we must keep shutting down individual parts of the network to repair or change the software. We do so because we’ve forgotten how to do soft- ware maintenance. What is software maintenance? Software maintenance is not like hard- ware maintenance, which is the return of the item to its original state. Software maintenance involves moving an item away from its original state. It encom- passes all activities associated with the process of changing software. That in- cludes everything associated with “bug fixes,” functional and performance enhancements, providing backward compatibility, updating its algorithm, covering up hardware errors, creating user-interface access methods, and other cosmetic changes.
  • 3. In software, adding a six-lane au- tomobile expressway to a railroad bridge is considered maintenance— and it would be particularly valuable if you could do it without stopping the train traffic. Is it possible to design software so it can be maintained in this way? Yes, it is. So, why don’t we? the four horsemen of the apocalypse There are four approaches to software You Don’t Know Jack about software maintenance D o i : 1 0 . 1 1 4 5 / 1 5 9 2 7 6 1 . 1 5 9 2 7 7 7 Article development led by queue.acm.org Long considered an afterthought, software maintenance is easiest and most effective when built into a system from the ground up. BY PauL stachouR anD DaViD coLLieR-BRoWn P h
  • 5. 56 c o m m u n i c at i o n s o f t h e a c m | n o v e m b e r 2 0 0 9 | v o l . 5 2 | n o . 1 1 practice tems, the specification and designs aren’t quite good enough, so in prac- tice the specification is frozen while it’s still faulty. This is often because it cannot be validated, so you can’t tell if it’s faulty until too late. Then the spec- ification is not adhered to when code is written, so you can’t prove the pro- gram follows the specification, much less prove it’s correct. So, you test un- til the program is late, and then ship. Some months later you replace it as a complete entity, by sending out new ROMs. This is the typical history of video games, washing machines, and embedded systems from the U.S. De- partment of Defense. Discrete. The discrete change ap- proach is the current state of prac- tice: define hard-and-fast, highly configuration-controlled interfaces to elements of software, and regularly carry out massive all-at-once changes. Next, ship an entire new copy of the program, or a “patch” that silently
  • 6. replaces entire executables and li- braries. (As we write this, a new copy of Open Office is asking us please to download it.) In theory, the process accepts (re- luctantly) the fact of change, keeps a parts list and tools list on every item, allows only preauthorized changes under strict configuration control, and forces all servers’/users’ changes to take place in one discrete step. In practice, the program is running mul- tiple places, and each must kick off its users, do the upgrade, and then let them back on again. Change happens more often and in more places than predicted, all the components of an maintenance: traditional, never, dis- crete, and continuous—or, perhaps, war, famine, plague, and death. In any case, 3.5 of them are terrible ideas. Traditional (or “everyone’s first project”). This one is easy: don’t even think about the possibility of main- tenance. Hard-code constants, avoid subroutines, use all global variables, use short and non-meaningful vari- able names. In other words, make it difficult to change any one thing with- out changing everything. Everyone knows examples of this approach— and the PHBs who thoughtlessly push you into it, usually because of sched-
  • 7. ule pressures. Trying to maintain this kind of soft- ware is like fighting a war. The enemy fights back! It particularly fights back when you have to change interfaces, and you find you’ve only changed some of the copies. Never. The second approach is to decide upfront that maintenance will never occur. You simply write wonder- ful programs right from the start. This is actually credible in some embedded systems, which will be burned to ROM and never changed. Toasters, video games, and cruise missiles come to mind. All you have to do is design per- fect specifications and interfaces, and never change them. Change only the implementation, and then only for bug fixes before the product is released. The code quality is wildly better than it is for the traditional ap- proach, but never quite good enough to avoid change completely. Even for very simple embedded sys- item are not recorded, and patching is alive (and, unfortunately, thriving) be- cause of the time lag for authorization and the rebuild time for the system.
  • 8. Furthermore, while official inter- faces are controlled, unofficial in- terfaces proliferate; and with C and older languages, data structures are so available that even when change is desired, too many functions “know” that the structure has a particular layout. When you change the data structure, some program or library that you didn’t even know existed starts to crash or return enotsup. A mismatch between an older Linux kernel and newer glibc once had getuid returning “Operation not supported,” much to the surprise of the recipients. Experience shows that it is com- pletely unrealistic to expect all users to whom an interface is visible will be able to change at the same time. The result is that single-step changes can- not happen: multiple change interre- lationships conflict, networks mean multiple versions are simultaneously current, and owners/users want to control change dates. Vendors try to force discrete chang- es, but the changes actually spread through a population of computers in a wave over time. This is often lik- ened to a plague, and is every bit as popular. Customers use a variant of the
  • 9. “never” approach to software main- tenance against the vendors of these plagues: they build a known work- ing configuration, then “freeze and forget.” When an update is required, they build a completely new system from the ground up and freeze it. This works unless you get an urgent secu- rity patch, at which time you either ignore it or start a large unscheduled rebuild project. Continuous change. At first, this ap- proach to maintenance sounds like just running new code willy-nilly and watching what happens. We know at least one company that does just that: a newly logged-on user will unknow- ingly be running different code from everyone else. If it doesn’t work, the user’s system will either crash or be kicked off by the sysadmin, then will have to log back on and repeat the work using the previous version. Real-world structure for managing interface changes. struct item_loc_t { struct { unsigned short major; /* = 1 */ unsigned short minor; /* = 0 */ } version; unsigned part_no; unsigned quantity; struct location_t { char state[4];
  • 10. char city[8]; unsigned warehouse; short area; short pigeonhole; } location; ... practice n o v e m b e r 2 0 0 9 | v o l . 5 2 | n o . 1 1 | c o m m u n i c at i o n s o f t h e a c m 57 However, that is not the real mean- ing of continuous. The real continu- ous approach comes from Multics, the machine that was never sup- posed to shut down and that used controlled, transparent change. The developers understood the only con- stant is change and that migration for hardware, software, and function during system operation is necessary. Therefore, the ability to change was designed from the very beginning. Software in particular must be writ- ten to evolve as changes happen, us- ing a weakly typed high-level language and, in older programs, a good macro assembler. No direct references are al- lowed to anything if they can be avoid- ed. Every data structure is designed for expansion and self-identifying as to version. Every code segment is
  • 11. made self-identifying by the compil- er or other construction procedure. Code and data are changeable on a per-command/process/system basis, and as few as possible copies of any- thing are kept, so single copies could be dynamically updated as necessary. The most important thing is to manage interface changes. Even in the Multics days, it was easy to forget to change every single instance of an interface. Today, with distributed pro- grams, changing all possible copies of an interface at once is going to be in- sanely difficult, if not flat-out impos- sible. Who Does it Right? BBN Technologies was the first com- pany to perform continuous con- trolled change when they built the ARPANET backbone in 1969. They placed a 1-bit version number in ev- ery packet. If it changed from 0 to 1, it meant that the IMP (router) was to switch to a new version of its software and set the bit to 1 on every outgoing packet. This allowed the entire ARPA- NET to switch easily to new versions of the software without interrupting its operation. That was very important to the pre-TCP Internet, as it was quite experimental and suffered a consider- able amount of change.
  • 12. With Multics, the developers did all of these good things, the most im- portant of which was the discipline used with data structures: if an inter- face took more than one parameter, all the parameters were versioned by placing them in a structure with a ver- sion number. The caller set the ver- sion, and the recipient checked it. If it was completely obsolete, it was flatly rejected. If it was not quite current, it was processed differently, by be- ing upgraded on input and probably downgraded on return. This meant that many different versions of a program or kernel mod- ule could exist simultaneously, while upgrades took place at the user’s con- venience. It also meant that upgrades could happen automatically and that multiple sites, multiple suppliers, and networks didn’t cause problems. An example of a structure used by a U.S.-based warehousing company (translated to C from Multics PL/1) is illustrated in the accompanying box. The company bought a Canadian competitor and needed to add inter- country transfers, initially from three of its warehouses in border cities. This, in turn, required the state field to split into two parts:
  • 13. char country _ code[4] char state _ province[4]; To identify this, the company incre- mented the version number from 1.0 to 2.0 and arranged for the server to support both types. New clients used version 2.0 structures and were able to ship to Canada. Old ones continued to use version 1.0 structures. When the server received a type 1 structure, it used an “updater” subroutine that copied the data into a type 2 structure and set the country code to U.S. In a more modern language, you would add a new subclass with a con- structor that supports a country code, and update your new clients to use it. The process is this: Update the server.1. Change the clients that run in 2. the three border-state warehouses. Now they can move items from U.S. to Canadian warehouses. Deploy updated clients to those 3. Canadian locations needing to move stock. Update all of the U.S.-based cli-4. ents at their leisure. Using this approach, there is never
  • 14. a need to stop the whole system, only the individual copies, and that can be software maintenance is not like hardware maintenance, which is the return of the item to its original state. software maintenance involves moving an item away from its original state. 58 c o m m u n i c at i o n s o f t h e a c m | n o v e m b e r 2 0 0 9 | v o l . 5 2 | n o . 1 1 practice scheduled around a business’s conve- nience. The change can be immedi- ate, or can wait for a suitable time. Once the client updates have oc- curred, we simultaneously add a check to produce a server error message for anyone who accidentally uses an ou- dated U.S.-only version of the client. This check is a bit like the “can’t hap- pen” case in an else-if: it’s done to identify impossibly out-of-date calls. It fails conspicuously, and the system
  • 15. administrators can then hunt down and replace the ancient version of the program. This also discourages the unwise from permanently deferring fixes to their programs, much like the coarse version numbers on entire pro- grams in present practice. modern examples This kind of fine-grain versioning is sometimes seen in more recent pro- grams. Linkers are an example, as they read files containing numbered records, each of which identifies a particular kind of code or data. For ex- ample, a record number 7 might con- tain the information needed to link a subroutine call, containing items such as the name of the function to call and a space for an address. If the linker uses record types 1 through 34, and later needs to extend 7 for a new compiler, then create a type 35, use it for the new compiler, and schedule changes from type 7 to type 35 in all the other compilers, typically by an- nouncing the date on which type 7 re- cords would no longer be accepted. Another example is in networking protocols such as IBM SMB (Server Message Block), used for Windows networking. It has both protocol ver- sions and packet types that can be used exactly the same way as the re- cord types of a linker.
  • 16. Object languages can also support controlled maintenance by creat- ing new versions as subclasses of the same parent. This is a slightly odd use of a subclass, as the variations you create aren’t necessarily meant to per- sist, but you can go back and clean out unneeded variants later, after they’re no longer in use. With AJAX, a reasonably small cli- ent can be downloaded every time the program is run, thus allowing change without versioning. A larger client would need only a simple version- ing scheme, enough to allow it to be downloaded whenever it was out of date. An elegant modern form of contin- uous maintenance exists in relational databases: one can always add col- umns to a relation, and there is a well- known value called null that stands for “no data.” If the programs that use the database understand that any calculation with a null yields a null, then a new column can be added, pro- grams changed to use it over some period of time, and the old column(s) filled with nulls. Once all the users of the old column are no more, as indi- cated by the column being null for some time, then the old column can
  • 17. be dropped. Another elegant mechanism is a markup language such as SGML or XML, which can add or subtract attri- butes of a type at will. If you’re careful to change the attribute name when the type changes, and if your XML processor understands that adding 3 to a null value is still null, you’ve an easy way to transfer and store mutat- ing data. maintenance isn’t hard, it’s easy During the last boom, (author) Col- lier-Brown’s team needed to create a single front end to multiple back ends, under the usual insane time pressures. The front end passed a few parameters and a C structure to the back ends, and the structure repeat- edly needed to be changed for one or another of the back ends as they were developed. Even when all the programs were on the same machine, the team couldn’t change them simultaneously because they would have been forced to stop everything they were doing and ap- ply a structure change. Therefore, the team started using version numbers. If a back end needed version 2.6 of the structure, it told the front end, which handed it the new one. If it could use only version 2.5, that’s what it asked
  • 18. for. The team never had a “flag day” when all work stopped to apply an interface change. They could make those changes when they could sched- ule them. Of course, the team did have to make the changes eventually, and their management had to manage that, but they were able to make the changes when it wouldn’t destroy our schedule. In an early precursor to test- directed design, they had a regression test that checked whether all the ver- sion numbers were up to date and warned them if updates were needed. The first time the team avoided a flag day, they gained the few hours ex- pended preparing for change. By the 12th time, they were winning big. Maintenance really is easy. More importantly, investing time to pre- pare for it can save you and your man- agement time in the most frantic of projects. Related articles on queue.acm.org The Meaning of Maintenance Kode Vicious http://queue.acm.org/detail.cfm?id=1594861
  • 19. The Long Road to 64 Bits John Mashey http://queue.acm.org/detail.cfm?id=1165766 A Conversation with David Brown http://queue.acm.org/detail.cfm?id=1165764 Paul Stachour is a software engineer equally at home in development, quality assurance, and process. one of his focal areas is how to create correct, reliable, functional software in effective and efficient ways in many programming languages. Most of his work has been with life-, safety-, and security-critical applications from his home base in the twin Cities of Minnesota. David Collier-Brown is an author and systems programmer, formerly with Sun Microsystems, who mostly does performance and capacity work from his home in toronto. © 2009 aCM 0001-0782/09/1100 $10.00 contributed articles 142 c o m m u n i c at i o n s o f t h e a c m | J a n u a r y 2 0 1 0 | v o l . 5 3 | n o . 1 d o i : 1 0 . 1 1 4 5 / 1 6 2 9 1 7 5 . 1 6 2 9 2 0 9 by Paul d. Witman and terry ryan
  • 20. M a n y o r g a n i z at i o n s a r e s u c c e s s f u l w i t h s o f t wa r e reuse at fine to medium granularities – ranging from objects, subroutines, and components through software product lines. However, relatively little has been published on very large-grained reuse. One example of this type of large-grained reuse might be that of an entire Internet banking system (applications and infrastructure) reused in business units all over the world. In contrast, “large scale” software reuse in current research generally refers to systems that reuse a large number of smaller components, or that perhaps reuse subsystems.9 In this article, we explore a case of an organization with an internal development group that has been very successful with large-grained software reuse. BigFinancial, and the BigFinancial Technology Center (BTC) in particular, have created a number of software systems that have been reused in multiple businesses and in multiple countries. BigFinancial and BTC thus provided a rich source of data for case studies to look at the characteristics of those projects and why they have been successful, as well as to look at projects that have been less successful and to understand what has caused those results and what might be done differently to prevent issues in the future. The research is focused on technology, process, and organizational elements of the development process, rather than on specific product features and functions. Supporting reuse at a large-grained level may help to alleviate some of the
  • 21. issues that occur in more traditional reuse programs, which tend to be finer- grained. In particular, because BigFi- nancial was trying to gain commonal- ity in business processes and operating models, reuse of large-grained compo- nents was more closely aligned with its business goals. This same effect may well not have happened with finer- grained reuse, due to the continued ability of business units to more readily pick and choose components for reuse. BTC is a technology development unit of BigFinancial, with operations in both the eastern and western US. Ap- proximately 500 people are employed by BTC, reporting ultimately through a single line manager responsible to the Global Retail Business unit head of Big- Financial. BTC is organized to deliver both products and infrastructure com- ponents to BigFinancial, and its prod- uct line has through the years included consumer Internet banking services, teller systems, ATM software, and net- work management tools. BigFinancial has its U.S. operations headquartered in the eastern U.S., and employs more than 8,000 technologists worldwide. In cooperation with BTC, we selected three cases for further study from a pool of about 25. These cases were the Java Banking Toolkit (JBT) and its related ap- plication systems, the Worldwide Single
  • 22. Signon (WSSO) subsystem, and the Big- Financial Message Switch (BMS). background – software reuse and bigfinancial Various definitions appear in the lit- erature for software reuse. Karlsson de- fines software reuse as “the process of creating software systems from existing software assets, rather than building software systems from scratch.” One taxonomy of the approaches to software reuse includes notions of the scope of reuse, the target of the reuse, and the granularity of the reuse.5 The notion of granularity is a key differentiator of the type of software reuse practiced at Big- Financial, as BigFinancial has demon- think big for reuse J a n u a r y 2 0 1 0 | v o l . 5 3 | n o . 1 | c o m m u n i c at i o n s o f t h e a c m 143 contributed articles portal services, and alerts capabilities, and thus the JBT infrastructure is al- ready reused for multiple applications. To some extent, these multiple appli- cations could be studied as subcases, though they have thus far tended to be deployed as a group. In addition, the
  • 23. online banking, portal services, and alerts functions are themselves reused at the application level across multiple business units globally. Initial findings indicated that sever- al current and recent projects showed significant reuse across independent business units that could have made alternative technology development decisions. The results are summarized in Table 1. While significant effort is required to support multiple languages and business-specific functional variabili- ty, BTC found that it was able to accom- modate these requirements by design- ing its products to be rule-based, and by designing its user interface to separate content from language. In this manner, business rules drove the behavior of the Internet banking applications, and language- and format-definition tools drove the details of application behav- ior, while maintaining a consistent set of underlying application code. In the late 1990s, BTC was respon- sible for creation of system infrastruc- ture components, built on top of in- dustry-standard commercial operating systems and components, to support the banking functionality required by its customers within BigFinancial. The functions of these infrastructure
  • 24. components included systems man- agement, high-reliability logging pro- cesses, high-availability mechanisms, and other features not readily available in commercial products at the time that the components were created. The same infrastructure was used to sup- port consumer Internet banking as strated success in large-grained reuse programs – building a system once and reusing it in multiple businesses. Product Line Technology models, such as that proposed by Griss4 and fur- ther expanded upon by Clements and Northrop2 and by Krueger6 suggest that software components can be treated similarly to the notions used in manu- facturing – reusable parts that contrib- ute to consistency across a product line as well as to improved efficiencies in manufacturing. Benefits of such reuse include the high levels of commonal- ity of such features as user interfaces,7 which increases switching costs and customer loyalty in some domains. This could logically extend to banking systems in the form of common func- tionality and user interfaces across systems within a business, and across business units. BigFinancial has had several in- stances of successful, large-grained re- use projects. We identified projects that
  • 25. have been successfully reused across a wide range of business environments or business domains, resulting in sig- nificant benefit to BigFinancial. These included the JBT platform and its re- lated application packages, as well as the Worldwide SSO product. These projects demonstrated broad success, and the authors evaluated these for evi- dence to identify what contributed to, and what may have worked against, the success of each project. The authors also identified another project that has been successfully re- used across a relatively narrow range of business environments. This project, the BigFinancial Message Switch (BMS) was designed for a region-wide level of reuse, and had succeeded at that level. As such, it appears to have invested ap- propriately in features and capabilities needed for its client base, and did not appear to have over-invested. online banking and related services We focused on BTC’s multi-use Java Banking Toolkit (JBT) as a model of a successful project. The Toolkit is in wide use across multiple business units, and represents reuse both at the largest-grained levels as well as reuse of large-scale infrastructure compo- nents. JBT supports three application sets today, including online banking,
  • 26. well as automated teller machines. The Internet banking services will be iden- tified here as the Legacy Internet Bank- ing product (LIB). BigFinancial’s initial forays into Internet transaction services were ac- complished via another instance of reuse. Taking its pre-Internet banking components, BTC was able to “scrape” the content from the pages displayed in that product, and wrap HTML code around them for display on a Web browser. Other components were re- sponsible for modifying the input and menuing functions for the Internet. The purpose for this approach to Internet delivery was to more rapidly deliver a product to the Internet, with- out modification of the legacy business logic, thereby reducing risk as well. In what amounted to an early separation of business and presentation logic, the pre-Internet business logic remained in place, and the presentation layer re-mapped its content for the browser environment. In 2002, BigFinancial and BTC rec- ognized two key issues that needed to be addressed. The platform for their legacy Internet Banking application was nearing end of life (having been first deployed in 1996), and there were
  • 27. too many disparate platforms for its consumer Internet offerings. BTC’s Internet banking, alerts, and portal functions each required separate hard- ware and operating environments. BTC planned its activities such that the costs of the new development could fit within the existing annual mainte- nance and new development costs al- ready being paid by its clients. BTC and business executives cited trust in BTC’s organization as a key to allowing BTC the opportunity to devel- op the JBT product. In addition, BTC’s prior success with reusing software components at fine and medium gran- table 1. selected reuse results Project reused in business units System Infrastructure Consumer Internet banking; automated Teller Machines all users of BTC’s legacy Internet banking components – >35 businesses worldwide System Infrastructure Internet banking – Small Business approximately 4 business units worldwide Internet banking Europe > 15 business units Internet banking asia > 10 business units Internet banking latin america > 6 business units
  • 28. Internet banking north america > 4 business units contributed articles 144 c o m m u n i c at i o n s o f t h e a c m | J a n u a r y 2 0 1 0 | v o l . 5 3 | n o . 1 ularities led to a culture that promoted reuse as a best practice. Starting in late 2002, BTC developed an integrated platform and application set for a range of consumer Internet functions. The infrastructure package, named the Java Banking Toolkit (JBT), was based on Java 2 Enterprise Edition (J2EE) standards and was intended to allow BigFinancial to centralize its server infrastructure for consumer Internet functions. The authors con- ducted detailed interviews with several BTC managers and architects, and re- viewed several hundred documents. Current deployment statistics for JBT are shown in Table 2. The JBT infrastructure and appli- cations were designed and built by BTC and its regional partners, with in- put from its clients around the world. BTC’s experience had shown that con- sumer banking applications were not fundamentally different from one an-
  • 29. other across the business units, and BTC proposed and received funding for creation of a consolidated applica- tion set for Internet banking. A market evaluation determined that there were no suitable, globally reusable, com- plete applications on the market, nor any other organization with the track record of success required for confi- dence in the delivery. Final funding approval came from BigFinancial tech- nology and business executives. The requirements for JBT called for several major functional elements. The requirements were broken out among the infrastructural elements supporting the various planned appli- cation packages, and the applications themselves. The applications delivered with the initial release of JBT included a consumer Internet banking applica- tion set, an account activity and bal- ance alerting function, and a portal content toolset. Each of these components was de- signed to be reused intact in each busi- ness unit around the world, requiring only changes to business rules and language phrases that may be unique to a business. One of the fundamental requirements for each of the JBT appli- cations was to include capabilities that were designed to be common to and
  • 30. shared by as many business units as possible, while allowing for all neces- sary business-specific variability. Such variability was planned for in the requirements process, building on the LIB infrastructure and applica- tions, as well as the legacy portal and alerts services that were already in pro- duction. Examples of the region- and business-specific variability include language variations, compliance with local regulatory requirements, and functionality based on local and re- gional competitive requirements. JBT’s initial high-level requirements documents included requirements across a range of categories. These categories included technology, opera- tions, deployment, development, and tools. These requirements were in- tended to form the foundation for ini- tial discussion and agreement with the stakeholders, and to support division of the upcoming tasks to define the archi- tecture. Nine additional, more detailed, requirements documents were created to flesh out the details referenced in the top-level requirements. Additional topics addressed by the detailed docu- ments included language, business rules, host messaging, logging, portal services, and system management.
  • 31. One of BigFinancial’s regional tech- nology leaders reported that JBT has been much easier to integrate than the legacy product, given its larger applica- tion base and ability to readily add ap- plications to it. Notably, he indicated that JBT’s design had taken into ac- count the lessons learned from prior products, including improvements in performance, stability, and total cost of ownership. This resulted in a “win/ win/win for businesses, technology groups, and customers.” From an economic viewpoint, BigFi- nancial indicates that the cost savings for first-time business unit implemen- tations of products already deployed to other business units averaged between 20 and 40%, relative to the cost of new de- velopment. Further, the cost savings for subsequent deployments of updated re- leases to a group of business units result- ed in cost savings of 50% – 75% relative to the cost of maintaining the software for each business unit independently. All core banking functionality is supported by a single global applica- tion set. There remain, in some cases, functions required only by a specific business or region. The JBT architec- ture allows for those region-specific applications to be developed by the regional technology unit as required. An overview of the JBT architecture is
  • 32. shown in Figure 1. BTC implemented JBT on principles of a layered architecture,12 focusing on interoperability and modularity. For example, the application components interact only with the application body section of the page; all other elements of navigation and branding are handled by the common and portal services figure 1. Java banking toolkit architecture overview table 2. Jbt reuse results region business units Europe > 18 business units Asia > 14 business units Latin America > 9 business units North America > 5 business units J a n u a r y 2 0 1 0 | v o l . 5 3 | n o . 1 | c o m m u n i c at i o n s o f t h e a c m 145 contributed articles ments to global product capabilities, along with the cost of training, devel- opment and testing of business rules, and ramp-up of operational processes.
  • 33. In contrast, ongoing maintenance sav- ings are generally larger, due to the commonality across the code base for numerous business units. This com- monality enables bug fixes, security patches, and other maintenance activi- ties to be performed on one code base, rather than one for each business unit. BigFinancial has demonstrated that it is possible for a large organization, building software for its own internal use, to move beyond the more common models of software reuse. In so doing, BigFinancial has achieved significant economies of scale across its many business units, and has shortened the time to market for new deployments of its products. Numerous factors were critical to the success of the reuse projects. These included elements expected from the more traditional reuse literature, in- cluding organizational structure, tech- nological foundations, and economic factors. In addition, several new ele- ments have been identified. These in- clude the notions of trust and culture, the concepts of a track record of large- and fine-grained reuse success, and the virtuous (and potentially vicious) cycle of corporate mandates. Conversely, organizational barriers prove to be the greatest inhibitor to successful reuse.13
  • 34. BTC took specific steps, over a period of many years, to create and strengthen its culture of reuse. Across numerous product lines, reuse of components and infrastructure packages was strongly encouraged. Reuse of large-grained elements was the next logical step, working with a group of business units within a single regional organization. This supported the necessary business alignment to enable large-grained re- use. In addition, due to its position as a global technology provide to BigFinan- cial, BTC was able to leverage its knowl- edge of requirements across business units, and explicitly design products to be readily reusable, as well as to drive commonality of requirements to sup- port that reuse as well. On the technical factors related to reuse, BTC’s results have provided empirical evidence regarding the use of various technologies and patterns elements. In addition, transactional messaging is isolated from the applica- tion via a message abstraction layer, so that unique messaging models can be used in each region, if necessary. JBT includes both the infrastruc- ture and applications components for a range of banking functionality. The infrastructure and applications com- ponents are defined as independently
  • 35. changeable releases, but are currently packaged as a group to simplify the de- ployment process. Funding and governance of the projects are coordinated through BTC, with significant participation from the business units. Business units have the opportunity to choose other vendors for their technology needs, though the corporate technology strategy limited that option as the JBT project gained wider rollout status. Business units participate in a semi-annual in-person planning exercise to evaluate enhance- ment requests and prioritize new busi- ness deployments. results The authors examined a total of six dif- ferent cases of software reuse. Three of these were subcases of the Java Banking Toolkit (JBT) – Internet banking, portal services, and alerts, along with the re- use of the JBT platform itself. The oth- ers were the Worldwide SSO product, and the BigFinancial Message Switch. There were a variety of reuse success levels, and a variety of levels of evidence of anticipated supports and barriers to reuse. The range of outcomes is repre- sented as a two dimensional graph, as shown in Figure 2. BigFinancial measures its reuse
  • 36. success in a very pragmatic, straight- forward fashion. Rather than measur- ing reused modules, lines of code, or function points, BigFinancial instead simply measures total deployments of compatible code sets. Due to on- going enhancements, the code base continues to evolve over time, but in a backwards-compatible fashion, so that older versions can be and are readily upgraded to the latest version as busi- ness needs dictate. BTC did not explicitly capture hard economic measures of cost savings. However, their estimates of the range of cost savings are shown in Figure 3. Cost savings are smaller for new de- ployments due to the significant effort required to map business unit require- figure 2. reuse expectations and outcomes contributed articles 146 c o m m u n i c at i o n s o f t h e a c m | J a n u a r y 2 0 1 0 | v o l . 5 3 | n o . 1 in actual reuse environments. Some of these technologies and patterns are platform-independent interfaces, busi- ness rule structures, rigorous isolation of concerns across software layers, and versioning of interfaces to allow phased
  • 37. migration of components to updated interfaces. These techniques, among others, are commonly recognized as good architectural approaches for de- signing systems, and have been exam- ined more closely for their contribution to the success of the reuse activities. In this examination, they have been found to contribute highly to the technologi- cal elements required for success of large-grained reuse projects. Product vendors, and particularly application service providers, routinely conduct this type of development and reuse, though with different motiva- tions. (Application service providers are now often referred to as providers of Software as a Service.) As commer- cial providers, they are more likely to be market-driven, often with sales of Pro- fessional Services for customization. In contrast, the motivations in evidence at BigFinancial seemed more aimed at achieving the best combinations of functionality, time to market, and cost. The research provided an opportu- nity to examine, in-depth, the various forms of reuse practiced on three proj- ects, and three subprojects, inside Big- Financial. Some of those forms include design reuse, code reuse, pattern reuse, and test case reuse. The authors have found based on documents and re- ports from participants that the active
  • 38. practice of systematic, finer-grained re- use contributed to successful reuse of systems at larger levels of granularity. This study has provided a view of management structures and leader- ship styles, and an opportunity to ex- amine how those contribute to, or work against, successful reuse. Much has been captured about IT governance in general, and about organizational con- structs to support reuse in various situ- ations at BigFinancial/BTC. Leadership of both BTC and BigFinancial was cited as contributing to the success of the re- use efforts, and indeed also was cited as a prerequisite for even launching a project that intends to accomplish such large-grained reuse. Sabherwal11 notes the criticality of trust in outsourced IS relationships, where the participants in projects may not know one another before a project, and may only work together on the one project. As such, the establishment and maintenance of trust is critical in that environment. This is not entirely applicable to BTC, as it is a peer organi- zation to its client’s technology groups, and its members often have long-stand- ing relationships with their peers. Ring and Van de Ven examine the broader notions of cooperative inter-organiza- tional relationships (IOR’s), and note
  • 39. that trust is a fundamental part of an IOR. Trust is used to serve to mitigate the risks inherent in a relationship, and at both a personal and organiza- tional level is itself mitigated by the po- tential overriding forces of the legal or organizational systems.10 This element does seem to be applicable to BTC’s en- vironment, in that trust is reported to have been foundational to the assign- ment of the creation of JBT to BTC. Griss notes that culture is one ele- ment of the organizational structure that can impede reuse. A culture that fears loss of creativity, lacks trust, or doesn’t know how to effectively reuse software will not be as successful as an organization that doesn’t have these impediments.4 The converse is likely then also reasonable – that a culture that focuses on and implicitly welcomes reuse will likely be more successful. BTC’s long history of reuse, its lack of explicit incentives and metrics around more traditional reuse, and its position as a global provider of technology to its business partners make it likely that its culture, is, indeed a strong supporter of its reuse success. Several other researchers have com- mented on the impact of organizational culture on reuse. Morisio et al8 refer in passing to cultural factors, primarily as potential inhibitors to reuse. Card and
  • 40. Comer1 examine four cultural aspects that can contribute to reuse adoption: training, incentives, measurement, and management. In addition, Card and Comer’s work focuses generally on cultural barriers, and how to overcome them. In BTC’s case, however, there is a solid cultural bias for reuse, and one that, for example, no longer requires incentives to promote reuse. One key participant in the study had a strong opinion to offer in relation to fine- vs. coarse-grained reuse. The lead architect for JBT was explicitly and vig- orously opposed to a definition of reuse figure 3. reuse cost savings ranges J a n u a r y 2 0 1 0 | v o l . 5 3 | n o . 1 | c o m m u n i c at i o n s o f t h e a c m 147 contributed articles Paul D. Witman, ([email protected] ) is an Assistant Professor of Information Technology at California Lutheran University. Terry Ryan ([email protected]) is an Associate Professor and Dean of the School of Information Systems at Claremont Graduate University. © 2010 ACm 0001-0782/10/0100 $10.00
  • 41. that slanted toward fine-grained reuse – of objects and components at a fine- grained level. This person’s opinion was that while reuse at this granularity was possible (indeed, BTC demonstrat- ed success at this level), fine-grained reuse was very difficult to achieve in a distributed development project. The lead architect further believed that the leverage it provides was not nearly as great as the leverage from a large- grained reuse program. The integrators of such larger-grained components can then have more confidence that the component has been used in a similar environment, tested under appropri- ate loads, and so on – relieving the risk that a fine-grained component built for one domain may get misused in a new domain or at a new scale, and be unsuc- cessful in that environment. While BTC’s JBT product does, to some extent, work as part of a software product line (supporting its three ma- jor applications), JBT’s real reuse does not come in the form of developing more instances from a common set of core assets. Rather, it appears that JBT is itself reused, intact, to support the needs of each of the various businesses in a highly configurable fashion. Organizational barriers appeared, at least in part, to contribute to the lack of broad deployment of the BigFinan-
  • 42. cial Message Switch. Gallivan3 defined a model for technology innovation as- similation and adoption, which includ- ed the notion that even in the face of management directive, some employ- ees and organizations might not adopt and assimilate a particular technology or innovation. This concept might part- ly explain the results with BMS, that it was possible for some business units and technology groups to resist its in- troduction on a variety of grounds, in- cluding business case, even with a de- cision by a global steering committee to proceed with deployment. We noted previously the negative impact of inter-organizational barriers on reuse adoption, particularly in the BMS case. This was particularly evident in that the organization that created BMS, and was in large part responsible for “selling” it to other business units, was positioned at a regional rather than global technology level. This organiza- tional location, along with the organi- zation’s more limited experience with globally reusable products, may have contributed to the difficulty in accom- plishing broader reuse of that product. conclusion While BTC’s results and BigFinancial’s specific business needs may be some- what unusual, it is likely that the busi-
  • 43. ness and technology practices support- ing reuse may be generalizable to other banks and other technology users. Good system architecture, supporting reuse, and an established business case that identify the business value of the reuse were fundamental to establishing the global reuse accomplished by BTC, and should be readily scalable to smaller and less global environments. Key factors contributing to a suc- cessful project will be a solid technolo- gy foundation, experience building and maintaining reusable software, and a financial and organizational structure that supports and promotes reuse. In addition, the organization will need to actively build a culture of large-grained reuse, and establish trust with its busi- ness partners. Establishing that trust will be vital to even having the oppor- tunity to propose a large-grained reus- able project. References 1. Card, D. and Comer, E. Why do so many reuse programs fail? IEEE Software 11, 5, 114-115. 2. Clements, P. and Northrop, L.m. Software Product Lines: Practices and Patterns Addison-Wesley Professional, 2002. 3. Gallivan, m.J. Organizational adoption and assimilation of complex technological innovations: Development
  • 44. and application of a new framework. The DATA BASE for Advances in Information Systems 32, 3, 51-85. 4. Griss, m.L. Software reuse: From library to factory. IBM Systems Journal 32, 4, 548-566. 5. Karlsson, E.-A. Software Reuse: A Holistic Approach. John Wiley & Sons, West Sussex, England, 1995. 6. Krueger, C.W. New methods in software product line practice. Comm. ACM 49, 12, (Dec. 2006), 37-40. 7. malan, R. and Wentzel, K. Economics of Software Reuse Revisited. Hewlett-Packard Software Technology Laboratory, Irvine, CA, 1993, 19. 8. morisio, m., Ezran, m. and Tully, C. Success and failure factors in software reuse. IEEE Transactions on Software Engineering 28, 4, 340-357. 9. Ramachandran, m. and Fleischer, W., Design for large scale software reuse: An industrial case study. In Proceedings for International Conference on Software Reuse, (Orlando, FL, 1996), 104-111. 10. Ring, P.S. and Van de Ven, A.H. Developmental processes of cooperative interorganizational relationships. Academy of Management Review 19, 1, 90-118. 11. Sabherwal, R. The Role of Trust in Outsourced IS Development Projects. Comm. of the ACM 42, 2, (Feb. 1999), 80-86. 12. Szyperski, C., Gruntz, D. and murer, S. Component software: beyond object-oriented programming ACm
  • 45. Press, New York, 2002. 13. Witman, P. and Ryan, T., Innovation in large-grained software reuse: A case from banking. In Proceedings for Hawaii International Conference on System Sciences, (Waikoloa, HI, 2007), IEEE Computer Society. Copyright of Communications of the ACM is the property of Association for Computing Machinery and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. 54 C O M M U N I C AT I O N S O F T H E A C M | J A N U A R Y 2 0 1 8 | V O L . 6 1 | N O . 1 practice I M A G E B
  • 46. Y V I T E Z S L A V V A L K A T H E H E T E R O G E N E I T Y, C O M P L E X I T Y, and scale of cloud applications make verification of their fault tolerance properties challenging. Companies are moving away from formal methods and toward large-scale testing in which components are deliberately compromised to identify weaknesses in the software. For example, techniques such as Jepsen apply fault-injection testing to distributed data stores, and Chaos Engineering performs fault injection experiments on production systems, often on live traffic. Both approaches have captured the attention of industry and academia alike.
  • 47. Unfortunately, the search space of distinct fault combinations that an infrastructure can test is intractable. Existing failure-testing solutions require skilled and intelligent users who can supply the faults to inject. These superusers, known as Chaos Engineers and Jepsen experts, must study the sys- tems under test, observe system execu- tions, and then formulate hypotheses about which faults are most likely to expose real system-design flaws. This approach is fundamentally unscal- able and unprincipled. It relies on the superuser’s ability to interpret how a distributed system employs redun- dancy to mask or ameliorate faults and, moreover, the ability to recognize the insufficiencies in those redundan- cies—in other words, human genius. This article presents a call to arms for the distributed systems research community to improve the state of the art in fault tolerance testing. Ordinary users need tools that au- tomate the selection of custom-tai- lored faults to inject. We conjecture that the process by which superusers select experiments—observing execu- tions, constructing models of system redundancy, and identifying weak- nesses in the models—can be effec- tively modeled in software. The ar- ticle describes a prototype validating this conjecture, presents early results from the lab and the field, and identi-
  • 48. fies new research directions that can make this vision a reality. The Future Is Disorder Providing an “always-on” experience for users and customers means that distributed software must be fault tol- erant—that is to say, it must be writ- ten to anticipate, detect, and either mask or gracefully handle the effects of fault events such as hardware fail- ures and network partitions. Writing fault-tolerant software—whether for distributed data management systems involving the interaction of a handful of physical machines, or for Web ap- plications involving the cooperation of tens of thousands—remains extremely difficult. While the state of the art in verification and program analysis con- tinues to evolve in the academic world, the industry is moving very much in the opposite direction: away from for- mal methods (however, with some noteworthy exceptions,41) and toward Abstracting the Geniuses Away from Failure Testing D O I : 1 0 . 1 1 4 5 / 3 1 5 2 4 8 3 Article development led by queue.acm.org
  • 49. Ordinary users need tools that automate the selection of custom-tailored faults to inject. BY PETER ALVARO AND SEVERINE TYMON http://dx.doi.org/10.1145/3152483 J A N U A R Y 2 0 1 8 | V O L . 6 1 | N O . 1 | C O M M U N I C AT I O N S O F T H E A C M 55 56 C O M M U N I C AT I O N S O F T H E A C M | J A N U A R Y 2 0 1 8 | V O L . 6 1 | N O . 1 practice up the stack and frustrate any attempts at abstraction. The Old Guard. The modern myth: Formally verified distributed compo- nents. If we cannot rely on geniuses to hide the specter of partial failure, the next best hope is to face it head on, armed with tools. Until quite recently, many of us (academics in particular) looked to formal methods such as model checking16,20,29,39,40,53,54 to assist “mere mortal” programmers in writ- ing distributed code that upholds its guarantees despite pervasive uncer- tainty in distributed executions. It is not reasonable to exhaustively search the state space of large-scale systems
  • 50. (one cannot, for example, model check Netflix), but the hope is that modularity and composition (the next best tools for conquering complexity) can be brought to bear. If individual distributed components could be formally verified and combined into systems in a way that preserved their guarantees, then global fault toler- ance could be obtained via composi- tion of local fault tolerance. Unfortunately, this, too, is a pipe dream. Most model checkers require a formal specification; most real-world systems have none (or have not had one since the design phase, many versions ago). Software model checkers and oth- er program-analysis tools require the source code of the system under study. The accessibility of source code is also an increasingly tenuous assumption. Many of the data stores targeted by tools such as Jepsen are closed source; large-scale architectures, while typical- ly built from open source components, are increasingly polyglot (written in a wide variety of languages). Finally, even if you assume that spec- ifications or source code are available, techniques such as model checking are not a viable strategy for ensuring that applications are fault tolerant because, as mentioned, in the context of time- outs, fault tolerance itself is an end-to-
  • 51. end property that does not necessarily hold under composition. Even if you are lucky enough to build a system out of individually verified components, it does not follow the system is fault toler- ant—you may have made a critical error in the glue that binds them. The Vanguard. The emerging ethos: YOLO. Modern distributed systems approaches that combine testing with fault injection. Here, we describe the underlying causes of this trend, why it has been successful so far, and why it is doomed to fail in its current practice. The Old Gods. The ancient myth: Leave it to the experts. Once upon a time, distributed systems researchers and practitioners were confident that the responsibility for addressing the problem of fault tolerance could be relegated to a small priesthood of ex- perts. Protocols for failure detection, recovery, reliable communication, consensus, and replication could be implemented once and hidden away in libraries, ready for use by the layfolk. This has been a reasonable dream. After all, abstraction is the best tool for overcoming complexity in com- puter science, and composing reliable
  • 52. systems from unreliable components is fundamental to classical system design.33 Reliability techniques such as process pairs18 and RAID45 dem- onstrate that partial failure can, in certain cases, be handled at the low- est levels of a system and successfully masked from applications. Unfortunately, these approaches rely on failure detection. Perfect failure detectors are impossible to implement in a distributed system,9,15 in which it is impossible to distinguish between delay and failure. Attempts to mask the fundamental uncertainty arising from partial failure in a distributed system—for example, RPC (remote procedure calls8) and NFS (network file system49)—have met (famously) with difficulties. Despite the broad consen- sus that these attempts are failed ab- stractions,28 in the absence of better abstractions, people continue to rely on them to the consternation of devel- opers, operators, and users. In a distributed system—that is, a system of loosely coupled components interacting via messages—the failure of a component is only ever manifested as the absence of a message. The only way to detect the absence of a message is via a timeout, an ambiguous signal that means either the message will nev- er come or that it merely has not come
  • 53. yet. Timeouts are an end-to-end con- cern28,48 that must ultimately be man- aged by the application. Hence, partial failures in distributed systems bubble While the state of the art in verification and program analysis continues to evolve in the academic world, the industry is moving in the opposite direction: away from formal methods and toward approaches that combine testing with fault injection. J A N U A R Y 2 0 1 8 | V O L . 6 1 | N O . 1 | C O M M U N I C AT I O N S O F T H E A C M 57 practice are simply too large, too heteroge- neous, and too dynamic for these classic approaches to software qual- ity to take root. In reaction, practitio- ners increasingly rely on resiliency techniques based on testing and fault injection.6,14,19,23,27,35 These “black box” approaches (which perturb and ob-
  • 54. serve the complete system, rather than its components) are (arguably) better suited for testing an end-to- end property such as fault tolerance. Instead of deriving guarantees from understanding how a system works on the inside, testers of the system observe its behavior from the outside, building confidence that it functions correctly under stress. Two giants have recently emerged in this space: Chaos Engineering6 and Jepsen testing.24 Chaos Engineering, the practice of actively perturbing pro- duction systems to increase overall site resiliency, was pioneered by Netflix,6 but since then LinkedIn,52 Microsoft,38 Uber,47 and PagerDuty5 have developed Chaos-based infrastructures. Jepsen performs black box testing and fault injection on unmodified distributed data management systems, in search of correctness violations (for example, counterexamples that show an execu- tion was not linearizable). Both approaches are pragmatic and empirical. Each builds an understand- ing of how a system operates under faults by running the system and observ- ing its behavior. Both approaches offer a pay-as-you-go method to resiliency: the initial cost of integration is low, and the more experiments that are performed, the higher the confidence
  • 55. that the system under test is robust. Because these approaches represent a straightforward enrichment of exist- ing best practices in testing with well- understood fault injection techniques, they are easy to adopt. Finally, and perhaps most importantly, both ap- proaches have been shown to be effec- tive at identifying bugs. Unfortunately, both techniques also have a fatal flaw: they are manual processes that require an extremely sophisticated operator. Chaos Engi- neers are a highly specialized subclass of site reliability engineers. To devise a custom fault injection strategy, a Chaos Engineer typically meets with different service teams to build an understanding of the idiosyncrasies of various components and their in- teractions. The Chaos Engineer then targets those services and interactions that seem likely to have latent fault tol- erance weaknesses. Not only is this ap- proach difficult to scale since it must be repeated for every new composition of services, but its critical currency— a mental model of the system under study—is hidden away in a person’s brain. These points are reminiscent of a bigger (and more worrying) trend in industry toward reliability priest- hoods,7 complete with icons (dash- boards) and rituals (playbooks).
  • 56. Jepsen is in principle a framework that anyone can use, but to the best of our knowledge all of the reported bugs discovered by Jepsen to date were dis- covered by its inventor, Kyle Kingsbury, who currently operates a “distributed systems safety research” consultancy.24 Applying Jepsen to a storage system requires the superuser carefully read the system documentation, generate workloads, and observe the externally visible behaviors of the system under test. It is then up to the operator to choose—from the massive combina- torial space of “nemeses,” including machine crashes and network parti- tions—those fault schedules that are likely to drive the system into returning incorrect responses. A human in the loop is the kiss of death for systems that need to keep up with software evolution. Human atten- tion should always be targeted at tasks that computers cannot do! Moreover, the specialists that Chaos and Jepsen testing require are expensive and rare. Here, we show how geniuses can be ab- stracted away from the process of fail- ure testing. We Don’t Need Another Hero Rapidly changing assumptions about our visibility into distributed system internals have made obsolete many
  • 57. if not all of the classic approaches to software quality, while emerging “cha- os-based” approaches are fragile and unscalable because of their genius-in- the-loop requirement. We present our vision of automated failure testing by looking at how the same changing environments that has- tened the demise of time-tested resil- iency techniques can enable new ones. We argue the best way to automate the experts out of the failure-testing loop is to imitate their best practices in soft- ware and show how the emergence of sophisticated observability infrastruc- ture makes this possible. The order is rapidly fadin.’ For large- scale distributed systems, the three fundamental assumptions of tradi- tional approaches to software quality are quickly fading in the rearview mir- ror. The first to go was the belief that you could rely on experts to solve the hardest problems in the domain. Sec- ond was the assumption that a formal specification of the system is available. Finally, any program analysis (broadly defined) that requires that source code is available must be taken off the ta- ble. The erosion of these assumptions helps explain the move away from clas- sic academic approaches to resiliency in favor of the black box approaches
  • 58. described earlier. What hope is there of understand- ing the behavior of complex systems in this new reality? Luckily, the fact that it is more difficult than ever to understand distributed systems from the inside has led to the rapid evolu- tion of tools that allow us to under- stand them from the outside. Call- graph logging was first described by Google;51 similar systems are in use at Twitter,4 Netflix,1 and Uber,50 and the technique has since been stan- dardized.43 It is reasonable to assume that a modern microservice-based Internet enterprise will already have instrumented its systems to collect call-graph traces. A number of start- ups that focus on observability have recently emerged.21,34 Meanwhile, provenance collection techniques for data processing systems11,22,42 are becoming mature, as are operating system-level provenance tools.44 Re- cent work12,55 has attempted to infer causal and communication structure of distributed computations from raw logs, bringing high-level explana- tions of outcomes within reach even for uninstrumented systems. Regarding testing distributed systems. Chaos Monkey, like they mention, is awe- some, and I also highly recommend get- ting Kyle to run Jepsen tests.
  • 59. —Commentator on HackerRumor 58 C O M M U N I C AT I O N S O F T H E A C M | J A N U A R Y 2 0 1 8 | V O L . 6 1 | N O . 1 practice of properties that are either maintained throughout the system’s execution (for example, system invariants or safety properties) or established during execu- tion (for example, liveness properties). Most distributed systems with which we interact, though their executions may be unbounded, nevertheless pro- vide finite, bounded interactions that have outcomes. For example, a broad- cast protocol may run “forever” in a re- active system, but each broadcast deliv- ered to all group members constitutes a successful execution. By viewing distributed systems in this way, we can revise the definition: A system is fault tolerant if it provides sufficient mechanisms to achieve its successful outcomes despite the given class of faults. Step 3: Formulate experiments that target weaknesses in the façade. If we could understand all of the ways in which a system can obtain its good
  • 60. outcomes, we could understand which faults it can tolerate (or which faults it could be sensitive to). We assert that (whether they realize it or not!) the process by which Chaos Engineers and Jepsen superusers determine, on a system-by-system basis, which faults to inject uses precisely this kind of rea- soning. A target experiment should exercise a combination of faults that knocks out all of the supports for an ex- pected outcome. Carrying out the experiments turns out to be the easy part. Fault injection infrastructure, much like observability infrastructure, has evolved rapidly in recent years. In contrast to random, coarse-grained approaches to distrib- uted fault injection such as Chaos Monkey,23 approaches such as FIT (failure injection testing)17 and Grem- lin32 allow faults to be injected at the granularity of individual requests with high precision. Step 4. Profit! This process can be ef- fectively automated. The emergence of sophisticated tracing tools described earlier makes it easier than ever to build redundancy models even from the executions of black box systems. The rapid evolution of fault injection infrastructure makes it easier than ever to test fault hypotheses on large- scale systems. Figure 1 illustrates how
  • 61. the automation described in this here fits neatly between existing observ- Away from the experts. While this quote is anecdotal, it is difficult to imagine a better example of the fun- damental unscalability of the current state of the art. A single person can- not possibly keep pace with the ex- plosion of distributed system imple- mentations. If we can take the human out of this critical loop, we must; if we cannot, we should probably throw in the towel. The first step to understanding how to automate any process is to compre- hend the human component that we would like to abstract away. How do Chaos Engineers and Jepsen superus- ers apply their unique genius in prac- tice? Here is the three-step recipe com- mon to both approaches. Step 1: Observe the system in action. The human element of the Chaos and Jepsen processes begins with princi- pled observation, broadly defined. A Chaos Engineer will, after study- ing the external API of services rel- evant to a given class of interactions, meet with the engineering teams to better understand the details of the implementations of the individual
  • 62. services.25 To understand the high- level interactions among services, the engineer will then peruse call-graph traces in a trace repository.3 A Jepsen superuser typically begins by reviewing the product documenta- tion, both to determine the guarantees that the system should uphold and to learn something about the mecha- nisms by which it does so. From there, the superuser builds a model of the behavior of the system based on inter- action with the system’s external API. Since the systems under study are typ- ically data management and storage, these interactions involve generating histories of reads and writes.31 The first step to understanding what can go wrong in a distributed system is watching things go right: observing the system in the common case. Step 2. Build a mental model of how the system tolerates faults. The com- mon next step in both approaches is the most subtle and subjective. Once there is a mental model of how a dis- tributed system behaves (at least in the common case), how is it used to help choose the appropriate faults to inject? At this point we are forced to dabble in conjecture: bear with us. Fault tolerance is redundancy. Giv-
  • 63. en some fixed set of faults, we say that a system is “fault tolerant” exactly if it operates correctly in all executions in which those faults occur. What does it mean to “operate correctly”? Correct- ness is a system-specific notion, but, broadly speaking, is expressed in terms Figure 1. Our vision of automated failure testing. explanations models of redundancy fault injection Figure 2. Fault injection and fault-tolerant code. APP1 APP1 APP2 APP2 caller fault callee API API API API API J A N U A R Y 2 0 1 8 | V O L . 6 1 | N O . 1 | C O M M U N I C AT I O N S O F T H E A C M 59
  • 64. practice ability infrastructure and fault injec- tion infrastructure, consuming the former, maintaining a model of system redundancy, and using it to param- eterize the latter. Explanations of sys- tem outcomes and fault injection in- frastructures are already available. In the current state of the art, the puzzle piece that fits them together (models of redundancy) is a manual process. LDFI (as we will explain) shows that automa- tion of this component is possible. A Blast from the Past In previous work, we introduced a bug- finding tool called LDFI (lineage-driven fault injection).2 LDFI uses data prove- nance collected during simulations of distributed executions to build deriva- tion graphs for system outcomes. These graphs function much like the models of system redundancy described ear- lier. LDFI then converts the derivation graphs into a Boolean formula whose satisfying assignments correspond to combinations of faults that invalidate all derivations of the outcome. An ex- periment targeting those faults will then either expose a bug (that is, the ex- pected outcome fails to occur) or reveal additional derivations (for example, af- ter a timeout, the system fails over to a backup) that can be used to enrich the model and constrain future solutions.
  • 65. At its heart, LDFI reapplies well- understood techniques from data management systems, treating fault tolerance as a materialized view main- tenance problem.2,13 It models a dis- tributed system as a query, its expect- ed outcomes as query outcomes, and critical facts such as “replica A is up at time t” and “there is connectivity be- tween nodes X and Y during the inter- val i . . . j” as base facts. It can then ask a how-to query:37 What changes to base data will cause changes to the derived data in the view? The answers to this query are the faults that could, accord- ing to the current model, invalidate the expected outcomes. The idea seems far-fetched, but the LDFI approach shows a great deal of promise. The initial prototype demon- strated the efficacy of the approach at the level of protocols, identifying bugs in replication, broadcast, and commit protocols.2,46 Notably, LDFI reproduced a bug in the replication protocol used by the Kafka distributed log26 that was first (manually) identified by Kingsbury.30 A later iteration of LDFI is deployed at Netflix,1 where (much like the illustra- tion in Figure 1) it was implemented as a microservice that consumes traces from a call-graph repository service and provides inputs for a fault injection ser-
  • 66. vice. Since its deployment, LDFI has identified 11 critical bugs in user-fac- ing applications at Netflix.1 Rumors from the Future The prior research presented earlier is only the tip of the iceberg. Much work still needs to be undertaken to realize the vision of fully automated failure testing for distributed systems. Here, we highlight nascent research that shows promise and identifies new di- rections that will help realize our vision. Don’t overthink fault injection. In the context of resiliency testing for distribut- ed systems, attempting to enumerate and faithfully simulate every possible kind of fault is a tempting but dis- tracting path. The problem of under- standing all the causes of faults is not directly relevant to the target, which is to ensure that code (along with its configuration) intended to detect and mitigate faults performs as expected. Consider Figure 2: The diagram on the left shows a microservice-based architecture; arrows represent calls generated by a client request. The right-hand side zooms in on a pair of interacting services. The shaded box in the caller service represents the fault tolerance logic that is intended to detect and handle faults of the cal- lee. Failure testing targets bugs in this
  • 67. logic. The fault tolerance logic targeted in this bug search is represented as the shaded box in the caller service, while the injected faults affect the callee. The common effect of all faults, from the perspective of the caller, is explicit error returns, corrupted responses, and (possibly infinite) delay. Of these manifestations, the first two can be ad- equately tested with unit tests. The last is difficult to test, leading to branches of code that are infrequently executed. If we inject only delay, and only at com- ponent boundaries, we conjecture that we can address the majority of bugs re- lated to fault tolerance. Explanations everywhere. If we can provide better explanations of system outcomes, we can build better models The rapid evolution of fault injection infrastructure makes it easier than ever to test fault hypotheses on large-scale systems. 60 C O M M U N I C AT I O N S O F T H E A C M | J A N U A R Y 2 0 1 8 | V O L . 6 1 | N O . 1
  • 68. practice to embrace (rather than abstracting away) this uncertainty. Distributed systems are probabi- listic by nature and are arguably bet- ter modeled probabilistically. Future directions of work include the proba- bilistic representation of system re- dundancy and an exploration of how this representation can be exploited to guide the search of fault experiments. We encourage the research community to join in exploring alternative internal representations of system redundancy. Turning the explanations inside out. Most of the classic work on data provenance in database research has focused on aspects related to human- computer interaction. Explanations of why a query returned a particular result can be used to debug both the query and the initial database—given an un- expected result, what changes could be made to the query or the database to fix it? By contrast, in the class of systems we envision (and for LDFI concretely), explanations are part of the internal language of the reasoner, used to con- struct models of redundancy in order to drive the search through faults. Ideally, explanations should play a role in both worlds. After all, when a
  • 69. bug-finding tool such as LDFI identi- fies a counterexample to a correctness property, the job of the programmers has only just begun—now they must un- dertake the onerous job of distributed debugging. Tooling around debugging has not kept up with the explosive pace of distributed systems development. We continue to use tools that were de- signed for a single site, a uniform mem- ory, and a single clock. While we are not certain what an ideal distributed debug- ger should look like, we are quite certain that it does not look like GDB (GNU Proj- ect debugger).36 The derivation graphs used by LDFI show how provenance can also serve a role in debugging by provid- ing a concise, visual explanation of how the system reached a bad state. This line of research can be pushed further. To understand the root causes of a bug in LDFI, a human operator must review the provenance graphs of the good and bad executions and then examine the ways in which they differ. Intuitively, if you could abstractly subtract the (incomplete by assump- tion) explanations of the bad outcomes from the explanations of the good out- of redundancy. Unfortunately, a bar- rier to entry for systems such as LDFI is the unwillingness of software de- velopers and operators to instrument their systems for tracing or provenance
  • 70. collection. Fortunately, operating sys- tem-level provenance-collection tech- niques are mature and can be applied to uninstrumented systems. Moreover, the container revolution makes simulating distributed execu- tions of black box software within a single hypervisor easier than ever. We are actively exploring the collection of system call-level provenance from unmodified distributed software in order to select a custom-tailored fault injection schedule. Doing so requires extrapolating application-level causal structure from low-level traces, iden- tifying appropriate cut points in an observed execution, and finally syn- chronizing the execution with fault injection actions. We are also interested in the pos- sibility of inferring high-level explana- tions from even noisier signals, such as raw logs. This would allow us to relax the assumption that the systems un- der study have been instrumented to collect execution traces. While this is a difficult problem, work such as the Mystery Machine12 developed at Face- book shows great promise. Toward better models. The LDFI system represents system redundancy using derivation graphs and treats the task of identifying possible bugs as a
  • 71. materialized-view maintenance prob- lem. LDFI was hence able to exploit well-understood theory and mecha- nisms from the history of data man- agement systems research. But this is just one of many ways to represent how a system provides alternative computa- tions to achieve its expected outcomes. A shortcoming of the LDFI approach is its reliance on assumptions of de- terminism. In particular, it assumes that if it has witnessed a computation that, under a particular contingency (that is, given certain inputs and in the presence of certain faults), produces a successful outcome, then any future computation under that contingency will produce the same outcome. That is to say, it ignores the uncertainty in timing that is fundamental to distrib- uted systems. A more appropriate way to model system redundancy would be The container revolution makes simulating distributed executions of black-box software within a single hypervisor easier than ever.
  • 72. J A N U A R Y 2 0 1 8 | V O L . 6 1 | N O . 1 | C O M M U N I C AT I O N S O F T H E A C M 61 practice 36. Matloff, N., Salzman, P.J. The Art of Debugging with GDB, DDD, and Eclipse. No Starch Press, 2008. 37. Meliou, A., Suciu, D. Tiresias: The database oracle for how-to queries. Proceedings of the ACM SIGMOD International Conference on the Management of Data (2012), 337-348. 38. Microsoft Azure Documentation. Introduction to the fault analysis service, 2016; https://azure.microsoft. com/en-us/documentation/articles/ service-fabric- testability-overview/. 39. Musuvathi, M. et al. CMC: A pragmatic approach to model checking real code. ACM SIGOPS Operating Systems Review. In Proceedings of the 5th Symposium on Operating Systems Design and Implementation 36 (2002), 75–88. 40. Musuvathi, M. et al. Finding and reproducing Heisenbugs in concurrent programs. In Proceedings of the 8th Usenix Conference on Operating Systems Design and Implementation (2008), 267–280. 41. Newcombe, C. et al. Use of formal methods at Amazon Web Services. Technical Report, 2014; http:// lamport.azurewebsites.net/tla/formal-methods- amazon.pdf. 42. Olston, C., Reed, B. Inspector Gadget: A framework for custom monitoring and debugging of distributed
  • 73. data flows. In Proceedings of the ACM SIGMOD International Conference on the Management of Data (2011), 1221–1224. 43. OpenTracing. 2016; http://opentracing.io/. 44. Pasquier, T.F. J.-M., Singh, J., Eyers, D.M., Bacon, J. CamFlow: Managed data-sharing for cloud services, 2015; https://arxiv.org/pdf/1506.04391.pdf. 45. Patterson, D.A., Gibson, G., Katz, R.H. A case for redundant arrays of inexpensive disks (RAID). In Proceedings of the 1988 ACM SIGMOD International Conference on Management of Data, 109–116; http://web.mit.edu/6.033/2015/wwwdocs/papers/ Patterson88.pdf. 46. Ramasubramanian, K. et al. Growing a protocol. In Proceedings of the 9th Usenix Workshop on Hot Topics in Cloud Computing (2017). 47. Reinhold, E. Rewriting Uber engineering: The opportunities microservices provide. Uber Engineering, 2016; https: //eng.uber.com/building-tincup/. 48. Saltzer, J. H., Reed, D.P., Clark, D.D. End-to-end arguments in system design. ACM Trans. Computing Systems 2, 4 (1984): 277–288. 49. Sandberg, R. The Sun network file system: design, implementation and experience. Technical report, Sun Microsystems. In Proceedings of the Summer 1986 Usenix Technical Conference and Exhibition. 50. Shkuro, Y. Jaeger: Uber’s distributed tracing system. Uber Engineering, 2017; https://uber.github.io/jaeger/.
  • 74. 51. Sigelman, B.H. et al. Dapper, a large-scale distributed systems tracing infrastructure. Technical report. Research at Google, 2010; https://research.google. com/pubs/pub36356.html. 52. Shenoy, A. A deep dive into Simoorg: Our open source failure induction framework. Linkedin Engineering, 2016; https://engineering.linkedin.com/blog/2016/03/ deep-dive-Simoorg-open-source-failure-induction- framework. 53. Yang, J. et al.L., Zhou, L. MODIST: Transparent model checking of unmodifed distributed systems. In Proceedings of the 6th Usenix Symposium on Networked Systems Design and Implementation (2009), 213–228. 54. Yu, Y., Manolios, P., Lamport, L. Model checking TLA+ specifications. In Proceedings of the 10th IFIP WG 10.5 Advanced Research Working Conference on Correct Hardware Design and Verification Methods (1999), 54–66. 55. Zhao, X. et al. Lprof: A non-intrusive request flow profiler for distributed systems. In Proceedings of the 11th Usenix Conference on Operating Systems Design and Implementation (2014), 629–644. Peter Alvaro is an assistant professor of computer science at the University of California Santa Cruz, where he leads the Disorderly Labs research group (disorderlylabs.github.io). Severine Tymon is a technical writer who has written documentation for both internal and external users
  • 75. of enterprise and open source software, including for Microsoft, CNET, VMware, and Oracle. Copyright held by owners/authors. Publication rights licensed to ACM. $15.00. comes,10 then the root cause of the dis- crepancy would be likely to be near the “frontier” of the difference. Conclusion A sea change is occurring in the tech- niques used to determine whether distributed systems are fault tolerant. The emergence of fault injection ap- proaches such as Chaos Engineering and Jepsen is a reaction to the erosion of the availability of expert program- mers, formal specifications, and uni- form source code. For all of their prom- ise, these new approaches are crippled by their reliance on superusers who decide which faults to inject. To address this critical shortcom- ing, we propose a way of modeling and ultimately automating the process carried out by these superusers. The enabling technologies for this vision are the rapidly improving observabil- ity and fault injection infrastructures that are becoming commonplace in the industry. While LDFI provides con- structive proof that this approach is possible and profitable, it is only the beginning. Much work remains to be
  • 76. done in targeting faults at a finer grain, constructing more accurate models of system redundancy, and providing bet- ter explanations to end users of exactly what went wrong when bugs are identi- fied. The distributed systems research community is invited to join in explor- ing this new and promising domain. Related articles on queue.acm.org Fault Injection in Production John Allspaw http://queue.acm.org/detail.cfm?id=2353017 The Verification of a Distributed System Caitie McCaffrey http://queue.acm.org/detail.cfm?id=2889274 Injecting Errors for Fun and Profit Steve Chessin http://queue.acm.org/detail.cfm?id=1839574 References 1. Alvaro, P. et al. Automating failure-testing research at Internet scale. In Proceedings of the 7th ACM Symposium on Cloud Computing (2016), 17–28. 2. Alvaro, P., Rosen, J., Hellerstein, J.M. Lineage-driven fault injection. In Proceedings of the ACM SIGMOD International Conference on Management of Data (2015), 331–346. 3. Andrus, K. Personal communication, 2016.
  • 77. 4. Aniszczyk, C. Distributed systems tracing with Zipkin. Twitter Engineering; https://blog.twitter.com/2012/ distributed-systems-tracing-with-zipkin. 5. Barth, D. Inject failure to make your systems more reliable. DevOps.com; http://devops.com/2014/06/03/ inject-failure/. 6. Basiri, A. et al. Chaos Engineering. IEEE Software 33, 3 (2016), 35–41. 7. Beyer, B., Jones, C., Petoff, J., Murphy, N.R. Site Reliability Engineering. O’Reilly, 2016. 8. Birrell, A.D., Nelson, B.J. Implementing remote procedure calls. ACM Trans. Computer Systems 2, 1 (1984), 39–59. 9. Chandra, T.D., Hadzilacos, V., Toueg, S. The weakest failure detector for solving consensus. J.ACM 43, 4 (1996), 685–722. 10. Chen, A. et al. The good, the bad, and the differences: better network diagnostics with differential provenance. In Proceedings of the ACM SIGCOMM Conference (2016), 115–128. 11. Chothia, Z., Liagouris, J., McSherry, F., Roscoe, T. Explaining outputs in modern data analytics. In Proceedings of the VLDB Endowment 9, 12 (2016): 1137–1148. 12. Chow, M. et al. The Mystery Machine: End-to-end performance analysis of large-scale Internet services. In Proceedings of the 11th Usenix Conference on
  • 78. Operating Systems Design and Implementation (2014), 217–231. 13. Cui, Y., Widom, J., Wiener, J.L. Tracing the lineage of view data in a warehousing environment. ACM Trans. Database Systems 25, 2 (2000), 179–227. 14. Dawson, S., Jahanian, F., Mitton, T. ORCHESTRA: A Fault Injection Environment for Distributed Systems. In Proceedings of the 26th International Symposium on Fault-tolerant Computing, (1996). 15. Fischer, M.J., Lynch, N.A., Paterson, M.S. Impossibility of distributed consensus with one faulty process. J. ACM 32, 2 (1985): 374–382; https://groups.csail.mit. edu/tds/papers/Lynch/jacm85.pdf. 16. Fisman, D., Kupferman, O., Lustig, Y. On verifying fault tolerance of distributed protocols. In Tools and Algorithms for the Construction and Analysis of Systems, Lecture Notes in Computer Science 4963, Springer Verlag (2008). 315–331. 17. Gopalani, N., Andrus, K., Schmaus, B. FIT: Failure injection testing. Netflix Technology Blog; http:// techblog.netflix.com/2014/10/fit-failure-injection- testing.html. 18. Gray, J. Why do computers stop and what can be done about it? Tandem Technical Report 85.7 (1985); http://www.hpl.hp.com/techreports/ tandem/TR-85.7.pdf. 19. Gunawi, H.S. et al. FATE and DESTINI: A framework for cloud recovery testing. In Proceedings of the 8th Usenix Conference on Networked Systems Design
  • 79. and Implementation (2011), 238–252; http://db.cs. berkeley.edu/papers/nsdi11-fate-destini.pdf. 20. Holzmann, G. The SPIN Model Checker: Primer and Reference Manual. Addison-Wesley Professional, 2003. 21. Honeycomb. 2016; https://honeycomb.io/. 22. Interlandi, M. et al. Titian: Data provenance support in Spark. In Proceedings of the VLDB Endowment 9, 33 (2015), 216–227. 23. Izrailevsky, Y., Tseitlin, A. The Netflix Simian Army. Netflix Technology Blog; http: //techblog.netflix. com/2011/07/ netflix-simian-army.html. 24. Jepsen. Distributed systems safety research, 2016; http://jepsen.io/. 25. Jones, N. Personal communication, 2016. 26. Kafka 0.8.0. Apache, 2013; https://kafka.apache. org/08/documentation.html. 27. Kanawati, G.A., Kanawati, N.A., Abraham, J.A. Ferrari: A flexible software-based fault and error injection system. IEEE Trans. Computers 44, 2 (1995): 248–260. 28. Kendall, S.C., Waldo, J., Wollrath, A., Wyant, G. A note on distributed computing. Technical Report, 1994. Sun Microsystems Laboratories. 29. Killian, C.E., Anderson, J.W., Jhala, R., Vahdat, A. Life, death, and the critical transition: Finding liveness bugs in systems code. Networked System Design and Implementation, (2007); 243–256.
  • 80. 30. Kingsbury, K. Call me maybe: Kafka, 2013; http:// aphyr.com/posts/293-call-me-maybe-kafka. 31. Kingsbury, K. Personal communication, 2016. 32. Lafeldt, M. The discipline of Chaos Engineering. Gremlin Inc., 2017; https://blog.gremlininc.com/the- discipline-of-chaos-engineering-e39d2383c459. 33. Lampson, B.W. Atomic transactions. In Distributed Systems—Architecture and Implementation, An Advanced Cours: (1980), 246–265; https://link. springer.com/chapter/10.1007%2F3-540-10571-9_11. 34. LightStep. 2016; http://lightstep.com/. 35. Marinescu, P.D., Candea, G. LFI: A practical and general library-level fault injector. In IEEE/IFIP International Conference on Dependable Systems and Networks (2009). Copyright of Communications of the ACM is the property of Association for Computing Machinery and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. International Journal of Performability Engineering Vol. 6, No.
  • 81. 6, November 2010, pp. 531-546. © RAMS Consultants Printed in India * Corresponding author’s email: [email protected] 531 Successful Application of Software Reliability: A Case Study NORMAN F. SCHNEIDEWIND Fellow of the IEEE 2822 Raccoon Trail Pebble Beach, California 93953 USA (Received on July 30, 2009, revised on May 3, 2010) Abstract: The purpose of this case study is to help readers implement or improve a software reliability program in their organizations, using a step- by-step approach based on the Institute of Electrical and Electronic Engineers (IEEE) and the American Institute of Aeronautics and Astronautics Recommended (AIAA) Practice for Software Reliability, released in June 2008, supported by a case study from the NASA Space Shuttle.
  • 82. This case study covers the major phases that the software engineering practitioner needs in planning and executing a software reliability- engineering program. These phases require a number of steps for their implementation. These steps provide a structured approach to the software reliability process. Each step will be discussed to provide a good understanding of the entire software reliability process. Major topics covered are: data collection, reliability risk assessment, reliability prediction, reliability prediction interpretation, testing, reliability decisions, and lessons learned from the NASA Space Shuttle software reliability engineering program. Keywords: software reliability program, Institute of Electrical and Electronic Engineers and the American Institute of Aeronautics and Astronautics Recommended Practice for Software Reliability, NASA Space Shuttle application 1. Introduction The IEEEAIAA recommended practice provides a foundation on which
  • 83. practitioners and researchers can build consistent methods [1]. This case study will describe the SRE process and show that it is important for an organization to have a disciplined process if it is to produce high reliability software. To accomplish this purpose, an overview is presented of existing practice in software reliability, as represented by the recommended practice [1]. This will provide the reader with the foundation to understand the basic process of Software Reliability engineering (SRE). The Space Shuttle Primary Avionics Software Subsystem will be used to illustrate the SRE process. The reliability prediction models that will be used are based on some key definitions and assumptions, as follows: Definitions Interval: an integer time unit t of constant or variable length defined by t-1 <t <t+1, where t>0; failures are counted in intervals. Number of Intervals: the number of contiguous integer time units t of constant or variable
  • 84. length represented by a positive real number. Norman F. Schneidewind . 532 Operational Increment (OI): a software system comprised of modules and configured from a series of builds to meet Shuttle mission functional requirements. Time: continuous CPU execution time over an interval range. Assumptions 1. Faults that cause failures are removed. 2. As more failures occur and more faults are corrected, remaining failures will be reduced. 3. The remaining failures are "zero" for those OI's that were executed for extremely long times (years) with no additional failure reports;
  • 85. correspondingly, for these OI's, maximum failures equals total observed failures. 1.1 Space Shuttle Flight Software Application The Shuttle software represents a successful integration of many of the computer industry's most advanced software engineering practices and approaches. Beginning in the late 1970's, this software development and maintenance project has evolved one of the world's most mature software processes applying the principles of the highest levels of the Software Engineering Institute's (SEI) Capability Maturity Model (the software is rated Level 5 on the SEI scale) and ISO 9001 Standards [2]. This software process includes state-of-the-practice software reliability engineering (SRE) methodologies. The goals of the recommended practice are to: interpret software reliability predictions, support verification and validation of the software, assess the risk of deploying the software, predict the reliability of the software, develop test strategies to
  • 86. bring the software into conformance with reliability specifications, and make reliability decisions regarding deployment of the software. Reliability predictions are used by the developer to add confidence to a formal software certification process comprised of requirements risk analysis, design and code inspections, testing, and independent verification and validation. This case study uses the experience obtained from the application of SRE on the Shuttle project, because this application is judged by NASA and the developer to be a successful application of SRE [6]. These SRE techniques and concepts should be of value for other software systems 1.2 Reliability Measurements and Predictions There are a number of measurements and predictions that can be made of reliability to verify and validate the software. Among these are remaining failures, maximum failures, total test time required to attain a given fraction of remaining failures, and time to next failure. These have been shown to be useful measurements and predictions for: 1)
  • 87. providing confidence that the software has achieved reliability goals; 2) rationalizing how long to test a software component (e.g., testing sufficiently long to verify that the measured reliability conforms to design specifications); and 3) analyzing the risk of not achieving remaining failures and time to next failure goals [6]. Having predictions of the extent to which the software is not fault free (remaining failures) and whether a failure it is likely to occur during a mission (time to next failure) provide criteria for assessing the risk of deploying the software. Furthermore, fraction of remaining failures can be used as both an Successful Application of Software Reliability: Case Study 533 operational quality goal in predicting total test time requirements and, conversely, as an indicator of operational quality as a function of total test time expended [6]. The various software reliability measurements and predictions
  • 88. can be divided into the following two categories to use in combination to assist in assuring the desired level of reliability of the software in mission critical systems like the Shuttle. The two categories are: 1) measurements and predictions that are associated with residual software faults and failures, and 2) measurements and predictions that are associated with the ability of the software to complete a mission without experiencing a failure of a specified severity. In the first category are: remaining failures, maximum failures, fraction of remaining failures, and total test time required to attain a given number of fraction of remaining failures. In the second category are: time to next failure and total test time required to attain a given time to next failure. In addition, there is the risk associated with not attaining the required remaining failures and time to next failure goals. Lastly, there is operational quality that is derived from fraction of remaining failures. With this type of information, a software manager can determine whether more testing is warranted or
  • 89. whether the software is sufficiently tested to allow its release or unrestricted use. These predictions provide a quantitative basis for achieving reliability goals [2]. 1.3 Interpretations and Credibility The two most critical factors in establishing credibility in software reliability predictions are the validation method and the way the predictions are interpreted. For example, a "conservative" prediction can be interpreted as providing an "additional margin of confidence" in the software reliability, if that predicted reliability already exceeds an established "acceptable level" or requirement. It may not be possible to validate predictions of the reliability of software precisely, but it is possible with "high confidence" to predict a lower bound on the reliability of that software within a specified environment. If there historical failure data were available for a series of previous dates (and there is actual data for the failure history following those dates), it would be possible to compare
  • 90. the predictions to the actual reliability and evaluate the performance of the model. Taking this approach will significantly enhance the credibility of predictions among those who must make software deployment decisions based on the predictions [9]. 1.4 Verification and Validation Software reliability measurement and prediction are useful approaches to verify and validate software. Measurement refers to collecting and analyzing data about the observed reliability of software, for example the occurrence of failures during test. Prediction refers to using a model to forecast future software reliability, for example failure rate during operation. Measurement also provides the failure data that is used to estimate the parameters of reliability models (i.e., make the best fit of the model to the observed failure data). Once the parameters have been estimated, the model is used to predict the future reliability of the software. Verification ensures that the software product, as it exists in a given project phase, satisfies the conditions imposed in the