4 why bad_things_happen_to_goog_projects

Why B Things
Projects
Happe to Goo(d
KARENMACKEY,Lotus DeweloprnentCorporation
that involvesmu
t the start of a distributed software project,
have you ever felt that you could finally “do it
right the first time?” You have all your plans
laid out, a super team, and you’re primed to
use the latest software technologies and
deve1opment methodo1ogies. €3 ut telling
yourself you’re going to “do it right” can be dangerous: You
are likely to set yourself up for failure because of unforeseen
complications.’ Furthermore, if you share your optimism
with your managers and they build business schedules around
it, you’re likely to both lose their trust and put your job or
enterprise at risk.
If you’re building a relatively complex system involving multi-
ple computers and multiple users, and if the system entails sig-
nificant innovation - such as new technology or expanded scale
- something will inevitably go wrong. Realizing this might
encourage you to use both design and user-interface proto-
types?,’and the spiral model of development+so that you can
look ahead and assess risks as you go. Before you start a major
project, you need to understand what problems can affect it,
even if you use the best available techniques and methods.
I E E E S O F T W A R E 0 7 4 0 7 4 5 9 / 9 6 / $ 0 5 0 0 0 1996 I E E E

This article characterizes two possi-
ble pitfalls: the Quality-Capacity
Syndrome and the Missing-Tools Crisis.
Since both have occurred with some fre-
quency in unrelated software-develop-
ment projects, they appear to be inde-
pendent of individual and management
slulls - factors responsible for a large
variance in project success.’ Likewise,
since the affected projects followed rea-
sonably good development practices,the
appearance of these problems serves to
underscore that serious problems can
occur even in the best-intentioned
multiuser, distributed-application devel-
opment project.
To examine these two pitfalls, I past-
ed together a fictitious development
project called GEMS, Greatest
Electronic Mail Systems. GEMS is a
composite of real projects that experi-
enced the Quality-Capacity Syndrome
and Missing-Tools Crisis. Managers of
the various projects were so sensitive
about their project’s problems that the
only way they would release the infor-
mation was for me to create a composite
project. This sensitivityunderscores the
difficultywithin our industry of having
open discussions about lessons learned,
while the problems I describe under the
guise of GEMS emphasize the need for
such discussions.
The creators of GEMS wanted to
create a uniform user interface for an
electronic mail service in a heteroge-
neous environment comprised of IB,M
and Amdahl mainframes and D E C
minicomputers. The existing mail ser-
vice w-as an internal system developed
by the company’s computer services
department. It prorided mail service
across the different systems, but on
each system the mail command
behaved differently. Also, because each
system had unique software, it was dif-
ficult to maintain softu-are and add
features. The developers and main-
tainers of the existing system decided
to create a replacement. They were
going to do it again, and they were
going to do it right.
The developers designed the new
spstern top-down. First they found out
what the users needed, and then they
developed requirements. They worked
from an understanding of the problem
to the design of a solution, rather than
conversely. The developers employed
functional decomposition, carefully
separating the transport facility and the
mail-handling functions. They also
used modularization6 within the con-
fines of decomposition to encapsulate
related functionality. T h e design
decomposed functions in a layered
style, with complex layers built on top
of more rudimentary layers, rising to a
crescendo of sophisticated user capa-
bilities a t the highest layer. Further-
more, since the existing mail system
was working, the team had time for a
thoughtful design.
T h e design goal was to maximize
portability between the different sys-
tems. A standard mail interface
buffered the GEMS software from the
idiosyncrasies of different operating
systems. By using a single high-level
language that had compilers on the
various systems, the developers could
port much of the software. The porta-
bility in turn helped improve maintain-
ability, because a bug in the portable
portion of the code had to be fixed
only once.
The developers designed in many
fault-recovery fearures: Point-to-point
protocols detected errors and initiated
retransmissions; end-to-end protocols
resent mail if network hardware failed
during delivery; the system automati-
cally restarted if the code failed; if the
destination host was down, a queue
held the mail; the inail queue was
crash-resistant and audited at every
restart to protect mail integrity. In
summary, the developers were doing
things right. In fact, they considered
GEMS a really neat project.
GEMS was origiiially deployed on a
three-node pilot system. As probleins
arose, the team solved them. In gener-
al, the users were quite happy. Because
the system seemed to be working just
fine, the developers added five more
hosts to the pilot system and made
near-term plans to expand GEMS to
80 to 100 more hosts. That’s when
both the Quality-Capacity Syndrome
and Missing-Tools Crisis struck, and
GEMS took a critical turn for the
worse: The mail wasn’t being delivered
and the system was unresponsive.
Users couldn’t tell whether their indi-
vidual systems were locked up or the
mail software was just taking a long
time. Users were no longer happy, and
the system administrators faced a
major quandary.
E
Quality means how well a system is
working. In the GEMS context, it cor-
responded to the inverse of the nuin-
her of recovery actions per time inter-
val. Quality problems like software
bugs, hardware failures, and resource
contention would trigger the fault-
recovery features built into GEMS.
Capacity refers to how much work a
system can do, which in this case
meant the number of messages GEMS
delivered per time interval. Measures
taken to increase quality directly
reduced capacity.
T h e Quality-Capacity Syndrome
has three symptoms:
M A Y 1996

t With a light load, a system per-
forms well.
t As system load increases, system
quality falls sharply.
t Even if the quality problems are
solved, the capacity remains unsatisfac-
tory.
GEMS’Ssystem quality was good at
a low mail volume. As .the number of
messages increased, the system had lots
of retransmissions and code restarts.
However, the main problem was that
system capacity fell rapidly as traffic
increased. Message queues backed up
and had to be processed at night.
System capacity seemed to be limited
by the implementation, not just the
bugs.
Causes. Why did GEMS develop the
Quality- Capacity Syndrome? T w o
things contributed to the problem: the
desigddevelopment methodology and
the pilot project.
GEMS was inherently complex, so
it was essential to use a good
design/development methodology.
T h e use of modularity, clean inter-
faces, common functions, a high-level
language, and a general-.purpose oper-
ating system did result in less efficient
code. However, once the designers
understood where the inefficiencies
were, they could improve efficiency
fairly quickly. Poorly designed soft-
ware would not have permitted such
easy efficiencyimprovements.
T h e GEMS design methods
encouraged functional partitioning and
localization of concerns. Partitioning
allowed the designers to break the pro-
ject into development tasks that could
proceed in parallel. However, because
all developers were assigned a specific
partition, there were no system gener-
alists who understood thoroughly how
the pieces fit together. Without this
perspective, it was very difficult to
debug across interfaces or boundaries.
For example, when a bug surfaced
between the transport mechanism and
the mail-handling layer, the two teams
assigned to those layers wasted a lot of
time chasing the bug back and forth
across the interface. Unfortunately, a
lot of demoralising blaming and fin-
ger-pointing accompanied the chase.
T h e project needed someone who
understood the interface from both
sides.
The pilot project also contributed
to the Quality-Capacity Syndrome.
Both prototypes and pilot projects are
important to developing large, com-
plex systems. In general, these are posi-
tive activities. In this case, however, a
demonstration of functionality early on
gave a false impression of progress.
The GEMS pilot succeeded in getting
a message to go between system A and
system B, so the developers broke out
the champagne and moved up the
schedule. The problem was that they
- and even worse, their bosses -
thought the project was further along
than it really was.
The focus of the pilot project was to
demonstrate feasibility and usability,
not test failure legs. With all the fault-
recovery features in GEIMS, the latter
was a sizable task. The developers sim-
ply got caught up in the pilot project’s
success and failed to assess accurately
what essential testing they still needed
to do.
the load to the level at which the sys-
tem runs well, then take baseline mea-
surements. To improve quality, reduce
and control change so yosu can identify
the sources of problems. T o improve
capacity, seek out changes that will
have a large effect. hTeitlierteams nor
people can focus effectively on both
thrusts at the same time. ’Thus,you can
either have a single development team
that deals first with quality issues and
then with capacity, or if you have suffi-
cient staff you can form a group to
address quality and one to address
capacity. In either case,you’ll need sys-
tems generalists to complete a timely
cure.
The initial focus of the cure should
be to improve quality until the system
can handle an acceptable load. For
GEMS, this meant reducing the num-
ber of restarts, rather th,an raising the
throughput level. Once this was
accomplished, the team could look for
ways to improve capacky. The devel-
opers discovered a traffic pattern in
Treatment. How do you treat
Quality-Capacity Syndrome? T h e
GEMS crew applied the Hass Cure,
named after R.J. Hass of AT&T Bell
Laboratories, who first characterized,
named, and treated the syndrome. This
“cure” assumes that the underlying sys-
tem architecture is in fact capable of
handling the desired capacity. Without
this guarantee, no amount of work can
overcome the limitations.
The Hass Cure has four steps:
1. Stabilisethe system.
2. Separate the quality and capacity
concerns, which work at cross-purposes.
3. Address quality problems.
4. Address capacity problems.
T o stabilize the system, you reduce
GEMS in which 70 percent of the
messages passed throu,gh one host.
Blocking the text in the rnessage trans-
fers through this host increased the
transfer rate by two to five times. This
improved the overall syrjtem capacity
and successfully cured the Quality-
Capacity Syndrome.
MISSING-TOOLS CRISIS
When the Quality-Capacity Syn-
drome struck GEMS, the system admin-
I E E E S O F T W A R E

istrators faced a inajor crisis that they
were ill-equipped to handle. Although
the GEMS team had plenty of develop-
ment, debugging, and testing tools, their
administration tools were totally inade-
quate. In addition to the quality-capacity
crisis,adniiiiistrators were faced with the
crisis of missing tools.
The Missing-Tools Crisis has four
characteristics:
+ A major problem - such as bugs,
hardware failures, or a full-fledged
Quality-Capacity Syndrome - illunii-
nates the tool deficit
+ The system lacks adequate moni-
toring and control took
+ Administrators lack adequate tools
to change the software 111 the deployed
system.
+ The exisang system administratwe
procedures and tools do not scale up
adequately
In GEMS, the Missing-Tools Crisis
emerged in the wake of the Quality-
Capacity Syndrome GEMS notified
users of the success or failure of mail
delivery, but offered them no window
to watch their mail progress through
the system Even system administrators
had iio way to moiiitor what was going
on and had iio tools to take corrective
action. If a problem arose, system
iestart was the inaiii recourse.
One illustration of the importance
of monitoring and control tools
involved a clcver adaptive algorithm
for routing messages around failed
components. When a ineinory over-
write conhused GEMS, it responded by
looping niessages around the network.
This looping was detected not by sys-
tem administrators - n-ho had no
tools for this sort of sun-eillance -but
by frustrated users who nei-er received
delivery notification. T o reinitialize
and resynchronize the sJ-stem,adminis-
trators had to bring down the entire
system and restart it. lioiiitoring aiid
control tools could have prevented
such a drastic measure.
T o overcome the Quality-Capacity
Syndrome, the GEIIS teain needed to
integrate changes into the running soft-
ware quickly. Vsing the standard proce-
dure, this integration as slo~r.At one
point, a message with a bad address
slipped through GEJIS defenses and
blocked the message queues. It brought
the system to its knees for a week -
despite the fact that the developers
uiiderstood the problem and proded a
solution within a day. They needed a
inore responsive n-a)-to insert changes
into the system.
In place of an administrative inter-
face designed for the nen- system,
GEXIS had different ad hoc tools on
each host system. Even the logged mes-
sages generated b!- C;E,lS software
were messages for debugging rather
than management - but they were all
that administrators had to work with.
Furthermore, the administrators man-
aged the system using one terminal per
host. U’ith the three-node pilot, this
management strategy -as possible.
With the eight-node network, the task
became cumbersome. Tf‘ith a projected
addition of 80 to 100 nodes, it would be
impossible. The makeshift administra-
tive interface simply did not scale up.
Causes. M’hy did the system develop
the Missing-Tools Crisis? How could
developers have 01-erlooked such criti-
cal needs? Again, the two main coii-
tributors to the problem were the
design/developinent methodology and
the pilot project. ktually, it was more
the “religion” of the methodology
than the methodology itself: People
did not consider monitoring and coii-
trol tools and the need for making cor-
rective changes because they felt the
system was going to work correctly.
The GEMS designers thought the sys-
tem would be totally automated and
self-correcting; they never foresaw the
need for humans to manage the system.
Also, inany system designers had
little or no system-administration
experience. In niaiiy companies and
universities, software eiigineers have
their own persoiial computers and get
little exposure to distributed-systems
management. They are genuinely
naive - and understandably so. Few
career paths lead through systems
administr atio11 i11 to deve1opnient,
except in small enterprises where the
developer does both. Even worse, a
class distinction soinetiines exists
between system administrators aiid
developers, which inhibits rapport and
sensitivity to the management side of
distributed systems.
Another unfortunate influence on
the design/development methodology
came from the focus infused into the
development process: The user func-
tions justified the funding for the
GEMS project. With this orientation,
support functions got slighted; they
were uiiiinportant until the users grew
more sophisticated and demanded bet-
ter performance.
Yet another influence came from
assumptions about the users. When
GEMS was designed, developers envi-
sioned small, single-page messages
going between users. This was in the
early days of e-mail, before its usefu-
ness for file transfer was established,
and developers didn’t anticipate a user
sending a 3-Mbyte message and the
impact it would have on the system.
Because they failed to imagine things a
user could do with the system, devel-
opers provided no means to monitor
and control them.
The pilot project also helped bring
on the Missing-Tools Crisis by, again,
giving a false sense of progress. ’The
pilot project focused on user functions
rather than administration. It did nialte
M A Y 1996

some sense not to build Idahorate man-
agement tools before thc feasibility of
user functions was proven. Also, since
the pilot focused 011 sm~all-scalefeasi-
bility, management needs were not
obvious. Scaling up the project uncov-
ered the need for adiniriistrative inter-
facc and management tolols.
Treatment. Treating the Missing-
Tools Crisis IS straightforward
+ retrofit an interface for monitor-
ing and control tools,
+ create quick-change tools and
proceclures so developers can make
code correctioiis quickly, and
+ have the developers both use and
maiiage the system they have built.
GEMS developers had to modify
the system design to ,iccommodate
monitoring and control features, as the
developers of rnany ne tuork systems
- including DECNet, IBM SNA, and
thc I S 0 OS1 model -- have done
before them. The advantage of incor-
porating tools after the system 15 run-
ning is that developers have a better
idea of what administr‘itive tools the
system needs and how users are likely
to usc the system. It thc software is
well designed, adding this interface can
be rel‘itively easy. T h e work the
GEIMS de5igners put into their origi-
nal design paid off heire. the solid
design allowed them to add the tools
fairly quickly.
Many argument5 show how much
inorc cxpensive it is to make changes
late i n thc developtnent process than to
get it right the first time.’ However,
even though GELVSdevelopers were
extremely ca-etul and followed a good
development methodology, they still
had to make late-tage clr post-deploy-
inent code changes. L q e , complex
system can require several rounds of
corrections, so ebery system should
iiicorporate quick-change tools.
A few years ago, a former student
of mine went 011 ‘1 j01) interview At
that time, the acadcinic community
h d lust fully embraced structured
I E E E S O F T W A R E
coding as a useful software-engineer-
ing technique. Wanting to make sure
that the company he might work for
was forward-looking and used up-to-
date software practices, the student
asked the interviewer if they did struc-
tured coding. ’The interviewer said
yes, they structured their code with a
block of assembly code followed by a
block of reserved memory callcd a
patch area. If they needed to insert a
change, they could zap out the offend-
ing code, branch down to the correct-
ed code loaded into the reserved
block, then branch back in-line.
When he related the story in class,
we all had a good, self-righteous laugh
at the interviewer’s old-fashioned defin-
ition of structured coding. However, as
T realized later, we missed the usefulness
of this “antiquated” technique for
quickly fixing problems in deployed sys-
tems. We still have much to learn from
solutions out of our pre-structured-
coding and pre-object-oriented past,
even though newer software-engineer-
ing practices might alter their exact
form.
Finally, having developers use what
they build has gained widespread
acceptance in the software-engineer-
ing community as a way of improving
understanding of user needs.
However, system administration often
remains an unexplored perspective.
Managing the system for a period of
time definitely enhanced GEMS
developers’ awareness of the need for
administrative tools.
DOINGTHINGS BETTER
T o avoid the Quality-Capacity
Syndrome, G. Scott Graham suggest-
ed the following steps in a University
of Santa Cruz seminar:
+ Build a simple analytic perfor-
mance model early in the design
process, even as early as during system
definition, and iinprove it as the
design progresses.
+ Build a picture of logical resource
use or resource demand. For example,
determine how many disk accesses a
routine might make.
+ ‘Tie logical resource use to physical
resource use. For example, tie the disk
accessesto the actual disk-accesstime.
+ Use the analytic model to identi-
fy the most-used modules, then opti-
inize those modules.
+ Conduct a walkthrough explicitly
to review the design for performance.
+ Design into the system an inter-
face to capture data that nieasures
quality and capacity.
Frequently, capacity and perfor-
mance goals get shelved duriiig devel-
opment. After the system is built, we
push it off a cliff and see if it flies. It
would be better to keep capacity goals
in mind during design/’developinent
and to get performance feedback all
along. Conducting an explicit perfor-
iiiance walkthrough and designing in
appropriate data-col1ection mecha-
nisms are tangible activities that will
e1evate the design ’s performance
aspects. T o improve capacity, you can
form a capacity group that follows
behind the implementation group.
Using the system model’s numeric
assumptions and ana1ysi:jas the start-
ing point, the group can measure the
actual capacity and improve it.
Finally, an absolutc necessity for
avoiding or treating the Quality-
Capacity Syndrome is to assign one or
more developers the role of systems
generalists. You must identify and cul-

tivate systems generalists throughout
the development process.
To avoid the Missing-Tools Crisis,
try the following suggestions:
+ Design interfaces into the system
that support monitoring and control
tools. You don’t have to build all the
envisioned monitoring and control
tools at the onset of the project, but at
least design a control scheme and
build in the appropriate interfaces,
with extra attention to manual over-
rides of clever adaptive algorithms.
+ Have the developers manage the
system prior to deployment. It’s espe-
cially helpful to study the administra-
tion procedures in light of the scale of
the final deployment.
+ Develop the tools and procedures
to support quick changes. A vital on-
line system will undoubtedly need
them.
he developers of multiuser dis-
tributed systems are particular-
+ Conduct a systems-administra-
tion walkthrough, and include actual
administrators. Also, bring in the peo-
ple who will be using the system so
they can share their perspectives.
ly vulnerable to the pitfalls of
Quality-Capacity Syndrome and
Missing-Tools Crisis. What makes
them so deadly is that they tend to
occur together just as the project is
nearing completion, putting the
schedule in jeopardy. The Quality-
Capacity Syndrome teaches us to start
both performance modeling and
model validation early. The Missing-
Tools Crisis teaches us to consider
the administration and management
of the system under development.
Perhaps the best lesson learned
from this experience is that we should
beware of relying too heavily on our
ability to “do it right.” There’s always
+a pitfall waiting to educate us.
Process-CenteredSoftware
Engineering Environments
edited by Pankaj K. Carg and Mehdi Jazayeri
Presents a comprehensive picture of this
emerging technology while highlighting the key
concepts and issues. The book introduces some of
the basic concepts and developments behind PSEEs
and discusses the unifying role it plays in combining
project management, software engineering, and
process engineering. It reviews related process
modeling and representation concepts, terminology, and issues, and analyzes
the features of some example PSEEs while taking an inside look at their
impleinentation by describing specific design choices. The book concludes
with a discussion of the significant role they will play in the software life cycle.
Contents:Preface Introduction Software Processes: Modeling and
Representation PSEE Features Fundamental Design Issues Future
Directions Further Readings
424 pages. September 1995. Softcover. lS6N 0-8186-7103-3.
Catalog # BP07103 - $40.00 Members / $50.00 List
II
@cOMPUTER
SOCIETY
REFERENCES
I. F.P. Brooks, Jr., The .Llythictll Mtlii-~Vloiitli,
Addison-M’esley, Reading, Mass., 1975 .
2 . L. Bernstein, “Get the Design Right!,” IEEE
Sofmuw,Sept. 1903, pp. 61-63.
3. H. Ledgard, Sofrz~weEirgiweeringConcepts,
Addison-Vesley, Reading, Mass., 1987.
4,B.M’. Boehin, “A Spiral Model of Software
Development and Enhancement,” Coiiiprrter,
,May 1988,pp. 61-72.
5 . ’r.DeMarco and T. Lister, Peopleu,trre,Dorset
lIouse, New York, 10x7.
6. 11.L.Parnas, “On the Criteria to Be Used in
Decomposing Systems into Modules,” Cultlnr.
ACLf, VOI.5, No. 12, Dec. 1072, pp. 1,053-
1,058.
7 . B.W. Boehni, .Suffmim Eiigiiiiwiwg Econuniic.r,
Prentice-I lall, Upper Saddle River, NJ.,
1981.
Karen Mackeyis a devel-
opment manager at Lotus
Development Corpora-
tion, a subsidiary of 1B.V.
Previously, she was a soft-
ware engineer and manager
at T R W and AT&TRcll
Laboratories.
in coininiter science from
Mackey received a PhD
Pennsylvania State
University, University Park. She is a meniber of the
IEEE Computer Society, ACM, and Silicon ‘alley
SPIN.
Address questions ahout this article to Mackey at
I??!, Suwi $’ay, Sunn)?.ale, CA 9.1087;
kinackey@best.ci)in.
M A Y 1 9 9 6

4 why bad_things_happen_to_goog_projects

Recommended

Recommended

More Related Content

What's hot

What's hot (10)

Viewers also liked

Viewers also liked (10)

Similar to 4 why bad_things_happen_to_goog_projects

Similar to 4 why bad_things_happen_to_goog_projects (20)

Recently uploaded

Recently uploaded (20)

4 why bad_things_happen_to_goog_projects