SlideShare a Scribd company logo
Software Implemented Fault Tolerance:
Technologies and Experience
Yennun Huang and Chandra Kintala
AT&T Bell Laboratories
Murray Hill, NJ 07974
Abstract

ance has two dimensions: availability and data consistency of the application. For example, users of telephone switching systems demand continuous availability whereas bank teller machine customers demand the
highest degree of data consistency[l3]. Most other applications have lower degrees of requirements for faulttolerance in both dimensions but the trend is to ins
crease those degrees a the costs, performance, technologies and other engineering considerations permit;
see Figure 1 below. In this paper, we discuss three

B y software implemented fault tolerance, we mean
a set of software facilities t o detect ‘and recover from
faults that are are not handled by the underlying hardware or operating system. W e consider those faults
that cause a n application process t o crash or hang;
they include software faults as well as faults in the underlying hardware and operating system layers i f they
are undetected in those layers. W e define 4 levels of
software fault tolerance based on availability and data
consistency of a n application in the presence of such
faults. Watchd, libft and nDFS are reusable components that provide up t o the 3rd level of software fault
tolerance. They perform, respectively, automatic detection and restart of failed processes, periodic checkpointing and recovery of critical volatile data, and
replication and synchronization of persistent data in
an application software system. These modules have
been ported t o a number of UNIX’ platforms and can
be used by any application with minimal programming
egort.
Some newer telecommunications products in AT&T
have already enhanced their fault-tolerance capability
using these three components. Experience with those
products to date indicates that these modules provide
eficient and economical means to increase the level of
fault tolerance in a software product. The performance
overhead due t o these components depends on the level
and varies from 0.1% to 14% based on the amount of
critical data being checkpointed and replicated.

1

Baoli

Teller
Machines

Availability

Figure 1: Dimensions of Fault Tolerance
cost-effective software technologies t o raise the degree
of fault tolerance in both dimensions of the application.
Following Cristian[7], we consider software applications that provide a “service” t o clients. The applications in turn use the services provided by the
underlying operating or database systems which in
turn use the computing and network communication
services provided by the underlying hardware; see
Figure 2. Tolerating faults in such applications involves detecting a failure, gathering knowledge about
the failure and recovering from that failure. Traditionally, these fault tolerance actions are performed

Introduction

There are increasing demands to make the software in the applications we build today more tolerant to faults. From users’ point of view, fault toler‘UNIX is a registered trademark of UNIX System Laboratories, Inc.

2

0731-3071/93$3.00 0 1993 IEEE
Saltzer et.al. claim
per on End-to-End Arguments[21],
that such hardware and operating system based methods to detect and recover from software failures are
necessarily incomplete. They show that fault tolerance cannot be complete without the knowledge and
help from the endpoints of an application, i.e., the
application software. We claim that such methods,
i.e. services at a lower layer detecting and recovering
from failures at a higher layer, are also inefficient. For
example, file replication on a mirrored disk through
a facility in the operating system will be more inefficient than replicating only the “critical” files of the
application in the application layer since the operating
system has no internal knowledge of that application.
Similarly, generalized checkpointing schemes in an operating system checkpoint entire in-memory data of an
application whereas application-assisted methods will
1
checkpoint only the critical data[l7, 3 .
A common but misleading argument against embedding checkpointing, recovery and other fault tolerance schemes inside an application is that such
schemes are not transparent, efficient or reliable because t hey are coded by application programmers;
we claim that well-tested and efficient fault tolerance
methods can be built as libraries or software reusable
components executing in the application layer and
that they provide as much as transparency as the other
methods do. All the three components discussed in
this paper have those properties, i.e. efficient, reliable
and transparent.

A*-

FT-DBMS, ...

.*.duplex, T’MR,

...

Figure 2: Layers of Fault Tolerance

in the hardware, operating or database systems used
in the underlying layers of the application software.
Hardware fault tolerance is provided using Duplex,
Triple-Module-Redundancy or other techniques[l9].
Fault tolerance in the operating and database layers
is often provided using replicated file systems[22], exception handling[23], disk shadowing[6], transactionbased checkpointing and recovery[l8], and other system routines. These methods and technologies handle
faults occurring in the underlying hardware, operating
and database system layers only.
Increasing number of faults are however occurring
in the application software layer causing the application processes to crash or hang. A picocess is said to
be crashed if the working process image is no longer
present in the system. A process is said to be hung if
the process image is alive, its entry is still present in
the process table but the process is not making any
progress from a user’s point of view. Such software
failures arise from incorrect program designs or coding errors but, more often than not, they arise from
transient and nondeterministic errors[9]; for example,
unsatisfactory boundary value conditions, timing and
race conditions in the underlying computing and communication services, performance failures etc.[7]. Due
to complex and temporal nature of interleaving of
messages and computations in a distributed system,
no amount of verification, validation and testing will
eliminate all those software faults in an application
and give complete confidence in the availability and
data consistency of that application. So, those faults
will occasionally manifest themselves and cause the
application process to crash or hang.
It is possible to detect a crash andl restart an application at a checkpointed state through operating
sytem facilities, as in IBM’s MVS[24]. In their pa-

The above observations and the End-to-End argument [21] lead t o our notion of software fault tolerance
as defined below:

Software fault tolerance is a set of software
facilities to detect and recover from faults
that cause an application process to crash or
hang and that are are not handled by the underlying hardware or operating system.
Observe that this includes software faults as described
earlier as well as faults in the underlying hardware and
operating system layers if they are undetected in those
layers. Thus, if the underlying hardware and operating system are not fault-tolerant in an application system due t o performance/cost trade-offs or other engineering considerations, then that system can increase
its availability more cost effectively through software
fault tolerance as described in this paper. It is also an
easier migration path for making existing applications
more fault-tolerant.
3
2

Model

many nodes, one of which is designated as primary
and the others as backups. All the requests for the
service are sent t o the primary site. The primary site
periodically checkpoints its state on the backups. If
the primary fails, one of the backups takes over as
primary. This model for fault tolerance has been analyzed for frequency of checkpointing, degree of service
replication and the effect on response time by Huang
and Jalote[ll, 121. This model was slightly modified,
as described below, to build the three technologies described in this paper. The tasks in our modification
of the primary site approach are:

For simplicity in the following discussions, we consider only client-server based applications running in
a local or wide-area network of computers in a distributed system; the discussion also applies to other
kinds of applications. Each application has a server
process executing in the user level on top of a vendor
supplied hardware and operating system. To get services, clients send messages to the server; the server
process acts on those messages one by one and, in
each of those message processing steps, updates its
data. We sometimes call the server process the application. For fault tolerance purposes, the nodes in
the distributed system will be viewed as being in a
circular configuration such that each node will be a
backup node for its left neighbor in that circular list.
As shown in Figure 3, each application will be exe-

a watchdog process running on the primary node
watching for application crashes or hangs,
a watchdog process running on the backup node
watching for primary node crashes,
periodically checkpointing the critical volatile
data in the application

Backup
..............................................7
A. i
....
....
I

logging of client messages to the application,

..................................

replicating application’s persistent data and making them available on the backup node,

<

j

:

!

:

i ;

Primary

...........

when the application on the primary node crashes
or hangs, restarting the application, if possible,
on the primary node, otherwise, on the backup
node.

p............,I
1
.
.

Application
Data

.......................................... ,
.................. .............
c

I

I

.........................................

................

recovering the application t o the last checkpointed state and reexecuting the message log and

(libft)

Persisten

:

connecting the replicated files t o the backup node
if the application restarts on the backup.

:

...........

Observe that these software fault tolerance tasks can
be used in addition t o other methods such as N-version
programming[2] or recovery blocks[20] inside an application program. Observe also that the application
process on the backup node will not be running until
it is started by the watchdog process; this is unlike
in the process-pair model[9] where the backup process
will be passively running even during normal operations.
The degree t o which the above software fault tolerance tasks are used in an application determines the
availability and data consistency of that application.
It is, therefore, useful t o establish a classification of
the different levels of software fault tolerance. We define the following 4 levels based on our experience in
AT&T. Applications illustrating these levels are described in Section 4.

Figure 3: Modified Primary Site Approach
cuting primarily on one of the nodes in the network,
called the primary node for that application. Each
executing application has process text (the compiled
code), volatile data (variables, structures, pointers and
all the bytes in the static and dynamic memory segments of the process image) and persistent data (the
application files being referred to and updated by the
executing process).
We use a modified primary-site approach to software fault tolerance[l]. In the primary site approach,
the service to be made fault tolerant is replicated at
4
Level 0: No tolerance to faults in the application
software:
In this level, when the executing (application process dies or hangs, it has to be manually restarted
from an initial internal state, i.e. the initial values
of the volatile data. The application may leave
its persistent data in an incorrect or inconsistent
state due to the timing of the crwh and may take
a long time to restart due to elaborate initialization procedures.

possible. The data consistency of the application
in Level 3 is higher than that in Level 2.

Level 4: Continuous operation without any interruption:
This level of fault tolerance in software guarantees
the highest degree of availability and data consistency. Often, this is provided, such as in switching systems, using replicated processing of the application on “hot” spare hardware. The state of a
process need not be saved, but multicast messaging, voting and other mechanisms must be used
t o maintain consistency and concurrency control.
Availability of the system during planned interruptions, such as those during upgrades, is made
possible using dynamic loading or other operating
system facilities. The technologies we describe in
this paper do not provide this level of fault tolerance.

Level 1: Automatic detection and restart:
When the application dies or hangs, the fault will
be detected and the application will be restarted
from an initial internal state on the same processor, if possible, or on a backup processor if
available. In this level, the internal state of the
application is not saved and, hence, the process
will restart at the initial internal state. As stated
above, restart along with reinitialization will be
slow. The restarted internal state may not reflect all the messages that have been processed
in the previous execution, and therefore, may not
be consistent with the persistent data. The difference between Levels 0 and 1 is that the detection
and restart are automatic in Level 1, and therefore, the application availability is higher in Level
1 than in Level 0.

3

Technologies

Many applications perform some of these software
fault tolerance features by coding them directly in
their programs. We developed three reusable components - watchd, l i b f t and nDFS - to embed those
features in any application with minimal programming
effort.

Level 2: Level 1 plus periodic checkpointing, logging
and recovery of internal state:
In addition to what is available in Level 1, the
internal state of the application process is periodically checkpointed, i.e. the #criticalvolatile
data is saved, and the messages t o the server are
logged. After a failure is detected, the application is restarted at the most recent checkpointed
internal state and the logged messages will be reprocessed to bring the application to the state
at which it crashed. The application availability
and volatile data consistency are higher in Level
2 than those in Level 1.

3.1

Watchd

Watchd is a watchdog daemon process that runs on
a single machine or on a network of machines. It continually watches the life of a local application process
by periodically sending a null signal to the process and
checking the return value to detect whether that process is alive or dead. It detects whether that process
is hung or not by using one of the following two methods. The first method sends a null message to the local
application process using IPC (Inter Process Communication) facilities on the local node and checks the return value. If it cannot make the connection, it waits
for some time (specified by the application) and tries
again. If it fails after the second attempt, it will interpret it to mean that the process is hung. The second
method asks the application process to send a heartbeat message to watchd periodically and watchd periodically checks the heartbeat. If the heartbeat message from the application is not received by a specified
time, watchd will assume that the application is hung.
This implies that w a t chd cannot differentiate between
hung processes and very slow processes.

Level 3: Level 2 plus persistent data recovery:
In addition to what is available in Level 2, the
persistent data of the application is replicated on
a backup disk connected t o a backup node, and is
kept consistent with the primary server throughout the normal operation of the application. In
case of a fault and resulting recovery of the application on the backup node, the backup disk
brings the application’s persistent data as close
to the state at which the application crashed as
5
environment. The exception handling, NVP and recovery block facilities are implemented using C macros
and standard C library functions. These facilities can
be used by any application without changing the underlying operating system or adding new C preprocessors.
Libft also provides f t r e a d 0 and ftwrite0 functions t o automatically log messages. When the
ftreado function is called by a process in a normal
condition, the data will be read from a channel and
automatically logged on a file. The logged data will
then be duplicated and logged by the watchd daemon
on a backup machine. The replication of logged data is
necessary for a process to recover from a primary machine failure. When the ftreado function is called
by a process which is recovering from a failure in a
recovery situation, the input data will be read from
the logged file before any data can be read from a regular input channel. Similarly, the ftwrite0 function
logs output data before they are sent out. The output data is also duplicated and logged by the watchd
daemon on a backup machine. The log files created
by the f t r e a d 0 and ftwriteo functions are truncated after a checkpoint 0 function is successfully executed. Using functions checkpoint (), ftread() and
ftwrite0, one can implement either a sender-based
or a receiver-based logging and recovery scheme[l4].
There is a slight possibility that some messages during the automatic restart procedure may get lost. If
this is a concern t o an application, an additional message synchronization mechanism can be built into the
application t o check and retransmit lost messages.
Speed and portability are primary concerns in implementing libft. The libft checkpoint mechanism
is not fully transparent to programmers as in the Condor system[l6]. However, libft does not require a
new language, a new preprocessor or complex declarations and computations to save data structures[9].
The sacrifice of transparency for speed has been
proven to be useful in some projects t o adopt libft.
The installation of libft doesn’t require any change
to a UNIX-based operating system; it has been ported
to several platforms.
Watchd and libf t separate fault detection and
volatile data recovery facilities from the application
functions. They provide those facilities as reusable
components which can be combined with any application to make it fault tolerant. Since the messages
received at the server site are logged and only the
server process is recovered in this scheme, the consistency problems that occur in recovering multiple
processes[l4] are not issues in this implementation.

When it detects that the application process
crashed or hung, watchd recovers that application at
an initial internal state or at the last checkpointed
state. The application is recovered on the primary
node if that node has not crashed, otherwise on the
backup node for the primary as specified in a configuration file. If libft is also used, watchd sets the
restarted application to process all the logged messages from the log file generated by libft. watchd
also watches one neighboring watchd (left or right) in
a circular fashion t o detect node failures; this circular
arrangement is similar t o the adaptive distributed diagnosis algorithm[5]. When a node failure is detected,
watchd can execute user-defined recovery commands
and reconfigure the network. Observe that neighboring wat chds cannot differentiate between node failures
and link failures. In general, this is the problem of
attaining common knowledge in the presence of communication failures which is provably unsolvable[lO].
Watchd also watches itself. A self-recovery mechanism is built into watchd in such a way that it
can recover itself from an unexpected software failure. Watchd also facilitates restarting a failed process,
restoring the saved values and reexecuting the logged
events and provides facilities for remote execution, remote copy, distributed election, and status report production.

3.2

Libft

Libft is a user-level library of C functions that can
be used in application programs t o specify and checkpoint critical data, recover the checkpointed data, log
events, locate and reconnect a server, do exception
handling, do N-version programming (NVP), and use
recovery block techniques.
Libft provides a set of functions (e.g. critical())
to specify critical volatile data in an application.
Those critical data items are allocated in a reserved region of the virtual memory and are periodically checkpointed. Values in critical data structures are saved
using memory copy functions, and thus avoid traversing application-dependent data structures. When an
application does a checkpoint, its critical data will be
saved on the primary and backup nodes. Unlike other
checkpointing methods[l?], the overhead in our checkpointing mechanism is minimized by saving only critical data and avoiding data-structure traversals. This
idea of saving only critical data in an application is
analogous to the Recovery Box concept in Sprite[3].
Libf t provides functions (e.g.
getsvrloc 0,
getsvrport(),ftconnectO, ftbind())for clients to
locate servers and reconnect t o servers in a network
6
3.3

nDFS

ing environments for state-less network services such
as lpr, f i n g e r or i n e t d daemons. Providing higher
levels of fault tolerance in those services would be unnecessary.
Level 2: Failure detection, checkpointing, restart
and recovery using watchd and l i b f t : Application
N maintains a certain segment of the telephone call
routing information on a Sun servgr; maintenance operators use workstations running N’s client processes
communicating with N’s server process using sockets.
The server process in N was crashing or hanging for
unknown reasons. During such failures, the system
administrators had to manually bring back the server
process, but they could not do so immediately because of the UNIX delay in cleaning up the socket
table. Moreover, the maintenance operators had to
restart client interactions from an initial state. Replacing the server node with fault tolerant hardware
would have increased their capital and development
costs by a factor of 4. Even then, all their problems
would not have been solved; for example, saving the
client states of interactions. Using watchd and l i b f t ,
system N is now able to tolerate such failures. Watchd
also detects primary server failures and restarts it on
the backup server. Location transparency is obtained
using g e t s v r l o c ( ) and getsvrport 0 calls in client
programs and f t b i n d 0 in server program. Libf t’s
checkpoint and recovery mechanisms are used to save
and recover all critical data. Checkpointing and recovery overheads are below 2%. Installing and integrating
the two components into the application took 2 people
3 days.
Level 3: Failure detection, checkpointing, replication, restart and recovey using watchd, l i b f t and
nDFS: Application D is a real-time telecommunication
network element currently being developed. In addition to the previous requirements for fault tolerance,
this product needed to get its persistent files on-line
immediately after a failure recovery on a backup node.
During normal operations on the primary server, nDFS
replicates all the critical persistent files on a backup
server with an expected overhead of less than 14%.
When the primary server fails, watchd starts the application D on the backup node and automatically connects it to the backup disk on which the persistent
files were replicated.

The multi-dimensional file system, nRFS[8],is based
on 3DFS[15] and provides facilities for replication of
critical data. It allows users t o specify and replicate
critical files on backup file systems in real time. The
implementation of nDFS uses the dynamic-shared library mechanism t o intercept file system calls and
propagate the system calls to backup file systems.
nDFS is built on top of UNIX file systems, and so
its use requires no change in the underlying file system. Speed, robustness and replication transparency
are the primary design goals of nDFS.
Implementation of nDFS uses watchd and l i b f t
for fault detection and fast recovery. A failure of
the underlying replication mechanism (software failure) or a crash of a backup file system is transparent
to applications or users. A failed software component
is detected and recovered immediately by watchd; a
crashed backup file system, after it is repaired, can
catch up with the primary file system without interrupting or slowing down applications using the primary file system.

4

Experience

Fault tolerance in some of the newer telecommunications network management products in AT&T has
been enhanced using watchd, l i b f t and nDFS. Experience with those products to date indicates that
these technologies are indeed economical and effective
means to increase the level of fault tolerance in application software. The performance overhead due t o
these components depends on the level of fault tolerance, the amount of critical volatile data being checkpointed, frequency of checkpointing, and the amount
of persistent data being replicated. It varies from 0.1%
t o 14%. We describe some of those products t o illustrate the availability, flexibility and efficiency in providing software fault tolerance through these 3 components. To protect the proprietary information of
those products, we use generic terms and titles in the
descriptions.
Level 1: Failure detection and restart using
watchd: Application C uses watchd to check the “liveness” of some service daemon processes in C at 10 second intervals. When any of those processes fails, i.e.
crashes or hangs, watchd restarts that process at its
initial state. It took 2 people 3 hours to embed and
configure watchd for this level of fault tolerance in application C. One potential use of this kind of fault tolerance would be in general purpose local area comput-

4.1 Other Possible Uses
The three software components, watchd, l i b f t and
n D F S , can be used not only to increase the level of

fault-tolerance in an application, as described above,
7
Acknowledgments

but also to aid in other operations unrelated to faulttolerance as described below.
a

a

Overcoming persistent errors in software: Some
errors in software are simply persistent, i.e. nondebuggable, due to the complexity and transient
nature of the interactions and events in a distributed system[4]. Such errors sometimes do not
reappear after the server process is restarted[9];
watchd, in those instances, can be used to bring
the server process back up without clients noticing the failure and restart. After restart and
restoration of the checkpointed state, message
logs can be replayed in the order they originally
arrived at the server or, if needed, in a different order[25]. Reordering the message logs sometimes eliminates transient errors due to “boundary” conditions.

a

5

Many thanks to Lawrence Bernstein who suggested
defining levels for fault tolerance, provided leadership
to transfer this technology rapidly and encouraged using these components in a wide range of AT&T products and services. The authors have benefited from
discussions, contributions and comments from several
colleagues, particularly, Rao Arimilli, David Belanger,
Marilyn Chiang, Glenn Fowler, Kent Fuchs, Pankaj
Jalote, Robin Knight, David Korn, Herman Rao and
Yi-Min Wang.

On-line upgrading of software: One can install a
new version of software for an application without interrupting the service provided by the older
version. This can be done by first loading the new
version on the backup node, simulating a fault on
the primary and then letting wat chd dynamically
move the service location to the backup node.
This method assumes that the two versions are
compatible at the application level client-server
protocol.

References
[l]P. A. Alsberg and J. D. Day, “A Principle for Re-

silient Sharing of Distributed Services,” Proceedings of 2nd Intl. Conf. on Software Engineering,
pp. 562-570, October 1976.
[2] A. Avizienis, “The N-Version Approach to FaultTolerant Software,” IEEE Trans. on Software Engineering, SE-11, No. 12, pp. 1491-1501,Dec. 1985.
[3] M. Baker and M. Sullivan, “The Recovery Box:
Using Fast Recovery to Provide High Availability
in the UNIX environment,” Proceedings of Summer ’92 USENIX, pp. 31-43, June 1992.

Using checkpoint states and message logs for debugging distributed applications: In l i b f t , all the
checkpointed states, i.e. values in the critical
data, and message logs can optionally be saved
in a journal file. This journal can be used to aid
in analyzing failures in distributed applications.

[4] L. Bernstein, “On Software Discipline and the War
of 1812,” ACM Software Engineering Notes, p. 18,
Oct. 1992.
[5] R. Bianchini, Jr. and R. Buskens, “An Adaptive Distributed System-Level Diagnosis Algorithm and Its Implementation,” Proceedings of
21st IEEE Conference on Fault- Tolerant Computing Systems (FTCS), pp. 222-229, July 1991.

Summary

[6] D. Bitton and J. Gray, “Disk Shadowing,” Proceeding of 14th Conference on Very Large Data Bases,
pp. 331-338, Sept. 1988.

We identified some of the dimensions of fault tolerance and defined a role, a taxonomy and tasks for
software fault tolerance based on availability and data
consistency requirements of an application. We then
described three software components, wat chd, l i b f t ,
and nDFS to perform these tasks. These three components are flexible, portable and reusable; they can be
embedded in any UNIX-based application software to
provide different levels of fault tolerance with minimal
programming effort. Experience in using these three
components in some telecommunication products has
shown that these components indeed increase the level
of fault tolerance with acceptable increases in performance overhead.

[7] H. Cristian, “Understanding Fault-Tolerant Distributed Systems,” Communications of the AGM,
Vol. 34, No. 2, pp. 56-78, February 1991.
[8] G. S. Fowler, Y. Huang, D. G. Korn and H. Rao,
“A User-Level Replicated File System,” To be presented at the Summer 1993 USENIX Conference,
June 1993.
[9] J. Gray and D. P. Siewiorek, “High-Availability

Computer Systems,” IEEE Computer, Vol. 24, No.
9, pp. 39-48, September 1991.
8
[lo] J. Y . Halpern and Y . Moses, “Knowledge and
Common Knowledge in a Distributed Environment,” Journal of ACM, Vol. 37, No. 3, pp. 549587, July 1990.

[22] M. Satyanarayanan, “Coda: A Highly Available

[ll] Y. Huang and P. Jalote, “Analytic Models for the
Primary Site Approach to Fault-Tolerance,” Acta
Informatica, Vol 26, pp. 543-557, 1989.

[23] S. K. Shrivastava (ed.), Reliable Computer Systems, Chapter 3, Springer-verlag, 1985.

File System for a Distributed Workstation Environment,” IEEE B,on Computers, Vol. C-39,
pp447-459, April 1990.

[24] D. P. Siewiorek and R. S. Swarz, Reliable Computer Systems Design and Implementation, Chapter 7, Digital Press, 1992.

[12] Y. Huang and P. Jalote, “Effect of Fault Tolerance on Response Time - Analysis of the Primary
Site Approach,” IEEE Transactions on Computers, Vol. 41, No. 4, pp. 420-428, April 1992.

[25] Y . M. Wang, Y . Huang and W. K. Fuchs, “Progressive Retry for Software Error Recovery in Distributed Systems,” To be presented at this FTCS23 Conference, June 1993.

[13] M. Iwama, D. S t a n and G . Zimmerman, Personal
correspondence, July 1992.
[14] P. Jalote, “Fault Tolerant Processes,” Distributed
Computing, Vol. 3, pp. 187-195, 1989.
[15] D. G . Korn and E. Krell, “A New Dimension for
the Unix File System,” Software Practice and Experience, Vol. 20, Supplement 1, pp. S1/19-S1/34,
June 1990.
[l6] M. Litxkow, M. Livny, and M Mutka. “Condor - a
hunter of idle workstations,” Proc. of the 8th Intl.
Conf. on Distributed Computing Systems, IEEE
Computer Society Press, June 1988.
[17] 9. Long, W. K. Fuchs and J . A. Abraham,
“Compiler-Assisted Static Checkpoint Insertion,”
Proceedings of 22nd IEEE Conference on FaultTolerant Computing Systems (FTCS), pp. 58-65,
July 1992.
[18] A. Nangia and D. Finker, “Transaction-based
Fault-Tolerant Computing in Distributed Systems,” Proceedings of 1992 IEEE Workshop on
Fault-tolerant Parallel and Distributed Systems,
pp. 92-97, July, 1992.
[19] D. K. Pradhan (ed.), Fault-Tolerant Computing:
theory and techniques, Vols. 1and 2, YPrentice-Hall,
1986.
[20] B. Randell, “System Structure for Software Fault
Tolerance,” IEEE Duns. on Software Engineering,
SE-1, No. 2, pp. 220-232, June, 1975.
[21] J. H. Saltzer, D. P. Reed and D.D. Clark, “EndTo-End Arguments in System Design,’’ ACM
Transactions on Computer Systems, Vol. 2, No.
4, pp. 277-288, November, 1984.
9

More Related Content

What's hot

Ch 4 software engineering
Ch 4 software engineeringCh 4 software engineering
Ch 4 software engineering
Mohammed Romi
 
Ch8.testing
Ch8.testingCh8.testing
Ch20 systems of systems
Ch20 systems of systemsCh20 systems of systems
Ch20 systems of systems
software-engineering-book
 
Ch2 sw processes
Ch2 sw processesCh2 sw processes
Ch2 sw processes
software-engineering-book
 
Distributed Software Engineering with Client-Server Computing
Distributed Software Engineering with Client-Server ComputingDistributed Software Engineering with Client-Server Computing
Distributed Software Engineering with Client-Server Computing
Haseeb Rehman
 
Ch11 reliability engineering
Ch11 reliability engineeringCh11 reliability engineering
Ch11 reliability engineering
software-engineering-book
 
Ch1-Software Engineering 9
Ch1-Software Engineering 9Ch1-Software Engineering 9
Ch1-Software Engineering 9Ian Sommerville
 
Ch13-Software Engineering 9
Ch13-Software Engineering 9Ch13-Software Engineering 9
Ch13-Software Engineering 9Ian Sommerville
 
Ian Sommerville, Software Engineering, 9th Edition Ch1
Ian Sommerville,  Software Engineering, 9th Edition Ch1Ian Sommerville,  Software Engineering, 9th Edition Ch1
Ian Sommerville, Software Engineering, 9th Edition Ch1
Mohammed Romi
 
Ch11-Software Engineering 9
Ch11-Software Engineering 9Ch11-Software Engineering 9
Ch11-Software Engineering 9Ian Sommerville
 
Ch18 service oriented software engineering
Ch18 service oriented software engineeringCh18 service oriented software engineering
Ch18 service oriented software engineering
software-engineering-book
 
Ch15-Software Engineering 9
Ch15-Software Engineering 9Ch15-Software Engineering 9
Ch15-Software Engineering 9Ian Sommerville
 
Ch17 distributed software engineering
Ch17 distributed software engineeringCh17 distributed software engineering
Ch17 distributed software engineering
software-engineering-book
 
Ch2-Software Engineering 9
Ch2-Software Engineering 9Ch2-Software Engineering 9
Ch2-Software Engineering 9Ian Sommerville
 
Software Security Engineering
Software Security EngineeringSoftware Security Engineering
Software Security Engineering
Muhammad Asim
 
Intro softwareeng
Intro softwareengIntro softwareeng
Intro softwareengPINKU29
 
10. Software testing overview
10. Software testing overview10. Software testing overview
10. Software testing overview
ghayour abbas
 
Ch9 evolution
Ch9 evolutionCh9 evolution
SE18_Lec 01_Introduction to Software Engineering
SE18_Lec 01_Introduction to Software EngineeringSE18_Lec 01_Introduction to Software Engineering
SE18_Lec 01_Introduction to Software Engineering
Amr E. Mohamed
 

What's hot (20)

Ch 4 software engineering
Ch 4 software engineeringCh 4 software engineering
Ch 4 software engineering
 
Ch8.testing
Ch8.testingCh8.testing
Ch8.testing
 
Ch20 systems of systems
Ch20 systems of systemsCh20 systems of systems
Ch20 systems of systems
 
Ch2 sw processes
Ch2 sw processesCh2 sw processes
Ch2 sw processes
 
Distributed Software Engineering with Client-Server Computing
Distributed Software Engineering with Client-Server ComputingDistributed Software Engineering with Client-Server Computing
Distributed Software Engineering with Client-Server Computing
 
Ch11 reliability engineering
Ch11 reliability engineeringCh11 reliability engineering
Ch11 reliability engineering
 
Ch1-Software Engineering 9
Ch1-Software Engineering 9Ch1-Software Engineering 9
Ch1-Software Engineering 9
 
Ch13-Software Engineering 9
Ch13-Software Engineering 9Ch13-Software Engineering 9
Ch13-Software Engineering 9
 
Ian Sommerville, Software Engineering, 9th Edition Ch1
Ian Sommerville,  Software Engineering, 9th Edition Ch1Ian Sommerville,  Software Engineering, 9th Edition Ch1
Ian Sommerville, Software Engineering, 9th Edition Ch1
 
A2
A2A2
A2
 
Ch11-Software Engineering 9
Ch11-Software Engineering 9Ch11-Software Engineering 9
Ch11-Software Engineering 9
 
Ch18 service oriented software engineering
Ch18 service oriented software engineeringCh18 service oriented software engineering
Ch18 service oriented software engineering
 
Ch15-Software Engineering 9
Ch15-Software Engineering 9Ch15-Software Engineering 9
Ch15-Software Engineering 9
 
Ch17 distributed software engineering
Ch17 distributed software engineeringCh17 distributed software engineering
Ch17 distributed software engineering
 
Ch2-Software Engineering 9
Ch2-Software Engineering 9Ch2-Software Engineering 9
Ch2-Software Engineering 9
 
Software Security Engineering
Software Security EngineeringSoftware Security Engineering
Software Security Engineering
 
Intro softwareeng
Intro softwareengIntro softwareeng
Intro softwareeng
 
10. Software testing overview
10. Software testing overview10. Software testing overview
10. Software testing overview
 
Ch9 evolution
Ch9 evolutionCh9 evolution
Ch9 evolution
 
SE18_Lec 01_Introduction to Software Engineering
SE18_Lec 01_Introduction to Software EngineeringSE18_Lec 01_Introduction to Software Engineering
SE18_Lec 01_Introduction to Software Engineering
 

Similar to Atifalhas

An Investigation of Fault Tolerance Techniques in Cloud Computing
An Investigation of Fault Tolerance Techniques in Cloud ComputingAn Investigation of Fault Tolerance Techniques in Cloud Computing
An Investigation of Fault Tolerance Techniques in Cloud Computing
ijtsrd
 
An Efficient Approach Towards Mitigating Soft Errors Risks
An Efficient Approach Towards Mitigating Soft Errors RisksAn Efficient Approach Towards Mitigating Soft Errors Risks
An Efficient Approach Towards Mitigating Soft Errors Risks
sipij
 
Ch13
Ch13Ch13
www.ijerd.com
www.ijerd.comwww.ijerd.com
www.ijerd.com
IJERD Editor
 
Ch13.pptx
Ch13.pptxCh13.pptx
Ch13.pptx
MohammedNouh7
 
System Structure for Dependable Software Systems
System Structure for Dependable Software SystemsSystem Structure for Dependable Software Systems
System Structure for Dependable Software Systems
Vincenzo De Florio
 
SOFTWARE QUALITY FACTORS_SQE.pptx
SOFTWARE QUALITY FACTORS_SQE.pptxSOFTWARE QUALITY FACTORS_SQE.pptx
SOFTWARE QUALITY FACTORS_SQE.pptx
MusondaSichinga
 
How to save on software maintenance costs
How to save on software maintenance costsHow to save on software maintenance costs
How to save on software maintenance costsFrancisJansen
 
Restoration and Degeneration of the Applications
Restoration and Degeneration of the ApplicationsRestoration and Degeneration of the Applications
Restoration and Degeneration of the Applications
iosrjce
 
F017264143
F017264143F017264143
F017264143
IOSR Journals
 
Testing
TestingTesting
Testing
BinamraRegmi
 
Using Fuzzy Clustering and Software Metrics to Predict Faults in large Indust...
Using Fuzzy Clustering and Software Metrics to Predict Faults in large Indust...Using Fuzzy Clustering and Software Metrics to Predict Faults in large Indust...
Using Fuzzy Clustering and Software Metrics to Predict Faults in large Indust...
IOSR Journals
 
A BRIEF PROGRAM ROBUSTNESS SURVEY
A BRIEF PROGRAM ROBUSTNESS SURVEYA BRIEF PROGRAM ROBUSTNESS SURVEY
A BRIEF PROGRAM ROBUSTNESS SURVEY
ijseajournal
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
ijceronline
 
Qms
QmsQms
Software Process and Requirement
Software Process and RequirementSoftware Process and Requirement
Software Process and Requirement
cricket2ime
 

Similar to Atifalhas (20)

An Investigation of Fault Tolerance Techniques in Cloud Computing
An Investigation of Fault Tolerance Techniques in Cloud ComputingAn Investigation of Fault Tolerance Techniques in Cloud Computing
An Investigation of Fault Tolerance Techniques in Cloud Computing
 
Cloud Storage and Security
Cloud Storage and SecurityCloud Storage and Security
Cloud Storage and Security
 
An Efficient Approach Towards Mitigating Soft Errors Risks
An Efficient Approach Towards Mitigating Soft Errors RisksAn Efficient Approach Towards Mitigating Soft Errors Risks
An Efficient Approach Towards Mitigating Soft Errors Risks
 
Ch13
Ch13Ch13
Ch13
 
www.ijerd.com
www.ijerd.comwww.ijerd.com
www.ijerd.com
 
Ch13.pptx
Ch13.pptxCh13.pptx
Ch13.pptx
 
System Structure for Dependable Software Systems
System Structure for Dependable Software SystemsSystem Structure for Dependable Software Systems
System Structure for Dependable Software Systems
 
SOFTWARE QUALITY FACTORS_SQE.pptx
SOFTWARE QUALITY FACTORS_SQE.pptxSOFTWARE QUALITY FACTORS_SQE.pptx
SOFTWARE QUALITY FACTORS_SQE.pptx
 
How to save on software maintenance costs
How to save on software maintenance costsHow to save on software maintenance costs
How to save on software maintenance costs
 
Alft
AlftAlft
Alft
 
Restoration and Degeneration of the Applications
Restoration and Degeneration of the ApplicationsRestoration and Degeneration of the Applications
Restoration and Degeneration of the Applications
 
F017264143
F017264143F017264143
F017264143
 
Download
DownloadDownload
Download
 
Testing
TestingTesting
Testing
 
Using Fuzzy Clustering and Software Metrics to Predict Faults in large Indust...
Using Fuzzy Clustering and Software Metrics to Predict Faults in large Indust...Using Fuzzy Clustering and Software Metrics to Predict Faults in large Indust...
Using Fuzzy Clustering and Software Metrics to Predict Faults in large Indust...
 
A BRIEF PROGRAM ROBUSTNESS SURVEY
A BRIEF PROGRAM ROBUSTNESS SURVEYA BRIEF PROGRAM ROBUSTNESS SURVEY
A BRIEF PROGRAM ROBUSTNESS SURVEY
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
Qms
QmsQms
Qms
 
Software Process and Requirement
Software Process and RequirementSoftware Process and Requirement
Software Process and Requirement
 

More from Evandro Madeira

Fit2014
Fit2014Fit2014
I pv6 eigrp
I pv6 eigrpI pv6 eigrp
I pv6 eigrp
Evandro Madeira
 
Exemplodeplanodegerenciamentodeprojeto 100726151218-phpapp02
Exemplodeplanodegerenciamentodeprojeto 100726151218-phpapp02Exemplodeplanodegerenciamentodeprojeto 100726151218-phpapp02
Exemplodeplanodegerenciamentodeprojeto 100726151218-phpapp02
Evandro Madeira
 
Gerenciamentodecustos 130918192946-phpapp02
Gerenciamentodecustos 130918192946-phpapp02Gerenciamentodecustos 130918192946-phpapp02
Gerenciamentodecustos 130918192946-phpapp02
Evandro Madeira
 
07 pmbok5 cap07 custo
07   pmbok5 cap07 custo07   pmbok5 cap07 custo
07 pmbok5 cap07 custo
Evandro Madeira
 

More from Evandro Madeira (6)

Fit2014
Fit2014Fit2014
Fit2014
 
I pv6 eigrp
I pv6 eigrpI pv6 eigrp
I pv6 eigrp
 
Custo pmbok
Custo pmbokCusto pmbok
Custo pmbok
 
Exemplodeplanodegerenciamentodeprojeto 100726151218-phpapp02
Exemplodeplanodegerenciamentodeprojeto 100726151218-phpapp02Exemplodeplanodegerenciamentodeprojeto 100726151218-phpapp02
Exemplodeplanodegerenciamentodeprojeto 100726151218-phpapp02
 
Gerenciamentodecustos 130918192946-phpapp02
Gerenciamentodecustos 130918192946-phpapp02Gerenciamentodecustos 130918192946-phpapp02
Gerenciamentodecustos 130918192946-phpapp02
 
07 pmbok5 cap07 custo
07   pmbok5 cap07 custo07   pmbok5 cap07 custo
07 pmbok5 cap07 custo
 

Recently uploaded

The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
heathfieldcps1
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
Celine George
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
Ashokrao Mane college of Pharmacy Peth-Vadgaon
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
DeeptiGupta154
 
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia
 
special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
Special education needs
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
Celine George
 
Polish students' mobility in the Czech Republic
Polish students' mobility in the Czech RepublicPolish students' mobility in the Czech Republic
Polish students' mobility in the Czech Republic
Anna Sz.
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
Sandy Millin
 
A Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in EducationA Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in Education
Peter Windle
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
vaibhavrinwa19
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
Jisc
 
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
Levi Shapiro
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
Nguyen Thanh Tu Collection
 
Embracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic ImperativeEmbracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic Imperative
Peter Windle
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
MysoreMuleSoftMeetup
 
Francesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptxFrancesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptx
EduSkills OECD
 
The Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptxThe Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptx
DhatriParmar
 
Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
Balvir Singh
 
Palestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxPalestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptx
RaedMohamed3
 

Recently uploaded (20)

The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
 
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
 
special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
 
Polish students' mobility in the Czech Republic
Polish students' mobility in the Czech RepublicPolish students' mobility in the Czech Republic
Polish students' mobility in the Czech Republic
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
 
A Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in EducationA Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in Education
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
 
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
 
Embracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic ImperativeEmbracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic Imperative
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
 
Francesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptxFrancesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptx
 
The Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptxThe Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptx
 
Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
 
Palestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxPalestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptx
 

Atifalhas

  • 1. Software Implemented Fault Tolerance: Technologies and Experience Yennun Huang and Chandra Kintala AT&T Bell Laboratories Murray Hill, NJ 07974 Abstract ance has two dimensions: availability and data consistency of the application. For example, users of telephone switching systems demand continuous availability whereas bank teller machine customers demand the highest degree of data consistency[l3]. Most other applications have lower degrees of requirements for faulttolerance in both dimensions but the trend is to ins crease those degrees a the costs, performance, technologies and other engineering considerations permit; see Figure 1 below. In this paper, we discuss three B y software implemented fault tolerance, we mean a set of software facilities t o detect ‘and recover from faults that are are not handled by the underlying hardware or operating system. W e consider those faults that cause a n application process t o crash or hang; they include software faults as well as faults in the underlying hardware and operating system layers i f they are undetected in those layers. W e define 4 levels of software fault tolerance based on availability and data consistency of a n application in the presence of such faults. Watchd, libft and nDFS are reusable components that provide up t o the 3rd level of software fault tolerance. They perform, respectively, automatic detection and restart of failed processes, periodic checkpointing and recovery of critical volatile data, and replication and synchronization of persistent data in an application software system. These modules have been ported t o a number of UNIX’ platforms and can be used by any application with minimal programming egort. Some newer telecommunications products in AT&T have already enhanced their fault-tolerance capability using these three components. Experience with those products to date indicates that these modules provide eficient and economical means to increase the level of fault tolerance in a software product. The performance overhead due t o these components depends on the level and varies from 0.1% to 14% based on the amount of critical data being checkpointed and replicated. 1 Baoli Teller Machines Availability Figure 1: Dimensions of Fault Tolerance cost-effective software technologies t o raise the degree of fault tolerance in both dimensions of the application. Following Cristian[7], we consider software applications that provide a “service” t o clients. The applications in turn use the services provided by the underlying operating or database systems which in turn use the computing and network communication services provided by the underlying hardware; see Figure 2. Tolerating faults in such applications involves detecting a failure, gathering knowledge about the failure and recovering from that failure. Traditionally, these fault tolerance actions are performed Introduction There are increasing demands to make the software in the applications we build today more tolerant to faults. From users’ point of view, fault toler‘UNIX is a registered trademark of UNIX System Laboratories, Inc. 2 0731-3071/93$3.00 0 1993 IEEE
  • 2. Saltzer et.al. claim per on End-to-End Arguments[21], that such hardware and operating system based methods to detect and recover from software failures are necessarily incomplete. They show that fault tolerance cannot be complete without the knowledge and help from the endpoints of an application, i.e., the application software. We claim that such methods, i.e. services at a lower layer detecting and recovering from failures at a higher layer, are also inefficient. For example, file replication on a mirrored disk through a facility in the operating system will be more inefficient than replicating only the “critical” files of the application in the application layer since the operating system has no internal knowledge of that application. Similarly, generalized checkpointing schemes in an operating system checkpoint entire in-memory data of an application whereas application-assisted methods will 1 checkpoint only the critical data[l7, 3 . A common but misleading argument against embedding checkpointing, recovery and other fault tolerance schemes inside an application is that such schemes are not transparent, efficient or reliable because t hey are coded by application programmers; we claim that well-tested and efficient fault tolerance methods can be built as libraries or software reusable components executing in the application layer and that they provide as much as transparency as the other methods do. All the three components discussed in this paper have those properties, i.e. efficient, reliable and transparent. A*- FT-DBMS, ... .*.duplex, T’MR, ... Figure 2: Layers of Fault Tolerance in the hardware, operating or database systems used in the underlying layers of the application software. Hardware fault tolerance is provided using Duplex, Triple-Module-Redundancy or other techniques[l9]. Fault tolerance in the operating and database layers is often provided using replicated file systems[22], exception handling[23], disk shadowing[6], transactionbased checkpointing and recovery[l8], and other system routines. These methods and technologies handle faults occurring in the underlying hardware, operating and database system layers only. Increasing number of faults are however occurring in the application software layer causing the application processes to crash or hang. A picocess is said to be crashed if the working process image is no longer present in the system. A process is said to be hung if the process image is alive, its entry is still present in the process table but the process is not making any progress from a user’s point of view. Such software failures arise from incorrect program designs or coding errors but, more often than not, they arise from transient and nondeterministic errors[9]; for example, unsatisfactory boundary value conditions, timing and race conditions in the underlying computing and communication services, performance failures etc.[7]. Due to complex and temporal nature of interleaving of messages and computations in a distributed system, no amount of verification, validation and testing will eliminate all those software faults in an application and give complete confidence in the availability and data consistency of that application. So, those faults will occasionally manifest themselves and cause the application process to crash or hang. It is possible to detect a crash andl restart an application at a checkpointed state through operating sytem facilities, as in IBM’s MVS[24]. In their pa- The above observations and the End-to-End argument [21] lead t o our notion of software fault tolerance as defined below: Software fault tolerance is a set of software facilities to detect and recover from faults that cause an application process to crash or hang and that are are not handled by the underlying hardware or operating system. Observe that this includes software faults as described earlier as well as faults in the underlying hardware and operating system layers if they are undetected in those layers. Thus, if the underlying hardware and operating system are not fault-tolerant in an application system due t o performance/cost trade-offs or other engineering considerations, then that system can increase its availability more cost effectively through software fault tolerance as described in this paper. It is also an easier migration path for making existing applications more fault-tolerant. 3
  • 3. 2 Model many nodes, one of which is designated as primary and the others as backups. All the requests for the service are sent t o the primary site. The primary site periodically checkpoints its state on the backups. If the primary fails, one of the backups takes over as primary. This model for fault tolerance has been analyzed for frequency of checkpointing, degree of service replication and the effect on response time by Huang and Jalote[ll, 121. This model was slightly modified, as described below, to build the three technologies described in this paper. The tasks in our modification of the primary site approach are: For simplicity in the following discussions, we consider only client-server based applications running in a local or wide-area network of computers in a distributed system; the discussion also applies to other kinds of applications. Each application has a server process executing in the user level on top of a vendor supplied hardware and operating system. To get services, clients send messages to the server; the server process acts on those messages one by one and, in each of those message processing steps, updates its data. We sometimes call the server process the application. For fault tolerance purposes, the nodes in the distributed system will be viewed as being in a circular configuration such that each node will be a backup node for its left neighbor in that circular list. As shown in Figure 3, each application will be exe- a watchdog process running on the primary node watching for application crashes or hangs, a watchdog process running on the backup node watching for primary node crashes, periodically checkpointing the critical volatile data in the application Backup ..............................................7 A. i .... .... I logging of client messages to the application, .................................. replicating application’s persistent data and making them available on the backup node, < j : ! : i ; Primary ........... when the application on the primary node crashes or hangs, restarting the application, if possible, on the primary node, otherwise, on the backup node. p............,I 1 . . Application Data .......................................... , .................. ............. c I I ......................................... ................ recovering the application t o the last checkpointed state and reexecuting the message log and (libft) Persisten : connecting the replicated files t o the backup node if the application restarts on the backup. : ........... Observe that these software fault tolerance tasks can be used in addition t o other methods such as N-version programming[2] or recovery blocks[20] inside an application program. Observe also that the application process on the backup node will not be running until it is started by the watchdog process; this is unlike in the process-pair model[9] where the backup process will be passively running even during normal operations. The degree t o which the above software fault tolerance tasks are used in an application determines the availability and data consistency of that application. It is, therefore, useful t o establish a classification of the different levels of software fault tolerance. We define the following 4 levels based on our experience in AT&T. Applications illustrating these levels are described in Section 4. Figure 3: Modified Primary Site Approach cuting primarily on one of the nodes in the network, called the primary node for that application. Each executing application has process text (the compiled code), volatile data (variables, structures, pointers and all the bytes in the static and dynamic memory segments of the process image) and persistent data (the application files being referred to and updated by the executing process). We use a modified primary-site approach to software fault tolerance[l]. In the primary site approach, the service to be made fault tolerant is replicated at 4
  • 4. Level 0: No tolerance to faults in the application software: In this level, when the executing (application process dies or hangs, it has to be manually restarted from an initial internal state, i.e. the initial values of the volatile data. The application may leave its persistent data in an incorrect or inconsistent state due to the timing of the crwh and may take a long time to restart due to elaborate initialization procedures. possible. The data consistency of the application in Level 3 is higher than that in Level 2. Level 4: Continuous operation without any interruption: This level of fault tolerance in software guarantees the highest degree of availability and data consistency. Often, this is provided, such as in switching systems, using replicated processing of the application on “hot” spare hardware. The state of a process need not be saved, but multicast messaging, voting and other mechanisms must be used t o maintain consistency and concurrency control. Availability of the system during planned interruptions, such as those during upgrades, is made possible using dynamic loading or other operating system facilities. The technologies we describe in this paper do not provide this level of fault tolerance. Level 1: Automatic detection and restart: When the application dies or hangs, the fault will be detected and the application will be restarted from an initial internal state on the same processor, if possible, or on a backup processor if available. In this level, the internal state of the application is not saved and, hence, the process will restart at the initial internal state. As stated above, restart along with reinitialization will be slow. The restarted internal state may not reflect all the messages that have been processed in the previous execution, and therefore, may not be consistent with the persistent data. The difference between Levels 0 and 1 is that the detection and restart are automatic in Level 1, and therefore, the application availability is higher in Level 1 than in Level 0. 3 Technologies Many applications perform some of these software fault tolerance features by coding them directly in their programs. We developed three reusable components - watchd, l i b f t and nDFS - to embed those features in any application with minimal programming effort. Level 2: Level 1 plus periodic checkpointing, logging and recovery of internal state: In addition to what is available in Level 1, the internal state of the application process is periodically checkpointed, i.e. the #criticalvolatile data is saved, and the messages t o the server are logged. After a failure is detected, the application is restarted at the most recent checkpointed internal state and the logged messages will be reprocessed to bring the application to the state at which it crashed. The application availability and volatile data consistency are higher in Level 2 than those in Level 1. 3.1 Watchd Watchd is a watchdog daemon process that runs on a single machine or on a network of machines. It continually watches the life of a local application process by periodically sending a null signal to the process and checking the return value to detect whether that process is alive or dead. It detects whether that process is hung or not by using one of the following two methods. The first method sends a null message to the local application process using IPC (Inter Process Communication) facilities on the local node and checks the return value. If it cannot make the connection, it waits for some time (specified by the application) and tries again. If it fails after the second attempt, it will interpret it to mean that the process is hung. The second method asks the application process to send a heartbeat message to watchd periodically and watchd periodically checks the heartbeat. If the heartbeat message from the application is not received by a specified time, watchd will assume that the application is hung. This implies that w a t chd cannot differentiate between hung processes and very slow processes. Level 3: Level 2 plus persistent data recovery: In addition to what is available in Level 2, the persistent data of the application is replicated on a backup disk connected t o a backup node, and is kept consistent with the primary server throughout the normal operation of the application. In case of a fault and resulting recovery of the application on the backup node, the backup disk brings the application’s persistent data as close to the state at which the application crashed as 5
  • 5. environment. The exception handling, NVP and recovery block facilities are implemented using C macros and standard C library functions. These facilities can be used by any application without changing the underlying operating system or adding new C preprocessors. Libft also provides f t r e a d 0 and ftwrite0 functions t o automatically log messages. When the ftreado function is called by a process in a normal condition, the data will be read from a channel and automatically logged on a file. The logged data will then be duplicated and logged by the watchd daemon on a backup machine. The replication of logged data is necessary for a process to recover from a primary machine failure. When the ftreado function is called by a process which is recovering from a failure in a recovery situation, the input data will be read from the logged file before any data can be read from a regular input channel. Similarly, the ftwrite0 function logs output data before they are sent out. The output data is also duplicated and logged by the watchd daemon on a backup machine. The log files created by the f t r e a d 0 and ftwriteo functions are truncated after a checkpoint 0 function is successfully executed. Using functions checkpoint (), ftread() and ftwrite0, one can implement either a sender-based or a receiver-based logging and recovery scheme[l4]. There is a slight possibility that some messages during the automatic restart procedure may get lost. If this is a concern t o an application, an additional message synchronization mechanism can be built into the application t o check and retransmit lost messages. Speed and portability are primary concerns in implementing libft. The libft checkpoint mechanism is not fully transparent to programmers as in the Condor system[l6]. However, libft does not require a new language, a new preprocessor or complex declarations and computations to save data structures[9]. The sacrifice of transparency for speed has been proven to be useful in some projects t o adopt libft. The installation of libft doesn’t require any change to a UNIX-based operating system; it has been ported to several platforms. Watchd and libf t separate fault detection and volatile data recovery facilities from the application functions. They provide those facilities as reusable components which can be combined with any application to make it fault tolerant. Since the messages received at the server site are logged and only the server process is recovered in this scheme, the consistency problems that occur in recovering multiple processes[l4] are not issues in this implementation. When it detects that the application process crashed or hung, watchd recovers that application at an initial internal state or at the last checkpointed state. The application is recovered on the primary node if that node has not crashed, otherwise on the backup node for the primary as specified in a configuration file. If libft is also used, watchd sets the restarted application to process all the logged messages from the log file generated by libft. watchd also watches one neighboring watchd (left or right) in a circular fashion t o detect node failures; this circular arrangement is similar t o the adaptive distributed diagnosis algorithm[5]. When a node failure is detected, watchd can execute user-defined recovery commands and reconfigure the network. Observe that neighboring wat chds cannot differentiate between node failures and link failures. In general, this is the problem of attaining common knowledge in the presence of communication failures which is provably unsolvable[lO]. Watchd also watches itself. A self-recovery mechanism is built into watchd in such a way that it can recover itself from an unexpected software failure. Watchd also facilitates restarting a failed process, restoring the saved values and reexecuting the logged events and provides facilities for remote execution, remote copy, distributed election, and status report production. 3.2 Libft Libft is a user-level library of C functions that can be used in application programs t o specify and checkpoint critical data, recover the checkpointed data, log events, locate and reconnect a server, do exception handling, do N-version programming (NVP), and use recovery block techniques. Libft provides a set of functions (e.g. critical()) to specify critical volatile data in an application. Those critical data items are allocated in a reserved region of the virtual memory and are periodically checkpointed. Values in critical data structures are saved using memory copy functions, and thus avoid traversing application-dependent data structures. When an application does a checkpoint, its critical data will be saved on the primary and backup nodes. Unlike other checkpointing methods[l?], the overhead in our checkpointing mechanism is minimized by saving only critical data and avoiding data-structure traversals. This idea of saving only critical data in an application is analogous to the Recovery Box concept in Sprite[3]. Libf t provides functions (e.g. getsvrloc 0, getsvrport(),ftconnectO, ftbind())for clients to locate servers and reconnect t o servers in a network 6
  • 6. 3.3 nDFS ing environments for state-less network services such as lpr, f i n g e r or i n e t d daemons. Providing higher levels of fault tolerance in those services would be unnecessary. Level 2: Failure detection, checkpointing, restart and recovery using watchd and l i b f t : Application N maintains a certain segment of the telephone call routing information on a Sun servgr; maintenance operators use workstations running N’s client processes communicating with N’s server process using sockets. The server process in N was crashing or hanging for unknown reasons. During such failures, the system administrators had to manually bring back the server process, but they could not do so immediately because of the UNIX delay in cleaning up the socket table. Moreover, the maintenance operators had to restart client interactions from an initial state. Replacing the server node with fault tolerant hardware would have increased their capital and development costs by a factor of 4. Even then, all their problems would not have been solved; for example, saving the client states of interactions. Using watchd and l i b f t , system N is now able to tolerate such failures. Watchd also detects primary server failures and restarts it on the backup server. Location transparency is obtained using g e t s v r l o c ( ) and getsvrport 0 calls in client programs and f t b i n d 0 in server program. Libf t’s checkpoint and recovery mechanisms are used to save and recover all critical data. Checkpointing and recovery overheads are below 2%. Installing and integrating the two components into the application took 2 people 3 days. Level 3: Failure detection, checkpointing, replication, restart and recovey using watchd, l i b f t and nDFS: Application D is a real-time telecommunication network element currently being developed. In addition to the previous requirements for fault tolerance, this product needed to get its persistent files on-line immediately after a failure recovery on a backup node. During normal operations on the primary server, nDFS replicates all the critical persistent files on a backup server with an expected overhead of less than 14%. When the primary server fails, watchd starts the application D on the backup node and automatically connects it to the backup disk on which the persistent files were replicated. The multi-dimensional file system, nRFS[8],is based on 3DFS[15] and provides facilities for replication of critical data. It allows users t o specify and replicate critical files on backup file systems in real time. The implementation of nDFS uses the dynamic-shared library mechanism t o intercept file system calls and propagate the system calls to backup file systems. nDFS is built on top of UNIX file systems, and so its use requires no change in the underlying file system. Speed, robustness and replication transparency are the primary design goals of nDFS. Implementation of nDFS uses watchd and l i b f t for fault detection and fast recovery. A failure of the underlying replication mechanism (software failure) or a crash of a backup file system is transparent to applications or users. A failed software component is detected and recovered immediately by watchd; a crashed backup file system, after it is repaired, can catch up with the primary file system without interrupting or slowing down applications using the primary file system. 4 Experience Fault tolerance in some of the newer telecommunications network management products in AT&T has been enhanced using watchd, l i b f t and nDFS. Experience with those products to date indicates that these technologies are indeed economical and effective means to increase the level of fault tolerance in application software. The performance overhead due t o these components depends on the level of fault tolerance, the amount of critical volatile data being checkpointed, frequency of checkpointing, and the amount of persistent data being replicated. It varies from 0.1% t o 14%. We describe some of those products t o illustrate the availability, flexibility and efficiency in providing software fault tolerance through these 3 components. To protect the proprietary information of those products, we use generic terms and titles in the descriptions. Level 1: Failure detection and restart using watchd: Application C uses watchd to check the “liveness” of some service daemon processes in C at 10 second intervals. When any of those processes fails, i.e. crashes or hangs, watchd restarts that process at its initial state. It took 2 people 3 hours to embed and configure watchd for this level of fault tolerance in application C. One potential use of this kind of fault tolerance would be in general purpose local area comput- 4.1 Other Possible Uses The three software components, watchd, l i b f t and n D F S , can be used not only to increase the level of fault-tolerance in an application, as described above, 7
  • 7. Acknowledgments but also to aid in other operations unrelated to faulttolerance as described below. a a Overcoming persistent errors in software: Some errors in software are simply persistent, i.e. nondebuggable, due to the complexity and transient nature of the interactions and events in a distributed system[4]. Such errors sometimes do not reappear after the server process is restarted[9]; watchd, in those instances, can be used to bring the server process back up without clients noticing the failure and restart. After restart and restoration of the checkpointed state, message logs can be replayed in the order they originally arrived at the server or, if needed, in a different order[25]. Reordering the message logs sometimes eliminates transient errors due to “boundary” conditions. a 5 Many thanks to Lawrence Bernstein who suggested defining levels for fault tolerance, provided leadership to transfer this technology rapidly and encouraged using these components in a wide range of AT&T products and services. The authors have benefited from discussions, contributions and comments from several colleagues, particularly, Rao Arimilli, David Belanger, Marilyn Chiang, Glenn Fowler, Kent Fuchs, Pankaj Jalote, Robin Knight, David Korn, Herman Rao and Yi-Min Wang. On-line upgrading of software: One can install a new version of software for an application without interrupting the service provided by the older version. This can be done by first loading the new version on the backup node, simulating a fault on the primary and then letting wat chd dynamically move the service location to the backup node. This method assumes that the two versions are compatible at the application level client-server protocol. References [l]P. A. Alsberg and J. D. Day, “A Principle for Re- silient Sharing of Distributed Services,” Proceedings of 2nd Intl. Conf. on Software Engineering, pp. 562-570, October 1976. [2] A. Avizienis, “The N-Version Approach to FaultTolerant Software,” IEEE Trans. on Software Engineering, SE-11, No. 12, pp. 1491-1501,Dec. 1985. [3] M. Baker and M. Sullivan, “The Recovery Box: Using Fast Recovery to Provide High Availability in the UNIX environment,” Proceedings of Summer ’92 USENIX, pp. 31-43, June 1992. Using checkpoint states and message logs for debugging distributed applications: In l i b f t , all the checkpointed states, i.e. values in the critical data, and message logs can optionally be saved in a journal file. This journal can be used to aid in analyzing failures in distributed applications. [4] L. Bernstein, “On Software Discipline and the War of 1812,” ACM Software Engineering Notes, p. 18, Oct. 1992. [5] R. Bianchini, Jr. and R. Buskens, “An Adaptive Distributed System-Level Diagnosis Algorithm and Its Implementation,” Proceedings of 21st IEEE Conference on Fault- Tolerant Computing Systems (FTCS), pp. 222-229, July 1991. Summary [6] D. Bitton and J. Gray, “Disk Shadowing,” Proceeding of 14th Conference on Very Large Data Bases, pp. 331-338, Sept. 1988. We identified some of the dimensions of fault tolerance and defined a role, a taxonomy and tasks for software fault tolerance based on availability and data consistency requirements of an application. We then described three software components, wat chd, l i b f t , and nDFS to perform these tasks. These three components are flexible, portable and reusable; they can be embedded in any UNIX-based application software to provide different levels of fault tolerance with minimal programming effort. Experience in using these three components in some telecommunication products has shown that these components indeed increase the level of fault tolerance with acceptable increases in performance overhead. [7] H. Cristian, “Understanding Fault-Tolerant Distributed Systems,” Communications of the AGM, Vol. 34, No. 2, pp. 56-78, February 1991. [8] G. S. Fowler, Y. Huang, D. G. Korn and H. Rao, “A User-Level Replicated File System,” To be presented at the Summer 1993 USENIX Conference, June 1993. [9] J. Gray and D. P. Siewiorek, “High-Availability Computer Systems,” IEEE Computer, Vol. 24, No. 9, pp. 39-48, September 1991. 8
  • 8. [lo] J. Y . Halpern and Y . Moses, “Knowledge and Common Knowledge in a Distributed Environment,” Journal of ACM, Vol. 37, No. 3, pp. 549587, July 1990. [22] M. Satyanarayanan, “Coda: A Highly Available [ll] Y. Huang and P. Jalote, “Analytic Models for the Primary Site Approach to Fault-Tolerance,” Acta Informatica, Vol 26, pp. 543-557, 1989. [23] S. K. Shrivastava (ed.), Reliable Computer Systems, Chapter 3, Springer-verlag, 1985. File System for a Distributed Workstation Environment,” IEEE B,on Computers, Vol. C-39, pp447-459, April 1990. [24] D. P. Siewiorek and R. S. Swarz, Reliable Computer Systems Design and Implementation, Chapter 7, Digital Press, 1992. [12] Y. Huang and P. Jalote, “Effect of Fault Tolerance on Response Time - Analysis of the Primary Site Approach,” IEEE Transactions on Computers, Vol. 41, No. 4, pp. 420-428, April 1992. [25] Y . M. Wang, Y . Huang and W. K. Fuchs, “Progressive Retry for Software Error Recovery in Distributed Systems,” To be presented at this FTCS23 Conference, June 1993. [13] M. Iwama, D. S t a n and G . Zimmerman, Personal correspondence, July 1992. [14] P. Jalote, “Fault Tolerant Processes,” Distributed Computing, Vol. 3, pp. 187-195, 1989. [15] D. G . Korn and E. Krell, “A New Dimension for the Unix File System,” Software Practice and Experience, Vol. 20, Supplement 1, pp. S1/19-S1/34, June 1990. [l6] M. Litxkow, M. Livny, and M Mutka. “Condor - a hunter of idle workstations,” Proc. of the 8th Intl. Conf. on Distributed Computing Systems, IEEE Computer Society Press, June 1988. [17] 9. Long, W. K. Fuchs and J . A. Abraham, “Compiler-Assisted Static Checkpoint Insertion,” Proceedings of 22nd IEEE Conference on FaultTolerant Computing Systems (FTCS), pp. 58-65, July 1992. [18] A. Nangia and D. Finker, “Transaction-based Fault-Tolerant Computing in Distributed Systems,” Proceedings of 1992 IEEE Workshop on Fault-tolerant Parallel and Distributed Systems, pp. 92-97, July, 1992. [19] D. K. Pradhan (ed.), Fault-Tolerant Computing: theory and techniques, Vols. 1and 2, YPrentice-Hall, 1986. [20] B. Randell, “System Structure for Software Fault Tolerance,” IEEE Duns. on Software Engineering, SE-1, No. 2, pp. 220-232, June, 1975. [21] J. H. Saltzer, D. P. Reed and D.D. Clark, “EndTo-End Arguments in System Design,’’ ACM Transactions on Computer Systems, Vol. 2, No. 4, pp. 277-288, November, 1984. 9