Fault Tolerance System

Inaam Ilahi
Furqan Farooq
Sabuha Sarwar
Ehsan Ilahi
Group Members

Faults, Errors and Failures
Fault is a defect within the system
Error is observed by a deviation from the expected
behavior of the system
Failure occurs when the system can no longer perform
as required (does not meet specification)
Fault Tolerance is ability of system to provide a service,
even in the presence of errors
Fault Error Failure

 A fault tolerant system is a system which is a able
to continue operating despite the failure of a
limited subset of their hardware or software.
 They are gracefully degradable i.e. as the size of
the faulty set increases, the system wont collapse
suddenly but continue executing, part of its
workload.
 The goal of this design is to ensure that the
probability of system failure is acceptably small.
Fault Tolerant System

Fault Types
 Hardware Fault:
A hardware fault is some physical defect that can cause a
component to malfunction.
E.g. A broken wire or the output of a logic gate that is
perpetually stuck at some logic value(0 or 1).
 Software Fault:
A software fault is bug that can cause the program to fail
for a given set of inputs.

Type of failure Description
Crash failure A server halts, but is working correctly until it halts
Omission failure
Receive omission
Send omission
A server fails to respond to incoming requests
A server fails to receive incoming messages
A server fails to send messages
Timing failure A server's response lies outside the specified time
interval
Response failure
Value failure
State transition failure
The server's response is incorrect
The value of the response is wrong
The server deviates from the correct flow of control
Arbitrary failure A server may produce arbitrary responses at arbitrary
times
Types Of Failure

Objectives Of Fault Tolerance
 Availability
system always ready for use, or probability that system is ready or
available at a given time
 Reliability
property that a system can run without failure, for a given time
 Safety
Indicates the safety issues in the case the system fails
 Maintainability
refers to the ease of repair to a failed system

Miscommunication
Success of any software application depends on
communication between stakeholders, development
and testing teams.

Changing Requirements
The customer may not understand the effects of
changes, or may understand and request them
anyway .

Poorly Documented Code
It’s tough to maintain and modify code that is badly
written or poorly documented.

Lack of Skilled Testing
No tester would want to accept it but let’s face it; poor
testing do take place across organizations. There can be
shortcomings in the testing process that are followed.

Fault And Error Containment
The process of preventing the error spreading from
one part to another part of the system is called
containment
When a fault or error occurs in one part of a
system, it will spread through the system like an
infectious disease.
e.g. An fault in one part of the system might
cause large voltage swings in another.
 A fault-free processor can give erroneous results,
when getting input from a faulty unit.

Recovery
Once failure has occurred in many cases it is important to recover
critical processes to a known state in order to resume processing
Problem is compounded in distributed systems
Two Approaches:
 Backward recovery, by use of check pointing (global
snapshot of distributed system status) to record the system state but
check pointing is costly (performance degradation)
 Forward recovery, attempt to bring system to a new stable
state from which it is possible to proceed (applied in situations
where the nature if errors is known and a reset can be applied)

Fault Tolerance in Distributed Systems
 System attributes:
 · Availability – system always ready for use, or probability that
system is ready or available at a given time
 · Reliability – property that a system can run without failure, for
a given time
 · Safety – indicates the safety issues in the case the system fails
 · Maintainability – refers to the ease of repair to a failed system
 Failure in a distributed system = when a service cannot
be fully provided
 System failure may be partial
 A single failure may affect other parts of a system (failure
escalation)

Lost Request Messages when Server Crashes
A server in client-server communication
• Normal case
• Crash after execution
• Crash before execution

REDUNDANCY
FTS consist of properly managed redundancy, i.e.
the system is to kept running despite the failure of some
its parts.
It must have spare capacity to begin with.
TYPES OF REDUNDANCY
 Hardware redundancy
 Software redundancy
 Time redundancy
 Information redundancy

Software Redundancy
 Software faults are not like hardware faults i.e. software
never wears out , the faults are not generated
spontaneously during system operation.
 Software faults can be regarded as faults in design.
 For software redundancy simply replicating the same
software N times will not work, all N copies will fail for
the same inputs.
 Instead N versions of the software can be implemented.
The N versions can be developed by independent teams,
with no contact between them.

 Each version is being developed by a team of developers
who never communicated with each other
To minimize the common mode failures
The specifications should be written in formal terms and
are subject to rigorous process of checking
Multiple software versions should be developed in
different programming languages.
Nature of tools that are being used should be selected
properly.
Training and quality of the programmers should be
maintained.

 N - Version Programming
 Recovery Block Approach
There are two Approaches for that

Applications Of Fault Tolerance System
1.Long-life applications:
. e.g. space, satellites
.typical requirement: Availability (10 years) ≥0.95
.outages in between are allowed.
2.Critical-computation applications:
.e.g. critical to human safety: aircraft control system
.typical requirement: Reliability (3 years) ≥0.97 (short for 0.9999999).
3.Maintenance-postponement applications:
.when maintenance operations are extremely costly
. e.g. space systems and remote processing systems, like telephone
switching systems (e.g. maintenance only once a month)

4.High availability applications:
.e.g. banking, flight reservation
5. Transportation systems
– train/subway
– ships
– automobiles
• ABS anti-locking-brakes
• ESP electronic
stability program
• airbag activation
• electronic ignition/fuel pump

STEPS TO PREVENT FAILURE
 Power Failure
 Power Surge .
 Data loss
 Device or Computer failure
 Unauthorized access
 Overload .
 Virus

Disadvantages
 Interference with fault detection in the same
component.
 Interference with fault detection in another
component.
 Reduction of priority of fault correction.
 Test difficulty.
 Cost.
 Inferior components

Conclusion
 Hardware, software and networks cannot be totally free
from failures
 Fault tolerance is a non-functional requirement that requires
a system to continue to operate, even in the presence of faults.
 Distributed systems can be more fault tolerant than
ccentralized systems.
 Agrement in faulty systems and reliable group
communication are important problems in distributed systems.
 Replication of Data is a major fault tolerance method in
distributed systems.
 Recovery is another property to consider in faulty
distributed environments.

Fault Tolerance System

More Related Content

What's hot

Viewers also liked

Similar to Fault Tolerance System

Recently uploaded

Fault Tolerance System