2. FAULT TOLERANCE
System ability to continue operating uninterrupted despite the
failure of one or more of its components
How an OS Responds to and allows malfunctions and failures
It guarantees no break in service
Recovers from failure completely and transparently
3. FAULT TOLERANCE
Every achievement in fault tolerance leads to a drawback
somewhere else
The system will be slower, take more disk space, utilize more
machines and also increase other costs
There for fault tolerance is always a trad-off between cost and
the degree of fault tolerance.
4. FAILUREVS ERROR
System differs from expected behavior
Failure might involve the system being unreachable or
producing incorrect output
Error is incorrectness of system that may lead to a failure.
Error do not must create failures but can be detect in the
system before they produce failure.
5. FAULT TOLERANCE
Fault tolerance usually running through several phases.
Error Detection: error has to be detect in order to avoid failure.
Damage Confinement: it must prevent that the error spreads
through other components
Error recovery: error must be removed, otherwise system would
run into failure
6. PROCESSOR FAULT
Occur when the processor behaves in unexpected manner. It may
be classified into three kinds.
1. Fail Stop: totally failed and will never respond, neighboring
processors can detect the failed processor
2. Slowdown: processor might run in degraded form or might
totally fail
3. Byzantine: processor can fail, run in degraded fashion for some
time or execute at normal speed but tries to fail the computation
7. NETWORK FAULTS
When processors are prevented from communicating with each
other. Link faults can cause new kinds of problems like
One way Links: one processor can send messages but other
is not able to receive message.
Network partition: network of portion is completely isolated
with other
8. ATTRIBUTES OF FAULT TOLERANT SYSTEM
Fault tolerance system is depended system which requires following
attributes
1. Availability: when system is in a ready state and ready to deliver tis
functions. Highly available systems works at a given instant in time.
2. Reliability: ability of computer to run continuously without failure, it is
defined as time interval instead of instant time. Reliable system works
constantly without interruption.
3. Safety: fails to carry out its corresponding processes correctly and
operations are incorrect but no major disastrous happened and also
doesn’t affect other system to be faulty
4. Maintainability: if failures can be notices and fixed easily.
11. FAULT TOLERANCE MECHANISM IN DISTRIBUTED SYSTEM
Replication based fault tolerance technique
Process level redundancy technique
Fusion based redundancy technique
12. REPLICATION BASED FAULTTOLERANCE TECHNIQUE
Replicate the data on other machine. It will not cause the whole
system to stop.
Replicate the data on different server.
13. Problems of replication
Consistency: major problem of replication is consistency
because of updating by any client. Consistency of data is
ensured by some model such as sequential, causal memory
consistency model
Degree of replica: large number of replications are needed in
order to achieve high fault tolerance.
14. PROCESS LEVEL REDUNDANCY TECHNIQUES
Faults that disappears without anything been done is called transient
faults.This type of faults are hard to identify
Handling transient fault, software based fault tolerance technique
are used
PLR Compares processes to ensure correct execution
Check point and roll back are popular technique in which the
current state of system is done.
15. FUSION BASEDTECHNIQUE
Replication: downside is multiple backups that increases cost
This problem is solved by fusion based technique because it
requires fewer backup
Backup machines are fused to a given set of system (NP-
Problem)
Fusion based technique has very high overhead during recovery
process and it’s acceptable in low probability of fault in a
system.