FAULT TOLERANCE
By– Gaurav Singh Rawat
Electrical Department
Systems Engineering
Fault Tolerance
Fault-tolerant computing is the art and science of
building computing systems that
continue to operate satisfactorily in the presence of
faults. A fault-tolerant system may be
able to tolerate one or more fault-types including –
i) transient(cause by external disturbance),
intermittent(cause by marginal designed error) or
permanent hardware faults,
ii) software and hardware design errors,
iii) operator errors, or
iv) externally induced upsets or physical damage.
Fault tolerance concept taxonomy
Faults
Errors
Failures
Fault-
Tolerance
Threats
Attributes
Means
Availability
Perform ability
Graceful Degradation
Maintainability
Testability
Error Detection
System Recovery
Fault Masking
Reconfiguration
Redundancy
Basic Concept
Dependability includes:
 Availability
 Reliability
 Safety(security)
 Maintainability
Availability & Reliability
 Availability: A measurement of whether
a system is ready to be used immediately
◦ System is available at any given moment
 Reliability: A measurement of whether
a system can run continuously without
failure
◦ System continues to function for a long
period of time
Safety & Maintainability
 Safety: A measurement of how safe failures
are
◦ System fails, nothing serious happens
◦ For instance, high degree of safety is required for
systems controlling nuclear power plants
 Maintainability: A measurement of how
easy it is to repair a system
◦ A highly maintainable system may also show a
high degree of availability
◦ Failures can be detected and repaired
automatically. Self-healing systems.
What is Fault?
 A system fails when it cannot meet its promises
(specifications)
 An error is part of a system state that may lead to
a failure
 A fault is the cause of the error
 Fault-Tolerance: the system can provide services
even in the presence of faults
 Faults can be:
◦ Transient (appear once and disappear)
◦ Intermittent (appear-disappear-reappear behavior)
 A loose contact on a connector intermittent fault
◦ Permanent (appear and persist until repaired)
Failure Model
Type of Failure Description
Crash failure A server halts, but is working correctly until it halts
Omission failure
Receive omission
Send omission
A server fails to respond to incoming requests
A server fails to receive incoming messages
A server fails to send messages
Timing failure A server's response lies outside the specified time
interval
Response failure
Value failure
State transition
failure
The server's response is incorrect
The value of the response is wrong
The server deviates from the correct flow of control
Arbitrary failure
(Byzantine failure)
A server may produce arbitrary responses at
arbitrary times
Error Detection
 Error detection is a detection of errors
caused by noise or other impairments during
transmission from the transmitter to the
receiver.
 There are many schemes of error
detection:-
1. Repetition codes.
2. Parity bits.
3. Checksums.
4. Cyclic redundancy checks.
5. Cryptography hash functions.
System Recovery
 We have talk a lot about fault tolerance
but not talk about what happen after fault
has occurred.
 A process that exhibits a failure has to be
able to recover to a correct state
 There are two type of recovery:
1. Backward Recovery.
2. Forward Recovery.
Backward Recovery
 The goal of backward recovery is to bring
the system from an erroneous state back
to a prior correct state
 The state of the system must be recorded
- checkpointed - from time to time, and
then restored when things go wrong
 Examples
◦ Reliable communication through packet
retransmission
Forward Recovery
 The goal of forward recovery is to bring a
system from an erroneous state to a
correct new state (not a previous state)
 Examples:
◦ Reliable communication via erasure(a
correction made by erasing) correction, such
as an (n, k) block erasure code.
Fault Masking
 Fault Masking is a structural redundancy
technique that completely masks faults
within a set of redundant modules.
 Redundancy is key technique for hiding
failures.
 Redundancy, however, can have an
adverse impact on the performance of a
system. For example, it can increase the
length of transmitted data or increase
the resource consumption.
Reconfiguration
 Reconfiguration is the “process of
eliminating a faulty entity from a system
and restoring the system to some
operational condition or state”.
 When we use Reconfiguration process
designer must be concerned with fault
detection, fault location, fault containment,
and fault recovery.
Redundancy
 In engineering redundancy is the
duplication of critical components or
function of a system with the intention of
increasing reliability of the system.
 Redundancy are four types:-
1. Hardware(such as DMR & TMR)
2. Software(N-version programming)
3. Time(transient fault detection such as
Alternate logic)
4. Information(error detection or
correction)
Conclusion
Fault-tolerance is achieved by applying a set of
analysis and design techniques to create systems
with dramatically improved dependability.As new
technologies are developed and new applications
arise, new fault-tolerance approaches are also
needed. Now chips contain complex, highly-
integrated functions, and hardware and software
must be crafted to meet a variety of standards to
be economically viable.Thus a great deal of
current research focuses on implementing fault
tolerance using COTS (Commercial-Off-The-
Shelf) technology.
Fault tolerance

Fault tolerance

  • 1.
    FAULT TOLERANCE By– GauravSingh Rawat Electrical Department Systems Engineering
  • 2.
    Fault Tolerance Fault-tolerant computingis the art and science of building computing systems that continue to operate satisfactorily in the presence of faults. A fault-tolerant system may be able to tolerate one or more fault-types including – i) transient(cause by external disturbance), intermittent(cause by marginal designed error) or permanent hardware faults, ii) software and hardware design errors, iii) operator errors, or iv) externally induced upsets or physical damage.
  • 3.
    Fault tolerance concepttaxonomy Faults Errors Failures Fault- Tolerance Threats Attributes Means Availability Perform ability Graceful Degradation Maintainability Testability Error Detection System Recovery Fault Masking Reconfiguration Redundancy
  • 4.
    Basic Concept Dependability includes: Availability  Reliability  Safety(security)  Maintainability
  • 5.
    Availability & Reliability Availability: A measurement of whether a system is ready to be used immediately ◦ System is available at any given moment  Reliability: A measurement of whether a system can run continuously without failure ◦ System continues to function for a long period of time
  • 6.
    Safety & Maintainability Safety: A measurement of how safe failures are ◦ System fails, nothing serious happens ◦ For instance, high degree of safety is required for systems controlling nuclear power plants  Maintainability: A measurement of how easy it is to repair a system ◦ A highly maintainable system may also show a high degree of availability ◦ Failures can be detected and repaired automatically. Self-healing systems.
  • 7.
    What is Fault? A system fails when it cannot meet its promises (specifications)  An error is part of a system state that may lead to a failure  A fault is the cause of the error  Fault-Tolerance: the system can provide services even in the presence of faults  Faults can be: ◦ Transient (appear once and disappear) ◦ Intermittent (appear-disappear-reappear behavior)  A loose contact on a connector intermittent fault ◦ Permanent (appear and persist until repaired)
  • 8.
    Failure Model Type ofFailure Description Crash failure A server halts, but is working correctly until it halts Omission failure Receive omission Send omission A server fails to respond to incoming requests A server fails to receive incoming messages A server fails to send messages Timing failure A server's response lies outside the specified time interval Response failure Value failure State transition failure The server's response is incorrect The value of the response is wrong The server deviates from the correct flow of control Arbitrary failure (Byzantine failure) A server may produce arbitrary responses at arbitrary times
  • 9.
    Error Detection  Errordetection is a detection of errors caused by noise or other impairments during transmission from the transmitter to the receiver.  There are many schemes of error detection:- 1. Repetition codes. 2. Parity bits. 3. Checksums. 4. Cyclic redundancy checks. 5. Cryptography hash functions.
  • 10.
    System Recovery  Wehave talk a lot about fault tolerance but not talk about what happen after fault has occurred.  A process that exhibits a failure has to be able to recover to a correct state  There are two type of recovery: 1. Backward Recovery. 2. Forward Recovery.
  • 11.
    Backward Recovery  Thegoal of backward recovery is to bring the system from an erroneous state back to a prior correct state  The state of the system must be recorded - checkpointed - from time to time, and then restored when things go wrong  Examples ◦ Reliable communication through packet retransmission
  • 12.
    Forward Recovery  Thegoal of forward recovery is to bring a system from an erroneous state to a correct new state (not a previous state)  Examples: ◦ Reliable communication via erasure(a correction made by erasing) correction, such as an (n, k) block erasure code.
  • 13.
    Fault Masking  FaultMasking is a structural redundancy technique that completely masks faults within a set of redundant modules.  Redundancy is key technique for hiding failures.  Redundancy, however, can have an adverse impact on the performance of a system. For example, it can increase the length of transmitted data or increase the resource consumption.
  • 14.
    Reconfiguration  Reconfiguration isthe “process of eliminating a faulty entity from a system and restoring the system to some operational condition or state”.  When we use Reconfiguration process designer must be concerned with fault detection, fault location, fault containment, and fault recovery.
  • 15.
    Redundancy  In engineeringredundancy is the duplication of critical components or function of a system with the intention of increasing reliability of the system.  Redundancy are four types:- 1. Hardware(such as DMR & TMR) 2. Software(N-version programming) 3. Time(transient fault detection such as Alternate logic) 4. Information(error detection or correction)
  • 16.
    Conclusion Fault-tolerance is achievedby applying a set of analysis and design techniques to create systems with dramatically improved dependability.As new technologies are developed and new applications arise, new fault-tolerance approaches are also needed. Now chips contain complex, highly- integrated functions, and hardware and software must be crafted to meet a variety of standards to be economically viable.Thus a great deal of current research focuses on implementing fault tolerance using COTS (Commercial-Off-The- Shelf) technology.