Fault ToleranceFault-tolerant computing is the art and science ofbuilding computing systems thatcontinue to operate satisfactorily in the presence offaults. A fault-tolerant system may beable to tolerate one or more fault-types including –i) transient(cause by external disturbance),intermittent(cause by marginal designed error) orpermanent hardware faults,ii) software and hardware design errors,iii) operator errors, oriv) externally induced upsets or physical damage.
Availability & Reliability Availability: A measurement of whethera system is ready to be used immediately◦ System is available at any given moment Reliability: A measurement of whethera system can run continuously withoutfailure◦ System continues to function for a longperiod of time
Safety & Maintainability Safety: A measurement of how safe failuresare◦ System fails, nothing serious happens◦ For instance, high degree of safety is required forsystems controlling nuclear power plants Maintainability: A measurement of howeasy it is to repair a system◦ A highly maintainable system may also show ahigh degree of availability◦ Failures can be detected and repairedautomatically. Self-healing systems.
What is Fault? A system fails when it cannot meet its promises(specifications) An error is part of a system state that may lead toa failure A fault is the cause of the error Fault-Tolerance: the system can provide serviceseven in the presence of faults Faults can be:◦ Transient (appear once and disappear)◦ Intermittent (appear-disappear-reappear behavior) A loose contact on a connector intermittent fault◦ Permanent (appear and persist until repaired)
Failure ModelType of Failure DescriptionCrash failure A server halts, but is working correctly until it haltsOmission failureReceive omissionSend omissionA server fails to respond to incoming requestsA server fails to receive incoming messagesA server fails to send messagesTiming failure A servers response lies outside the specified timeintervalResponse failureValue failureState transitionfailureThe servers response is incorrectThe value of the response is wrongThe server deviates from the correct flow of controlArbitrary failure(Byzantine failure)A server may produce arbitrary responses atarbitrary times
Error Detection Error detection is a detection of errorscaused by noise or other impairments duringtransmission from the transmitter to thereceiver. There are many schemes of errordetection:-1. Repetition codes.2. Parity bits.3. Checksums.4. Cyclic redundancy checks.5. Cryptography hash functions.
System Recovery We have talk a lot about fault tolerancebut not talk about what happen after faulthas occurred. A process that exhibits a failure has to beable to recover to a correct state There are two type of recovery:1. Backward Recovery.2. Forward Recovery.
Backward Recovery The goal of backward recovery is to bringthe system from an erroneous state backto a prior correct state The state of the system must be recorded- checkpointed - from time to time, andthen restored when things go wrong Examples◦ Reliable communication through packetretransmission
Forward Recovery The goal of forward recovery is to bring asystem from an erroneous state to acorrect new state (not a previous state) Examples:◦ Reliable communication via erasure(acorrection made by erasing) correction, suchas an (n, k) block erasure code.
Fault Masking Fault Masking is a structural redundancytechnique that completely masks faultswithin a set of redundant modules. Redundancy is key technique for hidingfailures. Redundancy, however, can have anadverse impact on the performance of asystem. For example, it can increase thelength of transmitted data or increasethe resource consumption.
Reconfiguration Reconfiguration is the “process ofeliminating a faulty entity from a systemand restoring the system to someoperational condition or state”. When we use Reconfiguration processdesigner must be concerned with faultdetection, fault location, fault containment,and fault recovery.
Redundancy In engineering redundancy is theduplication of critical components orfunction of a system with the intention ofincreasing reliability of the system. Redundancy are four types:-1. Hardware(such as DMR & TMR)2. Software(N-version programming)3. Time(transient fault detection such asAlternate logic)4. Information(error detection orcorrection)
ConclusionFault-tolerance is achieved by applying a set ofanalysis and design techniques to create systemswith dramatically improved dependability.As newtechnologies are developed and new applicationsarise, new fault-tolerance approaches are alsoneeded. Now chips contain complex, highly-integrated functions, and hardware and softwaremust be crafted to meet a variety of standards tobe economically viable.Thus a great deal ofcurrent research focuses on implementing faulttolerance using COTS (Commercial-Off-The-Shelf) technology.