A fault tolerant system is a system which is a able to continue operating despite the failure of a limited subset of their hardware or software. They are gracefully degradable i.e. as the size of the faulty set increases, the system wont collapse suddenly but continue executing, part of its workload. The goal of this design is to ensure that the probability of system failure is acceptably small.
FAULT TYPESHardware Fault: A hardware fault is some physicaldefect that can cause a component to malfunction. E.g. A broken wire or the output of a logic gatethat is perpetually stuck at some logic value(0 or 1).Software Fault: A software fault is bug that cancause the program to fail for a given set of inputs.
ERROR Error is a manifestation of a fault. e.g. A broken wire will cause an error ifthe system tries to propagate a signalthrough it.A program that has a fault that inducesincorrect output for some set of inputs willgenerate errors, if that set of inputs isapplied.
FAULT LATENCYThe fault latency is the duration betweenthe onset of a fault and its manifestation asan error.Since the faults themselves are invisible tothe outside world, only showing themselveswhen they cause errors. Such latency willimpact the reliability of the overall system.
ERROR RECOVERY It is the process by which the system attempts torecover from the effects of an error.TYPES OF ERROR RECOVERYForward Error Recovery: In this type the error ismasked without any computations having to beredone.Backward Error Recovery: In this type the system isrolled back to moment in the time before the error isbelieved to be occurred and computation is carried outagain. It consumes additional time to mask the effectsof failure.
CAUSES FOR FAULTSErrors in the specification or design.Defects in the componentsEnvironmental effects.
Errors In The Specification Or DesignThis error arises due to the communicationgap between the person who writes thespecification and the system designer.The specification is the link between designprocess and real world application.If specification is wrong everything thatproceeds from it is likely to be wrong.
Defects In Components This fault arise due to defects caused by thewear and tear of use. E.g. A mosfet may fail due to electro migration,which is the drifting away overtime of metalatoms towards the cathode.
Environmental EffectsThis fault arise due to operating environment . Devices can be subjected to whole array ofstresses, depending on the application.Poor ventilation or excessively high ambienttemperatures can melt components or damagethem. e.g If a computer is in missile, it can undergohigh g-forces and vibrational stress.
FAULT TYPESFaults are classified according to their temporalbehavior and output behavior.A fault is said to be active when it is physicallycapable of generating errors and to be benign whenit is not.
TEMPORAL BEHAVIOR CLASSIFICATION Fault types: Permanent, intermittent, transient.A permanent fault does not die away with time,but remains until it is repaired or the affected unit isreplaced.An intermittent fault cycles between the fault-active and fault benign states.A transient fault dies away after some time.
Intermittent faults can be caused by loosely connected components.Transient faults can be caused by environmental effects. e.g. If there is a burst of electromagnetic radiation and the memory is not properly shielded, the contents of the memory can be altered without the memory chips themselves suffering any structural damage. When the memory is rewritten, the fault will go away.
OUTPUT BEHAVIOR CLASSIFICATION Malicious faults • Inconsistent output, harder to neutralize these errors • It behaves arbitrarily Non malicious faults • Consistent output errors • Easier to neutralize these errors
Fail stop Responds to up to a certain maximum number of failures by simply stopping, rather than putting out incorrect outputs.Fail safe Its failure mode is biased so that the application process does not suffer catastrophe upon failure.
INDEPENDENCE AND CORRELATION Component failures may be independent orcorrelated. Independent:A failure is said to beindependent if it does not directly or indirectlycause another failure. Correlated:If the failure is said to be correlated ifthey are related in some way. e.g. They may betriggered by same cause or one of them mightcause the others to occur.
FAULT DETECTION There two ways to determine that a processor ismalfunctioning• Online• OfflineOnline Detection:•This detection goes in parallel with normal system operation•It is done by checking the behavior that is inconsistent withcorrect operation.• Indication for faulty processor -Branching to an invalid destination. -Fetching an opcode from a location, which is notcontaining data.
- Writing into a portion of memory to which the process has no write access.- Fetching an illegal opcode.- Inactive for more than a prescribed period.• A monitor is associated with each processor, looking for signs that the processor is faulty. The monitor watches the data and address lines.• Another approach is to have multiple processors, which are supposed to put out the same result , and compare the results.If a discrepancy arise it indicates an fault.
OFFLINE DETECTIONIt is done by running a diagnostic test.These test are scheduled just like ordinary task.
FAULT AND ERROR CONTAINMENTThe process of preventing the error spreading from onepart to another part of the system is called containmentWhen a fault or error occurs in one part of a system, itwill spread through the system like an infectious disease. e.g. An fault in one part of the system might causelarge voltage swings in another. A fault-free processor can give erroneous results,when getting input from a faulty unit.
FAULT CONTAINMENT IS ACCOMPLISHED BYThe system is divided into fault and errorcontainment zones(FCZ,ECZ).An FCZ is a subset of the system that operatescorrectly despite arbitrary logical or electrical faultsoutside the subset. i.e. the failure of some part ofthe computer outside an FCZ cannot cause anyelement inside the FCZ to fail.
Hardware inside an fcz must be isolated from hardware outside it.It should withstand either a short- circuit or the aplication of the maximum voltage imposed on the lines connecting on FCZ to the outside world. Each fcz should have an independent power supply and its own clocks. These clocks are synchronized with the clocks in other FCZ’s ,but a malfunction in the outside clocks wont affect the clocks inside the fcz. The function of an ECZ is to prevent errors from propagating across zone boundaries. This is achieved by voting redundant outputs.
REDUNDANCY FTS consist of properly managedredundancy, i.e. the system is to keptrunning despite the failure of some its parts. It must have spare capacity to begin with.TYPES OF REDUNDANCY• Hardware redundancy• Software redundancy• Time redundancy• Information redundancy
Hardware redundancy Hardware redundancy is the use of additionalhardware to compensate for failures. This can beaccomplished in two ways.•One of them is fault detection, correction, and masking.Fault detection: Multiple hardware units may beassigned to do the same task in parallel and their resultsare compared. If one are more units are faulty, we can expectthis to show up as a disagreement in the result.
Fault Masking: If minority of the units are faulty and amajority of the units produce the same output, the majorityresult can considered and failure effect is masked.Fault correction: If minority of the units disagree, the faultis detected. So the computation is repeated on otherprocessors to correct that fault.• The second one in hardware redundancy is replacing themalfunctioning unit .It is possible that the system can bedesigned so that faulty units can be easily replaced withspare ones.
Two methods used in hardware redundancy •Static Pairing •N modular Redundancy (NMR)
•Hardwire processors in pairs and to discard theentire pair if one of the processors fails, this is verysimple scheme•The Pairs runs identical software with identical inputsand should generate identical outputs. If the output isnot identical, then the pair is non functional, so theentire pair is discarded•This approach is depicted in the following figure, andit will work only when the interface is working fine andboth the processors do not fail identically and aroundthe same time
• The interface is monitored by means of a monitor. If the interface fails, the monitor takes care and if the monitor fails, the interface takes care. If both interface and monitor fails, then the system is down.
•It is a scheme for Forward Error Recovery.•It works with N processors instead of one andvoting on their output and N is usually odd.•NMR can be illustrated by means of the followingtwo ways There are N voters and the entire cluster produces N outputs There is just one voter
• NMR clusters are designed to allow the purging of malfunctioning units. That is, when a failure is detected, the failed unit is checked to see whether or not the failure is transient. If it is not, it must be electrically isolated from the rest of the cluster and a replacement unit is switched on. The faster the unit is replaced, the more reliable the cluster.
• Purging can be done either by hardware or by the operating system.• Self purging consists of a monitor at each unit comparing its output against the voted output. If there is a difference, the monitor disconnects the unit from the system.• The monitor can be described as a finite state machine with two states connect and isolate. There are two signals, diff which is set to 1 whenever the module output disagrees with the voter output and reconnect, which is a command from the system to reconnect the module
SOFT WARE REDUNDANCY•Software faults are not like hardware faults i.e.software never wears out , the faults are notgenerated spontaneously during system operation.•Software faults can be regarded as faults indesign.•For software redundancy simply replicating thesame software N times will not work, all N copieswill fail for the same inputs.•Instead N versions of the software can beimplemented. The N versions can be developed byindependent teams, with no contact between them.
• Each version is being developed by a team of developers who never communicated with each other• To minimize the common mode failures The specifications should be written in formal terms and are subject to rigorous process of checking Multiple software versions should be developed in different programming languages. Nature of tools that are being used should be selected properly. Training and quality of the programmers should be maintainded.
There are two approaches for that •N Version Programming •Recovery Block Approach