Inaam Ilahi
Furqan Farooq
Sabuha Sarwar
Ehsan Ilahi
Group Members
Faults, Errors and Failures
Fault is a defect within the system
Error is observed by a deviation from the expected
behavior of the system
Failure occurs when the system can no longer perform
as required (does not meet specification)
Fault Tolerance is ability of system to provide a service,
even in the presence of errors
Fault Error Failure
 A fault tolerant system is a system which is a able
to continue operating despite the failure of a
limited subset of their hardware or software.
 They are gracefully degradable i.e. as the size of
the faulty set increases, the system wont collapse
suddenly but continue executing, part of its
workload.
 The goal of this design is to ensure that the
probability of system failure is acceptably small.
Fault Tolerant System
Fault Types
 Hardware Fault:
A hardware fault is some physical defect that can cause a
component to malfunction.
E.g. A broken wire or the output of a logic gate that is
perpetually stuck at some logic value(0 or 1).
 Software Fault:
A software fault is bug that can cause the program to fail
for a given set of inputs.
Type of failure Description
Crash failure A server halts, but is working correctly until it halts
Omission failure
Receive omission
Send omission
A server fails to respond to incoming requests
A server fails to receive incoming messages
A server fails to send messages
Timing failure A server's response lies outside the specified time
interval
Response failure
Value failure
State transition failure
The server's response is incorrect
The value of the response is wrong
The server deviates from the correct flow of control
Arbitrary failure A server may produce arbitrary responses at arbitrary
times
Types Of Failure
Objectives Of Fault Tolerance
 Availability
system always ready for use, or probability that system is ready or
available at a given time
 Reliability
property that a system can run without failure, for a given time
 Safety
Indicates the safety issues in the case the system fails
 Maintainability
refers to the ease of repair to a failed system
Miscommunication
Success of any software application depends on
communication between stakeholders, development
and testing teams.
Changing Requirements
The customer may not understand the effects of
changes, or may understand and request them
anyway .
Poorly Documented Code
It’s tough to maintain and modify code that is badly
written or poorly documented.
Lack of Skilled Testing
No tester would want to accept it but let’s face it; poor
testing do take place across organizations. There can be
shortcomings in the testing process that are followed.
Fault And Error Containment
The process of preventing the error spreading from
one part to another part of the system is called
containment
When a fault or error occurs in one part of a
system, it will spread through the system like an
infectious disease.
e.g. An fault in one part of the system might
cause large voltage swings in another.
 A fault-free processor can give erroneous results,
when getting input from a faulty unit.
Recovery
Once failure has occurred in many cases it is important to recover
critical processes to a known state in order to resume processing
Problem is compounded in distributed systems
Two Approaches:
 Backward recovery, by use of check pointing (global
snapshot of distributed system status) to record the system state but
check pointing is costly (performance degradation)
 Forward recovery, attempt to bring system to a new stable
state from which it is possible to proceed (applied in situations
where the nature if errors is known and a reset can be applied)
Fault Tolerance in Distributed Systems
 System attributes:
 · Availability – system always ready for use, or probability that
system is ready or available at a given time
 · Reliability – property that a system can run without failure, for
a given time
 · Safety – indicates the safety issues in the case the system fails
 · Maintainability – refers to the ease of repair to a failed system
 Failure in a distributed system = when a service cannot
be fully provided
 System failure may be partial
 A single failure may affect other parts of a system (failure
escalation)
Lost Request Messages when Server Crashes
A server in client-server communication
• Normal case
• Crash after execution
• Crash before execution
REDUNDANCY
FTS consist of properly managed redundancy, i.e.
the system is to kept running despite the failure of some
its parts.
It must have spare capacity to begin with.
TYPES OF REDUNDANCY
 Hardware redundancy
 Software redundancy
 Time redundancy
 Information redundancy
Software Redundancy
 Software faults are not like hardware faults i.e. software
never wears out , the faults are not generated
spontaneously during system operation.
 Software faults can be regarded as faults in design.
 For software redundancy simply replicating the same
software N times will not work, all N copies will fail for
the same inputs.
 Instead N versions of the software can be implemented.
The N versions can be developed by independent teams,
with no contact between them.
 Each version is being developed by a team of developers
who never communicated with each other
To minimize the common mode failures
The specifications should be written in formal terms and
are subject to rigorous process of checking
Multiple software versions should be developed in
different programming languages.
Nature of tools that are being used should be selected
properly.
Training and quality of the programmers should be
maintained.
 N - Version Programming
 Recovery Block Approach
There are two Approaches for that
N - Version Programming
Recovery Block Approach
Applications Of Fault Tolerance System
1.Long-life applications:
. e.g. space, satellites
.typical requirement: Availability (10 years) ≥0.95
.outages in between are allowed.
2.Critical-computation applications:
.e.g. critical to human safety: aircraft control system
.typical requirement: Reliability (3 years) ≥0.97 (short for 0.9999999).
3.Maintenance-postponement applications:
.when maintenance operations are extremely costly
. e.g. space systems and remote processing systems, like telephone
switching systems (e.g. maintenance only once a month)
4.High availability applications:
.e.g. banking, flight reservation
5. Transportation systems
– train/subway
– ships
– automobiles
• ABS anti-locking-brakes
• ESP electronic
stability program
• airbag activation
• electronic ignition/fuel pump
STEPS TO PREVENT FAILURE
 Power Failure
 Power Surge .
 Data loss
 Device or Computer failure
 Unauthorized access
 Overload .
 Virus
Disadvantages
 Interference with fault detection in the same
component.
 Interference with fault detection in another
component.
 Reduction of priority of fault correction.
 Test difficulty.
 Cost.
 Inferior components
Conclusion
 Hardware, software and networks cannot be totally free
from failures
 Fault tolerance is a non-functional requirement that requires
a system to continue to operate, even in the presence of faults.
 Distributed systems can be more fault tolerant than
ccentralized systems.
 Agrement in faulty systems and reliable group
communication are important problems in distributed systems.
 Replication of Data is a major fault tolerance method in
distributed systems.
 Recovery is another property to consider in faulty
distributed environments.
Fault Tolerance System

Fault Tolerance System

  • 2.
    Inaam Ilahi Furqan Farooq SabuhaSarwar Ehsan Ilahi Group Members
  • 4.
    Faults, Errors andFailures Fault is a defect within the system Error is observed by a deviation from the expected behavior of the system Failure occurs when the system can no longer perform as required (does not meet specification) Fault Tolerance is ability of system to provide a service, even in the presence of errors Fault Error Failure
  • 5.
     A faulttolerant system is a system which is a able to continue operating despite the failure of a limited subset of their hardware or software.  They are gracefully degradable i.e. as the size of the faulty set increases, the system wont collapse suddenly but continue executing, part of its workload.  The goal of this design is to ensure that the probability of system failure is acceptably small. Fault Tolerant System
  • 6.
    Fault Types  HardwareFault: A hardware fault is some physical defect that can cause a component to malfunction. E.g. A broken wire or the output of a logic gate that is perpetually stuck at some logic value(0 or 1).  Software Fault: A software fault is bug that can cause the program to fail for a given set of inputs.
  • 7.
    Type of failureDescription Crash failure A server halts, but is working correctly until it halts Omission failure Receive omission Send omission A server fails to respond to incoming requests A server fails to receive incoming messages A server fails to send messages Timing failure A server's response lies outside the specified time interval Response failure Value failure State transition failure The server's response is incorrect The value of the response is wrong The server deviates from the correct flow of control Arbitrary failure A server may produce arbitrary responses at arbitrary times Types Of Failure
  • 8.
    Objectives Of FaultTolerance  Availability system always ready for use, or probability that system is ready or available at a given time  Reliability property that a system can run without failure, for a given time  Safety Indicates the safety issues in the case the system fails  Maintainability refers to the ease of repair to a failed system
  • 10.
    Miscommunication Success of anysoftware application depends on communication between stakeholders, development and testing teams.
  • 11.
    Changing Requirements The customermay not understand the effects of changes, or may understand and request them anyway .
  • 12.
    Poorly Documented Code It’stough to maintain and modify code that is badly written or poorly documented.
  • 13.
    Lack of SkilledTesting No tester would want to accept it but let’s face it; poor testing do take place across organizations. There can be shortcomings in the testing process that are followed.
  • 14.
    Fault And ErrorContainment The process of preventing the error spreading from one part to another part of the system is called containment When a fault or error occurs in one part of a system, it will spread through the system like an infectious disease. e.g. An fault in one part of the system might cause large voltage swings in another.  A fault-free processor can give erroneous results, when getting input from a faulty unit.
  • 15.
    Recovery Once failure hasoccurred in many cases it is important to recover critical processes to a known state in order to resume processing Problem is compounded in distributed systems Two Approaches:  Backward recovery, by use of check pointing (global snapshot of distributed system status) to record the system state but check pointing is costly (performance degradation)  Forward recovery, attempt to bring system to a new stable state from which it is possible to proceed (applied in situations where the nature if errors is known and a reset can be applied)
  • 16.
    Fault Tolerance inDistributed Systems  System attributes:  · Availability – system always ready for use, or probability that system is ready or available at a given time  · Reliability – property that a system can run without failure, for a given time  · Safety – indicates the safety issues in the case the system fails  · Maintainability – refers to the ease of repair to a failed system  Failure in a distributed system = when a service cannot be fully provided  System failure may be partial  A single failure may affect other parts of a system (failure escalation)
  • 17.
    Lost Request Messageswhen Server Crashes A server in client-server communication • Normal case • Crash after execution • Crash before execution
  • 18.
    REDUNDANCY FTS consist ofproperly managed redundancy, i.e. the system is to kept running despite the failure of some its parts. It must have spare capacity to begin with. TYPES OF REDUNDANCY  Hardware redundancy  Software redundancy  Time redundancy  Information redundancy
  • 19.
    Software Redundancy  Softwarefaults are not like hardware faults i.e. software never wears out , the faults are not generated spontaneously during system operation.  Software faults can be regarded as faults in design.  For software redundancy simply replicating the same software N times will not work, all N copies will fail for the same inputs.  Instead N versions of the software can be implemented. The N versions can be developed by independent teams, with no contact between them.
  • 20.
     Each versionis being developed by a team of developers who never communicated with each other To minimize the common mode failures The specifications should be written in formal terms and are subject to rigorous process of checking Multiple software versions should be developed in different programming languages. Nature of tools that are being used should be selected properly. Training and quality of the programmers should be maintained.
  • 21.
     N -Version Programming  Recovery Block Approach There are two Approaches for that
  • 22.
    N - VersionProgramming
  • 23.
  • 24.
    Applications Of FaultTolerance System 1.Long-life applications: . e.g. space, satellites .typical requirement: Availability (10 years) ≥0.95 .outages in between are allowed. 2.Critical-computation applications: .e.g. critical to human safety: aircraft control system .typical requirement: Reliability (3 years) ≥0.97 (short for 0.9999999). 3.Maintenance-postponement applications: .when maintenance operations are extremely costly . e.g. space systems and remote processing systems, like telephone switching systems (e.g. maintenance only once a month)
  • 25.
    4.High availability applications: .e.g.banking, flight reservation 5. Transportation systems – train/subway – ships – automobiles • ABS anti-locking-brakes • ESP electronic stability program • airbag activation • electronic ignition/fuel pump
  • 26.
    STEPS TO PREVENTFAILURE  Power Failure  Power Surge .  Data loss  Device or Computer failure  Unauthorized access  Overload .  Virus
  • 27.
    Disadvantages  Interference withfault detection in the same component.  Interference with fault detection in another component.  Reduction of priority of fault correction.  Test difficulty.  Cost.  Inferior components
  • 28.
    Conclusion  Hardware, softwareand networks cannot be totally free from failures  Fault tolerance is a non-functional requirement that requires a system to continue to operate, even in the presence of faults.  Distributed systems can be more fault tolerant than ccentralized systems.  Agrement in faulty systems and reliable group communication are important problems in distributed systems.  Replication of Data is a major fault tolerance method in distributed systems.  Recovery is another property to consider in faulty distributed environments.