SlideShare a Scribd company logo
FAULT TOLEARANT SYSTEM
 A fault tolerant system is a system which is a able to
  continue operating despite the failure of a limited
  subset of their hardware or software.

 They are gracefully degradable i.e. as the size of the
  faulty set increases, the system wont collapse
  suddenly but continue executing, part of its
  workload.

 The goal of this design is to ensure that the
  probability of system failure is acceptably small.
FAULT TYPES

Hardware Fault: A hardware fault is some physical
defect that can cause a component to malfunction.
      E.g. A broken wire or the output of a logic gate
that is perpetually stuck at some logic value(0 or 1).

Software Fault: A software fault is bug that can
cause the program to fail for a given set of inputs.
ERROR
 Error is a manifestation of a fault.
   e.g. A broken wire will cause an error if
the system tries to propagate a signal
through it.
A program that has a fault that induces
incorrect output for some set of inputs will
generate errors, if that set of inputs is
applied.
FAULT LATENCY
The fault latency is the duration between
the onset of a fault and its manifestation as
an error.

Since the faults themselves are invisible to
the outside world, only showing themselves
when they cause errors. Such latency will
impact the reliability of the overall system.
ERROR RECOVERY
   It is the process by which the system attempts to
recover from the effects of an error.

TYPES OF ERROR RECOVERY
Forward Error Recovery: In this type the error is
masked without any computations having to be
redone.
Backward Error Recovery: In this type the system is
rolled back to moment in the time before the error is
believed to be occurred and computation is carried out
again. It consumes additional time to mask the effects
of failure.
CAUSES FOR FAULTS

Errors in the specification or design.

Defects in the components

Environmental effects.
Errors In The Specification Or Design

This error arises due to the communication
gap between the person who writes the
specification and the system designer.

The specification is the link between design
process and real world application.

If specification is wrong everything that
proceeds from it is likely to be wrong.
Defects In Components
  This fault arise due to defects caused by the
wear and tear of use.

  E.g. A mosfet may fail due to electro migration,
which is the drifting away overtime of metal
atoms towards the cathode.
Environmental Effects

This fault arise due to operating environment .

 Devices can be subjected to whole array of
stresses, depending on the application.

Poor ventilation or excessively high ambient
temperatures can melt components or damage
them.

  e.g If a computer is in missile, it can undergo
high g-forces and vibrational stress.
FAULT TYPES
Faults are classified according to their temporal
behavior and output behavior.

A fault is said to be active when it is physically
capable of generating errors and to be benign when
it is not.
TEMPORAL BEHAVIOR CLASSIFICATION

 Fault types: Permanent, intermittent, transient.
A permanent fault does not die away with time,
but remains until it is repaired or the affected unit is
replaced.

An intermittent fault cycles between the fault-
active and fault benign states.

A transient fault dies away after some time.
Intermittent faults can be caused by loosely
 connected components.

Transient faults can be caused by environmental
 effects.
     e.g. If there is a burst of electromagnetic
 radiation and the memory is not properly shielded,
 the contents of the memory can be altered without
 the memory chips themselves suffering any
 structural damage. When the memory is rewritten,
 the fault will go away.
OUTPUT BEHAVIOR CLASSIFICATION
  Malicious faults

   • Inconsistent output, harder to neutralize
     these errors

   • It behaves arbitrarily
  Non malicious faults
   • Consistent output errors

   • Easier to neutralize these errors
Fail stop
   Responds to up to a certain maximum
   number of failures by simply stopping,
   rather than putting out incorrect outputs.

Fail safe
   Its failure mode is biased so that the
   application process does not suffer
   catastrophe upon failure.
INDEPENDENCE AND CORRELATION
  Component failures may be independent or
correlated.

         Independent:A failure is said to be
independent if it does not directly or indirectly
cause another failure.

 Correlated:If the failure is said to be correlated if
they are related in some way. e.g. They may be
triggered by same cause or one of them might
cause the others to occur.
FAULT DETECTION
    There two ways to determine that a processor is
malfunctioning
• Online
• Offline

Online Detection:

•This detection goes in parallel with normal system operation
•It is done by checking the behavior that is inconsistent with
correct operation.
• Indication for faulty processor
     -Branching to an invalid destination.
     -Fetching an opcode from a location, which is not
containing data.
- Writing into a portion of memory to which the
  process has no write access.
- Fetching an illegal opcode.
- Inactive for more than a prescribed period.

• A monitor is associated with each processor,
  looking for signs that the processor is faulty. The
  monitor watches the data and address lines.

• Another approach is to have multiple processors,
  which are supposed to put out the same result , and
  compare the results.If a discrepancy arise it
  indicates an fault.
OFFLINE DETECTION

It is done by running a diagnostic test.


These test are scheduled just like ordinary task.
FAULT AND ERROR CONTAINMENT

The process of preventing the error spreading from one
part to another part of the system is called containment

When a fault or error occurs in one part of a system, it
will spread through the system like an infectious disease.
   e.g. An fault in one part of the system might cause
large voltage swings in another.

 A fault-free processor can give erroneous results,
when getting input from a faulty unit.
FAULT CONTAINMENT IS ACCOMPLISHED BY

The system is divided into fault and error
containment zones(FCZ,ECZ).

An FCZ is a subset of the system that operates
correctly despite arbitrary logical or electrical faults
outside the subset. i.e. the failure of some part of
the computer outside an FCZ cannot cause any
element inside the FCZ to fail.
 Hardware inside an fcz must be isolated from
  hardware outside it.It should withstand either a short-
  circuit or the aplication of the maximum voltage
  imposed on the lines connecting on FCZ to the
  outside world.

 Each fcz should have an independent power supply
  and its own clocks. These clocks are synchronized
  with the clocks in other FCZ’s ,but a malfunction in
  the outside clocks wont affect the clocks inside the
  fcz.

 The function of an ECZ is to prevent errors from
  propagating across zone boundaries. This is achieved
  by voting redundant outputs.
REDUNDANCY
     FTS consist of properly managed
redundancy, i.e. the system is to kept
running despite the failure of some its parts.

  It must have spare capacity to begin with.

TYPES OF REDUNDANCY
• Hardware redundancy
• Software redundancy
• Time redundancy
• Information redundancy
Hardware redundancy
         Hardware redundancy is the use of additional
hardware to compensate for failures. This can be
accomplished in two ways.

•One of them is fault detection, correction, and masking.

Fault detection: Multiple hardware units may be
assigned to do the same task in parallel and their results
are compared.
          If one are more units are faulty, we can expect
this to show up as a disagreement in the result.
Fault Masking: If minority of the units are faulty and a
majority of the units produce the same output, the majority
result can considered and failure effect is masked.

Fault correction: If minority of the units disagree, the fault
is detected. So the computation is repeated on other
processors to correct that fault.

• The second one in hardware redundancy is replacing the
malfunctioning unit .It is possible that the system can be
designed so that faulty units can be easily replaced with
spare ones.
Two methods used in hardware redundancy

  •Static Pairing

  •N modular Redundancy (NMR)
STATIC PAIRING
•Hardwire processors in pairs and to discard the
entire pair if one of the processors fails, this is very
simple scheme

•The Pairs runs identical software with identical inputs
and should generate identical outputs. If the output is
not identical, then the pair is non functional, so the
entire pair is discarded

•This approach is depicted in the following figure, and
it will work only when the interface is working fine and
both the processors do not fail identically and around
the same time
• The interface is monitored by means of a
  monitor. If the interface fails, the monitor takes
  care and if the monitor fails, the interface
  takes care. If both interface and monitor fails,
  then the system is down.
N MODULAR REDUNDANCY
•It is a scheme for Forward Error Recovery.

•It works with N processors instead of one and
voting on their output and N is usually odd.

•NMR can be illustrated by means of the following
two ways
   There are N voters and the entire cluster
   produces N outputs

   There is just one voter
•   NMR clusters are designed to allow the purging
    of malfunctioning units. That is, when a failure is
    detected, the failed unit is checked to see
    whether or not the failure is transient. If it is not, it
    must be electrically isolated from the rest of the
    cluster and a replacement unit is switched on.
    The faster the unit is replaced, the more reliable
    the cluster.
• Purging can be done either by hardware or by the operating
  system.

• Self purging consists of a monitor at each unit comparing its
  output against the voted output. If there is a difference, the
  monitor disconnects the unit from the system.

• The monitor can be described as a finite state machine with
  two states connect and isolate. There are two signals, diff
  which is set to 1 whenever the module output disagrees
  with the voter output and reconnect, which is a command
  from the system to reconnect the module
SOFT WARE REDUNDANCY
•Software faults are not like hardware faults i.e.
software never wears out , the faults are not
generated spontaneously during system operation.

•Software faults can be regarded as faults      in
design.

•For software redundancy simply replicating the
same software N times will not work, all N copies
will fail for the same inputs.

•Instead N versions     of the software can be
implemented. The N versions can be developed by
independent teams, with no contact between them.
•   Each version is being developed by a team of
    developers who never communicated with each other

• To minimize the common mode failures

      The specifications should be written in formal
       terms and are subject to rigorous process of
       checking

      Multiple software versions should be developed in
       different programming languages.

      Nature of tools that are being used should be
       selected properly.

      Training and quality of the programmers should
       be maintainded.
There are two approaches for that

   •N Version Programming

   •Recovery Block Approach
N Version Programming
Recovery Block Approach
THANK   U

More Related Content

What's hot

Real time Scheduling in Operating System for Msc CS
Real time Scheduling in Operating System for Msc CSReal time Scheduling in Operating System for Msc CS
Real time Scheduling in Operating System for Msc CS
Thanveen
 
WSN NETWORK -MAC PROTOCOLS - Low Duty Cycle Protocols And Wakeup Concepts – ...
WSN NETWORK -MAC PROTOCOLS - Low Duty Cycle Protocols And Wakeup Concepts –  ...WSN NETWORK -MAC PROTOCOLS - Low Duty Cycle Protocols And Wakeup Concepts –  ...
WSN NETWORK -MAC PROTOCOLS - Low Duty Cycle Protocols And Wakeup Concepts – ...
ArunChokkalingam
 
Reliability and clock synchronization
Reliability and clock synchronizationReliability and clock synchronization
Reliability and clock synchronization
Sri Manakula Vinayagar Engineering College
 
Satellite link design
Satellite link designSatellite link design
Satellite link design
RAVIKIRAN ANANDE
 
Communication primitives
Communication primitivesCommunication primitives
Communication primitives
Student
 
MicroC/OS-II
MicroC/OS-IIMicroC/OS-II
Hardware and Software parallelism
Hardware and Software parallelismHardware and Software parallelism
Hardware and Software parallelismprashantdahake
 
Approaches to real time scheduling
Approaches to real time schedulingApproaches to real time scheduling
Approaches to real time scheduling
Kamal Acharya
 
Ethernet
EthernetEthernet
Ethernet
sijil chacko
 
Csma cd and csma-ca
Csma cd and csma-caCsma cd and csma-ca
Csma cd and csma-ca
kazim Hussain
 
Traffic and Congestion Control in ATM Networks Chapter 13
Traffic and Congestion Control in ATM Networks Chapter 13Traffic and Congestion Control in ATM Networks Chapter 13
Traffic and Congestion Control in ATM Networks Chapter 13
daniel ayalew
 
cell splitting and sectoring
cell splitting and sectoringcell splitting and sectoring
cell splitting and sectoring
Shwetanshu Gupta
 
Fault tolerance in distributed systems
Fault tolerance in distributed systemsFault tolerance in distributed systems
Fault tolerance in distributed systemssumitjain2013
 
EC8791 UML-model train controller
EC8791 UML-model train controllerEC8791 UML-model train controller
EC8791 UML-model train controller
RajalakshmiSermadurai
 
Prototyping Embedded Devices_Internet of Things
Prototyping Embedded Devices_Internet of ThingsPrototyping Embedded Devices_Internet of Things
Prototyping Embedded Devices_Internet of Things
alengadan
 
Vx works RTOS
Vx works RTOSVx works RTOS
Vx works RTOS
Sai Malleswar
 
Basic cellular system
Basic cellular systemBasic cellular system
Basic cellular system
ShubhamMishra485
 

What's hot (20)

Real time Scheduling in Operating System for Msc CS
Real time Scheduling in Operating System for Msc CSReal time Scheduling in Operating System for Msc CS
Real time Scheduling in Operating System for Msc CS
 
WSN NETWORK -MAC PROTOCOLS - Low Duty Cycle Protocols And Wakeup Concepts – ...
WSN NETWORK -MAC PROTOCOLS - Low Duty Cycle Protocols And Wakeup Concepts –  ...WSN NETWORK -MAC PROTOCOLS - Low Duty Cycle Protocols And Wakeup Concepts –  ...
WSN NETWORK -MAC PROTOCOLS - Low Duty Cycle Protocols And Wakeup Concepts – ...
 
Fault tolerance
Fault toleranceFault tolerance
Fault tolerance
 
Mobile computing unit 1
Mobile computing unit 1Mobile computing unit 1
Mobile computing unit 1
 
Reliability and clock synchronization
Reliability and clock synchronizationReliability and clock synchronization
Reliability and clock synchronization
 
Satellite link design
Satellite link designSatellite link design
Satellite link design
 
Communication primitives
Communication primitivesCommunication primitives
Communication primitives
 
MicroC/OS-II
MicroC/OS-IIMicroC/OS-II
MicroC/OS-II
 
Routing Protocols in WSN
Routing Protocols in WSNRouting Protocols in WSN
Routing Protocols in WSN
 
Hardware and Software parallelism
Hardware and Software parallelismHardware and Software parallelism
Hardware and Software parallelism
 
Approaches to real time scheduling
Approaches to real time schedulingApproaches to real time scheduling
Approaches to real time scheduling
 
Ethernet
EthernetEthernet
Ethernet
 
Csma cd and csma-ca
Csma cd and csma-caCsma cd and csma-ca
Csma cd and csma-ca
 
Traffic and Congestion Control in ATM Networks Chapter 13
Traffic and Congestion Control in ATM Networks Chapter 13Traffic and Congestion Control in ATM Networks Chapter 13
Traffic and Congestion Control in ATM Networks Chapter 13
 
cell splitting and sectoring
cell splitting and sectoringcell splitting and sectoring
cell splitting and sectoring
 
Fault tolerance in distributed systems
Fault tolerance in distributed systemsFault tolerance in distributed systems
Fault tolerance in distributed systems
 
EC8791 UML-model train controller
EC8791 UML-model train controllerEC8791 UML-model train controller
EC8791 UML-model train controller
 
Prototyping Embedded Devices_Internet of Things
Prototyping Embedded Devices_Internet of ThingsPrototyping Embedded Devices_Internet of Things
Prototyping Embedded Devices_Internet of Things
 
Vx works RTOS
Vx works RTOSVx works RTOS
Vx works RTOS
 
Basic cellular system
Basic cellular systemBasic cellular system
Basic cellular system
 

Viewers also liked

Fault Tolerance System
Fault Tolerance SystemFault Tolerance System
Fault Tolerance Systemprakashjjaya
 
Fault tolerant presentation
Fault tolerant presentationFault tolerant presentation
Fault tolerant presentation
skadyan1
 
Real time database
Real time databaseReal time database
Real time databasearvinthsaran
 
Main MeMory Data Base
Main MeMory Data BaseMain MeMory Data Base
Main MeMory Data BaseSiva Rushi
 
Vxworks
VxworksVxworks
Fault tolerance
Fault toleranceFault tolerance
Fault tolerance
Aman Balutia
 
Real time database (MDARTS)
Real time database (MDARTS)Real time database (MDARTS)
Real time database (MDARTS)
Pradeep Kumar TS
 
Fault tolerance techniques for real time operating system
Fault tolerance techniques for real time operating systemFault tolerance techniques for real time operating system
Fault tolerance techniques for real time operating system
anujos25
 
Fault management presentation
Fault management presentationFault management presentation
Fault management presentation
ardhita banu adji
 
Fault Management System (OSS)
Fault Management System (OSS)Fault Management System (OSS)
Fault Management System (OSS)
Riswan
 
Be information technology2008course
Be information technology2008courseBe information technology2008course
Be information technology2008course
Anuj Sharma
 
Chapter 19 - Real Time Systems
Chapter 19 - Real Time SystemsChapter 19 - Real Time Systems
Chapter 19 - Real Time Systems
Wayne Jones Jnr
 
Ch21 real time software engineering
Ch21 real time software engineeringCh21 real time software engineering
Ch21 real time software engineering
software-engineering-book
 
Introduction to Real-Time Operating Systems
Introduction to Real-Time Operating SystemsIntroduction to Real-Time Operating Systems
Introduction to Real-Time Operating Systems
coolmirza143
 
Real Time Systems & RTOS
Real Time Systems & RTOSReal Time Systems & RTOS
Real Time Systems & RTOS
Vishwa Mohan
 
In-memory Databases
In-memory DatabasesIn-memory Databases
In-memory Databases
Robert Friberg
 
Clock Synchronization in Distributed Systems
Clock Synchronization in Distributed SystemsClock Synchronization in Distributed Systems
Clock Synchronization in Distributed Systems
Zbigniew Jerzak
 
Real Time Analytics: Algorithms and Systems
Real Time Analytics: Algorithms and SystemsReal Time Analytics: Algorithms and Systems
Real Time Analytics: Algorithms and Systems
Arun Kejariwal
 
Error detection recovery
Error detection recoveryError detection recovery
Error detection recoveryTech_MX
 

Viewers also liked (20)

Fault Tolerance System
Fault Tolerance SystemFault Tolerance System
Fault Tolerance System
 
Fault tolerant presentation
Fault tolerant presentationFault tolerant presentation
Fault tolerant presentation
 
Real time database
Real time databaseReal time database
Real time database
 
Main MeMory Data Base
Main MeMory Data BaseMain MeMory Data Base
Main MeMory Data Base
 
Fault tolerance
Fault toleranceFault tolerance
Fault tolerance
 
Vxworks
VxworksVxworks
Vxworks
 
Fault tolerance
Fault toleranceFault tolerance
Fault tolerance
 
Real time database (MDARTS)
Real time database (MDARTS)Real time database (MDARTS)
Real time database (MDARTS)
 
Fault tolerance techniques for real time operating system
Fault tolerance techniques for real time operating systemFault tolerance techniques for real time operating system
Fault tolerance techniques for real time operating system
 
Fault management presentation
Fault management presentationFault management presentation
Fault management presentation
 
Fault Management System (OSS)
Fault Management System (OSS)Fault Management System (OSS)
Fault Management System (OSS)
 
Be information technology2008course
Be information technology2008courseBe information technology2008course
Be information technology2008course
 
Chapter 19 - Real Time Systems
Chapter 19 - Real Time SystemsChapter 19 - Real Time Systems
Chapter 19 - Real Time Systems
 
Ch21 real time software engineering
Ch21 real time software engineeringCh21 real time software engineering
Ch21 real time software engineering
 
Introduction to Real-Time Operating Systems
Introduction to Real-Time Operating SystemsIntroduction to Real-Time Operating Systems
Introduction to Real-Time Operating Systems
 
Real Time Systems & RTOS
Real Time Systems & RTOSReal Time Systems & RTOS
Real Time Systems & RTOS
 
In-memory Databases
In-memory DatabasesIn-memory Databases
In-memory Databases
 
Clock Synchronization in Distributed Systems
Clock Synchronization in Distributed SystemsClock Synchronization in Distributed Systems
Clock Synchronization in Distributed Systems
 
Real Time Analytics: Algorithms and Systems
Real Time Analytics: Algorithms and SystemsReal Time Analytics: Algorithms and Systems
Real Time Analytics: Algorithms and Systems
 
Error detection recovery
Error detection recoveryError detection recovery
Error detection recovery
 

Similar to Fault tolearant system

Fault Finding.pptx
Fault Finding.pptxFault Finding.pptx
Fault Finding.pptx
MUST
 
SE2_Lec 20_Software Testing
SE2_Lec 20_Software TestingSE2_Lec 20_Software Testing
SE2_Lec 20_Software Testing
Amr E. Mohamed
 
Trouble Shooting PC
Trouble Shooting PCTrouble Shooting PC
Trouble Shooting PC
Debashish Sen
 
Fault Tolerance System
Fault Tolerance SystemFault Tolerance System
Fault Tolerance SystemEhsan Ilahi
 
Fault tolerance techniques
Fault tolerance techniquesFault tolerance techniques
Fault tolerance techniques
ECEDepartmentJSREC
 
SE2018_Lec 19_ Software Testing
SE2018_Lec 19_ Software TestingSE2018_Lec 19_ Software Testing
SE2018_Lec 19_ Software Testing
Amr E. Mohamed
 
Parallel and Distributed Computing Chapter 12
Parallel and Distributed Computing Chapter 12Parallel and Distributed Computing Chapter 12
Parallel and Distributed Computing Chapter 12
AbdullahMunir32
 
Types of Computer System Errors.pptx
Types of Computer System Errors.pptxTypes of Computer System Errors.pptx
Types of Computer System Errors.pptx
ArjunePantallano1
 
Proposed Algorithm for Surveillance Applications
Proposed Algorithm for Surveillance ApplicationsProposed Algorithm for Surveillance Applications
Proposed Algorithm for Surveillance Applications
Editor IJCATR
 
Troubleshooting & Tools
Troubleshooting & ToolsTroubleshooting & Tools
Troubleshooting & Tools
Prabu U
 
priority interrupt computer organization
priority interrupt computer organizationpriority interrupt computer organization
priority interrupt computer organization
chnrketan
 
CS304PC:Computer Organization and Architecture Session 15 program control.pptx
CS304PC:Computer Organization and Architecture Session 15 program control.pptxCS304PC:Computer Organization and Architecture Session 15 program control.pptx
CS304PC:Computer Organization and Architecture Session 15 program control.pptx
Asst.prof M.Gokilavani
 
IRJET- FPGA Implementation of an Improved Watchdog Timer for Safety Critical ...
IRJET- FPGA Implementation of an Improved Watchdog Timer for Safety Critical ...IRJET- FPGA Implementation of an Improved Watchdog Timer for Safety Critical ...
IRJET- FPGA Implementation of an Improved Watchdog Timer for Safety Critical ...
IRJET Journal
 
RTOS - Real Time Operating Systems
RTOS - Real Time Operating SystemsRTOS - Real Time Operating Systems
RTOS - Real Time Operating Systems
Emertxe Information Technologies Pvt Ltd
 
Interrupts in 8085
Interrupts in 8085Interrupts in 8085
Interrupts in 8085
Hetauda City College
 
Fault avoidance and fault tolerance
Fault avoidance and fault toleranceFault avoidance and fault tolerance
Fault avoidance and fault tolerance
Jabez Winston
 
Jonny doin safe io t- lt_spice failsafe
Jonny doin safe io t- lt_spice failsafeJonny doin safe io t- lt_spice failsafe
Jonny doin safe io t- lt_spice failsafe
Jonny Doin
 
Functions of the Operating System
Functions of the Operating SystemFunctions of the Operating System
Functions of the Operating System
andyr91
 
3_Lesson-1_MAINTENANCE_maintaining .pptx
3_Lesson-1_MAINTENANCE_maintaining .pptx3_Lesson-1_MAINTENANCE_maintaining .pptx
3_Lesson-1_MAINTENANCE_maintaining .pptx
pujanteclementmarcus
 
2012A8PS309P_AbhishekKumar_FinalReport
2012A8PS309P_AbhishekKumar_FinalReport2012A8PS309P_AbhishekKumar_FinalReport
2012A8PS309P_AbhishekKumar_FinalReportabhishekroushan
 

Similar to Fault tolearant system (20)

Fault Finding.pptx
Fault Finding.pptxFault Finding.pptx
Fault Finding.pptx
 
SE2_Lec 20_Software Testing
SE2_Lec 20_Software TestingSE2_Lec 20_Software Testing
SE2_Lec 20_Software Testing
 
Trouble Shooting PC
Trouble Shooting PCTrouble Shooting PC
Trouble Shooting PC
 
Fault Tolerance System
Fault Tolerance SystemFault Tolerance System
Fault Tolerance System
 
Fault tolerance techniques
Fault tolerance techniquesFault tolerance techniques
Fault tolerance techniques
 
SE2018_Lec 19_ Software Testing
SE2018_Lec 19_ Software TestingSE2018_Lec 19_ Software Testing
SE2018_Lec 19_ Software Testing
 
Parallel and Distributed Computing Chapter 12
Parallel and Distributed Computing Chapter 12Parallel and Distributed Computing Chapter 12
Parallel and Distributed Computing Chapter 12
 
Types of Computer System Errors.pptx
Types of Computer System Errors.pptxTypes of Computer System Errors.pptx
Types of Computer System Errors.pptx
 
Proposed Algorithm for Surveillance Applications
Proposed Algorithm for Surveillance ApplicationsProposed Algorithm for Surveillance Applications
Proposed Algorithm for Surveillance Applications
 
Troubleshooting & Tools
Troubleshooting & ToolsTroubleshooting & Tools
Troubleshooting & Tools
 
priority interrupt computer organization
priority interrupt computer organizationpriority interrupt computer organization
priority interrupt computer organization
 
CS304PC:Computer Organization and Architecture Session 15 program control.pptx
CS304PC:Computer Organization and Architecture Session 15 program control.pptxCS304PC:Computer Organization and Architecture Session 15 program control.pptx
CS304PC:Computer Organization and Architecture Session 15 program control.pptx
 
IRJET- FPGA Implementation of an Improved Watchdog Timer for Safety Critical ...
IRJET- FPGA Implementation of an Improved Watchdog Timer for Safety Critical ...IRJET- FPGA Implementation of an Improved Watchdog Timer for Safety Critical ...
IRJET- FPGA Implementation of an Improved Watchdog Timer for Safety Critical ...
 
RTOS - Real Time Operating Systems
RTOS - Real Time Operating SystemsRTOS - Real Time Operating Systems
RTOS - Real Time Operating Systems
 
Interrupts in 8085
Interrupts in 8085Interrupts in 8085
Interrupts in 8085
 
Fault avoidance and fault tolerance
Fault avoidance and fault toleranceFault avoidance and fault tolerance
Fault avoidance and fault tolerance
 
Jonny doin safe io t- lt_spice failsafe
Jonny doin safe io t- lt_spice failsafeJonny doin safe io t- lt_spice failsafe
Jonny doin safe io t- lt_spice failsafe
 
Functions of the Operating System
Functions of the Operating SystemFunctions of the Operating System
Functions of the Operating System
 
3_Lesson-1_MAINTENANCE_maintaining .pptx
3_Lesson-1_MAINTENANCE_maintaining .pptx3_Lesson-1_MAINTENANCE_maintaining .pptx
3_Lesson-1_MAINTENANCE_maintaining .pptx
 
2012A8PS309P_AbhishekKumar_FinalReport
2012A8PS309P_AbhishekKumar_FinalReport2012A8PS309P_AbhishekKumar_FinalReport
2012A8PS309P_AbhishekKumar_FinalReport
 

Recently uploaded

Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 

Recently uploaded (20)

Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 

Fault tolearant system

  • 2.  A fault tolerant system is a system which is a able to continue operating despite the failure of a limited subset of their hardware or software.  They are gracefully degradable i.e. as the size of the faulty set increases, the system wont collapse suddenly but continue executing, part of its workload.  The goal of this design is to ensure that the probability of system failure is acceptably small.
  • 3. FAULT TYPES Hardware Fault: A hardware fault is some physical defect that can cause a component to malfunction. E.g. A broken wire or the output of a logic gate that is perpetually stuck at some logic value(0 or 1). Software Fault: A software fault is bug that can cause the program to fail for a given set of inputs.
  • 4. ERROR  Error is a manifestation of a fault. e.g. A broken wire will cause an error if the system tries to propagate a signal through it. A program that has a fault that induces incorrect output for some set of inputs will generate errors, if that set of inputs is applied.
  • 5. FAULT LATENCY The fault latency is the duration between the onset of a fault and its manifestation as an error. Since the faults themselves are invisible to the outside world, only showing themselves when they cause errors. Such latency will impact the reliability of the overall system.
  • 6. ERROR RECOVERY It is the process by which the system attempts to recover from the effects of an error. TYPES OF ERROR RECOVERY Forward Error Recovery: In this type the error is masked without any computations having to be redone. Backward Error Recovery: In this type the system is rolled back to moment in the time before the error is believed to be occurred and computation is carried out again. It consumes additional time to mask the effects of failure.
  • 7. CAUSES FOR FAULTS Errors in the specification or design. Defects in the components Environmental effects.
  • 8. Errors In The Specification Or Design This error arises due to the communication gap between the person who writes the specification and the system designer. The specification is the link between design process and real world application. If specification is wrong everything that proceeds from it is likely to be wrong.
  • 9. Defects In Components This fault arise due to defects caused by the wear and tear of use. E.g. A mosfet may fail due to electro migration, which is the drifting away overtime of metal atoms towards the cathode.
  • 10. Environmental Effects This fault arise due to operating environment .  Devices can be subjected to whole array of stresses, depending on the application. Poor ventilation or excessively high ambient temperatures can melt components or damage them. e.g If a computer is in missile, it can undergo high g-forces and vibrational stress.
  • 11. FAULT TYPES Faults are classified according to their temporal behavior and output behavior. A fault is said to be active when it is physically capable of generating errors and to be benign when it is not.
  • 12. TEMPORAL BEHAVIOR CLASSIFICATION  Fault types: Permanent, intermittent, transient. A permanent fault does not die away with time, but remains until it is repaired or the affected unit is replaced. An intermittent fault cycles between the fault- active and fault benign states. A transient fault dies away after some time.
  • 13. Intermittent faults can be caused by loosely connected components. Transient faults can be caused by environmental effects. e.g. If there is a burst of electromagnetic radiation and the memory is not properly shielded, the contents of the memory can be altered without the memory chips themselves suffering any structural damage. When the memory is rewritten, the fault will go away.
  • 14. OUTPUT BEHAVIOR CLASSIFICATION Malicious faults • Inconsistent output, harder to neutralize these errors • It behaves arbitrarily Non malicious faults • Consistent output errors • Easier to neutralize these errors
  • 15. Fail stop Responds to up to a certain maximum number of failures by simply stopping, rather than putting out incorrect outputs. Fail safe Its failure mode is biased so that the application process does not suffer catastrophe upon failure.
  • 16. INDEPENDENCE AND CORRELATION Component failures may be independent or correlated. Independent:A failure is said to be independent if it does not directly or indirectly cause another failure. Correlated:If the failure is said to be correlated if they are related in some way. e.g. They may be triggered by same cause or one of them might cause the others to occur.
  • 17. FAULT DETECTION There two ways to determine that a processor is malfunctioning • Online • Offline Online Detection: •This detection goes in parallel with normal system operation •It is done by checking the behavior that is inconsistent with correct operation. • Indication for faulty processor -Branching to an invalid destination. -Fetching an opcode from a location, which is not containing data.
  • 18. - Writing into a portion of memory to which the process has no write access. - Fetching an illegal opcode. - Inactive for more than a prescribed period. • A monitor is associated with each processor, looking for signs that the processor is faulty. The monitor watches the data and address lines. • Another approach is to have multiple processors, which are supposed to put out the same result , and compare the results.If a discrepancy arise it indicates an fault.
  • 19. OFFLINE DETECTION It is done by running a diagnostic test. These test are scheduled just like ordinary task.
  • 20. FAULT AND ERROR CONTAINMENT The process of preventing the error spreading from one part to another part of the system is called containment When a fault or error occurs in one part of a system, it will spread through the system like an infectious disease. e.g. An fault in one part of the system might cause large voltage swings in another.  A fault-free processor can give erroneous results, when getting input from a faulty unit.
  • 21. FAULT CONTAINMENT IS ACCOMPLISHED BY The system is divided into fault and error containment zones(FCZ,ECZ). An FCZ is a subset of the system that operates correctly despite arbitrary logical or electrical faults outside the subset. i.e. the failure of some part of the computer outside an FCZ cannot cause any element inside the FCZ to fail.
  • 22.  Hardware inside an fcz must be isolated from hardware outside it.It should withstand either a short- circuit or the aplication of the maximum voltage imposed on the lines connecting on FCZ to the outside world.  Each fcz should have an independent power supply and its own clocks. These clocks are synchronized with the clocks in other FCZ’s ,but a malfunction in the outside clocks wont affect the clocks inside the fcz.  The function of an ECZ is to prevent errors from propagating across zone boundaries. This is achieved by voting redundant outputs.
  • 23. REDUNDANCY FTS consist of properly managed redundancy, i.e. the system is to kept running despite the failure of some its parts. It must have spare capacity to begin with. TYPES OF REDUNDANCY • Hardware redundancy • Software redundancy • Time redundancy • Information redundancy
  • 24. Hardware redundancy Hardware redundancy is the use of additional hardware to compensate for failures. This can be accomplished in two ways. •One of them is fault detection, correction, and masking. Fault detection: Multiple hardware units may be assigned to do the same task in parallel and their results are compared. If one are more units are faulty, we can expect this to show up as a disagreement in the result.
  • 25. Fault Masking: If minority of the units are faulty and a majority of the units produce the same output, the majority result can considered and failure effect is masked. Fault correction: If minority of the units disagree, the fault is detected. So the computation is repeated on other processors to correct that fault. • The second one in hardware redundancy is replacing the malfunctioning unit .It is possible that the system can be designed so that faulty units can be easily replaced with spare ones.
  • 26. Two methods used in hardware redundancy •Static Pairing •N modular Redundancy (NMR)
  • 28. •Hardwire processors in pairs and to discard the entire pair if one of the processors fails, this is very simple scheme •The Pairs runs identical software with identical inputs and should generate identical outputs. If the output is not identical, then the pair is non functional, so the entire pair is discarded •This approach is depicted in the following figure, and it will work only when the interface is working fine and both the processors do not fail identically and around the same time
  • 29. • The interface is monitored by means of a monitor. If the interface fails, the monitor takes care and if the monitor fails, the interface takes care. If both interface and monitor fails, then the system is down.
  • 31. •It is a scheme for Forward Error Recovery. •It works with N processors instead of one and voting on their output and N is usually odd. •NMR can be illustrated by means of the following two ways There are N voters and the entire cluster produces N outputs There is just one voter
  • 32. NMR clusters are designed to allow the purging of malfunctioning units. That is, when a failure is detected, the failed unit is checked to see whether or not the failure is transient. If it is not, it must be electrically isolated from the rest of the cluster and a replacement unit is switched on. The faster the unit is replaced, the more reliable the cluster.
  • 33. • Purging can be done either by hardware or by the operating system. • Self purging consists of a monitor at each unit comparing its output against the voted output. If there is a difference, the monitor disconnects the unit from the system. • The monitor can be described as a finite state machine with two states connect and isolate. There are two signals, diff which is set to 1 whenever the module output disagrees with the voter output and reconnect, which is a command from the system to reconnect the module
  • 34.
  • 35. SOFT WARE REDUNDANCY •Software faults are not like hardware faults i.e. software never wears out , the faults are not generated spontaneously during system operation. •Software faults can be regarded as faults in design. •For software redundancy simply replicating the same software N times will not work, all N copies will fail for the same inputs. •Instead N versions of the software can be implemented. The N versions can be developed by independent teams, with no contact between them.
  • 36. Each version is being developed by a team of developers who never communicated with each other • To minimize the common mode failures  The specifications should be written in formal terms and are subject to rigorous process of checking  Multiple software versions should be developed in different programming languages.  Nature of tools that are being used should be selected properly.  Training and quality of the programmers should be maintainded.
  • 37. There are two approaches for that •N Version Programming •Recovery Block Approach
  • 40. THANK U