OPERATING SYSTEM
INDIVIDUAL ASSIGNMENT
NAME ID NUMBER
Natnael serte NSR/1856/14
Submitted to: Jira.k
Submission date: 06/07/2016
FAULT TOLERANCE DEFINITION
 Fault Tolerance simply means a system’s ability to
continue operating uninterrupted despite the failure
of one or more of its components. This is true whether
it is a computer system, a cloud cluster, a network, or
something else. In other words, fault tolerance refers
to how an operating system (OS) responds to and
allows for software or hardware malfunctions and
failures.
 An OS’s ability to recover and tolerate faults without
failing can be handled by hardware, software, or a
combined solution leveraging load balancers(see more
below). Some computer systems use multiple
duplicate fault tolerant systems to handle faults
gracefully. This is called a fault tolerant network.
WHAT IS FAULT TOLERANCE?
 The goal of fault tolerant computer systems is to
ensure business continuity and high availability
by preventing disruptions arising from a single
point of failure. Fault tolerance solutions
therefore tend to focus most on mission-critical
applications or systems.
FAULT TOLERANT COMPUTING MAY INCLUDE
SEVERAL LEVELS OF TOLERANCE:
 At the lowest level, the ability to respond to a power
failure, for example.
 A step up: during a system failure, the ability to use a
backup system immediately.
 Enhanced fault tolerance: a disk fails, and mirrored disks
take over for it immediately. This provides functionality
despite partial system failure, or graceful degradation,
rather than an immediate breakdown and loss of
function.
 High level fault tolerant computing: multiple processors
collaborate to scan data and output to detect errors, and
then immediately correct them.
 Fault tolerance software may be part of the OS
interface, allowing the programmer to check critical
data at specific points during a transaction.
 Fault-tolerant systems ensure no break in service by
using backup components that take the place of failed
components automatically. These may include:
 Hardware systems with identical or equivalent backup
operating systems. For example, a server with an
identical fault tolerant server mirroring all operations
in backup, running in parallel, is fault tolerant. By
eliminating single points of failure, hardware fault
tolerance in the form of redundancy can make any
component or system far safer and more reliable.
 Software systems backed up by other instances of
software. For example, if you replicate your customer
database continuously, operations in the primary
database can be automatically redirected to the second
database if the first goes down.
 Redundant power sources can help avoid a system
fault if alternative sources can take over automatically
during power failures, ensuring no loss of service.
COMPONENTS OF A FAULT-TOLERANCE SYSTEM
 The key benefit of fault tolerance is to minimize or
avoid the risk of systems becoming unavailable due to
a component error. This is particularly important in
critical systems that are relied on to ensure people’s
safety, such as air traffic control, and systems that
protect and secure critical data and high-value
transactions.
 The core components to improving fault tolerance
include:
 Diversity
 If a system’s main electricity supply fails, potentially
due to a storm that causes a power outage or affects a
power station, it will not be possible to access
alternative electricity sources. In this event, fault
tolerance can be sourced through diversity, which
provides electricity from sources like backup
generators that take over when a main power failure
occurs.
 Some diverse fault-tolerance options result in the
backup not having the same level of capacity as the
primary source. This may, in some cases, require the
system to ensure graceful degradation until the
primary power source is restored.
Redundancy
 Fault-tolerant systems use redundancy to remove the
single point of failure. The system is equipped with
one or more power supply units (PSUs), which do not
need to power the system when the primary PSU
functions as normal. In the event the primary PSU
fails or suffers a fault, it can be removed from service
and replaced by a redundant PSU, which takes over
system function and performance.
 Alternatively, redundancy can be imposed at a system
level, which means an entire alternate computer
system is in place in case a failure occurs.
Replication
 Replication is a more complex approach to achieving
fault tolerance. It involves using multiple identical
versions of systems and subsystems and ensuring
their functions always provide identical results. If the
results are not identical, then a democratic procedure
is used to identify the faulty system. Alternatively, a
procedure can be used to check for a system that
shows a different result, which indicates it is faulty.
 Replication can either take place at the component
level, which involves multiple processors running
simultaneously, or at the system level, which involves
identical computer systems running simultaneously.
 Elements of Fault-tolerant Systems
 Fault-tolerant systems also use backup components,
which automatically replace failed components to
prevent a loss of service. These backup components
include:
Hardware Systems
 Hardware systems can be backed up by systems that
are identical or equivalent to them. A typical example
is a server made fault-tolerant by deploying an
identical server that runs in parallel to it and mirrors
all its operations, such as the redundant array of
inexpensive disks (RAID), which combines physical
disk components to achieve redundancy and improved
performance.
Software Systems
 Software systems can be made fault-tolerant by backing
them up with other software. A common example is backing
up a database that contains customer data to ensure it can
continuously replicate onto another machine. As a result, in
the event that a primary database fails, normal operations
will continue because they are automatically replicated and
redirected onto the backup database.
Power Sources
 Power sources can also be made fault-tolerant by using
alternative sources to support them. One approach is to run
devices on an uninterruptible power supply (UPS). Another
is to use backup power generators that ensure storage and
hardware, heating, ventilation, and air conditioning (HVAC)
continue to operate as normal if the primary power source
fails.
What is Reliability?
 Reliability, on the other hand, is the measure of how
consistently a system performs its intended functions
without failure over a specific period. In operating
systems, reliability is essential for ensuring that users
can depend on the system to function correctly under
normal conditions.
RELIABILITY MEASURES IN
OPERATING SYSTEMS:
 MTBF (Mean Time Between Failures): MTBF is a
metric used to estimate the average time between failures
in a system. A higher MTBF value indicates greater
reliability, as it signifies longer intervals between
potential failures.
 MTTR (Mean Time To Repair): MTTR represents the
average time taken to repair a failed component or system
after a failure occurs. Minimizing MTTR is essential for
improving system reliability and reducing downtime.
 Availability: Availability is a measure of how
consistently a system can deliver its services without
interruption. High availability indicates greater
reliability, as it ensures that users can access the system
whenever needed.
TECHNIQUES FOR ENHANCING FAULT
TOLERANCE AND RELIABILITY:
 RAID (Redundant Array of Independent Disks): RAID
technology involves combining multiple disk drives into a single
logical unit to improve performance, fault tolerance, and data
redundancy. Different RAID levels offer varying degrees of
fault tolerance and performance benefits.
 Virtualization: Virtualization technology enables the creation
of virtual instances of hardware resources, allowing for better
resource utilization, scalability, and fault isolation.
Virtualization can enhance fault tolerance by enabling rapid
recovery from failures through migration or replication of
virtual machines.
 Clustering: Clustering involves grouping multiple
independent systems together to work as a single unit. Clusters
provide redundancy and load balancing capabilities, enhancing
fault tolerance by distributing workloads across multiple nodes
and enabling failover mechanisms.
 Conclusion: In conclusion, fault tolerance and
reliability are essential considerations in operating
systems to ensure continuous operation and minimize
disruptions caused by failures or errors. By
implementing redundancy, error detection
mechanisms, failover strategies, and reliability
measures like MTBF and availability metrics,
operating systems can enhance their resilience against
faults and improve overall system reliability.

Operating system.assig.ppt gokgfchvhj;;hhjcghfxgch

  • 1.
    OPERATING SYSTEM INDIVIDUAL ASSIGNMENT NAMEID NUMBER Natnael serte NSR/1856/14 Submitted to: Jira.k Submission date: 06/07/2016
  • 2.
    FAULT TOLERANCE DEFINITION Fault Tolerance simply means a system’s ability to continue operating uninterrupted despite the failure of one or more of its components. This is true whether it is a computer system, a cloud cluster, a network, or something else. In other words, fault tolerance refers to how an operating system (OS) responds to and allows for software or hardware malfunctions and failures.  An OS’s ability to recover and tolerate faults without failing can be handled by hardware, software, or a combined solution leveraging load balancers(see more below). Some computer systems use multiple duplicate fault tolerant systems to handle faults gracefully. This is called a fault tolerant network.
  • 3.
    WHAT IS FAULTTOLERANCE?  The goal of fault tolerant computer systems is to ensure business continuity and high availability by preventing disruptions arising from a single point of failure. Fault tolerance solutions therefore tend to focus most on mission-critical applications or systems.
  • 4.
    FAULT TOLERANT COMPUTINGMAY INCLUDE SEVERAL LEVELS OF TOLERANCE:  At the lowest level, the ability to respond to a power failure, for example.  A step up: during a system failure, the ability to use a backup system immediately.  Enhanced fault tolerance: a disk fails, and mirrored disks take over for it immediately. This provides functionality despite partial system failure, or graceful degradation, rather than an immediate breakdown and loss of function.  High level fault tolerant computing: multiple processors collaborate to scan data and output to detect errors, and then immediately correct them.
  • 5.
     Fault tolerancesoftware may be part of the OS interface, allowing the programmer to check critical data at specific points during a transaction.  Fault-tolerant systems ensure no break in service by using backup components that take the place of failed components automatically. These may include:  Hardware systems with identical or equivalent backup operating systems. For example, a server with an identical fault tolerant server mirroring all operations in backup, running in parallel, is fault tolerant. By eliminating single points of failure, hardware fault tolerance in the form of redundancy can make any component or system far safer and more reliable.
  • 6.
     Software systemsbacked up by other instances of software. For example, if you replicate your customer database continuously, operations in the primary database can be automatically redirected to the second database if the first goes down.  Redundant power sources can help avoid a system fault if alternative sources can take over automatically during power failures, ensuring no loss of service.
  • 7.
    COMPONENTS OF AFAULT-TOLERANCE SYSTEM  The key benefit of fault tolerance is to minimize or avoid the risk of systems becoming unavailable due to a component error. This is particularly important in critical systems that are relied on to ensure people’s safety, such as air traffic control, and systems that protect and secure critical data and high-value transactions.  The core components to improving fault tolerance include:
  • 8.
     Diversity  Ifa system’s main electricity supply fails, potentially due to a storm that causes a power outage or affects a power station, it will not be possible to access alternative electricity sources. In this event, fault tolerance can be sourced through diversity, which provides electricity from sources like backup generators that take over when a main power failure occurs.  Some diverse fault-tolerance options result in the backup not having the same level of capacity as the primary source. This may, in some cases, require the system to ensure graceful degradation until the primary power source is restored.
  • 9.
    Redundancy  Fault-tolerant systemsuse redundancy to remove the single point of failure. The system is equipped with one or more power supply units (PSUs), which do not need to power the system when the primary PSU functions as normal. In the event the primary PSU fails or suffers a fault, it can be removed from service and replaced by a redundant PSU, which takes over system function and performance.  Alternatively, redundancy can be imposed at a system level, which means an entire alternate computer system is in place in case a failure occurs.
  • 10.
    Replication  Replication isa more complex approach to achieving fault tolerance. It involves using multiple identical versions of systems and subsystems and ensuring their functions always provide identical results. If the results are not identical, then a democratic procedure is used to identify the faulty system. Alternatively, a procedure can be used to check for a system that shows a different result, which indicates it is faulty.  Replication can either take place at the component level, which involves multiple processors running simultaneously, or at the system level, which involves identical computer systems running simultaneously.
  • 11.
     Elements ofFault-tolerant Systems  Fault-tolerant systems also use backup components, which automatically replace failed components to prevent a loss of service. These backup components include: Hardware Systems  Hardware systems can be backed up by systems that are identical or equivalent to them. A typical example is a server made fault-tolerant by deploying an identical server that runs in parallel to it and mirrors all its operations, such as the redundant array of inexpensive disks (RAID), which combines physical disk components to achieve redundancy and improved performance.
  • 12.
    Software Systems  Softwaresystems can be made fault-tolerant by backing them up with other software. A common example is backing up a database that contains customer data to ensure it can continuously replicate onto another machine. As a result, in the event that a primary database fails, normal operations will continue because they are automatically replicated and redirected onto the backup database. Power Sources  Power sources can also be made fault-tolerant by using alternative sources to support them. One approach is to run devices on an uninterruptible power supply (UPS). Another is to use backup power generators that ensure storage and hardware, heating, ventilation, and air conditioning (HVAC) continue to operate as normal if the primary power source fails.
  • 13.
    What is Reliability? Reliability, on the other hand, is the measure of how consistently a system performs its intended functions without failure over a specific period. In operating systems, reliability is essential for ensuring that users can depend on the system to function correctly under normal conditions.
  • 14.
    RELIABILITY MEASURES IN OPERATINGSYSTEMS:  MTBF (Mean Time Between Failures): MTBF is a metric used to estimate the average time between failures in a system. A higher MTBF value indicates greater reliability, as it signifies longer intervals between potential failures.  MTTR (Mean Time To Repair): MTTR represents the average time taken to repair a failed component or system after a failure occurs. Minimizing MTTR is essential for improving system reliability and reducing downtime.  Availability: Availability is a measure of how consistently a system can deliver its services without interruption. High availability indicates greater reliability, as it ensures that users can access the system whenever needed.
  • 15.
    TECHNIQUES FOR ENHANCINGFAULT TOLERANCE AND RELIABILITY:  RAID (Redundant Array of Independent Disks): RAID technology involves combining multiple disk drives into a single logical unit to improve performance, fault tolerance, and data redundancy. Different RAID levels offer varying degrees of fault tolerance and performance benefits.  Virtualization: Virtualization technology enables the creation of virtual instances of hardware resources, allowing for better resource utilization, scalability, and fault isolation. Virtualization can enhance fault tolerance by enabling rapid recovery from failures through migration or replication of virtual machines.  Clustering: Clustering involves grouping multiple independent systems together to work as a single unit. Clusters provide redundancy and load balancing capabilities, enhancing fault tolerance by distributing workloads across multiple nodes and enabling failover mechanisms.
  • 16.
     Conclusion: Inconclusion, fault tolerance and reliability are essential considerations in operating systems to ensure continuous operation and minimize disruptions caused by failures or errors. By implementing redundancy, error detection mechanisms, failover strategies, and reliability measures like MTBF and availability metrics, operating systems can enhance their resilience against faults and improve overall system reliability.