Operating system.assig.ppt gokgfchvhj;;hhjcghfxgch

OPERATING SYSTEM
INDIVIDUAL ASSIGNMENT
NAME ID NUMBER
Natnael serte NSR/1856/14
Submitted to: Jira.k
Submission date: 06/07/2016

FAULT TOLERANCE DEFINITION
 Fault Tolerance simply means a system’s ability to
continue operating uninterrupted despite the failure
of one or more of its components. This is true whether
it is a computer system, a cloud cluster, a network, or
something else. In other words, fault tolerance refers
to how an operating system (OS) responds to and
allows for software or hardware malfunctions and
failures.
 An OS’s ability to recover and tolerate faults without
failing can be handled by hardware, software, or a
combined solution leveraging load balancers(see more
below). Some computer systems use multiple
duplicate fault tolerant systems to handle faults
gracefully. This is called a fault tolerant network.

WHAT IS FAULT TOLERANCE?
 The goal of fault tolerant computer systems is to
ensure business continuity and high availability
by preventing disruptions arising from a single
point of failure. Fault tolerance solutions
therefore tend to focus most on mission-critical
applications or systems.

FAULT TOLERANT COMPUTING MAY INCLUDE
SEVERAL LEVELS OF TOLERANCE:
 At the lowest level, the ability to respond to a power
failure, for example.
 A step up: during a system failure, the ability to use a
backup system immediately.
 Enhanced fault tolerance: a disk fails, and mirrored disks
take over for it immediately. This provides functionality
despite partial system failure, or graceful degradation,
rather than an immediate breakdown and loss of
function.
 High level fault tolerant computing: multiple processors
collaborate to scan data and output to detect errors, and
then immediately correct them.

 Fault tolerance software may be part of the OS
interface, allowing the programmer to check critical
data at specific points during a transaction.
 Fault-tolerant systems ensure no break in service by
using backup components that take the place of failed
components automatically. These may include:
 Hardware systems with identical or equivalent backup
operating systems. For example, a server with an
identical fault tolerant server mirroring all operations
in backup, running in parallel, is fault tolerant. By
eliminating single points of failure, hardware fault
tolerance in the form of redundancy can make any
component or system far safer and more reliable.

 Software systems backed up by other instances of
software. For example, if you replicate your customer
database continuously, operations in the primary
database can be automatically redirected to the second
database if the first goes down.
 Redundant power sources can help avoid a system
fault if alternative sources can take over automatically
during power failures, ensuring no loss of service.

COMPONENTS OF A FAULT-TOLERANCE SYSTEM
 The key benefit of fault tolerance is to minimize or
avoid the risk of systems becoming unavailable due to
a component error. This is particularly important in
critical systems that are relied on to ensure people’s
safety, such as air traffic control, and systems that
protect and secure critical data and high-value
transactions.
 The core components to improving fault tolerance
include:

 Diversity
 If a system’s main electricity supply fails, potentially
due to a storm that causes a power outage or affects a
power station, it will not be possible to access
alternative electricity sources. In this event, fault
tolerance can be sourced through diversity, which
provides electricity from sources like backup
generators that take over when a main power failure
occurs.
 Some diverse fault-tolerance options result in the
backup not having the same level of capacity as the
primary source. This may, in some cases, require the
system to ensure graceful degradation until the
primary power source is restored.

Redundancy
 Fault-tolerant systems use redundancy to remove the
single point of failure. The system is equipped with
one or more power supply units (PSUs), which do not
need to power the system when the primary PSU
functions as normal. In the event the primary PSU
fails or suffers a fault, it can be removed from service
and replaced by a redundant PSU, which takes over
system function and performance.
 Alternatively, redundancy can be imposed at a system
level, which means an entire alternate computer
system is in place in case a failure occurs.

Replication
 Replication is a more complex approach to achieving
fault tolerance. It involves using multiple identical
versions of systems and subsystems and ensuring
their functions always provide identical results. If the
results are not identical, then a democratic procedure
is used to identify the faulty system. Alternatively, a
procedure can be used to check for a system that
shows a different result, which indicates it is faulty.
 Replication can either take place at the component
level, which involves multiple processors running
simultaneously, or at the system level, which involves
identical computer systems running simultaneously.

 Elements of Fault-tolerant Systems
 Fault-tolerant systems also use backup components,
which automatically replace failed components to
prevent a loss of service. These backup components
include:
Hardware Systems
 Hardware systems can be backed up by systems that
are identical or equivalent to them. A typical example
is a server made fault-tolerant by deploying an
identical server that runs in parallel to it and mirrors
all its operations, such as the redundant array of
inexpensive disks (RAID), which combines physical
disk components to achieve redundancy and improved
performance.

Software Systems
 Software systems can be made fault-tolerant by backing
them up with other software. A common example is backing
up a database that contains customer data to ensure it can
continuously replicate onto another machine. As a result, in
the event that a primary database fails, normal operations
will continue because they are automatically replicated and
redirected onto the backup database.
Power Sources
 Power sources can also be made fault-tolerant by using
alternative sources to support them. One approach is to run
devices on an uninterruptible power supply (UPS). Another
is to use backup power generators that ensure storage and
hardware, heating, ventilation, and air conditioning (HVAC)
continue to operate as normal if the primary power source
fails.

What is Reliability?
 Reliability, on the other hand, is the measure of how
consistently a system performs its intended functions
without failure over a specific period. In operating
systems, reliability is essential for ensuring that users
can depend on the system to function correctly under
normal conditions.

RELIABILITY MEASURES IN
OPERATING SYSTEMS:
 MTBF (Mean Time Between Failures): MTBF is a
metric used to estimate the average time between failures
in a system. A higher MTBF value indicates greater
reliability, as it signifies longer intervals between
potential failures.
 MTTR (Mean Time To Repair): MTTR represents the
average time taken to repair a failed component or system
after a failure occurs. Minimizing MTTR is essential for
improving system reliability and reducing downtime.
 Availability: Availability is a measure of how
consistently a system can deliver its services without
interruption. High availability indicates greater
reliability, as it ensures that users can access the system
whenever needed.

TECHNIQUES FOR ENHANCING FAULT
TOLERANCE AND RELIABILITY:
 RAID (Redundant Array of Independent Disks): RAID
technology involves combining multiple disk drives into a single
logical unit to improve performance, fault tolerance, and data
redundancy. Different RAID levels offer varying degrees of
fault tolerance and performance benefits.
 Virtualization: Virtualization technology enables the creation
of virtual instances of hardware resources, allowing for better
resource utilization, scalability, and fault isolation.
Virtualization can enhance fault tolerance by enabling rapid
recovery from failures through migration or replication of
virtual machines.
 Clustering: Clustering involves grouping multiple
independent systems together to work as a single unit. Clusters
provide redundancy and load balancing capabilities, enhancing
fault tolerance by distributing workloads across multiple nodes
and enabling failover mechanisms.

 Conclusion: In conclusion, fault tolerance and
reliability are essential considerations in operating
systems to ensure continuous operation and minimize
disruptions caused by failures or errors. By
implementing redundancy, error detection
mechanisms, failover strategies, and reliability
measures like MTBF and availability metrics,
operating systems can enhance their resilience against
faults and improve overall system reliability.

Operating system.assig.ppt gokgfchvhj;;hhjcghfxgch

More Related Content

Similar to Operating system.assig.ppt gokgfchvhj;;hhjcghfxgch

Recently uploaded

Operating system.assig.ppt gokgfchvhj;;hhjcghfxgch