Adaptive fault tolerance in cloud survey

An Effective Survey on Fault Tolerance in
Cloud Computing
Dr. Mohammad Abdur Rouf (Professor)
Head of Department of Computer Science & Engineering
Dhaka University of Engineering and Technology, Gazipur
© B.Sc. Engg. (KU) M.Sc. Engg. (BUET), Ph.D. (KAIST)
Presented by
Md. Mostafijur Rahman
M.Sc. Engg. Student ID #: 132431(p)
Department of Computer Science and Engineering

Outline of the talk• Introduction
• Related Work
• Taxonomy Fault, Error and Failure in Cloud Computing
• Fault Tolerance Requirements
• Existing Fault Tolerance Techniques/ Policies
- Proactive Fault Tolerance
- Rejuvenation
- Self-Healing
- Preemptive Migration
- Load Balancing
- Reactive Fault Tolerance
- Restart
- Task Resubmission
- Check pointing
- Job Migration
• Cloud Fault Tolerance Developed Techniques
- A Proactive Fault Tolerance
- Fault Tolerance Manager (FTM)
- A Self-turning Failure Detection (SFD)
- An Efficient Fault-Tolerant Algorithm
- A byzantine Fault Tolerance Framework (BFT Cloud)
- Adaptive Fault Tolerance in Real time Cloud Computing Output Dependable Drawback
- Tools Used for Implementation of Fault Tolerance Techniques

Introduction
• Cloud computing is a large-scale distributed multi-level services via
virtualization technologies. Due to increasing of recourses demand in
cloud , flexibility in obtaining and releasing resources is an important
issue in the cloud .
• Fault tolerance deals with all different approaches that provides
robustness, availability and dependability. The major use of enforcing
fault tolerance in cloud computing include recovery from different
hardware and software failures, reduced cost and also improves
performance .
• Different types of faults may occur in a cloud infrastructure and there
are various fault tolerance techniques that can be used based on these
fault tolerance policies.
• First, Data Failures such as data corruption, missing source data and
other flaws in the data.
• Second, Hardware failure such as faulty or slow VMs and storage
access exception.

Related Work
• In 1999, Felix C. Gartner discussed fundamental methodologies and their
relation on distributed computing with the help of a formal approach which
helped in better understanding of the subject and build a more reliable and
dependable system.
• In 2004, Lee Pike provided four different abstractions for distributed fault
tolerant systems. These four methods abstracts a wide variety of distributed
fault tolerant system. These abstraction are related to faults,
communication, messages and fault masking.
• In 2011, Arvind Kumar have checked various fault tolerance techniques
which can be used in real time distributed system. This paper checks how
different fault tolerance techniques are applied to tolerate different faults
present in real time distributed system.
• In 2011, Zhang proposed a Byzantine fault tolerance (BFT) system for cloud
which provide high reliability in cloud which also ensures high performance
of these systems.

Related Work
• In 2013, N. Chandrakala and P. Sivaprakasam provided a load
balancing algorithm for cloud computing which checks if the load on any
virtual machine in cloud is more than 75%. It transfer it to a new virtual
machine whose load is less than 75% and does not exceed more than
75% after the load is transferred.
• In 2013, Anjali D. Meshramet provided a Fault Tolerance in Cloud
Computing (FTMC) model which works on the principle of reliability of
each computing node. A node is only selected if its reliability is high
otherwise it is removed.
• In 2013, Ravi Jhawar proposed a comprehensive high-level approach
which assess fault tolerance mechanism and use virtualization
technology to improve availability and reliability of applications which
are placed in virtual machines in cloud.

Taxonomy of Fault , Error, And Failure in Cloud
Fault Tolerance:
Fig.1. Path to Generation o f Failure
Faults are combined of causes that degradation the performance of Cloud Application. The
several faults causes the Error . A quantity of predicted difference between the calculated or
observed value of a quantity and its true value. Many errors are caused for failure . A system
will be damaged by several failures. Failure refers to misconduct of a system which can be
observed by a user . There are many faults occurring in system. Such as, Omission faults ,
aging Related faults, Response Faults, Software Faults , Timing Faults , Interaction Faults and
Miscellaneous Faults.
We can find many errors in Cloud Computing . Such as Network error, software error and
Miscellaneous error.
The omission , Hardware , Software , Response , Network , Crash and Miscellaneous are the
effective system failure in Cloud Application.
Fault causes Error Failures

Taxonomy of Faults
There are many kinds of faults in Cloud Application Such as ,
Fig.2. Taxonomy of Faults
Faults
Omission
faults
Aging
Related
Response
Faults
Software
Faults
Interaction
faults
Timing
faults
Miscellaneous
Hardware
Software
Denial of
Services
Disk
Space Full
Transient or
Intermittent
Value
Faults
State
Transition
Early
Faults
Late
Faults
Permanent
Faults
Incorrect
Design
Timing
Overheads
Service Inter
dependencies
Protocol
incompatibilities

Taxonomy of Errors
Packet
Corruption
Packet Loss
Error
Software MiscellaneousNetwork
Network
Congestion
Permanent
Errors
Intermittent
Errors
Transient
Errors
Memory Leak
Numerical
Exception
Fig.3. Taxonomy of Errors

Taxonomy of Failure
Dhaka University of Engineering and Technology, GazipurDhaka University of Engineering and Technology, Gazipur
Failure
Hardware
Response
Faults
Software
Faults
Interaction
faults
Timing
faults
Miscellaneous
Machine
Failure
Disk
Failure
Transient or
Intermittent
Value
Faults
State
Transition
Early
Faults
Late
Faults
Permanent
Faults
Incorrect
Design
Timing
Overheads
Service Inter
dependencies
Protocol
incompatibilities
Fig.4. Taxonomy of Failure

Fault Tolerance Requirements
To understand the role of fault tolerance, we should cover a number of useful
requirements for distributed systems including the following:
Mean Time To Failure (MTTF):
Mean Time To Repair (MTTR):
Mean Time Between Failures (MTBF):
MTBF = MTTF + MTTR
Reliability:

Fault Tolerance Requirements
Availability:
Safety:
Maintainability:
Two Phase :
a) Error detection
b) Error Recovery

Existing Fault Tolerance Techniques/ Policies
Faults Tolerance
Techniques
Preemptive
Migration
Load Balancing
Proactive Fault
Tolerance
Reactive Fault
Tolerance
Self-Healing
Rejuvenation
Job Migration
User Defined
Exception
Handling
Checkpointing
Restart
Task
Resubmission
Replication
Rescue
Workflow
Full
Checkpointing
Incremental
Checkpointing
Semi-active
Replication
Passive
Replication
Semi-Passive
Replication
Fig.5. Fault Tolerance Techniques

Describe Proactive Fault Tolerance Techniques
Proactive Fault Tolerance: The proactive fault tolerance policies helps to avoid
various faults by predicting them before they occur and replace the suspected
component.
a) Rejuvenation: All processes, restarting the system and reinitializing the operating
environment and all processes.
b) Self-Healing: Failures are handled automatically on those application if multiple
instances of those application are run on different virtual machines.
c) Preemptive Migration: In this case a job which is executed is preempted, its state is
then saved and then it is migrated to another system.
d) Load Balancing: Whenever the load (% utilization) of CPU and memory exceeds a
certain limit. For example, if 75% utilization of CPU is considered as limit. The load from
that CPU is transferred to other CPU which does not exceeds its limit.

Describe Reactive Fault Tolerance Techniques
Reactive fault tolerance policies are used when a failure has occurred in the system. It
helps in recovery of system state from an unstable state to stable state so that the
system can again start working to provide desired results.
Checkpoint: Check pointing systems save the system state in regular or irregular time
intervals. Upon failure the system recovers by rolling back to most recent checkpoint.
It has two kinds a) Full Checkpoint b) Increment Checkpoint
Replication: It is a process of creating copies of similar data and then stored at
different locations .It has three kinds a) Semi-active Replication b) Semi-passive
Replication c) Passive Replication
Job Migration: During failure of a task the task which has failed can be migrated to
another machine.
Task Resubmission: If there is a failed task detected in a system, the failed task is
resubmitted to a new or same resource.
User Defined Exception Handling: For a particular treatment of a failure of task the
user specifies a workflow in this process.
Rescue Workflow: Even if a task fails rescue workflow allows it to continue until or
unless it is not possible to execute without considering the failed task.

Cloud Fault Tolerance Developed Techniques
A Proactive Fault Tolerance:

Answers
Thanks

Adaptive fault tolerance in cloud survey

More Related Content

What's hot

Similar to Adaptive fault tolerance in cloud survey

More from www.pixelsolutionbd.com

Recently uploaded

Adaptive fault tolerance in cloud survey

Editor's Notes