An Effective Survey on Fault Tolerance in
Cloud Computing
Dr. Mohammad Abdur Rouf (Professor)
Head of Department of Computer Science & Engineering
Dhaka University of Engineering and Technology, Gazipur
© B.Sc. Engg. (KU) M.Sc. Engg. (BUET), Ph.D. (KAIST)
Dhaka University of Engineering and Technology, Gazipur
Presented by
Md. Mostafijur Rahman
M.Sc. Engg. Student ID #: 132431(p)
Department of Computer Science and Engineering
Outline of the talk• Introduction
• Related Work
• Taxonomy Fault, Error and Failure in Cloud Computing
• Fault Tolerance Requirements
• Existing Fault Tolerance Techniques/ Policies
- Proactive Fault Tolerance
- Rejuvenation
- Self-Healing
- Preemptive Migration
- Load Balancing
- Reactive Fault Tolerance
- Restart
- Task Resubmission
- Check pointing
- Job Migration
• Cloud Fault Tolerance Developed Techniques
- A Proactive Fault Tolerance
- Fault Tolerance Manager (FTM)
- A Self-turning Failure Detection (SFD)
- An Efficient Fault-Tolerant Algorithm
- A byzantine Fault Tolerance Framework (BFT Cloud)
- Adaptive Fault Tolerance in Real time Cloud Computing Output Dependable Drawback
- Tools Used for Implementation of Fault Tolerance Techniques
Dhaka University of Engineering and Technology, Gazipur
Introduction
• Cloud computing is a large-scale distributed multi-level services via
virtualization technologies. Due to increasing of recourses demand in
cloud , flexibility in obtaining and releasing resources is an important
issue in the cloud .
• Fault tolerance deals with all different approaches that provides
robustness, availability and dependability. The major use of enforcing
fault tolerance in cloud computing include recovery from different
hardware and software failures, reduced cost and also improves
performance .
• Different types of faults may occur in a cloud infrastructure and there
are various fault tolerance techniques that can be used based on these
fault tolerance policies.
• First, Data Failures such as data corruption, missing source data and
other flaws in the data.
• Second, Hardware failure such as faulty or slow VMs and storage
access exception.
Dhaka University of Engineering and Technology, Gazipur
Related Work
• In 1999, Felix C. Gartner discussed fundamental methodologies and their
relation on distributed computing with the help of a formal approach which
helped in better understanding of the subject and build a more reliable and
dependable system.
• In 2004, Lee Pike provided four different abstractions for distributed fault
tolerant systems. These four methods abstracts a wide variety of distributed
fault tolerant system. These abstraction are related to faults,
communication, messages and fault masking.
• In 2011, Arvind Kumar have checked various fault tolerance techniques
which can be used in real time distributed system. This paper checks how
different fault tolerance techniques are applied to tolerate different faults
present in real time distributed system.
• In 2011, Zhang proposed a Byzantine fault tolerance (BFT) system for cloud
which provide high reliability in cloud which also ensures high performance
of these systems.
Dhaka University of Engineering and Technology, Gazipur
Related Work
• In 2013, N. Chandrakala and P. Sivaprakasam provided a load
balancing algorithm for cloud computing which checks if the load on any
virtual machine in cloud is more than 75%. It transfer it to a new virtual
machine whose load is less than 75% and does not exceed more than
75% after the load is transferred.
• In 2013, Anjali D. Meshramet provided a Fault Tolerance in Cloud
Computing (FTMC) model which works on the principle of reliability of
each computing node. A node is only selected if its reliability is high
otherwise it is removed.
• In 2013, Ravi Jhawar proposed a comprehensive high-level approach
which assess fault tolerance mechanism and use virtualization
technology to improve availability and reliability of applications which
are placed in virtual machines in cloud.
Dhaka University of Engineering and Technology, Gazipur
Taxonomy of Fault , Error, And Failure in Cloud
Dhaka University of Engineering and Technology, Gazipur
Fault Tolerance:
Fig.1. Path to Generation o f Failure
Faults are combined of causes that degradation the performance of Cloud Application. The
several faults causes the Error . A quantity of predicted difference between the calculated or
observed value of a quantity and its true value. Many errors are caused for failure . A system
will be damaged by several failures. Failure refers to misconduct of a system which can be
observed by a user . There are many faults occurring in system. Such as, Omission faults ,
aging Related faults, Response Faults, Software Faults , Timing Faults , Interaction Faults and
Miscellaneous Faults.
We can find many errors in Cloud Computing . Such as Network error, software error and
Miscellaneous error.
The omission , Hardware , Software , Response , Network , Crash and Miscellaneous are the
effective system failure in Cloud Application.
Fault causes Error Failures
Taxonomy of Faults
Dhaka University of Engineering and Technology, Gazipur
There are many kinds of faults in Cloud Application Such as ,
Fig.2. Taxonomy of Faults
Faults
Omission
faults
Aging
Related
Response
Faults
Software
Faults
Interaction
faults
Timing
faults
Miscellaneous
Hardware
Software
Denial of
Services
Disk
Space Full
Transient or
Intermittent
Value
Faults
State
Transition
Early
Faults
Late
Faults
Permanent
Faults
Incorrect
Design
Timing
Overheads
Service Inter
dependencies
Protocol
incompatibilities
Taxonomy of Errors
Dhaka University of Engineering and Technology, Gazipur
Packet
Corruption
Packet Loss
Error
Software MiscellaneousNetwork
Network
Congestion
Permanent
Errors
Intermittent
Errors
Transient
Errors
Memory Leak
Numerical
Exception
Fig.3. Taxonomy of Errors
Taxonomy of Failure
Dhaka University of Engineering and Technology, GazipurDhaka University of Engineering and Technology, Gazipur
Failure
Hardware
Response
Faults
Software
Faults
Interaction
faults
Timing
faults
Miscellaneous
Machine
Failure
Disk
Failure
Transient or
Intermittent
Value
Faults
State
Transition
Early
Faults
Late
Faults
Permanent
Faults
Incorrect
Design
Timing
Overheads
Service Inter
dependencies
Protocol
incompatibilities
Fig.4. Taxonomy of Failure
Fault Tolerance Requirements
Dhaka University of Engineering and Technology, Gazipur
To understand the role of fault tolerance, we should cover a number of useful
requirements for distributed systems including the following:
Mean Time To Failure (MTTF):
Mean Time To Repair (MTTR):
Mean Time Between Failures (MTBF):
MTBF = MTTF + MTTR
Reliability:
Fault Tolerance Requirements
Dhaka University of Engineering and Technology, Gazipur
Availability:
Safety:
Maintainability:
Two Phase :
a) Error detection
b) Error Recovery
Existing Fault Tolerance Techniques/ Policies
Faults Tolerance
Techniques
Preemptive
Migration
Load Balancing
Proactive Fault
Tolerance
Reactive Fault
Tolerance
Self-Healing
Rejuvenation
Job Migration
User Defined
Exception
Handling
Checkpointing
Restart
Task
Resubmission
Replication
Dhaka University of Engineering and Technology, Gazipur
Rescue
Workflow
Full
Checkpointing
Incremental
Checkpointing
Semi-active
Replication
Passive
Replication
Semi-Passive
Replication
Fig.5. Fault Tolerance Techniques
Describe Proactive Fault Tolerance Techniques
Dhaka University of Engineering and Technology, Gazipur
Proactive Fault Tolerance: The proactive fault tolerance policies helps to avoid
various faults by predicting them before they occur and replace the suspected
component.
a) Rejuvenation: All processes, restarting the system and reinitializing the operating
environment and all processes.
b) Self-Healing: Failures are handled automatically on those application if multiple
instances of those application are run on different virtual machines.
c) Preemptive Migration: In this case a job which is executed is preempted, its state is
then saved and then it is migrated to another system.
d) Load Balancing: Whenever the load (% utilization) of CPU and memory exceeds a
certain limit. For example, if 75% utilization of CPU is considered as limit. The load from
that CPU is transferred to other CPU which does not exceeds its limit.
Describe Reactive Fault Tolerance Techniques
Reactive fault tolerance policies are used when a failure has occurred in the system. It
helps in recovery of system state from an unstable state to stable state so that the
system can again start working to provide desired results.
Checkpoint: Check pointing systems save the system state in regular or irregular time
intervals. Upon failure the system recovers by rolling back to most recent checkpoint.
It has two kinds a) Full Checkpoint b) Increment Checkpoint
Replication: It is a process of creating copies of similar data and then stored at
different locations .It has three kinds a) Semi-active Replication b) Semi-passive
Replication c) Passive Replication
Job Migration: During failure of a task the task which has failed can be migrated to
another machine.
Task Resubmission: If there is a failed task detected in a system, the failed task is
resubmitted to a new or same resource.
User Defined Exception Handling: For a particular treatment of a failure of task the
user specifies a workflow in this process.
Rescue Workflow: Even if a task fails rescue workflow allows it to continue until or
unless it is not possible to execute without considering the failed task.
Dhaka University of Engineering and Technology, Gazipur
Cloud Fault Tolerance Developed Techniques
A Proactive Fault Tolerance:
Dhaka University of Engineering and Technology, Gazipur
Answers
Thanks
Dhaka University of Engineering and Technology, Gazipur
Answers
Thanks
Dhaka University of Engineering and Technology, Gazipur
Answers
Thanks
Dhaka University of Engineering and Technology, Gazipur
Answers
Thanks
Dhaka University of Engineering and Technology, Gazipur
Answers
Thanks
Dhaka University of Engineering and Technology, Gazipur

Adaptive fault tolerance in cloud survey

  • 1.
    An Effective Surveyon Fault Tolerance in Cloud Computing Dr. Mohammad Abdur Rouf (Professor) Head of Department of Computer Science & Engineering Dhaka University of Engineering and Technology, Gazipur © B.Sc. Engg. (KU) M.Sc. Engg. (BUET), Ph.D. (KAIST) Dhaka University of Engineering and Technology, Gazipur Presented by Md. Mostafijur Rahman M.Sc. Engg. Student ID #: 132431(p) Department of Computer Science and Engineering
  • 2.
    Outline of thetalk• Introduction • Related Work • Taxonomy Fault, Error and Failure in Cloud Computing • Fault Tolerance Requirements • Existing Fault Tolerance Techniques/ Policies - Proactive Fault Tolerance - Rejuvenation - Self-Healing - Preemptive Migration - Load Balancing - Reactive Fault Tolerance - Restart - Task Resubmission - Check pointing - Job Migration • Cloud Fault Tolerance Developed Techniques - A Proactive Fault Tolerance - Fault Tolerance Manager (FTM) - A Self-turning Failure Detection (SFD) - An Efficient Fault-Tolerant Algorithm - A byzantine Fault Tolerance Framework (BFT Cloud) - Adaptive Fault Tolerance in Real time Cloud Computing Output Dependable Drawback - Tools Used for Implementation of Fault Tolerance Techniques Dhaka University of Engineering and Technology, Gazipur
  • 3.
    Introduction • Cloud computingis a large-scale distributed multi-level services via virtualization technologies. Due to increasing of recourses demand in cloud , flexibility in obtaining and releasing resources is an important issue in the cloud . • Fault tolerance deals with all different approaches that provides robustness, availability and dependability. The major use of enforcing fault tolerance in cloud computing include recovery from different hardware and software failures, reduced cost and also improves performance . • Different types of faults may occur in a cloud infrastructure and there are various fault tolerance techniques that can be used based on these fault tolerance policies. • First, Data Failures such as data corruption, missing source data and other flaws in the data. • Second, Hardware failure such as faulty or slow VMs and storage access exception. Dhaka University of Engineering and Technology, Gazipur
  • 4.
    Related Work • In1999, Felix C. Gartner discussed fundamental methodologies and their relation on distributed computing with the help of a formal approach which helped in better understanding of the subject and build a more reliable and dependable system. • In 2004, Lee Pike provided four different abstractions for distributed fault tolerant systems. These four methods abstracts a wide variety of distributed fault tolerant system. These abstraction are related to faults, communication, messages and fault masking. • In 2011, Arvind Kumar have checked various fault tolerance techniques which can be used in real time distributed system. This paper checks how different fault tolerance techniques are applied to tolerate different faults present in real time distributed system. • In 2011, Zhang proposed a Byzantine fault tolerance (BFT) system for cloud which provide high reliability in cloud which also ensures high performance of these systems. Dhaka University of Engineering and Technology, Gazipur
  • 5.
    Related Work • In2013, N. Chandrakala and P. Sivaprakasam provided a load balancing algorithm for cloud computing which checks if the load on any virtual machine in cloud is more than 75%. It transfer it to a new virtual machine whose load is less than 75% and does not exceed more than 75% after the load is transferred. • In 2013, Anjali D. Meshramet provided a Fault Tolerance in Cloud Computing (FTMC) model which works on the principle of reliability of each computing node. A node is only selected if its reliability is high otherwise it is removed. • In 2013, Ravi Jhawar proposed a comprehensive high-level approach which assess fault tolerance mechanism and use virtualization technology to improve availability and reliability of applications which are placed in virtual machines in cloud. Dhaka University of Engineering and Technology, Gazipur
  • 6.
    Taxonomy of Fault, Error, And Failure in Cloud Dhaka University of Engineering and Technology, Gazipur Fault Tolerance: Fig.1. Path to Generation o f Failure Faults are combined of causes that degradation the performance of Cloud Application. The several faults causes the Error . A quantity of predicted difference between the calculated or observed value of a quantity and its true value. Many errors are caused for failure . A system will be damaged by several failures. Failure refers to misconduct of a system which can be observed by a user . There are many faults occurring in system. Such as, Omission faults , aging Related faults, Response Faults, Software Faults , Timing Faults , Interaction Faults and Miscellaneous Faults. We can find many errors in Cloud Computing . Such as Network error, software error and Miscellaneous error. The omission , Hardware , Software , Response , Network , Crash and Miscellaneous are the effective system failure in Cloud Application. Fault causes Error Failures
  • 7.
    Taxonomy of Faults DhakaUniversity of Engineering and Technology, Gazipur There are many kinds of faults in Cloud Application Such as , Fig.2. Taxonomy of Faults Faults Omission faults Aging Related Response Faults Software Faults Interaction faults Timing faults Miscellaneous Hardware Software Denial of Services Disk Space Full Transient or Intermittent Value Faults State Transition Early Faults Late Faults Permanent Faults Incorrect Design Timing Overheads Service Inter dependencies Protocol incompatibilities
  • 8.
    Taxonomy of Errors DhakaUniversity of Engineering and Technology, Gazipur Packet Corruption Packet Loss Error Software MiscellaneousNetwork Network Congestion Permanent Errors Intermittent Errors Transient Errors Memory Leak Numerical Exception Fig.3. Taxonomy of Errors
  • 9.
    Taxonomy of Failure DhakaUniversity of Engineering and Technology, GazipurDhaka University of Engineering and Technology, Gazipur Failure Hardware Response Faults Software Faults Interaction faults Timing faults Miscellaneous Machine Failure Disk Failure Transient or Intermittent Value Faults State Transition Early Faults Late Faults Permanent Faults Incorrect Design Timing Overheads Service Inter dependencies Protocol incompatibilities Fig.4. Taxonomy of Failure
  • 10.
    Fault Tolerance Requirements DhakaUniversity of Engineering and Technology, Gazipur To understand the role of fault tolerance, we should cover a number of useful requirements for distributed systems including the following: Mean Time To Failure (MTTF): Mean Time To Repair (MTTR): Mean Time Between Failures (MTBF): MTBF = MTTF + MTTR Reliability:
  • 11.
    Fault Tolerance Requirements DhakaUniversity of Engineering and Technology, Gazipur Availability: Safety: Maintainability: Two Phase : a) Error detection b) Error Recovery
  • 12.
    Existing Fault ToleranceTechniques/ Policies Faults Tolerance Techniques Preemptive Migration Load Balancing Proactive Fault Tolerance Reactive Fault Tolerance Self-Healing Rejuvenation Job Migration User Defined Exception Handling Checkpointing Restart Task Resubmission Replication Dhaka University of Engineering and Technology, Gazipur Rescue Workflow Full Checkpointing Incremental Checkpointing Semi-active Replication Passive Replication Semi-Passive Replication Fig.5. Fault Tolerance Techniques
  • 13.
    Describe Proactive FaultTolerance Techniques Dhaka University of Engineering and Technology, Gazipur Proactive Fault Tolerance: The proactive fault tolerance policies helps to avoid various faults by predicting them before they occur and replace the suspected component. a) Rejuvenation: All processes, restarting the system and reinitializing the operating environment and all processes. b) Self-Healing: Failures are handled automatically on those application if multiple instances of those application are run on different virtual machines. c) Preemptive Migration: In this case a job which is executed is preempted, its state is then saved and then it is migrated to another system. d) Load Balancing: Whenever the load (% utilization) of CPU and memory exceeds a certain limit. For example, if 75% utilization of CPU is considered as limit. The load from that CPU is transferred to other CPU which does not exceeds its limit.
  • 14.
    Describe Reactive FaultTolerance Techniques Reactive fault tolerance policies are used when a failure has occurred in the system. It helps in recovery of system state from an unstable state to stable state so that the system can again start working to provide desired results. Checkpoint: Check pointing systems save the system state in regular or irregular time intervals. Upon failure the system recovers by rolling back to most recent checkpoint. It has two kinds a) Full Checkpoint b) Increment Checkpoint Replication: It is a process of creating copies of similar data and then stored at different locations .It has three kinds a) Semi-active Replication b) Semi-passive Replication c) Passive Replication Job Migration: During failure of a task the task which has failed can be migrated to another machine. Task Resubmission: If there is a failed task detected in a system, the failed task is resubmitted to a new or same resource. User Defined Exception Handling: For a particular treatment of a failure of task the user specifies a workflow in this process. Rescue Workflow: Even if a task fails rescue workflow allows it to continue until or unless it is not possible to execute without considering the failed task. Dhaka University of Engineering and Technology, Gazipur
  • 15.
    Cloud Fault ToleranceDeveloped Techniques A Proactive Fault Tolerance: Dhaka University of Engineering and Technology, Gazipur
  • 16.
    Answers Thanks Dhaka University ofEngineering and Technology, Gazipur
  • 17.
    Answers Thanks Dhaka University ofEngineering and Technology, Gazipur
  • 18.
    Answers Thanks Dhaka University ofEngineering and Technology, Gazipur
  • 19.
    Answers Thanks Dhaka University ofEngineering and Technology, Gazipur
  • 20.
    Answers Thanks Dhaka University ofEngineering and Technology, Gazipur

Editor's Notes