SlideShare a Scribd company logo
1 of 12
Software Rejuvenation based Fault Tolerance
Scheme for Cloud Applications
Dr. Mohammad Abdur Rouf (Professor)
Head of Department of Computer Science & Engineering
Dhaka University of Engineering and Technology, Gazipur
© B.Sc. Engg. (KU) M.Sc. Engg. (BUET), Ph.D. (KAIST)
Dhaka University of Engineering and Technology, Gazipur
Presented by
Md. Mostafijur Rahman
M.Sc. Engg. Student ID #: 132431(p)
Department of Computer Science and Engineering
Outline of the talk
• Software Aging
• Holistic Software Rejuvenation
• Software Rejuvenation in Clouds
• System Architecture
- Tight coupling components (TCC)
- Loose-Coupling Components (LCC)
- Virtual Machine (VM)
- Interim Node
- Software Rejuvenation Manager
• Adaptive Failure Detection & Evaluation
- Computing Expected Arrival Time (EAT).
- Software Aging Degree Evaluation
• Migration Based Software Rejuvenation
- Checkpoint Generation and Remote Storing.
- Live Migration of the Original VM.
- Checkpoint with Trace Replay
- Rejuvenation from Checkpoint.
Dhaka University of Engineering and Technology, Gazipur
Introduction
• Cloud computing is a large-scale and complex distributed computing
paradigm where the configurable resources (servers, storage,
network, data and software applications) are provided as multi-level
services via virtualization technologies. Software aging is the
biggest problem in Cloud Application .
• I propose a novel and holistic software rejuvenation based fault
tolerance scheme to counteract aging obstacles for cloud
applications.
• First, an adaptive failure detection and aging degree evaluation
approach is proposed to predict which cloud service components
deserve foremost to be rejuvenated. Second, a component
rejuvenation approach based on checkpoints with trace replay is
proposed to guarantee the continuous running of cloud application
systems.
Dhaka University of Engineering and Technology, Gazipur
Operation Work
There is a set of virtual machines, running on cloud infrastructure, and
Separate TCC and LCC are running on Virtual Machine.
Dhaka University of Engineering and Technology, Gazipur
Operation
Dhaka University of Engineering and Technology, Gazipur
Applying software rejuvenation technologies to the cloud computing scenario
brings the software from a failure-prone state to an aging-free state, where the
most intuitive and widely-adopted rejuvenation method is to terminate software
transiently, clean up their internal runtime states (e.g. reinitializing internal data
structures or garbage collection) and finally restart it.
Tight coupling components (TCC) execute in the same virtual machine (VM),
while loose-coupling components (LCC) execute in different virtual machines.
Service components often communicate with each other via RPC over local high
speed network, and cloud services always communicate with each other via web
services protocols over Internet.
Aging Failure Detector (AFD) inspects CPU and memory usage.
Aging Degree Evaluator (ADE) is used to record contexts of this failure, and
then assign a fatal degree number to this failure event through an evaluation
process, and finally insert this event into a failure event queue. The event queue
is ordered according to the fatal degree number of every failure event
automatically.
Failure Detection And Evaluation
Dhaka University of Engineering and Technology, Gazipur
Computing Expected Arrival Time (EAT).
If next metrics data packet arrives after EAT or totally does not arrive, it indicates some kinds
of failures tend to appear. In every sampling interval ti, the agent collects the CPU usage Ci
And the free memory available data FMi, encapsulates Ci and Fmi data into a packet and
sends it to the AFD. Recent n packet p1, p2, ..., pn, and T1, T2, ..., Tn are their actual arrival
time according to AFD's local clock.
Expected Arrival Time (EAT):
EATk+1 (k> n) is the theoretically expected arrival time of next packet, and it is estimated as
follows based on studies.
EAT utilizes actual arrival time of recent packets. The average of packet transmission time
of latest n packets and shifts forward the average for next pk+1.
Then, EAT for next packet could be added an offset to EATk+1 .
Software Aging Degree Evaluation.
Dhaka University of Engineering and Technology, Gazipur
The aging degree of TCC or LCC is roughly divided into four levels:
Level 1 indicates the crucial case which
is very likely to result in fatal
consequence, such as service crash, and
needs rejuvenation immediately.
Level 2 indicates the serious case which
tends to result in serious consequence,
e.g., long transmission delay or high
performance loss, and needs rejuvenation as soon as possible.
Level 3 indicates the suspectable case which may cause unsatisfied consequence or
false detection. It needs on-going monitoring and now rejuvenation is just a
suggestion.
Level 4 indicates the normal case without any rejuvenation. It is assumed that for a
cloud application which is not computing-intensive, the metric .
The significance follows that EAT > Mi > Ci.
Migration Based Software Rejuvenation
 To avoid re-executing of a TCC or a LCC
from its initial states, which costs more due
to high downtime and complex
synchronization behaviors, we integrate
checkpoint technique with trace record
and replay and live VM migration technique
Check point Generation and Remote Storing.
When a TCC or a LCC is detected with aging
failure-prone in level 1 or 2, the ADE will
trigger the SRM to generate a rejuvenation
starting message and send it to the interim node.
The interim node will establish a connection with the agent, and ask
it to generate a checkpoint of the runtime status of that aging
cloud service component.
Dhaka University of Engineering and Technology, Gazipur
Live Migration of the Original VM
Dhaka University of Engineering and Technology, Gazipur
After generating the checkpoint, the VM hosting those failure-prone TCC or
LCC is migrated in pre-copy scheme to the interim node, while the continuous
communication of upper cloud service components are kept on as the
migration occurs in the same LAN.
It should be noted that during this live migration process, the trace-log
file for recording subsequent behaviors of the aging TCC or LCC is still written
by the rejuvenation agent.
Rejuvenation from Checkpoint:
When migration is done, the original VM environment could be rebooted automatically
controlled by the rejuvenation agent. That means a totally new VM environment (VMs)
for the cloud service running are well available. Next, the checkpoint image is
transmitted from the interim node to VMs to restore the TCC or LCC running from
that checkpoint time. Then following iterations copy the trace log file from the VM in
interim node (VMin) to VMs, to replay the subsequent behaviors accurately. This
iterative process is executed until trace-log files generated during the previous transfer
round is reduced to a pre-defined threshold value. Till now, VMin is suspended for
running and the remaining trace log file is transferred to the VMs.
DISCUSSION
Dhaka University of Engineering and Technology, Gazipur
 Our aging failure detection tends to be more comprehensive and more accurate
because we consider both runtime metrics of VMs and communication liveness of
cloud services as credible guidance to identify which cloud service components
need rejuvenation imperatively.
 Cooperation of the VM migration and trace-log based checkpoint replay
techniques guarantee continuous running of cloud applications
 Our method can be directly used to handle multiple rejuvenations of independent
VMs.
CONCLUSIONS AND FUTURE WORK
In order to counteract software aging obstacles for service
components in cloud applications, a novel and holistic software
rejuvenation based fault tolerance scheme is well presented to
improve their running availability.
By a preliminary evaluation, we discuss that the adaptive failure detection and aging
degree evaluation approach can accurately predict which cloud service components
deserve foremost to be rejuvenated, and the checkpoint with trace-log replay based
rejuvenation approach is a very promising fault tolerance mechanism choice to
guarantee continuous running of cloud applications.
As to future works, we intend to improve the accuracy of the failure detection metrics
and adaptive aging degree evaluation . More importantly, we will conduct optimal
checkpoint and trace-log replay file .
Dhaka University of Engineering and Technology, Gazipur
Section Questions and Answers
Thanks
Dhaka University of Engineering and Technology, Gazipur

More Related Content

What's hot

A Sense-based Registration Process for TDMA in IEEE 802.11 Network q
A Sense-based Registration Process for TDMA in IEEE 802.11 Network qA Sense-based Registration Process for TDMA in IEEE 802.11 Network q
A Sense-based Registration Process for TDMA in IEEE 802.11 Network qIJECEIAES
 
Reliability analysis of wireless automotive applications with transceiver red...
Reliability analysis of wireless automotive applications with transceiver red...Reliability analysis of wireless automotive applications with transceiver red...
Reliability analysis of wireless automotive applications with transceiver red...rchulyada
 
A Review of Different Types of Schedulers Used In Energy Management
A Review of Different Types of Schedulers Used In Energy ManagementA Review of Different Types of Schedulers Used In Energy Management
A Review of Different Types of Schedulers Used In Energy ManagementIRJET Journal
 
Rpc Case Studies (Distributed computing)
Rpc Case Studies (Distributed computing)Rpc Case Studies (Distributed computing)
Rpc Case Studies (Distributed computing)Sri Prasanna
 
Remote Procedure Call in Distributed System
Remote Procedure Call in Distributed SystemRemote Procedure Call in Distributed System
Remote Procedure Call in Distributed SystemPoojaBele1
 
Performance Analysis of multithreaded applications based on Hardware Simulati...
Performance Analysis of multithreaded applications based on Hardware Simulati...Performance Analysis of multithreaded applications based on Hardware Simulati...
Performance Analysis of multithreaded applications based on Hardware Simulati...Maria Stylianou
 
Scheduling of Heterogeneous Tasks in Cloud Computing using Multi Queue (MQ) A...
Scheduling of Heterogeneous Tasks in Cloud Computing using Multi Queue (MQ) A...Scheduling of Heterogeneous Tasks in Cloud Computing using Multi Queue (MQ) A...
Scheduling of Heterogeneous Tasks in Cloud Computing using Multi Queue (MQ) A...IRJET Journal
 
tcpcongest
tcpcongesttcpcongest
tcpcongestBill Bao
 
Towards Achieving Execution Time Predictability in Web Services Middleware
Towards Achieving Execution Time Predictability in Web Services MiddlewareTowards Achieving Execution Time Predictability in Web Services Middleware
Towards Achieving Execution Time Predictability in Web Services MiddlewareVidura Gamini Abhaya
 
Data Replication in Distributed System
Data Replication in  Distributed SystemData Replication in  Distributed System
Data Replication in Distributed SystemEhsan Hessami
 
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...IRJET Journal
 
Chapter 11d coordination agreement
Chapter 11d coordination agreementChapter 11d coordination agreement
Chapter 11d coordination agreementAbDul ThaYyal
 
9 fault-tolerance
9 fault-tolerance9 fault-tolerance
9 fault-tolerance4020132038
 

What's hot (20)

Hu3513551364
Hu3513551364Hu3513551364
Hu3513551364
 
Rtos 2
Rtos 2Rtos 2
Rtos 2
 
Ds ppt imp.
Ds ppt imp.Ds ppt imp.
Ds ppt imp.
 
Seminar report
Seminar reportSeminar report
Seminar report
 
Distributed System
Distributed SystemDistributed System
Distributed System
 
Message passing in Distributed Computing Systems
Message passing in Distributed Computing SystemsMessage passing in Distributed Computing Systems
Message passing in Distributed Computing Systems
 
A Sense-based Registration Process for TDMA in IEEE 802.11 Network q
A Sense-based Registration Process for TDMA in IEEE 802.11 Network qA Sense-based Registration Process for TDMA in IEEE 802.11 Network q
A Sense-based Registration Process for TDMA in IEEE 802.11 Network q
 
Reliability analysis of wireless automotive applications with transceiver red...
Reliability analysis of wireless automotive applications with transceiver red...Reliability analysis of wireless automotive applications with transceiver red...
Reliability analysis of wireless automotive applications with transceiver red...
 
A Review of Different Types of Schedulers Used In Energy Management
A Review of Different Types of Schedulers Used In Energy ManagementA Review of Different Types of Schedulers Used In Energy Management
A Review of Different Types of Schedulers Used In Energy Management
 
Rpc Case Studies (Distributed computing)
Rpc Case Studies (Distributed computing)Rpc Case Studies (Distributed computing)
Rpc Case Studies (Distributed computing)
 
Remote Procedure Call in Distributed System
Remote Procedure Call in Distributed SystemRemote Procedure Call in Distributed System
Remote Procedure Call in Distributed System
 
Performance Analysis of multithreaded applications based on Hardware Simulati...
Performance Analysis of multithreaded applications based on Hardware Simulati...Performance Analysis of multithreaded applications based on Hardware Simulati...
Performance Analysis of multithreaded applications based on Hardware Simulati...
 
Scheduling of Heterogeneous Tasks in Cloud Computing using Multi Queue (MQ) A...
Scheduling of Heterogeneous Tasks in Cloud Computing using Multi Queue (MQ) A...Scheduling of Heterogeneous Tasks in Cloud Computing using Multi Queue (MQ) A...
Scheduling of Heterogeneous Tasks in Cloud Computing using Multi Queue (MQ) A...
 
tcpcongest
tcpcongesttcpcongest
tcpcongest
 
Towards Achieving Execution Time Predictability in Web Services Middleware
Towards Achieving Execution Time Predictability in Web Services MiddlewareTowards Achieving Execution Time Predictability in Web Services Middleware
Towards Achieving Execution Time Predictability in Web Services Middleware
 
Data Replication in Distributed System
Data Replication in  Distributed SystemData Replication in  Distributed System
Data Replication in Distributed System
 
thesis
thesisthesis
thesis
 
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...
 
Chapter 11d coordination agreement
Chapter 11d coordination agreementChapter 11d coordination agreement
Chapter 11d coordination agreement
 
9 fault-tolerance
9 fault-tolerance9 fault-tolerance
9 fault-tolerance
 

Similar to Software rejuvenation based fault tolerance

IRJET- Dynamic Resource Allocation of Heterogeneous Workload in Cloud
IRJET- Dynamic Resource Allocation of Heterogeneous Workload in CloudIRJET- Dynamic Resource Allocation of Heterogeneous Workload in Cloud
IRJET- Dynamic Resource Allocation of Heterogeneous Workload in CloudIRJET Journal
 
A checkpointing mechanism for virtual clusters using memory- bound time-multi...
A checkpointing mechanism for virtual clusters using memory- bound time-multi...A checkpointing mechanism for virtual clusters using memory- bound time-multi...
A checkpointing mechanism for virtual clusters using memory- bound time-multi...IJECEIAES
 
AUTOMATED VM MIGRATION USING INTELLIGENT LEARNING TECHNIQUE
AUTOMATED VM MIGRATION USING INTELLIGENT LEARNING TECHNIQUEAUTOMATED VM MIGRATION USING INTELLIGENT LEARNING TECHNIQUE
AUTOMATED VM MIGRATION USING INTELLIGENT LEARNING TECHNIQUEIRJET Journal
 
An Optimized-Throttled Algorithm for Distributing Load in Cloud Computing
An Optimized-Throttled Algorithm for Distributing Load in Cloud ComputingAn Optimized-Throttled Algorithm for Distributing Load in Cloud Computing
An Optimized-Throttled Algorithm for Distributing Load in Cloud ComputingIRJET Journal
 
Adaptive fault tolerance in cloud survey
Adaptive fault tolerance in cloud surveyAdaptive fault tolerance in cloud survey
Adaptive fault tolerance in cloud surveywww.pixelsolutionbd.com
 
Resource Allocation using Virtual Machine Migration: A Survey
Resource Allocation using Virtual Machine Migration: A SurveyResource Allocation using Virtual Machine Migration: A Survey
Resource Allocation using Virtual Machine Migration: A Surveyidescitation
 
Adaptive fault tolerance in real time cloud_computing
Adaptive fault tolerance in real time cloud_computingAdaptive fault tolerance in real time cloud_computing
Adaptive fault tolerance in real time cloud_computingwww.pixelsolutionbd.com
 
Project-ReviewFinal.pptx
Project-ReviewFinal.pptxProject-ReviewFinal.pptx
Project-ReviewFinal.pptxNikhilRanjan93
 
CACMAN COMPARISION WITH MOCA USING PKI ON MANET.
CACMAN COMPARISION WITH MOCA USING PKI  ON MANET.CACMAN COMPARISION WITH MOCA USING PKI  ON MANET.
CACMAN COMPARISION WITH MOCA USING PKI ON MANET.neeravkubavat
 
Virtual Machine Migration Techniques in Cloud Environment: A Survey
Virtual Machine Migration Techniques in Cloud Environment: A SurveyVirtual Machine Migration Techniques in Cloud Environment: A Survey
Virtual Machine Migration Techniques in Cloud Environment: A Surveyijsrd.com
 
An investigation into Cluster CPU load balancing in the JVM
An investigation into Cluster CPU load balancing in the JVMAn investigation into Cluster CPU load balancing in the JVM
An investigation into Cluster CPU load balancing in the JVMCalum Beck
 
DesignOfAnExtensibleTelemetry&CommandArcitectureForSmallSatellites
DesignOfAnExtensibleTelemetry&CommandArcitectureForSmallSatellitesDesignOfAnExtensibleTelemetry&CommandArcitectureForSmallSatellites
DesignOfAnExtensibleTelemetry&CommandArcitectureForSmallSatellitesBrenden Hogan
 
Cloud service analysis using round-robin algorithm for qualityof-service awar...
Cloud service analysis using round-robin algorithm for qualityof-service awar...Cloud service analysis using round-robin algorithm for qualityof-service awar...
Cloud service analysis using round-robin algorithm for qualityof-service awar...IJECEIAES
 
Hybrid Scheduling Algorithm for Efficient Load Balancing In Cloud Computing
Hybrid Scheduling Algorithm for Efficient Load Balancing In Cloud ComputingHybrid Scheduling Algorithm for Efficient Load Balancing In Cloud Computing
Hybrid Scheduling Algorithm for Efficient Load Balancing In Cloud ComputingEswar Publications
 
IRJET- Online Failure Prediction for Railway Transportation System based ...
IRJET-  	  Online Failure Prediction for Railway Transportation System based ...IRJET-  	  Online Failure Prediction for Railway Transportation System based ...
IRJET- Online Failure Prediction for Railway Transportation System based ...IRJET Journal
 
A load balancing strategy for reducing data loss risk on cloud using remodif...
A load balancing strategy for reducing data loss risk on cloud  using remodif...A load balancing strategy for reducing data loss risk on cloud  using remodif...
A load balancing strategy for reducing data loss risk on cloud using remodif...IJECEIAES
 
ROLE OF DIGITAL SIMULATION IN CONFIGURING NETWORK PARAMETERS
ROLE OF DIGITAL SIMULATION IN CONFIGURING NETWORK PARAMETERSROLE OF DIGITAL SIMULATION IN CONFIGURING NETWORK PARAMETERS
ROLE OF DIGITAL SIMULATION IN CONFIGURING NETWORK PARAMETERSDeepak Shankar
 

Similar to Software rejuvenation based fault tolerance (20)

IRJET- Dynamic Resource Allocation of Heterogeneous Workload in Cloud
IRJET- Dynamic Resource Allocation of Heterogeneous Workload in CloudIRJET- Dynamic Resource Allocation of Heterogeneous Workload in Cloud
IRJET- Dynamic Resource Allocation of Heterogeneous Workload in Cloud
 
A checkpointing mechanism for virtual clusters using memory- bound time-multi...
A checkpointing mechanism for virtual clusters using memory- bound time-multi...A checkpointing mechanism for virtual clusters using memory- bound time-multi...
A checkpointing mechanism for virtual clusters using memory- bound time-multi...
 
AUTOMATED VM MIGRATION USING INTELLIGENT LEARNING TECHNIQUE
AUTOMATED VM MIGRATION USING INTELLIGENT LEARNING TECHNIQUEAUTOMATED VM MIGRATION USING INTELLIGENT LEARNING TECHNIQUE
AUTOMATED VM MIGRATION USING INTELLIGENT LEARNING TECHNIQUE
 
An Optimized-Throttled Algorithm for Distributing Load in Cloud Computing
An Optimized-Throttled Algorithm for Distributing Load in Cloud ComputingAn Optimized-Throttled Algorithm for Distributing Load in Cloud Computing
An Optimized-Throttled Algorithm for Distributing Load in Cloud Computing
 
Adaptive fault tolerance in cloud survey
Adaptive fault tolerance in cloud surveyAdaptive fault tolerance in cloud survey
Adaptive fault tolerance in cloud survey
 
Resource Allocation using Virtual Machine Migration: A Survey
Resource Allocation using Virtual Machine Migration: A SurveyResource Allocation using Virtual Machine Migration: A Survey
Resource Allocation using Virtual Machine Migration: A Survey
 
Adaptive fault tolerance in real time cloud_computing
Adaptive fault tolerance in real time cloud_computingAdaptive fault tolerance in real time cloud_computing
Adaptive fault tolerance in real time cloud_computing
 
Project-ReviewFinal.pptx
Project-ReviewFinal.pptxProject-ReviewFinal.pptx
Project-ReviewFinal.pptx
 
Coca1
Coca1Coca1
Coca1
 
CACMAN COMPARISION WITH MOCA USING PKI ON MANET.
CACMAN COMPARISION WITH MOCA USING PKI  ON MANET.CACMAN COMPARISION WITH MOCA USING PKI  ON MANET.
CACMAN COMPARISION WITH MOCA USING PKI ON MANET.
 
Virtual Machine Migration Techniques in Cloud Environment: A Survey
Virtual Machine Migration Techniques in Cloud Environment: A SurveyVirtual Machine Migration Techniques in Cloud Environment: A Survey
Virtual Machine Migration Techniques in Cloud Environment: A Survey
 
An investigation into Cluster CPU load balancing in the JVM
An investigation into Cluster CPU load balancing in the JVMAn investigation into Cluster CPU load balancing in the JVM
An investigation into Cluster CPU load balancing in the JVM
 
Clone cloud
Clone cloudClone cloud
Clone cloud
 
DesignOfAnExtensibleTelemetry&CommandArcitectureForSmallSatellites
DesignOfAnExtensibleTelemetry&CommandArcitectureForSmallSatellitesDesignOfAnExtensibleTelemetry&CommandArcitectureForSmallSatellites
DesignOfAnExtensibleTelemetry&CommandArcitectureForSmallSatellites
 
Cloud service analysis using round-robin algorithm for qualityof-service awar...
Cloud service analysis using round-robin algorithm for qualityof-service awar...Cloud service analysis using round-robin algorithm for qualityof-service awar...
Cloud service analysis using round-robin algorithm for qualityof-service awar...
 
Hybrid Scheduling Algorithm for Efficient Load Balancing In Cloud Computing
Hybrid Scheduling Algorithm for Efficient Load Balancing In Cloud ComputingHybrid Scheduling Algorithm for Efficient Load Balancing In Cloud Computing
Hybrid Scheduling Algorithm for Efficient Load Balancing In Cloud Computing
 
IRJET- Online Failure Prediction for Railway Transportation System based ...
IRJET-  	  Online Failure Prediction for Railway Transportation System based ...IRJET-  	  Online Failure Prediction for Railway Transportation System based ...
IRJET- Online Failure Prediction for Railway Transportation System based ...
 
A load balancing strategy for reducing data loss risk on cloud using remodif...
A load balancing strategy for reducing data loss risk on cloud  using remodif...A load balancing strategy for reducing data loss risk on cloud  using remodif...
A load balancing strategy for reducing data loss risk on cloud using remodif...
 
ROLE OF DIGITAL SIMULATION IN CONFIGURING NETWORK PARAMETERS
ROLE OF DIGITAL SIMULATION IN CONFIGURING NETWORK PARAMETERSROLE OF DIGITAL SIMULATION IN CONFIGURING NETWORK PARAMETERS
ROLE OF DIGITAL SIMULATION IN CONFIGURING NETWORK PARAMETERS
 
RamachandraParlapalli_RESUME
RamachandraParlapalli_RESUMERamachandraParlapalli_RESUME
RamachandraParlapalli_RESUME
 

More from www.pixelsolutionbd.com

Adaptive fault tolerance_in_real_time_cloud_computing
Adaptive fault tolerance_in_real_time_cloud_computingAdaptive fault tolerance_in_real_time_cloud_computing
Adaptive fault tolerance_in_real_time_cloud_computingwww.pixelsolutionbd.com
 
Privacy preserving secure data exchange in mobile p2 p
Privacy preserving secure data exchange in mobile p2 pPrivacy preserving secure data exchange in mobile p2 p
Privacy preserving secure data exchange in mobile p2 pwww.pixelsolutionbd.com
 
Protecting from transient failures in cloud deployments
Protecting from transient failures in cloud deploymentsProtecting from transient failures in cloud deployments
Protecting from transient failures in cloud deploymentswww.pixelsolutionbd.com
 
Protecting from transient failures in cloud microsoft azure deployments
Protecting from transient failures in cloud microsoft azure deploymentsProtecting from transient failures in cloud microsoft azure deployments
Protecting from transient failures in cloud microsoft azure deploymentswww.pixelsolutionbd.com
 
Real time service oriented cloud computing
Real time service oriented cloud computingReal time service oriented cloud computing
Real time service oriented cloud computingwww.pixelsolutionbd.com
 
Performance, fault tolerance and scalability analysis of virtual infrastructu...
Performance, fault tolerance and scalability analysis of virtual infrastructu...Performance, fault tolerance and scalability analysis of virtual infrastructu...
Performance, fault tolerance and scalability analysis of virtual infrastructu...www.pixelsolutionbd.com
 
Comprehensive analysis of performance, fault tolerance and scalability in gri...
Comprehensive analysis of performance, fault tolerance and scalability in gri...Comprehensive analysis of performance, fault tolerance and scalability in gri...
Comprehensive analysis of performance, fault tolerance and scalability in gri...www.pixelsolutionbd.com
 
A task based fault-tolerance mechanism to hierarchical master worker with div...
A task based fault-tolerance mechanism to hierarchical master worker with div...A task based fault-tolerance mechanism to hierarchical master worker with div...
A task based fault-tolerance mechanism to hierarchical master worker with div...www.pixelsolutionbd.com
 
Privacy preserving secure data exchange in mobile P2P
Privacy preserving secure data exchange in mobile P2PPrivacy preserving secure data exchange in mobile P2P
Privacy preserving secure data exchange in mobile P2Pwww.pixelsolutionbd.com
 
Software Testing in Cloud Platform A Survey_final
Software Testing in Cloud Platform A Survey_finalSoftware Testing in Cloud Platform A Survey_final
Software Testing in Cloud Platform A Survey_finalwww.pixelsolutionbd.com
 

More from www.pixelsolutionbd.com (19)

Adaptive fault tolerance_in_real_time_cloud_computing
Adaptive fault tolerance_in_real_time_cloud_computingAdaptive fault tolerance_in_real_time_cloud_computing
Adaptive fault tolerance_in_real_time_cloud_computing
 
Privacy preserving secure data exchange in mobile p2 p
Privacy preserving secure data exchange in mobile p2 pPrivacy preserving secure data exchange in mobile p2 p
Privacy preserving secure data exchange in mobile p2 p
 
Protecting from transient failures in cloud deployments
Protecting from transient failures in cloud deploymentsProtecting from transient failures in cloud deployments
Protecting from transient failures in cloud deployments
 
Protecting from transient failures in cloud microsoft azure deployments
Protecting from transient failures in cloud microsoft azure deploymentsProtecting from transient failures in cloud microsoft azure deployments
Protecting from transient failures in cloud microsoft azure deployments
 
Speaking
SpeakingSpeaking
Speaking
 
Fault tolerance on cloud computing
Fault tolerance on cloud computingFault tolerance on cloud computing
Fault tolerance on cloud computing
 
Fault tolerance on cloud computing
Fault tolerance on cloud computingFault tolerance on cloud computing
Fault tolerance on cloud computing
 
Fault tolerance on cloud computing
Fault tolerance on cloud computingFault tolerance on cloud computing
Fault tolerance on cloud computing
 
Fault tolerance on cloud computing
Fault tolerance on cloud computingFault tolerance on cloud computing
Fault tolerance on cloud computing
 
Fault tolerance on cloud computing
Fault tolerance on cloud computingFault tolerance on cloud computing
Fault tolerance on cloud computing
 
Fault tolerance on cloud computing
Fault tolerance on cloud computingFault tolerance on cloud computing
Fault tolerance on cloud computing
 
Cyber Physical System
Cyber Physical SystemCyber Physical System
Cyber Physical System
 
Fault tolerance on cloud computing
Fault tolerance on cloud computingFault tolerance on cloud computing
Fault tolerance on cloud computing
 
Real time service oriented cloud computing
Real time service oriented cloud computingReal time service oriented cloud computing
Real time service oriented cloud computing
 
Performance, fault tolerance and scalability analysis of virtual infrastructu...
Performance, fault tolerance and scalability analysis of virtual infrastructu...Performance, fault tolerance and scalability analysis of virtual infrastructu...
Performance, fault tolerance and scalability analysis of virtual infrastructu...
 
Comprehensive analysis of performance, fault tolerance and scalability in gri...
Comprehensive analysis of performance, fault tolerance and scalability in gri...Comprehensive analysis of performance, fault tolerance and scalability in gri...
Comprehensive analysis of performance, fault tolerance and scalability in gri...
 
A task based fault-tolerance mechanism to hierarchical master worker with div...
A task based fault-tolerance mechanism to hierarchical master worker with div...A task based fault-tolerance mechanism to hierarchical master worker with div...
A task based fault-tolerance mechanism to hierarchical master worker with div...
 
Privacy preserving secure data exchange in mobile P2P
Privacy preserving secure data exchange in mobile P2PPrivacy preserving secure data exchange in mobile P2P
Privacy preserving secure data exchange in mobile P2P
 
Software Testing in Cloud Platform A Survey_final
Software Testing in Cloud Platform A Survey_finalSoftware Testing in Cloud Platform A Survey_final
Software Testing in Cloud Platform A Survey_final
 

Recently uploaded

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 

Recently uploaded (20)

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 

Software rejuvenation based fault tolerance

  • 1. Software Rejuvenation based Fault Tolerance Scheme for Cloud Applications Dr. Mohammad Abdur Rouf (Professor) Head of Department of Computer Science & Engineering Dhaka University of Engineering and Technology, Gazipur © B.Sc. Engg. (KU) M.Sc. Engg. (BUET), Ph.D. (KAIST) Dhaka University of Engineering and Technology, Gazipur Presented by Md. Mostafijur Rahman M.Sc. Engg. Student ID #: 132431(p) Department of Computer Science and Engineering
  • 2. Outline of the talk • Software Aging • Holistic Software Rejuvenation • Software Rejuvenation in Clouds • System Architecture - Tight coupling components (TCC) - Loose-Coupling Components (LCC) - Virtual Machine (VM) - Interim Node - Software Rejuvenation Manager • Adaptive Failure Detection & Evaluation - Computing Expected Arrival Time (EAT). - Software Aging Degree Evaluation • Migration Based Software Rejuvenation - Checkpoint Generation and Remote Storing. - Live Migration of the Original VM. - Checkpoint with Trace Replay - Rejuvenation from Checkpoint. Dhaka University of Engineering and Technology, Gazipur
  • 3. Introduction • Cloud computing is a large-scale and complex distributed computing paradigm where the configurable resources (servers, storage, network, data and software applications) are provided as multi-level services via virtualization technologies. Software aging is the biggest problem in Cloud Application . • I propose a novel and holistic software rejuvenation based fault tolerance scheme to counteract aging obstacles for cloud applications. • First, an adaptive failure detection and aging degree evaluation approach is proposed to predict which cloud service components deserve foremost to be rejuvenated. Second, a component rejuvenation approach based on checkpoints with trace replay is proposed to guarantee the continuous running of cloud application systems. Dhaka University of Engineering and Technology, Gazipur
  • 4. Operation Work There is a set of virtual machines, running on cloud infrastructure, and Separate TCC and LCC are running on Virtual Machine. Dhaka University of Engineering and Technology, Gazipur
  • 5. Operation Dhaka University of Engineering and Technology, Gazipur Applying software rejuvenation technologies to the cloud computing scenario brings the software from a failure-prone state to an aging-free state, where the most intuitive and widely-adopted rejuvenation method is to terminate software transiently, clean up their internal runtime states (e.g. reinitializing internal data structures or garbage collection) and finally restart it. Tight coupling components (TCC) execute in the same virtual machine (VM), while loose-coupling components (LCC) execute in different virtual machines. Service components often communicate with each other via RPC over local high speed network, and cloud services always communicate with each other via web services protocols over Internet. Aging Failure Detector (AFD) inspects CPU and memory usage. Aging Degree Evaluator (ADE) is used to record contexts of this failure, and then assign a fatal degree number to this failure event through an evaluation process, and finally insert this event into a failure event queue. The event queue is ordered according to the fatal degree number of every failure event automatically.
  • 6. Failure Detection And Evaluation Dhaka University of Engineering and Technology, Gazipur Computing Expected Arrival Time (EAT). If next metrics data packet arrives after EAT or totally does not arrive, it indicates some kinds of failures tend to appear. In every sampling interval ti, the agent collects the CPU usage Ci And the free memory available data FMi, encapsulates Ci and Fmi data into a packet and sends it to the AFD. Recent n packet p1, p2, ..., pn, and T1, T2, ..., Tn are their actual arrival time according to AFD's local clock. Expected Arrival Time (EAT): EATk+1 (k> n) is the theoretically expected arrival time of next packet, and it is estimated as follows based on studies. EAT utilizes actual arrival time of recent packets. The average of packet transmission time of latest n packets and shifts forward the average for next pk+1. Then, EAT for next packet could be added an offset to EATk+1 .
  • 7. Software Aging Degree Evaluation. Dhaka University of Engineering and Technology, Gazipur The aging degree of TCC or LCC is roughly divided into four levels: Level 1 indicates the crucial case which is very likely to result in fatal consequence, such as service crash, and needs rejuvenation immediately. Level 2 indicates the serious case which tends to result in serious consequence, e.g., long transmission delay or high performance loss, and needs rejuvenation as soon as possible. Level 3 indicates the suspectable case which may cause unsatisfied consequence or false detection. It needs on-going monitoring and now rejuvenation is just a suggestion. Level 4 indicates the normal case without any rejuvenation. It is assumed that for a cloud application which is not computing-intensive, the metric . The significance follows that EAT > Mi > Ci.
  • 8. Migration Based Software Rejuvenation  To avoid re-executing of a TCC or a LCC from its initial states, which costs more due to high downtime and complex synchronization behaviors, we integrate checkpoint technique with trace record and replay and live VM migration technique Check point Generation and Remote Storing. When a TCC or a LCC is detected with aging failure-prone in level 1 or 2, the ADE will trigger the SRM to generate a rejuvenation starting message and send it to the interim node. The interim node will establish a connection with the agent, and ask it to generate a checkpoint of the runtime status of that aging cloud service component. Dhaka University of Engineering and Technology, Gazipur
  • 9. Live Migration of the Original VM Dhaka University of Engineering and Technology, Gazipur After generating the checkpoint, the VM hosting those failure-prone TCC or LCC is migrated in pre-copy scheme to the interim node, while the continuous communication of upper cloud service components are kept on as the migration occurs in the same LAN. It should be noted that during this live migration process, the trace-log file for recording subsequent behaviors of the aging TCC or LCC is still written by the rejuvenation agent. Rejuvenation from Checkpoint: When migration is done, the original VM environment could be rebooted automatically controlled by the rejuvenation agent. That means a totally new VM environment (VMs) for the cloud service running are well available. Next, the checkpoint image is transmitted from the interim node to VMs to restore the TCC or LCC running from that checkpoint time. Then following iterations copy the trace log file from the VM in interim node (VMin) to VMs, to replay the subsequent behaviors accurately. This iterative process is executed until trace-log files generated during the previous transfer round is reduced to a pre-defined threshold value. Till now, VMin is suspended for running and the remaining trace log file is transferred to the VMs.
  • 10. DISCUSSION Dhaka University of Engineering and Technology, Gazipur  Our aging failure detection tends to be more comprehensive and more accurate because we consider both runtime metrics of VMs and communication liveness of cloud services as credible guidance to identify which cloud service components need rejuvenation imperatively.  Cooperation of the VM migration and trace-log based checkpoint replay techniques guarantee continuous running of cloud applications  Our method can be directly used to handle multiple rejuvenations of independent VMs.
  • 11. CONCLUSIONS AND FUTURE WORK In order to counteract software aging obstacles for service components in cloud applications, a novel and holistic software rejuvenation based fault tolerance scheme is well presented to improve their running availability. By a preliminary evaluation, we discuss that the adaptive failure detection and aging degree evaluation approach can accurately predict which cloud service components deserve foremost to be rejuvenated, and the checkpoint with trace-log replay based rejuvenation approach is a very promising fault tolerance mechanism choice to guarantee continuous running of cloud applications. As to future works, we intend to improve the accuracy of the failure detection metrics and adaptive aging degree evaluation . More importantly, we will conduct optimal checkpoint and trace-log replay file . Dhaka University of Engineering and Technology, Gazipur
  • 12. Section Questions and Answers Thanks Dhaka University of Engineering and Technology, Gazipur

Editor's Notes

  1. 10