FT Architecture For Cloud Service Computing

•

3 likes•585 views

destruck

CENG 5334 - Fault Tolerant
Computing
Fall 2012

Fatih Karabacak

 What is Cloud Computing?
 Reliability of Cloud Service.
 A Fault Tolerance Framework in Cloud Computing.

“Cloud computing is Web-based processing,
whereby shared resources, software, and
information are provided to computers and
other devices (such as smart phones) on
demand over the Internet.”

Common implies multi-tenancy, not single or
isolated tenancy
Location-independent
Online
Utility implies pay-for-use pricing
Demand implies ~infinite, ~immediate,
~invisible scalability

 Overflow
 Timeout
 Data resource missing
 Computing resource missing
 Software failure
 Database failure
 Hardware failure
 Network failure

 Request Stage Failures: Overflow and Timeout.

 Execution Stage Failures: Data resource missing,
Computing resource missing, Software failure,
Database failure, Hardware failure, and Network
failure.

 The overhead created by proactive and reactive FT
should be minimized when checkpointing.
 A good fault tolerance should be transparent and it
should not require source code or application
modifications.
 It should use fault prediction mechanisms to determine
when to checkpoint.
 It should use failure detection mechanism to determine
when to recover the application from a failure.

1) Fault predictor
2) PLR (Process-Level Redundancy) Controller
Daemon
3) Fault Tolerant Policy (Proactive-Reactive)
4) Fault Tolerance Daemon Protocol
5) Checkpoint/Restart Module

A redundancy
technique which
uses the software-
centric model of
transient fault
detection.

Resource
http://en.wikipedia.org/wiki/Cloud_computin
g
 http://www.focus.com/articles/hosting-
bandwidth/top-10-cloud-computing-trends/
 Y. Dai, B. Yang, J. Dongarra, G. Zhang,
”Cloud Service Reliability: Modeling and
Analysis”
I. Egwutuoha, S. Chen, D. Levy, B. Selic, “ A
Fault Tolerance Framework for High
Performance Computing in Cloud”

Thank you

FT Architecture For Cloud Service Computing

What's hot

Exploration of Radars and Software Defined Radios using VisualSimDeepak Shankar

Mod05lec24(resource mgmt i)Ankit Gupta

Chap 1(one) general introductionMalobe Lottin Cyrille Marcel

Desktop to Cloud Transformation PlanningPhearin Sok

Introduction to Cloud Data Center and Network IssuesJason TC HOU (侯宗成)

Mod05lec23(map reduce tutorial)Ankit Gupta

Cloud computingAaron Tushabe

Service Ownership with PagerDuty and Rundeck: Help others help you TraciMyers5

(Slides) Task scheduling algorithm for multicore processor system for minimiz...Naoki Shibata

Chap 2 classification of parralel architecture and introduction to parllel p...Malobe Lottin Cyrille Marcel

Velocity 2018 preetha appan finalpreethaappan

Achieving scale and performance using cloud native environmentRakuten Group, Inc.

Error tolerant resource allocation and payment minimization for cloud systemJPINFOTECH JAYAPRAKASH

Webinar: Detecting Deadlocks in Electronic Systems using Time-based SimulationDeepak Shankar

cloud schedulingMudit Verma

An Efficient Decentralized Load Balancing Algorithm in Cloud ComputingAisha Kalsoom

Chaos Engineering with Gremlin PlatformAnshul Patel

Stinson post si and verificationObsidian Software

Intel xeon-scalable-processors-overviewDESMOND YUEN

What's hot (19)

Exploration of Radars and Software Defined Radios using VisualSim

Mod05lec24(resource mgmt i)

Chap 1(one) general introduction

Desktop to Cloud Transformation Planning

Introduction to Cloud Data Center and Network Issues

Mod05lec23(map reduce tutorial)

Cloud computing

Service Ownership with PagerDuty and Rundeck: Help others help you

(Slides) Task scheduling algorithm for multicore processor system for minimiz...

Chap 2 classification of parralel architecture and introduction to parllel p...

Velocity 2018 preetha appan final

Achieving scale and performance using cloud native environment

Error tolerant resource allocation and payment minimization for cloud system

Webinar: Detecting Deadlocks in Electronic Systems using Time-based Simulation

cloud scheduling

An Efficient Decentralized Load Balancing Algorithm in Cloud Computing

Chaos Engineering with Gremlin Platform

Stinson post si and verification

Intel xeon-scalable-processors-overview

Viewers also liked

Hardware Software Codesigndestruck

Design of embedded systemsPradeep Kumar TS

A practical introduction to hardware software codesign 2eSpringer

A petri net model for hardware software codesignJULIO GONZALEZ SANZ

Signal modellingDebangi_G

9 d57105 hardware software co designVinod Kumar Gorrepati

Genetic Algorithm for task scheduling in Cloud Computing EnvironmentSwapnil Shahade

Task scheduling Survey in Cloud ComputingRamandeep Kaur

Cloud operating systemsadak pramodh

Fault tolerance techniques for real time operating systemanujos25

Embedded systemmangal das

AWS vs. AzureRob Gillen

Issues in cloud computingronak patel

Cloud computing simple pptAgarwaljay

Introduction of Cloud computingRkrishna Mishra

Viewers also liked (15)

Hardware Software Codesign

Design of embedded systems

A practical introduction to hardware software codesign 2e

A petri net model for hardware software codesign

Signal modelling

9 d57105 hardware software co design

Genetic Algorithm for task scheduling in Cloud Computing Environment

Task scheduling Survey in Cloud Computing

Cloud operating system

Fault tolerance techniques for real time operating system

Embedded system

AWS vs. Azure

Issues in cloud computing

Cloud computing simple ppt

Introduction of Cloud computing

Similar to FT Architecture For Cloud Service Computing

htcia-5-2015Tony Godfrey

Untitled 1ajayzkumarz

APIsecure 2023 - Approaching Multicloud API Security USing Metacloud, David L...apidays

Visualizing Your Network Health - Driving Visibility in Increasingly Complex...DellNMS

Cloud computingakanksha botke

Grid computingDikshita_Viradia

ON FAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDSijgca

www.iosrjournals.org 57 | Page Latest development of cloud computing technolo...Sushil kumar Choudhary

Latest development of cloud computing technology, characteristics, challenge,...sushil Choudhary

Cloud computingwaghu

Cloud computing final format(1)ahmed elmeghiny

Final_ReportTlhologelo Mphahlele

Improve HLA based Encryption Process using fixed Size Aggregate Key generationEditor IJMTER

Cloud computing(Basic).pptxnischal52

13778757.pptKamoliddinUktamov

Security Issues in Cloud Computing Solution of DDOS and Introducing Two-Tier ...ijccsa

Cloud computingAamir chouhan

Celera Networks on Cloud Computing CeleraNetworks

Cloud computingUpanya Singh

ICRTITCS-2012 Conference PublicationTejaswi Agarwal

Similar to FT Architecture For Cloud Service Computing (20)

htcia-5-2015

Untitled 1

APIsecure 2023 - Approaching Multicloud API Security USing Metacloud, David L...

Visualizing Your Network Health - Driving Visibility in Increasingly Complex...

Cloud computing

Grid computing

ON FAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDS

www.iosrjournals.org 57 | Page Latest development of cloud computing technolo...

Latest development of cloud computing technology, characteristics, challenge,...

Cloud computing

Cloud computing final format(1)

Final_Report

Improve HLA based Encryption Process using fixed Size Aggregate Key generation

Cloud computing(Basic).pptx

13778757.ppt

Security Issues in Cloud Computing Solution of DDOS and Introducing Two-Tier ...

Cloud computing

Celera Networks on Cloud Computing

Cloud computing

ICRTITCS-2012 Conference Publication

FT Architecture For Cloud Service Computing

1. CENG 5334 - Fault Tolerant Computing Fall 2012 Fatih Karabacak

2.  What is Cloud Computing?  Reliability of Cloud Service.  A Fault Tolerance Framework in Cloud Computing.

4. “Cloud computing is Web-based processing, whereby shared resources, software, and information are provided to computers and other devices (such as smart phones) on demand over the Internet.” Common implies multi-tenancy, not single or isolated tenancy Location-independent Online Utility implies pay-for-use pricing Demand implies ~infinite, ~immediate, ~invisible scalability

6. Cloud Service System

7.  Overflow  Timeout  Data resource missing  Computing resource missing  Software failure  Database failure  Hardware failure  Network failure

8.  Request Stage Failures: Overflow and Timeout.  Execution Stage Failures: Data resource missing, Computing resource missing, Software failure, Database failure, Hardware failure, and Network failure.

10.  The overhead created by proactive and reactive FT should be minimized when checkpointing.  A good fault tolerance should be transparent and it should not require source code or application modifications.  It should use fault prediction mechanisms to determine when to checkpoint.  It should use failure detection mechanism to determine when to recover the application from a failure.

11. 1) Fault predictor 2) PLR (Process-Level Redundancy) Controller Daemon 3) Fault Tolerant Policy (Proactive-Reactive) 4) Fault Tolerance Daemon Protocol 5) Checkpoint/Restart Module

12. A redundancy technique which uses the software- centric model of transient fault detection.

13.

14.

15. Resource http://en.wikipedia.org/wiki/Cloud_computin g  http://www.focus.com/articles/hosting- bandwidth/top-10-cloud-computing-trends/  Y. Dai, B. Yang, J. Dongarra, G. Zhang, ”Cloud Service Reliability: Modeling and Analysis” I. Egwutuoha, S. Chen, D. Levy, B. Selic, “ A Fault Tolerance Framework for High Performance Computing in Cloud” Thank you

Editor's Notes

The cloud symbol was used to denote to setting a limit points between the responsibility areas of the customer and provider.
Simply, I can summarize some characteristics of cloud computing. The first characters build up the word CLOUD and it’s very easy to remember. They’re Common, Location-independent, Online, Utility implies and Demand implies.It's called cloud computing, the fifth utility (after electric power, gas, water and telephony) and it could change the way individuals and companies operate.
The CMS mainly fulfills four different functions as shown in Fig. 1: 1) To manage a request queue that receives job requests from different users for cloud services; 2) To manage computing resources (such as PCs, Clusters, Supercomputers, etc.) all over the Internet; 3) To manage data resources (such as Databases, Publicized Information, URL contents, etc.) all over the Internet; and 4) To schedule a request and divide it into different subtasks and assign the subtasks to different computing resources thatmay access different data resources over the Internet.
The model for cloud computing reliability has to consider all types of these failures, whichwould be very complicated…Moreover, these different types of failures are actually correlated with one another (i.e., notindependent) in a cloud service which exhibits another reason why the cloud reliability modelcannot simply utilize any one single existing model in each individual topic (such as softwarereliability, hardware reliability, or network reliability).With such correlations, it is obvious that a new holistic model hasto be developed for cloud reliability.
Framework consists of five modules:fault predictor, The fault predictor (FP) runs on each compute node and filters local information to predict failures based on system data.(2) PLR controller daemon,(3) fault tolerance policy, (4) fault tolerance daemon protocol and (5) checkpoint/restart module.
Fault Definitions Transient – A fault resulting from temporary environmental conditions. A soft fault. Permanent – a failure or fault that is continuous and stable, a hard fault. In hardware, a permanent fault is an irreversible physical change until repaired. Intermittent – a fault that is only occasionally present due to unstable hardware or varying hardware or software states, e.g., as a function of load or activity.
1. Step: The fault predictor module predicts future fault and, send alarm to PLR controller daemon. 2. Step: PLR controller daemon monitors the VMs and MPI (Message Passing Interface) applications. It has the visibility of HPC applications running on VMs and the visibility of virtualized environments. The PLR controller daemon ensures that redundant nodes are available for live migration and checkpointing. If the nodes are not available, it makes provision of redundant nodes. PLR controller daemon also initials and carries live migrations of VMs to redundant nodes. It initiates checkpointing of MPI applications after migration.3. Step: Fault tolerance daemon protocol notifies the PLR controller daemon when failure occurs through I am alive messages and continues monitoring of communications of MPI applications.4. Step: Checkpoint is initiated with checkpointing library BLCR library which is available in Open MPI implementation [9], after live migration of the VMs to the redundant nodes. The checkpoint files are saved on the network and neighboring node for easy recovery as well as to eliminate single point of failure.5. Step: After checkpointing, the resources which were used for migration are free because cloud resource can be relinquished at will.

FT Architecture For Cloud Service Computing

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (15)

Similar to FT Architecture For Cloud Service Computing

Similar to FT Architecture For Cloud Service Computing (20)

FT Architecture For Cloud Service Computing

Editor's Notes