1. CENG 5334 - Fault Tolerant
Computing
Fall 2012
Fatih Karabacak
2. What is Cloud Computing?
Reliability of Cloud Service.
A Fault Tolerance Framework in Cloud Computing.
3.
4. “Cloud computing is Web-based processing,
whereby shared resources, software, and
information are provided to computers and
other devices (such as smart phones) on
demand over the Internet.”
Common implies multi-tenancy, not single or
isolated tenancy
Location-independent
Online
Utility implies pay-for-use pricing
Demand implies ~infinite, ~immediate,
~invisible scalability
8. Request Stage Failures: Overflow and Timeout.
Execution Stage Failures: Data resource missing,
Computing resource missing, Software failure,
Database failure, Hardware failure, and Network
failure.
9.
10. The overhead created by proactive and reactive FT
should be minimized when checkpointing.
A good fault tolerance should be transparent and it
should not require source code or application
modifications.
It should use fault prediction mechanisms to determine
when to checkpoint.
It should use failure detection mechanism to determine
when to recover the application from a failure.
The cloud symbol was used to denote to setting a limit points between the responsibility areas of the customer and provider.
Simply, I can summarize some characteristics of cloud computing. The first characters build up the word CLOUD and it’s very easy to remember. They’re Common, Location-independent, Online, Utility implies and Demand implies.It's called cloud computing, the fifth utility (after electric power, gas, water and telephony) and it could change the way individuals and companies operate.
The CMS mainly fulfills four different functions as shown in Fig. 1: 1) To manage a request queue that receives job requests from different users for cloud services; 2) To manage computing resources (such as PCs, Clusters, Supercomputers, etc.) all over the Internet; 3) To manage data resources (such as Databases, Publicized Information, URL contents, etc.) all over the Internet; and 4) To schedule a request and divide it into different subtasks and assign the subtasks to different computing resources thatmay access different data resources over the Internet.
The model for cloud computing reliability has to consider all types of these failures, whichwould be very complicated…Moreover, these different types of failures are actually correlated with one another (i.e., notindependent) in a cloud service which exhibits another reason why the cloud reliability modelcannot simply utilize any one single existing model in each individual topic (such as softwarereliability, hardware reliability, or network reliability).With such correlations, it is obvious that a new holistic model hasto be developed for cloud reliability.
Framework consists of five modules:fault predictor, The fault predictor (FP) runs on each compute node and filters local information to predict failures based on system data.(2) PLR controller daemon,(3) fault tolerance policy, (4) fault tolerance daemon protocol and (5) checkpoint/restart module.
Fault Definitions Transient – A fault resulting from temporary environmental conditions. A soft fault. Permanent – a failure or fault that is continuous and stable, a hard fault. In hardware, a permanent fault is an irreversible physical change until repaired. Intermittent – a fault that is only occasionally present due to unstable hardware or varying hardware or software states, e.g., as a function of load or activity.
1. Step: The fault predictor module predicts future fault and, send alarm to PLR controller daemon. 2. Step: PLR controller daemon monitors the VMs and MPI (Message Passing Interface) applications. It has the visibility of HPC applications running on VMs and the visibility of virtualized environments. The PLR controller daemon ensures that redundant nodes are available for live migration and checkpointing. If the nodes are not available, it makes provision of redundant nodes. PLR controller daemon also initials and carries live migrations of VMs to redundant nodes. It initiates checkpointing of MPI applications after migration.3. Step: Fault tolerance daemon protocol notifies the PLR controller daemon when failure occurs through I am alive messages and continues monitoring of communications of MPI applications.4. Step: Checkpoint is initiated with checkpointing library BLCR library which is available in Open MPI implementation [9], after live migration of the VMs to the redundant nodes. The checkpoint files are saved on the network and neighboring node for easy recovery as well as to eliminate single point of failure.5. Step: After checkpointing, the resources which were used for migration are free because cloud resource can be relinquished at will.