Grds conferences icst and icbelsh (9)
Upcoming SlideShare
Loading in...5
×
 

Grds conferences icst and icbelsh (9)

on

  • 71 views

This presentation was done in June 2014 by one of our participants in ICST and ICBELSH conferences.

This presentation was done in June 2014 by one of our participants in ICST and ICBELSH conferences.

Statistics

Views

Total Views
71
Views on SlideShare
71
Embed Views
0

Actions

Likes
0
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Grds conferences icst and icbelsh (9) Grds conferences icst and icbelsh (9) Presentation Transcript

  • APPLICATION LEVEL CHECKPOINT-BASED APPROACH FOR CRUSH FAILURE IN DISTRIBUTED SYSTEM Presented By Moh Moh Khaing
  • OUTLINES  Abstract  Introduction  Objectives  Background Theory  Proposed System  System flow of proposed system  Two phases of proposed system  Implementation  Conclusion 2
  • ABSTRACT  Fault-tolerance for the computing node failure is an important and critical issue in distributed and parallel processing system.  If the numbers of computing nodes are increased concurrently and dynamically in network, it may occur node failure more times.  This system proposes application level checkpoint-based fault tolerance approach for distributed computing.  The proposed system uses coordinated checkpointing techniques and systematic process logging as global monitoring mechanism.  The proposed system implements on distributed multiple sequences alignment (MSA) application using genetic algorithm (GA). 3
  • DISTRIBUTED MULTIPLE SEQUENCE ALIGNMENT WITH GENETIC ALGORITHM (MSAGA) 4 MSA with GA Division Head Node MSA with GA MSA with GA Aligned Sequence Result Aligned Sequence Result Aligned Sequence Result Combine Alignment Result Display Result DNA Sequences (2 …..n)
  • SEQUENCES ALIGNMENT EXAMPLE Input multiple DNA Sequences >DNAseq1: AAGGAAGGAAGGAAGGAAGGAAGG >DNAseq2: AAGGAAGGAATGGAAGGAAGGAAGG >DNAseq3: AAGGAACGGAATGGTAGGAAGGAAGG Output for aligned DNA Sequences >DNAseq1: A-AGGA-AGGA-AGGAA-------GG-----AA-GGAAGG >DNAseq2: ----------------AAGGAAGGAATGGAAGGAAGGAAGG >DNAseq3: ----------------AAGGAACGGAATGGTAGGAAGGAAGG 5
  • NODE FAILURE CONDITION  Node failure condition is occurred when the worker node connects to head node, worker node accepts the input sequence and worker node sends resulted sequence the head node. The failure conditions are 1. Worker node is denied as soon as worker node had connected to the head node without working any job. 2. Worker node rejects the input sequence from the head node after the head node and worker node had connected and head node had prepared the input sequence for worker node. 3. Worker node sends “No Send” message to Head node after worker node had accepted the result sequence to head node. 4. Worker node is crushed when it cannot connect to the Head node with correct address. 5. Worker node is crushed when it disconnect to the Head node. 6
  • COORDINATED CHECKPOINTING  Checkpointing is used as fault tolerance mechanism in distributed system.  A checkpoint is a snapshot of the current state of a process and assist in monitoring process.  Coordinated checkpointing takes the checkpoint periodically and save in the log file.  This monitoring information provides at the node failure condition.  If node failure occurs in distributed computing, another available node can reconstruct the process state from the information saved in the checkpoint information of failed node. 7
  • SYSTEMATIC PROCESS LOGGING  Systematic Process Logging (SPL) which was derived from a log-based method.  The motivation for SPL is to reduce the amount of computation that can be lost, which is bound by the execution time of a single failed task.  SPL saves the checkpoint information from the coordinated checkpointing as the log file format with exactly time and their contents.  Depending on the fault, it decides which node can be accepted the job from failed node using storing log file. 8
  • PROPOSED FAULT TOLERANCE SYSTEM  The checkpoint based fault tolerance approach is implemented on the application layer without using any operating system support.  In distributed multiple sequences alignment application,one head node and one or more worker nodes are connected with local area network.  All worker nodes implemented the MSAGA and aligned the input sequence from head node independently.  The proposed fault tolerance system takes the local checkpoint at the MSA process of each computing worker node themselves and global checkpoint at events of all workers ’ condition by head node. 9
  • ARCHITECTURE OF PROPOSED FAULT TOLERANCE SYSTEM Head Node Local Area Network GRM GCS LCS LC Worker 1 LCS LC Worker 2 LCS LC Worker 3 GRM – Global Resource Monitor GCS – Global Checkpoint Storage LCS- Local Checkpoint Storage LC – Local Checkpoint 10
  • SYSTEM FLOW OF PROPOSED SYSTEM Start End Load Balancing Phase GRM HN GCS Checkpointing Phase WNHN Systematic Process Logging GCS LCS WNHN GRM LC Coordinated Checkpointing HN- Head Node WN – Worker Node 11
  • IMPLEMENTATION OF HEAD NODE Checkpointing Phase  The global resource monitor(GRM) plays the main role in both coordinated checkpointing phase and systematic process logging phase.  GRM takes the global checkpoint of all workers nodes’ event at the coordinated checkpointing phase.  GCS saves the global checkpoint information as the log file format at the Systematic process logging phase. 12
  • GLOBAL CHECKPOINT 13 Global Rrsource Monitor(GRM ) Begin 1. Taking global checkpoints of current condition of each WN with WN’s IP, port, status, and time duration 2. Detecting the failure condition of WNs 3. Finding the available worker nodes and decide which node is suitable for continuing to do failed WN’s jobs End
  • TYPES OF CHECKPOINT 14 Checkpoint No Checkpoint Name Checkpoint Content 1 Available Worker node is connected with Head node and waits for jobs from Head node 2 Denied Worker node is disconnected with Server 3 Busy Worker node is processing the jobs 4 Receive Worker node send the result to the Head node and exist (or) Worker node send Error message and Exit 5 Crush Worker node sends the crush message to the Head node
  • CHECKPOINT INFORMATION  For each checkpoint, there are four conditions are described:  Worker Typeto show worker number,  IP address to show WN,  Checkpoint Name to show worker node’s conditions,  Current Time to show process current time,  Time Duration to show time within each worker’s running state to accept and receive state or running state to reject state.  15 Worker Type IP Address Checkpoint Name Current Time Time Duration
  • AVAILABLE CHECKPOINT OF ALL WORKERS  GRM take checkpoint as Available when all worker nodes are connected to the head node 16
  • CHECKPOINT CHANGES FROM AVAILABLE 17 GlobalCheckpoint_Available ( ) Begin 1. IF HN and WNs are connected THEN GRM takes checkpoint as Available END IF 2. IF Checkpoint is Available THEN IF WN is continuously connected to HN THEN HN selects sequence and send to WNs IF WN not accepted the sequence THEN GRM takes checkpoint as Crush The sequence is go to crush queue ELSE GRM takes checkpoint as Busy WN does MSA application END IF ELSE GRM takes checkpoint as Denied END IF End
  • DETECTING NODE FAILURE BY GRM 18
  • BUSY CHECKPOINT OF ALL WORKERS 19
  • CHECKPOINT CHANGES FROM BUSY 20 GlobalCheckpoint_Busy ( ) Begin 1 IF WN accepted input sequence from HN THEN GRM takes checkpoint as Busy END IF 2 IF the checkpoint is Busy THEN IF WN sends error message to HN THEN GRM takes checkpoint as Receive for error ELSE GRM takes checkpoint as Receive for result END IF END IF End
  • RECEIVE CHECKPOINT WITH RESULT 21
  • RECEIVE CHECKPOINT WITH NO SEND MESSAGE 22
  • GLOBAL CHECKPOINT STORAGE(GCS) 23 Global_Checkpoint_Storage ( ) Begin 1 GCS stores the current condition of all WN in network as checkpoint by GRM 2 GCS records the detail condition of WN 3 Create GCS log file for all checkpoint of nodes End
  • GCS LOG FILE 24
  • LOAD BALANCING PHASE 25 GRM_LoadBalancing( ) BEGIN IF (GRM detects Denied or Crush or Receive “No Send”) THEN 1 It is assumed that they are the failure of worker node. 2 The GRM finds the available node using GCS and decide which node is suitable to send job. 3 If so, the HN sends jobs to such available node from failed node. 4 Call Available and Busy Algorithm ENDIF END
  • LOAD BALANCING ACCORDING TO NODE FAILURE AS DENIED CHECKPOINT 26
  • LOAD BALANCING ACCORDING TO NODE FAILURE AS CRUSH CHECKPOINT 27
  • LOAD BALANCING ACCORDING TO NODE FAILURE AS RECEIVE CHECKPOINT(NO SEND) 28
  • IMPLEMENTATION OF WORKER NODE  Worker node executes the DNA sequence to form aligned sequence using MSAGA application  Worker node takes the local checkpoint at the application level of MSAGA  Worker node implements checkpointing phase in proposed fault tolerance system.  The local checkpoint (LC) and the local checkpoint storage (LCS) play the main role in that phase.  Every worker nodes make the local checkpoint and has own local checkpoint storage.  Local checkpoint (LC) takes all checkpoint of each worker node.  Local checkpoint storage(LCS) stores the process of one worker’s processing state. 29
  • LOCAL CHECKPOINT  local checkpoint (LC) is responsible for taking local checkpoint of worker process states.  Local checkpoint (LC) starts to take the checkpoints of worker’s processing state when worker node (WN) connects to the head node.  This local checkpoint’s responsibilities is done till all workers’ processes are finished regularly and worker is exit from local area network because of node failure. 30
  • LOCAL CHECKPOINT OF EACH WORKER 31 LocalCheckpoint( ) BEGIN 1 Record WN Starting time, Ending time and connection time 2 Record all process state of MSA for sequence END
  • LOCAL CHECKPOINT STORAGE(LCS)  SPL produces the checkpoint log file and processing log file for local condition of each node.  So, all local checkpoint monitoring information are stored into local checkpoint storage (LCS).  The LCS is stored by the correspondence each WN. 32 LocalCheckpointStorage( ) BEGIN 1. Store WN Starting time, Ending time and connection time 2. Store all process state of MSA for sequence END
  • LCS LOG FILE 33
  • CONCLUSION  The GRM cannot make wrong checkpoint for the number of worker node .  GRM can recognize differences between old worker node and new worker node exactly when the worker node connect to the head node next again.  While GRM takes the checkpoint for one worker node, the remaining workers do not need to stop their operation. Therefore, there is no block for worker nodes.  This approach supports that the distributed multiple sequence alignment processing can operate continuously to get the final result when the node failure occurred within network.  This system computes the exact time of each worker nodes and the whole system execution time. This system can get the portable checkpoint feature and does not need to use any operating system supports. 34
  • THANK YOU!! 35