Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Fault tolerant mobile agent-based monitoring mechanism for highly dynamic distributed netwo


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Fault tolerant mobile agent-based monitoring mechanism for highly dynamic distributed netwo

  1. 1. IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 3, No 3, May 2010 1ISSN (Online): 1694-0784ISSN (Print): 1694-0814 Fault-tolerant Mobile Agent-based Monitoring Mechanism for Highly Dynamic Distributed Networks Jinho Ahn1 1 Dept. of Computer Science, College of Natural Science, Kyonggi University Suwon, Gyeonggi-do 443-760, Republic of Korea environments within the systems needed to be monitored Abstract and the nature of managed resources becomes almostThanks to asynchronous and dynamic natures of mobile agents, a dynamic, not static, which forces traditional staticcertain number of mobile agent-based monitoring mechanisms centralized and distributed monitoring mechanisms to behave actively been developed to monitor large-scale and unsuitable for the systems [10]. Thus, mobile agent-baseddynamic distributed networked systems adaptively and monitoring mechanisms have actively been developed toefficiently. Among them, some mechanisms attempt to adapt to monitor these large scale and dynamic distributeddynamic changes in various aspects such as network trafficpatterns, resource addition and deletion, network topology and so networked systems adaptively and efficiently.on. However, failures of some domain managers are very critical Mobile agent is an autonomous and independent softwareto providing correct, real-time and efficient monitoring program to satisfy the corresponding user’s goal on behalffunctionality in a large-scale mobile agent-based distributed of the user while visiting various target nodes through amonitoring system. In this paper, we present a novel fault- network [3]. This mobile agent technology has severaltolerance mechanism to have the following advantageous advantages such as reduction of network traffic,features appropriate for large-scale and dynamic hierarchical overcoming of network delay, enabling asynchronousmobile agent-based monitoring organizations. It supports fast execution and enhancement of dynamic adaptability.failure detection functionality with low failure-free overhead by Thanks to these desirable features, this technology is veryeach domain manager transmitting heart-beat messages to itsimmediate higher-level manager. Also, it minimizes the number widely used in distributed systems, especially for networkof non-faulty monitoring managers affected by failures of management. In a network management system, eachdomain managers. Moreover, it allows consistent failure mobile agent is generally designed to move to one or moredetection actions to be performed continuously in case of agent agent-executable nodes in a network, sense temporally andcreation, migration and termination, and is able to execute permanently other nodes and resources, and filter andconsistent takeover actions even in concurrent failures of domain deliver the received management information to themanagers. appropriate network management nodes [10].Keywords: Distributed Network, Fault-tolerance, Mobile Agent, The previous mobile agent-based monitoring mechanismsScalability, Takeover. are classified as follows: centralized and hierarchical distributed monitoring mechanisms. Most of them are based on the centralized monitoring model and divided1. Introduction into two categories, single mobile agent-based and segment-based mechanisms. In the first [11], a singleRecently, as the number of users of distributed systems management station creates a mobile agent and allows theand networks considerably increases with the increasing agent to sequentially visit the required nodes in acomplexity of their services and policies, system particular order. This mechanism is simple to implement,administrators attempt to ensure high quality of services but causes the task completion time of a mobile agent toeach user requires by maximizing utilization of system become too long in large-scale distributed systems becauseresources [5]. To achieve this goal, correct, real-time and the number of visiting nodes significantly increases andefficient management and monitoring mechanisms are the size of the agent may grow considerably. In particular,essential for the systems. But, as the infrastructures of the if the visiting nodes are interconnected through low-systems rapidly scale up, a huge amount of monitoring bandwidth links, the round-trip delay may extremelyinformation is produced by a larger number of managed increase. Secondly, the segment-based mechanism [2]nodes and resources and so the complexity of network partitions a network into several sub-networks or domains,monitoring function becomes extremely high [1]. Also, and creates and transfers a mobile agent to each domainthere are heterogeneous and various network
  2. 2. IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 3, No 3, May 2010 2www.IJCSI.orgrespectively. Therefore, the collection and filtering of the for a predefined number of consecutive timeout intervals,management information for monitored nodes can be it generates and delivers an AgentFailure message to aperformed in parallel per domain, which addresses the global recovery agent (GRA). Afterwards, the GRAscalability problem of the first mechanism to a certain recreates a new agent based on its most recentextent. However, in this mechanism, the single manager configuration information and redeploys it to theshould execute all the monitoring function and may appropriate target host. However, this behavior results inbecome the performance bottleneck of the entire system. high failure-free overhead due to the centralization ofIn addition, if the agent migration network includes failure detection functionality in a single point within aexpensive low bandwidth links, it is very difficult to large-scale hierarchical monitoring organization.perform the procedure to obtain and filter the monitoring Additionally, the takeover procedure performed by GRAsinformation in real-time. is much unsuitable for maintaining a tree-like managerTo solve the scalability problem, mobile agent-based structure efficiently. Also, this mechanism includes nomechanisms using hierarchical monitoring structure [6, 7] concrete method to detect failures of manager agentswere proposed. They allow a network to be partitioned correctly in case of agent creation, migration andinto a set of domains organized hierarchically and deploy termination triggered by dynamic changes in a network.a new monitoring agent to each domain. In this hierarchy, This paper proposes a novel fault-tolerance mechanism toa main manager is at the top-level (level 1) and delegates have the following desirable features appropriate for large-monitoring tasks with monitoring agents to the lower level scale and dynamic hierarchical mobile agent-baseddomain managers. Each manager clones and dispatches a monitoring organizations:monitoring agent to the appropriate domain manager nodeconsidering load redistribution of monitoring tasks. In this •Support fast failure detection functionality with lowcase, each domain manager collects the management failure-free overhead by each domain managerinformation from the lower-level managers and filters and periodically transmitting heart-beat messages to itsdelivers the processed information to its higher-level immediate higher-level manager.manager. The original hierarchical monitoring •Minimize the number of non-faulty monitoring managersmechanisms were almost based on a static manager affected by failures of domain managers.organization model. In other words, each network •Enable consistent failure detection actions to beadministrator configures a tree of network domains performed continuously in case of agent creation,according to its initial monitoring policy and then the main migration and termination.manager at the root domain creates and migrates •Can execute consistent takeover actions even inmonitoring manager agents to other domains. However, if concurrent failures of domain managers.any dynamic changes in various aspects such as networktraffic patterns, resource addition and deletion, network The remainder of this paper is organized as follows. Intopology and so on occur, this mechanism cannot adapt to sections 2 and 3, we describe our proposed mechanism inthese changes and will degrade significantly the entire both conceptual and algorithmic ways, and show itsmanagement performance. There were presented some correctness proof. Section 4 compares the proposedadaptive mobile agent-based mechanisms [8] to address mechanism with the existing ones in detail and section 5this important issue. In these mechanisms, if each domain concludes this paper.manager at level i estimates the need for some additionalmonitoring capability at run-time, it creates and installs anew manager agent to an appropriate node at level i+1 or 2. The Proposed Mechanismmigrates to another node for keeping location optimalityof its network monitoring. In the following subsections, data structures andHowever, failures of some domain managers even algorithms of the proposed mechanism are describedassuming the main manager can be reliable using informally.replication-based fault-tolerance mechanisms are verycritical to providing correct, real-time and efficient 2.1 Data structuresmonitoring functionality in a large-scale mobile agent- Every domain monitoring manager α has to keep thebased distributed monitoring system. To the best of our following three variables.knowledge, the fault-tolerance mechanism proposed in •AIDα: it is the agent identifier of domain manager α.[13] is the only one to address this issue. But, in this •MMaddrα: it is the main manager’s identifier neededmechanism, every agent should periodically send heart- when domain manager α is created or the organization ofbeat messages to global failure detection agents (GFDAs). its lower-level managers changes.If the GFDA receives no heart-beat message from an agent •IHMaddrα: it is the immediate higher-level manager’s
  3. 3. IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 3, No 3, May 2010 3www.IJCSI.orgidentifier of domain manager α. received any heart-beat message from the lower-level•ptrα: it is the root node of a tree for saving the identifier manager until the timer expires, it suspects that the lower-and timer of every lower-level manager of main or domain level manager crashes. This behavior results in lowmonitoring manager α. Its node is a tuple (aid, tinterval, failure-free overhead incurred by failure detection byptr). tinterval for each lower-level manager aid is used so utilizing the tree-like organization of monitoring managersthat monitoring manager α detects whether its lower-level effectively.manager aid is alive or failed, and is initialized to τ . ptr If a monitoring manager determines that a new one isfor its lower-level manager aid is the next-level node needed as its immediate lower-level manager for effectivemaintaining references for all lower-level managers of the monitoring, it creates a new mobile agent for this likedomain manager aid in a hierarchical manner. figure 1. In this figure, manager DMx has a monitoring agent spawned and transferred to a new node DMz. The agent initiates its monitoring task and notifies of its2.2 Informal Description location all nodes on the path between the main manager MM and itself. When a manager knows that it cannot play its role well and effectively for guaranteeing the monitoring performance required, it is voluntarily replaced by agent migration like in figure 2. In this figure, after manager DMz has made the same decision mentioned above, it finds an appropriate substitute node DMα and forces its agent to migrate to the substitute, where the agent resumes its monitoring task. If a manager detects some immediate lower-level managers has failed, it activates our takeover procedure. Fig.1 In case of another DM being required.Every domain manager α periodically transmits eachheartbeat message only to its immediate higher-levelmanager IHMaddrα. Therefore, each monitoring managercan know which ones fail or are alive among its immediatelower level managers by their periodic notification. In ourmechanism, the manager α decrements the timer tintervalfor its corresponding immediate lower-level manager aidin ptrα by one every certain time interval. If α has not
  4. 4. IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 3, No 3, May 2010 Fig.2 In case of DM replacement by agent migration.At this point, there can occur among three cases depending Fig.3 In case of a new DM taking over failed DM’s task.on availability and capability of nodes. First, like in figure3, when MM recognizes DMw’s failure and a new node If DMγ determines that DMy is just suitable for the role, itDMγ is its suitable substitute, the main manager creates allows DMy to take over DMx’s task like in figure 4. In thisand transfers a new monitoring agent with the same role to case, DMy notifies DMx’s other immediate lower-levelnode DMγ. Then, it performs the same monitoring function managers of this substitution and updates its location onthe failed node DMw executed, and inform its immediate all nodes on the path between the main manager MM andlower-level managers, e.g., DMx of this replacement. DMx. As the last case, when there is neither any new norSecond, when a manager DMγ identifies the failure of its lower-level manager capable of being substituted for thenext-level manager DMx in figure 3 and there is no failed one DMx in figure 3, DMx’s immediate higher-levelavailable node for replacing the failed one, it checks manager DMγ takes over DMx’s role aside from DMγ’swhether among DMx’s immediate lower-level managers own task in figure 5. Also, the mechanism performs theDMy and DMα, there exists a proper one as DMx’s consistent takeover procedure even in case of concurrentreplacement. failures of domain managers. Algorithmic description of the failure detection and takeover procedures for main or domain manager Self in our mechanism are formally given in figures 6 and 7.
  5. 5. IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 3, No 3, May 2010 procedure CHECK_AGENTLIVENESS() failedMngrs ← invoke DECR_TINTERVAL() on Self ; for all fmngr in failedMngrs do if(there is a new node nmngr as an appropriate substitute for fmngr) then invoke MNGR_TAKEOVER(fmngr) on nmngr ; send a message Change_IHigherLevelMngr(AID Self) to nmngr ; else if(there is a suitable substitute lmngr for fmngr among its immediate lower-level managers) then invoke MNGR_TAKEOVER(fmngr) on lmngr ; send a message Change_IHigherLevelMngr(AID Self ) Fig.5 In case to lmngr ; of the else invoke MNGR_TAKEOVER(fmngr) on Self ; immediate higher-level procedure DECR_TINTERVAL() DM taking failedMngrs ←  ; over failed DMs. for all e in ptrSelf do e.tinterval ← e.tinterval - 1 ; Fig.4 In case of an existing DM taking over failed DMs task. Fig.6 Failure if(e.tinterval = 0) then failedMngrs ← failedMngrs U {e} ; detection and ptrSelf ← ptrSelf - failedMngrs ; takeover return failedMngrs ;3. Correctness Proof procedures for procedure MNGR_TAKEOVER(fmngr)This section shows theorems 1 and 2 to prove safety and manager Self. for all e in fmngr.ptr doliveness of our proposed mechanism in order. ptrSelf ← ptrSelf U {(e.aid, e.ptr, τ )} ; send a message Change_IHigherLevelMngr(AID Self) [Base to e.aid ;Theorem 1. Even if multiple domain managers crash case] As send a message Change_TreeTopologyAtMMngr(AID Self,concurrently, our mechanism enables other live managersto monitor all the network elements previously managed ptrSelf ) to MMaddrSelf ;by the failed ones. procedure NOTIF AGENTALIVEMSG() procedure CHANGE_TREETOPOLOGYATMMNGR(AID, send a message Update_AgentTInterval(AIDSelf) ptr)Proof: Suppose that the entire distributed monitoring to IHMaddrSelf ;find a path mngrs to AID in ptrSelf ;system consists of a finite set N of monitoring managers find a node e in ptrSelf st (e.aid = AID); e.ptr ← ptr ;whose size is n and there is the set of all crashed domain procedure UPDATE AGENTTINTERVAL(AID) mngrs ; NLMngr ← the first element e inmanagers, denoted by SCDM. The proof proceeds by for all e in ptrSelfmngrs ← mngrs - {e} ; doinduction on the number of all the crashed domain if(e.aid = AID) thenmessage Change_TreeTopology(mngrs, AID, ptr) send a e.tinterval ← τ ;managers in SCDM, denoted by |SCDM| (|SCDM| < n). return ; to NLMngr ; procedure MIGRATE AGENTTONEWNODE(nmngr) procedure CHANGE_TREETOPOLOGY(mngrs, AID, ptr) invoke MNGR TAKEOVER(AIDSelf) on nmngr= AID) ; find a node e in ptrSelf st (e.aid ; send a message e.ptr ← ptr ; Change_IHigherLevelMngr(IHMaddrSelf) to nmngr ; if(mngrs = ) then NLMngr ← the first element e in mngrs ; mngrs ← mngrs - {e} ; send a message Change_TreeTopology(mngrs, AID, ptr) to NLMngr ; procedure CHANGE_IHIGHERLEVELMNGR(AID) IHMaddrSelf ← AID ;
  6. 6. IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 3, No 3, May 2010|SCDM|=1, there is only one crashed domain manager single agent to be created on the network manager and toDMx. In this case, the following three cases should be perform its task monitoring function according to theconsidered. itinerary consisting of its target nodes. It is simple toCase 1: there is a new available domain manager DMγ implement, but not scalable because in large distributedcapable of taking over DMx. networks, the round-trip delay for the agent may becomeIn this case, after detecting DMx’s failure, the immediate significantly increasing, especially on polling frequently,higher level manager of DMx creates and transfers a new and its size, considerably large while visiting its targetmonitoring agent with the same role to node DMγ. Then, nodes.DMγ performs the same monitoring function the failed Corradi et al. [2] presented a segment-based monitoringnode DMx executed, and inform DMx’s immediate lower- mechanism partitioning a network into a set of sub-level managers of this replacement and updates its location networks or domains and transferring a mobile agent toon all nodes on the path between the main manager and each domain. This mechanism can reduce greatly itsDMx. overall monitoring response time by collecting and filtering its management data per domain in parallel Fig.7 Failure detection and takeover procedures for Self (continued). compared with the single agent-based one [11]. Gavalas et al. [4] proposed a broadcast-based monitoringCase 2: among DMx’s immediate lower-level managers, mechanism, where the network manager instantiates andthere is a proper one DMγ as DMx’s substitute. migrates each a mobile agent to all managed nodes. AfterIn this case, DMx’s immediate higher level manager allows the agent collects and analyzes the network trafficDMγ to take over DMx’s task and notifies DMx’s other information from the corresponding node, it returns to theimmediate lower-level managers of this substitution and network manager platform with the requested information.updates its location on all nodes on the path between the Thus, this mechanism maximizes the parallelism of itsmain manager and DMx. monitoring process and achieves its short response time.Case 3: there is neither any new nor lower-level manager However, as the number of managed network elements orcapable of being substituted for DMx. In this case, DMx’s resources extremely increases, a large number of mobileimmediate higher-level manager DMγ takes over DMx’s agents are required. This feature may incur high agentrole aside from DMγ’s own task. The subsequent movement overhead by broadcasting and so degrade theprocedure is the same as in case 2. entire system performance remarkably.[Induction hypothesis] We assume that the theorem is All the mechanisms stated above may not overcome thetrue in case that |SCDM|=k. limitation of scalability fundamentally because of their[Induction step] Only if (k+1)-th crashed domain centralization nature. Also, this problem becomes gettingmanager (k+1 < n) can be taken over by any other live increasingly worse if expensive and low bandwidth linksdomain managers, the theorem is true in case that are included in routing paths of agents.|SCDM|=k+1. The following case is the same as the base Liotta et al. [7] introduced a scalable multi-levelcase mentioned above. monitoring mechanism based on the concept ofBy induction, even after |SCDM| concurrent domain Management by Delegation (MbM) [6]. In other words,manager failures occur, our mechanism allows their this mechanism partitions a networked system into severalmonitoring functions to be taken over other surviving ones. domains composing a hierarchical structure and deploys a mobile agent to each of them. A distributed java agent-based monitoring system JAMMTheorem 2. Our mechanism terminates within a finite was proposed for grid computing in [12]. This systemtime. enables monitoring sensors to execute by triggering their execution based on actual client usage. Clients can controlProof: As no more than |SCDM| (|SCDM| < n) domain remote sensors and obtain their requested informationmanager crashes occur, the proposed mechanism has only from the sensors in the form of re-execute its takeover procedure at most up to |SCDM| In [9], a multi-agent based distributed monitoring systemtimes as explained in theorem 1. Thus, the mechanism is implemented composed of dynamically controllableterminates within a finite time. agents. The structure of the system is divided into three layers to support independence among communication protocols, message interpretation and monitoring tasks.4. Comparisons This independence among the three layers may reduce agent development time and makeMost of monitoring systems using mobile agents were it easy to manage distributed systems.developed based on flat network infra-structure. Single However, these three mechanisms [7, 9, 12] cannot beagent-based monitoring system proposed in [11] forces a
  7. 7. IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 3, No 3, May 2010 7www.IJCSI.orgautonomously adaptable for dynamic changes such as Referencesvariations of network traffic patterns, resource addition [1] H. Asgari, P. Trimintzios, M. Irons, G. Pavlou, S. Berghe,and deletion, changes of network topology and so on and R. Egan, "A Scalable Real-time Monitoring System forbecause their structure of monitoring managers is static Supporting Traffic Engineering", in Proc. of the IEEEafter the initial agent deployment. Workshop on IP Operations and Management, Dallas, USA,In [8], an adaptive and hierarchical mobile agent-based 2002.monitoring mechanism was presented to address the above [2] A. Corradi, C. Stefanelli, and F. Tarantino, "How to Employ Mobile Agents in Systems Management", in Proc. of the 3rdmentioned problems. In this mechanism, each middle-level Int. Conf. on the Practical Application of Intelligent Agentsmanager agent is not bound to a particular network node and Multi-Agent Technology (PAAM’98), 1998, pp. 17-26.and be able to sense the network, find and move to better [3] A. Fuggetta, G.P.Picco, and G. Vigna, "Understanding Codelocations for seeking monitoring location optimality. Mobility", IEEE Transactions on Software Engineering, Vol.But, among all the previously stated hierarchical mobile 24, No. 5, 1998, pp. 342-361.agent-based mechanisms, no one addresses the failure [4] D. Gavalas, D. Greenwood, M. Ghanbari, and M. O’Mahony,detection and recovery issue for a hierarchy of monitoring "Complimentary Polling Modes for Network Performancemanagers. Management Employing Mobile Agents", in Proc. of theTripathi et al. [13] presented a mobile agent-based IEEE Global Communications Conference (Globecom’99), 1999, pp. 401-405.distributed monitoring system supporting autonomic [5] D. Goderis, S. Bosch, and Y. T’Joens, "A Service-Centric IPconfiguration and recovery. In this system, there are Quality of Service Architecture for Next Generationseveral global failure detection agents subscribing to Networks", in Proc. of the IEEE/IFIP Network Operationsheart-beat events from all monitoring agents. Thus, every and Management Symposium, 2002, pp. 139-154.monitoring agent periodically sends its heart-beat message [6] G. Goldszmidt, and Y. Yemini, "Delegated Agents forto each global failure detection agent. If a monitoring Network Management", IEEE Communication Magazine,agent fails, one of global recovery agents in this system Vol. 36, No. 3, 1998, pp. 66-70.executes the following recovery procedure: the recovery [7] A. Liotta , G. Knight, and G. Pavlou, "Modelling Networkagent instantiates the monitoring agent based on its most and System Monitoring Over the Internet with Mobile Agents", in Proc. of the IEEE/IFIP Network Operations andrecent configuration information and re-launches it to an Management Symposium (NOMS’98), 1998, pp. 303-312.available node. Thus, if large-scale networks are assumed, [8] A. Liotta , G. Pavlou, and G. Knight, "Exploiting Agentthis feature results in high failure-free overhead due to the Mobility for Large-scale Network Monitoring", IEEEcentralized failure detection procedure. Additionally, the Network, 2002, pp. 7-15.takeover procedure of the global recovery agents in this [9] S. Kwon, and J. Choi, "An Agent-based Adaptive Monitoringsystem is very unsuitable for maintaining a tree-like System", Lecture Notes In Artificial Intelligence, Vol. 4088,monitoring manager structure efficiently. Also, this system 2006, pp. 672-677.presented no concrete mechanism to have the hierarchical [10] J. Philippe, M. Flatin, and S. Znaty, "Two Taxonomies ofstructure of monitoring managers adaptable for its correct Distributed Network and System Management Paradigms", Emerging Trends and Challenges in Network Management,and efficient failure detection in case of agent creation, 2000.migration and destruction caused by the dynamic changes [11] G. Susilo, A. Bieszczad, and B. Pagurek, "Infrastructure forwithin its entire network. Advanced Network Management based on Mobile Code", In Proc. of the IEEE/IFIP Network Operations and Management Symposium (NOMS’98), 1998, pp. 322-333.5. Conclusions [12] B. Tierney, B. Crowley, D. Gunter, J. Lee, and M. Thompson, "A Monitoring Sensor Management System forThis paper presented a novel fault-tolerance mechanism to Grid Environments", Cluster Computing Journal, Vol. 4, No.have the following advantageous features appropriate for 1, 2001, pp. 19–28.large-scale and dynamic hierarchical mobile agent-based [13] A. Tripathi, D. Kulkarni, H. Talkad, M. Koka, S. Karanth, T.monitoring organizations. It supports fast failure detection Ahmed, and I. Osipkov, "Autonomic Configuration andfunctionality with low failure-free overhead by each Recovery In A Mobile Agent-based Distributed Event Monitoring System", Software Practice and Experience, Vol.domain manager transmitting heart-beat messages to its 37, 2007, pp. 493–522.immediate higher-level manager. Also, it minimizes thenumber of non-faulty monitoring managers affected byfailures of domain managers. Moreover, it allows Jinho Ahn received his B.S., M.S. and Ph.D. degrees inconsistent failure detection actions to be performed Computer Science and Engineering from Korea University, Korea,continuously in case of agent creation, migration and in 1997, 1999 and 2003, respectively. He has been an associate professor in Department of Computer Science, Kyonggi University.termination, and is able to execute consistent takeover He has published more than 70 papers in refereed journals andactions even in concurrent failures of domain managers. conference proceedings and served as program or organizing committee member or session chair in several
  8. 8. IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 3, No 3, May 2010 8www.IJCSI.orgdomestic/international conferences and editor-in-chief of journal ofKorean Institute of Information Technology and editorial boardmember of journal of Korean Society for Internet Information. Hisresearch interests include distributed computing, fault-tolerance,sensor networks and mobile agent systems.