Collcom2005 agent basedft


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Collcom2005 agent basedft

  1. 1. Dynamic Network Reconfiguration in Presence of Multiple Node and LinkFailures Using Autonomous AgentsJuan Ram´on Acosta and Dimiter R. AvreskyNetwork Computing Lab, Northeastern University, Boston, MA{jracosta,avresky}@ece.neu.eduAbstractCurrently, high-speed networks are indispensable commodities forall users and they have become an integral part of their lifestyles.For this reason, it is necessary for the network to be available mostof the time and to achieve transparent network failure recovery. Inthis paper, it is proposed to use Agent NetReconf 1, an agent baseddynamic network reconfiguration algorithm that is capable of tol-erating multiple router and link failures in high-speed networkswith arbitrary topology. Agent NetReconf updates the routing ta-bles asynchronously and does not require any global knowledgeof the network topology. Agent NetReconf uses mobile and au-tonomous agents to detect and recover the network from failures.Agent NetReconf highlights the benefits of using smart networkingdevices as a means of building an active network. The complexityof Agent NetReconf is analyzed and the termination, liveliness andsafety are proved.Keywords: high-speed networks, autonomous mobile agents, dy-namic reconfiguration, fault tolerance, adaptive routing, arbitrarytopologiesIntroductionThe increasing number of users of the Internet has trig-gered a significant growth in the number of networked de-vices and the traffic they generate. Computer networks arenow been pushed to their limit. In this context, computingcapacity is available but it can be severely affected by fail-ures. The major challenge faced by service providers todayis to keep their ability to give customers the level of ser-vice they require, regardless of system conditions and thenumber of faults on the network.The need to provide increased availability has lead re-searchers such as Hood and Ji [8] to develop a sophisti-cated intelligent software agent that performs fault detec-tion accurately and in certain cases predicts the fault before1This work was supported by the U.S. National Science Foundationunder grant CCR-0004515it appears. Others such as Whit et al. [15] have imple-mented communities of mobile agents that roam the net-work collecting and exchanging network information basedon the ”social insects” paradigm (ant behavior) describedby Schoonderwoerd et al. [11].In this paper, an algorithm is proposed for achieving dy-namic network fault detection and avoidance in arbitrarytopologies using autonomous agents running at each router.The reconfiguration algorithm is distributed and embeddedin the agents’ behavior. The paper is organized in six sec-tions as follows: Section 1, presents an overview on agentsand how they are used in adaptive routing. Section 2,describes a new router architecture that uses autonomousagents for its routing services. Section 3, describes AgentNetReconf and how it does the tables reconfiguration to re-store routing capabilities at the network segment affected bythe failure. Section 4, presents the complexity, termination,safety and cognitive properties of Agent NetReconf. Section5, presents a fault recovery example showcasing the algo-rithm execution. The last section in the paper contains theconclusions.1. Autonomous AgentsThis section presents an overview of previous work thathas been published on how agents are used to achieve effi-cient network routing and fault tolerance.The term agent has been used to refer to a softwareand/or hardware component which is capable of acting ex-actingly in order to accomplish tasks on behalf of its user[10]. An agent is able to cooperate with other agents, learnsfrom its environment [17], and sometimes has the capabil-ity of migrating under its own control from one machine toanother, provided both computers are part of a network.Agents communicate with other agents to achieve suc-cessfully all the tasks given to them [16]. Communicationbetween agents is modeled as a point-to-point exchange ofmessages whose content is a construction of a well definedlanguage, for example: the Knowledge Query and Manipu-lation Language (KQML) [4] , the Knowledge Interchange
  2. 2. Format (KIF) [14] or, the most recent, the OWL Web On-tology Langauage [2].1.1. Applications on Network Fault ToleranceMinar in [9], describes an algorithm to discover the net-work topology using mobile agents. The agents travel thenetwork and from each node they visit they learn its cur-rent connectivity. In addition, the agents complement theacquired knowledge by cooperating with other agents theymeet at the same node. Finally, when agents finish explor-ing the network, the topology is fully discovered, and thisinformation is then used to define the routing tables at eachnode. Agents have also been used in adaptive routing, forexample, Gianni in [3], introduced a distributed adaptiverouting algorithm based on mobile agents that is capable oflearning the routing tables of a computer network using theant colony metaphor. Garijo, Cancer and S´anches in [6],for example, describe a centralized Multi-agent Coopera-tive Network-Fault Management system (CNFM) that usesISO standard interfaces at each router to detect and avoidfaults on the network. In CNFM the agents are workingas watch dogs of the network monitoring each element andgenerating events into the CNFM engine when faults arerecognized.Cynthia Hood and Chuanyi Ji [8], took advantage ofthe increasingly available computation power in networkingdevices and the benefits of artificial intelligence to designan intelligent agent that processes information collected bythe Simple Network Management Protocol agents (SNMP-agents) at each node and uses this information to detect net-work anomalies that typically precede a fault. “The intel-ligent agent learns the normal behavior from each readingmade by the SNMP-agent and combines the information us-ing a Bayesian network that could trigger a local correctiveaction or a message to a centralized network manager.” Ina similar approach presented by Phuan and Yufang in [19],an intelligent mobile agent has the capability to extract datafrom a network element using a local high-bandwidth com-munication session without consuming network resourcesand reducing the overall communication traffic. The intel-ligent mobile agent has the ability to integrate knowledgefrom a network manager and any network element to per-form inferences on which type of fault recovery it will benecessary to perform.The algorithm proposed in this paper is different fromthe solutions described earlier in that Agent NetReconf ex-ecutes network failure recovery using only the local knowl-edge at each router without having to know the networktopology or the type of faulty element (router or link), andit is platform independent.2. Agent Based RouterIn order for network failure recovery to happen at the ex-act location where an element failed, it is necessary that therouting elements in the vicinity take an active role in thedetection and contention of the fault. As mentioned earlier,network fault recovery and detection is commonly imple-mented in a way such that a central network monitoring sta-tion launches all the corrective actions from a remote site,as seen in [8, 6, 19] and only a few implementations, suchas those described in [1, 5], make the adjacent routers to thefailure participate in the restoration of connectivity.The authors, in this section, propose an agent basedrouter in which the detection and reconfiguration tasks areperformed by a group of intelligent agents. The agents aregoal oriented and capable of incorporating new knowledgelearned during the router operation and network reconfigu-ration.In essence, the new router is an active intelligent networkdevice capable of reacting and adjusting its operation basedon the events that occur in its internal and external environ-ment.2.1. ArchitectureThe architecture of the new intelligent router, in Figure1, is based on a high-speed cross bar switch with an en-hanced embedded software module that contains an agentsubsystem. For simplicity, the agent platform will not bespecified.The router hosts a community of agents that are responsi-ble for controlling the router’s activities and coordinate allthe tasks involved in the dynamic reconfiguration of rout-ing tables when the router participates in the recovery ofa failure. The knowledge used by the agents to representthe router, links, neighbors and the execution parameters ofthe fault-tolerant reconfiguration algorithm is saved in theagent’s main memory. The structural representation of theknowledge is defined using ontology classes written in theOWL web Ontology Language [2].The definition of the agents operating the router is as fol-lows:1. Node Manager Agent. This agent oversees the opera-tion of the router and the other agents. The node man-ager is the router public interface that can be use bynetwork administration tools, visiting explorer agents,neighbor routers and other external network elementsto communicate with the router. The manager agentis also responsible for the security and integrity of therouter; it supervises all the access made to the routingtables and memory, and makes sure that all the requestmade to it are safe. The node manager agent is the
  3. 3. ..ArbitrationDecisionRoutingCrossbarNxNTables0iiInput Ports Output PortsNode ManagerAgentRouterAgentAgentRoutingN−10N−1Link ManagerFigure 1. Agent based router architectureonly component in the router that can initiate a recon-figuration task. The node manager agent uses a rein-forcement learning method to acquire new knowledgeto make better decisions during node management andfault recovery.2. Router Agent. It is the only agent in the new architec-ture that can manipulate the routing tables and has thecapability of accepting or declining updates. The agentbehavior is determined by the inherent routing algo-rithm and the dynamic reconfiguration policies. Asseen in Figure 1, the router’s arbitration and routingdecision logic are controlled by this agent. The routeragent reacts only to requests from the node manageragent.3. Link Manager Agent. Responsible for managing therouter’s connected links, ports and queues. The agentis in charge of detecting and reporting failures and con-gestion to the node manager. The agent uses a rein-forcement learning model to learn the characteristicsymptoms before a failure or congestion take place,this allows the agent to choose the appropriate cor-rective actions and promptly trigger a restoration task.The agent uses the “I’m alive” message model to de-termine failures and the flow-unaware statistical de-lay method described in [13] to accurately determinepacket delays without depending on the dynamic in-formation of the packet flow.4. Explorer Agent. These agents are dynamically cre-ated in each router when Agent NetReconf is executed.When an explorer agent is working in search modeit cooperates with other agents to build a restorationspanning tree that will re-connect the nodes discon-nected by the failure. When an explorer agent is work-ing in restoration mode, it collaborates with the nodemanager agents at each router on the restoration treeto update the local router tables. An explorer agent isa delegate of the router that created it, such that anyinteraction between two different agents is equivalentto the two routers interacting directly point-to-point.3. Network Failure Recovery3.1. Agent NetReconfThis section describes a new dynamic network reconfig-uration algorithm Agent NetReconf. The algorithm uses aset of collaborative agents to restore network connectivityafter a failure is detected. Agent NetReconf is a distributedintelligent algorithm that operates at the network level with-out any global information of the network topology.The strategy used by Agent NetReconf consists in iden-tifying the set of nodes adjacent to a failure and from themselecting a leader to coordinate the construction of a restora-tion spanning tree and synchronize the updates to the rout-ing tables at each node on the restoration tree.The complete reconfiguration process consists of fourphases: Leader Selection, Restoration Tree Construction,Reconfiguration Synchronization and Tables Update. Thecorrect execution of these phases is subject to the validityof the following assumptions:Assumption 3.1 After a failure F is detected, no additionalfailures will occur on any link or node that belongs to therestoration tree, until Agent NetReconf finishes the recon-figuration process for F.Assumption 3.2 The network is not partitioned as result ofthe failures.Before describing in detail each phase, for clarity, con-sider R to be the set of all routers in the network and thateach router Ri is connected to N other routers, its imme-diate neighbors. Also let Sij be the collection of IDs of allrouters that are two hops away from Ri via link Lj. Addi-tionally, assume that each Lk is monitored and managed byone of the link manager agents (LMk). At each router Ri,the link manager LMk that detects missing “I’m alive” mes-sages from link Lk, immediately notifies the Node ManagerAgent (NMi) by raising the asynchronous NetworkFailure-Detected event.Leader Selection After the failure is detected by router Ri,the node manager NMi suspends the traffic targeting Lk,the link leading to the presumed faulty node. From Sik,NMi selects the ID with the highest value and records it inmemory as the ID corresponding to the Restoration Leader(RLF ). If the selected ID equals Ri’s ID then Ri becomesthe leader and immediately starts Phase 1. Otherwise, whenthe selected ID does not match Ri’s, the router starts timer
  4. 4. Tstart and waits for a control signal from RLF that indi-cates that the node can join Phase 1. If Tstart times out andno signal from RLF was received, Ri marks RLF faultyand starts the leader selection again.Definition 3.1 Node Adjacent to Failure (NAF) It is a nodethat was not selected “Restoration Leader” and was di-rectly connected to a node or link that failed.Phase 1. Restoration Tree Construction The first step inAgent NetReconf is to build a restoration tree to establish acommunication path between the leader and the NAFs.Step 1a. Begin PhasePhase 1 starts with the Restoration Leader RLF (RLF = Ri ) creating one explorer agent Eij per activelink Lj. Eij is initialized in search mode and is providedwith the list of disconnected NAFs. Eij makes Ri its homeand starts the search for NAFs by migrating to the neighborconnected to Lj.After all Eij migrated out of the leader node, RLF startstimer Tack and waits for the arrival of control signals con-firming that a restoration path was found between RLF andeach NAF.Step 1b. Searching for NAFsAs the explorer agent Eij arrives at a node Rx, it adds theID of the visited node to the restoration path it is building.Eij exchanges information with the current node and usesthis information to define an itinerary for its next migration.If the explorer agent did not arrive at a NAF, then ituses the information to create clones of itself to help it con-tinue searching. The itinerary and the number of clones arebased on the number of active links and the available feasi-ble routes to the NAFs. For example, in Figure 2, explorerEH3 learns from RE that there are two active links L0 andL3, and one feasible route via L3. NAFs {C,D} are pre-sumed to be reachable through L3 and {A,B} will need tobe searched via L0. This implies that at least two clones arerequired. However, since RE is not a NAF then EH3 cancontinue searching. Therefore, only one clone is requiredfor the next migration.When Eij arrives at a NAF, the explorer agent removesRx from the list of NAFs and tells NMx to save the restora-tion path Eij traveled. Then, NMx stops Tstart and createsan agent explorer for restoration ERxi that sends back tothe restoration leader Ri to confirm that the restoration pathwas found. Although, Eij reached a NAF the search needsto continue for the remaining NAFs in the list. Eij thencreates clones and their itinerary following the same crite-ria mention before. Each clone then continues the search.Meanwhile Eij stays at Rx and starts timer Tphase3 to waitfor a signal from RLF to start Phase 3. The case in whichTphase3 times out represents a situation in which a failuremight occurr during reconfiguration. However, based onAssump. 3.1, this will not occur.Cycles are prevented in the restoration paths by deacti-vating an explorer agent when it arrives a node that has beenvisited already by either itself, one of its clones or one of itssiblings.In order to distinguish between node and Link Failures,Agent NetReconf uses explorer agents as follows: If a NAFreceives an Eij from a node which is assumed to be faulty,then a link failure is identified, therefore the NAF must up-date its reconfiguration information for the node and mark itsafe. In the case in which two nodes, each at the end of thefaulty link, may have determined that both are restorationleaders for the link failure, it is required to synchronize thenodes such that only one leader remains. The synchroniza-tion will occur when both nodes receive an explorer agentfrom each other, Eij and Eyj. The restoration leader forthe faulty link will be the parent of the explorer agent thathas the highest ID value. For example Ry, parent of Eyj,becomes the restoration leader for the failed link and nodeRi becomes a NAF. After the leader synchronization hasoccurred Agent NetReconf will continue with the reconfig-uration.Step 1c Establishing TreeAt each node Rj that is on the path followed by ERxi,NMj marks the links on which ERxi arrives and departsmembers of the restoration tree. Furthermore, if NMj de-tects that a different ERxy, from leader Ry, has alreadyvisited the node, then to avoid any conflicts with the recon-figuration, it gives ERxi the information about Ry, suchthat when it gets to Ri this can synchronize with Ry beforeit proceeds with Phase 3. ERxi continues migrating until itreaches the restoration leader.When Tack times out at the restoration leader, RLF de-termines which NAF did not reply with an ERxi in order tomark it faulty and exclude it from the reconfiguration. Therestoration leader continues and builds the restoration treeby merging each root of the confirmed restoration paths.After the restoration tree is completed, each ERxi sends apoint-to-point Restoration Tree Built (RBT) message signalto its parent.Definition 3.2 Node On Restoration Tree (NORT)It is a node that has at least one link belonging to therestoration tree.Phase 2. Multiple Failure SynchronizationWhen multiple failures appear, Agent NetReconf estab-lishes an ordered sequence of priorities between the restora-tion leaders detected by the visited NORTs, such that thereconfigurations occurs in a “safe” sequence in which therestoration leader with the highest ID always executes Phase3 first, while the others await their chance. For example, ifwe assume that Ry’s ID is higher than Ri’s then it will pro-ceed to Phase 3 before Ri.
  5. 5. Phase 3. Routing Information UpdateThis phase starts with a NAF processing an incomingRTB message and providing new routing information to theawaiting Eij. After the information exchange finishes theexplorer agent starts migrating back to RLF using the ac-knowledged restoration path. As Eij travels back to RLF ,the node manager of a visited node exchanges routing in-formation with Eij and if necessary it updates its rout-ing tables. Eij continues migrating until it reaches RLF .The information given to Eij by the NAF, and each visitednode, includes the IDs of all destinations that are reachablethrough each of these nodes using links that do not belongto the restoration tree.Upon arrival to the restoration leader, Eij delivers toRLF the routing information it collected. RLF processesthe data to adjust its routing tables and deactivates Eij.When RLF completes the update, it then provides eachERxi with the IDs of all the destinations reachable throughits active links excluding the link on which the ERxi arrivedand then ERxi migrates to its parent NAF. After all restora-tion explorers have migrated, RLF starts a timer Tcompleteto wait for a confirmation signal from each NAF indicatingthat the updates were completed and that they are ready toresume operations. The case in which Tcomplete times outrepresents a case similar to that described earlier and willbe dealt with in a future publication..As ERxi travels back, the node manager of a visitednode exchanges routing information with ERxi and if nec-essary it updates its routing tables. ERxi continues travel-ing until it reaches its parent NAF. The routing informationprovided by the visited node includes the IDs of all the des-tinations reachable through the visited node using the linksthat belong to the restoration tree with the exception of theIDs of nodes accessible via the links on which ERxi arrivesand leaves the visited node.Upon arrival of ERxi, Rx updates its routing tableswith the information contained in the restoration explorerand ERxi is deactivated. NAF sends RLF a point-to-pointUpdate Complete Response (UCR) message signal. WhenRLF receives the UCR signal, it stops Tcomplete and re-sumes normal operations.The reconfiguration algorithm, as described, uses to themaximum the ability of the agents to interact with eachother. Communication between the explorer and node man-ager agents are performed mostly within the router’s agentmodule, only a very few leave the router and happen in apoint-to-point form. This is an important contribution ofAgent NetReconf because it maintains the algorithm execu-tion distributed at each router and keeps to a minimum theoverhead on the bandwidth usage and the number of linkspreempted for the reconfiguration to work.Agent NetReconf bases its execution on the natural abil-ity of autonomous agents to acquire and share knowledge,for instance, when the explorer agents are searching forNAFs they learn information at each node that helps themdesign an optimal migration pattern that reduces networkflooding significantly.4. Properties of Agent NetReconf4.1. ComplexityThe complexity of Agent NetReconf is analyzed in termsof the number of explorer agents created during restora-tion tree construction and routing table reconfiguration. LetLActive be the number of active links on each router, nfinthe number of NAFs for failure F, and P a path betweenRLF and a NAF.Theorem 4.1 The complexity for Agent NetReconf for mul-tiple failures is given byO(LActive ∗ ((nmax ∗ Pmax) + 1))where Pmax is the longest path connecting RLF and anyNAF and nmax is the maximum number of NAFs.Proof: Agent NetReconf determines RLF without cre-ating explorer agents such that leader selection is achievedwith O(0) complexity.In Phase 1, when the recovery step initiates, RLF createsLActive exploration agents Eij, one per active link. Thecorresponding complexity for this operation is O(LActive).As an explorer agent migrates searching for target NAFs,the maximum number of explorer agents created at the vis-ited node Rx as described in Phase 3 is LActive − 1. Incases where Rx is a NAF, Rx creates one exploration agentfor recovery, ERxi, such that the maximum number of ex-plorer agents created at an intermediate router is LActive.Now, considering that the longest restoration path be-tween RLF and a NAF is Pmax, the total number of ex-plorer agents needed to continue searching for a NAF isPmax ∗ LActive. Considering the worst case in which eachNAF is reached via a disjoint restoration path, the total num-ber of explorers created is given by nmax ∗ Pmax ∗ LActive.Assuming that all restoration trees intersect, Phase 2 isexecuted independently for each RT without creating anyagents, which results in O(0) complexity.Then, by adding the number of agents created bythe restoration leader, the complexity of Agent NetRe-conf becomes O((nmax ∗ Pmax ∗ LActive) + LActive), orO(LActive ∗ ((nmax ∗ Pmax) + 1)). Q.E.D 2Now, by comparing O(LActive ∗ ((nmax ∗ Pmax) + 1))with the complexity of NetRec in [1], which is O(N ∗ (L +nmax ∗Pmax +N ∗Pmax)), it is clear that Agent NetReconfreduces the complexity of NetRec by more than one order
  6. 6. of magnitude. This is explainable by the fact that LActive isexpressed in terms of the number of active links instead ofthe total number of links in the network. The improvementpresented here is possible because in Agent NetReconf theagents are using their knowledge to make inferences andexecute actions that otherwise, in standard NetRec, wouldrequire several point-to-point message exchanges. This, infact, is a powerful feature of agent based systems as is men-tioned in [14].4.2. TerminationThe following agent migration patterns and message de-livery properties are used for proving Agent NetReconf’sTermination.Definition 4.1 If a point-to-point message is sent from asource agent S to a destination agent D, then it will be re-ceived once and only once by D.Definition 4.2 Every point-to-point message sent betweenan exploration agent Eij or ERxi and a node manageragent NMx will be routed following a path on the restora-tion tree and will be reliably delivered to its destination.Definition 4.3 The restoration leader RLF considers anarriving ERxi to be the acknowledgment sent from a NAFto confirm that a restoration path has been created.Definition 4.4 The restoration leader RLF considers a re-turning Eij to be the acknowledgment sent by a NAF to con-firm that a restoration tree was established and the requestto update its routing tables with the information carried byEij .Definition 4.5 A NAF considers a returning ERxi to be theacknowledgment sent by the RLF that it updated its routinginformation and that the NAF must update its table with thenew information carried by ERxiLemma 4.1 For a given faulty node F, all NAFs will electthe same RL.Proof: We prove by contradiction. Suppose that twoNAFs will elect different RLs. Since the router with highestID among the NAFs is elected for RL, then these two NAFsmust have used different NAF sets. However, all NAFs aretwo hops from each other through F and by definition eachNAF knows its own ID and the IDs of all routers that aretwo hops away from it. Thus, the NAF sets determined bythe NAFs cannot be different, which contradicts the suppo-sition. Q.E.D. 2Lemma 4.2 For a given fault F, the RLF and all the NAFswill successfully establish a restoration tree rooted at RLFsuch that Agent NetReconf can start the reconfigurationstep.Proof: According to Lemma 4.1, all non-faulty NAFs willelect the same RLF . Phase 3 and Def. 4.3 assure that a NAFis reached by RLF and that the restoration path is estab-lished. By sending a Restoration Tree Built (RTB) message,as described in Phase 3, it is guaranteed that a NAF is no-tified that the restoration tree was established. Def. 4.1 and4.2 assure that this point-to-point message is delivered toits destination reliably. Finally, Def. 4.4 assures that bothRLF and NAFs receive the routing information describingthe restoration tree. Therefore the restoration tree is reliablyestablished. Q.E.D 2Lemma 4.3 For a given failure all NAFs, NORTs and RLFsuccessfully update their routing tables and Agent NetRe-conf execution terminates.Proof: Since Lemma 4.2 assures that the restoration treeis reliably established, then from Phase 3, it is assured thatnew routing information is collected by the explorer agents.Def. 4.4 assures that RLF receives the new information andupdates its table before any NAF. Def. 4.5 guarantees thatthe NAFs receive new information after RLF completes itsupdates. Phase 3 makes sure that RLF knows that a NAFfinished updating and that it is ready to resume operations.Q.E.D. 2Lemma 4.4 All the explorer agents Eij and ERxi deacti-vate.Proof: By Def. 4.4, an Eij explorer returns home afterthe restoration tree RTF has been established. Phase 3 as-sures that Eij deactivates after the RLF updates its routinginformation. Similarly, Def. 4.5 assures that ERxi returnshome and deactivates after the NAF updates its table. In ad-dition, Phase 3 assures that the Eij that were created andnever reach a NAF will deactivate. Q.E.D. 2Lemma 4.5 In the presence of multiple intersectingrestoration trees, none of the intersecting RLs will remainforever in Phase 2.Proof The goal of Phase 2 is to ensure that at any giventime only RLs with non-intersecting restoration trees willbe executing Phase 3, in which the routing information isupdated. In the cases of consecutive failures and simulta-neous disjoint failures, this is always true, so Phase 2 isskipped and the RLs will proceed to Phase 3 independentlyfrom each other. If there are simultaneous failures with in-tersecting restoration trees, then their RLs must establishsuch order, which results in a sequence of temporally dis-joint reconfigurations around single failures or simultane-ous disjoint failures.For each two intersecting restoration trees there is atleast one joint node, which detects the intersection. This
  7. 7. guarantees that at least one of the RLs in each intersectionwill be notified about it. The temporal order is establishedby the intersecting RLs based on their node IDs - nodeswith higher IDs have higher priority. All lower priority RLswill wait in Phase 2 until all higher priority RLs have com-pleted Phase 3. Following the algorithm, after completingPhase 3, each RL notifies all lower-priority RLs, which al-lows the next leader in the temporal order to execute Phase3. Thus, all leaders that were waiting in Phase 2 will even-tually receive the required synchronization messages thatallow them to proceed to Phase 3. Q.E.D. 2Theorem 4.2 On all nodes Agent NetReconf will success-fully complete in the presence of multiple failures, i.e. AgentNetReconf will terminate and the nodes adjacent to the fail-ures will be reachable.Proof: Based on Lemmas 4.1 - 4.5, it can be concludedthat the RLF and the NAFs will proceed with all phasesof Agent NetReconf and will generate the required exploreragents to carry out the establishment of the restoration treeand the reconfiguration of each node (RLF , NAFs andNORTs) on the tree. Q.E.D. 24.3. LivelinessIn this section is proved that on completion of Agent Ne-tReconf the network will be reconfigured appropriately.Theorem 4.3 On completion of Agent NetReconf, all con-nected nodes in the network are reachable.Proof: The appearance of a failure causes all the pathsthat go through the faulty link or node to be bisected. Theresults are segments of unreachable nodes where each seg-ment begins with a NAF. By Assumption 3.2, the network isnot partitioned, such that all connected nodes are reachablethrough non-faulty physical paths. Lemma 4.2 assures thatall the NORTs and NORTs are reachable through a spanningtree rooted at the NAF acting as restoration leader. Duringthe recovery phase, Lemma 4.3 guarantees that all the nodeson the restoration tree have their routing tables updated ina way such that all the faulty segments are replaced withrestoration paths. Theorem 4.2 demonstrates that Agent Ne-tReconf will terminate for any single failure by executinga “safe” sequence of reconfigurations that are performedsynchronously and coordinated by the restoration leader.Q.E.D. 24.4. SafetyThe goal of this section is to define and prove the safetyproperty of Agent NetReconf, namely, avoidance of infiniteloops and cyclic dependenciesTheorem 4.4 Agent NetReconf does not create infiniteloops or cyclic dependencies.Proof: Cyclic dependencies among the nodes on therestoration tree will not be created, because Step 3.1 pre-vents any explorer agents Eij in search mode to either re-turn back to the RLF or continue exploring if the currentvisited node was already visited by another Eij from RLF .Lemma 4.5 proves that no restoration leader will be blockedforever in Phase 2. As well, cyclic dependencies betweenthe RLs cannot arise, because they are resolved by alwaysgiving priority to the nodes with higher ID or nodes that arealready in Phase 3.In the presence of multiple failures, the RLs will enterPhase 3 in the priority order, which was established in Phase2, i.e., at any time only RLs with disjoint restoration treesare permitted to concurrently execute Phase 3. Therefore,cyclic dependences cannot be formed between the RLs. TheRL-NAF relations are based on a strict request-responsemodel, so there are no cyclic dependencies between them.Since all possible faulty NAFs have been isolated from therestoration tree in Phase 1 and all reconfiguration messagesare reliably delivered, all loops in Phase 3 will terminateafter the corresponding messages are received. Q.E.D. 24.5. Cognitive PropertiesHaving autonomous mobile agents execute the algorithmin parallel at each router reduces the required point-to-pointinteractions between the restoration leader and the NAFs.For instance, two agents would only exchange point-to-point messages when necessary, otherwise they will workwith the knowledge that exists at each node, and the knowl-edge they acquire from other agents during the constructionof the restoration tree or the reconfiguration phase.To have agents execute the recovery algorithm allowskeeping the knowledge of a failure closer to where it hap-pened instead of widely spreading the information to otherelements that are oblivious of such a fault. Also, withagents, more intelligent interactions occur between routers.For example, the manager NMi at RLF knows that the ar-rival of an ERxi is the confirmation that the NAF is aliveand the path followed by an Eij is the desired restorationpath. Similarly, if an ERxi returns home it is known to theNAF that the restoration leader has completed updating itsrouting information and that it is its turn to do the same.The lower complexity in Agent NetReconf, allows the al-gorithm to scale because it only involves a small number oflinks, as was proved in Section 4.1.In Agent NetReconf, an explorer agent represents morethan one message type of those used in message based al-gorithms such as [1, 5], and without oversimplifying, an
  8. 8. agent is considered a smart message that has cognitive andevolutive capabilities.These cognitive properties allow the reconfiguration al-gorithm to execute faster, because the agents are retrievingthe information from the data knowledge base at the routerand do not have to wait for synchronous acknowledgmentfrom any router. The use of agents in the reconfigurationalgorithm helps reduce the number of message exchanges,the number of links used in the reconfiguration and allowsan agent to make an optimal selection of the link that leadsto the next node.5. Examples of Failure Recovery5.1. Node Failure RecoveryTo illustrate the behavior of Agent NetReconf for recov-ering a node failure, consider that router R fails on the net-work shown in Figure 2. After a TIamAlive timeout expires,routers {A, B, C, D, E, H} detect the failure F. Each routerthen becomes a Node Adjacent to Failure (NAF) and in par-allel they start selecting a restoration leader RLF .FGH1001230102 14 530123012122102EDCB A30 E210E0,HERERBCEE13,HE3,H1E3,H2E3,H2E3,H333,HEERDER EE3,H0R1,HE11,HE3,HFigure 2. Node failure recoveryPhase 0. In D, NMD queries SD1, its knowledge base,and determines that router H has the highest ID amongthe others that are two hops away via link L1. Similarly,{A, B, C, D} select H as RLF and then become NAFs.Phase 1. At H, NMH creates three explorer agentsEH0, EH1 and EH3, one per active neighbor. Each agentlearns the list of NAFs and starts migrating, searching forNAFs. Consider EH3. the explorer when it arrives RElearns that there are two active links L0 and L3, and one fea-sible route via L3. NAFs {C,D} are presumed to be reach-able through L3 and {A,B} will need to be searched via L0.This implies that at least two clones are required. However,since RE is not a NAF then EH3 can continue searching.As each explorer reaches a NAF, a restoration explorer issent to RLF . At RLF , when ERAH, ERBH, ERCHand ERDH arrive, the restoration tree is considered built,shown with black lines in Figure 2.Phase 2. Since there are no overlapping restoration trees,the agents move to the next phase.Phase 3. Each ERxi sends a point-to-point RTB messageback home to make each Eij return back to RLF . EachEij on its way back learns routing information that it latershares with RLF .Table 1. Router D, original tableDest Port Dest PortA 1 F 0B 1 G 1C 2 H 1D - R 1E 0Table 2. Router D, updated tableDest Port Dest PortA 0 F 0B 0 G 0C 2 H 1D - R 1E 0When all Eij have arrived, RLF determines the destina-tions that can be reached through its active links and givesto each ERxi a list from which it excludes the destina-tions reachable through the port on which ERxi came in.ERDH, for example, will be provided with {A, B, F, G}.On its way home, each node visited by ERDH provides thedestinations reachable through links belonging to RTF ex-cluding those reachable through the links on which ERDHarrived at and departed from the node. When ERDH getshome, it asks NMD to update its routing tables with theinformation that it is carrying. After NMD finishes updat-ing its table, it sends a point-to-point UCR confirmation toRLF . The table for router D after the reconfiguration iscomplete is as shown in Table 25.2. Link Failure RecoveryThe following example illustrates the behavior of AgentNetReconf recovering a link failure. Assume that the linkconnecting routers J and K fails in Figure 3. After theTIamAlive timeout expires, routers J and K start the leaderselection phase and both routers assume that its neighbor, atthe other end of the link, has failed.Phase 0. During leader selection, router J is selectedrestoration leader RLJ by routers {A, C, D}. Likewise,router K is selected restoration leader RLK by routers{E, G, H, I}.
  9. 9. S DS AS BS FS EE K,3SE K,3E J,40E K,3E K,3E J,40E J,40J K CDAGIFBHE01234012012340123401230123012012 34Figure 3. Link failure recoveryPhase 1. At J, four explorer agents are created:EJ1, EJ2, EJ3 and EJ4. At K, three explorer agentsare created EK0, EK1 and EK3. To start building therestoration paths, the explorers from each leader start mi-grating to search for the known NAFs to each leader. Inthe search process, explorer agents EK3 and EJ4 arriveat restoration leaders RLJ and RLK respectively. With thearrival of the explorers both leaders realize that the routerthey presumed failed is indeed alive. Both leaders markfaulty the link that connected them and move to determinewhich is the new role of the supposedly faulty node in thisphase. Router J determines that router K’s ID is higher andbecomes a NAF belonging to RLK. Router J then issues adeactivate point-to-point message to all its explorers to indi-cate it is no longer the leader, see pseudo-code in AppendixA. After the new role is assumed by J, Phase 1 continues asdescribed in section 3.1. Note that EK3 stays at J since itbecame a NAF.Phase 2. Since there are no overlapping restoration trees,the agents move to the next phase.Phase 3. Each ERxi sends a point-to-point RTB messageback home to make each EKj return back to RLK. EachEKj, on its way back learns routing information that it latershares with RLK. Phase 3 continues as described in section3.1 to the end. The table for router J after the reconfigura-tion is complete is as shown in Table 46. ConclusionsThis paper has presented Agent NetReconf, a dynamicnetwork reconfiguration algorithm that uses collaborativeagents. It was proved by complexity analysis that Agent Ne-tReconf is significantly more efficient than message basedalgorithms [1, 5], and reduces by more than one orderof magnitude the number of interactions and message ex-changes required to perform the network reconfiguration aswas explained in Section 4.1.The improvement in complexity achieved in Agent Ne-tReconf is based on the fact that all the agent interactionsTable 3. Router J, original tableDest Port Dest Port Dest PortA 0 F 3 SB 2B 2 G 4 SD 0C 0 H 2 SE 1D 0 I 3 SF 3E 1 SA 0Table 4. Router J, updated tableDest Port Dest Port Dest PortA 4 F 3 SB 2B 2 G 4 SD 4C 4 H 2 SE 1D 4 I 3 SF 3E 1 SA 4occur at each router and the number of point-to-point non-in-router communications are minimal.Another important, but not obvious, contributor to AgentNetReconf’s reduction in complexity, is the representationof agent knowledge as an OWL ontology. Using OWL sim-plifies dramatically the way in which agents exchange in-formation. For example, during the Leader Selection anagent will only have to make a query to the router’s knowl-edge base specifying that it needs to know the neighbor withthe highest ID that is two hops away. Querying the OWLknowledge base is executed in constant time and does notrequire any agents to be created such that its contribution tothe communication complexity is zero. This is mainly be-cause the queries are executed locally and never leave thecurrent router. This last property assures that there is noneed for the agents, nor Agent NetReconf, to use any globalnetwork information.The combination of the agent based architecture andAgent NetReconf represent an important contribution to ac-tive networking because the network takes control of all itstasks and uses intelligence as a way to provide improvedreliability and quality routing.The cognitive properties of the agents allow the reconfig-uration algorithm to execute faster, because the agents areretrieving the information from the data knowledge base atthe router and do not have to wait for synchronous acknowl-edgment from any other router. This facilitates the optimalselection of the link that leads to the next node during thereconfiguration.To conclude, Agent NetReconf is a low complexity, in-telligent distributed dynamic network reconfiguration algo-rithm that is applicable to network computers with arbitrarytopologies, is application-transparent and is capable of iso-lating and tolerating multiple faulty links or nodes.
  10. 10. References[1] D. Avresky and N. Natchev. Dynamic Reconfiguration inComputer Clusters with Irregular Topologies in the Presenceof Multiple Node and Link Failures. IEEE Transactions onComputers, 55(2), May 2005.[2] N. Bennacer, Y. Bourda, and B. Doan. Formalizing forQuerying Learning Objects Using OWL. In Proceedings ofIEEE International Conference on Advanced Learning Tech-nologies, pages 321–325, 2004.[3] G. D. Caro and M. Dorigo. Mobile Agents for AdaptiveRouting. In Proceedings of 31st International Conferenceon System Sciences (HICSS-31), 1998.[4] H. Chalupsky, T. Finin, R. Fritzson, D. McKay, S. Shapiro,and G. Weiderhold. An Overview of KQML: A Knowl-edge Query and Manipulation Language. Technical report,KQML Advisory Group, Apr. 1992.[5] J. Duato, R. Casado, A. Berm´udez, and F. J. Quiles. A Pro-tocol for Deadlock-Free Dynamic Reconfiguration in High-Speed Local Area Networks. IEEE Transactions on Paralleland Distributed Systems, 12(2):115 – 132, February 2001.[6] M. Garijo, A. Cancer, and J. Sanchez. A Multi-Agent Sys-tem for Cooperative Network-Fault Management. In Pro-ceedings of the First International Conference and Exhibi-tion on the Practical Applications of Intelligent Agents andMulti-agent Technology, pages 279 – 294, 1996.[7] M. Heusse, S. Gu’erin, D. Snyers, and P. Kuntz. AdaptiveAgent-Driven Routing and Load Balancing in Communica-tion Networks. Complex Systems, 1998.[8] C. S. Hood and C. Ji. Intelligent Agents for ProactiveFault Detection. IEEE The Internet Computing, 2(2):65–72,March – April 1998.[9] N. Minar, K. H. Kramer, and P. Maes. Cooperating MobileAgents for Mapping Networks. In Proceedings of the FirstHungarian National Conference on Agent Based Computa-tion, 1999.[10] H. S. Nwana. Software Agents: An Overview. KnowledgeEngineering Review, 11(3):205–244, Oct./Nov. 1995.[11] R. Schoonderwoerd, O. E. Holland, J. L. Bruten, and L. J. M.Rothkrantz. Ant-Based Load Balancing in Telecommunica-tions Networks. Adaptive Behavior, 5(2):169–207, 1996.[12] D. L. Tennenhouse, J. M. Smith, W. D. Sincoskie, D. J.Wetherall, and G. J. Minden. A Survey of Active NetworkResearch. IEEE Communications Magazine, 35(1):80–86,1997.[13] S. Wang, D. Xuan, R. Bettati, and W. Zhao. A Study of Pro-viding Statistical QoS in a Differentiated Services Network.In NCA’03, Proceedings of IEEE International Symposiumon Network Computing and Applications, pages 0297–0304,2003.[14] G. Weiss. Multi Agent Systems, A Modern Approach to Dis-tributed Artificial Intelligence. MIT Press, 2001. ISBN:0-262-23203-0.[15] T. White, A. Bieszczad, and B. Pagurek. Distributed FaultLocation in Networks Using Mobile Agents. In IATA1998,Proceedings of the Second International Workshopon Intelligent Agents for Telecommunication, volume 1437,1998.[16] M. J. Wooldridge. The Logical Modeling of ComputationalMulti-Agent Systems. PhD thesis, University of Manchester,1992.[17] M. J. Wooldridge and N. R. Jennings. Intelligent Agents:Theory and Practice. Knowledge Engineering Review,10(2):115–152, June 1995.[18] Y. Yemini and S. daSilva. Towards programmable networks.In Proceedings of IFIP/IEEE International Workshop onDistributed Systems: Operations and Management, 1996.[19] P. Zhang and Y. Sun. A New Approach Based on MobileAgents to Network Fault Detection. In ICCNMC’01, Pro-ceedings of the International Conference on Computer Net-works and Mobile Computing, 2001.