Ft nmdoc


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Ft nmdoc

  1. 1. IMPLEMENTATION OF A FAULT TOLERANT NETWORK MANAGEMENT SYSTEM USING VOTING High Performance Computing and Simulation Research Lab Department of Electrical and Computer Engineering University of Florida Dr. Alan D. George and Edwin Hernandez Abstract Fault Tolerance (FT) can be achieved through Replication and N-Module Redundancy (NMR) systems are widely applied in software and hardware architectures. NMR Software system could have dynamic or static voting mechanisms, indeed static voters decrease the complexity of the control algorithms and add new performance bottlenecks, meanwhile dynamic voters have increased control complexity and some other performance bottlenecks. Every Network Management System (NMS) requires of replication to increase reliability and availability. The software implementation of a Fault Tolerant Network Management System (FT-NMS) shown here uses the NMR approach. The paper contributions are related to the implementation issues of the system and measurements of the performance at the protocol and application level. The resulting measurements lead to conclude that the system can improve the bottlenecks and keep simplicity by taking into account synchronization and buffer management techniques. Keywords: Network Management, Fault-Tolerant Distributed Applications, SNMP, NMR. 1. INTRODUCTION Several local-area and wide-area networks rely on Network Management information for decision making and monitoring. Therefore, a Network Management System (NMS) has to maintain high availability and reliability in order to complete those tasks. Consequently, the implementation of an NMS has to be fault tolerant and include software replication. Moreover, fault masking and voting are techniques added to the system as a consequence of the replication.[Prad97]. In addition, the most prevalent solutions in distributed systems to create FT server models use replicated servers, redundant servers and event-based servers [Lan98]. If replicated servers are chosen, the system selects among static or dynamic coordination. In static coordination, it does not require a leader election mechanism; indeed, fault masking using voting is easier to achieve. In dynamic coordination, protocol overhead is required and leader election and voting coordination algorithms have to be included in every transaction. There are several replica control
  2. 2. protocols involving leader election. They depend on the quorum size, availability and system distribution, for example: Coterie, Quorum Consensus Protocol, Dynamic Voting, Maekawa’s, Grid, Tree and Hierarchical. Those are found in [WU93] and [Beed95]. The algorithms mentioned above introduce overhead and can degrade the performance of the system in terms of communication or processing [SAHA93]. There are some other replica control protocols defined in [Paris94] and [Sing94] in which the location of the replicas of a database or the location of the replicas in a network are minimized. However, the protocol complexity is not avoided and remains present. Triple Module Redundancy (TMR) and masking fault tolerance can be found in [Aro95] and [Brasi95] , a TMR voting protocol does not require a leader, neither leader election protocols, but it requires message broadcasting from one to all nodes at the group, therefore the protocol complexity relies on message ordering and coordination. Experimental measurements of the TMR nodes yielded a total processing time between 50 to 200ms using 100 messages of 64 bytes long. Static voting architectures will require high reliability of the network node where the voter reside, because the voter becomes a single point of failure (SPF). However, protocol overhead is light and fault recovery can be achieved almost instantaneously. Those advantages support the use of a voter in high performance networks and the implementation of static TMR systems such as the one presented in this paper. In addition to the requirements of group communication, the detection of a failure is generally done by using the combination of a heartbeat and a predefined [Maffe96], [Prad96] [Landi98] or a self-adapted timeout period [Agui97]. The predefined timeout was used in the FT-NMS implemented. Nevertheless, there are several other techniques for handling faults in network managers and agents as defined in [Duar96] with the hierarchical adaptive distributed system-level diagnosis (HADSD). Moreover, monitoring is one of the main tasks of any NMS. For this purpose, a traditional approach was used, in which a set of replica managers were organized in a tree structure running at a set network’s node. The managers are able to monitor a set of agents using Simple Network Management Protocol (SNMP) request as defined in [Rose90] and [Rose94]. Decentralization and database monitoring were not implemented for the application [Scho97] and [Wolf91].
  3. 3. This paper is organized in the following manner. First at Section 1, some assumptions are presented and they are also used for the experiments and system’s architecture design (Section 2). In Section 3, algorithms and system design are described. Finally, performance measurements were run in different testbeds such as Myrinet and ATM-LAN as shown in Section 4. 2. Assumptions The FT-NMS application relies on a simple heartbeat error-detection mechanism and sender-based message logging to generate monitored information (replica are responsible to initiate the communication is not the voter). Failures are detected when a failing system stops replying to heartbeats and is considered to have a transient fault. If a transient fault takes longer than the timeout period the faulty unit is considered “down”, consequently, the unit will not be able to provide useful information to the system. Basically, a machine will fail by crashing or fail-stop behavior. A TCP/IP environment with naming services has to be available for the replica units. In addition to that, it is assumed that SNMP daemons should be already installed and working correctly in all the agents to be monitored. The data types that can be retrieved from the agent’s Management Information Base (MIB) are INTEGER, INTEGER32 and COUNTER as define in the ASN.1 standard in [Feit97] and [OSI87]. Each node should be in the same sub-network and consequently avoid long time communication delays between managers, otherwise timeouts have to be modified. And finally, The heartbeat interval used is one second. 3. System’s Model The System’s model is conformed by two sub-systems: the managers (Section 3.1) and the voter-gateway (Section 3.2). Managers depend on the voter, who also works as a coordinator. All the applications mentioned here are multithreaded and client/server. Moreover, the manager makes use of the Carnegie Melon University SNMP Application Programming Interface (CMU-SNMP API). The voter is being run in a separate, highly reliable computation node, meanwhile replica managers should run at different network nodes. The different manager modules are shown in figure 1.
  4. 4. Thread for Heartbeat Listening SNMP OBject Handler UDP_echo server MIB Database CMU - API Reliable Communication with Gateway Events Local Data UDP Sockets TCP Sockets Figure 1. SNMP Manager Application, using the CMU-API to handle SNMP packets for the agents 3.1. Manager The manager application uses the CMU-SNMP API to handle snmpget packets to the agents. It has access to a local MIB database (Figure 1.) which is handled by HCS_SNMP Object handler (Figure 2.), this object is used to support all the different MIBs found in all network management agents. In addition to that, a simple UDP_Echo server runs concurrently for handling of the heartbeat service provided by the voter. class HCS_SNMP { private: struct snmp_session session, *ss; struct snmp_pdu *pdu, *response1; struct variable_list *vars; char* gateway, *community; int Port; oid name[MAX_NAME_LEN]; int name_length; int HCS_SNMPCommunication(snmp_pdu* response, char** varname, char** value); public: HCS_SNMP(char* gateway, char* community); ~HCS_SNMP(); int HCS_SNMPGet(char** namesobjid, int number, char** varname, char** value); }; Figure 2. Class description of the SNMP Object The main goal of the FT-NMS is distributed method for reliable monitoring of MIBs handled at different agents. The system is designed to keep a heavy weight process running for each agent (Figure 1.). Each heavy-weight process is able to handle 64 simultaneous Object Identifiers (OID) from the MIB at any agent. The SNMP API provides all the libraries and services to convert from the Abstract Syntax Notation One (ASN.1) to the different data types used in C++. Each manager is designed to read a table of OIDs that has to monitor from the agents using polling. In addition to that the manager should define a polling strategy and write all the responses to a file. Furthermore, each manager creates and sends a TCP packet with the format shown in Figure 5. to the main voter application informing the results gathered from the agents being monitored. Also, the sampling time and the number of samples to monitor is defined for every OID. In order to reach accuracy in the measurement, synchronization is required to achieve concurrency on each poll, otherwise the differences between the value sampled or measured to agenti at To and any ∆T will have a misalignment
  5. 5. with the values sampled by agentI+1 from another replica. This behavior could not affect large sample interval in which the sampling time is greater in several orders of magnitude to the misalignment, but assuming that in high performance networks the sampling time must be really small, synchronization should be achieved as a priority. Therefore a Two Phase Commit protocol is used to achieve synchronization between samples and to coordinate groups of replica monitoring applications (Sec. 3.1.1) 3.1.1. Sample synchronization through a Two Phase Commit Protocol (2PC). As previously stated, the main weakness of a distributed network management system is time synchronization. Nevertheless, a Network Time Protocol (NTP) could supply some useful information, the network traffic and lack of precision are not suitable for High Performance Networks (HPN). In fact, a 2PC protocol is easier to implement and it will provide the required synchronization. In addition, the implementation of the heartbeat was merged with the 2PC avoiding any possibility of deadlock (Figure 3.b.). Monitoring at the manager is done using the pseudo-code at Figure 4. The manager waits for executing the sampling action (commit) to the agent as the voter delivers the commit packet to all non-faulty managers. Consequently, the manager delivers the “SNMP Get” packet to the agent, polling the required information. Finally the manager transmits the information sampled to the gateway-voter element. Observe that a correction_factor is introduced to the waiting time between samples and the sampling time. This modification keeps an accurate sampling time and reduces the error introduced on time delay spend on synchronization and round-trip communication to the agent. It also can be drawn from Figure 4., the minimum sampling time at the manager is shown in eq.1. Tpc + Tsnmpget +TmsgResponse. (eq. 1)
  6. 6. Request to Commit Accepting Commit LWP Request to Commit Request to Commit 2PC 1 Mutlthreaded 2PC commit protocol for synchronization Commit Msg Packet Manager 2 SNMP GET Bounded Thread Commit Commit Bounded Thread VOTER Bounded //Thread Content Thread void* commit(void* sockdesc){ read (sock_desc, commit_request); P(mutex); n_commits++; V(mutex); 3 wait_until (ncommits==n_available_managers); write(sock_desc, "GO"); P(mutex); n_commits--; V(mutex); Tpc + Tmsg +Tsnmpget + Interval = Timming Manager } The Manager only waits (Interval seconds) Manager Manager The LWP monitors the TIMEOUT for the commit Threads, before making the next sample therfore if from the available managers one fails to COMMIT the whole operation should fail by a TIMEOUT and send a CANCEL but without minding the simultaneously approach GOAL: Use the 2PC to achieve synchronicity between the Manager and the Voting Application (a) (b) Figure 3. Two Phase Commit Protocol to achieve synchronization While (n_samples>I){ I++; Start=gethrtime(); HCS_MAN->TwoPC(); // TCP socket connection HCS_MAN->SNMPGet(OID’s, agent, &Response); // UDP socket connection HCS_MAN->SendResponse(Response, Gateway); // TCP socket connection Correction_factor=gethrtime()-Start; Wait(sampling_time - Correction_factor); } Figure 4. Pseudo-code executed at each replica manager 3.2. The voter–gateway (GW). Having different replicas monitoring the same agent, the voter collects all the non-faulty measurements. In order to achieve congruent results and generate a voted output an instance of the voter-gateway has to be running in a highly reliable network node. The gateway or voter is shown in Figure 5. As mentioned above, the use of a voter avoids the implementation of complex leader election schemas in replica management. Fault masking is easily achieve in transitions of N to N-1 replicas, and processing delays are almost non- existent. In other words, whenever a failure is injected to the system, and any manager application can fail with graceful degradation of the overall system. However the Voter becomes a performance bottleneck in the sense that all the traffic of the Managers is directed to the Voter and therefore several performance issues have to be found and ways to improve them should be reached.
  7. 7. Voted DB Voted DB Voted Buffer of DB received data Voting Thread (one per agent monitored) Voting Thread a thread for each TCP (one per agent connect() monitored) Main Applica Voting Thread tion a thread for each TCP (one per agent connect() monitored) Thread for Recepction of Information Server Side Local Hearbeat Thread Objects (timming = 1sec) client Side General Log file TCP Sockets UDP Sockets Manager Agent Figure 5. Gateway Application Architecture. As shown in the Figure 5., there are several objects working together to achieve total monitoring or N Replica Managers monitoring M agents concurrently. The voter-gateway was tested using two approaches asynchronous messages from the Managers and total synchronization using the 2PC protocol [Chow97]. The communication frame among the manager and the voter is presented in Figure 6. Structure of Msg (Message from Manager-to-Voter) Manager Name Time Stamp Agent Name OID IN OCTET FORMAT VALUE MEASURED TCP Header (16 bytes) (16 bytes) (16 bytes) (256 bytes) (256 bytes) Total Length: 560 Octets/bytes Figure 6. SNMP information frame Manager-to-Gateway. As show here voter has local objects which are initialized with the information of all the replicas, such as OIDs monitored by the manager, the agent itself has some information concerning agent name, parameters to be measured from the agent. The big scenario is drawn in the Figure 7. First, the voter creates
  8. 8. Replica of Manager Data Base of Local Agents Replica of SNMPGET SNMPGET Manager Data Base of Response Response Local Agents Hearbeat/rpc shell and Collection of Votes Vote for Voter/ NM Proxy OID VOTE for OID ATM Network SNMPGET Response VOTED DataBase Replica of DataBase of Manager Local Agents MIB Network Agent SNMPd running/Agent Figure 7. Distributed Fault Tolerant Network Managers and the Gateway accessing a Network Agent via RPC calls, NxM instances of Managers (hcs_snmp_man objects) in each network node, where M is the number of agents to monitor by the voter, and N is the number of replicas. There are K servers to perform as network nodes or peers for remote execution of the NxM instances. Then, when all the instances are executed, each replica manager will poll its correspondent agent. Hence, the replica gathers the responses of all the OIDs defined to monitor, as a consequence the replica also generates the frame voter-manager with the information gathered (Figure 6). The tasks done at the voter are resumed as follows: a) Voter reads all configuration files about agents to monitor and references to the OIDS to be read by the manager. It generates local objects of the environment driven. b) The rshells are executed in each of the nodes of the system. c) The voter activates the threads for heart-beat processing and failure detection, and the TCP port listener for SNMP results and commits. d) Concurrently replica managers communicate with the Agents collecting network management information e) Replica Managers send the queried information to the voter, using the frame at Figure 6. f) The voter handles all the arrival information using two techniques: a linear buffer and a hash table. Simultaneously, local threads are created per agent. The thread is used to dig into the data colected and generate the voted result for every agent being monitored. Each message that arrives to the voter is converted to the format of the class type MsgFormat at figure 8.
  9. 9. class Msgformat{ public: int marked; // message marked to be deleted char* manager; // manager name char* agent; // agent owner of the information char* timestamp_client; // timestamps client and server (manager and GW) char* timestamp_server; char* OID; // Object Identifier according to the MIB char* SNMPResponse; // Response from the Agent hrtime_t start; // for performance measurements. Msgformat(); ~Msgformat(); }; Figure 8. Msgformat class used at the voter to collect the information from the replica managers. The Msgformat instance is stored whether in the buffer or the hash table. It is expected that a thread (fillbuffer in figure 9.a) is created per message received from the replica managers, therefore if all the replica managers sample concurrently, the number of threads created will be N_Agents*N_Mananagers. The access policy to the shared structure (buffer or hash table) used by the threads is round-robin. In addition to all of these threads from the Replicas, there are “voter” threads created for each agent. Each voter-thread is in charge of pulling from the buffer and basically generate the voted output. This process can be viewed as a join between two tables, the first one composed of agents and OIDs to monitor and the “table” of messages, which is the buffer. The JOIN operation executed is msg- >OID==agents[thread_id].OID and msg->agent=agents[thread_id].agentname. (Voter in figure 9.b.) void* fillbuffer(void* sock_desc){ while (message[1]!='q'){ if (read(sock_desc, &message, SIZE_MSG)==SIZE_MSG) void* Voter(int agent_id){ msg = new Msgformat; while (NOT(cancel)){ msg->start = gethrtime(); for each agent[agent_id].OID do { strncpy(msg->manager, message, 16); P(db); strncpy(msg->timestamp_client, message+16,16); while (buffer->length()>=k){ strncpy(msg->agent, message+32, 16); k++; strncpy(msg->OID, message+48,256); Msg=buffer->pop(); strncpy(msg->SNMPResponse, message+304,256); If (msg->agent==agents[agent_id].agentname] gettimeStamp(msg->timestamp_server); && (msg->OID==agents[agent_id].OID[j]){ msg->marked = 0; T_buffer->append(msg); P(db); } buffer->append((void*) msg); } V(db); if (T_buffer_>length()==getAvailableManagers()){ } agent[agent_id].file << Tstamp<<”Average_Values(T_buffer); } delete_elements_in_buffer(); close(sock_desc); } thr_exit((void*) 0); delete T_buffer; } V(db); } } } (a) (b) Figure 9. Threads for Filling to and removing elements from the linear buffer. (messages from replica managers to the voter) The pseudo-code shown in Figure 9.a. for the process of filling up the buffer as well in Figure 9.b. for the voter’s threads. As seen here the voting function depends upon the number of available managers,
  10. 10. getAvailableManagers(), this is used to determine whether the number of messages in the buffer is valid or not. In case of failure from the replica managers, the number of messages will be greater or less than the number of available managers, therefore the sample will be simply be lost and not processed. It won’t be until the next sequence of messages arriving into the queue that the process will continue normally. As mentioned before the access method for all the voter’s threads is a round-robin sequence, they also share the context with all the fillbuffer() threads generated upon the arrival of the SNMP managers-voter packets. (Figure 6.) IF instead of using the linear buffer (figure 9a and 9b), this structure is substituted by a double hashed array, having as hashing functions the OIDS and the Agent Name. (see Figure 10.a and 10.b.) void* fillbuffer(void* sock_desc){ void* VOTER_h(void* agent_id){ Msgformat* msg; while (NOT(cancel)){ char message[600]; j=0; int k, m; for each agent[agent_id].OID do { while (message[1]!='q'){ P(db); if (read((int) sock_desc, &message, SIZE_MSG)==SIZE_MSG) { j++; msg = new Msgformat; if (buffer[agent_id][j]->length()>=getAvailableManagers()){ msg->start = gethrtime(); agent[agent_id].file << Tstamp<<”Average_Values(T_buffer); strncpy(msg->manager, message, 16); delete_elements_in_buffer(agent_id,j); strncpy(msg->timestamp_client, message+16,16); } strncpy(msg->agent, message+32, 16); V(db); strncpy(msg->OID, message+48,256); } strncpy(msg->SNMPResponse, message+304,256); gettimeStamp(msg->timestamp_server); } P(db); m=getAgentIndex(msg->agent); //Hashing Functions. k=getOIDIndex(msg->OID, m); if ((m>0) && (k>0)) { Hashed_buf[m][k]->append((void*) msg); } else { ERROR(); elete msg; } V(db); } } close((int) sock_desc); thr_exit((void*) 0); } (a) (b) Figure 10. Threads for Filling to and Reading from the double Hashed Table. As Shown here the number of iterations is reduced from O(n) for the Linear buffer to approximately O(log n) of the Hashed array. The dynamic voting is achieved to in a similar way than in the linear buffer, the difference here is that instead of loosing the sample, the hash structure will have more than the number of allowed elements, manAvailableManagers(). Therefore in the subsequent iteration an error of a sample will be introduced to the measurment after a failure but all subsequent measurement will proceed normally. The Failure detection system and definition of manAvailableManagers() is presented in sect. 3.3
  11. 11. 3. 3. Heartbeat and Status of the Nodes, Managers and Agents. As stated in the assumptions of the system heartbeats are issued to the managers every second (or defined interval during compiling time). A simple echo server is running per node and a timeout mechanism is used to switch a manager of a set of managers from NORMAL into FAULTY state. Later on, if a second timeout is reached the manager or node does not return to a NORMAL state is erased from the group and declared DOWN. Thus, the number of available managers is decreased by one since a FAULTY state is detected. A recovery action to maintain the number of available managers above a threshold, can be easily achieved by finding the next available network node and running a remote shells command into it with the monitoring application. In addition, agents are as well considered NORMAL, FAULTY or DOWN. A FAULTY behavior of an agent is defined after a timeout from which no SNMP or null responses are sent from the manager. The Agent can switch from the FAULTY state into the DOWN state after a second timeout. The manager kills itself and the agent is not monitored anymore. 4. Experiments, Fault Injection and Performance Measurements. Performance experiments were run, these experiments were executed at the HCS, ATM LAN and Myrinet SAN, using as nodes for managers the following workstation’s architectures: • Ultra-Station 30/300, 128 MB of RAM (Managers) • Ultra-Station 2/200, 256 MB of RAM (Gateway station, Managers and agents) • Ultra-Station 1/170, 128 MB of RAM (Managers and agents) • Sparc-Station 20/85 , 64 MB of RAM (Managers and Agents) • An ATM Fore-HUB and a Fore-ATM-switch. All the measurements were done in terms of the latency added by voting process and monitoring, at both sides, the manager and the voter-gateway.
  12. 12. 4.1. Performance of SNMP at the managers. Testbed measurements where done to determine the latency of different OIDs using the CMU-SNMP protocol using Myrinet and ATM-LAN’s Round-trip latency of SNMP using CNTR32 data types 140 120 100 80 Rountrip Timming (Myrinet) ms 60 Rountrip Timming (ATM) 40 20 0 1 2 4 8 16 32 64 Number of object identifiers (OIDs) CNTR32 Round-trip Latency of SNMP uding OCTET STRING data types 35 30 25 20 Rountrip Timming (Myrinet) ms 15 Rountrip Timming (ATM) 10 5 0 1 2 4 8 16 32 Number of object identifiers (OIDs) OCTET STRING Time distribution of a SNMP GET request at the Manager using CNTR32 data types 140 120 100 P rotocol and Agent 80 a p p lic a t i o n ms 60 d e c o d ing 40 e n c o d ing 20 0 A T M A T M M Y R IN E T M Y R IN E T 1 C N T R 3 2 6 4 C N T R 3 2 1 C N T R 3 2 6 4 C N T R 3 2 D A T A T Y P E S number of object identifiers (OIDS) Time distribution of a SNMP Get request at the manager using OCTECT STRING data types 40 35 30 protocol and agent 25 application ms 20 decoding 15 encoding 10 5 0 1 OCTET 32 OCTET 1 OCTET 32 OCTET STRING STRING STRING STRING ATM ATM Myrinet Myrinet Numboer of object identifiers (OIDs) Figure 11. Latency Measurements using CNTR32 and OCTET STRING data types in a Myrinet and ATM testbeds. As shown in figure 11, the latency of different combinations of SNMPGET commands using the CMU- SNMP API grows constantly as the number of OIDs is increased. Figures a) and c) where done using the
  13. 13. OCTECT STRING data type, and b) and d) using the CNTR32 Data Type. In average the number of CNTR32 and INTEGER Requests using the SNMPGET frame cover more than 85% of all the requests. For these reason the performance experiments run at the agents included only CNTR32 data types. The percentage of time involved in the processing of each request to the agent is In the process of encoding/decoding ASN.1 information, the application and the protocol is shown in Table 1. The protocol/Agent overhead covers from 43.8% to 94.9% of the overhead. This situation turns out to be the first performance bottleneck found at the whole process. It is important to remember that at the Agent level the ASN.1 encoding/decoding overhead is also executed and in addition to that the agent should made access to its Control Status Registers and create the Frame. The experiments where run at the Myrinet link between two Ultra-2/200 with 256 Mbytes and for ATM, one Ultra-2/200 polling information out the ATM-Fore switch. Table 1. Processing overhead distribution at the Manager. ATM Latency (ns) Distribution 1 CNTR32 64 CNTR32 1 CNTR32 64 CNTR32 ATM ATM Myrinet Myrinet encoding 1.06% 2.41% 1.58% 1.32% decoding 1.78% 4.74% 1.97% 2.33% application 40.98% 3.04% 52.65% 1.44% Protocol and Agent 56.18% 89.80% 43.80% 94.90% No. of samples: 500, Sampling time: 1 second
  14. 14. L a t e n c y a t t h e m a n a g e r ( 1 r e p lic a ) 35.00 30.00 25.00 2 P C 20.00 S N M P G E T ms T C P c o n n e c t 15.00 Lantency to the voter 10.00 5.00 0.00 0 5 1 0 1 5 2 0 N u m b e r o f A g e n ts Latency at the manager (5 replicas) 40.00 35.00 2PC 30.00 25.00 SNMPGET ms 20.00 TCPconnect 15.00 10.00 Lantency to the 5.00 voter 0.00 0 2 4 6 8 10 Number of Agents Figure 12. Average latency at all replica managers having different levels of replication and different number of monitored agents. At figure 12, both a) and b) corresponds to the measurements made at the manager, but in this case the issue mentioned as SNMPGET corresponds to the whole process described at Figure 11. For these measurement the number of OIDs used was fixed to eight. The eight OIDs selected are those presented in the Table 2. The main concern for all the experiments is that the sampling time is fixed and exactly the same in number of samples for all the agents being monitored. As a consequence the values of TCPConnect and Latency to Voter are the transmission of the eight OIDs transmitted from the replica manager. Table 2. OIDs selected for the performance experiments OID ASN.1 Data Type .iso.org.dod.internet.mgmt.mib-2.interfaces.ifTable.ifEntry.ifInOctets.3 CNTR32 .iso.org.dod.internet.mgmt.mib-2.interfaces.ifTable.ifEntry.ifInOctets.4 CNTR32 .iso.org.dod.internet.mgmt.mib-2.interfaces.ifTable.ifEntry.ifInOctets.5 CNTR32 .iso.org.dod.internet.mgmt.mib-2.interfaces.ifTable.ifEntry.ifInOctets.6 CNTR32 .iso.org.dod.internet.mgmt.mib-2.interfaces.ifTable.ifEntry.ifOutOctets.3 CNTR32 .iso.org.dod.internet.mgmt.mib-2.interfaces.ifTable.ifEntry.ifOutOctets.4 CNTR32 .iso.org.dod.internet.mgmt.mib-2.interfaces.ifTable.ifEntry.ifOutOctets.5 CNTR32 .iso.org.dod.internet.mgmt.mib-2.interfaces.ifTable.ifEntry.ifOutOctets.6 CNTR32
  15. 15. Those entries correspond with the number of octets in/out at the different devices monitored by the snmpd at each workstation being polled by the manager. According to the results from Figure 12, the Two Phase Commit (2PC) protocol and the SNMGET required more than 95% of the overhead. It is shown also in Table 3., that the SNMPGET covers more than 65% of the process and in comparison with the 2PC which only occupies less than 30%. The algorithm is shown in Figure 4. Table 3. Comparison of 2PC and SNMPGET at the manager using different agents and replication Percentage of Utilization Number of 2 agents 16 agents Agents Replicas 2PC SNMPGET 2PC SNMPGET 1 23.80% 69.50% 32.45% 62.82% 2 23.10% 68.10% 24.50% 47.10% 3 21.80% 72.07% 25.90% 66.45% 4 27.83% 76.87% 28.22% 63.45% 5 27.14% 60.60% 29.80% 60.60% Therefore, this overhead introduced by the manager has to be compensated and taken into account to define the minimum sampling time stated in section 3.1, equation 1. 4.2. Testbed experiments using the 2PC. By adding the 2PC protocol to each manager, the overhead introduced to the application represents the 25% but the total Processing time at the manager. It is important to point out that the “Total Processing Time” here is related to the voter system and includes the inter-arrival time between SNMP gateway-voter messages, the searching time at the shared buffer and the correspondent I/O to disk. In addition, the total processing time is per Thread. In other words N_Agents will be able to process the incoming OID concurrently at the average time mentioned above.
  16. 16. Total processing time (TPT) of eight concurrent OID at the voter 3500 3000 2500 1 TPT 1 Manager 2 TPT 2 Managers 2000 ms 3 TPT 3 Managers 1500 4 TPT 4 Managers 1000 5 TPT 5 Managers 500 0 0 5 10 15 20 Number of agents (a) Inter-arrival time of eight concurrent OID at the voter 3000.00 2500.00 1 Interarrival 1 Manager 2000.00 2 Interarrival 2 Managers 3 Interarrival 3 Managers ms 1500.00 4 Interarrival 4 Managers 1000.00 5 Interarrival 5 Managers 500.00 0.00 0 2 4 6 8 10 12 14 16 18 Number of agents (b) Figure 13. Inter-arrival time and total processing time at the voter. In both cases the dominant factor is the Inter-arrival time of messages to the queue. This time represents the time difference between the arrival to the shared buffer or a message identifying an OID and the moment in which the last message coming from a different manager arrives to the shared structure. It is important to remember that the system shares the access for the threads of filling up the buffer, fillbuffer() and the voters (each per agent). Thus, the Inter-arrival rate is being affected by the processing time of the voters or all the “join” processes mentioned in section 3.2, figure 9. The Amount of messages received per sample are defined by N_Managers*N_Agents*N_OIDs. In other words a system with 16 agents, 5 managers and 8 OIDs will send out to the voter 640 messages, each
  17. 17. message with a fixed size of 560 bytes (figure 6) with represents 358400 bytes received by the Manager, in this particular iteration. For instance in an iteration with two replicas and 16 agents, the average processing time is 500ms and having 8 OIDs per agent, the minimum sampling time is 4 seconds. Any other shorter sampling time measurement will lead to an erroneous monitoring and the results won’t be accurate to the sampling time. To avoid these problems the sampling time for the 5 managers was defined at 30 seconds, and the number of samples to 20. 4.2.2. Testbed experiments with the hash table. In order to reduce the searching time to the shared buffer structure and make that reflect it in the performance of the whole application, the linear-search was substituted by the hash table (figure 10). The results of using a Hash Table are shown in Figure 14. Preliminary the reduction of the search time will allow more time for the other threads to process and reduce the sections of the buffer in which semaphores are required for access. The Hash functions are very simple and the relationship is one-to-one since it is the voter which defines the managers and agents of monitoring. In comparison with the linear search done at the previous structure an improvement of 50% was achieved when having three or more replicas, and it remains without change for one or two replicas. An interesting behavior is shown with 4 or 5 replicas in which the variation between processing times is not more than hundreds of milliseconds, which is expected given the nature of the search time at the hash table.
  18. 18. Inter-arrival time of eight concurrent OID at the Voter 1600.00 1400.00 1200.00 1 Interarrival 1 Manager 1000.00 2 Interarrival 2 Managers ms 800.00 3 Interarrival 3 Managers 600.00 4 Interarrival 4 Managers 400.00 5 Interarrival 5 Managers 200.00 0.00 1 2 4 8 16 Number of Agents (a) Total processing time (TPT) of eight concurrent OID at the voter 1800 1600 1400 1 TP 1 Manager 1200 2 TP 2 Managers 1000 ms 3 TP 3 Managers 800 600 4 TP 4 Managers 400 5 TP 5 Managers 200 0 1 2 4 8 16 Number of Agents (b) Figure 14. Inter-arrival and Total Processing Time at the voter using a Hashed Table. 4.2.3 Testbed experiments with an asynchronous system If the 2PC protocol is not included, the manager will be able to be fully independent and reduce the overhead involved in more than 20%. However, the overhead of the replica control protocol which is non- existent will cause the system to degrade drastically in performance. The values to be measured at the voter have a total processing time of a minimum of 200 ms to a maximum of 2.2 seconds (having 4 replicas an 16 agents). Therefore, in this particular system, the lost of synchronization drastically degrades the System. Observe that from Table 4 and Figure 15, this degradation is at the order of 60% respect to the Hash Table method using the 2PC. Preliminary measurements showed that the degradation is even worst if the Has Table is substituted by the linear search using the shared buffer structure.
  19. 19. Table 4. Total processing time without synchronization using the hash table Total Processing Time (ms) Number of 1 2 3 4 Replicas No of Agents No of OIDs 1 8 220 436 483 461 2 8 297 556 580 926 4 8 345 778 890 1207 8 8 322 800 1137 1659 16 8 344 975 1325 2210 As showed in the previous two sections where the total processing time grows with the number of agents, in figure 15 the same behavior is obtained but in greater proportions. T o ta l p r o c e s s in g tim e o f e ig h t c o n c u r r e n t O ID s a t th e v o te r 2 5 0 0 2 0 0 0 1 5 0 0 Time (ms) O n e R e p lic a s T w o R e p lic a s T h r e e R e p lic a s F o u r R e p lic a s 1 0 0 0 5 0 0 0 0 2 4 6 8 1 0 1 2 1 4 1 6 1 8 N u m b e r o f A g e n ts Figure 15. Average total processing time (TPT) introduced at the voter in the asynchronous system. 4.3 Comparison between “non-voted” and “voted” outputs. One of the main goals with a Fault Tolerant System is the achievement of transparency for every measurement by reducing the addition of noise product of the replication. In order to define whether is the
  20. 20. FT system reduces or gracefully degrades the accuracy of the measurement, in figure 16 it’s shown a sample of voted and non-voted measurements. The experimental conditions for Figure 16 are: three replicated managers monitoring a FORE-ATM router/hub (hcs-gw). V o t e d a n d N O N - v o t e d D a t a c o llected by one of the R e p l i c a M a n a g e m e n t A p p l ic a t i o n ( D a g g e r - U l t r a 2 ) 727300000 727250000 727200000 IfOutOctets - GB port at HCS-GW 727150000 Voted-IfO u t O c t e t s Dagger-IfOutOctets 727100000 727050000 727000000 726950000 T ime (seconds) Figure 15. Fault-free of voted and non-voted measurements at the Gb-ethernet port at hcs-gw (router) using .iso.org.dod.internet.mgmt.mib-2.interfaces.ifTable.ifEntry.ifOutOctets.3 As shown in Figure 15, dagger is taken as point of reference. The collected data showed that the error between the voted and non-voted measurements is not greater than 0.03%. This behavior is reflected in a qualitative view to the figure. 4.4. Performance Degradation after Fault Injections One of the main assumptions for the FT-system is the fail-stop model of the system. However after the failure of one of the managers the monitoring should continue by fault masking. The experiments run at the test bed consisted in a set of five managers and one agent, and using 8 OID’s per request. The injection of faults is done by killing the remote shell, after that the system depends in the heartbeat service for failure detection. In figure 16, it is shown how the values of the voter are modified after a failure in one of the managers. The measurement starts with five managers, every minute the number of managers decreased by one. As seen
  21. 21. here the voted and the real measurement varies slightly, as a matter of fact after a failure the value at the time of the measurement is lost and an interpolation is required to determine the sample between the gap when the system had N and N-1 replicas. Comparasion Voted Vrs a Replica Manager Local Information given Fault Injection 3497400000 3497380000 3497360000 ifInOctets (GB Port) at HCS-GW (router) 3497340000 3497320000 3497300000 3497280000 Voted - ifInOctets Dagger - ifInOctets 3497260000 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 Non faulty managers Figure 16. Comparison of voted and non-voted results of ifInOctets after fail-stop faults at the managers
  22. 22. The Behavior of the Throughput is presented in Figure 17a and 17b, in both cases the reference is the manager at the node dagger. In both cases the graph generated by “dagger” is followed by graph generated Output Throughput Measured at the Voter and Local Information at Replica (dagger) Input Throughput Measured from the Voted and Local information in Replica (dagger) 1600 900 1400 800 1200 700 1000 600 Octets/sec Octets/sec 500 Voted-IfOutOctets/s 800 Voted-IfInOctets/s Dagger-IfOutOctets/s Dagger - IfInOctets/s 400 600 300 400 200 200 100 0 0 555555555555555 54444444444433333333333222222222111 111111 111 voters (slots of 10 seconds) 5 5 5 5 5 5 5 5 4 4 4 4 4 4 3 3 3 3 3 2 2 2 2 2 1 1 1 1 1 voters (slots of 10 sec) (a) (b) Figure 17. Input and Output Throughput measured from one of the router ports with Fault Injection by the voter. Observe that the graceful degradation is achieved by avoiding gaps between two or more samples. Moreover, when there is only one manager, “dagger”, left it is obvious to expect that both graphs will be exactly the same as shown on the figure. 5. Future Work Group communication is very important to keep concurrency at the application level. There are several improvements for the system. The use of lightweight agents by replacing UDP sockets by XTP [ ] or SCALE Messages [George98a] will decrease the overhead at the protocol layer in every manager. In addition to this, the message interchange between voter-managers can be done using multicast – lightweight communication. Reducing the thread context switching and speeding up the shared buffer access at the voter can also be achieved.
  23. 23. As a NM application, the voter should be able to re-run or re-schedule one of the replicas after detecting a dead node. A combination of the Leader Election Replica Management and Voting can also reduce the load of every Management Node. Adding a Lightweight CORBA framework [George98b] to communicate with agents instead of using the SNMP only. Finally, one of the major improvements to achieve for a FT application is the use of faster data structures and a SQL [Wolf91] engine to relate parameters from replications and original and work as a failure detector, which will already include built-in check-pointing and error recovery. 6. Conclusions • FT Distributed Applications required a well defined-efficient replica communication protocol. The combination of fault-detection and 2PC protocol provides an easy methodology to achieve synchronization • Timing and Latency overhead at the Voter and Manager have to be taken into consideration. Specially when defining small intervals of sampling in High Performance Networks. • The use of a voting system provides an efficient way to monitor and gracefully degrade measurements as a Network Management application 7. Acknowledgements HCS Lab for their comments and reviews.
  24. 24. 8. References [Agui97] M. Aguilera, W. Chen, S. Toueg.” Heartbeat: A timeout-Free Failure Detector for Quiescent Reliable Communication”, Cornell University, July 1997. [Alvi96] L. Alvisi., K. Marzullo. “ Message Logging: Pessimistic, Optimistic and Casual”, IEEE Int. Symp of Distributed Computing, pp 229-235 1995. [Aro95] A. Aurora, S. Kulkarni “Designing Masking Fault via Non-Masking Fault Tolerance”, IEEE Symposium on Reliable Distributed Systems, 1995, pp 174-185. [Beed95] G. Beedubahil, A. Karmarkar, U. Pooch.” Fault Tolerant Object Replication Algorithm”, TR-95-042, Dept. of Computer Science, Texas A&M, October, 1995 [Begu97] A. Beguelin, E. Seligman, P. Stephan.” Application Level Fault Tolerance in Heterogeneous Networks of Workstations”, Journal of Parallel and distributed Computing, Vol 43, pg 147-155, 1997. [Bras95] F. Brasileiro, P. Ezhichelvan. “TMR Processing without explicit clock synchronisation”, IEEE Symposium on Reliable Distributed Systems 1995, pp 186-195. [Doer90] W. Doeringer, D. Dykeman, et.al. “A Survey of Light-Weight Transport Protocols for High-Speed Networks”, IEEE Trans. On Communications, Vol. 38, No.11, pp 2025- 2035. [Dol97] S. Dolev, A. Israeli, S. Moran. “ Uniform Dynamic Self-Stabilizing Leader Election”, IEEE Transactions on Parallel and Distributed Systems, vol. 8., No. 4, April, pp 424- 440,1997. [Duar96] E. Duarte, T. Nanya. “Hierarchical adaptive distributed system-level diagnosis applied for SNMP-based network fault management”, IEEE Symposium on reliable distributed systems, 1996, pp 98-107. [Feit97] S. Feit. “ SNMP”, McGraw Hill, NY, 1997. [Georg98a ] Paper being fixed by Dave and Tim. [Georg98b] Luises Thesis….. [John97] D. Johnson. “Sender-Based Message Logging”, IEEE Fault Tolerant Computing, pp 14- 19, 1987. [Landi98] S, Landis, R. Stento “CORBA with Fault Tolerance”, Object Magazine, March 1998, [Maffe96] S. Maffeis. “Fault Tolerant Name Server”, IEEE Symposium on reliable distributed systems, 1996, pp 188-197. [OSI87] OSI, Information Processing Systems – Specification of the Abstract Syntax Notation One ASN.1, ISO 8824, December 1987. [Paris94] J. Franciois, Paris.” A highly Available Replication Control Protocol Using Volatile Witnesses”, IEEE Intl Conference of Distrib. Comp. Systems, 1994, pp 536-543, [Prad96] D. Pradhan. “Fault Tolerant Computer System Design”, Prentice Hall: NJ, 1995. [Rose90] M. Rose and K. McCloghrie. “Structure and Identification of Management Information for TCP/IP based Internets”, RFC 1155, 1990 [Rose94] M. Rose. “The Simple book – An Introduction to Internet Management”, 2nd edition, Prentice Hall, Englewooed Cliffs, NJ, 1994. [Saha93] D. Saja, S. Rangarajan, S. Tripathi.” Average Message Overhead of Replica Control Protocols”, 23th IEEE Intl Conf on Distrib Computing Systems, pp 474-481, 1993 [Scho97] J. Schowalder. “Network Management by delegation” , Computer Networks and ISDN Systems, No. 29, 1997. Pp 1843-1852 [Sing94] G. Singh, M. Bommareddy. “Replica Placement in a Dynamic Network”, IEEE Intl Confernce of Distributed Computing Systems, pp 528-535, 1994. [Wolf91] O. Wolfson, S. Sengrupta, Y. Yemini. “Managing Communication Networks by Monitoring Databases” IEEE Transactions on Software Engineering” Vol. 17 No. 9. Sep 1991, pp 944-953. [WU97] C. Wu.” Replica Control Protocols that guarantee high availability and low access cost”, Thesis – Ph.D., University of Illionis – Urbana Campaign, 1993. [XU96] J. Xu, B. Randell, et.al.” Fault Tolerant in Concurrent Object Oriented Software through Coordinated Error Recovery.”, FTCS-25 Submission, University New-Castle, 1996.