Network performance management: Three key technology challenges


Published on

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Network performance management: Three key technology challenges

  1. 1. S PECIAL R EPORT Network performance management: Three key technology challenges By Enterprise Management Associates (EMA) Dennis Drogseth, Vice President Steve Hultquist, Senior Analyst Jeff Nudler, Senior Analyst Sponsored Exclusively By: This Special Advertising Section Produced By: www.nw
  2. 2. S PECIAL R EPORT Network performance management: Three key technology challenges 2 Issues of data collection, analytics and thresholds as they impact monitoring of network infrastructure Table of Contents In the current environment, management of the networked infrastructure has become fundamentally more complex than ever. This is true because both the number and complexity of the active elements in networks today 3 Issues and Challenges are higher than ever. Simultaneously, components have expanded to in Network Monitoring include functions to address needs across a greater variety of applications and across more layers of the network model. Even as new specialized 3 Data Volume Issues devices are being added to networks, existing devices are expanding their functions. In short, the functions of many network elements are increasing while the number of network elements also is increasing, creating an ever- 5 Data Transit Issues expanding management challenge. While the marketplace is evolving, it is still doing so with some level of 5 Data Saturation Issues confusion along critical areas of technology leadership. This is because in part to the historic evolution of network engineering and management 6 Data Triggering and approaches. Historically, management was focused on maintaining the Threshold Issues health of the individual elements within a network, assuming that keeping each element healthy would guarantee the health of the overall network. With the expanding functions and changes in network designs, this has not 7 Data Analytics Issues proven to be the case. This report focuses on data collection, analytics and thresholds as they 8 Conclusions impact network monitoring. Because they represent the three single most critical areas requiring both technology innovation within the vendor community and strategic solution planning from an IT adoption perspective, it is important that these areas are understood and used as focus areas in the development of network management systems (NMS) and network management planning. About the Authors Dennis Drogseth, Vice President; Steve Hultquist, Senior Analyst; and Jeff Nudler, Senior Analyst at Enterprise Management Associates (EMA) EMA is the first technology analyst firm to specialize exclusively in manage- ment software and services. Organized into three highly collaborative prac- tice areas: e-Business Management, Networked Services Management, and Systems Enabled Services Management - EMA offers information and guid- ance about management software and services.
  3. 3. S PECIAL R EPORT Network performance management: Three key technology challenges 3 Issues and challenges in Figure 1. 2003 Metrics important to SLAs network monitoring As the number of IT infrastructure elements has increased and the topological complexity of the network escalated, the issues surrounding the monitoring and management of those elements has expanded exponentially.The management of net- work traffic passing through an ever-expanding labyrinth means an inexorably growing amount of statistical informa- tion.As the requirements for redundancy,performance and the number of mission-critical applications increase, the complex- ity of the overall architecture increases, adding to this issue as well.All of this growth has increased the strain on the monitor- ing of networks and introduced a need for new thinking about the resulting data collection and analysis. Understanding the implications of highly interdependent measurements, for instance, understanding how network, server and application components might all impact each other, and providing both proactive and reactive alerts requires effective analytics and complex threshold setting alternatives. This combination of challenges creates an environment ripe for innovation and new thinking. Data volume issues Network elements have become exceedingly complex over the past few years. For example, Ethernet hubs have morphed into Layer 3 switches serving multiple virtual LANs (VLAN).This increased complexity has led to an explosion of statistics and configuration information available from the network elements. This data,when actually collected and used,does not necessar- ily provide useful knowledge for the management and opera- Engineering data collection tions team because of their extensive volume and specificity. The characteristic of voluminous management data pro- duces a requirement to actually engineer its collection or rely In other words, excessive data does not equate to meaningful on automation for that.Among the decisions for collection are information,and the information gleaned from the data does not the specific statistics to be gathered from each element, a necessarily lead to useful knowledge with a clear context for design for the inter- and intra-element analytic algorithms, and management decision making. On the contrary, the volume of the development of an empirically supported model for mon- information captured constantly from the network can serve to itoring and management. Much of this engineering is incorpo- drown human analysts and overwhelm systems designed to rated into management systems that develop unique methods process the data to produce useful analytics.Effectively,the sheer and mechanisms for managing the volume of data. Even with volume of data can often sabotage the primary purpose for gath- those capabilities, it is likely that organizations will continue to ering it: efficient management of the network infrastructure. expend significant engineering effort to develop the complete But given the complexity of the network and the need to data collection strategy for the foreseeable future,as individual understand more, not fewer, interdependencies, what are the network engineers and operations staff members continue to alternatives for this swelling ocean of data? craft their own sequences and searches in data gathering and analysis from a multiplicity of resources.
  4. 4. S PECIAL R EPORT Network performance management: Three key technology challenges 4 Figures 2 and 3. 2003 Traffic Engineering tools ured out-of-the-box or through policy rules available to the operations staff. Distinguishing data With the volume of data produced in the management of a large network, culling the valuable data for any given potential problem or actual failure is a great challenge. Data that might otherwise have little use becomes of great import due to its cor- relation to other data. For example, Ethernet addresses of client workstations might rarely be useful information. However, con- sider a notebook computer with a bad Ethernet interface.When it begins to generate excessive traffic or unusual errors while it is moved from LAN to LAN, being able to ascertain that the errors that seem to be erupting randomly through a network are all associated with the same physical address provides invalu- able information for resolving the issue. Another example of this type of issue: After losing data because of a series of undiscovered multiple failures, the root- cause analysis uncovered backup logs showing backup fail- ures over a 10-day period that were never reported through an alerting system.The volume of data in the backup reports was such that the people and processes in place to find errors failed consistently for nearly two weeks! Consider that there are far more data points in a network monitoring system than in a backup log, and the escalating demands of effective prob- lem notification become obvious. Examples of the capabilities of management systems to address the explosion of monitoring data are found in tools that dynamically alter data collection based on the status com- municated by that data. For example, if a switch port begins to experience excessive errors, the management system would collect additional information from that port, including possi- bly complete packet capture for analysis.When the port returns to a normal range, however, that additional information would no longer be collected. While this approach generally reduces the volume of infor- mation gathered continually,there might be data that would be useful in understanding a problem’s cause that was not collect- ed before the problem manifested itself.Recurring but intermit- tent problems might require additional data to be collected until the data is sufficient to pinpoint the root cause. In some cases, management solutions have been able to automate to that condition as well, proactively seeking added diagnostics based on “trigger conditions”.The conditions might be config- Statseeker
  5. 5. S PECIAL R EPORT Network performance management: Three key technology challenges 5 Interrelated data ment while real-world estimates for large networks show that The reality of network monitoring data is that it is most use- as much as 20% to 25% of the traffic is management specific. ful when correlated by proximity, technology service, possible The multiplicative effects of the monitoring traffic can serious- root causes and business processes. Outside of that correla- ly impact the speed of acquisition for that monitoring data and tion,most network statistics are of limited usefulness for day-to- the time to resolution for related problems. day network management. For example, measurements of Two effective tools for reducing the load that network moni- metrics on network elements that are adjacent in the network toring traffic places on a network are distributed data collec- provide information on typical traffic flow. Sudden changes in tion and distributed analytics. Using these techniques, a data that flow might indicate a problem in one of the mutual con- collection system can collect data at multiple locations within nections, even if the metrics on the connection is not unusual. the network, reduce and compress that data as appropriate, Different business processes and technology services tend to and send the resulting data to the network management sys- cross multiple networks and system boundaries. An issue with tem. Similarly, distributed analytics both collect and analyze a single component of the overall service, such as a database the data in a distributed manner, communicating results of the server,affects the performance of the entire service.A clear cor- analysis to the network management system. Both of these relation of these components for this service provides impor- techniques reduce the traffic crossing the network while tant information for problem determination and management. potentially providing the ability to capture information that However, those correlations require deep understanding of would be impractical to capture, otherwise. For example, the the network architecture and design, the applications and busi- distributed systems could maintain a day’s worth of detailed ness processes relying on it, and extensive root-cause engineer- statistics for all of the elements they collect, but not send that ing. Developing the network model to support this level of data to the network management system (NMS) unless a engineering is the purview of a small number of expert network review after an outage required it. This reduces the analytical engineers, and it is clear that the automation of this expertise is load on the NMS while maintaining the data that might be vital for widespread development of these capabilities. needed in the event of a problem investigation. However,data volume isn’t the only issue.As elements record statistics, they must send them to management systems for col- Data saturation issues lection and analysis. A characteristic related to the volume and transit issues are the fraction of the total elements within the network infrastruc- ture that are actively monitored and managed. In many cases, Data transit issues the ideal 100% of elements are not managed or even moni- Most network statistics’ must be gathered at the elements in tored because of the challenge of data saturation: so much the network and then somehow communicated so that they data, most of which is not the least bit useful. So, many organi- are available to the network management and monitoring zations monitor those elements that they believe are most like- team.Traditionally, this is done by having a centralized collec- ly to allow them to see problems either before or as they occur. tion system as part of the network management platform.All of If they are wrong, they might miss outages until the help desk the elements then are polled by the system and the data col- informs them. lected. Given the amount of data that it is possible to collect and the need increasingly to span the elements that can shed Effectively,this is playing Russian roulette with a loaded gun. light on network, systems and application interdependencies, As long as none of the unmonitored elements are the culprit, as well as issues that might affect new services such as VoIP , you will survive. But when one of them causes a failure you VPNs, or wireless, the network traffic generated by monitoring will have no way of managing that failure or understanding and data collection can be very significant.For example,mem- how it occurred. bers of the InteropNet NOC team have estimated that 80% of A fundamental engineering problem is guessing before a the network traffic on the InteropNet is network monitoring problem occurs what data might be necessary to understand and management traffic. Internet traffic reports have shown up and solve it. Eighty percent of performance problems are tran- to 30% of traffic as being network monitoring and manage- sient situations, and data capture is impossible once the prob-
  6. 6. S PECIAL R EPORT Network performance management: Three key technology challenges 6 lem occurs. To address these types of problems, vendors of Among the challenges of setting appropriate thresholds are management systems create areas of coverage for their prod- the typically constant nature of thresholds, the largely ad hoc ucts related to the problems they believe are the most preva- approaches to setting thresholds, and the element-specific lent.The problems that their product addresses will determine nature of thresholds. the perceived quality of their product. Threshold values Targeted analytic capture can provide another approach to In most systems, thresholds are set to a specific value.When distributed data capture in which isolated variables well- that value is exceeded, an alert of some kind is generated for known to affect certain critical conditions are automated for the operations staff. For example, it is typical to set bandwidth capture in either a distributed or centralized fashion. Limiting utilization thresholds on point-to-point links to alert operations variable collection to enable more fluid monitoring across the staff to possible network choke points. infrastructure can be an effective tradeoff when the limitations imposed reflect in-depth insights into the networked environ- One challenge with this approach is that it might not commu- ment, just as expert diagnosticians in the medical profession nicate potential issues until it is too late.The naturally cyclical are in large part effective because they know what to ignore. nature of network utilization means that an environmental change such as the introduction of new applications or the For example, high traffic volume on the accounting depart- physical movement of a number of staff members might cause ment link on Thursdays most likely signify payroll processing an unexpected increase in traffic that causes a bottleneck. while on Tuesday the same condition might signify anomaly. However, the trend is unseen until it rapidly crosses the thresh- The monitoring tool must be able to recognize this time and old during the high-utilization time of day and causes a prob- monitored data relationship. Beyond the collection of data, lem and potentially an outage. The static nature of many there are additional issues with how measurements trigger threshold values makes them less useful than they could be. alerts and then the analysis of the data. Threshold setting Data triggering and threshold issues In addition to the static nature of many thresholds, the process of determining threshold values often has been infor- As distributed systems and networks emerged, it became mal and experiential instead of rigorous. Consider the thresh- clear that there were situations when the operations team needed to know that a threshold had been crossed or a partic- ular situation had occurred. Classic examples include a 60% load on a shared Ethernet network and SNMP traps for power failures.These thresholds and triggering events provided mech- anisms for warning engineers about problems as they occurred. If the values were judiciously selected, they could even serve as warnings for potential problems. However, over time, the traps and thresholds became a flood of incoming data without any relative value clearly discern- able from the NMS. It was not unusual for threshold alerts, SNMP traps and related monitoring information to effectively paralyze the management infrastructure (people, processes and tools). In other words, the very phenomena designed to alert operations staff to issues with the infrastructure served to blunt the urgency of those insights. For example, a service provider reported receiving 60,000 alarms per day, but the NOC personnel depended on external technicians to alert them to actual problems. Statseeker
  7. 7. S PECIAL R EPORT Network performance management: Three key technology challenges 7 old value for bandwidth utilization, again. At what utilization around the real-world requirements. Some products now use should it be set? 60%? 80%? What are the characteristics of the analytics, heuristics and/or baselines and trending to deter- channel that contribute to an appropriate setting? The com- mine dynamic or adaptable thresholds that account for the plexities of setting these thresholds can lead to either overly interrelationship of measurements across a network. conservative settings, effectively increasing inappropriate Using analytics products can determine the impact of corre- alerts,or to aggressive or even arbitrary settings that might con- lating measurements and communicate the overall implica- tribute to network failures. tions for the network. For instance, when certain links, The historical practice of setting thresholds in isolation sel- interfaces, or element backplanes begin to approach their dom reflects the complex realities of today’s networks and the capacity limits, analytical correlation can determine the snow- interrelation between the network components. ball effect and alert network operations to impending trouble. Automation is really the only way to correlate and analyze the Element-based thresholds vast data that is available in a way that provides useful informa- Typically,thresholds are set for a single element and often for tion to operations staff. individual interfaces on an element (for example, each inter- Similarly, a set of measurements might indicate a possibility face on a router might have different thresholds based on the that seemingly unrelated components in the network must be type of medium connected). Relationships between elements checked or that thresholds must be set on metrics elsewhere. and the affect of those relationships are often impossible to For example, a slow down in end-user response time for an integrate into the thresholds and alerts. application might indicate a need for detailed inspection of Consider the implications of multiple Internet connections network links, system utilization and server process perform- managed by the Border Gateway Protocol.These connections ance. These heuristics allow for correlations and connections are likely different bandwidths, and also likely to connect to across metrics that provide an overall perspective on deliver- different ISPs and different points of presence. For most practi- ing end-user services. Without the correlations, it is likely that cal purposes, those links can be considered aggregated, and the implications of a set of measurements might not be recog- the routers connected to them are able to negotiate appropri- nized until the problems manifest themselves. ate routing based on a complex algorithm. However, how will Along this same vein, a number of products also provide the the operations team be alerted if the usage off those links is capability to create baselines for measurements that correlate abnormally high? Typically, thresholds are not set for the aggre- to time of day, day of week and other trending characteristics. gation of bandwidth that crosses element and interface bound- These baselines then serve to allow for tighter threshold con- aries, as is the requirement in this example. Effectively, the straints within a band around the baseline, allowing for alert- relationships between elements are seldom part of the thresh- ing based on measurements that are “outside the norm” but old value or the threshold calculations. not necessarily above a strict, invariable threshold. Consolidation of information to create useful thresholds is diffi- cult, and data gathered from multiple elements is difficult to cor- Data analytics issues relate and aggregate for both alerting and reporting.The multitude After considering the challenges inherent in the data collec- of ways that data can combine to represent a warning – or even tion and threshold issues, there are still additional issues in the to indicate that a value that could indicate a problem doesn’t nec- analysis of the data. Among them are the computational and essarily apply in a particular case – makes it difficult to manually network load created by centralized data analysis,the scope of engineer all of the possible permutations.Understanding all of the analytics and the challenges of root-cause issues and trending. possible correlations is also challenging,and likely requires signif- icant engineering effort that is often unavailable within the net- When vast amounts of monitoring data generated by a large work infrastructure management team. network are returned to many central analysis systems, those systems will typically be forced to either filter that data aggres- Emerging threshold approaches sively or to apply substantial computational resources to ana- As the extent of the threshold challenges has become clear lyze the multiple permutations of the metrics gathered. This in the industry, more products are beginning to be designed challenge implies that either potentially useful data is ignored
  8. 8. S PECIAL R EPORT Network performance management: Three key technology challenges 8 when it is filtered or that substantial computational horsepow- for proactive management, capacity planning, and, as men- er applied to analytics on a continuous basis will result in no tioned earlier, the determination of appropriate dynamic useful information (until an anomaly occurs). thresholds.More systems are recognizing the vital nature of this capability, so they will continue to be offered in more systems. Figure 4. 2003 troubleshooting tools projected spending Conclusions The nature of network management creates a number of challenges and demands on the infrastructure. From the extreme volume of data made available by network elements, the challenges of moving that data to a central location, to the difficulty of sifting through the sheer number of measure- ments gathered, the complexity of today’s network and the components that comprise it introduce significant challenges to the process of monitoring, managing and operating the net- work infrastructure. These challenges beg for either some combination of cen- tralized focus with selective metrics or a distributed solution to the distributed problems.Those solutions might be distributed data collection, distributed filtering and correlation, or distrib- uted analytics. All of these approaches provide some level of relief for the challenges of the data issues.In addition,these dis- tributed approaches also can provide some locality of refer- Most analytical engines are often limited in their ability to ence for setting thresholds, allowing for those thresholds to correlate across multiple elements and multiple metrics. In have greater applicability. addition, combinational metrics are difficult to create as a part In addition, using baselines for tracking trends and setting of the generic out-of-the-box analytics repertoire, so many thresholds will continue to gather steam as organizations real- products are unable to provide for these.Yet, those same met- ize that current techniques fall short of the needs.The ability to rics are often the keys to proactive management of complex correlate metrics across multiple elements and interfaces in networks. It is the use of the combination of multiple metrics the network will also continue to expand the capabilities of into a single view that is one way of avoiding a flood of alerts analytical platforms. when a problem does occur. The real key is to provide network operations with the ability Of course, multiple alerts for a single problem are common- to see the forest first when looking for problems, but to then place. Root cause analysis technologies emerged in the late quickly focus in on the trees – and the paths between them – to 1990s to address just this problem and to pinpoint the likely understand the implications and necessary actions as a result. root causes by correlating a large set of data to rules or other means of determining likely causes.While a significant break- through at the time, root-cause analysis has limitations in proactively managing networks from a performance as well as © 2004 Network World, Inc. All rights reserved. availability perspective,and root cause does not always help to avoid outages or manage degradation. One of the more effective ways to deal with these analytics issues – especially as they affect both performance and avail- ability – involves the inclusion of trending and “normative” baselines into the analysis. Even though these share their own set of implementation pitfalls, they are valuable touchstones