Network performance management: Three key technology challenges
S PECIAL R EPORT
Three key technology challenges
By Enterprise Management Associates (EMA)
Dennis Drogseth, Vice President
Steve Hultquist, Senior Analyst
Jeff Nudler, Senior Analyst
Sponsored Exclusively By: This Special Advertising Section Produced By:
S PECIAL R EPORT Network performance management: Three key technology challenges 2
Issues of data collection,
analytics and thresholds as they impact
monitoring of network infrastructure
Table of Contents In the current environment, management of the networked infrastructure
has become fundamentally more complex than ever. This is true because
both the number and complexity of the active elements in networks today
3 Issues and Challenges are higher than ever. Simultaneously, components have expanded to
in Network Monitoring include functions to address needs across a greater variety of applications
and across more layers of the network model. Even as new specialized
3 Data Volume Issues devices are being added to networks, existing devices are expanding their
functions. In short, the functions of many network elements are increasing
while the number of network elements also is increasing, creating an ever-
5 Data Transit Issues expanding management challenge.
While the marketplace is evolving, it is still doing so with some level of
5 Data Saturation Issues confusion along critical areas of technology leadership. This is because in
part to the historic evolution of network engineering and management
6 Data Triggering and approaches. Historically, management was focused on maintaining the
Threshold Issues health of the individual elements within a network, assuming that keeping
each element healthy would guarantee the health of the overall network.
With the expanding functions and changes in network designs, this has not
7 Data Analytics Issues proven to be the case.
This report focuses on data collection, analytics and thresholds as they
8 Conclusions impact network monitoring. Because they represent the three single most
critical areas requiring both technology innovation within the vendor
community and strategic solution planning from an IT adoption
perspective, it is important that these areas are understood and used as
focus areas in the development of network management systems (NMS)
and network management planning.
About the Authors
Dennis Drogseth, Vice President; Steve Hultquist, Senior Analyst; and
Jeff Nudler, Senior Analyst at Enterprise Management Associates (EMA)
EMA is the first technology analyst firm to specialize exclusively in manage-
ment software and services. Organized into three highly collaborative prac-
tice areas: e-Business Management, Networked Services Management, and
Systems Enabled Services Management - EMA offers information and guid-
ance about management software and services.
S PECIAL R EPORT Network performance management: Three key technology challenges 3
Issues and challenges in Figure 1. 2003 Metrics important to SLAs
As the number of IT infrastructure elements has increased
and the topological complexity of the network escalated, the
issues surrounding the monitoring and management of those
elements has expanded exponentially.The management of net-
work traffic passing through an ever-expanding labyrinth
means an inexorably growing amount of statistical informa-
tion.As the requirements for redundancy,performance and the
number of mission-critical applications increase, the complex-
ity of the overall architecture increases, adding to this issue as
well.All of this growth has increased the strain on the monitor-
ing of networks and introduced a need for new thinking about
the resulting data collection and analysis. Understanding the
implications of highly interdependent measurements, for
instance, understanding how network, server and application
components might all impact each other, and providing both
proactive and reactive alerts requires effective analytics and
complex threshold setting alternatives. This combination of
challenges creates an environment ripe for innovation and
Data volume issues
Network elements have become exceedingly complex over
the past few years. For example, Ethernet hubs have morphed
into Layer 3 switches serving multiple virtual LANs (VLAN).This
increased complexity has led to an explosion of statistics and
configuration information available from the network elements.
This data,when actually collected and used,does not necessar-
ily provide useful knowledge for the management and opera- Engineering data collection
tions team because of their extensive volume and specificity. The characteristic of voluminous management data pro-
duces a requirement to actually engineer its collection or rely
In other words, excessive data does not equate to meaningful
on automation for that.Among the decisions for collection are
information,and the information gleaned from the data does not
the specific statistics to be gathered from each element, a
necessarily lead to useful knowledge with a clear context for
design for the inter- and intra-element analytic algorithms, and
management decision making. On the contrary, the volume of
the development of an empirically supported model for mon-
information captured constantly from the network can serve to
itoring and management. Much of this engineering is incorpo-
drown human analysts and overwhelm systems designed to
rated into management systems that develop unique methods
process the data to produce useful analytics.Effectively,the sheer
and mechanisms for managing the volume of data. Even with
volume of data can often sabotage the primary purpose for gath-
those capabilities, it is likely that organizations will continue to
ering it: efficient management of the network infrastructure.
expend significant engineering effort to develop the complete
But given the complexity of the network and the need to data collection strategy for the foreseeable future,as individual
understand more, not fewer, interdependencies, what are the network engineers and operations staff members continue to
alternatives for this swelling ocean of data? craft their own sequences and searches in data gathering and
analysis from a multiplicity of resources.
S PECIAL R EPORT Network performance management: Three key technology challenges 4
Figures 2 and 3. 2003 Traffic Engineering tools ured out-of-the-box or through policy rules available to the
With the volume of data produced in the management of a
large network, culling the valuable data for any given potential
problem or actual failure is a great challenge. Data that might
otherwise have little use becomes of great import due to its cor-
relation to other data. For example, Ethernet addresses of client
workstations might rarely be useful information. However, con-
sider a notebook computer with a bad Ethernet interface.When
it begins to generate excessive traffic or unusual errors while it
is moved from LAN to LAN, being able to ascertain that the
errors that seem to be erupting randomly through a network are
all associated with the same physical address provides invalu-
able information for resolving the issue.
Another example of this type of issue: After losing data
because of a series of undiscovered multiple failures, the root-
cause analysis uncovered backup logs showing backup fail-
ures over a 10-day period that were never reported through an
alerting system.The volume of data in the backup reports was
such that the people and processes in place to find errors
failed consistently for nearly two weeks! Consider that there
are far more data points in a network monitoring system than
in a backup log, and the escalating demands of effective prob-
lem notification become obvious.
Examples of the capabilities of management systems to
address the explosion of monitoring data are found in tools
that dynamically alter data collection based on the status com-
municated by that data. For example, if a switch port begins to
experience excessive errors, the management system would
collect additional information from that port, including possi-
bly complete packet capture for analysis.When the port returns
to a normal range, however, that additional information would
no longer be collected.
While this approach generally reduces the volume of infor-
mation gathered continually,there might be data that would be
useful in understanding a problem’s cause that was not collect-
ed before the problem manifested itself.Recurring but intermit-
tent problems might require additional data to be collected
until the data is sufficient to pinpoint the root cause. In some
cases, management solutions have been able to automate to
that condition as well, proactively seeking added diagnostics
based on “trigger conditions”.The conditions might be config-
S PECIAL R EPORT Network performance management: Three key technology challenges 5
Interrelated data ment while real-world estimates for large networks show that
The reality of network monitoring data is that it is most use- as much as 20% to 25% of the traffic is management specific.
ful when correlated by proximity, technology service, possible The multiplicative effects of the monitoring traffic can serious-
root causes and business processes. Outside of that correla- ly impact the speed of acquisition for that monitoring data and
tion,most network statistics are of limited usefulness for day-to- the time to resolution for related problems.
day network management. For example, measurements of Two effective tools for reducing the load that network moni-
metrics on network elements that are adjacent in the network toring traffic places on a network are distributed data collec-
provide information on typical traffic flow. Sudden changes in tion and distributed analytics. Using these techniques, a data
that flow might indicate a problem in one of the mutual con- collection system can collect data at multiple locations within
nections, even if the metrics on the connection is not unusual. the network, reduce and compress that data as appropriate,
Different business processes and technology services tend to and send the resulting data to the network management sys-
cross multiple networks and system boundaries. An issue with tem. Similarly, distributed analytics both collect and analyze
a single component of the overall service, such as a database the data in a distributed manner, communicating results of the
server,affects the performance of the entire service.A clear cor- analysis to the network management system. Both of these
relation of these components for this service provides impor- techniques reduce the traffic crossing the network while
tant information for problem determination and management. potentially providing the ability to capture information that
However, those correlations require deep understanding of would be impractical to capture, otherwise. For example, the
the network architecture and design, the applications and busi- distributed systems could maintain a day’s worth of detailed
ness processes relying on it, and extensive root-cause engineer- statistics for all of the elements they collect, but not send that
ing. Developing the network model to support this level of data to the network management system (NMS) unless a
engineering is the purview of a small number of expert network review after an outage required it. This reduces the analytical
engineers, and it is clear that the automation of this expertise is load on the NMS while maintaining the data that might be
vital for widespread development of these capabilities. needed in the event of a problem investigation.
However,data volume isn’t the only issue.As elements record
statistics, they must send them to management systems for col-
Data saturation issues
lection and analysis. A characteristic related to the volume and transit issues are
the fraction of the total elements within the network infrastruc-
ture that are actively monitored and managed. In many cases,
Data transit issues the ideal 100% of elements are not managed or even moni-
Most network statistics’ must be gathered at the elements in tored because of the challenge of data saturation: so much
the network and then somehow communicated so that they data, most of which is not the least bit useful. So, many organi-
are available to the network management and monitoring zations monitor those elements that they believe are most like-
team.Traditionally, this is done by having a centralized collec- ly to allow them to see problems either before or as they occur.
tion system as part of the network management platform.All of If they are wrong, they might miss outages until the help desk
the elements then are polled by the system and the data col- informs them.
lected. Given the amount of data that it is possible to collect
and the need increasingly to span the elements that can shed Effectively,this is playing Russian roulette with a loaded gun.
light on network, systems and application interdependencies, As long as none of the unmonitored elements are the culprit,
as well as issues that might affect new services such as VoIP , you will survive. But when one of them causes a failure you
VPNs, or wireless, the network traffic generated by monitoring will have no way of managing that failure or understanding
and data collection can be very significant.For example,mem- how it occurred.
bers of the InteropNet NOC team have estimated that 80% of A fundamental engineering problem is guessing before a
the network traffic on the InteropNet is network monitoring problem occurs what data might be necessary to understand
and management traffic. Internet traffic reports have shown up and solve it. Eighty percent of performance problems are tran-
to 30% of traffic as being network monitoring and manage- sient situations, and data capture is impossible once the prob-
S PECIAL R EPORT Network performance management: Three key technology challenges 6
lem occurs. To address these types of problems, vendors of Among the challenges of setting appropriate thresholds are
management systems create areas of coverage for their prod- the typically constant nature of thresholds, the largely ad hoc
ucts related to the problems they believe are the most preva- approaches to setting thresholds, and the element-specific
lent.The problems that their product addresses will determine nature of thresholds.
the perceived quality of their product.
Targeted analytic capture can provide another approach to
In most systems, thresholds are set to a specific value.When
distributed data capture in which isolated variables well-
that value is exceeded, an alert of some kind is generated for
known to affect certain critical conditions are automated for
the operations staff. For example, it is typical to set bandwidth
capture in either a distributed or centralized fashion. Limiting
utilization thresholds on point-to-point links to alert operations
variable collection to enable more fluid monitoring across the
staff to possible network choke points.
infrastructure can be an effective tradeoff when the limitations
imposed reflect in-depth insights into the networked environ- One challenge with this approach is that it might not commu-
ment, just as expert diagnosticians in the medical profession nicate potential issues until it is too late.The naturally cyclical
are in large part effective because they know what to ignore. nature of network utilization means that an environmental
change such as the introduction of new applications or the
For example, high traffic volume on the accounting depart-
physical movement of a number of staff members might cause
ment link on Thursdays most likely signify payroll processing
an unexpected increase in traffic that causes a bottleneck.
while on Tuesday the same condition might signify anomaly.
However, the trend is unseen until it rapidly crosses the thresh-
The monitoring tool must be able to recognize this time and old during the high-utilization time of day and causes a prob-
monitored data relationship. Beyond the collection of data, lem and potentially an outage. The static nature of many
there are additional issues with how measurements trigger threshold values makes them less useful than they could be.
alerts and then the analysis of the data.
Data triggering and threshold issues In addition to the static nature of many thresholds, the
process of determining threshold values often has been infor-
As distributed systems and networks emerged, it became
mal and experiential instead of rigorous. Consider the thresh-
clear that there were situations when the operations team
needed to know that a threshold had been crossed or a partic-
ular situation had occurred. Classic examples include a 60%
load on a shared Ethernet network and SNMP traps for power
failures.These thresholds and triggering events provided mech-
anisms for warning engineers about problems as they
occurred. If the values were judiciously selected, they could
even serve as warnings for potential problems.
However, over time, the traps and thresholds became a flood
of incoming data without any relative value clearly discern-
able from the NMS. It was not unusual for threshold alerts,
SNMP traps and related monitoring information to effectively
paralyze the management infrastructure (people, processes
and tools). In other words, the very phenomena designed to
alert operations staff to issues with the infrastructure served to
blunt the urgency of those insights. For example, a service
provider reported receiving 60,000 alarms per day, but the
NOC personnel depended on external technicians to alert
them to actual problems.
S PECIAL R EPORT Network performance management: Three key technology challenges 7
old value for bandwidth utilization, again. At what utilization around the real-world requirements. Some products now use
should it be set? 60%? 80%? What are the characteristics of the analytics, heuristics and/or baselines and trending to deter-
channel that contribute to an appropriate setting? The com- mine dynamic or adaptable thresholds that account for the
plexities of setting these thresholds can lead to either overly interrelationship of measurements across a network.
conservative settings, effectively increasing inappropriate
Using analytics products can determine the impact of corre-
alerts,or to aggressive or even arbitrary settings that might con-
lating measurements and communicate the overall implica-
tribute to network failures.
tions for the network. For instance, when certain links,
The historical practice of setting thresholds in isolation sel- interfaces, or element backplanes begin to approach their
dom reflects the complex realities of today’s networks and the capacity limits, analytical correlation can determine the snow-
interrelation between the network components. ball effect and alert network operations to impending trouble.
Automation is really the only way to correlate and analyze the
Element-based thresholds vast data that is available in a way that provides useful informa-
Typically,thresholds are set for a single element and often for tion to operations staff.
individual interfaces on an element (for example, each inter-
Similarly, a set of measurements might indicate a possibility
face on a router might have different thresholds based on the
that seemingly unrelated components in the network must be
type of medium connected). Relationships between elements
checked or that thresholds must be set on metrics elsewhere.
and the affect of those relationships are often impossible to
For example, a slow down in end-user response time for an
integrate into the thresholds and alerts.
application might indicate a need for detailed inspection of
Consider the implications of multiple Internet connections network links, system utilization and server process perform-
managed by the Border Gateway Protocol.These connections ance. These heuristics allow for correlations and connections
are likely different bandwidths, and also likely to connect to across metrics that provide an overall perspective on deliver-
different ISPs and different points of presence. For most practi- ing end-user services. Without the correlations, it is likely that
cal purposes, those links can be considered aggregated, and the implications of a set of measurements might not be recog-
the routers connected to them are able to negotiate appropri- nized until the problems manifest themselves.
ate routing based on a complex algorithm. However, how will
Along this same vein, a number of products also provide the
the operations team be alerted if the usage off those links is
capability to create baselines for measurements that correlate
abnormally high? Typically, thresholds are not set for the aggre-
to time of day, day of week and other trending characteristics.
gation of bandwidth that crosses element and interface bound-
These baselines then serve to allow for tighter threshold con-
aries, as is the requirement in this example. Effectively, the
straints within a band around the baseline, allowing for alert-
relationships between elements are seldom part of the thresh-
ing based on measurements that are “outside the norm” but
old value or the threshold calculations.
not necessarily above a strict, invariable threshold.
Consolidation of information to create useful thresholds is diffi-
cult, and data gathered from multiple elements is difficult to cor- Data analytics issues
relate and aggregate for both alerting and reporting.The multitude
After considering the challenges inherent in the data collec-
of ways that data can combine to represent a warning – or even
tion and threshold issues, there are still additional issues in the
to indicate that a value that could indicate a problem doesn’t nec-
analysis of the data. Among them are the computational and
essarily apply in a particular case – makes it difficult to manually
network load created by centralized data analysis,the scope of
engineer all of the possible permutations.Understanding all of the
analytics and the challenges of root-cause issues and trending.
possible correlations is also challenging,and likely requires signif-
icant engineering effort that is often unavailable within the net- When vast amounts of monitoring data generated by a large
work infrastructure management team. network are returned to many central analysis systems, those
systems will typically be forced to either filter that data aggres-
Emerging threshold approaches sively or to apply substantial computational resources to ana-
As the extent of the threshold challenges has become clear lyze the multiple permutations of the metrics gathered. This
in the industry, more products are beginning to be designed challenge implies that either potentially useful data is ignored