White Paper Leveraging Automation for Advanced Network Troubleshooting

WHITEPAPER: LEVERAGING AUTOMATION FOR ADVANCED NETWORK TROUBLESHOOTING | 1
Table of Contents
1. Executive Summary.............................................................................................................................................................1
2. Why is Network Troubleshooting So Hard? ................................................................................................................1
Causes of Network Outages.............................................................................................................................................1
The Cost of Network Outages..........................................................................................................................................1
Finding a Needle in a Haystack: Troubleshooting with Limited Visibility.........................................................2
3. Divide & Conquer with Network Automation ............................................................................................................3
A Network Map to Define the Scope of the Problem...............................................................................................4
Analyzing Network Performance....................................................................................................................................5
Analyzing Recent Changes...............................................................................................................................................6
Diagnosing Network Segments in Parallel..................................................................................................................7
4. Case Study: Dimension Data Accelerates Troubleshooting on Customer Networks................................. 10

WHITEPAPER: LEVERAGING AUTOMATION FOR ADVANCED NETWORK TROUBLESHOOTING 1
1. Executive Summary
When the network goes down, every minute counts. Data from a 2013 CDW
survey suggests that network outages cost enterprises over $1.7B in lost
revenue over the previous year. Much of this loss could have been avoided if
network teams were able to discover the source of problems more quickly.
Many enterprises have already deployed network monitoring systems to help
them react to incidents faster, but it’s not enough. It’s equally important to
improve mean-time-to-repair (MTTR) by accelerating troubleshooting times.
In this paper, we’ll examine why network troubleshooting is so challenging and
look at opportunities to improve incident response times with a divide and
conquer strategy. We’ll address how automation can be applied to a traditional
troubleshooting methodology for isolating the problem, gathering information,
and automating the analysis of critical data.
2. Why is Network Troubleshooting So Hard?
Effective troubleshooting requires a combination of both experience and an
intimate knowledge of the network’s design. Even when a network engineer
possesses both, there’s still the challenge of diagnosing network symptoms,
involving a lot of manual data collection and analysis.
Causes of Network Outages
There’s a lot of hype and media coverage around network hacking and DDoS
attacks, but far more network outages are actually caused by mistakes made by
an organization’s own people. A recent Gartner study estimated that people
and process issues will cause 80% of outages impacting mission-critical services
through 2015. Of that number, more than 50% will be the result of a network
upgrade or configuration change.
The Cost of Network Outages
Early in 2014, both Xbox LIVE and Facebook suffered well-publicized network
outages, both caused by configuration errors during scheduled maintenance.
For Xbox LIVE the untimely outage crippled the launch of one of their biggest
online games. For Facebook, 30 minutes of downtime cost an estimated
$500,000 in lost ad revenue. Of course, the cost to a business’ reputation may
be far higher if customers are impacted.
Top Causes of
Network Outages*
o 23% from router/switch
failure (including DoS
attacks)
o 32% from a link failure
(fiber cuts, network
congestion)
o 36% from a network
change (upgrade, config
change)
*Data from a 2013 Cisco Study

Finding a Needle in a Haystack: Troubleshooting with Limited
Visibility
Network visibility is increasingly sought-after in the network industry, because
better visualization of the network leads to better decision-making and faster
problem resolution. Despite dozens of tools, which claim to improve visibility,
the most common window a troubleshooter has into the network is the
command-line interface (CLI). Unfortunately, the CLI provides a narrow field of
vision for troubleshooters because the information they can gather is limited to
the rate at which they can issue and interpret commands – one device at a time.
When diagnosing a network problem, it’s estimated that engineers spend 80%
of their time manually gathering data, and only 20% analyzing it. This time
spent ‘data mining’ represents an opportunity for improvement. The figure
below shows how important the task of gathering and analyzing information is
during a typical troubleshooting scenario.
Figure 1: Visibility Challenges during Troubleshooting Diagnosis
Because the CLI provides limited visibility, engineers also need access to
accurate ‘troubleshoot-ready’ network diagrams. These are diagrams that
target the problem area and omit parts of the network that aren’t related to the
problem. These maps should include design parameters including routing
protocols, access-lists, VLANs, etc. Today, very few tools exist which can provide
these types of maps; instead engineers commonly rely on ‘static’ diagrams,
commonly created with MS Visio.
Although both the CLI and network diagrams (if available), help
troubleshooters gather information about topology and configuration, they’re

both poor tools for understanding what’s happening on the network. During an
incident, engineers need to understand both live performance as well as recent
changes. Even with a performance monitoring solution deployed, engineers
often struggle due to ‘information overload’.
The last factor we’ll address in this paper is the dependence network teams
have on ‘tribal’ knowledge. This refers to the all-too-common scenario where a
network ‘hero’ needs to come in and solve a difficult problem. The reason is
that a very small percentage of team members have sufficient troubleshooting
experience or intimate network knowledge which is required to solve complex
problems. The figure below summarizes the challenges associated with
visibility, and how it impacts an engineer’s ability to find answers to their most
critical questions.
Figure 2: Sources and Limitations of Network Visibility in an Enterprise Environment
3. Divide & Conquer with Network Automation
There’s no shortage of network monitoring tools to help engineers detect
network outages, but the steps to diagnose a detected alarm are almost always
manual. Effective troubleshooting techniques require a tool which can both
increase network visibility as well as help divide and conquer time-consuming
analyses.

A Network Map to Define the Scope of the Problem
Without visual aids, the ability to understand complex networks begins to break
down. Network diagrams serve as the go-to visual aid for network engineers,
but troubleshooting is dramatically hindered if the diagrams aren’t up-to-date
and reliable.
More than a repository of updated site diagrams, what a troubleshooter needs
is a customized diagram, which omits irrelevant parts of the network that only
serve to distract. For example, if a slow application is traversing across three
data centers, an engineer needs a single diagram of the application flow, not
three diagrams, one for each data center. In other words, a tailored diagram is
the best asset.
A Fresh Approach: Dynamic Network Mapping
NetBrain’s unique network diagrams are dynamic in nature, which means they
are updated automatically, when the network changes. NetBrain diagrams can
be created on-demand as well, so engineers don’t need to sort through dozens
of diagrams during an incident. Instead, they can instantly create a custom map
focused on the event.
Network engineers are frequently asked to troubleshoot poorly performing
applications, with little more to go on than a report of slowness. To tackle this
challenge, the engineer can dynamically create a custom layer-3 or layer-2 map
of the application flow by entering two IP addresses (i.e. the source IP address
and the IP of the application server). NetBrain will perform a comprehensive
analysis of the routing, access-lists, and NAT for every hop in the path. The
resulting map will show which devices are in the path of the application flow.
Figure 3: A Tailored Diagram of an Application Flow (Created On-Demand with NetBrain)

Analyzing Network Performance
It’s difficult to troubleshoot performance problems without being able to see
what’s happening on the network. Many network teams have 24x7 network
monitoring systems that generate alarms when an incident occurs. Examples of
such monitoring tools include HP OpenView, IBM Tivoli, CA Spectrum, and
Solarwinds NPM.
Figure 4: Example Network Monitoring and Alerting Tools
Network monitoring tools solve only half of the puzzle; after an alarm is
generated network teams still revert to manual methods of troubleshooting. An
effective troubleshooting tool should integrate with network monitoring and
ticketing systems to improve visibility into the problem area.
Diagnostic Monitoring on a Live Map
NetBrain’s monitoring function can be turned on from any map, or even
launched from a 3rd part monitoring tool, to visualize the performance
characteristics of each device and interface. When troubleshooting a slow
application, engineers can quickly spot bandwidth bottlenecks on the
interfaces (highlighted in red) or CPU/Memory over-utilization on each device.
For intermittent application behavior issues, monitoring can be left to run
overnight; it will collect and plot each data point to highlight trends.

Figure 5: Monitoring Application Performance Factors (Issues Highlighted in Red)
Analyzing Recent Changes
With over one third of network outages resulting from a network change,
visibility into what’s changed is critical. That means understanding not just
what’s changed in configuration, but understanding the impact of those
changes on routing, topology, application traffic, and more.
Automated Change Analysis
NetBrain can be configured to benchmark the network regularly so that network
teams are better equipped to understand recent changes. During every
benchmark, NetBrain collects live data and looks for changes in configuration,
routing, inventory, as well as MAC/ARP/CDP/STP tables. NetBrain also includes
comparative analysis capabilities to automatically highlight the changes side-by-
side.
Figure 6: NetBrain’s System Benchmark Properties

By way of example, when troubleshooting application slowness, an engineer can
‘rewind the clock’ and see how application traffic was being routed before the
problem arose. Any changes could provide valuable clues into the problem.
Figure 7: Analyzing Application Traffic from Last Week
Diagnosing Network Segments in Parallel
When engineers rely on the command line interface as their primary
troubleshooting tool, they’re forced to diagnose the network in a serialized
manner, one device at a time. That’s because the output to CLI commands is
often uneasy to scan, and important data points are hard to find. Finding the
‘missing pieces’ of information may take dozens of commands.
Figure 8: Serialized Troubleshooting with the CLI
CLI ping and traceroute
used to determine path
Multiple show level
commands in multiple
CLI windows
Repeat until problem is
found
Quick “performance”
test results
Stare and compare to
find deviations and
anomalies
Quick “performance”
test results

Effective troubleshooting should instead occur in parallel, meaning that
commands are issued on many devices simultaneously and only the relevant
data is parsed from the output. A network map serves as the best
troubleshooting user interface because it provides a canvas for which to
populate the relevant data.
Figure 9: Diagnosing Interface Errors in Parallel (collisions and CRC errors labeled in red)
The image above shows what it may look like to diagnose the interfaces of
multiple devices, in parallel, on a live network map. Troubleshooting
automation can issue the appropriate commands on your behalf, and extract
the relevant data.
Adaptive Network Automation – A Powerful Alternative to Scripting
Writing Perl and Python scripts to automate data collection is powerful, but the
vast majority of network engineers aren’t programmers and they struggle to
realize the benefits. NetBrain eliminates the programming requirement from
network automation with its ‘quick’ programming environment. Engineers can
literally point and click to program their own NetBrain ‘Qapps’.
As an example, the Check Interface Errors Qapp - which was written by a
NetBrain engineer in less than 10 minutes - can be run to detect incrementing
interface errors and speed/duplex mismatches.

Figure 10: NetBrain’s ‘Quick’ Programming Environment
Each new Qapp becomes a new feature, and it leverages a dynamic map to
display the output. For troubleshooters, every Qapp is an executable diagnosis
which can automatically extract and analyze the CLI data which would
otherwise be collected manually. This helps network teams troubleshoot
virtually any network issue in parallel, rather than one device at a time. It also
helps network teams digitize and share their troubleshooting checklists.

4. Case Study: Dimension Data Accelerates
Troubleshooting on Customer Networks
.CUSTOMER PROFILE:
Industry: Managed Services
Company: Dimension Data
CHALLENGE:
Dimension Data does not own
the customer networks they
manage so they struggle to
gain and maintain intimate
knowledge on those networks,
which is inherently gained
through day-to-day
operations.
SOLUTION:
Dimension Data utilizes
NetBrain to automate diagram
creation, visualize
performance issues to
expedite diagnosis, and to
easily share information for
collaborative troubleshooting
sessions.
BENEFIT:
NetBrain’s advanced network
visualization and automation
capabilities enable Dimension
Data to shorten typical
diagnosis and repair time by as
much as 50%.
Dimension Data specializes in information technology services, with
operations on every inhabited continent. Dimension Data's focus areas
include network integration, security solutions, data center solutions,
converged communications, and a range of professional, consulting, and
managed services. A major challenge the company consistently faces is the
ability to understand their customers’ networks to the extent necessary to
diagnose and troubleshoot complex issues and resolve network outages
effectively.
Dimension Data deployed NetBrain in their customer environments, in many
cases integrating the tool with the NetCool alarm system, Opsware
configuration management solution, and Vitalnet’s performance trending
solution. With these integrations, an alarm reported by HP OpenView is
instantly translated to a map inside NetBrain Workstation.
NetBrain continues to offer value to Dimension Data in three areas:
 On-demand network mapping effectively removes dependencies on
manual network diagrams which are often inconsistent and error-
prone.
 Network performance diagnosis via Dynamic Diagrams enables lower-
level engineers to troubleshoot advanced problems
 Engineers share information via NetBrain for collaboration
The following are some ‘war stories’ reported by this customer:
Detecting Serious Routing Issues on the Accudyne Network
NetBrain was able to provide real-time network visibility into the
Accudyne network and help identify serious routing issues. The tool
was used to highlight the congestion points on the map and
ultimately tie the problem to equal cost routes and MPLS design
segregation.
Troubleshooting Slowness to a Server
Previously it took Dimension Data almost two and a half hours to
determine the source and destination path of an application server
inside the Accudyne network, followed by another two hours to
diagnose the problem. With NetBrain, the task to find the path took
two minutes, and another five minutes was all that was needed to
diagnose the issue.
Troubleshooting MS Outlook Slowness to Tokyo
The Tokyo office was experiencing slowness sending outlook
attachments. Multiple tickets had been opened for this issue and
several engineers had already looked into it. NetBrain was then
applied and, within three minutes, it was determined that there was a
duplex issue on the edge WAN port.
NetBrain saves time when time is critical. As a Dimension Data Network
Integration Engineer reported, “It has changed the way I approach
troubleshooting.”

About NetBrain Technologies, Inc.
Founded in 2004, NetBrain set out to pursue a new vision: automate time-
consuming tasks associated with network documentation, design, and
troubleshooting. NetBrain’s customers are using map-driven automation to
eliminate manual network documentation, automate troubleshooting tasks,
and mitigate security risks. NetBrain is headquartered in Burlington, MA with
offices in Sacramento, CA, New York, and Beijing, China.
To learn more about NetBrain’s dynamic mapping solution, contact us at
781.221.7199 or download free trial of NetBrain’s Enterprise Suite from
www.netbraintech.com/trial.
NetBrain Technologies, Inc.
15 Network Drive
Burlington, MA 01803
+1 800 605 7964
info@netbraintech.com
www.netbraintech.com

White Paper Leveraging Automation for Advanced Network Troubleshooting

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to White Paper Leveraging Automation for Advanced Network Troubleshooting

Similar to White Paper Leveraging Automation for Advanced Network Troubleshooting (20)

More from E.S.G. JR. Consulting, Inc.

More from E.S.G. JR. Consulting, Inc. (20)

Recently uploaded

Recently uploaded (20)

White Paper Leveraging Automation for Advanced Network Troubleshooting