Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
WHITEPAPER: LEVERAGING AUTOMATION FOR ADVANCED NETWORK TROUBLESHOOTING | 1
Table of Contents
1. Executive Summary............
WHITEPAPER: LEVERAGING AUTOMATION FOR ADVANCED NETWORK TROUBLESHOOTING 1
1. Executive Summary
When the network goes down, ...
WHITEPAPER: LEVERAGING AUTOMATION FOR ADVANCED NETWORK TROUBLESHOOTING | 2
Finding a Needle in a Haystack: Troubleshooting...
WHITEPAPER: LEVERAGING AUTOMATION FOR ADVANCED NETWORK TROUBLESHOOTING | 3
Although both the CLI and network diagrams (if ...
WHITEPAPER: LEVERAGING AUTOMATION FOR ADVANCED NETWORK TROUBLESHOOTING | 4
A Network Map to Define the Scope of the Proble...
WHITEPAPER: LEVERAGING AUTOMATION FOR ADVANCED NETWORK TROUBLESHOOTING | 5
Analyzing Network Performance
It’s difficult to...
WHITEPAPER: LEVERAGING AUTOMATION FOR ADVANCED NETWORK TROUBLESHOOTING | 6
Figure 5: Monitoring Application Performance Fa...
WHITEPAPER: LEVERAGING AUTOMATION FOR ADVANCED NETWORK TROUBLESHOOTING | 7
By way of example, when troubleshooting applica...
WHITEPAPER: LEVERAGING AUTOMATION FOR ADVANCED NETWORK TROUBLESHOOTING | 8
Effective troubleshooting should instead occur ...
WHITEPAPER: LEVERAGING AUTOMATION FOR ADVANCED NETWORK TROUBLESHOOTING | 9
Figure 10: NetBrain’s ‘Quick’ Programming Envir...
WHITEPAPER: LEVERAGING AUTOMATION FOR ADVANCED NETWORK TROUBLESHOOTING | 10
About NetBrain Technologies, Inc.
Founded in 2...
[White Paper] Leveraging-Automation-For-Advanced-Network-Troubleshooting
Upcoming SlideShare
Loading in …5
×

[White Paper] Leveraging-Automation-For-Advanced-Network-Troubleshooting

366 views

Published on

A 10 page paper examining why network troubleshooting is so challenging and exploring opportunities to improve incident response times with a divide and conquer strategy. The paper addresses how automation can be applied to a traditional troubleshooting methodology for isolating the problem, gathering information, and automating the analysis of critical data.

Published in: Software
  • Be the first to comment

[White Paper] Leveraging-Automation-For-Advanced-Network-Troubleshooting

  1. 1. WHITEPAPER: LEVERAGING AUTOMATION FOR ADVANCED NETWORK TROUBLESHOOTING | 1 Table of Contents 1. Executive Summary.......................................................................................................................1 2. Why is Network Troubleshooting So Hard? ...................................................................................1 Causes of Network Outages .................................................................................................................1 The Cost of Network Outages...............................................................................................................1 Finding a Needle in a Haystack: Troubleshooting with Limited Visibility................................................2 3. Divide & Conquer with Network Automation ..................................................................................3 A Network Map to Define the Scope of the Problem .............................................................................4 Analyzing Network Performance...........................................................................................................5 Analyzing Recent Changes...................................................................................................................6 Diagnosing Network Segments in Parallel ............................................................................................7
  2. 2. WHITEPAPER: LEVERAGING AUTOMATION FOR ADVANCED NETWORK TROUBLESHOOTING 1 1. Executive Summary When the network goes down, every minute counts. Data from a 2013 CDW survey suggests that network outages cost enterprises over $1.7B in lost revenue over the previous year. Much of this loss could have been avoided if network teams were able to discover the source of problems more quickly. Many enterprises have already deployed network monitoring systems to help them react to incidents faster, but it’s not enough. It’s equally important to improve mean-time-to-repair (MTTR) by accelerating troubleshooting times. In this paper, we’ll examine why network troubleshooting is so challenging and look at opportunities to improve incident response times with a divide and conquer strategy. We’ll address how automation can be applied to a traditional troubleshooting methodology for isolating the problem, gathering information, and automating the analysis of critical data. 2. Why is Network Troubleshooting So Hard? Effective troubleshooting requires a combination of both experience and an intimate knowledge of the network’s design. Even when a network engineer possesses both, there’s still the challenge of diagnosing network symptoms, involving a lot of manual data collection and analysis. Causes of Network Outages There’s a lot of hype and media coverage around network hacking and DDoS attacks, but far more network outages are actually caused by mistakes made by an organization’s own people. A recent Gartner study estimated that people and process issues will cause 80% of outages impacting mission- critical services through 2015. Of that number, more than 50% will be the result of a network upgrade or configuration change. The Cost of Network Outages Early in 2014, both Xbox LIVE and Facebook suffered well-publicized network outages, both caused by configuration errors during scheduled maintenance. For Xbox LIVE the untimely outage crippled the launch of one of their biggest online games. For Facebook, 30 minutes of downtime cost an estimated $500,000 in lost ad revenue. Of course, the cost to a business’ reputation may be far higher if customers are impacted. Top Causes of Network Outages* o 23% from router/switch failure (including DoS attacks) o 32% from a link failure (fiber cuts, network congestion) o 36% from a network change (upgrade, config change) *Data from a 2013 Cisco Study
  3. 3. WHITEPAPER: LEVERAGING AUTOMATION FOR ADVANCED NETWORK TROUBLESHOOTING | 2 Finding a Needle in a Haystack: Troubleshooting with Limited Visibility Network visibility is increasingly sought-after in the network industry, because better visualization of the network leads to better decision-making and faster problem resolution. Despite dozens of tools, which claim to improve visibility, the most common window a troubleshooter has into the network is the command-line interface (CLI). Unfortunately, the CLI provides a narrow field of vision for troubleshooters because the information they can gather is limited to the rate at which they can issue and interpret commands – one device at a time. When diagnosing a network problem, it’s estimated that engineers spend 80% of their time manually gathering data, and only 20% analyzing it. This time spent ‘data mining’ represents an opportunity for improvement. The figure below shows how important the task of gathering and analyzing information is during a typical troubleshooting scenario. Figure 1: Visibility Challenges during Troubleshooting Diagnosis Because the CLI provides limited visibility, engineers also need access to accurate ‘troubleshoot-ready’ network diagrams. These are diagrams that target the problem area and omit parts of the network that aren’t related to the problem. These maps should include design parameters including routing protocols, access-lists, VLANs, etc. Today, very few tools exist which can provide these types of maps; instead engineers commonly rely on ‘static’ diagrams, commonly created with MS Visio.
  4. 4. WHITEPAPER: LEVERAGING AUTOMATION FOR ADVANCED NETWORK TROUBLESHOOTING | 3 Although both the CLI and network diagrams (if available), help troubleshooters gather information about topology and configuration, they’re both poor tools for understanding what’s happening on the network. During an incident, engineers need to understand both live performance as well as recent changes. Even with a performance monitoring solution deployed, engineers often struggle due to ‘information overload’. The last factor we’ll address in this paper is the dependence network teams have on ‘tribal’ knowledge. This refers to the all-too-common scenario where a network ‘hero’ needs to come in and solve a difficult problem. The reason is that a very small percentage of team members have sufficient troubleshooting experience or intimate network knowledge which is required to solve complex problems. The figure below summarizes the challenges associated with visibility, and how it impacts an engineer’s ability to find answers to their most critical questions. Figure 2: Sources and Limitations of Network Visibility in an Enterprise Environment 3. Divide & Conquer with Network Automation There’s no shortage of network monitoring tools to help engineers detect network outages, but the steps to diagnose a detected alarm are almost always manual. Effective troubleshooting techniques require a tool which can both increase network visibility as well as help divide and conquer time- consuming analyses.
  5. 5. WHITEPAPER: LEVERAGING AUTOMATION FOR ADVANCED NETWORK TROUBLESHOOTING | 4 A Network Map to Define the Scope of the Problem Without visual aids, the ability to understand complex networks begins to break down. Network diagrams serve as the go-to visual aid for network engineers, but troubleshooting is dramatically hindered if the diagrams aren’t up-to-date and reliable. More than a repository of updated site diagrams, what a troubleshooter needs is a customized diagram, which omits irrelevant parts of the network that only serve to distract. For example, if a slow application is traversing across three data centers, an engineer needs a single diagram of the application flow, not three diagrams, one for each data center. In other words, a tailored diagram is the best asset. A Fresh Approach: Dynamic Network Mapping NetBrain’s unique network diagrams are dynamic in nature, which means they are updated automatically, when the network changes. NetBrain diagrams can be created on-demand as well, so engineers don’t need to sort through dozens of diagrams during an incident. Instead, they can instantly create a custom map focused on the event. Network engineers are frequently asked to troubleshoot poorly performing applications, with little more to go on than a report of slowness. To tackle this challenge, the engineer can dynamically create a custom layer-3 or layer-2 map of the application flow by entering two IP addresses (i.e. the source IP address and the IP of the application server). NetBrain will perform a comprehensive analysis of the routing, access-lists, and NAT for every hop in the path. The resulting map will show which devices are in the path of the application flow. Figure 3: A Tailored Diagram of an Application Flow (Created On-Demand with NetBrain)
  6. 6. WHITEPAPER: LEVERAGING AUTOMATION FOR ADVANCED NETWORK TROUBLESHOOTING | 5 Analyzing Network Performance It’s difficult to troubleshoot performance problems without being able to see what’s happening on the network. Many network teams have 24x7 network monitoring systems that generate alarms when an incident occurs. Examples of such monitoring tools include HP OpenView, IBM Tivoli, CA Spectrum, and Solarwinds NPM. Figure 4: Example Network Monitoring and Alerting Tools Network monitoring tools solve only half of the puzzle; after an alarm is generated network teams still revert to manual methods of troubleshooting. An effective troubleshooting tool should integrate with network monitoring and ticketing systems to improve visibility into the problem area. Diagnostic Monitoring on a Live Map NetBrain’s monitoring function can be turned on from any map, or even launched from a 3rd part monitoring tool, to visualize the performance characteristics of each device and interface. When troubleshooting a slow application, engineers can quickly spot bandwidth bottlenecks on the interfaces (highlighted in red) or CPU/Memory over-utilization on each device. For intermittent application behavior issues, monitoring can be left to run overnight; it will collect and plot each data point to highlight trends.
  7. 7. WHITEPAPER: LEVERAGING AUTOMATION FOR ADVANCED NETWORK TROUBLESHOOTING | 6 Figure 5: Monitoring Application Performance Factors (Issues Highlighted in Red) Analyzing Recent Changes With over one third of network outages resulting from a network change, visibility into what’s changed is critical. That means understanding not just what’s changed in configuration, but understanding the impact of those changes on routing, topology, application traffic, and more. Automated Change Analysis NetBrain can be configured to benchmark the network regularly so that network teams are better equipped to understand recent changes. During every benchmark, NetBrain collects live data and looks for changes in configuration, routing, inventory, as well as MAC/ARP/CDP/STP tables. NetBrain also includes comparative analysis capabilities to automatically highlight the changes side-by-side. Figure 6: NetBrain’s System Benchmark Properties
  8. 8. WHITEPAPER: LEVERAGING AUTOMATION FOR ADVANCED NETWORK TROUBLESHOOTING | 7 By way of example, when troubleshooting application slowness, an engineer can ‘rewind the clock’ and see how application traffic was being routed before the problem arose. Any changes could provide valuable clues into the problem. Figure 7: Analyzing Application Traffic from Last Week Diagnosing Network Segments in Parallel When engineers rely on the command line interface as their primary troubleshooting tool, they’re forced to diagnose the network in a serialized manner, one device at a time. That’s because the output to CLI commands is often uneasy to scan, and important data points are hard to find. Finding the ‘missing pieces’ of information may take dozens of commands. Figure 8: Serialized Troubleshooting with the CLI CLI ping and traceroute used to determine path Multiple show level commands in multiple CLI windows Repeat until problem is found Quick “performance” test results Stare and compare to find deviations and anomalies Quick “performance” test results
  9. 9. WHITEPAPER: LEVERAGING AUTOMATION FOR ADVANCED NETWORK TROUBLESHOOTING | 8 Effective troubleshooting should instead occur in parallel, meaning that commands are issued on many devices simultaneously and only the relevant data is parsed from the output. A network map serves as the best troubleshooting user interface because it provides a canvas for which to populate the relevant data. Figure 9: Diagnosing Interface Errors in Parallel (collisions and CRC errors labeled in red) The image above shows what it may look like to diagnose the interfaces of multiple devices, in parallel, on a live network map. Troubleshooting automation can issue the appropriate commands on your behalf, and extract the relevant data. Adaptive Network Automation – A Powerful Alternative to Scripting Writing Perl and Python scripts to automate data collection is powerful, but the vast majority of network engineers aren’t programmers and they struggle to realize the benefits. NetBrain eliminates the programming requirement from network automation with its ‘quick’ programming environment. Engineers can literally point and click to program their own NetBrain ‘Qapps’. As an example, the Check Interface Errors Qapp - which was written by a NetBrain engineer in less than 10 minutes - can be run to detect incrementing interface errors and speed/duplex mismatches.
  10. 10. WHITEPAPER: LEVERAGING AUTOMATION FOR ADVANCED NETWORK TROUBLESHOOTING | 9 Figure 10: NetBrain’s ‘Quick’ Programming Environment Each new Qapp becomes a new feature, and it leverages a dynamic map to display the output. For troubleshooters, every Qapp is an executable diagnosis which can automatically extract and analyze the CLI data which would otherwise be collected manually. This helps network teams troubleshoot virtually any network issue in parallel, rather than one device at a time. It also helps network teams digitize and share their troubleshooting checklists.
  11. 11. WHITEPAPER: LEVERAGING AUTOMATION FOR ADVANCED NETWORK TROUBLESHOOTING | 10 About NetBrain Technologies, Inc. Founded in 2004, NetBrain set out to pursue a new vision: automate time- consuming tasks associated with network documentation, design, and troubleshooting. NetBrain’s customers are using map-driven automation to eliminate manual network documentation, automate troubleshooting tasks, and mitigate security risks. NetBrain is headquartered in Burlington, MA with offices in Sacramento, CA, New York, and Beijing, China. To learn more about NetBrain’s dynamic mapping solution, contact us at 781.221.7199 or download free trial of NetBrain’s Enterprise Suite from www.netbraintech.com/trial. NetBrain Technologies, Inc. 15 Network Drive Burlington, MA 01803 +1 800 605 7964 info@netbraintech.com www.netbraintech.com

×