Juniper Networks: Virtual Chassis High Availability
 

Juniper Networks: Virtual Chassis High Availability

on

  • 2,554 views

This presentation shares the findings of the second installment of a recent Juniper Networks commissioned Network Test to evaluate its Virtual Chassis technology in Juniper EX8200 modular and Juniper ...

This presentation shares the findings of the second installment of a recent Juniper Networks commissioned Network Test to evaluate its Virtual Chassis technology in Juniper EX8200 modular and Juniper EX4200/EX4500/EX4550 fixed-configuration switches.

In this second installment of a two-part project, the focus is on the reliability and resiliency of Virtual Chassis technology. Part I of this project focused on Virtual Chassis performance and scalability: http://juni.pr/13Zi1Sp. Visit http://juni.pr/dacenSS
to learn more about Juniper’s Data Center solutions.

Statistics

Views

Total Views
2,554
Views on SlideShare
2,501
Embed Views
53

Actions

Likes
4
Downloads
45
Comments
0

1 Embed 53

https://twitter.com 53

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Juniper Networks: Virtual Chassis High Availability Juniper Networks: Virtual Chassis High Availability Document Transcript

  • Juniper Networks Virtual Chassis: High Availability November 2012
  • Juniper Virtual Chassis High Availability Assessment Page2 Executive Summary Juniper Networks commissioned Network Test to evaluate its Virtual Chassis technology in Juniper EX8200 modular and Juniper EX4200/EX4500/EX4550 fixed-configuration switches. In this second installment of a two-part project, the focus is on the reliability and resiliency of Virtual Chassis technology. Part I of this project focused on Virtual Chassis performance and scalability. For most enterprise network managers, maintaining maximum uptime is an even more important consideration than raw performance. Application availability is not only expected but also demanded from IT infrastructure. Enterprises expect to have access to their data and applications 24/7/365. To ensure round-the-clock access, network infrastructure must be both robust and highly available. The tests described in this document validate that Juniper’s Virtual Chassis technology addresses these requirements. Failovers in many cases are hitless, with no disruption in case of planned or unplanned events. Among the highlights of high-availability testing:  In all 46 test cases described here, the Virtual Chassis system recovered from component and/or link failures in less than 1 second, with hitless failover in many cases.  Virtual Chassis technology offered total protection against a “split-brain” problem where multiple routing engines each try to act as a Virtual Chassis master. Even when test engineers simultaneously disabled multiple components, the Virtual Chassis system correctly migrated all control-plane state between master and backup routing engines.  Juniper’s Nonstop Software Upgrade (NSSU) feature performed a complete upgrade of all EX8200 Virtual Chassis components, including four switches and two external routing engines, with less than 1 second of downtime. In the layer-2 test case, user data was “off the air” for less than 1/8 of a second with NSSU.  Virtual Chassis technology recovered from component and/or link failure far faster than routing protocols. Enterprise routing protocols such as OSPF and Protocol Independent Multicast (PIM-SM) take tens of seconds, or longer, to recover from network topology changes. Since recovery times for Virtual Chassis configurations are less than 1 second, transitions are invisible to the routed network.  Virtual Chassis configurations recovered from component and/or link failures far faster than spanning tree, the dominant switching protocol. Spanning tree is widely used for loop prevention, but even rapid spanning tree typically takes at least 1-3 seconds to converge after a failure. Virtual Chassis technology eliminates the need for spanning tree, and always recovers from failures in less than 1 second.
  • Juniper Virtual Chassis High Availability Assessment Page3 Introducing EX8200 Virtual Chassis Technology Virtual Chassis technology allows up to four EX8200 switches or any combination of up to 10 EX4200/EX4500/EX4550 switches to be interconnected to form one logical entity. This unified approach has many advantages:  Virtual Chassis technology doubles available bandwidth by using active/active redundancy instead of the active/passive model used by the spanning tree protocol.  Virtual Chassis technology enhances scalability by adding capacity as needed. A Virtual Chassis configuration requires just two EX8200 or EX4200/EX4500/EX4550 chassis to get started; network architects can then add chassis as the network grows. There’s no disruption to existing Virtual Chassis components, and the newly expanded Virtual Chassis system will continue to appear as one entity to the rest of the network.  Virtual Chassis technology simplifies network management by using just one configuration file for all EX8200 or EX4200/EX4500/EX4550 chassis. This reduces the number of network elements seen by external monitoring and management tools, easing the management workload.  Virtual Chassis technology allows “rightsizing” by combining switches with different port densities. In all performance tests described here, engineers combined smaller EX8208 and larger EX8216 switches to form a single logical entity. Similarly, engineers connected different combinations of EX4200/EX4500/EX4550 switches, each using different port densities and speeds, to create one logical device. Test Methodology Figure 1 shows the Layer-3 test bed used for this project. A key design goal of this project was to represent a typical data center or campus switching architecture, with core and access switches along with a WAN edge router. In the core is one EX8200 Virtual Chassis instance comprising two Juniper EX8216 and two Juniper EX8208 switches, along with redundant EX8200-XRE200 external routing engines to handle control-plane tasks. The four Juniper EX8200 switches are also called line card chassis (LCCs). At the access layer, there are two Virtual Chassis instances. One combined a Juniper EX4200 and the new Juniper EX4550 switch, while the other combined Juniper EX4200 and Juniper EX4500 switches. Also at the access layer is a standalone EX8208 deployed as an end-of- row or middle-of-row switch. The WAN edge router is represented by a single Juniper MX80.
  • Juniper Virtual Chassis High Availability Assessment Page4 As is often the case in modern data centers, most test traffic flowed in an “east-west” direction, between the various access nodes shown at the bottom of Figure 1. This traffic was evenly divided between IPv4 and IPv6 unicast flows. In addition, a small percentage of “north-south” traffic flowed between the Juniper MX80 WAN edge router and the access nodes shown at the bottom of Figure 1. This north-south traffic also consisted of a mix of IPv4 and IPv6 unicast traffic, with IPv4 multicast added. In this Layer-3 scenario, all devices ran OSPF for unicast routing and Protocol Independent Multicast-Sparse Mode (PIM-SM) for multicast routing. Figure 1: The Juniper Virtual Chassis Layer-3 high-availability test bed The Spirent TestCenter traffic generator/analyzer served as the primary test instrument in this project. For the multicast traffic, the Spirent instrument emulated 48 IPv4 hosts sending to 50 multicast groups, for a total of 2,400 multicast routes. For the unicast traffic, the Spirent instrument emulated one IPv4 and one IPv6 host per port. The Spirent instrument connected to switch ports at the network edge via 12 gigabit Ethernet ports and 8 10-Gbit/s Ethernet ports, and to the Juniper MX80 router via 4 10-Gbit/s Ethernet ports. To showcase Virtual Chassis support for IEEE 802.3ad link aggregation, engineers used four-member link aggregation groups to connect 10-Gbit/s Ethernet switch ports. Engineers repeated all tests twice, in Layer-2 and Layer-3 modes. In the Layer-2 configuration, engineers configured all EX8200 Virtual Chassis ports facing the access layer, along with all switch ports in the access layer, to use a single VLAN and broadcast domain.
  • Juniper Virtual Chassis High Availability Assessment Page5 In the Layer-3 tests, engineers used the routed VLAN interface (RVI) feature in Junos software to place all host-facing ports on the test bed in different IP subnets. Figure 2 shows the Layer-2 configuration of the test bed. Figure 2: The Juniper Virtual Chassis Layer-2 high-availability test bed A primary goal of all tests was to validate Juniper’s claim of subsecond recovery from various types of hardware and software failures. For all tests, engineers determined recovery time using the following formula1: Frame loss / (total transmitted frames / test duration) 1 Engineers also normalized overall transmit rates by configuring the test instrument’s 10-Gbit/s interfaces to offer traffic at 1/10 the rate of its gigabit Ethernet interfaces.
  • Juniper Virtual Chassis High Availability Assessment Page6 EX8200 Virtual Chassis High Availability In 24 different test cases, the EX8200 Virtual Chassis configuration recovered from link and component failures in well below 1 second, with hitless failover in many instances. Even in the absolute worst case – a Layer-3 test involving the loss of a line card – the highest recovery time seen was 174 ms, or less than one-fifth of a second. As described in detail below, these tests involved failure of virtually every possible component attached to the EX8200 chassis on the test bed. Significantly, several of these tests validated Juniper’s claim that even the simultaneous loss of multiple components at the same time will not cause a Virtual Chassis system to go into “split brain” mode, where different routing engines each think they are the master controller. In all such test cases, the Virtual Chassis system correctly transferred “mastership” status when a component failure occurred. Table 1 summarizes test results from high availability testing of the Juniper EX8200 Virtual Chassis configuration. The remainder of this section will discuss the tests performed in detail. 1. XRE failure To increase resiliency, redundant EX8200-XRE200 external routing engines handle all control-plane tasks in an EX8200 Virtual Chassis configuration. To determine the impact of the loss of one of these critical components, engineers rebooted the master XRE200 while Control-plane tests Recovery time (seconds) Test case Layer 2 Layer 3 Master XRE failure 0.020 0.000 Backup XRE failure 0.000 0.000 Master LCC-RE failure 0.000 0.000 Backup LCC-RE failure 0.000 0.000 VCP failure (between master XRE and LCC- RE) 0.000 0.000 VCP failure (between backup XRE and LCC- RE) 0.000 0.000 VCP failure (between XREs) 0.000 0.000 LCC failure 0.034 0.040 Line-card failure 0.088 0.174 Data-plane tests Recovery time (seconds) Test case Layer 2 Layer 3 Link flapping 0.024 0.020 LAG member failure 0.031 0.032 Control- and data-plane tests Recovery time (seconds) Test case Layer 2 Layer 3 Multiple failures 0.081 0.102 Table 1: Juniper EX8200 Virtual Chassis high-availability test results
  • Juniper Virtual Chassis High Availability Assessment Page7 offering traffic from Spirent TestCenter at a constant rate. Next, engineers measured failover time using the formula described in the “Methodology and Results” section above. Engineers then rebooted the same unit – after verifying it was now in a backup role – and again measured failover time. In all, these tests ran four times: once apiece for master and backup XRE200s, and one each in Layer-2 and Layer-3 configurations. 2. LCC-RE failure The redundant line card chassis-routing engines (LCC-REs) in each EX8200 Virtual Chassis configuration also act in master and backup roles. To determine failover time, engineers rebooted the master LCC-RE in one member of the Virtual Chassis while offering test traffic at a constant rate. There was no frame loss in this test. Engineers then repeated the test by again rebooting the same LCC-RE after verifying it had shifted into a backup role. Again, there was zero frame loss. In fact, the EX8200 Virtual Chassis configuration dropped no frames in any LCC-RE failure test, both in Layer-2 and Layer-3 modes. 3. VCP failure (split-brain protection) The Virtual Chassis Port (VCP) is a key component in any EX8200 Virtual Chassis configuration, since it carries not only Layer-2 and Layer-3 control-plane traffic but also the Virtual Chassis Control Protocol (VCCP) frames needed for Virtual Chassis technology to work. Given its importance, engineers tested three different types of VCP failures. All three VCP test cases involved the potential risk of “split-brain” configurations, where the loss of a link could cause multiple XREs and/or LCC-REs to claim master status at the same time. Engineers increased the risk of split-brain configurations by disabling multiple sets of links in all three test cases. In the first test case, engineers disabled two sets of links between master XRE and master LCC-RE ports while offering test traffic at a constant rate. In the second case, engineers disabled two sets of links between backup XRE and backup LCC-RE ports, again while offering test traffic. Finally, engineers disabled multiple links between master and backup XREs, again with test traffic active. In all three cases, there was no frame loss and no split-brain configuration as a result of multiple VCP failures. 4. LCC failure The LCC failure test determined the effect of the loss of an entire EX8200 chassis within a Virtual Chassis system. Here, engineers rebooted one Juniper EX8216 switch within the Virtual Chassis configuration while offering test traffic at a constant rate. This had the effect of taking the chassis and its line cards offline, forcing Virtual Chassis state migration. In Layer-2 and Layer-3 configurations, the EX8200 Virtual Chassis configuration recovered in less than 50 ms from the loss of a switch member.
  • Juniper Virtual Chassis High Availability Assessment Page8 5. Line card failure Engineers rebooted one line card in a Juniper EX8216 switch within the Virtual Chassis while offering test traffic at a constant rate. This forced Virtual Chassis state migration for the flows that previously used this line card. (Engineers first verified that the line card carried test traffic.) In Layer-2 and Layer-3 configurations, the EX8200 Virtual Chassis configuration recovered in 174 ms or less from the loss of a line card, well below Juniper’s stated ceiling of 1-second maximum recovery time. 6. Link flapping (soft failure) In this scenario, engineers used the Junos command-line interface (CLI) to disable one member of the link aggregation group connecting the EX8200 Virtual Chassis with one of the other Virtual Chassis instances at the edge of the test bed. As in other cases, engineers configured Spirent TestCenter to offer traffic throughout the test, and derived failover time from frame loss. In both Layer-2 and Layer-3 scenarios, failover time due to link flapping was less than 25 ms. 7. Link flapping (hard failure) This link-flapping test was similar to the previous one, only here engineers induced a failure by physically removing a cable from one member of the link aggregation group between the EX8200 Virtual Chassis and one of the Virtual Chassis instances at the edge of the network. Here, too, the Spirent test instrument offered traffic at a constant rate. In Layer-2 and Layer-3 scenarios, failover time due to loss of a physical link was 32 ms or less. 8. Multiple failures It is unlikely, though not impossible, that several components can fail at once. To model this scenario, engineers created a multiple-failure test case, offering traffic while simultaneously disabling these components:  Master XRE  Master LCC-RE  LCC (EX8216 chassis)  Link aggregation group member These failures required multiple concurrent state transitions. Despite multiple concurrent component failures, the EX8200 Virtual Chassis recovered in less than 110 ms in both Layer-2 and Layer-3 test cases. Both results are well below Juniper’s stated guideline of 1-second recovery times.
  • Juniper Virtual Chassis High Availability Assessment Page9 9. Nonstop Software Upgrade (NSSU) In addition to measuring recovery times in various failover scenarios, engineers also exercised Juniper’s Nonstop Software Upgrade capability on the EX8200 Virtual Chassis system. NSSU allows in-place software upgrades with little or no disruption. In this test, engineers upgraded the entire Virtual Chassis system – comprising two EX8216 chassis, two EX8208 chassis, and two EX8200-XRE200 external routing engines – from Junos version 12.1R2 to 12.1R3. As in other cases, engineers offered a mix of unicast and multicast traffic while conducting the upgrade, and derived recovery time from frame loss. In the Layer-2 test case, the system recovered in 117 ms from NSSU. In the Layer-3 test case, which involved OSPF and PIM routing on every port, the system recovered in 857 ms from NSSU. Both figures are less than Juniper’s stated guideline of 1-second recovery times. EX4200/EX4500/EX4550 Virtual Chassis High Availability While most tests focused on the Juniper EX8200 core switching platform, engineers also performed high-availability tests on Virtual Chassis instances at the edge of the network – those using the Juniper EX4200, Juniper EX4500, and the new Juniper EX4550 top-of-rack switches. In all test cases, Juniper EX4200/EX4500/EX4550 Virtual Chassis instances recovered from component and/or link failure in less than 1 second. Table 2 presents results from these tests. Trials involved the same combination of IPv4, IPv6, unicast, and multicast traffic as in the EX8200 tests, with the majority of traffic in the “east-west” direction between Virtual Chassis instances. Most traffic used a partially meshed pattern between the two Virtual Chassis instances at the edge of the network; as defined in RFC 2285, a partial mesh is one in which all ports on one side of the network exchange traffic with all ports on the other side, but no traffic stays local. That meant all traffic went through the core Virtual Chassis instance. Control-plane tests Recovery time (seconds) Test case Layer 2 Layer 3 EX4200/EX4500 master failure 0.000 0.284 EX4200/EX4550 master failure 0.281 0.294 EX4200/EX4500 backup failure 0.000 0.258 EX4200/EX4550 backup failure 0.291 0.303 EX4200/EX4500 remove VCP 0.000 0.000 EX4200/EX4550 remove VCP 0.000 0.000 Data-plane tests Recovery time (seconds) Test case Layer 2 Layer 3 EX4200/EX4500 LAG member failure 0.073 0.056 EX4200/EX4550 LAG member failure 0.059 0.064 Table 2: Juniper EX4200/EX4500/EX4550 high-availability test results
  • Juniper Virtual Chassis High Availability Assessment Page10 As in the EX8200 tests, Juniper’s stated guideline at the network edge is that recovery from component or link failure will take less than 1 second in all cases. As the results show, recovery times always fell within that limit. Even the very longest recovery time – 303 ms, in the case of Layer-3 recovery from loss of a Virtual Chassis backup switch – is still well below Juniper’s 1-second guideline. Because Virtual Chassis implementations at the edge of the network involve fewer components (there is no external routing engine, as was the case with the EX8200-XRE200 in the core switching tests), the number of test cases is reduced. Still, the results demonstrate that Virtual Chassis instances made up of Juniper EX4200/EX4500/EX4550 switches recover quickly from failures, in several cases with zero disruption. 1. Master failure Working from the Junos CLI, engineers rebooted a master switch in each Virtual Chassis instance while offering a mix of unicast and multicast IPv4 and IPv6 traffic. As in the EX8200 Virtual Chassis tests, engineers then derived recovery time from frame-loss measurements. In this and all other EX4200/EX4500/EX4550 Virtual Chassis tests, engineers ensured Spirent test ports were attached only to the device not lost during the failover scenario. In all four combinations of switches and Layer-2 and Layer-3 configurations, the Virtual Chassis instances recovered in less than 300 ms. With Layer-2 traffic and the loss of a Juniper EX4200/EX4500 Virtual Chassis master, there was zero frame loss and thus zero disruption. 2. Backup failure In this test, engineers used the Junos CLI to reboot a backup switch in each Virtual Chassis instance. In all four combinations of switches and Layer-2 and Layer-3 configurations, the Virtual Chassis instances recovered in 303 ms or less. With Layer-2 traffic and the loss of a Juniper EX4200/EX4500 Virtual Chassis backup, there was zero frame loss and thus zero disruption. 3. Link flapping (soft failure) In this scenario, engineers used the Junos CLI to disable one member of the link aggregation group linking each Virtual Chassis instance at the network edge with the EX8200 Virtual Chassis instance in the network core. As in other test cases, engineers configured Spirent TestCenter to offer test traffic throughout the test, and derived failover time from frame loss. In all Layer-2 and Layer-3 scenarios, failover time due to a software-initiated link flap was 120 ms or less.
  • Juniper Virtual Chassis High Availability Assessment Page11 4. Link flapping (hard failure) This link-flapping test was similar to the previous one, only here engineers induced a failure by removing a cable from one member of the link aggregation group between the core EX8200 Virtual Chassis configuration and one of the Virtual Chassis instances at the edge of the network. Here, too, the Spirent test instrument offered traffic at a constant rate, mainly in an “east-west” direction between Virtual Chassis instances at the edge of the network. In Layer-2 and Layer-3 scenarios, failover time due to a loss of physical link was 73 ms or less. 5. VCP failure In the context of EX4200/EX4500/EX4550 Virtual Chassis instances, VCPs are dedicated ports connecting each switch member, carrying all control-plane traffic including VCCP frames. Engineers assessed the loss of this critical component by physically disconnecting a primary VCP cable while simultaneously offering test traffic. In all four test cases, the loss of a VCP link caused little or no disruption to user traffic. In three of four cases, the Virtual Chassis system dropped zero frames. In a fourth instance, involving Layer-2 traffic and a Juniper EX4200/4550 Virtual Chassis configuration, the system dropped 16 frames out of more than 700 million total, the equivalent of about 9 microseconds of failover time. Conclusion These tests validated the high-availability features of Juniper’s Virtual Chassis technology as implemented on Juniper EX8200 and Juniper EX4200/EX4500/EX4550 switches. In dozens of test cases involving split Layer-2/Layer-3 and pure Layer-3 scenarios, the systems under test always recovered from failure in less than 1 second. In every case, recovery times were always far faster than those for common enterprise switching or routing protocols. Subsecond frame loss is helpful with management tasks, such as removing an XRE controller or Virtual Chassis member for maintenance or repair. These tests also showcased NSSU for nearly hitless upgrades of Juniper EX8200 switches running Virtual Chassis technology. Here again, NSSU recovery times were less than 1 second in both Layer-2 and Layer-3 test cases. Moreover, the test results also showed how the multiple levels of redundancy in Virtual Chassis technology protect against “split-brain” problems, where different routing engines try to claim a master role. Despite engineers’ best efforts to create split-brain scenarios, Juniper’s Virtual Chassis technology always transferred master and backup roles as expected, with one routing engine playing the master role at any given instant. For most enterprise network managers, high availability is even more important than high performance; after all, a fast network is of little use if it can’t be reached. With subsecond recovery times in all cases (and zero frame loss in many tests), these results demonstrate how Juniper Virtual Chassis technology can make enterprise networks more reliable.
  • Juniper Virtual Chassis High Availability Assessment Page12 Appendix A: About Network Test Network Test is an independent third-party test lab and engineering services consultancy. Our core competencies are performance, security, and conformance assessment of networking equipment and live networks. Our clients include equipment manufacturers, large enterprises, service providers, industry consortia, and trade publications. Appendix B: Hardware and Software Releases Tested This appendix describes the software versions used on the test bed. All tests were conducted in September 2012 at Juniper’s headquarters facility in Sunnyvale, CA, USA. Component Version Juniper EX8208, Juniper EX8216, Juniper EX8200- XRE200, Juniper EX4200, Juniper EX4500, Juniper EX4550, Juniper MX80 Junos 12.3I0 (all tests except NSSU); Junos 12.1R2, Junos 12.1R3 (NSSU) Spirent TestCenter 4.03.0496.0000 Appendix C: Disclaimer Network Test Inc. has made every attempt to ensure that all test procedures were conducted with the utmost precision and accuracy, but acknowledges that errors do occur. Network Test Inc. shall not be held liable for damages that may result from the use of information contained in this document. All trademarks mentioned in this document are property of their respective owners. Version 2012110100. Copyright © 2012 Network Test Inc. All rights reserved. Network Test Inc. 31324 Via Colinas, Suite 113 Westlake Village, CA 91362-6761 USA +1-818-889-0011 http://networktest.com info@networktest.com