Towards an Open Data Center with an Interoperable Network (ODIN)
Volume 2: ECMP Layer 3 Networks
May 2012                                                                                      ®




                                             Towards an Open Data Center
                                             with an Interoperable Network
                                             (ODIN)

                                             Volume 2: ECMP Layer 3
                                             Networks




                                             Casimer DeCusatis, Ph.D.
                                             Distinguished Engineer
                                             IBM System Networking, CTO Strategic Alliances
                                             IBM Systems and Technology Group


                                             May 2012
Towards an Open Data Center with an Interoperable Network (ODIN)
Volume 2: ECMP Layer 3 Networks
Page 2




Executive Overview
       As the data center network scales out (both through the addition of more servers per pod and the
       interconnection of more pods per data center), conventional Ethernet designs need to be
       modified. This section will consider the evolution from conventional network design to several
       emerging standards that will support higher scalability and more complex network topologies.
       Note that this section does not differentiate between traditional, lossy Ethernet (in which frames
       may be dropped during transmission) and lossless Ethernet (also known as Converged
       Enhanced Ethernet, which is a different technology that guarantees frame delivery, and will be
       discussed in a separate volume of the ODIN reference architecture).


2.1 STP protocol limitations
       In order to understand the motivation behind spine-leaf ECMP designs, we must first briefly
       review the traditional Ethernet approach using spanning tree protocol (STP) and multi-chassis link
       aggregation. Classic Ethernet uses STP to define a hierarchical structure for a network, forcing
       the network topology into a single path tree structure without loops while providing redundancy
       around both link and device failures. STP works by blocking ports on redundant paths so that all
       nodes in the network are reachable through a single path. If a device or a link failure occurs,
       based on the spanning tree algorithm, a selective redundant path or paths are opened up to allow
       traffic to flow, while still reducing the topology to a tree structure which prevents loops. Even
       when multiple links are connected for scalability and availability, only one link or LAG can be
       active. An enhanced version called multiple spanning tree protocol (MSTP) has also been
       standardized; this configures a separate spanning tree for each VLAN group and blocks all but
       one of the possible alternate paths within each spanning tree.

       The changing requirements of modern data center networks are forcing designers to reexamine
       the role of STP. One of the drawbacks of spanning tree protocol is that in blocking redundant
       ports and paths, spanning tree effectively reduces the available bandwidth significantly (the
       bandwidth available on the redundant paths goes unused until a failure occurs). Put another way,
       spanning trees reduce aggregate bandwidth by forcing all traffic paths onto a single tree (they
       lack multi-pathing support). This significantly lowers utilization of the available network bandwidth.
       Additionally, in many situations the choice of which ports to block can also lead to a suboptimal
       path of communication between end nodes by forcing traffic to go up and down the spanning tree.
       Spanning tree cannot be easily segregated into smaller domains to provide better scalability.
       Finally, the convergence time taken to recompute the spanning tree and propagate the changes
       in the event of a failure can vary and sometimes becomes quite large. When a new link is added
       or removed, the entire network halts all traffic while it configures a new loop-free tree; this can
       take anywhere from tens of seconds to minutes. This is highly disruptive for virtual machine
       migration, storage traffic, and other applications; in some cases, it can lead to server or system
       crashes.

       As the data center grows larger and networking devices proliferate, designers are forced to give
       closer attention than ever before to the complexity associated with the vast number of devices to
       be managed in a single fabric. Long distance bridging between networks has also made the
       overall data center design more complex. Virtual machine mobility adds requirements to the
       network in terms of extending Layer 2 VLANs between racks within a data center or between
       different geographic data centers. These moves typically require network configuration changes
       and in many cases the traffic may use a non-optimal path between data centers.

       To optimize bandwidth utilization in this environment, several vendors have proposed proprietary
       alternatives to STP. Since these approaches only function within a single vendor network, and
Towards an Open Data Center with an Interoperable Network (ODIN)
Volume 2: ECMP Layer 3 Networks
Page 3

        only for certain devices, we will not discuss them here. However, we will review link aggregation
        technology which can be used as part of a spine-leaf ECMP network.


2.2 MLAG – Multi-chassis Link Aggregation Groups
        The original link aggregation group (LAG) standard (IEEE 802.3ad) is supported by all switch
        vendors today, and was developed in part to overcome the limitations of STP. LAG allows you to
        bond two or more physical links into a logical link between two switches or between a server and
        a switch. Subsequently, an extension of LAG was proposed and standardized by IEEE
        802.1AXbq, Link Aggregation Amendment: Distributed Resilient Network Interconnect. This is
        more commonly known as multi-chassis link aggregation, or MLAG. As illustrated in the figure
        below, one end of the link aggregated port group is dual-homed into two different devices to
        provide device level redundancy. The other end of the group is still single homed into a single
        device. The link aggregated port group that is single homed continues to run normal LAG and is
        unaware that MLAG is being used. For example, in the figure below Device 1 continues to run
        normal LAG. However, Device 2 and Device 3 run MLAG.




                      Device 2                                      Device 3



                                                                         MLAG


                                                              LAG


                                               Device 1




Figure 2.1 – Illustration of Multi-Chassis Link Aggregation


        If the data center network is designed with multiple links between devices, the connections
        between the end systems and the access switches and between the access switches and the
        aggregation switches can be based on MLAG. MLAG can be used to create a logically loop-free
        topology without relying on spanning tree protocol

        MLAG builds on IEEE 802.1ax (2008) and inherits these properties from conventional LAG in that
        all frames in a flow are sent over the same physical link typically using hashing based on packet
        headers to ensure that frame order is maintained and duplication is avoided. The number of hops
        that are traversed between two devices remain the same, so delay should be equivalent
        regardless of the path taken. As a relatively mature technology , MLAG has been deployed
        extensively, does not require new encapsulation, and works with existing OAM systems and
        multicast protocols.

        MLAG is supported by a broad spectrum of switch vendors, many of who refer to this technology
        by slightly different brand names. It is possible to MLAG from a server to two TORs using NIC
Towards an Open Data Center with an Interoperable Network (ODIN)
Volume 2: ECMP Layer 3 Networks
Page 4

       teaming on the server side with MLAG at the TORs, or from a blade switch to two aggregation
       switches, or from a TOR into two cores. In each of these cases, each tier of switches or servers
       could come from a different vendor since one end of the MLAG really sees this as a traditional
       LAG. In other words MLAG allows interoperability across tiers. The primary constraint is that the
       two devices that are being used for dual-homing should come from the same vendor. For
       example in the previous figure, Device 1 could be from one vendor while Device 2 and Device 3
       could be from another vendor. However, Device 2 and Device 3 need to be from the same
       vendor. Furthermore Device 1 and Device2/3 could be different device types. For example Device
       1 could be a server or blade switch, while Device 2 and Device 3 could be an aggregation switch.
       Here again the constraint is that Device 2 and Device 3 should typically be similar devices. In
       practice most MLAG systems allow dual homing across only two paths as it is difficult to maintain
       a coherent state between more than two devices with sub-microsecond refresh times.

                     MLAG and LAG                                              STP and LAG


                            LAG                                                     LAG




        MLAG                                 MLAG


                                                                             STP Blocked             LAG
         LAG                                  LAG



Figure 2.2 – MLAG and STP comparison

       As shown on the left, MLAG increases switching bandwidth and allows dual homing; a change to
       the MLAG configuration only impacts affected links. As shown on the right, STP can block certain
       links and does not allow dual homing; a change to STP impacts the whole network.


2.3 Layer 3 Spine-Leaf Designs with VLAG and ECMP
       In this section, we will describe the basic approach to a Layer 3 “Fat Tree” design (or Clos
       network) using Equal Cost Multi-Pathing (ECMP). As shown in the figure below, a Layer 3 ECMP
       design creates multiple paths between nodes in a network, which are load balanced with the
       network traffic. The number of paths is variable depending on the implementation; the figure
       shows a 4 way ECMP (in other words, there are 4 paths which can be used for load balancing).
       Bandwidth can be adjusted by adding or removing paths up to the maximum allowed number of
       links. Unlike a Layer 2 network which relies on STP, no links are blocked with this approach.
       Broadcast loops are avoided by using different VLANs, and the network can route around link
       failures.

       A typical Layer 3 ECMP implementation is shown in the attached figure. In this case, all attached
       servers are dual homed (each server has two connections to the first network switch using active-
       active NIC teaming). This approach is known as a spine and leaf architecture, where the switches
       closest to the server are “leaf” switches which interconnect with a set of “spine” switches using a
       set of load balanced paths (a 4 way ECMP in this case). In this example, there are 16 IP subnets
       per rack and 64 IP subnets per uplink, for a total of 80 IP subnets. Using a two tier design such as
       this with a reasonably sized (48 port) leaf and spine switch and relatively low oversubscription
       (3:1), it is possible to scale this L3 ECMP network up to around 1,000 – 2,000 ports. The spine of
       the network supports east-west traffic between servers, which can account for over 90% of the
       traffic flow in modern data center networks. Note that the design does not require a larger form
Towards an Open Data Center with an Interoperable Network (ODIN)
Volume 2: ECMP Layer 3 Networks
Page 5

        factor core switch, although we could certainly use core switches to replace the spine switches in
        this example. Any vendor product which supports L3 ECMP can be employed in this manner.


                                                   10.1.1.0/24


                                   .1              10.1.2.0/24                    .


                                                   10.1.3.0/24


                                                   10.1.4.0/24


Figure 2.3 – Layer 3 ECMP design principles




                                                   4 spine” switches




                40GbE Links




                                              16“leaf” switches




            48 Servers per “leaf” switch


Figure 2.4 – Example Layer 3 ECMP leaf-spine design


        A Layer 3 ECMP design can be enhanced by using Virtual Link Aggregation Groups (VLAGs), as
        shown in the figure below. If devices attached to the network support Link Aggregation Control
        Protocol (LACP) it becomes possible to logically aggregate multiple connections to the same
        device under a common vLAG ID. It is also possible to use vLAG inter-switch links (ISLs)
        combined with VRRP protocols to interconnect switches at the same tier of the network. VRRP
        supports IP Forwarding between subnets, and protocols such as OSPF or BGP can be used to
        route around link failures. Server pods can be constructed as shown in this example, and VMs
        can be migrated to any server within the pod (note that migration across multiple pods is not
        supported by this design).
Towards an Open Data Center with an Interoperable Network (ODIN)
Volume 2: ECMP Layer 3 Networks
Page 6



         Layer 3



                                                                                                L3 connections
                                                                                                  With unique
             OSPF/BGP                                                                           subnets per link


        Layer 2/3                                                    ISL
                                                       Active/Active VRRP
                     vLAG Primary Switch                                                 vLAG Secondary Switch


                                    vLAG3      vLAG4
                                                                            vLAG5       vLAG6



        Layer 2

                   Subnet 1                 Subnet 2                         Subnet 3            Subnet 4
                   Server pod 1
                   Dual homed with LACP




Figure 2.5 – Spine-Leaf ECMP design example


       Layer 3 ECMP designs offer several advantages. They are based on proven, standardized
       technology which leverages smaller, less expensive rack or blade switches (virtual switches
       typically do not provide Layer 3 functions and would not participate in an ECMP network). The
       control plane is distributed, and smaller fault domains are possible using the pod design
       approach. These networks scale well (up to 1-2 thousand ports with a slightly oversubscribed 2
       tier topology, higher with more tiers).

       There are also some tradeoffs when using a Layer 3 ECMP design. The native Layer 2 domains
       are relatively small, which limits the ability to perform live VM migrations from any server to any
       other server. Such designs can also be fairly complex, requiring expertise in IP routing to setup
       and manage the network, and presenting complications with multicast domains. In the examples
       shown earlier, scaling is limited by the control plane, which can become unstable in some
       conditions (for example, if all the servers attached to a leaf switch boot up at once, the switch’s
       ability to process ARP and DHCP relay requests will be a bottleneck in overall performance). In a
       Layer 3 design, the size of the ARP table supported by the switches can become a limiting factor
       in scaling the design, even if the MAC address tables are quite large. Finally, complications may
       result from the use of different hashing algorithms on the spine and leaf switches.


Summary
       We have outlined industry standard best practices related to the use of MLAG and Layer 3 ECMP
       “fat tree” networks within the data center. This approach addresses the rising CAPEX and OPEX
       associated with data center design, enables cost effective scaling of the network, and supports
       virtualization of the network servers.
Towards an Open Data Center with an Interoperable Network (ODIN)
Volume 2: ECMP Layer 3 Networks
Page 7




For More Information
IBM System Networking                                               http://ibm.com/networking/
IBM PureSystems                                                     http://ibm.com/puresystems/
IBM System x Servers                                                http://ibm.com/systems/x
IBM Power Systems                                                   http://ibm.com/systems/power
IBM BladeCenter Server and options                                  http://ibm.com/systems/bladecenter
IBM System x and BladeCenter Power Configurator                     http://ibm.com/systems/bladecenter/resources/powerconfig.html
IBM Standalone Solutions Configuration Tool                         http://ibm.com/systems/x/hardware/configtools.html
IBM Configuration and Options Guide                                 http://ibm.com/systems/x/hardware/configtools.html
Technical Support                                                   http://ibm.com/server/support
Other Technical Support Resources                                   http://ibm.com/systems/support

Legal Information                                                   This publication may contain links to third party sites that are
                                                                    not under the control of or maintained by IBM. Access to any
IBM Systems and Technology Group                                    such third party site is at the user's own risk and IBM is not
Route 100                                                           responsible for the accuracy or reliability of any information,
                                                                    data, opinions, advice or statements made on these sites. IBM
Somers, NY 10589.
                                                                    provides these links merely as a convenience and the
Produced in the USA                                                 inclusion of such links does not imply an endorsement.
May 2012
                                                                    Information in this presentation concerning non-IBM products
All rights reserved.
                                                                    was obtained from the suppliers of these products, published
IBM, the IBM logo, ibm.com, BladeCenter, and VMready are            announcement material or other publicly available sources.
trademarks of International Business Machines Corp.,                IBM has not tested these products and cannot confirm the
registered in many jurisdictions worldwide. Other product and       accuracy of performance, compatibility or any other claims
service names might be trademarks of IBM or other                   related to non-IBM products. Questions on the capabilities of
companies. A current list of IBM trademarks is available on         non-IBM products should be addressed to the suppliers of
the web at ibm.com/legal/copytrade.shtml                            those products.
InfiniBand is a trademark of InfiniBand Trade Association.          MB, GB and TB = 1,000,000, 1,000,000,000 and
                                                                    1,000,000,000,000 bytes, respectively, when referring to
Intel, the Intel logo, Celeron, Itanium, Pentium, and Xeon are
                                                                    storage capacity. Accessible capacity is less; up to 3GB is
trademarks or registered trademarks of Intel Corporation or its
                                                                    used in service partition. Actual storage capacity will vary
subsidiaries in the United States and other countries.              based upon many factors and may be less than stated.
Linux is a registered trademark of Linus Torvalds.
                                                                    Performance is in Internal Throughput Rate (ITR) ratio based
Lotus, Domino, Notes, and Symphony are trademarks or                on measurements and projections using standard IBM
registered trademarks of Lotus Development Corporation              benchmarks in a controlled environment. The actual
and/or IBM Corporation.                                             throughput that any user will experience will depend on
                                                                    considerations such as the amount of multiprogramming in the
Microsoft, Windows, Windows Server, the Windows logo,               user’s job stream, the I/O configuration, the storage
Hyper-V, and SQL Server are trademarks or registered                configuration and the workload processed. Therefore, no
trademarks of Microsoft Corporation.                                assurance can be given that an individual user will achieve
TPC Benchmark is a trademark of the Transaction Processing          throughput improvements equivalent to the performance ratios
Performance Council.                                                stated here.
UNIX is a registered trademark in the U.S. and/or other             Maximum internal hard disk and memory capacities may require the
countries licensed exclusively through The Open Group.              replacement of any standard hard drives and/or memory and the
                                                                    population of all hard disk bays and memory slots with the largest
Other company, product and service names may be                     currently supported drives available. When referring to variable
trademarks or service marks of others.                              speed CD-ROMs, CD-Rs, CD-RWs and DVDs, actual playback
IBM reserves the right to change specifications or other product    speed will vary and is often less than the maximum possible.
information without notice. References in this publication to IBM
products or services do not imply that IBM intends to make them
available in all countries in which IBM operates. IBM PROVIDES
THIS PUBLICATION “AS IS” WITHOUT WARRANTY OF ANY
KIND, EITHER EXPRESS OR IMPLIED, INCLUDING THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND
FITNESS FOR A PARTICULAR PURPOSE. Some jurisdictions do
not allow disclaimer of express or implied warranties in certain
transactions; therefore, this statement may not apply to you.
                                                                                                           QCW03020USEN-00

Towards an Open Data Center with an Interoperable Network (ODIN) : Volume 2: ECMP Layer 3 Networks

  • 1.
    Towards an OpenData Center with an Interoperable Network (ODIN) Volume 2: ECMP Layer 3 Networks May 2012 ® Towards an Open Data Center with an Interoperable Network (ODIN) Volume 2: ECMP Layer 3 Networks Casimer DeCusatis, Ph.D. Distinguished Engineer IBM System Networking, CTO Strategic Alliances IBM Systems and Technology Group May 2012
  • 2.
    Towards an OpenData Center with an Interoperable Network (ODIN) Volume 2: ECMP Layer 3 Networks Page 2 Executive Overview As the data center network scales out (both through the addition of more servers per pod and the interconnection of more pods per data center), conventional Ethernet designs need to be modified. This section will consider the evolution from conventional network design to several emerging standards that will support higher scalability and more complex network topologies. Note that this section does not differentiate between traditional, lossy Ethernet (in which frames may be dropped during transmission) and lossless Ethernet (also known as Converged Enhanced Ethernet, which is a different technology that guarantees frame delivery, and will be discussed in a separate volume of the ODIN reference architecture). 2.1 STP protocol limitations In order to understand the motivation behind spine-leaf ECMP designs, we must first briefly review the traditional Ethernet approach using spanning tree protocol (STP) and multi-chassis link aggregation. Classic Ethernet uses STP to define a hierarchical structure for a network, forcing the network topology into a single path tree structure without loops while providing redundancy around both link and device failures. STP works by blocking ports on redundant paths so that all nodes in the network are reachable through a single path. If a device or a link failure occurs, based on the spanning tree algorithm, a selective redundant path or paths are opened up to allow traffic to flow, while still reducing the topology to a tree structure which prevents loops. Even when multiple links are connected for scalability and availability, only one link or LAG can be active. An enhanced version called multiple spanning tree protocol (MSTP) has also been standardized; this configures a separate spanning tree for each VLAN group and blocks all but one of the possible alternate paths within each spanning tree. The changing requirements of modern data center networks are forcing designers to reexamine the role of STP. One of the drawbacks of spanning tree protocol is that in blocking redundant ports and paths, spanning tree effectively reduces the available bandwidth significantly (the bandwidth available on the redundant paths goes unused until a failure occurs). Put another way, spanning trees reduce aggregate bandwidth by forcing all traffic paths onto a single tree (they lack multi-pathing support). This significantly lowers utilization of the available network bandwidth. Additionally, in many situations the choice of which ports to block can also lead to a suboptimal path of communication between end nodes by forcing traffic to go up and down the spanning tree. Spanning tree cannot be easily segregated into smaller domains to provide better scalability. Finally, the convergence time taken to recompute the spanning tree and propagate the changes in the event of a failure can vary and sometimes becomes quite large. When a new link is added or removed, the entire network halts all traffic while it configures a new loop-free tree; this can take anywhere from tens of seconds to minutes. This is highly disruptive for virtual machine migration, storage traffic, and other applications; in some cases, it can lead to server or system crashes. As the data center grows larger and networking devices proliferate, designers are forced to give closer attention than ever before to the complexity associated with the vast number of devices to be managed in a single fabric. Long distance bridging between networks has also made the overall data center design more complex. Virtual machine mobility adds requirements to the network in terms of extending Layer 2 VLANs between racks within a data center or between different geographic data centers. These moves typically require network configuration changes and in many cases the traffic may use a non-optimal path between data centers. To optimize bandwidth utilization in this environment, several vendors have proposed proprietary alternatives to STP. Since these approaches only function within a single vendor network, and
  • 3.
    Towards an OpenData Center with an Interoperable Network (ODIN) Volume 2: ECMP Layer 3 Networks Page 3 only for certain devices, we will not discuss them here. However, we will review link aggregation technology which can be used as part of a spine-leaf ECMP network. 2.2 MLAG – Multi-chassis Link Aggregation Groups The original link aggregation group (LAG) standard (IEEE 802.3ad) is supported by all switch vendors today, and was developed in part to overcome the limitations of STP. LAG allows you to bond two or more physical links into a logical link between two switches or between a server and a switch. Subsequently, an extension of LAG was proposed and standardized by IEEE 802.1AXbq, Link Aggregation Amendment: Distributed Resilient Network Interconnect. This is more commonly known as multi-chassis link aggregation, or MLAG. As illustrated in the figure below, one end of the link aggregated port group is dual-homed into two different devices to provide device level redundancy. The other end of the group is still single homed into a single device. The link aggregated port group that is single homed continues to run normal LAG and is unaware that MLAG is being used. For example, in the figure below Device 1 continues to run normal LAG. However, Device 2 and Device 3 run MLAG. Device 2 Device 3 MLAG LAG Device 1 Figure 2.1 – Illustration of Multi-Chassis Link Aggregation If the data center network is designed with multiple links between devices, the connections between the end systems and the access switches and between the access switches and the aggregation switches can be based on MLAG. MLAG can be used to create a logically loop-free topology without relying on spanning tree protocol MLAG builds on IEEE 802.1ax (2008) and inherits these properties from conventional LAG in that all frames in a flow are sent over the same physical link typically using hashing based on packet headers to ensure that frame order is maintained and duplication is avoided. The number of hops that are traversed between two devices remain the same, so delay should be equivalent regardless of the path taken. As a relatively mature technology , MLAG has been deployed extensively, does not require new encapsulation, and works with existing OAM systems and multicast protocols. MLAG is supported by a broad spectrum of switch vendors, many of who refer to this technology by slightly different brand names. It is possible to MLAG from a server to two TORs using NIC
  • 4.
    Towards an OpenData Center with an Interoperable Network (ODIN) Volume 2: ECMP Layer 3 Networks Page 4 teaming on the server side with MLAG at the TORs, or from a blade switch to two aggregation switches, or from a TOR into two cores. In each of these cases, each tier of switches or servers could come from a different vendor since one end of the MLAG really sees this as a traditional LAG. In other words MLAG allows interoperability across tiers. The primary constraint is that the two devices that are being used for dual-homing should come from the same vendor. For example in the previous figure, Device 1 could be from one vendor while Device 2 and Device 3 could be from another vendor. However, Device 2 and Device 3 need to be from the same vendor. Furthermore Device 1 and Device2/3 could be different device types. For example Device 1 could be a server or blade switch, while Device 2 and Device 3 could be an aggregation switch. Here again the constraint is that Device 2 and Device 3 should typically be similar devices. In practice most MLAG systems allow dual homing across only two paths as it is difficult to maintain a coherent state between more than two devices with sub-microsecond refresh times. MLAG and LAG STP and LAG LAG LAG MLAG MLAG STP Blocked LAG LAG LAG Figure 2.2 – MLAG and STP comparison As shown on the left, MLAG increases switching bandwidth and allows dual homing; a change to the MLAG configuration only impacts affected links. As shown on the right, STP can block certain links and does not allow dual homing; a change to STP impacts the whole network. 2.3 Layer 3 Spine-Leaf Designs with VLAG and ECMP In this section, we will describe the basic approach to a Layer 3 “Fat Tree” design (or Clos network) using Equal Cost Multi-Pathing (ECMP). As shown in the figure below, a Layer 3 ECMP design creates multiple paths between nodes in a network, which are load balanced with the network traffic. The number of paths is variable depending on the implementation; the figure shows a 4 way ECMP (in other words, there are 4 paths which can be used for load balancing). Bandwidth can be adjusted by adding or removing paths up to the maximum allowed number of links. Unlike a Layer 2 network which relies on STP, no links are blocked with this approach. Broadcast loops are avoided by using different VLANs, and the network can route around link failures. A typical Layer 3 ECMP implementation is shown in the attached figure. In this case, all attached servers are dual homed (each server has two connections to the first network switch using active- active NIC teaming). This approach is known as a spine and leaf architecture, where the switches closest to the server are “leaf” switches which interconnect with a set of “spine” switches using a set of load balanced paths (a 4 way ECMP in this case). In this example, there are 16 IP subnets per rack and 64 IP subnets per uplink, for a total of 80 IP subnets. Using a two tier design such as this with a reasonably sized (48 port) leaf and spine switch and relatively low oversubscription (3:1), it is possible to scale this L3 ECMP network up to around 1,000 – 2,000 ports. The spine of the network supports east-west traffic between servers, which can account for over 90% of the traffic flow in modern data center networks. Note that the design does not require a larger form
  • 5.
    Towards an OpenData Center with an Interoperable Network (ODIN) Volume 2: ECMP Layer 3 Networks Page 5 factor core switch, although we could certainly use core switches to replace the spine switches in this example. Any vendor product which supports L3 ECMP can be employed in this manner. 10.1.1.0/24 .1 10.1.2.0/24 . 10.1.3.0/24 10.1.4.0/24 Figure 2.3 – Layer 3 ECMP design principles 4 spine” switches 40GbE Links 16“leaf” switches 48 Servers per “leaf” switch Figure 2.4 – Example Layer 3 ECMP leaf-spine design A Layer 3 ECMP design can be enhanced by using Virtual Link Aggregation Groups (VLAGs), as shown in the figure below. If devices attached to the network support Link Aggregation Control Protocol (LACP) it becomes possible to logically aggregate multiple connections to the same device under a common vLAG ID. It is also possible to use vLAG inter-switch links (ISLs) combined with VRRP protocols to interconnect switches at the same tier of the network. VRRP supports IP Forwarding between subnets, and protocols such as OSPF or BGP can be used to route around link failures. Server pods can be constructed as shown in this example, and VMs can be migrated to any server within the pod (note that migration across multiple pods is not supported by this design).
  • 6.
    Towards an OpenData Center with an Interoperable Network (ODIN) Volume 2: ECMP Layer 3 Networks Page 6 Layer 3 L3 connections With unique OSPF/BGP subnets per link Layer 2/3 ISL Active/Active VRRP vLAG Primary Switch vLAG Secondary Switch vLAG3 vLAG4 vLAG5 vLAG6 Layer 2 Subnet 1 Subnet 2 Subnet 3 Subnet 4 Server pod 1 Dual homed with LACP Figure 2.5 – Spine-Leaf ECMP design example Layer 3 ECMP designs offer several advantages. They are based on proven, standardized technology which leverages smaller, less expensive rack or blade switches (virtual switches typically do not provide Layer 3 functions and would not participate in an ECMP network). The control plane is distributed, and smaller fault domains are possible using the pod design approach. These networks scale well (up to 1-2 thousand ports with a slightly oversubscribed 2 tier topology, higher with more tiers). There are also some tradeoffs when using a Layer 3 ECMP design. The native Layer 2 domains are relatively small, which limits the ability to perform live VM migrations from any server to any other server. Such designs can also be fairly complex, requiring expertise in IP routing to setup and manage the network, and presenting complications with multicast domains. In the examples shown earlier, scaling is limited by the control plane, which can become unstable in some conditions (for example, if all the servers attached to a leaf switch boot up at once, the switch’s ability to process ARP and DHCP relay requests will be a bottleneck in overall performance). In a Layer 3 design, the size of the ARP table supported by the switches can become a limiting factor in scaling the design, even if the MAC address tables are quite large. Finally, complications may result from the use of different hashing algorithms on the spine and leaf switches. Summary We have outlined industry standard best practices related to the use of MLAG and Layer 3 ECMP “fat tree” networks within the data center. This approach addresses the rising CAPEX and OPEX associated with data center design, enables cost effective scaling of the network, and supports virtualization of the network servers.
  • 7.
    Towards an OpenData Center with an Interoperable Network (ODIN) Volume 2: ECMP Layer 3 Networks Page 7 For More Information IBM System Networking http://ibm.com/networking/ IBM PureSystems http://ibm.com/puresystems/ IBM System x Servers http://ibm.com/systems/x IBM Power Systems http://ibm.com/systems/power IBM BladeCenter Server and options http://ibm.com/systems/bladecenter IBM System x and BladeCenter Power Configurator http://ibm.com/systems/bladecenter/resources/powerconfig.html IBM Standalone Solutions Configuration Tool http://ibm.com/systems/x/hardware/configtools.html IBM Configuration and Options Guide http://ibm.com/systems/x/hardware/configtools.html Technical Support http://ibm.com/server/support Other Technical Support Resources http://ibm.com/systems/support Legal Information This publication may contain links to third party sites that are not under the control of or maintained by IBM. Access to any IBM Systems and Technology Group such third party site is at the user's own risk and IBM is not Route 100 responsible for the accuracy or reliability of any information, data, opinions, advice or statements made on these sites. IBM Somers, NY 10589. provides these links merely as a convenience and the Produced in the USA inclusion of such links does not imply an endorsement. May 2012 Information in this presentation concerning non-IBM products All rights reserved. was obtained from the suppliers of these products, published IBM, the IBM logo, ibm.com, BladeCenter, and VMready are announcement material or other publicly available sources. trademarks of International Business Machines Corp., IBM has not tested these products and cannot confirm the registered in many jurisdictions worldwide. Other product and accuracy of performance, compatibility or any other claims service names might be trademarks of IBM or other related to non-IBM products. Questions on the capabilities of companies. A current list of IBM trademarks is available on non-IBM products should be addressed to the suppliers of the web at ibm.com/legal/copytrade.shtml those products. InfiniBand is a trademark of InfiniBand Trade Association. MB, GB and TB = 1,000,000, 1,000,000,000 and 1,000,000,000,000 bytes, respectively, when referring to Intel, the Intel logo, Celeron, Itanium, Pentium, and Xeon are storage capacity. Accessible capacity is less; up to 3GB is trademarks or registered trademarks of Intel Corporation or its used in service partition. Actual storage capacity will vary subsidiaries in the United States and other countries. based upon many factors and may be less than stated. Linux is a registered trademark of Linus Torvalds. Performance is in Internal Throughput Rate (ITR) ratio based Lotus, Domino, Notes, and Symphony are trademarks or on measurements and projections using standard IBM registered trademarks of Lotus Development Corporation benchmarks in a controlled environment. The actual and/or IBM Corporation. throughput that any user will experience will depend on considerations such as the amount of multiprogramming in the Microsoft, Windows, Windows Server, the Windows logo, user’s job stream, the I/O configuration, the storage Hyper-V, and SQL Server are trademarks or registered configuration and the workload processed. Therefore, no trademarks of Microsoft Corporation. assurance can be given that an individual user will achieve TPC Benchmark is a trademark of the Transaction Processing throughput improvements equivalent to the performance ratios Performance Council. stated here. UNIX is a registered trademark in the U.S. and/or other Maximum internal hard disk and memory capacities may require the countries licensed exclusively through The Open Group. replacement of any standard hard drives and/or memory and the population of all hard disk bays and memory slots with the largest Other company, product and service names may be currently supported drives available. When referring to variable trademarks or service marks of others. speed CD-ROMs, CD-Rs, CD-RWs and DVDs, actual playback IBM reserves the right to change specifications or other product speed will vary and is often less than the maximum possible. information without notice. References in this publication to IBM products or services do not imply that IBM intends to make them available in all countries in which IBM operates. IBM PROVIDES THIS PUBLICATION “AS IS” WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. Some jurisdictions do not allow disclaimer of express or implied warranties in certain transactions; therefore, this statement may not apply to you. QCW03020USEN-00