Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

An FPGA for high end Open Networking


Published on

Use of an FPGA for High End Open Networking

Published in: Technology
  • Be the first to comment

  • Be the first to like this

An FPGA for high end Open Networking

  1. 1. 8 Mar 2016 Roberto Innocente - 1 Open Networking with FPGAs A bare and IPv6 FPGA networking box
  2. 2. 8 Mar 2016 Roberto Innocente - 2 Summary Slide Argument 2 HPRC High performance Reconfigurable Computing 5 FPGA networking performance 10 Interchip/Interboard comm - Interlaken 12 OpenFlow/OpenSwitch 20 Clos Networks 25 VLB – Valiant Load Balancing 30 New DC network topologies 33 VL2 – Virtual Layer2 36 Monsoon
  3. 3. 8 Mar 2016 Roberto Innocente - 3 HPRC High Performance Reconfigurable Computing
  4. 4. 8 Mar 2016 Roberto Innocente - 4 HPRC project ● High Performance Reconfigurable Computing (HPRC) ● In the last decade peak floating point performance of FPGAs reached the performance of GPUs ( 1.5 - 10 Tflop/s) ● The same escalation of performance happened in networking with FPGAs (top is now 128 lanes up to 32.75 Gb/s per FPGA) ● Expected power consumption is 1/30 of CPUs and 1/10 of GPUs ● More info at ●
  5. 5. 8 Mar 2016 Roberto Innocente - 5 HPRC project/2 ● Part of HPC is strictly correlated with network performance ● FPGAs performance is escalating also in this field. Let us recall the NetFPGA project from UCB and the Huawei demonstration in 2014 of 400 gb/s line cards based on FPGAs ● We are therefore involved in : – Very high data rate cluster communications (Interlaken) – Statistically sound implementation of switching networks supporting random traffic matrix (Valiant LB) and OpenFlow control – OpenFlow switches implementing IPv6 on FPGA (Follow up of opensource NetFPGA project )
  6. 6. 8 Mar 2016 Roberto Innocente - 6 FPGA networking performance
  7. 7. 8 Mar 2016 Roberto Innocente - 7 June 2014, Huawei – Xilinx - Spirent ● Demonstrated and tested a 400 Gb/s core router implemented over FPGAs cards, the Huawei NE5000E ● FPGA : Virtex7 HC7VH870T ● Interlaken 400 Gb/s MAC/PCS Bridge Interlaken 400 Gb/s MAC/PCS Bridge 40/48 x 10/12.5 G Interlaken 40/48 x 10/12.5 G Interlaken Virtex7 HC7VH870T Virtex7 HC7VH870T 16 x 25 GMisc. ctrl40/48 x 10/12.5 G CFP2 CFP2 CFP2 CFP2 16 x 25 G 16 x 25 G
  8. 8. 8 Mar 2016 Roberto Innocente - 8 400 Gb/s core router ● Virtex-7 fpga H870T : – 400 gb/s troughput – 1280 bit busses – 312.5 Mhz busses – foundry TSMC 28 nm lito ● Virtex US VU095 Single chip solution for 400 Gb/s, ~900k cells ● Virtex US VUP190 Higher Density ● 16nm US ….
  9. 9. 8 Mar 2016 Roberto Innocente - 9 Xilinx last generation FPGAs 128 transceivers up to 32.75 Gb/s 4 PCIe Gen4 x8 8 Interlaken 150 Gb/s 12 Ethernet 100G w/RS-FEC VU13P has 128 GTY and 448 HP(High Perf IO) txceivers Recently With 16nm FinFET+ Xilinx showed 56Gb/s PAM4 txceivers
  10. 10. 8 Mar 2016 Roberto Innocente - 10 Xilinx Zynq SoC US+ 44 GTH txceivers up to 16.3 Gb/s 28 GTY txceivers up to 32.75 Gb/s 5 PCIe Gen4 x8 4 Interlaken 150 Gb/s 4 Ethernet 100G w/RS-FEC
  11. 11. 8 Mar 2016 Roberto Innocente - 11 InterChip/InterBoard communications
  12. 12. 8 Mar 2016 Roberto Innocente - 12 Interlaken ● Originally specified by Cortina Systems and Cisco in 2006 : a narrow, high speed, channelized packet interface (Framer/MAC to L2/L3 interface or Switch fabric to Switch fabric) ● Supports up to 256 channels or using extensions up to 64K ● A simple control word to delineate pkts ● A continuous meta-frame of programmable frequency to assure lane alignement ● In-band and out-of-band control flow with semantics similar to Xon/Xoff ● 64B/67B scrambling ● Data sent is segmented in bursts (subsets of original pkt) ● Each burst is bounded by 2 control words: 1 before and 1 after (start of burst, end of burst) indicating the channel it belongs. Size of bursts is configurable. ● (similar to ATM) using bursts, it allow the multiplexing of channels, avoiding long latencies for high priority channels. ● MetaFrame = 4 control words ● Data is txmitted via a number of configurable SerDes lanes (protocol works from 1 lane to .. no maximum) ● Fundamental unit of data sent across lanes is an 8 bytes word. ● Lane striping : SerDes (Serializers/Deserializers) went from rates of 6Gb/s at the time of specification, to 10/12 Gb/s and now ~ 28 Gb/s. Xilinx VUP 128 lanes x 32.5 Gb/s
  13. 13. 8 Mar 2016 Roberto Innocente - 13 OpenFlow OpenFlow Switch Mininet/Quagga
  14. 14. 8 Mar 2016 Roberto Innocente - 14 OpenFlow/1 OpenFlow channel Group Table Flow Table Flow Table Controller OpenFlow Protocol Pipeline OpenFlow Switch Using OpenFlow Protocol the Controller can add, update and delete flow entries in Flow Tables. Matching starts in first FlowTable and can continue along the Pipeline. First match in a table is applied. If no match then packet is treated according to the table- miss flow entry (usually discard for the last table, go to n-table for the other tables).
  15. 15. 8 Mar 2016 Roberto Innocente - 15 OpenFlow/2 ● Instruction associated with flows : – pkt forwarding : eg send trough port 3 – modify pkt : eg incrementing hop count – process pkt according to group table –
  16. 16. 8 Mar 2016 Roberto Innocente - 16 Lab configuration Web Proxy server Web server OpenFLOW NAT switch/router H1 H3 H2 OFS Web
  17. 17. 8 Mar 2016 Roberto Innocente - 17 Flow tables processing Flow table 0 Access Control Allow ARP and IP between 10.x.x.x GoTo 2 Allow 10.x.x.x and 80/TCP or ICMP GoTo 2 Default DROP Flow table 2 Routing Set MAC dst and egress port for the Web if dst, the H1,H2,H3 port and MAC if src Direct ARP between H1,H2,H3, Proxy Flow table 1 NAT Allow and 80/TCP and ICMP NAT src to Go To 2 NAT dst To Go To 2 Default Go To 2 Processing pipeline
  18. 18. 8 Mar 2016 Roberto Innocente - 18 NetFPGA : Open source project of hardware and software for rapid prototyping of network devices using FPGAs (UCBerkeley). The project started in 2007 at Stanford University as NetFPGA-1G (fpga Xilinx Virtex II pro, 4x1G interfaces). In 2009 / 2010 the new project NetFPGA-10G was started (fpga Xilinx Virtex V TX240T, 4x10G interfaces). NetFPGA SUME (fpga Virtex-7 690T, its High Speed Interface subsystem supports 30 serial links with speed up to 13.1Gb/s with GTH transceivers or up to 28.5Gb/s with GTZ transceivers, towards 100gb/s : Zilberman et al. ) NetFPGA 10G OpenFlow switch : NetFPGA 10G OpenFlow-Switch ● Block diagram:
  19. 19. 8 Mar 2016 Roberto Innocente - 19 Mininet ● Mininet uses process-based virtualization to run many (we’ve successfully booted up to 4096) hosts and switches on a single OS kernel. ● Since version 2.2.26, Linux has supported network namespaces that provides individual processes with separate network interfaces, routing tables, and ARP tables. ● The Linux container architecture adds chroot() jails, process and user namespaces, and CPU and memory limits to provide full OS-level virtualization, but Mininet does not require these additional features. ● Mininet can create kernel or user-space OpenFlow switches, controllers to control the switches, and hosts to communicate over the simulated network. ● Mininet connects switches and hosts using virtual ethernet (veth) pairs. ● Mininet’s code is Python, except for a C utility. Overview ● Mininet is a network emulator which creates a network of virtual hosts, switches, controllers, and links. Mininet hosts run standard Linux network software, and its switches support OpenFlow for highly flexible custom routing and Software-Defined Networking. Mininet: ● Provides a simple and inexpensive network testbed for developing OpenFlow application ● Enables complex topology testing, without the need to wire up a physical network ● Includes a CLI that is topology-aware and OpenFlow- aware, for debugging or running network-wide tests ● also Provides a straightforward and extensible Python API for network creation and experimentation ● Mininet networks run real code including standard Unix/Linux network applications as well as the real Linux kernel and network stack (including any kernel extensions which you may have available, as long as they are compatible with network namespaces.).
  20. 20. 8 Mar 2016 Roberto Innocente - 20 Quagga ● Quagga is a 10 years old fork of GNU Zebra (now abandoned as os project). ● It is an open source software suite that implements : – RIPv1/RIPv2 for IPv4 and RIPng for IPv6 – OSPFv2 and OSPFv3 – BGPv4+ (including address family support for multicast and Ipv6) – IS-IS with support for IPv4 and Ipv6 ● It can be used to manage openFlow switches/routers ● A competitor open source network suite is bird ● Software stack : Bidirectional Forwarding detection Oslr wireless mesh routing MPLS label distribution prot
  21. 21. 8 Mar 2016 Roberto Innocente - 21 Clos networks
  22. 22. 8 Mar 2016 Roberto Innocente - 22 Clos/1 ● Formalized in 1952 by Charles Clos : (Mar 1953). "A study of non-blocking switching networks" . Bell System Technical Journal ● He found that for more than 36 inputs a 3- stage network strictly non-blocking can be built from small switches with less xpoints(~N3/2) than a complete xbar (N2 ). In fact for N=1,000 a xbar needs 1,000,000 xpoints a Clos net ~200k xpoints ● Clos networks have 3 stages : an ingress stage, a middle stage and an egress stage. Each stage is made up of xbar switches ● Clos networks can be generalized to any odd number of stages. By replacing the center stage in a 3-stage Clos net by a Clos net we get a 5-stage Clos net and so on ● Today Clos topology has no alternatives ● It is here to stay .. 4x4 crossbar (xbar) switch N2 = 16 crosspoints 4x4 3-stage Clos network 6N3/2 -3N xpoints
  23. 23. 8 Mar 2016 Roberto Innocente - 23 Clos/2 : topology ● Multistage switching network : you can connect a large number of input and output ports using small switches with xbar behaviour n*m = 2*3 r*r = 2*2 m*n = 3*2 r ingress stage xbar switches with n input ports and m outs (N=r*n) r egress stage xbar switches with m input ports and n outsm xbar switches r*r 3-stage Clos network: 4x4 strictly non-blocking Because m≥2n-1 But 36 xpoints !
  24. 24. 8 Mar 2016 Roberto Innocente - 24 Clos/3 ● Re-arrangeable nonblocking: When m ≥ n then the Clos network is nonblocking like a xbar switch of N = r*n ports : that is, for any permutation of the lines, we can arrange the switches to create independent paths (proof : Hall's marriage theorem). ● Strict-sense nonblocking: When m ≥ 2n-1 it is always possible to add another connection without rearranging the switches (Clos Theorem ). In the worst case n-1 inputs of the ingress switch are busy and they go to n-1 different middle switches, the same for the egress switch. In the worst case, when these 2 sets are disjoints, 2n-2 middle switches are busy. We need just another 1 to allow this connection without re- arrangement.
  25. 25. 8 Mar 2016 Roberto Innocente - 25 Clos/4 Planning a Clos network : ● With N=36 inputs : lets choose (n = N1/2 =) 6 inputs xbar in-/e-gress switches ● To comply with Clos theorem (strictly non-blocking) we need at least ≥ 2n – 1 = 11 middle switches and therefore outputs from ingress switches ● Therefore ingress switches will be 6 (6x11) and egress switches 6 (11x6) ● Middle switches will be 11 (6x6) ● This will total 1188 xpoints, less than N2 = 1296 ● The number of xpoints required by a 3-stage strictly non-blocking Clos net with in-/e-gress switches of N1/2 inputs is: – 6N3/2 – 3N (instead of the N2 of the xbar) with a large N this is a huge difference : for N=1,000 a xbar needs 1,000,000 xpoints, a 3-stage Clos net <200k xpoints.
  26. 26. 8 Mar 2016 Roberto Innocente - 26 VLB Valiant Load Balancing
  27. 27. 8 Mar 2016 Roberto Innocente - 27 Valiant Load Balancing (VLB)/1 Early work by Valiant about processor interconnection networks in 1981 : A scheme for fast parallel communications, L.G.Valiant, SIAM J of Computing, 1982 He analyzes a sparse connected network. In fact he considers as a reprensentative of this an hypercube : N=2n vertices, n edges for each vertex to the adjacent vertices (those obtained flipping only a bit of the address). E.g. when n=3, N=8 (a normal 3-d cube) 010 is adjacent to 110,000,011. The number of vertices is |N| = (n*N)/2 = n*2n-1. The algorithm has 2 phases A and B. During phase A : for each msg s at a node v, you choose randomly to make a step or not in the first dimension not yet considered and so on for n times. The msg in this way can arrive at any node. During phase B : you route the msg to its final destination (this time deterministically : in the hypercube you take the dimensions that differ from destination and you flip them 1 at a time). The algorithm, clearly, is bound by 2n steps. He proved that for every S, it exists a C such that the algorithm finishes with probability P > 1 - 2Sn in less than 2*C*n steps (considering also the time msg waits in the queues at nodes). Steps ~ log2 |N| 3d hypercube network : 010 000 001 100 110 101 011 111 A. Start at 001 destination 110 : 1st toss : flip 1st dimension bit goto 101 2nd toss : dont flip 2nd bit stay at 101 3rd toss : flip 3rd bit goto 100 B. From 100 route to 110
  28. 28. 8 Mar 2016 Roberto Innocente - 28 Valiant Load Balancing (VLB)/2 VLB for Internet backbones: ● Zhang-Shen,McKeown, VLB, HotNets III, 2004 ● Sengupta et al., Traffic oblivious routing, HotNets 2004 ● A.Greenberg et al., “A scalable and flexible Data Center Network”, ACM SIGCOMM 2009 ● Backbone of N PoPs connected to access networks through links of capacity r 1 4 3N 2
  29. 29. 8 Mar 2016 Roberto Innocente - 29 Valiant Load Balancing (VLB)/3 Backbone topology is a full logical mesh network in which each link has a 2r/N capacity ● A. Traffic entering the backbone is spreaded with uniform probability across all nodes (in this proposal the spreading is done per flow and not per pkt) : r/N to each node (comprised the txmitting node). Therefore the maximum traffic received by a single node is r/N * N = r ● B. Because each node receive max a traffic of r then also its output would be at max r/N on each link. Therefore a capacity of 2r/N for each link of the full mesh is enough to guarantee 100% troughput. It can seem counter intuitive that this is the most efficient network but consider that this network is able to assure, with links of capacity 2r/N, a troughput of r between any 2 nodes of the backbone. In phase A round-robin can be used for randomizing flows instead of random choice.
  30. 30. 8 Mar 2016 Roberto Innocente - 30 Traffic Oblivious routing ● Phase 1: – A percentage αj of the traffic from node i to j denoted with Tij is routed trough an intermediate node k with tunneling : ● i → k → j – This is done independently from final destination – Traffic is split over all possible 2 hop routes – Can be done at pkt level or flow level : because of TCP burden with packet reordering usually is done per flow or even per flowlet (using a hash function or IPv6 flow label) ● Phase 2: – Each node receives this randomly delivered traffic for different destinations and directs it to the final destination
  31. 31. 8 Mar 2016 Roberto Innocente - 31 New Data Center Architectures: VL2, Monsoon
  32. 32. 8 Mar 2016 Roberto Innocente - 32 Data Center networks/1 ● Should allow over 100.000 servers ● Conventional architectures depend on tree like infrastructures built with expensive network hardware. STP is used on Layer2 to avoid loops and this disables redundant links. ● They are now replaced by leaf-spine networks. In these nets all links are forwading because STP is replaced by other protocols like SBP, TRILL, FabricPath,…, IS-IS or OSPF. ● To each service can give the illusion that all servers connected to it, and only them, are interconnected by a single non interfering Layer 2 vlan and this from 1 to over 100.000 servers. Tree Leaf- Spine Pics from
  33. 33. 8 Mar 2016 Roberto Innocente - 33 Data Center (DC) networks/2 ● Monsoon, Greenberg et al. Microsoft Research ● VL2, Greenberg, Sangupta et al., Microsoft Reasearch ● SEATTLE : A Scalable Ethernet Architecture for Large Enterprises (SIGCOMM 2008) Changhoon Kim et al. Princeton ● Portland, Mysore et al . UCSD
  34. 34. 8 Mar 2016 Roberto Innocente - 34 VL2 Virtual Layer 2
  35. 35. 8 Mar 2016 Roberto Innocente - 35 Virtual Layer2 (VL2) μsoft/1 Picture from Ankita Mahajan IITG VL2 = Virtual Layer 2
  36. 36. 8 Mar 2016 Roberto Innocente - 36 VL2 μsoft /2 Picture from Ankita Mahajan IITG
  37. 37. 8 Mar 2016 Roberto Innocente - 37 Monsoon
  38. 38. 8 Mar 2016 Roberto Innocente - 38 IEEE 802.1ah-2008 Carrier Ethernet PBB Provider Backbone Bridges (aka MAC-in-MAC). Initially created by Nortel and then submitted as std, to conserve customer VLAN tags while traversing provider nets. The idea is to offer complete separation between customer and provider networks always using Ethernet frames. The standard customer original Ethernet frame is encapsulated by the Carrier Ethernet in another frame : ● Backbone components: ● 48 bits B-DA backbone dst addr ● 48 bits B-SA backbone src addr ● Ethertype=0x88A8 ● B-TAG/B-VID backbone vlan id tag (12 bits) ● Service encapsulation : ● Ethertype=0x88E7 ● Flags: priority, DEI(DropEligibleIndicator),NCA(no customer address indicator) ● I-SID service identifier (3 bytes) Using this in 2009 was approved PBB-TE (traffic engineering) : IEEE 802.1Qay-2009 again following the Nortel Provider-Backbone- Transport (PBT) . It proposes itself as a better a cheaper solution than T-MPLS. Customer host with SA=MAC X sends to DA=MAC Y. Backbone at edge encapsulates with frame B-SA=MAC A to backbone B-DA=MAC B. At B backbone de-capsulates and delivers. Pic from Nortel Networks
  39. 39. 8 Mar 2016 Roberto Innocente - 39 Monsoon/3 D/2 switches D switches Intermediate switches Aggregation switches ToR switches(Top of Rack) D ports D/2 ports D/2 ports 20 ports 20 ports Switches Up Down # Intermediate - 144x10Gb/s 72 Aggregation 72x10Gb/s 72X10Gb/s 144 ToR 2X10Gb/s 20X1 Gb/s 5,184 Possible Number of Nodes : 5184 ToR x 20 = 103,680 D²/4*20 nodes
  40. 40. 8 Mar 2016 Roberto Innocente - 40 VXLAN/1 ● Virtual Extensible LAN (VXLAN) : to address problems of large cloud providers. It encapsulates layer 2 frames into layer 4 UDP packets using UDP port 4789 (IANA assigned) ● Multicast or unicast HER (Head End Replication) is used to flood BUM (broadcasts, Unknown, Multicasts) traffics ● Described in RFC7348 ● Open vSwitch supports VXLANs ● Docker flannel, among other, uses it
  41. 41. 8 Mar 2016 Roberto Innocente - 41 Project
  42. 42. 8 Mar 2016 Roberto Innocente - 42 SOC (System On Chip) Final Project ARM Hardware Embedded chip (Linux Quagga) FPGA 4x100 Gb/s Ethernet Interfaces QSFP+