Ethernet Alliance Tech Exploration Forum Keynote

2,506 views

Published on

http://www.ethernetalliance.org/events/technology_exploration_forum__life_beyond_ieee_p8023ba/technology_exploration_forum_presentations

  • Be the first to comment

  • Be the first to like this

Ethernet Alliance Tech Exploration Forum Keynote

  1. 1. 100GbE and Beyond for Datacenter Connectivity Bikash Koley Network Architecture, Google Sep 2009 1
  2. 2. Our mission Organize the world's information and make it universally accessible and useful 2
  3. 3. Datacenter Interconnects • Large number of identical compute systems • Interconnected by a large number of identical switching gears • Can be within single physical boundary or can span several physical boundaries • Interconnect length varies between few meters to tens of kms • Current best practice: rack switches with oversubscribed uplinks BW Demand Distance Between Compute Elements
  4. 4. Background of this work • This is a theoretical study of data-center interconnects based on architectures published previously in high-performance- computing literature • We do not take into account complexity or cost (or even the availability of switching technology) to build the interconnect fabrics that have been simulated • This work purely focuses on the trade-offs associated with choices of various interconnect speed and technologies to build these possible interconnect architectures: the focus is on the interconnects themselves, not the switching architecture or technology or the difficulty associated with actually implementing these fabrics in real life ☺ 4
  5. 5. • INTRA-DATACENTER CONNECTIONS • INTER-DATACENTER CONNECTIONS Fiber-rich, Very large BW demand 5
  6. 6. Datacenter Interconnect Fabrics • High performance computing/ super-computing architectures have often used various complex multi- stage fabric architectures such as Clos Fabric, Fat Tree or Torus [1, 2, 3, 4, 5] • For this theoretical study, we picked the Fat Tree architecture described in [2, 3], and analyzed the impact of choice of interconnect speed and technology on overall interconnect cost • As described in [2,3], Fat-tree fabrics are built with identical N-port switching elements • Such a switch fabric architecture delivers a constant bisectional bandwidth (CBB) A 3-stage Fat Tree s Sp /2 in e N A 2-stage Fat Tree 1. C. Clos, A study of non-blocking switching networks, Bell System Technical Journal, Vol. 32, 1953, pp. 406-424. 2. Charles E. Leiserson: “Fat-Trees: Universal Networks for Hardware-Efficient Supercomputing.”, IEEE Transactions on Computers, Vol 34, October 1985, pp 892-901 3. S. R. Ohring, M. Ibel, S. K. Das, M. J. Kumar, “On Generalized Fat-tree,” IEEE IPPS 1995. 4 RUFT: Simplifying the Fat-Tree Topology, Gomez, C.; Gilabert, F.; Gomez, M.E.; Lopez, P.; Duato, J.; Parallel and Distributed Systems, 2008. ICPADS '08. 14th IEEE International Conference on, 8-10 Dec. 2008 Page(s):153 – 160 5. [Beowulf] torus versus (fat) tree topologies: http://www.beowulf.org/archive/2004-November/011114.html 6
  7. 7. How many stages? • For a k stage fabric with N-port nodes and N leaf nodes Nk System _ Port _ Cnt = k −1 2 k −1  1  System _ Node _ Count = N 1 + k −1   2  •So a 2 stage fabric has N2/2 system ports and 3N/2 nodes; a 3 stage fabric has N3/4 system ports and 5N2/4 nodes and so on • For a given switching capacity per node (e.g. 1Tbps), N = switching_capacity/port_speed • For a given switching capacity per node, port speed will determine how many stages (k) will be needed in the fabric in order to provide a pre-determined system port bandwidth (e.g. 10Tbps) 7
  8. 8. Interconnect at What Port Speed? • A switching node has a fixed 3-stage Fat-tree Fabric Capacity switching capacity (i.e. CMOS gate- 1,000,000 count) within the same space and power envelope 100,000 Maximal Fabric Capacity (Gbps) 10,000 • Per node switching capacity can be presented at different port-speed: 1,000 – i.e. a 400Gbps node can be 40X10Gbps or 10X40Gbps or 100 4X100Gbps Port Speed = 10 Gbps 10 • Lower per-port speed allows building a Port Speed = 40Gbps Port Speed = 100Gbps much larger size maximal constant 1 bisectional bandwidth fabric 0 100 200 300 400 500 600 700 800 Per Node Switching Capacity (Gbps) • There are of course trade-offs with the number of fiber-connections needed to build the interconnect • Higher port-speed may allow better utilization of the fabric capacity 8
  9. 9. Fabric Size vs Port Speed Constant switching BW/node of 1Tbps and constant fabric cross-section BW of 10Tbps Assumed No of Fabric Stages Needed for a 10T Fabric s ine sp 5 14 12 No of Fabric Stages Needed Switching_BW/node=1Tbps 10 8 6 4 2 0 0 100 200 300 400 500 Port Speed (Gbps) • Higher per port bandwidth reduces No of Fabric Nodes Needed for a 10T Fabric the number of available ports in a 7000 node with constant switching 6000 bandwidth •In order to support same cross- No of Nodes Needed 5000 Switching_BW/node=1Tbps 4000 sectional BW 3000 more stages are needed in the 2000 fabric 1000 More fabric nodes are needed 0 0 50 100 150 200 250 300 350 400 Port Speed (Gbps) 9
  10. 10. Fabric Size vs Port Speed Constant switching BW/node of 1Tbps and constant fabric cross-section BW of 10Tbps Assumed No of Fabric Ports Needed for a 10T Fabric 18000 16000 14000 No of Ports Needed 12000 Switching_BW/node=1Tbps 10000 8000 6000 4000 2000 0 0 50 100 150 200 250 300 350 400 Port Speed (Gbps) Since higher port speed reduces port count per node, total number of ports needed for a given CBB has a “U” shaped curve
  11. 11. Power vs Port Speed Per Port Power Consumption Total Interconnect Power Consumption 140 300000 Constant power/Gbps Constant power/Gbps 120 4x power for 10x speed 4x power for 10x speed 250000 20x power for 10x speed 20x power for 10x speed Per-port Power Consumption (Watts) BLEEDING Total Power Consumption (Watts) 100 EDGE 200000 80 BLEEDING EDGE 150000 60 POWER PARITY 100000 40 POWER PARITY 50000 20 MATURE MATURE 0 0 0 50 100 150 200 250 300 350 400 450 0 50 100 150 200 250 300 350 400 Port Speed (Gbps) Port Speed (Gbps) • Three power consumption curves for interface optical modules: – Bleeding Edge: 20x power for 10x speed; e.g. if 10G is 1W/port, 100G is 20W/port – Power Parity: Power parity on per Gbps basis; e.g. if 10G is 1W/port, 100G is 10W/port – Mature: 4x power for 10x speed; e.g. if 10G is 1W/port, 100G is 4W/port • Lower port speed provides lower power consumption • For power consumption parity, power per optical module needs to follow the “mature” curve 11
  12. 12. Cost vs Port Speed Per Port Optics Cost Total Fabric Cost $70,000 $160,000,000 Constant cost/Gbps Constant cost/Gbps $140,000,000 $60,000 4x cost for 10x speed 4x cost for 10x speed 20x cost for 10x speed 20x cost for 10x speed BLEEDING $120,000,000 $50,000 EDGE Per-port Optics Cost Total Fabric Cost $100,000,000 $40,000 BLEEDING $80,000,000 EDGE $30,000 COST $60,000,000 PARITY $20,000 COST $40,000,000 PARITY $10,000 $20,000,000 MATURE MATURE $0 $0 0 50 100 150 200 250 300 350 400 450 0 50 100 150 200 250 300 350 400 Port Speed (Gbps) Port Speed (Gbps) • Three cost curves for optical interface modules: – Bleeding Edge: 20x cost for 10x speed – Cost Parity: Cost parity on per Gbps basis – Mature: 4x cost for 10x speed; • Fiber cost is assumed to be constant per port (10% of 10G port cost) • For fabric cost parity, cost of optical modules need to increase by < 4x for 10x increase in interface speed 12
  13. 13. • INTRA-DATACENTER CONNECTIONS • INTER-DATACENTER CONNECTIONS Limited Fiber Availability, 2km+ reach 13
  14. 14. Beyond 100G: What data rate? • 400Gbps? 1Tbps? Something “in-between”? How about all of the above? •Current optical PMD specs are designed for absolute worst-case penalties •Significant capacity is untapped within the statistical variation of various penalties Capacity “Wasted Margin/ Link 14
  15. 15. Where is the Untapped Capacity? Receiver Performance Optical Channel Capacity 0 -10 Sensitivity (dBm) -20 -30 -40 -50 -60 Quantum limit of M. Nakazawa: ECOC 2008 -70 paper Tu.1.E.1 Sensitivity -80 1.00E-02 1.00E-01 1.00E+00 1.00E+01 1.00E+02 1.00E+03 Bit Rate (Gbps) • Unused Link Margin ≡ Untapped SNR ≡ Untapped Capacity • In ideal world, 3dB of link margin will allow link capacity to be doubled • Need the ability to use additional capacity (speed up the link) when available (temporal or statistical) and scale-back to the base-line capacity (40G/100G?) when not 15
  16. 16. Rate Adaptive 100G+ Ethernet? • There are existing standards within the IEEE802.3 family: • IEEE 802.3ah 10PASS-TS: based on MCM-VDSL standard • IEEE 802.3ah 2BASE-TL: based on SHDSL standard • Needed when channels are close to physics-limit : We are getting there with 100Gbps+ Ethernet • Shorter links ≡ Higher capacity (matches perfectly with datacenter bandwidth demand distribution, see slide # 3) 600 500 • How to get there? • High-order modulation 400 • Multi-carrier- Bit Rate (Gbps) Modulation/OFDM 300 mQAM OOK • Ultra-dense WDM 200 • Combination of all the above 100 0 0 2 4 6 8 10 12 14 16 18 SNR (dB) 16
  17. 17. Is There a Business Case? Example Link Distance Distribution Example Adaptive Bit Rate Implementations 0.25 600 0.2 500 Possible Bit Rtae (Gbps) Distribution 0.15 400 300 scheme-1 0.1 scheme-2 200 non-adaptive 0.05 100 0 0 0 10 20 30 40 50 0 10 20 30 40 50 Link Distance (km) Link Distance (km) Aggregate Capacity for 1000 Links • An example link-length distribution between 400 datacenters is shown Total Capacity on 1000 links (Tbps) 350 • Can be supported by a 40km capable PMD 300 250 • Various rate-adaptive 100GbE+ options are 200 considered 150 • Base rate is 100Gbps 100 50 • Max adaptive bit-rate varies from 100G to 500G 0 0 100 200 300 400 500 600 • Aggregate capacity for 1000 such links is Max Adaptive Bit Rate with 100Gbps base rate (Gbps) computed 17
  18. 18. Rate-Adaptive vs Fixed-rate No of Links needed for 100Tbps capacity Relative Link Cost 1200 6 1000 5 No of links needed 800 4 Relative Cost 600 3 Rate-adaptive Rate-Adaptive 400 Fixed-rate 2 Fixed Rate 200 1 0 0 0 100 200 300 400 500 600 0 100 200 300 400 500 600 Max Bit Rate with 100Gbps base rate (Gbps) Max Bit Rate with 100Gbps base rate (Gbps) • 40km capable fixed-rate PMDs: nx100G cost ~ n Relative Cost of 100T Capacity times 100G PMD cost 1100 1000 • Adaptive PMDs capable of nx100G costs higher 900 than 100G PMD but lower than fixed rate nx100G as the spec requirements are much relaxed Relative Cost 800 700 Rate-Adaptive • Cost of aggregate capacity for large number of Fixed-rate 600 “metro” links could be significantly lower with 500 adaptive-rate Ethernet with a base-rate of 100G but capable of speeding up for shorter 400 distance/ better link quality 0 100 200 300 400 500 600 Max Adaptive Bit Rate with 100Gbps base rate (Gbps) 18
  19. 19. Conclusions • Datacenter interconnect economics is heavily driven by commoditization of edge-links • For intra-datacenter links, fatter interconnect pipes help with improving fabric throughput, however is only viable when both power dissipation and cost increase follow “4x for 10x” speed curve •For inter-datacenter links, > 100Gbps+ Ethernet is required for better utilization of limited fiber asset •Rate-adaptive 100G+ Ethernet is highly desirable and will probably be necessary to reach the right link economics 19
  20. 20. References 1. C. Clos, A study of non-blocking switching networks, Bell System Technical Journal, Vol. 32, 1953, pp. 406-424. 2. Charles E. Leiserson: “Fat-Trees: Universal Networks for Hardware-Efficient Supercomputing.”, IEEE Transactions on Computers, Vol 34, October 1985, pp 892-901 3. S. R. Ohring, M. Ibel, S. K. Das, M. J. Kumar, “On Generalized Fat-tree,” IEEE IPPS 1995. 4 RUFT: Simplifying the Fat-Tree Topology, Gomez, C.; Gilabert, F.; Gomez, M.E.; Lopez, P.; Duato, J.; Parallel and Distributed Systems, 2008. ICPADS '08. 14th IEEE International Conference on, 8-10 Dec. 2008 Page(s):153 – 160 5. [Beowulf] torus versus (fat) tree topologies: http://www.beowulf.org/archive/2004- November/011114.html 6. M. Nakazawa, "Toward the Shannon limit optical communication," OSA Annual Meeting, Frontiers in Optics 2008, Invited paper, FTuH3, October (2008). 20
  21. 21. Q&A

×