Ethernet and TCP optimizations

3,745 views

Published on

With a trivial bit of tuning, you can extract fairly amazing small message latencies out of TCP.

This ain't your father's Ethernet (or TCP).

Published in: Technology

Ethernet and TCP optimizations

  1. 1. Ethernet: Hidden SecretsJeff Squyres© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 1
  2. 2. First: some background information…
  3. 3. Using lots and lots and lots of servers simultaneously to solve one computational problem© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 3
  4. 4. Racks of 36 1Uservers Tend to send lots and lots and lots of small messages across the network to stay in sync with each other © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 4
  5. 5. Send a A B Receive the message message Underlying network© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 5
  6. 6. Today’s fastest networks: 1-3μs (!) Send a A B Receive the message message Underlying network© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 6
  7. 7. • Typically not Ethernet networks• Usually have supercomputer-specific networks Example: highly tuned for short message latency• …but that is changing Ethernet Ethernot© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 7
  8. 8. • Userspace NIC (“USNIC”) Expose Cisco NIC hardware directly to Linux userspace Bypass the OS Bypass the TCP stack• Send raw Ethernet frames directly from user applications Much, much faster than traditional TCP-based networking Especially for latency of short messages© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 8
  9. 9. Application MPI library Userspace sockets library Userspace Kernel TCP / IP stack Cisco VIC driver Cisco VIC hardware© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 9
  10. 10. Application MPI library Userspace verbs library Userspace Kernel Bootstrapping Send and receive and setup fast path Verbs IB core Cisco USNIC driver Cisco VIC hardware© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 10
  11. 11. With all that background…
  12. 12. Two servers© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 12
  13. 13. Two servers Each with a 2 x 10Gb NIC© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 13
  14. 14. Two servers Each with a 2 x 10Gb NIC Connected back-to-back© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 14
  15. 15. Send a message Receive the message from here here Ping!© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 15
  16. 16. Get the message Send the message back back Pong!© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 16
  17. 17. Because each ping and pong are soooo short, do this ping-pong exchange N times Ping! / Pong!© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 17
  18. 18. Total time for N ping-pongs N Time for one ping-pong© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 18
  19. 19. Total time for N ping-pongs N Time for one ping-pong 2 Time for one ping© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 19
  20. 20. Time for one ping = Half-round trip (HRT) ping pong latency© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 20
  21. 21. TCP NetPIPE latency times: 1 10G Ethernet port 0.1 1 10Gb Ethernet port 8MB ~150ms 0.01Time (seconds) 0.001 1 byte ~60μs 0.0001 1e-05 1 10 100 1000 10000 100000 1e+06 1e+07 Buffer size © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 21
  22. 22. TCP NetPIPE latency times: 2 10G Ethernet ports 0.1 1 10Gb Ethernet port 2 10Gb Ethernet ports 8MB ~150ms 0.01Time (seconds) 0.001 1 byte ~60μs 8MB 1 byte 0.0001 ~30μs (!) ~8.3ms 1e-05 1 10 100 1000 10000 100000 1e+06 1e+07 Buffer size © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 22
  23. 23. TCP NetPIPE latency times: 2 10G Ethernet ports 0.1 1 10Gb Ethernet port 2 10Gb Ethernet ports 8MB ~150ms 0.01Time (seconds) 0.001 1 byte ~60μs 8MB 1 byte 0.0001 ~30μs (!) ~8.3ms 1e-05 1 10 100 1000 10000 100000 1e+06 1e+07 Buffer size © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 23
  24. 24. TCP NetPIPE latency times: 2 10G Ethernet ports 0.001 1 10Gb Ethernet port 2 10Gb Ethernet ports The facts: From 1-1024 bytes: flat latency Using 1 interface: ~60μsTime (seconds) 0.0001 Using 2 interfaces: ~30μs ~60μs ~30μs 1e-05 1 10 100 1000 Buffer size © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 24
  25. 25. TCP NetPIPE latency times: 2 10G Ethernet ports 0.001 1 10Gb Ethernet port 2 10Gb Ethernet ports The facts: From 1-1024 bytes: flat latency Using 1 interface: ~60μsTime (seconds) 0.0001 Using 2 interfaces: ~30μs ~60μs ~30μs 1e-05 1 10 100 1000 Buffer size © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 25
  26. 26. 1. Ethernet frame arrives© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 26
  27. 27. 1. Ethernet frame arrives 2. NIC sends interrupt to OS Ethernet driver© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 27
  28. 28. 1. Ethernet frame arrives 2. NIC sends interrupt to OS Ethernet driver 3. OS Ethernet driver copies the packet to RAM© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 28
  29. 29. 1. Ethernet frame arrives 4. OS TCP stack hands packet off to (whatever) 2. NIC sends interrupt to OS Ethernet driver 3. OS Ethernet driver copies the packet to RAM© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 29
  30. 30. It’s always better in bulk© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 30
  31. 31. Let’s optimize this part © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 31
  32. 32. 1. Copy a bunch of packets across PCI at one time© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 32
  33. 33. 1. Copy a bunch of packets across PCI at one time 2. Only raise one interrupt for all ofthose packet copies© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 33
  34. 34. A.k.a. “Interrupt Coalescing” 1. Copy a bunch of packets across PCI at one time 2. Only raise one interrupt for all ofthose packet copies© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 34
  35. 35. 1. Ethernet frame arrives© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 35
  36. 36. 1. Ethernet frame arrives 2. Has N time passed since we sent an interrupt to the OS?© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 36
  37. 37. 1. Ethernet frame arrives 2. Has N time passed since we sent an interrupt to the OS?✖ No: queue up the frame© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 37
  38. 38. 1. Ethernet frame arrives 2. Has N time passed since we sent an interrupt to the OS?✖ No: queue up the frame✔ Yes: Send all queued frames and interrupt © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 38
  39. 39. Ok… So what?
  40. 40. © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 40
  41. 41. Periodic interrupt 1. A sends ping frame coalescing timeoutNIC A 125μsNIC B 2. B receives ping frame © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 41
  42. 42. NIC ANIC B 3. Coalesce timer expires; B sends interrupt 4. B sends pong frame © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 42
  43. 43. 5. Coalesce timer expires; A sends interrupt 6. A sends ping frame 7. Rinse, repeatNIC ANIC B © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 43
  44. 44. 4 ping-pongs in ~8x timer durationNIC ANIC B © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 44
  45. 45. NIC ANIC B In general, coalescing interrupts is a very Very Good Thing © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 45
  46. 46. NIC ANIC B But it definitely hurts low-latency traffic © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 46
  47. 47. How do we reducethose artificial delays?
  48. 48. NIC A Port 0NIC BNIC A Port 1NIC B © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 48
  49. 49. NIC A Port 0NIC BNIC A Port 1NIC B © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 49
  50. 50. NIC A Port 0 In reality, sender and receiver timers on eachNIC B port are wholly unrelated; they don’t line up nicely like I used in these examples.NIC A Meaning: in general, you actually usually get Port 1 better overlapNIC B © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 50
  51. 51. TCP NetPIPE latency times: 2 10G Ethernet ports 0.001 1 10Gb Ethernet port 2 10Gb Ethernet ports In this case, we got such good asymmetry, that the 2 port case is ~2x as fast (i.e., roughly twice as many interrupts in the same amount of time)Time (seconds) 0.0001 ~60μs ~30μs 1e-05 1 10 100 1000 Buffer size © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 51
  52. 52. Remember: these are AVERAGE latencies! Individual ping-pong times are the same as the 1 port case (from the network) …but you get higher throughput because we’re reducing the gaps between each ping-pong© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 52
  53. 53. Now let’s trysomething else…
  54. 54. © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 54
  55. 55. TCP NetPIPE latency times: 2 10G Ethernet ports 0.1 1 10Gb Ethernet port 2 10Gb Ethernet ports 1 port 1 10GB Ethernet port, timer=0 2 10GB Ethernet ports, timer=0 ~7.2ms 0.01Time (seconds) 1 port 2 ports 0.001 ~10.5μs ~10.6μs 2 ports 0.0001 ~5.5ms 1e-05 1 10 100 1000 10000 100000 1e+06 1e+07 Buffer size © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 55
  56. 56. Pros Cons • (Much) faster TCP latency • May not scale well for …without changing app! case of MPI process running on every core • Faster speeds seem to scale up to large • Lots and lots of interrupts messages, too going to socket:0.core:0 • Great for low-latency, • May need to run (N-1) MPI sparse comms apps processes…? May also want to avoid • Best for NICs that are socket:0.core:0, or move IRQ dedicated to MPI comms affinity© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 56
  57. 57. • Some experimentation might be worth trying with real world HPC apps:• Allow TCP to wholly utilize core 0 (i.e., run MPI processes only on cores 1-15)• Set the coalesce timer to something more than 0μs, but less than 125μs – there’s a whole spectrum with which to play© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 58
  58. 58. • Many in HPC have Ethernot networks …but as HPC continues to commoditize itself, lots of HPC users have Ethernet-based environments• Today’s Ethernet switches and NICs are actually quite a bit faster and more advanced than what we old-time-HPCers grew up with• Even good ol’ TCP is amazingly fast and optimized today• You may be able to tune your NIC and/or fabric to extract pretty darn good MPI TCP performance The default settings on your Ethernet NIC / fabric are likely set for general TCP traffic – which effect very different performance characteristics than what HPC applications typically need© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 59
  59. 59. Thank you.

×