Ethernet and TCP optimizations

  • 1,770 views
Uploaded on

With a trivial bit of tuning, you can extract fairly amazing small message latencies out of TCP. …

With a trivial bit of tuning, you can extract fairly amazing small message latencies out of TCP.

This ain't your father's Ethernet (or TCP).

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,770
On Slideshare
0
From Embeds
0
Number of Embeds
3

Actions

Shares
Downloads
34
Comments
0
Likes
4

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Ethernet: Hidden SecretsJeff Squyres© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 1
  • 2. First: some background information…
  • 3. Using lots and lots and lots of servers simultaneously to solve one computational problem© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 3
  • 4. Racks of 36 1Uservers Tend to send lots and lots and lots of small messages across the network to stay in sync with each other © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 4
  • 5. Send a A B Receive the message message Underlying network© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 5
  • 6. Today’s fastest networks: 1-3μs (!) Send a A B Receive the message message Underlying network© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 6
  • 7. • Typically not Ethernet networks• Usually have supercomputer-specific networks Example: highly tuned for short message latency• …but that is changing Ethernet Ethernot© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 7
  • 8. • Userspace NIC (“USNIC”) Expose Cisco NIC hardware directly to Linux userspace Bypass the OS Bypass the TCP stack• Send raw Ethernet frames directly from user applications Much, much faster than traditional TCP-based networking Especially for latency of short messages© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 8
  • 9. Application MPI library Userspace sockets library Userspace Kernel TCP / IP stack Cisco VIC driver Cisco VIC hardware© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 9
  • 10. Application MPI library Userspace verbs library Userspace Kernel Bootstrapping Send and receive and setup fast path Verbs IB core Cisco USNIC driver Cisco VIC hardware© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 10
  • 11. With all that background…
  • 12. Two servers© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 12
  • 13. Two servers Each with a 2 x 10Gb NIC© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 13
  • 14. Two servers Each with a 2 x 10Gb NIC Connected back-to-back© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 14
  • 15. Send a message Receive the message from here here Ping!© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 15
  • 16. Get the message Send the message back back Pong!© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 16
  • 17. Because each ping and pong are soooo short, do this ping-pong exchange N times Ping! / Pong!© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 17
  • 18. Total time for N ping-pongs N Time for one ping-pong© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 18
  • 19. Total time for N ping-pongs N Time for one ping-pong 2 Time for one ping© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 19
  • 20. Time for one ping = Half-round trip (HRT) ping pong latency© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 20
  • 21. TCP NetPIPE latency times: 1 10G Ethernet port 0.1 1 10Gb Ethernet port 8MB ~150ms 0.01Time (seconds) 0.001 1 byte ~60μs 0.0001 1e-05 1 10 100 1000 10000 100000 1e+06 1e+07 Buffer size © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 21
  • 22. TCP NetPIPE latency times: 2 10G Ethernet ports 0.1 1 10Gb Ethernet port 2 10Gb Ethernet ports 8MB ~150ms 0.01Time (seconds) 0.001 1 byte ~60μs 8MB 1 byte 0.0001 ~30μs (!) ~8.3ms 1e-05 1 10 100 1000 10000 100000 1e+06 1e+07 Buffer size © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 22
  • 23. TCP NetPIPE latency times: 2 10G Ethernet ports 0.1 1 10Gb Ethernet port 2 10Gb Ethernet ports 8MB ~150ms 0.01Time (seconds) 0.001 1 byte ~60μs 8MB 1 byte 0.0001 ~30μs (!) ~8.3ms 1e-05 1 10 100 1000 10000 100000 1e+06 1e+07 Buffer size © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 23
  • 24. TCP NetPIPE latency times: 2 10G Ethernet ports 0.001 1 10Gb Ethernet port 2 10Gb Ethernet ports The facts: From 1-1024 bytes: flat latency Using 1 interface: ~60μsTime (seconds) 0.0001 Using 2 interfaces: ~30μs ~60μs ~30μs 1e-05 1 10 100 1000 Buffer size © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 24
  • 25. TCP NetPIPE latency times: 2 10G Ethernet ports 0.001 1 10Gb Ethernet port 2 10Gb Ethernet ports The facts: From 1-1024 bytes: flat latency Using 1 interface: ~60μsTime (seconds) 0.0001 Using 2 interfaces: ~30μs ~60μs ~30μs 1e-05 1 10 100 1000 Buffer size © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 25
  • 26. 1. Ethernet frame arrives© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 26
  • 27. 1. Ethernet frame arrives 2. NIC sends interrupt to OS Ethernet driver© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 27
  • 28. 1. Ethernet frame arrives 2. NIC sends interrupt to OS Ethernet driver 3. OS Ethernet driver copies the packet to RAM© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 28
  • 29. 1. Ethernet frame arrives 4. OS TCP stack hands packet off to (whatever) 2. NIC sends interrupt to OS Ethernet driver 3. OS Ethernet driver copies the packet to RAM© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 29
  • 30. It’s always better in bulk© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 30
  • 31. Let’s optimize this part © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 31
  • 32. 1. Copy a bunch of packets across PCI at one time© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 32
  • 33. 1. Copy a bunch of packets across PCI at one time 2. Only raise one interrupt for all ofthose packet copies© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 33
  • 34. A.k.a. “Interrupt Coalescing” 1. Copy a bunch of packets across PCI at one time 2. Only raise one interrupt for all ofthose packet copies© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 34
  • 35. 1. Ethernet frame arrives© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 35
  • 36. 1. Ethernet frame arrives 2. Has N time passed since we sent an interrupt to the OS?© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 36
  • 37. 1. Ethernet frame arrives 2. Has N time passed since we sent an interrupt to the OS?✖ No: queue up the frame© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 37
  • 38. 1. Ethernet frame arrives 2. Has N time passed since we sent an interrupt to the OS?✖ No: queue up the frame✔ Yes: Send all queued frames and interrupt © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 38
  • 39. Ok… So what?
  • 40. © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 40
  • 41. Periodic interrupt 1. A sends ping frame coalescing timeoutNIC A 125μsNIC B 2. B receives ping frame © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 41
  • 42. NIC ANIC B 3. Coalesce timer expires; B sends interrupt 4. B sends pong frame © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 42
  • 43. 5. Coalesce timer expires; A sends interrupt 6. A sends ping frame 7. Rinse, repeatNIC ANIC B © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 43
  • 44. 4 ping-pongs in ~8x timer durationNIC ANIC B © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 44
  • 45. NIC ANIC B In general, coalescing interrupts is a very Very Good Thing © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 45
  • 46. NIC ANIC B But it definitely hurts low-latency traffic © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 46
  • 47. How do we reducethose artificial delays?
  • 48. NIC A Port 0NIC BNIC A Port 1NIC B © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 48
  • 49. NIC A Port 0NIC BNIC A Port 1NIC B © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 49
  • 50. NIC A Port 0 In reality, sender and receiver timers on eachNIC B port are wholly unrelated; they don’t line up nicely like I used in these examples.NIC A Meaning: in general, you actually usually get Port 1 better overlapNIC B © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 50
  • 51. TCP NetPIPE latency times: 2 10G Ethernet ports 0.001 1 10Gb Ethernet port 2 10Gb Ethernet ports In this case, we got such good asymmetry, that the 2 port case is ~2x as fast (i.e., roughly twice as many interrupts in the same amount of time)Time (seconds) 0.0001 ~60μs ~30μs 1e-05 1 10 100 1000 Buffer size © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 51
  • 52. Remember: these are AVERAGE latencies! Individual ping-pong times are the same as the 1 port case (from the network) …but you get higher throughput because we’re reducing the gaps between each ping-pong© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 52
  • 53. Now let’s trysomething else…
  • 54. © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 54
  • 55. TCP NetPIPE latency times: 2 10G Ethernet ports 0.1 1 10Gb Ethernet port 2 10Gb Ethernet ports 1 port 1 10GB Ethernet port, timer=0 2 10GB Ethernet ports, timer=0 ~7.2ms 0.01Time (seconds) 1 port 2 ports 0.001 ~10.5μs ~10.6μs 2 ports 0.0001 ~5.5ms 1e-05 1 10 100 1000 10000 100000 1e+06 1e+07 Buffer size © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 55
  • 56. Pros Cons • (Much) faster TCP latency • May not scale well for …without changing app! case of MPI process running on every core • Faster speeds seem to scale up to large • Lots and lots of interrupts messages, too going to socket:0.core:0 • Great for low-latency, • May need to run (N-1) MPI sparse comms apps processes…? May also want to avoid • Best for NICs that are socket:0.core:0, or move IRQ dedicated to MPI comms affinity© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 56
  • 57. • Some experimentation might be worth trying with real world HPC apps:• Allow TCP to wholly utilize core 0 (i.e., run MPI processes only on cores 1-15)• Set the coalesce timer to something more than 0μs, but less than 125μs – there’s a whole spectrum with which to play© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 58
  • 58. • Many in HPC have Ethernot networks …but as HPC continues to commoditize itself, lots of HPC users have Ethernet-based environments• Today’s Ethernet switches and NICs are actually quite a bit faster and more advanced than what we old-time-HPCers grew up with• Even good ol’ TCP is amazingly fast and optimized today• You may be able to tune your NIC and/or fabric to extract pretty darn good MPI TCP performance The default settings on your Ethernet NIC / fabric are likely set for general TCP traffic – which effect very different performance characteristics than what HPC applications typically need© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 59
  • 59. Thank you.