SlideShare a Scribd company logo
1 of 60
Ethernet: Hidden Secrets
Jeff Squyres




© 2012 Cisco and/or its affiliates. All rights reserved.   Cisco Confidential   1
First: some background
      information…
Using lots and lots and lots of servers simultaneously
                             to solve one computational problem

© 2012 Cisco and/or its affiliates. All rights reserved.               Cisco Confidential   3
Racks of
 36 1U
servers




                    Tend to send lots and lots and lots of small messages
                      across the network to stay in sync with each other

 © 2012 Cisco and/or its affiliates. All rights reserved.             Cisco Confidential   4
Send a                                  A                        B   Receive the
                  message                                                                message




                                                               Underlying network




© 2012 Cisco and/or its affiliates. All rights reserved.                                         Cisco Confidential   5
Today’s fastest networks:
                                                                   1-3μs (!)
                   Send a                                  A                        B   Receive the
                  message                                                                message




                                                               Underlying network




© 2012 Cisco and/or its affiliates. All rights reserved.                                         Cisco Confidential   6
• Typically not Ethernet networks

• Usually have supercomputer-specific networks
            Example: highly tuned for short message latency

• …but that is changing




                                              Ethernet        Ethernot




© 2012 Cisco and/or its affiliates. All rights reserved.                 Cisco Confidential   7
• Userspace NIC (“USNIC”)
            Expose Cisco NIC hardware directly to Linux userspace
            Bypass the OS
            Bypass the TCP stack

• Send raw Ethernet frames directly from user applications
            Much, much faster than traditional TCP-based networking
            Especially for latency of short messages




© 2012 Cisco and/or its affiliates. All rights reserved.              Cisco Confidential   8
Application

                                                                  MPI library

                                                           Userspace sockets library
                      Userspace

                      Kernel

                                                                TCP / IP stack



                                                               Cisco VIC driver



                                                             Cisco VIC hardware


© 2012 Cisco and/or its affiliates. All rights reserved.                               Cisco Confidential   9
Application

                                                                           MPI library

                                                                   Userspace verbs library
                      Userspace

                      Kernel
                                                           Bootstrapping                     Send and receive
                                                              and setup                         fast path
                                                                  Verbs IB core


                                                               Cisco USNIC driver


                                                                     Cisco VIC hardware


© 2012 Cisco and/or its affiliates. All rights reserved.                                                        Cisco Confidential   10
With all that background…
Two servers




© 2012 Cisco and/or its affiliates. All rights reserved.                 Cisco Confidential   12
Two servers




                                                           Each with a 2 x 10Gb NIC




© 2012 Cisco and/or its affiliates. All rights reserved.                              Cisco Confidential   13
Two servers




                                                           Each with a 2 x 10Gb NIC
                                                           Connected back-to-back




© 2012 Cisco and/or its affiliates. All rights reserved.                              Cisco Confidential   14
Send a message                                                     Receive the message
   from here                                                               here




                                                           Ping!




© 2012 Cisco and/or its affiliates. All rights reserved.                                 Cisco Confidential   15
Get the message                                                    Send the message
      back                                                               back




                                                           Pong!




© 2012 Cisco and/or its affiliates. All rights reserved.                              Cisco Confidential   16
Because each ping and pong are soooo short,
                                 do this ping-pong exchange N times




                                                           Ping! / Pong!




© 2012 Cisco and/or its affiliates. All rights reserved.                   Cisco Confidential   17
Total time for N ping-pongs


                                                           N

                       Time for one ping-pong




© 2012 Cisco and/or its affiliates. All rights reserved.       Cisco Confidential   18
Total time for N ping-pongs


                                                           N

                       Time for one ping-pong

                                                           2


                                    Time for one ping


© 2012 Cisco and/or its affiliates. All rights reserved.       Cisco Confidential   19
Time for one ping
                                                                  =
                                            Half-round trip (HRT)
                                             ping pong latency


© 2012 Cisco and/or its affiliates. All rights reserved.                       Cisco Confidential   20
TCP NetPIPE latency times: 1 10G Ethernet port
                              0.1
                                                  1 10Gb Ethernet port
                                                                                                                                8MB
                                                                                                                               ~150ms
                            0.01
Time (seconds)




                         0.001
                                                    1 byte
                                                    ~60μs

                      0.0001




                         1e-05
                                       1                                10   100           1000                 10000     100000   1e+06                     1e+07
                                                                                                  Buffer size


                 © 2012 Cisco and/or its affiliates. All rights reserved.                                                               Cisco Confidential    21
TCP NetPIPE latency times: 2 10G Ethernet ports
                              0.1
                                                   1 10Gb Ethernet port
                                                  2 10Gb Ethernet ports                                                               8MB
                                                                                                                                     ~150ms
                            0.01
Time (seconds)




                         0.001
                                                    1 byte
                                                    ~60μs                                                                                8MB
                                                                             1 byte
                      0.0001                                                ~30μs (!)                                                   ~8.3ms


                         1e-05
                                       1                                10        100           1000                 10000     100000    1e+06                     1e+07
                                                                                                       Buffer size


                 © 2012 Cisco and/or its affiliates. All rights reserved.                                                                     Cisco Confidential    22
TCP NetPIPE latency times: 2 10G Ethernet ports
                              0.1
                                                   1 10Gb Ethernet port
                                                  2 10Gb Ethernet ports                                                               8MB
                                                                                                                                     ~150ms
                            0.01
Time (seconds)




                         0.001
                                                    1 byte
                                                    ~60μs                                                                                8MB
                                                                             1 byte
                      0.0001                                                ~30μs (!)                                                   ~8.3ms


                         1e-05
                                       1                                10        100           1000                 10000     100000    1e+06                     1e+07
                                                                                                       Buffer size


                 © 2012 Cisco and/or its affiliates. All rights reserved.                                                                     Cisco Confidential    23
TCP NetPIPE latency times: 2 10G Ethernet ports
                         0.001
                                                   1 10Gb Ethernet port
                                                  2 10Gb Ethernet ports



                                                                                      The facts:
                                                                            From 1-1024 bytes: flat latency
                                                                               Using 1 interface: ~60μs
Time (seconds)




                       0.0001
                                                                              Using 2 interfaces: ~30μs
                                             ~60μs
                                             ~30μs



                         1e-05
                                       1                                         10                              100               1000
                                                                                                   Buffer size


                 © 2012 Cisco and/or its affiliates. All rights reserved.                                                             Cisco Confidential   24
TCP NetPIPE latency times: 2 10G Ethernet ports
                         0.001
                                                   1 10Gb Ethernet port
                                                  2 10Gb Ethernet ports



                                                                                      The facts:
                                                                            From 1-1024 bytes: flat latency
                                                                               Using 1 interface: ~60μs
Time (seconds)




                       0.0001
                                                                              Using 2 interfaces: ~30μs
                                             ~60μs
                                             ~30μs



                         1e-05
                                       1                                         10                              100               1000
                                                                                                   Buffer size


                 © 2012 Cisco and/or its affiliates. All rights reserved.                                                             Cisco Confidential   25
1. Ethernet frame
      arrives




© 2012 Cisco and/or its affiliates. All rights reserved.   Cisco Confidential   26
1. Ethernet frame
      arrives




          2. NIC sends interrupt
          to OS Ethernet driver




© 2012 Cisco and/or its affiliates. All rights reserved.   Cisco Confidential   27
1. Ethernet frame
      arrives




          2. NIC sends interrupt
          to OS Ethernet driver


                                                             3. OS Ethernet driver
                                                           copies the packet to RAM

© 2012 Cisco and/or its affiliates. All rights reserved.                      Cisco Confidential   28
1. Ethernet frame
      arrives


                                                            4. OS TCP stack hands
                                                            packet off to (whatever)
          2. NIC sends interrupt
          to OS Ethernet driver


                                                             3. OS Ethernet driver
                                                           copies the packet to RAM

© 2012 Cisco and/or its affiliates. All rights reserved.                       Cisco Confidential   29
It’s always better in bulk




© 2012 Cisco and/or its affiliates. All rights reserved.    Cisco Confidential   30
Let’s optimize
                                  this part 




© 2012 Cisco and/or its affiliates. All rights reserved.   Cisco Confidential   31
1. Copy a bunch of
                                                           packets across PCI
                                                               at one time




© 2012 Cisco and/or its affiliates. All rights reserved.                   Cisco Confidential   32
1. Copy a bunch of
                                                           packets across PCI
                                                               at one time
  2. Only raise one
  interrupt for all of
those packet copies




© 2012 Cisco and/or its affiliates. All rights reserved.                   Cisco Confidential   33
A.k.a. “Interrupt Coalescing”

                                                           1. Copy a bunch of
                                                           packets across PCI
                                                               at one time
  2. Only raise one
  interrupt for all of
those packet copies




© 2012 Cisco and/or its affiliates. All rights reserved.                   Cisco Confidential   34
1. Ethernet frame
      arrives




© 2012 Cisco and/or its affiliates. All rights reserved.   Cisco Confidential   35
1. Ethernet frame
      arrives



           2. Has N time passed
              since we sent an
            interrupt to the OS?




© 2012 Cisco and/or its affiliates. All rights reserved.   Cisco Confidential   36
1. Ethernet frame
      arrives



           2. Has N time passed
              since we sent an
            interrupt to the OS?

✖ No: queue up the frame

© 2012 Cisco and/or its affiliates. All rights reserved.   Cisco Confidential   37
1. Ethernet frame
      arrives



            2. Has N time passed
               since we sent an
             interrupt to the OS?

✖ No: queue up the frame
✔ Yes: Send all queued frames and interrupt
 © 2012 Cisco and/or its affiliates. All rights reserved.   Cisco Confidential   38
Ok… So what?
© 2012 Cisco and/or its affiliates. All rights reserved.   Cisco Confidential   40
Periodic interrupt
                          1. A sends ping frame             coalescing timeout
NIC A


                                                                  125μs

NIC B
                          2. B receives ping frame




 © 2012 Cisco and/or its affiliates. All rights reserved.                     Cisco Confidential   41
NIC A




NIC B

                                              3. Coalesce timer expires; B sends interrupt
                                              4. B sends pong frame


 © 2012 Cisco and/or its affiliates. All rights reserved.                                    Cisco Confidential   42
5. Coalesce timer expires; A sends interrupt
                                                            6. A sends ping frame
                                                            7. Rinse, repeat
NIC A




NIC B




 © 2012 Cisco and/or its affiliates. All rights reserved.                                          Cisco Confidential   43
4 ping-pongs in ~8x timer duration


NIC A




NIC B




 © 2012 Cisco and/or its affiliates. All rights reserved.                                        Cisco Confidential   44
NIC A




NIC B


         In general, coalescing interrupts is a very Very Good Thing



 © 2012 Cisco and/or its affiliates. All rights reserved.      Cisco Confidential   45
NIC A




NIC B


                                                     But it definitely hurts low-latency traffic



 © 2012 Cisco and/or its affiliates. All rights reserved.                                          Cisco Confidential   46
How do we reduce
those artificial delays?
NIC A
  Port 0




NIC B

NIC A
  Port 1




NIC B


 © 2012 Cisco and/or its affiliates. All rights reserved.   Cisco Confidential   48
NIC A
  Port 0




NIC B

NIC A
  Port 1




NIC B


 © 2012 Cisco and/or its affiliates. All rights reserved.   Cisco Confidential   49
NIC A
  Port 0




                                           In reality, sender and receiver timers on each
NIC B                                       port are wholly unrelated; they don’t line up
                                                nicely like I used in these examples.
NIC A
                                             Meaning: in general, you actually usually get
  Port 1




                                                           better overlap

NIC B


 © 2012 Cisco and/or its affiliates. All rights reserved.                                    Cisco Confidential   50
TCP NetPIPE latency times: 2 10G Ethernet ports
                         0.001
                                                   1 10Gb Ethernet port
                                                  2 10Gb Ethernet ports


                                                       In this case, we got such good asymmetry, that
                                                      the 2 port case is ~2x as fast (i.e., roughly twice
                                                      as many interrupts in the same amount of time)
Time (seconds)




                       0.0001

                                             ~60μs
                                             ~30μs



                         1e-05
                                       1                                    10                              100               1000
                                                                                              Buffer size


                 © 2012 Cisco and/or its affiliates. All rights reserved.                                                        Cisco Confidential   51
Remember:
                          these are AVERAGE
                               latencies!

        Individual ping-pong times
           are the same as the
      1 port case (from the network)

  …but you get higher throughput
   because we’re reducing the
  gaps between each ping-pong



© 2012 Cisco and/or its affiliates. All rights reserved.   Cisco Confidential   52
Now let’s try
something else…
© 2012 Cisco and/or its affiliates. All rights reserved.   Cisco Confidential   54
TCP NetPIPE latency times: 2 10G Ethernet ports
                           0.1
                                                            1 10Gb Ethernet port
                                                           2 10Gb Ethernet ports                                                    1 port
                                                    1 10GB Ethernet port, timer=0
                                                   2 10GB Ethernet ports, timer=0                                                  ~7.2ms
                        0.01
Time (seconds)




                                                 1 port                     2 ports
                      0.001
                                                ~10.5μs                     ~10.6μs
                                                                                                                                       2 ports
                   0.0001
                                                                                                                                       ~5.5ms


                     1e-05
                                   1                                 10      100           1000                 10000     100000      1e+06                        1e+07
                                                                                                  Buffer size

                 © 2012 Cisco and/or its affiliates. All rights reserved.                                                                     Cisco Confidential     55
Pros                                                 Cons
      • (Much) faster TCP latency                          • May not scale well for
                   …without changing app!                    case of MPI process
                                                             running on every core
      • Faster speeds seem to
              scale up to large                            • Lots and lots of interrupts
              messages, too                                 going to socket:0.core:0
      • Great for low-latency,                             • May need to run (N-1) MPI
              sparse comms apps                             processes…?
                                                              May also want to avoid
      • Best for NICs that are                                socket:0.core:0, or move IRQ
              dedicated to MPI comms                          affinity


© 2012 Cisco and/or its affiliates. All rights reserved.                             Cisco Confidential   56
• Some experimentation might be
       worth trying with real world HPC
       apps:
• Allow TCP to wholly utilize core 0
       (i.e., run MPI processes only on
       cores 1-15)
• Set the coalesce timer to something
       more than 0μs, but less than 125μs
       – there’s a whole spectrum with
       which to play




© 2012 Cisco and/or its affiliates. All rights reserved.   Cisco Confidential   58
• Many in HPC have Ethernot networks
            …but as HPC continues to commoditize itself, lots of HPC users have
            Ethernet-based environments

• Today’s Ethernet switches and NICs are actually quite a bit faster
       and more advanced than what we old-time-HPCers grew up with
• Even good ol’ TCP is amazingly fast and optimized today

• You may be able to tune your NIC and/or fabric to extract pretty
       darn good MPI TCP performance
            The default settings on your Ethernet NIC / fabric are likely set for general TCP
            traffic – which effect very different performance characteristics than what HPC
            applications typically need




© 2012 Cisco and/or its affiliates. All rights reserved.                            Cisco Confidential   59
Thank you.

More Related Content

What's hot

Cisco 640-864 Complete Study Guide
Cisco 640-864 Complete Study GuideCisco 640-864 Complete Study Guide
Cisco 640-864 Complete Study Guidenustouch
 
Redundant Internet service provision - customer viewpoint
Redundant Internet service provision - customer viewpointRedundant Internet service provision - customer viewpoint
Redundant Internet service provision - customer viewpointKae Hsu
 
ccmigration_09186a008033a3b4
ccmigration_09186a008033a3b4ccmigration_09186a008033a3b4
ccmigration_09186a008033a3b4guest66dc5f
 
Ccna prep from networkers
Ccna prep from networkersCcna prep from networkers
Ccna prep from networkersIvana Veljkovic
 

What's hot (8)

Cisco 640-864 Complete Study Guide
Cisco 640-864 Complete Study GuideCisco 640-864 Complete Study Guide
Cisco 640-864 Complete Study Guide
 
Cim 20070701 jul_2007
Cim 20070701 jul_2007Cim 20070701 jul_2007
Cim 20070701 jul_2007
 
Redundant Internet service provision - customer viewpoint
Redundant Internet service provision - customer viewpointRedundant Internet service provision - customer viewpoint
Redundant Internet service provision - customer viewpoint
 
Cim 20071101 nov_2007
Cim 20071101 nov_2007Cim 20071101 nov_2007
Cim 20071101 nov_2007
 
Day @ cio-gipfel 2007
Day @ cio-gipfel 2007Day @ cio-gipfel 2007
Day @ cio-gipfel 2007
 
ccmigration_09186a008033a3b4
ccmigration_09186a008033a3b4ccmigration_09186a008033a3b4
ccmigration_09186a008033a3b4
 
Ccna prep from networkers
Ccna prep from networkersCcna prep from networkers
Ccna prep from networkers
 
Cisco ios versions
Cisco ios versionsCisco ios versions
Cisco ios versions
 

Viewers also liked

I/O仮想化最前線〜ネットワークI/Oを中心に〜
I/O仮想化最前線〜ネットワークI/Oを中心に〜I/O仮想化最前線〜ネットワークI/Oを中心に〜
I/O仮想化最前線〜ネットワークI/Oを中心に〜Ryousei Takano
 
Linux Ethernet device driver
Linux Ethernet device driverLinux Ethernet device driver
Linux Ethernet device driver艾鍗科技
 
The linux networking architecture
The linux networking architectureThe linux networking architecture
The linux networking architecturehugo lu
 
The TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux KernelThe TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux KernelDivye Kapoor
 
Intel DPDK Step by Step instructions
Intel DPDK Step by Step instructionsIntel DPDK Step by Step instructions
Intel DPDK Step by Step instructionsHisaki Ohara
 
Library Operating System for Linux #netdev01
Library Operating System for Linux #netdev01Library Operating System for Linux #netdev01
Library Operating System for Linux #netdev01Hajime Tazaki
 
Direct Code Execution @ CoNEXT 2013
Direct Code Execution @ CoNEXT 2013Direct Code Execution @ CoNEXT 2013
Direct Code Execution @ CoNEXT 2013Hajime Tazaki
 
NUSE (Network Stack in Userspace) at #osio
NUSE (Network Stack in Userspace) at #osioNUSE (Network Stack in Userspace) at #osio
NUSE (Network Stack in Userspace) at #osioHajime Tazaki
 
Recent advance in netmap/VALE(mSwitch)
Recent advance in netmap/VALE(mSwitch)Recent advance in netmap/VALE(mSwitch)
Recent advance in netmap/VALE(mSwitch)micchie
 
Рекомендованные Cisco архитектуры для различных вертикалей
Рекомендованные Cisco архитектуры для различных вертикалейРекомендованные Cisco архитектуры для различных вертикалей
Рекомендованные Cisco архитектуры для различных вертикалейCisco Russia
 
Time Sensitive Networking in the Linux Kernel
Time Sensitive Networking in the Linux KernelTime Sensitive Networking in the Linux Kernel
Time Sensitive Networking in the Linux Kernelhenrikau
 
Cisco systems hacking layer 2 ethernet switches
Cisco systems   hacking layer 2 ethernet switchesCisco systems   hacking layer 2 ethernet switches
Cisco systems hacking layer 2 ethernet switchesKJ Savaliya
 
OMFW 2012: Analyzing Linux Kernel Rootkits with Volatlity
OMFW 2012: Analyzing Linux Kernel Rootkits with VolatlityOMFW 2012: Analyzing Linux Kernel Rootkits with Volatlity
OMFW 2012: Analyzing Linux Kernel Rootkits with VolatlityAndrew Case
 
A particle filter based scheme for indoor tracking on an Android Smartphone
A particle filter based scheme for indoor tracking on an Android SmartphoneA particle filter based scheme for indoor tracking on an Android Smartphone
A particle filter based scheme for indoor tracking on an Android SmartphoneDivye Kapoor
 
Cybermania Prelims
Cybermania PrelimsCybermania Prelims
Cybermania PrelimsDivye Kapoor
 

Viewers also liked (20)

I/O仮想化最前線〜ネットワークI/Oを中心に〜
I/O仮想化最前線〜ネットワークI/Oを中心に〜I/O仮想化最前線〜ネットワークI/Oを中心に〜
I/O仮想化最前線〜ネットワークI/Oを中心に〜
 
Userspace networking
Userspace networkingUserspace networking
Userspace networking
 
Linux Ethernet device driver
Linux Ethernet device driverLinux Ethernet device driver
Linux Ethernet device driver
 
Hands-on ethernet driver
Hands-on ethernet driverHands-on ethernet driver
Hands-on ethernet driver
 
Understanding DPDK
Understanding DPDKUnderstanding DPDK
Understanding DPDK
 
DPDK KNI interface
DPDK KNI interfaceDPDK KNI interface
DPDK KNI interface
 
The linux networking architecture
The linux networking architectureThe linux networking architecture
The linux networking architecture
 
The TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux KernelThe TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux Kernel
 
Intel DPDK Step by Step instructions
Intel DPDK Step by Step instructionsIntel DPDK Step by Step instructions
Intel DPDK Step by Step instructions
 
Library Operating System for Linux #netdev01
Library Operating System for Linux #netdev01Library Operating System for Linux #netdev01
Library Operating System for Linux #netdev01
 
Direct Code Execution @ CoNEXT 2013
Direct Code Execution @ CoNEXT 2013Direct Code Execution @ CoNEXT 2013
Direct Code Execution @ CoNEXT 2013
 
NUSE (Network Stack in Userspace) at #osio
NUSE (Network Stack in Userspace) at #osioNUSE (Network Stack in Userspace) at #osio
NUSE (Network Stack in Userspace) at #osio
 
Recent advance in netmap/VALE(mSwitch)
Recent advance in netmap/VALE(mSwitch)Recent advance in netmap/VALE(mSwitch)
Recent advance in netmap/VALE(mSwitch)
 
Рекомендованные Cisco архитектуры для различных вертикалей
Рекомендованные Cisco архитектуры для различных вертикалейРекомендованные Cisco архитектуры для различных вертикалей
Рекомендованные Cisco архитектуры для различных вертикалей
 
Time Sensitive Networking in the Linux Kernel
Time Sensitive Networking in the Linux KernelTime Sensitive Networking in the Linux Kernel
Time Sensitive Networking in the Linux Kernel
 
Cisco systems hacking layer 2 ethernet switches
Cisco systems   hacking layer 2 ethernet switchesCisco systems   hacking layer 2 ethernet switches
Cisco systems hacking layer 2 ethernet switches
 
Linux performance
Linux performanceLinux performance
Linux performance
 
OMFW 2012: Analyzing Linux Kernel Rootkits with Volatlity
OMFW 2012: Analyzing Linux Kernel Rootkits with VolatlityOMFW 2012: Analyzing Linux Kernel Rootkits with Volatlity
OMFW 2012: Analyzing Linux Kernel Rootkits with Volatlity
 
A particle filter based scheme for indoor tracking on an Android Smartphone
A particle filter based scheme for indoor tracking on an Android SmartphoneA particle filter based scheme for indoor tracking on an Android Smartphone
A particle filter based scheme for indoor tracking on an Android Smartphone
 
Cybermania Prelims
Cybermania PrelimsCybermania Prelims
Cybermania Prelims
 

Similar to Ethernet and TCP optimizations

Ole - Ipv4onlifesupport
Ole - Ipv4onlifesupportOle - Ipv4onlifesupport
Ole - Ipv4onlifesupportIPv6no
 
Ole Ipv4onlifesupport
Ole Ipv4onlifesupport Ole Ipv4onlifesupport
Ole Ipv4onlifesupport IPv6no
 
Eric Vyncke - IPv6 security in general
Eric Vyncke - IPv6 security in generalEric Vyncke - IPv6 security in general
Eric Vyncke - IPv6 security in generalIKT-Norge
 
Outsourcing your TDM Gateways: SIP Trunking as a Service Provider Cloud Service
Outsourcing your TDM Gateways: SIP Trunking as a Service Provider Cloud Service Outsourcing your TDM Gateways: SIP Trunking as a Service Provider Cloud Service
Outsourcing your TDM Gateways: SIP Trunking as a Service Provider Cloud Service Cisco Canada
 
Daniel cornejo cisco. centros de datos unificados y su evolución hacia la nub...
Daniel cornejo cisco. centros de datos unificados y su evolución hacia la nub...Daniel cornejo cisco. centros de datos unificados y su evolución hacia la nub...
Daniel cornejo cisco. centros de datos unificados y su evolución hacia la nub...datacentersummit
 
Vbrownbag container networking for real workloads
Vbrownbag container networking for real workloadsVbrownbag container networking for real workloads
Vbrownbag container networking for real workloadsCisco DevNet
 
Cisco X Factor 9.x Updates & More
Cisco X Factor 9.x Updates & MoreCisco X Factor 9.x Updates & More
Cisco X Factor 9.x Updates & Moreceriumnetworks
 
Cisco Cloud Briefing and Experiences for Cloud Slam 2011
Cisco Cloud Briefing and Experiences for Cloud Slam 2011Cisco Cloud Briefing and Experiences for Cloud Slam 2011
Cisco Cloud Briefing and Experiences for Cloud Slam 2011Cisco Collaboration
 
High speed networks and Java (Ryan Sciampacone)
High speed networks and Java (Ryan Sciampacone)High speed networks and Java (Ryan Sciampacone)
High speed networks and Java (Ryan Sciampacone)Chris Bailey
 
Michael Furminger
Michael  FurmingerMichael  Furminger
Michael Furmingerkatero4ok
 
DEFCON 23 - Etienne Martineau - inter vm data exfiltration
DEFCON 23 - Etienne Martineau - inter vm data exfiltrationDEFCON 23 - Etienne Martineau - inter vm data exfiltration
DEFCON 23 - Etienne Martineau - inter vm data exfiltrationFelipe Prado
 
Sustainable Green IT, Cisco Systems
Sustainable Green IT, Cisco SystemsSustainable Green IT, Cisco Systems
Sustainable Green IT, Cisco SystemsNetzwerk GreenIT-BB
 
Webinar: Move Your Business Forward with Cisco VOIP for SMB
Webinar: Move Your Business Forward with Cisco VOIP for SMBWebinar: Move Your Business Forward with Cisco VOIP for SMB
Webinar: Move Your Business Forward with Cisco VOIP for SMBAdvanced Logic Industries
 
My harvard dream v3
My harvard dream v3My harvard dream v3
My harvard dream v3Son Phan
 
Cloud Connect 2011 - Cisco and the Cloud: Within and Beyond the Data Center
Cloud Connect 2011 - Cisco and the Cloud: Within and Beyond the Data CenterCloud Connect 2011 - Cisco and the Cloud: Within and Beyond the Data Center
Cloud Connect 2011 - Cisco and the Cloud: Within and Beyond the Data CenterCisco Service Provider
 

Similar to Ethernet and TCP optimizations (20)

Ole - Ipv4onlifesupport
Ole - Ipv4onlifesupportOle - Ipv4onlifesupport
Ole - Ipv4onlifesupport
 
Ole Ipv4onlifesupport
Ole Ipv4onlifesupport Ole Ipv4onlifesupport
Ole Ipv4onlifesupport
 
Cisco one pk basic
Cisco one pk basicCisco one pk basic
Cisco one pk basic
 
Cisco one pk basic
Cisco one pk basicCisco one pk basic
Cisco one pk basic
 
Eric Vyncke - IPv6 security in general
Eric Vyncke - IPv6 security in generalEric Vyncke - IPv6 security in general
Eric Vyncke - IPv6 security in general
 
Outsourcing your TDM Gateways: SIP Trunking as a Service Provider Cloud Service
Outsourcing your TDM Gateways: SIP Trunking as a Service Provider Cloud Service Outsourcing your TDM Gateways: SIP Trunking as a Service Provider Cloud Service
Outsourcing your TDM Gateways: SIP Trunking as a Service Provider Cloud Service
 
Daniel cornejo cisco. centros de datos unificados y su evolución hacia la nub...
Daniel cornejo cisco. centros de datos unificados y su evolución hacia la nub...Daniel cornejo cisco. centros de datos unificados y su evolución hacia la nub...
Daniel cornejo cisco. centros de datos unificados y su evolución hacia la nub...
 
Vbrownbag container networking for real workloads
Vbrownbag container networking for real workloadsVbrownbag container networking for real workloads
Vbrownbag container networking for real workloads
 
Cisco X Factor 9.x Updates & More
Cisco X Factor 9.x Updates & MoreCisco X Factor 9.x Updates & More
Cisco X Factor 9.x Updates & More
 
How Technology can help to facilitate Effective eLearning Space
How Technology can help to facilitate Effective eLearning SpaceHow Technology can help to facilitate Effective eLearning Space
How Technology can help to facilitate Effective eLearning Space
 
Cisco Cloud Briefing and Experiences for Cloud Slam 2011
Cisco Cloud Briefing and Experiences for Cloud Slam 2011Cisco Cloud Briefing and Experiences for Cloud Slam 2011
Cisco Cloud Briefing and Experiences for Cloud Slam 2011
 
High speed networks and Java (Ryan Sciampacone)
High speed networks and Java (Ryan Sciampacone)High speed networks and Java (Ryan Sciampacone)
High speed networks and Java (Ryan Sciampacone)
 
Michael Furminger
Michael  FurmingerMichael  Furminger
Michael Furminger
 
DEFCON 23 - Etienne Martineau - inter vm data exfiltration
DEFCON 23 - Etienne Martineau - inter vm data exfiltrationDEFCON 23 - Etienne Martineau - inter vm data exfiltration
DEFCON 23 - Etienne Martineau - inter vm data exfiltration
 
Basic Network Security_Primer
Basic Network Security_PrimerBasic Network Security_Primer
Basic Network Security_Primer
 
Sustainable Green IT, Cisco Systems
Sustainable Green IT, Cisco SystemsSustainable Green IT, Cisco Systems
Sustainable Green IT, Cisco Systems
 
Webinar: Move Your Business Forward with Cisco VOIP for SMB
Webinar: Move Your Business Forward with Cisco VOIP for SMBWebinar: Move Your Business Forward with Cisco VOIP for SMB
Webinar: Move Your Business Forward with Cisco VOIP for SMB
 
My harvard dream v3
My harvard dream v3My harvard dream v3
My harvard dream v3
 
Cloud Connect 2011 - Cisco and the Cloud: Within and Beyond the Data Center
Cloud Connect 2011 - Cisco and the Cloud: Within and Beyond the Data CenterCloud Connect 2011 - Cisco and the Cloud: Within and Beyond the Data Center
Cloud Connect 2011 - Cisco and the Cloud: Within and Beyond the Data Center
 
Cisco data center training for ibm
Cisco data center training for ibmCisco data center training for ibm
Cisco data center training for ibm
 

More from Jeff Squyres

Open MPI State of the Union X SC'16 BOF
Open MPI State of the Union X SC'16 BOFOpen MPI State of the Union X SC'16 BOF
Open MPI State of the Union X SC'16 BOFJeff Squyres
 
MPI Sessions: a proposal to the MPI Forum
MPI Sessions: a proposal to the MPI ForumMPI Sessions: a proposal to the MPI Forum
MPI Sessions: a proposal to the MPI ForumJeff Squyres
 
MPI Fourm SC'15 BOF
MPI Fourm SC'15 BOFMPI Fourm SC'15 BOF
MPI Fourm SC'15 BOFJeff Squyres
 
Open MPI SC'15 State of the Union BOF
Open MPI SC'15 State of the Union BOFOpen MPI SC'15 State of the Union BOF
Open MPI SC'15 State of the Union BOFJeff Squyres
 
Cisco's journey from Verbs to Libfabric
Cisco's journey from Verbs to LibfabricCisco's journey from Verbs to Libfabric
Cisco's journey from Verbs to LibfabricJeff Squyres
 
(Very) Loose proposal to revamp MPI_INIT and MPI_FINALIZE
(Very) Loose proposal to revamp MPI_INIT and MPI_FINALIZE(Very) Loose proposal to revamp MPI_INIT and MPI_FINALIZE
(Very) Loose proposal to revamp MPI_INIT and MPI_FINALIZEJeff Squyres
 
Fun with Github webhooks: verifying Signed-off-by
Fun with Github webhooks: verifying Signed-off-byFun with Github webhooks: verifying Signed-off-by
Fun with Github webhooks: verifying Signed-off-byJeff Squyres
 
Open MPI new version number scheme and roadmap
Open MPI new version number scheme and roadmapOpen MPI new version number scheme and roadmap
Open MPI new version number scheme and roadmapJeff Squyres
 
The State of libfabric in Open MPI
The State of libfabric in Open MPIThe State of libfabric in Open MPI
The State of libfabric in Open MPIJeff Squyres
 
Cisco usNIC libfabric provider
Cisco usNIC libfabric providerCisco usNIC libfabric provider
Cisco usNIC libfabric providerJeff Squyres
 
2014 01-21-mpi-community-feedback
2014 01-21-mpi-community-feedback2014 01-21-mpi-community-feedback
2014 01-21-mpi-community-feedbackJeff Squyres
 
(Open) MPI, Parallel Computing, Life, the Universe, and Everything
(Open) MPI, Parallel Computing, Life, the Universe, and Everything(Open) MPI, Parallel Computing, Life, the Universe, and Everything
(Open) MPI, Parallel Computing, Life, the Universe, and EverythingJeff Squyres
 
Cisco usNIC: how it works, how it is used in Open MPI
Cisco usNIC: how it works, how it is used in Open MPICisco usNIC: how it works, how it is used in Open MPI
Cisco usNIC: how it works, how it is used in Open MPIJeff Squyres
 
Cisco EuroMPI'13 vendor session presentation
Cisco EuroMPI'13 vendor session presentationCisco EuroMPI'13 vendor session presentation
Cisco EuroMPI'13 vendor session presentationJeff Squyres
 
Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)
Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)
Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)Jeff Squyres
 
MOSSCon 2013, Cisco Open Source talk
MOSSCon 2013, Cisco Open Source talkMOSSCon 2013, Cisco Open Source talk
MOSSCon 2013, Cisco Open Source talkJeff Squyres
 
Friends don't let friends leak MPI_Requests
Friends don't let friends leak MPI_RequestsFriends don't let friends leak MPI_Requests
Friends don't let friends leak MPI_RequestsJeff Squyres
 
MPI-3 Timer requests proposal
MPI-3 Timer requests proposalMPI-3 Timer requests proposal
MPI-3 Timer requests proposalJeff Squyres
 
MPI_Mprobe is good for you
MPI_Mprobe is good for youMPI_Mprobe is good for you
MPI_Mprobe is good for youJeff Squyres
 

More from Jeff Squyres (20)

Open MPI State of the Union X SC'16 BOF
Open MPI State of the Union X SC'16 BOFOpen MPI State of the Union X SC'16 BOF
Open MPI State of the Union X SC'16 BOF
 
MPI Sessions: a proposal to the MPI Forum
MPI Sessions: a proposal to the MPI ForumMPI Sessions: a proposal to the MPI Forum
MPI Sessions: a proposal to the MPI Forum
 
MPI Fourm SC'15 BOF
MPI Fourm SC'15 BOFMPI Fourm SC'15 BOF
MPI Fourm SC'15 BOF
 
Open MPI SC'15 State of the Union BOF
Open MPI SC'15 State of the Union BOFOpen MPI SC'15 State of the Union BOF
Open MPI SC'15 State of the Union BOF
 
Cisco's journey from Verbs to Libfabric
Cisco's journey from Verbs to LibfabricCisco's journey from Verbs to Libfabric
Cisco's journey from Verbs to Libfabric
 
(Very) Loose proposal to revamp MPI_INIT and MPI_FINALIZE
(Very) Loose proposal to revamp MPI_INIT and MPI_FINALIZE(Very) Loose proposal to revamp MPI_INIT and MPI_FINALIZE
(Very) Loose proposal to revamp MPI_INIT and MPI_FINALIZE
 
Fun with Github webhooks: verifying Signed-off-by
Fun with Github webhooks: verifying Signed-off-byFun with Github webhooks: verifying Signed-off-by
Fun with Github webhooks: verifying Signed-off-by
 
Open MPI new version number scheme and roadmap
Open MPI new version number scheme and roadmapOpen MPI new version number scheme and roadmap
Open MPI new version number scheme and roadmap
 
The State of libfabric in Open MPI
The State of libfabric in Open MPIThe State of libfabric in Open MPI
The State of libfabric in Open MPI
 
Cisco usNIC libfabric provider
Cisco usNIC libfabric providerCisco usNIC libfabric provider
Cisco usNIC libfabric provider
 
2014 01-21-mpi-community-feedback
2014 01-21-mpi-community-feedback2014 01-21-mpi-community-feedback
2014 01-21-mpi-community-feedback
 
(Open) MPI, Parallel Computing, Life, the Universe, and Everything
(Open) MPI, Parallel Computing, Life, the Universe, and Everything(Open) MPI, Parallel Computing, Life, the Universe, and Everything
(Open) MPI, Parallel Computing, Life, the Universe, and Everything
 
Cisco usNIC: how it works, how it is used in Open MPI
Cisco usNIC: how it works, how it is used in Open MPICisco usNIC: how it works, how it is used in Open MPI
Cisco usNIC: how it works, how it is used in Open MPI
 
Cisco EuroMPI'13 vendor session presentation
Cisco EuroMPI'13 vendor session presentationCisco EuroMPI'13 vendor session presentation
Cisco EuroMPI'13 vendor session presentation
 
Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)
Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)
Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)
 
MPI History
MPI HistoryMPI History
MPI History
 
MOSSCon 2013, Cisco Open Source talk
MOSSCon 2013, Cisco Open Source talkMOSSCon 2013, Cisco Open Source talk
MOSSCon 2013, Cisco Open Source talk
 
Friends don't let friends leak MPI_Requests
Friends don't let friends leak MPI_RequestsFriends don't let friends leak MPI_Requests
Friends don't let friends leak MPI_Requests
 
MPI-3 Timer requests proposal
MPI-3 Timer requests proposalMPI-3 Timer requests proposal
MPI-3 Timer requests proposal
 
MPI_Mprobe is good for you
MPI_Mprobe is good for youMPI_Mprobe is good for you
MPI_Mprobe is good for you
 

Recently uploaded

Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data PlatformLess Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data PlatformWSO2
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMKumar Satyam
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...WSO2
 
Simplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxSimplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxMarkSteadman7
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data SciencePaolo Missier
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaWSO2
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Quantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation ComputingQuantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation ComputingWSO2
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard37
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
How to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfHow to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfdanishmna97
 
Choreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringChoreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringWSO2
 

Recently uploaded (20)

Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data PlatformLess Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Simplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxSimplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptx
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data Science
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using Ballerina
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Quantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation ComputingQuantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation Computing
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
How to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfHow to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cf
 
Choreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringChoreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software Engineering
 

Ethernet and TCP optimizations

  • 1. Ethernet: Hidden Secrets Jeff Squyres © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 1
  • 2. First: some background information…
  • 3. Using lots and lots and lots of servers simultaneously to solve one computational problem © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 3
  • 4. Racks of 36 1U servers Tend to send lots and lots and lots of small messages across the network to stay in sync with each other © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 4
  • 5. Send a A B Receive the message message Underlying network © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 5
  • 6. Today’s fastest networks: 1-3μs (!) Send a A B Receive the message message Underlying network © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 6
  • 7. • Typically not Ethernet networks • Usually have supercomputer-specific networks Example: highly tuned for short message latency • …but that is changing Ethernet Ethernot © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 7
  • 8. • Userspace NIC (“USNIC”) Expose Cisco NIC hardware directly to Linux userspace Bypass the OS Bypass the TCP stack • Send raw Ethernet frames directly from user applications Much, much faster than traditional TCP-based networking Especially for latency of short messages © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 8
  • 9. Application MPI library Userspace sockets library Userspace Kernel TCP / IP stack Cisco VIC driver Cisco VIC hardware © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 9
  • 10. Application MPI library Userspace verbs library Userspace Kernel Bootstrapping Send and receive and setup fast path Verbs IB core Cisco USNIC driver Cisco VIC hardware © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 10
  • 11. With all that background…
  • 12. Two servers © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 12
  • 13. Two servers Each with a 2 x 10Gb NIC © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 13
  • 14. Two servers Each with a 2 x 10Gb NIC Connected back-to-back © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 14
  • 15. Send a message Receive the message from here here Ping! © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 15
  • 16. Get the message Send the message back back Pong! © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 16
  • 17. Because each ping and pong are soooo short, do this ping-pong exchange N times Ping! / Pong! © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 17
  • 18. Total time for N ping-pongs N Time for one ping-pong © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 18
  • 19. Total time for N ping-pongs N Time for one ping-pong 2 Time for one ping © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 19
  • 20. Time for one ping = Half-round trip (HRT) ping pong latency © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 20
  • 21. TCP NetPIPE latency times: 1 10G Ethernet port 0.1 1 10Gb Ethernet port 8MB ~150ms 0.01 Time (seconds) 0.001 1 byte ~60μs 0.0001 1e-05 1 10 100 1000 10000 100000 1e+06 1e+07 Buffer size © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 21
  • 22. TCP NetPIPE latency times: 2 10G Ethernet ports 0.1 1 10Gb Ethernet port 2 10Gb Ethernet ports 8MB ~150ms 0.01 Time (seconds) 0.001 1 byte ~60μs 8MB 1 byte 0.0001 ~30μs (!) ~8.3ms 1e-05 1 10 100 1000 10000 100000 1e+06 1e+07 Buffer size © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 22
  • 23. TCP NetPIPE latency times: 2 10G Ethernet ports 0.1 1 10Gb Ethernet port 2 10Gb Ethernet ports 8MB ~150ms 0.01 Time (seconds) 0.001 1 byte ~60μs 8MB 1 byte 0.0001 ~30μs (!) ~8.3ms 1e-05 1 10 100 1000 10000 100000 1e+06 1e+07 Buffer size © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 23
  • 24. TCP NetPIPE latency times: 2 10G Ethernet ports 0.001 1 10Gb Ethernet port 2 10Gb Ethernet ports The facts: From 1-1024 bytes: flat latency Using 1 interface: ~60μs Time (seconds) 0.0001 Using 2 interfaces: ~30μs ~60μs ~30μs 1e-05 1 10 100 1000 Buffer size © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 24
  • 25. TCP NetPIPE latency times: 2 10G Ethernet ports 0.001 1 10Gb Ethernet port 2 10Gb Ethernet ports The facts: From 1-1024 bytes: flat latency Using 1 interface: ~60μs Time (seconds) 0.0001 Using 2 interfaces: ~30μs ~60μs ~30μs 1e-05 1 10 100 1000 Buffer size © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 25
  • 26. 1. Ethernet frame arrives © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 26
  • 27. 1. Ethernet frame arrives 2. NIC sends interrupt to OS Ethernet driver © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 27
  • 28. 1. Ethernet frame arrives 2. NIC sends interrupt to OS Ethernet driver 3. OS Ethernet driver copies the packet to RAM © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 28
  • 29. 1. Ethernet frame arrives 4. OS TCP stack hands packet off to (whatever) 2. NIC sends interrupt to OS Ethernet driver 3. OS Ethernet driver copies the packet to RAM © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 29
  • 30. It’s always better in bulk © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 30
  • 31. Let’s optimize this part  © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 31
  • 32. 1. Copy a bunch of packets across PCI at one time © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 32
  • 33. 1. Copy a bunch of packets across PCI at one time 2. Only raise one interrupt for all of those packet copies © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 33
  • 34. A.k.a. “Interrupt Coalescing” 1. Copy a bunch of packets across PCI at one time 2. Only raise one interrupt for all of those packet copies © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 34
  • 35. 1. Ethernet frame arrives © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 35
  • 36. 1. Ethernet frame arrives 2. Has N time passed since we sent an interrupt to the OS? © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 36
  • 37. 1. Ethernet frame arrives 2. Has N time passed since we sent an interrupt to the OS? ✖ No: queue up the frame © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 37
  • 38. 1. Ethernet frame arrives 2. Has N time passed since we sent an interrupt to the OS? ✖ No: queue up the frame ✔ Yes: Send all queued frames and interrupt © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 38
  • 40. © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 40
  • 41. Periodic interrupt 1. A sends ping frame coalescing timeout NIC A 125μs NIC B 2. B receives ping frame © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 41
  • 42. NIC A NIC B 3. Coalesce timer expires; B sends interrupt 4. B sends pong frame © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 42
  • 43. 5. Coalesce timer expires; A sends interrupt 6. A sends ping frame 7. Rinse, repeat NIC A NIC B © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 43
  • 44. 4 ping-pongs in ~8x timer duration NIC A NIC B © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 44
  • 45. NIC A NIC B In general, coalescing interrupts is a very Very Good Thing © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 45
  • 46. NIC A NIC B But it definitely hurts low-latency traffic © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 46
  • 47. How do we reduce those artificial delays?
  • 48. NIC A Port 0 NIC B NIC A Port 1 NIC B © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 48
  • 49. NIC A Port 0 NIC B NIC A Port 1 NIC B © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 49
  • 50. NIC A Port 0 In reality, sender and receiver timers on each NIC B port are wholly unrelated; they don’t line up nicely like I used in these examples. NIC A Meaning: in general, you actually usually get Port 1 better overlap NIC B © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 50
  • 51. TCP NetPIPE latency times: 2 10G Ethernet ports 0.001 1 10Gb Ethernet port 2 10Gb Ethernet ports In this case, we got such good asymmetry, that the 2 port case is ~2x as fast (i.e., roughly twice as many interrupts in the same amount of time) Time (seconds) 0.0001 ~60μs ~30μs 1e-05 1 10 100 1000 Buffer size © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 51
  • 52. Remember: these are AVERAGE latencies! Individual ping-pong times are the same as the 1 port case (from the network) …but you get higher throughput because we’re reducing the gaps between each ping-pong © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 52
  • 54. © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 54
  • 55. TCP NetPIPE latency times: 2 10G Ethernet ports 0.1 1 10Gb Ethernet port 2 10Gb Ethernet ports 1 port 1 10GB Ethernet port, timer=0 2 10GB Ethernet ports, timer=0 ~7.2ms 0.01 Time (seconds) 1 port 2 ports 0.001 ~10.5μs ~10.6μs 2 ports 0.0001 ~5.5ms 1e-05 1 10 100 1000 10000 100000 1e+06 1e+07 Buffer size © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 55
  • 56. Pros Cons • (Much) faster TCP latency • May not scale well for …without changing app! case of MPI process running on every core • Faster speeds seem to scale up to large • Lots and lots of interrupts messages, too going to socket:0.core:0 • Great for low-latency, • May need to run (N-1) MPI sparse comms apps processes…? May also want to avoid • Best for NICs that are socket:0.core:0, or move IRQ dedicated to MPI comms affinity © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 56
  • 57.
  • 58. • Some experimentation might be worth trying with real world HPC apps: • Allow TCP to wholly utilize core 0 (i.e., run MPI processes only on cores 1-15) • Set the coalesce timer to something more than 0μs, but less than 125μs – there’s a whole spectrum with which to play © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 58
  • 59. • Many in HPC have Ethernot networks …but as HPC continues to commoditize itself, lots of HPC users have Ethernet-based environments • Today’s Ethernet switches and NICs are actually quite a bit faster and more advanced than what we old-time-HPCers grew up with • Even good ol’ TCP is amazingly fast and optimized today • You may be able to tune your NIC and/or fabric to extract pretty darn good MPI TCP performance The default settings on your Ethernet NIC / fabric are likely set for general TCP traffic – which effect very different performance characteristics than what HPC applications typically need © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 59