Ryan Sciampacone – IBM Java Runtime Lead
1st October 2012




High Speed Networks
Free Performance or New Bottlenecks?




                                           © 2012 IBM Corporation
Important Disclaimers



    THE INFORMATION CONTAINED IN THIS PRESENTATION IS PROVIDED FOR
      INFORMATIONAL PURPOSES ONLY.
    WHILST EFFORTS WERE MADE TO VERIFY THE COMPLETENESS AND ACCURACY OF
      THE INFORMATION CONTAINED IN THIS PRESENTATION, IT IS PROVIDED “AS IS”,
      WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED.
    ALL PERFORMANCE DATA INCLUDED IN THIS PRESENTATION HAVE BEEN GATHERED
       IN A CONTROLLED ENVIRONMENT. YOUR OWN TEST RESULTS MAY VARY BASED
       ON HARDWARE, SOFTWARE OR INFRASTRUCTURE DIFFERENCES.
    ALL DATA INCLUDED IN THIS PRESENTATION ARE MEANT TO BE USED ONLY AS A
       GUIDE.
    IN ADDITION, THE INFORMATION CONTAINED IN THIS PRESENTATION IS BASED ON
       IBM’S CURRENT PRODUCT PLANS AND STRATEGY, WHICH ARE SUBJECT TO
       CHANGE BY IBM, WITHOUT NOTICE.
    IBM AND ITS AFFILIATED COMPANIES SHALL NOT BE RESPONSIBLE FOR ANY
       DAMAGES ARISING OUT OF THE USE OF, OR OTHERWISE RELATED TO, THIS
       PRESENTATION OR ANY OTHER DOCUMENTATION.
    NOTHING CONTAINED IN THIS PRESENTATION IS INTENDED TO, OR SHALL HAVE THE
      EFFECT OF:
    - CREATING ANY WARRANT OR REPRESENTATION FROM IBM, ITS AFFILIATED
       COMPANIES OR ITS OR THEIR SUPPLIERS AND/OR LICENSORS
2                                                                       © 2012 IBM Corporation
Introduction to the speaker



    ■   15 years experience developing and deploying Java SDKs
    ■   Recent work focus:
         ■   Managed Runtime Architecture
         ■   Java Virtual Machine improvements
              ■   Multi-tenancy technology
              ■   Native data access and heap density
              ■   Footprint and performance
         ■   Garbage Collection
              ■   Scalability and pause time reduction
              ■   Advanced GC technology

    ■   My contact information:
         – Ryan_Sciampacone@ca.ibm.com


3                                                                © 2012 IBM Corporation
What should you get from this talk?



■   Understand the current state of high speed networks in the context of Java
    development and take away a clear view of the issues involved. Learn practical
    approaches to achieving great performance, including how to understand results
    that initially don’t make sense.




4                                                                         © 2012 IBM Corporation
Life In The Fast Lane



■   “Never underestimate the bandwidth of a station wagon full of tapes hurtling down
    the highway.”
    -- Andrew S. Tanenbaum, Computer Networks, 4th ed., p. 91


■   Networks often just thought of as a simple interconnect between systems
■   No real differentiators
     – WAN vs. LAN
     – Wired vs. Wireless
■   APIs traditionally make this invisible
     – Socket API is good at hiding things (SDP, SMC-R, TCP/IP)


■   Can today’s network offerings be exploited to improve existing performance?



5                                                                           © 2012 IBM Corporation
Network Overview




6                      © 2012 IBM Corporation
Network Speeds Over Time


                                Comparison of Network Speeds




                      10Mbs


                    100Mbs


                      1GigE


                     10GigE


                   InfiniBand




■   Consistent advancement in speeds over the years
■   Networks have come a long way in that time


7                                                              © 2012 IBM Corporation
Network Speeds Over Time


                                 Comparison of Network Speeds




                       10Mbs


                     100Mbs


                       1GigE


                      10GigE


                    InfiniBand




■   Oh sorry – that was a logarithmic scaled chart!




8                                                               © 2012 IBM Corporation
Network Speeds vs. The World


                                 Networks vs. Other Storage Bandwidth



                       1GigE


                      10GigE


                    InfiniBand


                      Core i7


                         SSD




■   Bandwidth differences between memory and InfiniBand still a ways off
■   But the gap is getting smaller!


9                                                                          © 2012 IBM Corporation
Networks Now vs. Yesterday



■    Real opportunity to look at decentralized systems
■    Already true:
      – Cloud computing
      – Data grids
      – Distributed computation


■    Network distance isn’t as far as it used to be!




10                                                       © 2012 IBM Corporation
What is InfiniBand?



■    Originated in 1999 from the merger of two competing designs
■    Features
      – High throughput
      – Low Latency
      – Quality of Service
      – Failover
      – Designed to be scalable
■    Offers low latency RDMA (Remote Direct Memory Access)
■    Uses a different programming model than traditional sockets
      – No “standard” API – De-facto: OFED (OpenFabrics Enterprise Distribution)
      – Upper layer protocols (ULPs) exist to ease the pain of development




11                                                                                 © 2012 IBM Corporation
IB vs. IPoIB vs. SDP – InfiniBand

           IB
      Application              Modified application using
                               IB specific communication
      IB Services              mechanism



                               Bypass of Kernel facilities.
                               Effectively a “zero hop” to
                               The communication layer



        IB Core

     Device Driver


■    Handles all transmission aspects (guarantees, transmission units, etc)
■    Extremely low CPU cost
12                                                                            © 2012 IBM Corporation
IB vs. IPoIB vs. SDP – IP over InfiniBand

           IB                IPoIB
      Application         Application              Application uses standard
                                                   socket APIs for communication
      IB Services         Socket API


                            TCP/IP                 Entire TCP/IP stack
                                                   used but resides on a
                                                   mapping / conversion
                                                   layer (IPoIB)
                             IPoIB


        IB Core            IB Core

     Device Driver      Device Driver


■    Effectively TCP/IP stack using a “device driver” to interface the IB layer
■    High CPU cost
13                                                                                 © 2012 IBM Corporation
IB vs. IPoIB vs. SDP – IP over InfiniBand

           IB               IPoIB               SDP
      Application        Application        Application             Application uses
                                                                    standard
                                                                    socket APIs for
      IB Services        Socket API         Socket API              communication

                           TCP/IP                                   Although socket
                                                SDP                 API based
                                                                    uses its own lighter
                            IPoIB                                   weight
                                                                    mechanisms and
                                                                    mappings
        IB Core            IB Core            IB Core               to leverage IB

     Device Driver      Device Driver      Device Driver


■    Largely bypasses the kernel but still incurs an extra hop during transmission
■    Medium CPU cost
14                                                                             © 2012 IBM Corporation
Throughput vs. Latency




15                            © 2012 IBM Corporation
Throughput vs. Latency




16                            © 2012 IBM Corporation
Throughput vs. Latency




     Data unit used for measuring
       throughput and latency




17                                            © 2012 IBM Corporation
Throughput vs. Latency




     Data unit used for measuring
       throughput and latency




18                                            © 2012 IBM Corporation
Throughput vs. Latency




                                    Length of time for a data unit to travel
     Data unit used for measuring        from the start to end point
       throughput and latency




19                                                                             © 2012 IBM Corporation
Throughput vs. Latency




                                    Length of time for a data unit to travel
     Data unit used for measuring        from the start to end point
       throughput and latency
                                                 Latency
                                                  e.g., 10ms




20                                                                             © 2012 IBM Corporation
Throughput vs. Latency




                                    Length of time for a data unit to travel
     Data unit used for measuring        from the start to end point
       throughput and latency
                                                 Latency
                                                  e.g., 10ms




21                                                                             © 2012 IBM Corporation
Throughput vs. Latency




                                    Length of time for a data unit to travel
     Data unit used for measuring        from the start to end point           Numbers of data units
       throughput and latency                                                   that arrive per time
                                                 Latency                           measurement
                                                  e.g., 10ms




22                                                                                                     © 2012 IBM Corporation
Throughput vs. Latency




                                    Length of time for a data unit to travel
     Data unit used for measuring        from the start to end point           Numbers of data units
       throughput and latency                                                   that arrive per time
                                                 Latency                           measurement
                                                  e.g., 10ms
                                                                                Throughput
                                                                                   e.g., 10Gb/s




23                                                                                                     © 2012 IBM Corporation
Throughput vs. Latency




                                       Length of time for a data unit to travel
        Data unit used for measuring        from the start to end point           Numbers of data units
          throughput and latency                                                   that arrive per time
                                                    Latency                           measurement
                                                     e.g., 10ms
                                                                                   Throughput
                                                                                      e.g., 10Gb/s



■    Shower analogy
      – Diameter of the pipe gives you water throughput
      – Length determines the time it takes for a drop to travel from end to end




24                                                                                                        © 2012 IBM Corporation
Throughput vs. Latency



■    Motivations can characterize priorities
      – They are not necessarily related!


■    Higher throughput rates offer interesting optimization possibilities
      – Reduced pressure on compressing data
      – Reduced pressure on being selective about what data to send


■    For something like RDMA… just send the entire page




25                                                                          © 2012 IBM Corporation
Simple Test using IB




26                          © 2012 IBM Corporation
Simple Test using IB – Background




■    Experiment: Can Java exploit RDMA to get better performance?
■    Tests conducted
      – Send different sized packets from a client to a server
      – Time required to complete write
      – Test variations include communication layer with RDMA
■    Conditions
      – Single threaded
      – 40Gb/s InfiniBand
■    Goal being to look at
      – Network speeds
      – Baseline overhead that Java imposes over C equivalent programs
      – Existing issues that may not have been predicted
■    Also going to look at very basic Java overhead
      – Comparisons will go against C equivalent program


27                                                                       © 2012 IBM Corporation
Simple Test using IB – IPoIB Comparison

                                                             Throughput comparison for C / Java



                    Throughput




                                                                                                             C IPoIB
                                                                                                             Java DBB IPoIB

      Better




                                                                                   1m

                                                                                        4m




                                                                                                        6m
                                                                                                   m
                                                                                              m
                                                        1k

                                                             4k
                                         16

                                              64
                                 1

                                     4




                                                                    k

                                                                         k
                                                    6




                                                                              6k
                                                                  16

                                                                        64
                                                   25




                                                                                             16

                                                                                                  64
                                                                             25




                                                                                                       25
                                                                  Payload Size




■    DirectByteBuffer (NIO socket channel) to avoid marshalling costs (JNI)
■    Observations
      – C code is initially faster than Java implementation
      – Generally even after 128k payload size


28                                                                                                                            © 2012 IBM Corporation
Simple Test using IB – SDP Comparison

                                                             Throughput comparison for C / Java



                    Throughput



                                                                                                             C IPoIB
                                                                                                             Java DBB IPoIB
                                                                                                             C SDP
                                                                                                             Java DBB SDP
      Better




                                                                                   1m

                                                                                        4m




                                                                                                        6m
                                                                                                   m
                                                                                              m
                                                        1k

                                                             4k
                                         16

                                              64
                                 1

                                     4




                                                                    k

                                                                         k
                                                    6




                                                                              6k
                                                                  16

                                                                        64
                                                   25




                                                                                             16

                                                                                                  64
                                                                             25




                                                                                                       25
                                                                  Payload Size




■    DirectByteBuffer (NIO socket channel) to avoid marshalling costs (JNI)
■    Observations
      – C code is initially faster than Java implementation
      – Generally even after 128k payload size


29                                                                                                                            © 2012 IBM Corporation
Interlude – Zero Copy 64k Boundary


■    Classic networking (java.net) package



                       Java             Native          Kernel



                                 JNI




30                                                               © 2012 IBM Corporation
Interlude – Zero Copy 64k Boundary


■    Classic networking (java.net) package



                           Java           Native             Kernel


              byte[ ]
                                    JNI




31                                                                    © 2012 IBM Corporation
Interlude – Zero Copy 64k Boundary


■    Classic networking (java.net) package



                           Java           Native             Kernel


              byte[ ]
                                    JNI




               write data




32                                                                    © 2012 IBM Corporation
Interlude – Zero Copy 64k Boundary


■    Classic networking (java.net) package



                           Java           Native             Kernel


              byte[ ]
                                    JNI




               write data




33                                                                    © 2012 IBM Corporation
Interlude – Zero Copy 64k Boundary


■    Classic networking (java.net) package



                           Java                Native        Kernel
                                  copy


              byte[ ]
                                         JNI




               write data




34                                                                    © 2012 IBM Corporation
Interlude – Zero Copy 64k Boundary


■    Classic networking (java.net) package



                           Java                Native          Kernel
                                  copy                  copy


              byte[ ]
                                         JNI




               write data




35                                                                      © 2012 IBM Corporation
Interlude – Zero Copy 64k Boundary


■    Classic networking (java.net) package



                           Java                Native          Kernel
                                  copy                  copy


              byte[ ]
                                                                    Transmit
                                         JNI




               write data




■    2 copies before data gets transmitted
■    Lots of CPU burn, lots of memory being consumed


36                                                                             © 2012 IBM Corporation
Interlude – Zero Copy 64k Boundary


■    Using DirectByteBuffer with SDP



                      Java             Native          Kernel




37                                                              © 2012 IBM Corporation
Interlude – Zero Copy 64k Boundary


■    Using DirectByteBuffer with SDP



                      Java             Native          Kernel




38                                                              © 2012 IBM Corporation
Interlude – Zero Copy 64k Boundary


■    Using DirectByteBuffer with SDP



                      Java             Native          Kernel


                     write data




39                                                              © 2012 IBM Corporation
Interlude – Zero Copy 64k Boundary


■    Using DirectByteBuffer with SDP



                      Java             Native          Kernel
                                                copy



                     write data




40                                                              © 2012 IBM Corporation
Interlude – Zero Copy 64k Boundary


■    Using DirectByteBuffer with SDP



                       Java                Native          Kernel
                                                    copy


                                                                Transmit
                      write data




■    1 copy before data gets transmitted
■    Less CPU burn, less memory being consumed


41                                                                         © 2012 IBM Corporation
Interlude – Zero Copy 64k Boundary


■    But when the payload hits the “zero copy” threshold in SDP…



                      Java              Native            Kernel




42                                                                 © 2012 IBM Corporation
Interlude – Zero Copy 64k Boundary


■    But when the payload hits the “zero copy” threshold in SDP…



                      Java              Native            Kernel

                                       >64KB




43                                                                 © 2012 IBM Corporation
Interlude – Zero Copy 64k Boundary


■    But when the payload hits the “zero copy” threshold in SDP…



                      Java              Native            Kernel

                                       >64KB
                     write data




44                                                                 © 2012 IBM Corporation
Interlude – Zero Copy 64k Boundary


■    But when the payload hits the “zero copy” threshold in SDP…



                      Java               Native                Kernel

                                         >64KB
                     write data



                              Memory is “registered” for use
                              with RDMA (direct send from
                                  user space memory)




45                                                                      © 2012 IBM Corporation
Interlude – Zero Copy 64k Boundary


■    But when the payload hits the “zero copy” threshold in SDP…



                      Java               Native                Kernel

                                         >64KB
                     write data



                              Memory is “registered” for use        This is extremely
                              with RDMA (direct send from           expensive / slow!
                                  user space memory)




46                                                                               © 2012 IBM Corporation
Interlude – Zero Copy 64k Boundary


■    But when the payload hits the “zero copy” threshold in SDP…



                      Java               Native                Kernel

                                         >64KB
                                               Transmit
                     write data



                              Memory is “registered” for use        This is extremely
                              with RDMA (direct send from           expensive / slow!
                                  user space memory)




47                                                                               © 2012 IBM Corporation
Interlude – Zero Copy 64k Boundary


■    But when the payload hits the “zero copy” threshold in SDP…



                        Java               Native               Kernel

                                           >64KB
                       write data



                                Memory is “unregistered” when
                                    the send completes


■    1 copy before data gets transmitted
■    Register / unregister is prohibitively expensive (every transmit!)


48                                                                        © 2012 IBM Corporation
Simple Test using IB – SDP Comparison

                                                             Throughput comparison for C / Java




                    Throughput


                                                                                                             C IPoIB
                                                                                                             Java DBB IPoIB
                                                                                                             C SDP
                                                                                                             Java DBB SDP
      Better




                                                                                   1m

                                                                                        4m




                                                                                                        6m
                                                                                                   m
                                                                                              m
                                                        1k

                                                             4k
                                         16

                                              64
                                 1

                                     4




                                                                    k

                                                                         k
                                                    6




                                                                              6k
                                                                  16

                                                                        64
                                                   25




                                                                                             16

                                                                                                  64
                                                                             25




                                                                                                       25
                                                                  Payload Size




■    Post Zero Copy threshold there is a sharp drop
      – Cost of memory register / unregister
■    Eventual climb and plateau
      – Benefits of zero copy cannot outweigh the drawbacks


49                                                                                                                            © 2012 IBM Corporation
Simple Test using IB – RDMA Comparison

                                                             Throughput comparison for C / Java




                                                                                                             C IPoIB
                    Throughput


                                                                                                             Java DBB IPoIB
                                                                                                             C SDP
                                                                                                             Java DBB SDP
                                                                                                             C RDMA/W
      Better                                                                                                 Java DBB RDMA/W




                                                                                   1m

                                                                                        4m


                                                                                              m

                                                                                                   m

                                                                                                        6m
                                                                    k

                                                                         k
                                                        1k

                                                             4k
                                 1

                                     4




                                                    6
                                         16

                                              64




                                                                              6k
                                                                  16

                                                                        64
                                                   25




                                                                                             16

                                                                                                  64
                                                                             25




                                                                                                       25
                                                                  Payload Size




■    No “zero copy” threshold issues
      – Always zero copy
      – Memory registered once, reused
■    Throughput does finally plateau
      – Single thread – pipe is hardly saturated

50                                                                                                                             © 2012 IBM Corporation
Simple Test using IB – What about that Zero Copy
                                      Threshold?
                                                          Zero Copy Threshold Comparison



                                                                                                                         4k
                                                                                                                         8k
                   Throughput


                                                                                                                         16k
                                                                                                                         32k
                                                                                                                         64k
                                                                                                                         128k

      Better                                                                                                             256k
                                                                                                                         512k




                                                                                                      m

                                                                                                             m
                                                                                    k




                                                                                                                    6m
                                                                       k




                                                                                         1m

                                                                                              4m
                                                      6
                                         16




                                                                              k
                                              64
                                1

                                     4




                                                          1k

                                                               4k




                                                                                     6
                                                                    16

                                                                           64
                                                   25




                                                                                                          64
                                                                                                   16
                                                                                  25




                                                                                                                 25
                                                                    Payload Size




■    SDP ultimately has a plateau here
      – Possibly other deeper tuning aspects available
■    Pushing the threshold for zero copy out has no advantange
■    Claw back is still ultimately limited
      – Likely gated by some other aspect of the system
■    64KB threshold (default) seems to be the “sweet spot”

51                                                                                                                              © 2012 IBM Corporation
Simple Test using IB – Summary



■    Simple steps to start using
      – IPoIB lets you use your application ‘as is’
■    Increased speed can potentially involve significant application changes
      – Potential need for deeper technical knowledge
      – SDP is an interesting stop gap
■    There are hidden gotchas!
      – Increased load changes the game – but this is standard when dealing with computers




52                                                                                © 2012 IBM Corporation
ORB and High Speed Networks




53                                 © 2012 IBM Corporation
Benchmarking the ORB – Background




■    Experiment: How does the ORB perform over InfiniBand?
■    Tests conducted
      – Send different sized packets from a client to a server
      – Time required for write followed by read
      – Compare standard Ethernet to SDP / IPoIB
■    Conditions
      – 500 client threads
      – Echo style test (send to server, server echoes data back)
      – byte[] payload
      – 40Gb/s InfiniBand
■    Goal being to look at
      – ORB performance when data pipe isn’t the bottleneck (Time to complete benchmark)
      – Threading performance
■    Realistically expecting to discover bottlenecks in the ORB


54                                                                               © 2012 IBM Corporation
Benchmarking the ORB – Ethernet Results



                                                            ORB Echo Test Performance



                     Time to Complete




                                                                                                                  ETH



         Better



                                        1k   2k   4k   8k    16k       32k        64k   128k   256k   512k   1m
                                                                   Payload Size




■    Standard Ethernet with the classic java.net package



55                                                                                                                      © 2012 IBM Corporation
Benchmarking the ORB – SDP



                                                            ORB Echo Test Performance



                     Time to Complete




                                                                                                                  ETH
                                                                                                                  SDP


         Better



                                        1k   2k   4k   8k    16k       32k        64k   128k   256k   512k   1m
                                                                   Payload Size




■    …And this is with SDP (could be better)



56                                                                                                                      © 2012 IBM Corporation
Benchmarking the ORB – ORB Transmission Buffers




     byte[ ]




57                                                           © 2012 IBM Corporation
Benchmarking the ORB – ORB Transmission Buffers




                                               ORB

     byte[ ]




58                                                           © 2012 IBM Corporation
Benchmarking the ORB – ORB Transmission Buffers




                                               ORB

     byte[ ]



                               write data




59                                                           © 2012 IBM Corporation
Benchmarking the ORB – ORB Transmission Buffers




                                               ORB

     byte[ ]



                               write data


                                                         Internal buffer
                                                         for transmission




60                                                             © 2012 IBM Corporation
Benchmarking the ORB – ORB Transmission Buffers




                                               ORB

     byte[ ]



                               write data
                2KB                            1KB




61                                                           © 2012 IBM Corporation
Benchmarking the ORB – ORB Transmission Buffers




                                                 ORB

     byte[ ]



                                 write data
                                                 1KB
           1KB         1KB




62                                                             © 2012 IBM Corporation
Benchmarking the ORB – ORB Transmission Buffers




                                                 ORB
                        copy
     byte[ ]



                                 write data
                                                 1KB
           1KB         1KB




63                                                             © 2012 IBM Corporation
Benchmarking the ORB – ORB Transmission Buffers




                                                             ORB
                             copy
        byte[ ]
                                                                   Transmit

                                        write data
                                                             1KB
              1KB           1KB




■    Many additional costs being incurred (per thread!) to transmit a byte array




64                                                                             © 2012 IBM Corporation
Benchmarking the ORB – ORB Transmission Buffers



     ■   3KB to 4KB ORB buffer sizes were sufficient for Ethernet


                       ORB                            Socket Layer


                              Transmit


                        4KB




■        Existing bottlenecks outside the ORB (buffer management)
■        Throughput couldn’t be pushed much further



65                                                                   © 2012 IBM Corporation
Benchmarking the ORB – ORB Transmission Buffers



■    64KB was the best size for SDP


                         ORB             Native

                                copy

                                              Transmit


                         64KB




■    Zero Copy Threshold!




66                                                              © 2012 IBM Corporation
Benchmarking the ORB – Garbage Collector Impact



■    Allocating large objects (e.g., buffers) can be a costly operation




       Heap


                                                                 Free Memory

                                                                 Allocated Memory




67                                                                             © 2012 IBM Corporation
Benchmarking the ORB – Garbage Collector Impact



■    Allocating large objects (e.g., buffers) can be a costly operation


                             Buffer


       Heap


                                                                 Free Memory

                                                                 Allocated Memory




68                                                                             © 2012 IBM Corporation
Benchmarking the ORB – Garbage Collector Impact



■    Allocating large objects (e.g., buffers) can be a costly operation


                             Buffer                       Allocate where?


       Heap


                                                                 Free Memory

                                                                 Allocated Memory




69                                                                             © 2012 IBM Corporation
Benchmarking the ORB – Garbage Collector Impact



■    Allocating large objects (e.g., buffers) can be a costly operation


                             Buffer                       Allocate where?


       Heap


                                                                 Free Memory

                                                                 Allocated Memory


■    Premature Garbage Collections in order to “clear space” for large allocations



70                                                                             © 2012 IBM Corporation
Benchmarking the ORB – Thread Pools



■    Thread and connection count ratios are a factor


                      Client                           Server




71                                                              © 2012 IBM Corporation
Benchmarking the ORB – Thread Pools



■    Thread and connection count ratios are a factor


                      Client                           Server




72                                                              © 2012 IBM Corporation
Benchmarking the ORB – Thread Pools



■    Thread and connection count ratios are a factor


                      Client                           Server



                        00 ds
                      5 a
                           e
                      T hr




73                                                              © 2012 IBM Corporation
Benchmarking the ORB – Thread Pools



■    Thread and connection count ratios are a factor


                      Client                           Server



                        00 ds
                      5 a
                           e
                      T hr           1 Connection




74                                                              © 2012 IBM Corporation
Benchmarking the ORB – Thread Pools



■    Thread and connection count ratios are a factor


                      Client                           Server



                        00 ds
                      5 a
                           e
                      T hr           1 Connection




■    Highly contended resource
■    Couldn’t saturate the communication channel



75                                                              © 2012 IBM Corporation
Benchmarking the ORB – Thread Pools



■    Thread and connection count ratios are a factor


                      Client                           Server



                        00 ds
                      5 a
                           e
                      T hr




76                                                              © 2012 IBM Corporation
Benchmarking the ORB – Thread Pools



■    Thread and connection count ratios are a factor


                      Client                           Server



                        00 ds
                      5 a
                           e
                      T hr




77                                                              © 2012 IBM Corporation
Benchmarking the ORB – Thread Pools



■    Thread and connection count ratios are a factor


                      Client                           Server



                        00 ds
                      5 a
                           e
                      T hr

                                   500 Connections




78                                                              © 2012 IBM Corporation
Benchmarking the ORB – Thread Pools



■    Thread and connection count ratios are a factor


                      Client                           Server



                        00 ds
                      5 a
                           e
                      T hr

                                   500 Connections


■    Context switching disaster
■    Threads queued and unable to complete transmit
■    Memory / resource consumption nightmare

79                                                              © 2012 IBM Corporation
Benchmarking the ORB – Thread Pools



■    Thread and connection count ratios are a factor


                      Client                           Server



                        00 ds
                      5 a
                           e
                      T hr




80                                                              © 2012 IBM Corporation
Benchmarking the ORB – Thread Pools



■    Thread and connection count ratios are a factor


                      Client                           Server



                        00 ds
                      5 a
                           e
                      T hr
                                    10 Connections




81                                                              © 2012 IBM Corporation
Benchmarking the ORB – Thread Pools



■    Thread and connection count ratios are a factor


                      Client                               Server



                        00 ds
                      5 a
                           e
                      T hr
                                    10 Connections




■    2-5% of the client thread count appeared to be best
■    Saturate the communication pipe enough to achieve best throughput
■    Keep resource consumption and context switches to a minimum

82                                                                       © 2012 IBM Corporation
Benchmarking the ORB – Post Optimization Round



                                                      ORB Echo Test Performance



               Time to Complete




                                                                                                            ETH
                                                                                                            SDP


     Better



                                  1k   2k   4k   8k    16k       32k        64k   128k   256k   512k   1m
                                                             Payload Size




83                                                                                                                © 2012 IBM Corporation
Benchmarking the ORB – Post Optimization Round



                                                             ORB Echo Test Performance



                      Time to Complete




                                                                                                          ETH
                                                                                                          SDP
                                                                                                          SDP (New)

         Better



                                         1k   2k   4k   8k   16k   32k    64k   128k   256k   512k   1m
                                                               Payload Size




■    Hey, great! Still not super (or the difference you’d expect) but it’s a good start
■    NOTE: 64k threshold definitely a big part of the whole thing


84                                                                                                                    © 2012 IBM Corporation
Benchmarking the ORB – Post Optimization Round



                                                            ORB Echo Test Performance



                     Time to Complete




                                                                                                         ETH
                                                                                                         SDP
                                                                                                         SDP (New)
                                                                                                         IPoIB
         Better



                                        1k   2k   4k   8k   16k   32k    64k   128k   256k   512k   1m
                                                              Payload Size




■    No surprises, IPoIB has higher overhead than SDP
■    64KB numbers are actually quite close – so still issues to discover and fix

85                                                                                                                   © 2012 IBM Corporation
Benchmarking the ORB – Summary



■    It’s not as easy as “stepping on the gas”
      – High speed networks alone don’t resolve your problems.
      – Software layers are going to have bottlenecks.
      – Improvements for high speed networks can help traditional ones as well
■    Benefit is not always clear cut




86                                                                               © 2012 IBM Corporation
And after all that…




87                         © 2012 IBM Corporation
Conclusion



■    High speed networks are a game changer
■    Simple to use, hard to use effectively
■    Expectations based on past results need to be re-evaluated
■    Existing applications / frameworks may need tuning or optimization
■    Opening of potentially new possibilities




88                                                                        © 2012 IBM Corporation
Questions?




89                © 2012 IBM Corporation
References



 ■   Get Products and Technologies:
      – IBM Java Runtimes and SDKs:
          • https://www.ibm.com/developerworks/java/jdk/
      – IBM Monitoring and Diagnostic Tools for Java:
          • https://www.ibm.com/developerworks/java/jdk/tools/


 ■   Learn:
      – IBM Java InfoCenter:
          • http://publib.boulder.ibm.com/infocenter/java7sdk/v7r0/index.jsp

 ■   Discuss:
      – IBM Java Runtimes and SDKs Forum:
          • http://www.ibm.com/developerworks/forums/forum.jspa?forumID=367&start=0




90                                                                                    © 2012 IBM Corporation
Copyright and Trademarks



 © IBM Corporation 2012. All Rights Reserved.


 IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of
   International Business Machines Corp., and registered in many jurisdictions
   worldwide.


 Other product and service names might be trademarks of IBM or other companies.


 A current list of IBM trademarks is available on the Web – see the IBM “Copyright
    and trademark information” page at URL: www.ibm.com/legal/copytrade.shtml




91                                                                        © 2012 IBM Corporation

High speed networks and Java (Ryan Sciampacone)

  • 1.
    Ryan Sciampacone –IBM Java Runtime Lead 1st October 2012 High Speed Networks Free Performance or New Bottlenecks? © 2012 IBM Corporation
  • 2.
    Important Disclaimers THE INFORMATION CONTAINED IN THIS PRESENTATION IS PROVIDED FOR INFORMATIONAL PURPOSES ONLY. WHILST EFFORTS WERE MADE TO VERIFY THE COMPLETENESS AND ACCURACY OF THE INFORMATION CONTAINED IN THIS PRESENTATION, IT IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED. ALL PERFORMANCE DATA INCLUDED IN THIS PRESENTATION HAVE BEEN GATHERED IN A CONTROLLED ENVIRONMENT. YOUR OWN TEST RESULTS MAY VARY BASED ON HARDWARE, SOFTWARE OR INFRASTRUCTURE DIFFERENCES. ALL DATA INCLUDED IN THIS PRESENTATION ARE MEANT TO BE USED ONLY AS A GUIDE. IN ADDITION, THE INFORMATION CONTAINED IN THIS PRESENTATION IS BASED ON IBM’S CURRENT PRODUCT PLANS AND STRATEGY, WHICH ARE SUBJECT TO CHANGE BY IBM, WITHOUT NOTICE. IBM AND ITS AFFILIATED COMPANIES SHALL NOT BE RESPONSIBLE FOR ANY DAMAGES ARISING OUT OF THE USE OF, OR OTHERWISE RELATED TO, THIS PRESENTATION OR ANY OTHER DOCUMENTATION. NOTHING CONTAINED IN THIS PRESENTATION IS INTENDED TO, OR SHALL HAVE THE EFFECT OF: - CREATING ANY WARRANT OR REPRESENTATION FROM IBM, ITS AFFILIATED COMPANIES OR ITS OR THEIR SUPPLIERS AND/OR LICENSORS 2 © 2012 IBM Corporation
  • 3.
    Introduction to thespeaker ■ 15 years experience developing and deploying Java SDKs ■ Recent work focus: ■ Managed Runtime Architecture ■ Java Virtual Machine improvements ■ Multi-tenancy technology ■ Native data access and heap density ■ Footprint and performance ■ Garbage Collection ■ Scalability and pause time reduction ■ Advanced GC technology ■ My contact information: – Ryan_Sciampacone@ca.ibm.com 3 © 2012 IBM Corporation
  • 4.
    What should youget from this talk? ■ Understand the current state of high speed networks in the context of Java development and take away a clear view of the issues involved. Learn practical approaches to achieving great performance, including how to understand results that initially don’t make sense. 4 © 2012 IBM Corporation
  • 5.
    Life In TheFast Lane ■ “Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway.” -- Andrew S. Tanenbaum, Computer Networks, 4th ed., p. 91 ■ Networks often just thought of as a simple interconnect between systems ■ No real differentiators – WAN vs. LAN – Wired vs. Wireless ■ APIs traditionally make this invisible – Socket API is good at hiding things (SDP, SMC-R, TCP/IP) ■ Can today’s network offerings be exploited to improve existing performance? 5 © 2012 IBM Corporation
  • 6.
    Network Overview 6 © 2012 IBM Corporation
  • 7.
    Network Speeds OverTime Comparison of Network Speeds 10Mbs 100Mbs 1GigE 10GigE InfiniBand ■ Consistent advancement in speeds over the years ■ Networks have come a long way in that time 7 © 2012 IBM Corporation
  • 8.
    Network Speeds OverTime Comparison of Network Speeds 10Mbs 100Mbs 1GigE 10GigE InfiniBand ■ Oh sorry – that was a logarithmic scaled chart! 8 © 2012 IBM Corporation
  • 9.
    Network Speeds vs.The World Networks vs. Other Storage Bandwidth 1GigE 10GigE InfiniBand Core i7 SSD ■ Bandwidth differences between memory and InfiniBand still a ways off ■ But the gap is getting smaller! 9 © 2012 IBM Corporation
  • 10.
    Networks Now vs.Yesterday ■ Real opportunity to look at decentralized systems ■ Already true: – Cloud computing – Data grids – Distributed computation ■ Network distance isn’t as far as it used to be! 10 © 2012 IBM Corporation
  • 11.
    What is InfiniBand? ■ Originated in 1999 from the merger of two competing designs ■ Features – High throughput – Low Latency – Quality of Service – Failover – Designed to be scalable ■ Offers low latency RDMA (Remote Direct Memory Access) ■ Uses a different programming model than traditional sockets – No “standard” API – De-facto: OFED (OpenFabrics Enterprise Distribution) – Upper layer protocols (ULPs) exist to ease the pain of development 11 © 2012 IBM Corporation
  • 12.
    IB vs. IPoIBvs. SDP – InfiniBand IB Application Modified application using IB specific communication IB Services mechanism Bypass of Kernel facilities. Effectively a “zero hop” to The communication layer IB Core Device Driver ■ Handles all transmission aspects (guarantees, transmission units, etc) ■ Extremely low CPU cost 12 © 2012 IBM Corporation
  • 13.
    IB vs. IPoIBvs. SDP – IP over InfiniBand IB IPoIB Application Application Application uses standard socket APIs for communication IB Services Socket API TCP/IP Entire TCP/IP stack used but resides on a mapping / conversion layer (IPoIB) IPoIB IB Core IB Core Device Driver Device Driver ■ Effectively TCP/IP stack using a “device driver” to interface the IB layer ■ High CPU cost 13 © 2012 IBM Corporation
  • 14.
    IB vs. IPoIBvs. SDP – IP over InfiniBand IB IPoIB SDP Application Application Application Application uses standard socket APIs for IB Services Socket API Socket API communication TCP/IP Although socket SDP API based uses its own lighter IPoIB weight mechanisms and mappings IB Core IB Core IB Core to leverage IB Device Driver Device Driver Device Driver ■ Largely bypasses the kernel but still incurs an extra hop during transmission ■ Medium CPU cost 14 © 2012 IBM Corporation
  • 15.
    Throughput vs. Latency 15 © 2012 IBM Corporation
  • 16.
    Throughput vs. Latency 16 © 2012 IBM Corporation
  • 17.
    Throughput vs. Latency Data unit used for measuring throughput and latency 17 © 2012 IBM Corporation
  • 18.
    Throughput vs. Latency Data unit used for measuring throughput and latency 18 © 2012 IBM Corporation
  • 19.
    Throughput vs. Latency Length of time for a data unit to travel Data unit used for measuring from the start to end point throughput and latency 19 © 2012 IBM Corporation
  • 20.
    Throughput vs. Latency Length of time for a data unit to travel Data unit used for measuring from the start to end point throughput and latency Latency e.g., 10ms 20 © 2012 IBM Corporation
  • 21.
    Throughput vs. Latency Length of time for a data unit to travel Data unit used for measuring from the start to end point throughput and latency Latency e.g., 10ms 21 © 2012 IBM Corporation
  • 22.
    Throughput vs. Latency Length of time for a data unit to travel Data unit used for measuring from the start to end point Numbers of data units throughput and latency that arrive per time Latency measurement e.g., 10ms 22 © 2012 IBM Corporation
  • 23.
    Throughput vs. Latency Length of time for a data unit to travel Data unit used for measuring from the start to end point Numbers of data units throughput and latency that arrive per time Latency measurement e.g., 10ms Throughput e.g., 10Gb/s 23 © 2012 IBM Corporation
  • 24.
    Throughput vs. Latency Length of time for a data unit to travel Data unit used for measuring from the start to end point Numbers of data units throughput and latency that arrive per time Latency measurement e.g., 10ms Throughput e.g., 10Gb/s ■ Shower analogy – Diameter of the pipe gives you water throughput – Length determines the time it takes for a drop to travel from end to end 24 © 2012 IBM Corporation
  • 25.
    Throughput vs. Latency ■ Motivations can characterize priorities – They are not necessarily related! ■ Higher throughput rates offer interesting optimization possibilities – Reduced pressure on compressing data – Reduced pressure on being selective about what data to send ■ For something like RDMA… just send the entire page 25 © 2012 IBM Corporation
  • 26.
    Simple Test usingIB 26 © 2012 IBM Corporation
  • 27.
    Simple Test usingIB – Background ■ Experiment: Can Java exploit RDMA to get better performance? ■ Tests conducted – Send different sized packets from a client to a server – Time required to complete write – Test variations include communication layer with RDMA ■ Conditions – Single threaded – 40Gb/s InfiniBand ■ Goal being to look at – Network speeds – Baseline overhead that Java imposes over C equivalent programs – Existing issues that may not have been predicted ■ Also going to look at very basic Java overhead – Comparisons will go against C equivalent program 27 © 2012 IBM Corporation
  • 28.
    Simple Test usingIB – IPoIB Comparison Throughput comparison for C / Java Throughput C IPoIB Java DBB IPoIB Better 1m 4m 6m m m 1k 4k 16 64 1 4 k k 6 6k 16 64 25 16 64 25 25 Payload Size ■ DirectByteBuffer (NIO socket channel) to avoid marshalling costs (JNI) ■ Observations – C code is initially faster than Java implementation – Generally even after 128k payload size 28 © 2012 IBM Corporation
  • 29.
    Simple Test usingIB – SDP Comparison Throughput comparison for C / Java Throughput C IPoIB Java DBB IPoIB C SDP Java DBB SDP Better 1m 4m 6m m m 1k 4k 16 64 1 4 k k 6 6k 16 64 25 16 64 25 25 Payload Size ■ DirectByteBuffer (NIO socket channel) to avoid marshalling costs (JNI) ■ Observations – C code is initially faster than Java implementation – Generally even after 128k payload size 29 © 2012 IBM Corporation
  • 30.
    Interlude – ZeroCopy 64k Boundary ■ Classic networking (java.net) package Java Native Kernel JNI 30 © 2012 IBM Corporation
  • 31.
    Interlude – ZeroCopy 64k Boundary ■ Classic networking (java.net) package Java Native Kernel byte[ ] JNI 31 © 2012 IBM Corporation
  • 32.
    Interlude – ZeroCopy 64k Boundary ■ Classic networking (java.net) package Java Native Kernel byte[ ] JNI write data 32 © 2012 IBM Corporation
  • 33.
    Interlude – ZeroCopy 64k Boundary ■ Classic networking (java.net) package Java Native Kernel byte[ ] JNI write data 33 © 2012 IBM Corporation
  • 34.
    Interlude – ZeroCopy 64k Boundary ■ Classic networking (java.net) package Java Native Kernel copy byte[ ] JNI write data 34 © 2012 IBM Corporation
  • 35.
    Interlude – ZeroCopy 64k Boundary ■ Classic networking (java.net) package Java Native Kernel copy copy byte[ ] JNI write data 35 © 2012 IBM Corporation
  • 36.
    Interlude – ZeroCopy 64k Boundary ■ Classic networking (java.net) package Java Native Kernel copy copy byte[ ] Transmit JNI write data ■ 2 copies before data gets transmitted ■ Lots of CPU burn, lots of memory being consumed 36 © 2012 IBM Corporation
  • 37.
    Interlude – ZeroCopy 64k Boundary ■ Using DirectByteBuffer with SDP Java Native Kernel 37 © 2012 IBM Corporation
  • 38.
    Interlude – ZeroCopy 64k Boundary ■ Using DirectByteBuffer with SDP Java Native Kernel 38 © 2012 IBM Corporation
  • 39.
    Interlude – ZeroCopy 64k Boundary ■ Using DirectByteBuffer with SDP Java Native Kernel write data 39 © 2012 IBM Corporation
  • 40.
    Interlude – ZeroCopy 64k Boundary ■ Using DirectByteBuffer with SDP Java Native Kernel copy write data 40 © 2012 IBM Corporation
  • 41.
    Interlude – ZeroCopy 64k Boundary ■ Using DirectByteBuffer with SDP Java Native Kernel copy Transmit write data ■ 1 copy before data gets transmitted ■ Less CPU burn, less memory being consumed 41 © 2012 IBM Corporation
  • 42.
    Interlude – ZeroCopy 64k Boundary ■ But when the payload hits the “zero copy” threshold in SDP… Java Native Kernel 42 © 2012 IBM Corporation
  • 43.
    Interlude – ZeroCopy 64k Boundary ■ But when the payload hits the “zero copy” threshold in SDP… Java Native Kernel >64KB 43 © 2012 IBM Corporation
  • 44.
    Interlude – ZeroCopy 64k Boundary ■ But when the payload hits the “zero copy” threshold in SDP… Java Native Kernel >64KB write data 44 © 2012 IBM Corporation
  • 45.
    Interlude – ZeroCopy 64k Boundary ■ But when the payload hits the “zero copy” threshold in SDP… Java Native Kernel >64KB write data Memory is “registered” for use with RDMA (direct send from user space memory) 45 © 2012 IBM Corporation
  • 46.
    Interlude – ZeroCopy 64k Boundary ■ But when the payload hits the “zero copy” threshold in SDP… Java Native Kernel >64KB write data Memory is “registered” for use This is extremely with RDMA (direct send from expensive / slow! user space memory) 46 © 2012 IBM Corporation
  • 47.
    Interlude – ZeroCopy 64k Boundary ■ But when the payload hits the “zero copy” threshold in SDP… Java Native Kernel >64KB Transmit write data Memory is “registered” for use This is extremely with RDMA (direct send from expensive / slow! user space memory) 47 © 2012 IBM Corporation
  • 48.
    Interlude – ZeroCopy 64k Boundary ■ But when the payload hits the “zero copy” threshold in SDP… Java Native Kernel >64KB write data Memory is “unregistered” when the send completes ■ 1 copy before data gets transmitted ■ Register / unregister is prohibitively expensive (every transmit!) 48 © 2012 IBM Corporation
  • 49.
    Simple Test usingIB – SDP Comparison Throughput comparison for C / Java Throughput C IPoIB Java DBB IPoIB C SDP Java DBB SDP Better 1m 4m 6m m m 1k 4k 16 64 1 4 k k 6 6k 16 64 25 16 64 25 25 Payload Size ■ Post Zero Copy threshold there is a sharp drop – Cost of memory register / unregister ■ Eventual climb and plateau – Benefits of zero copy cannot outweigh the drawbacks 49 © 2012 IBM Corporation
  • 50.
    Simple Test usingIB – RDMA Comparison Throughput comparison for C / Java C IPoIB Throughput Java DBB IPoIB C SDP Java DBB SDP C RDMA/W Better Java DBB RDMA/W 1m 4m m m 6m k k 1k 4k 1 4 6 16 64 6k 16 64 25 16 64 25 25 Payload Size ■ No “zero copy” threshold issues – Always zero copy – Memory registered once, reused ■ Throughput does finally plateau – Single thread – pipe is hardly saturated 50 © 2012 IBM Corporation
  • 51.
    Simple Test usingIB – What about that Zero Copy Threshold? Zero Copy Threshold Comparison 4k 8k Throughput 16k 32k 64k 128k Better 256k 512k m m k 6m k 1m 4m 6 16 k 64 1 4 1k 4k 6 16 64 25 64 16 25 25 Payload Size ■ SDP ultimately has a plateau here – Possibly other deeper tuning aspects available ■ Pushing the threshold for zero copy out has no advantange ■ Claw back is still ultimately limited – Likely gated by some other aspect of the system ■ 64KB threshold (default) seems to be the “sweet spot” 51 © 2012 IBM Corporation
  • 52.
    Simple Test usingIB – Summary ■ Simple steps to start using – IPoIB lets you use your application ‘as is’ ■ Increased speed can potentially involve significant application changes – Potential need for deeper technical knowledge – SDP is an interesting stop gap ■ There are hidden gotchas! – Increased load changes the game – but this is standard when dealing with computers 52 © 2012 IBM Corporation
  • 53.
    ORB and HighSpeed Networks 53 © 2012 IBM Corporation
  • 54.
    Benchmarking the ORB– Background ■ Experiment: How does the ORB perform over InfiniBand? ■ Tests conducted – Send different sized packets from a client to a server – Time required for write followed by read – Compare standard Ethernet to SDP / IPoIB ■ Conditions – 500 client threads – Echo style test (send to server, server echoes data back) – byte[] payload – 40Gb/s InfiniBand ■ Goal being to look at – ORB performance when data pipe isn’t the bottleneck (Time to complete benchmark) – Threading performance ■ Realistically expecting to discover bottlenecks in the ORB 54 © 2012 IBM Corporation
  • 55.
    Benchmarking the ORB– Ethernet Results ORB Echo Test Performance Time to Complete ETH Better 1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1m Payload Size ■ Standard Ethernet with the classic java.net package 55 © 2012 IBM Corporation
  • 56.
    Benchmarking the ORB– SDP ORB Echo Test Performance Time to Complete ETH SDP Better 1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1m Payload Size ■ …And this is with SDP (could be better) 56 © 2012 IBM Corporation
  • 57.
    Benchmarking the ORB– ORB Transmission Buffers byte[ ] 57 © 2012 IBM Corporation
  • 58.
    Benchmarking the ORB– ORB Transmission Buffers ORB byte[ ] 58 © 2012 IBM Corporation
  • 59.
    Benchmarking the ORB– ORB Transmission Buffers ORB byte[ ] write data 59 © 2012 IBM Corporation
  • 60.
    Benchmarking the ORB– ORB Transmission Buffers ORB byte[ ] write data Internal buffer for transmission 60 © 2012 IBM Corporation
  • 61.
    Benchmarking the ORB– ORB Transmission Buffers ORB byte[ ] write data 2KB 1KB 61 © 2012 IBM Corporation
  • 62.
    Benchmarking the ORB– ORB Transmission Buffers ORB byte[ ] write data 1KB 1KB 1KB 62 © 2012 IBM Corporation
  • 63.
    Benchmarking the ORB– ORB Transmission Buffers ORB copy byte[ ] write data 1KB 1KB 1KB 63 © 2012 IBM Corporation
  • 64.
    Benchmarking the ORB– ORB Transmission Buffers ORB copy byte[ ] Transmit write data 1KB 1KB 1KB ■ Many additional costs being incurred (per thread!) to transmit a byte array 64 © 2012 IBM Corporation
  • 65.
    Benchmarking the ORB– ORB Transmission Buffers ■ 3KB to 4KB ORB buffer sizes were sufficient for Ethernet ORB Socket Layer Transmit 4KB ■ Existing bottlenecks outside the ORB (buffer management) ■ Throughput couldn’t be pushed much further 65 © 2012 IBM Corporation
  • 66.
    Benchmarking the ORB– ORB Transmission Buffers ■ 64KB was the best size for SDP ORB Native copy Transmit 64KB ■ Zero Copy Threshold! 66 © 2012 IBM Corporation
  • 67.
    Benchmarking the ORB– Garbage Collector Impact ■ Allocating large objects (e.g., buffers) can be a costly operation Heap Free Memory Allocated Memory 67 © 2012 IBM Corporation
  • 68.
    Benchmarking the ORB– Garbage Collector Impact ■ Allocating large objects (e.g., buffers) can be a costly operation Buffer Heap Free Memory Allocated Memory 68 © 2012 IBM Corporation
  • 69.
    Benchmarking the ORB– Garbage Collector Impact ■ Allocating large objects (e.g., buffers) can be a costly operation Buffer Allocate where? Heap Free Memory Allocated Memory 69 © 2012 IBM Corporation
  • 70.
    Benchmarking the ORB– Garbage Collector Impact ■ Allocating large objects (e.g., buffers) can be a costly operation Buffer Allocate where? Heap Free Memory Allocated Memory ■ Premature Garbage Collections in order to “clear space” for large allocations 70 © 2012 IBM Corporation
  • 71.
    Benchmarking the ORB– Thread Pools ■ Thread and connection count ratios are a factor Client Server 71 © 2012 IBM Corporation
  • 72.
    Benchmarking the ORB– Thread Pools ■ Thread and connection count ratios are a factor Client Server 72 © 2012 IBM Corporation
  • 73.
    Benchmarking the ORB– Thread Pools ■ Thread and connection count ratios are a factor Client Server 00 ds 5 a e T hr 73 © 2012 IBM Corporation
  • 74.
    Benchmarking the ORB– Thread Pools ■ Thread and connection count ratios are a factor Client Server 00 ds 5 a e T hr 1 Connection 74 © 2012 IBM Corporation
  • 75.
    Benchmarking the ORB– Thread Pools ■ Thread and connection count ratios are a factor Client Server 00 ds 5 a e T hr 1 Connection ■ Highly contended resource ■ Couldn’t saturate the communication channel 75 © 2012 IBM Corporation
  • 76.
    Benchmarking the ORB– Thread Pools ■ Thread and connection count ratios are a factor Client Server 00 ds 5 a e T hr 76 © 2012 IBM Corporation
  • 77.
    Benchmarking the ORB– Thread Pools ■ Thread and connection count ratios are a factor Client Server 00 ds 5 a e T hr 77 © 2012 IBM Corporation
  • 78.
    Benchmarking the ORB– Thread Pools ■ Thread and connection count ratios are a factor Client Server 00 ds 5 a e T hr 500 Connections 78 © 2012 IBM Corporation
  • 79.
    Benchmarking the ORB– Thread Pools ■ Thread and connection count ratios are a factor Client Server 00 ds 5 a e T hr 500 Connections ■ Context switching disaster ■ Threads queued and unable to complete transmit ■ Memory / resource consumption nightmare 79 © 2012 IBM Corporation
  • 80.
    Benchmarking the ORB– Thread Pools ■ Thread and connection count ratios are a factor Client Server 00 ds 5 a e T hr 80 © 2012 IBM Corporation
  • 81.
    Benchmarking the ORB– Thread Pools ■ Thread and connection count ratios are a factor Client Server 00 ds 5 a e T hr 10 Connections 81 © 2012 IBM Corporation
  • 82.
    Benchmarking the ORB– Thread Pools ■ Thread and connection count ratios are a factor Client Server 00 ds 5 a e T hr 10 Connections ■ 2-5% of the client thread count appeared to be best ■ Saturate the communication pipe enough to achieve best throughput ■ Keep resource consumption and context switches to a minimum 82 © 2012 IBM Corporation
  • 83.
    Benchmarking the ORB– Post Optimization Round ORB Echo Test Performance Time to Complete ETH SDP Better 1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1m Payload Size 83 © 2012 IBM Corporation
  • 84.
    Benchmarking the ORB– Post Optimization Round ORB Echo Test Performance Time to Complete ETH SDP SDP (New) Better 1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1m Payload Size ■ Hey, great! Still not super (or the difference you’d expect) but it’s a good start ■ NOTE: 64k threshold definitely a big part of the whole thing 84 © 2012 IBM Corporation
  • 85.
    Benchmarking the ORB– Post Optimization Round ORB Echo Test Performance Time to Complete ETH SDP SDP (New) IPoIB Better 1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1m Payload Size ■ No surprises, IPoIB has higher overhead than SDP ■ 64KB numbers are actually quite close – so still issues to discover and fix 85 © 2012 IBM Corporation
  • 86.
    Benchmarking the ORB– Summary ■ It’s not as easy as “stepping on the gas” – High speed networks alone don’t resolve your problems. – Software layers are going to have bottlenecks. – Improvements for high speed networks can help traditional ones as well ■ Benefit is not always clear cut 86 © 2012 IBM Corporation
  • 87.
    And after allthat… 87 © 2012 IBM Corporation
  • 88.
    Conclusion ■ High speed networks are a game changer ■ Simple to use, hard to use effectively ■ Expectations based on past results need to be re-evaluated ■ Existing applications / frameworks may need tuning or optimization ■ Opening of potentially new possibilities 88 © 2012 IBM Corporation
  • 89.
    Questions? 89 © 2012 IBM Corporation
  • 90.
    References ■ Get Products and Technologies: – IBM Java Runtimes and SDKs: • https://www.ibm.com/developerworks/java/jdk/ – IBM Monitoring and Diagnostic Tools for Java: • https://www.ibm.com/developerworks/java/jdk/tools/ ■ Learn: – IBM Java InfoCenter: • http://publib.boulder.ibm.com/infocenter/java7sdk/v7r0/index.jsp ■ Discuss: – IBM Java Runtimes and SDKs Forum: • http://www.ibm.com/developerworks/forums/forum.jspa?forumID=367&start=0 90 © 2012 IBM Corporation
  • 91.
    Copyright and Trademarks © IBM Corporation 2012. All Rights Reserved. IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corp., and registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web – see the IBM “Copyright and trademark information” page at URL: www.ibm.com/legal/copytrade.shtml 91 © 2012 IBM Corporation