High speed networks and Java (Ryan Sciampacone)

Ryan Sciampacone – IBM Java Runtime Lead
1st October 2012

High Speed Networks
Free Performance or New Bottlenecks?

© 2012 IBM Corporation

Important Disclaimers

THE INFORMATION CONTAINED IN THIS PRESENTATION IS PROVIDED FOR
INFORMATIONAL PURPOSES ONLY.
WHILST EFFORTS WERE MADE TO VERIFY THE COMPLETENESS AND ACCURACY OF
THE INFORMATION CONTAINED IN THIS PRESENTATION, IT IS PROVIDED “AS IS”,
WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED.
ALL PERFORMANCE DATA INCLUDED IN THIS PRESENTATION HAVE BEEN GATHERED
IN A CONTROLLED ENVIRONMENT. YOUR OWN TEST RESULTS MAY VARY BASED
ON HARDWARE, SOFTWARE OR INFRASTRUCTURE DIFFERENCES.
ALL DATA INCLUDED IN THIS PRESENTATION ARE MEANT TO BE USED ONLY AS A
GUIDE.
IN ADDITION, THE INFORMATION CONTAINED IN THIS PRESENTATION IS BASED ON
IBM’S CURRENT PRODUCT PLANS AND STRATEGY, WHICH ARE SUBJECT TO
CHANGE BY IBM, WITHOUT NOTICE.
IBM AND ITS AFFILIATED COMPANIES SHALL NOT BE RESPONSIBLE FOR ANY
DAMAGES ARISING OUT OF THE USE OF, OR OTHERWISE RELATED TO, THIS
PRESENTATION OR ANY OTHER DOCUMENTATION.
NOTHING CONTAINED IN THIS PRESENTATION IS INTENDED TO, OR SHALL HAVE THE
EFFECT OF:
- CREATING ANY WARRANT OR REPRESENTATION FROM IBM, ITS AFFILIATED
COMPANIES OR ITS OR THEIR SUPPLIERS AND/OR LICENSORS
2 © 2012 IBM Corporation

Introduction to the speaker

■ 15 years experience developing and deploying Java SDKs
■ Recent work focus:
■ Managed Runtime Architecture
■ Java Virtual Machine improvements
■ Multi-tenancy technology
■ Native data access and heap density
■ Footprint and performance
■ Garbage Collection
■ Scalability and pause time reduction
■ Advanced GC technology

■ My contact information:
– Ryan_Sciampacone@ca.ibm.com


What should you get from this talk?

■ Understand the current state of high speed networks in the context of Java
development and take away a clear view of the issues involved. Learn practical
approaches to achieving great performance, including how to understand results
that initially don’t make sense.


Life In The Fast Lane

■ “Never underestimate the bandwidth of a station wagon full of tapes hurtling down
the highway.”
-- Andrew S. Tanenbaum, Computer Networks, 4th ed., p. 91

■ Networks often just thought of as a simple interconnect between systems
■ No real differentiators
– WAN vs. LAN
– Wired vs. Wireless
■ APIs traditionally make this invisible
– Socket API is good at hiding things (SDP, SMC-R, TCP/IP)

■ Can today’s network offerings be exploited to improve existing performance?


Network Overview


Network Speeds Over Time

Comparison of Network Speeds

10Mbs

100Mbs

1GigE

10GigE

InfiniBand

■ Consistent advancement in speeds over the years
■ Networks have come a long way in that time


Network Speeds Over Time

Comparison of Network Speeds

10Mbs

100Mbs

1GigE

10GigE

InfiniBand

■ Oh sorry – that was a logarithmic scaled chart!


Network Speeds vs. The World

Networks vs. Other Storage Bandwidth

1GigE

10GigE

InfiniBand

Core i7

SSD

■ Bandwidth differences between memory and InfiniBand still a ways off
■ But the gap is getting smaller!


Networks Now vs. Yesterday

■ Real opportunity to look at decentralized systems
■ Already true:
– Cloud computing
– Data grids
– Distributed computation

■ Network distance isn’t as far as it used to be!


What is InfiniBand?

■ Originated in 1999 from the merger of two competing designs
■ Features
– High throughput
– Low Latency
– Quality of Service
– Failover
– Designed to be scalable
■ Offers low latency RDMA (Remote Direct Memory Access)
■ Uses a different programming model than traditional sockets
– No “standard” API – De-facto: OFED (OpenFabrics Enterprise Distribution)
– Upper layer protocols (ULPs) exist to ease the pain of development


IB vs. IPoIB vs. SDP – InfiniBand

IB
Application Modified application using
IB specific communication
IB Services mechanism

Bypass of Kernel facilities.
Effectively a “zero hop” to
The communication layer

IB Core

Device Driver

■ Handles all transmission aspects (guarantees, transmission units, etc)
■ Extremely low CPU cost

IB vs. IPoIB vs. SDP – IP over InfiniBand

IB IPoIB
Application Application Application uses standard
socket APIs for communication
IB Services Socket API

TCP/IP Entire TCP/IP stack
used but resides on a
mapping / conversion
layer (IPoIB)
IPoIB

IB Core IB Core

Device Driver Device Driver

■ Effectively TCP/IP stack using a “device driver” to interface the IB layer
■ High CPU cost

IB vs. IPoIB vs. SDP – IP over InfiniBand

IB IPoIB SDP
Application Application Application Application uses
standard
socket APIs for
IB Services Socket API Socket API communication

TCP/IP Although socket
SDP API based
uses its own lighter
IPoIB weight
mechanisms and
mappings
IB Core IB Core IB Core to leverage IB

Device Driver Device Driver Device Driver

■ Largely bypasses the kernel but still incurs an extra hop during transmission
■ Medium CPU cost

Throughput vs. Latency



Data unit used for measuring
throughput and latency



Data unit used for measuring



Length of time for a data unit to travel
Data unit used for measuring from the start to end point



Latency
e.g., 10ms



Data unit used for measuring from the start to end point Numbers of data units
throughput and latency that arrive per time
Latency measurement
e.g., 10ms



Latency measurement
e.g., 10ms
Throughput
e.g., 10Gb/s



Latency measurement
e.g., 10ms
Throughput
e.g., 10Gb/s

■ Shower analogy
– Diameter of the pipe gives you water throughput
– Length determines the time it takes for a drop to travel from end to end



■ Motivations can characterize priorities
– They are not necessarily related!

■ Higher throughput rates offer interesting optimization possibilities
– Reduced pressure on compressing data
– Reduced pressure on being selective about what data to send

■ For something like RDMA… just send the entire page


Simple Test using IB


Simple Test using IB – Background

■ Experiment: Can Java exploit RDMA to get better performance?
■ Tests conducted
– Send different sized packets from a client to a server
– Time required to complete write
– Test variations include communication layer with RDMA
■ Conditions
– Single threaded
– 40Gb/s InfiniBand
■ Goal being to look at
– Network speeds
– Baseline overhead that Java imposes over C equivalent programs
– Existing issues that may not have been predicted
■ Also going to look at very basic Java overhead
– Comparisons will go against C equivalent program


Simple Test using IB – IPoIB Comparison

Throughput comparison for C / Java

Throughput

C IPoIB
Java DBB IPoIB

Better

1m

4m

6m
m
m
1k

4k
16

64
1

4

k

k
6

6k
16

64
25

16

64
25

25
Payload Size

■ DirectByteBuffer (NIO socket channel) to avoid marshalling costs (JNI)
■ Observations
– C code is initially faster than Java implementation
– Generally even after 128k payload size


Simple Test using IB – SDP Comparison


Throughput

C IPoIB
Java DBB IPoIB
C SDP
Java DBB SDP
Better

1m

4m

6m
m
m
1k

4k
16

64
1

4

k

k
6

6k
16

64
25

16

64
25

25
Payload Size

■ DirectByteBuffer (NIO socket channel) to avoid marshalling costs (JNI)
■ Observations
– C code is initially faster than Java implementation
– Generally even after 128k payload size


Interlude – Zero Copy 64k Boundary

■ Classic networking (java.net) package

Java Native Kernel

JNI




Java Native Kernel

byte[ ]
JNI




Java Native Kernel

byte[ ]
JNI

write data




Java Native Kernel
copy

byte[ ]
JNI

write data




Java Native Kernel
copy copy

byte[ ]
JNI

write data




Java Native Kernel
copy copy

byte[ ]
Transmit
JNI

write data

■ 2 copies before data gets transmitted
■ Lots of CPU burn, lots of memory being consumed



■ Using DirectByteBuffer with SDP

Java Native Kernel




Java Native Kernel




Java Native Kernel

write data




Java Native Kernel
copy

write data




Java Native Kernel
copy

Transmit
write data

■ 1 copy before data gets transmitted
■ Less CPU burn, less memory being consumed



■ But when the payload hits the “zero copy” threshold in SDP…

Java Native Kernel




Java Native Kernel

>64KB




Java Native Kernel

>64KB
write data




Java Native Kernel

>64KB
write data

Memory is “registered” for use
with RDMA (direct send from
user space memory)




Java Native Kernel

>64KB
write data

Memory is “registered” for use This is extremely
with RDMA (direct send from expensive / slow!
user space memory)




Java Native Kernel

>64KB
Transmit
write data

Memory is “registered” for use This is extremely
with RDMA (direct send from expensive / slow!
user space memory)




Java Native Kernel

>64KB
write data

Memory is “unregistered” when
the send completes

■ 1 copy before data gets transmitted
■ Register / unregister is prohibitively expensive (every transmit!)


Simple Test using IB – SDP Comparison


Throughput

C IPoIB
Java DBB IPoIB
C SDP
Java DBB SDP
Better

1m

4m

6m
m
m
1k

4k
16

64
1

4

k

k
6

6k
16

64
25

16

64
25

25
Payload Size

■ Post Zero Copy threshold there is a sharp drop
– Cost of memory register / unregister
■ Eventual climb and plateau
– Benefits of zero copy cannot outweigh the drawbacks


Simple Test using IB – RDMA Comparison


C IPoIB
Throughput

Java DBB IPoIB
C SDP
Java DBB SDP
C RDMA/W
Better Java DBB RDMA/W

1m

4m

m

m

6m
k

k
1k

4k
1

4

6
16

64

6k
16

64
25

16

64
25

25
Payload Size

■ No “zero copy” threshold issues
– Always zero copy
– Memory registered once, reused
■ Throughput does finally plateau
– Single thread – pipe is hardly saturated


Simple Test using IB – What about that Zero Copy
Threshold?
Zero Copy Threshold Comparison

4k
8k
Throughput

16k
32k
64k
128k

Better 256k
512k

m

m
k

6m
k

1m

4m
6
16

k
64
1

4

1k

4k

6
16

64
25

64
16
25

25
Payload Size

■ SDP ultimately has a plateau here
– Possibly other deeper tuning aspects available
■ Pushing the threshold for zero copy out has no advantange
■ Claw back is still ultimately limited
– Likely gated by some other aspect of the system
■ 64KB threshold (default) seems to be the “sweet spot”


Simple Test using IB – Summary

■ Simple steps to start using
– IPoIB lets you use your application ‘as is’
■ Increased speed can potentially involve significant application changes
– Potential need for deeper technical knowledge
– SDP is an interesting stop gap
■ There are hidden gotchas!
– Increased load changes the game – but this is standard when dealing with computers


ORB and High Speed Networks


Benchmarking the ORB – Background

■ Experiment: How does the ORB perform over InfiniBand?
■ Tests conducted
– Send different sized packets from a client to a server
– Time required for write followed by read
– Compare standard Ethernet to SDP / IPoIB
■ Conditions
– 500 client threads
– Echo style test (send to server, server echoes data back)
– byte[] payload
– 40Gb/s InfiniBand
■ Goal being to look at
– ORB performance when data pipe isn’t the bottleneck (Time to complete benchmark)
– Threading performance
■ Realistically expecting to discover bottlenecks in the ORB


Benchmarking the ORB – Ethernet Results

ORB Echo Test Performance

Time to Complete

ETH

Better

1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1m
Payload Size

■ Standard Ethernet with the classic java.net package


Benchmarking the ORB – SDP


Time to Complete

ETH
SDP

Better

1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1m
Payload Size

■ …And this is with SDP (could be better)


Benchmarking the ORB – ORB Transmission Buffers

byte[ ]



ORB

byte[ ]



ORB

byte[ ]

write data



ORB

byte[ ]

write data

Internal buffer
for transmission



ORB

byte[ ]

write data
2KB 1KB



ORB

byte[ ]

write data
1KB
1KB 1KB



ORB
copy
byte[ ]

write data
1KB
1KB 1KB



ORB
copy
byte[ ]
Transmit

write data
1KB
1KB 1KB

■ Many additional costs being incurred (per thread!) to transmit a byte array



■ 3KB to 4KB ORB buffer sizes were sufficient for Ethernet

ORB Socket Layer

Transmit

4KB

■ Existing bottlenecks outside the ORB (buffer management)
■ Throughput couldn’t be pushed much further



■ 64KB was the best size for SDP

ORB Native

copy

Transmit

64KB

■ Zero Copy Threshold!


Benchmarking the ORB – Garbage Collector Impact

■ Allocating large objects (e.g., buffers) can be a costly operation

Heap

Free Memory

Allocated Memory




Buffer

Heap

Free Memory

Allocated Memory




Buffer Allocate where?

Heap

Free Memory

Allocated Memory




Buffer Allocate where?

Heap

Free Memory

Allocated Memory

■ Premature Garbage Collections in order to “clear space” for large allocations


Benchmarking the ORB – Thread Pools

■ Thread and connection count ratios are a factor

Client Server




Client Server




Client Server

00 ds
5 a
e
T hr




Client Server

00 ds
5 a
e
T hr 1 Connection




Client Server

00 ds
5 a
e
T hr 1 Connection

■ Highly contended resource
■ Couldn’t saturate the communication channel




Client Server

00 ds
5 a
e
T hr




Client Server

00 ds
5 a
e
T hr

500 Connections




Client Server

00 ds
5 a
e
T hr

500 Connections

■ Context switching disaster
■ Threads queued and unable to complete transmit
■ Memory / resource consumption nightmare




Client Server

00 ds
5 a
e
T hr




Client Server

00 ds
5 a
e
T hr
10 Connections




Client Server

00 ds
5 a
e
T hr
10 Connections

■ 2-5% of the client thread count appeared to be best
■ Saturate the communication pipe enough to achieve best throughput
■ Keep resource consumption and context switches to a minimum


Benchmarking the ORB – Post Optimization Round


Time to Complete

ETH
SDP

Better

1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1m
Payload Size




Time to Complete

ETH
SDP
SDP (New)

Better

1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1m
Payload Size

■ Hey, great! Still not super (or the difference you’d expect) but it’s a good start
■ NOTE: 64k threshold definitely a big part of the whole thing




Time to Complete

ETH
SDP
SDP (New)
IPoIB
Better

1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1m
Payload Size

■ No surprises, IPoIB has higher overhead than SDP
■ 64KB numbers are actually quite close – so still issues to discover and fix


Benchmarking the ORB – Summary

■ It’s not as easy as “stepping on the gas”
– High speed networks alone don’t resolve your problems.
– Software layers are going to have bottlenecks.
– Improvements for high speed networks can help traditional ones as well
■ Benefit is not always clear cut


And after all that…


Conclusion

■ High speed networks are a game changer
■ Simple to use, hard to use effectively
■ Expectations based on past results need to be re-evaluated
■ Existing applications / frameworks may need tuning or optimization
■ Opening of potentially new possibilities


Questions?


References

■ Get Products and Technologies:
– IBM Java Runtimes and SDKs:
• https://www.ibm.com/developerworks/java/jdk/
– IBM Monitoring and Diagnostic Tools for Java:
• https://www.ibm.com/developerworks/java/jdk/tools/

■ Learn:
– IBM Java InfoCenter:
• http://publib.boulder.ibm.com/infocenter/java7sdk/v7r0/index.jsp

■ Discuss:
– IBM Java Runtimes and SDKs Forum:
• http://www.ibm.com/developerworks/forums/forum.jspa?forumID=367&start=0


Copyright and Trademarks

© IBM Corporation 2012. All Rights Reserved.

IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of
International Business Machines Corp., and registered in many jurisdictions
worldwide.

Other product and service names might be trademarks of IBM or other companies.

A current list of IBM trademarks is available on the Web – see the IBM “Copyright
and trademark information” page at URL: www.ibm.com/legal/copytrade.shtml


High speed networks and Java (Ryan Sciampacone)

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to High speed networks and Java (Ryan Sciampacone)

Similar to High speed networks and Java (Ryan Sciampacone) (20)

More from Chris Bailey

More from Chris Bailey (20)

Recently uploaded

Recently uploaded (20)

High speed networks and Java (Ryan Sciampacone)