Understanding Low And Scalable Mpi Latency


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Understanding Low And Scalable Mpi Latency

  1. 1. WHITE PaPEr Understanding Low and Scalable Message Passing Interface Latency Latency Benchmarks for High QLogic InfiniBand Solutions Offer 70% Advantage Over the Competition Performance Computing Key Findings Executive Summary • The QLogic QLE7140 and QLE7280 HCas outperform the Considerable improvements in InfiniBand® (IB) interconnect Mellanox® ConnectX™ HCa in osu_latency at the 128-byte technology for High Performance Computing (HPC) applications message size and the 1024-byte message size by as much have pushed bandwidth to a point where streaming large as 70%. amounts data off-node is nearly as fast as within a node. • The QLogic QLE7140 and QLE7280 HCas outperform the However, latencies for small-message transfers have not kept up ConnectX HCa in “scalable latency” by as much as 70% as with memory subsystems, and are increasingly the bottleneck in the number of MPI processes increase. high performance clusters. Different IB solutions provide dramatically varying latencies, Introduction especially as cluster sizes scale upward. Understanding how Today’s HPC applications are overwhelmingly implemented latencies will scale as your cluster grows is critical to choosing a using a parallel programming model known as the Message network that will optimize your time to solution. Passing Interface (MPI). To achieve maximum performance, HPC The traditional latency benchmarks, which send 0-byte messages applications require a high-performing MPI solution, involving between two adjacent systems, result in similar latency both a high-performance interconnect and highly tuned MPI measurements for emerging DDr IB Host Channel adapters libraries. InfiniBand has rapidly become the HPC interconnect (HCas) from QLogic® and competitors of about 1.4 microseconds of choice on 128 systems in the June 2007 Top 500 list. This (µs). However, on larger messages, or across more nodes in rapid upswing was due to its high (2GB/s) maximum bandwidth, a cluster, QLogic shows a 60-70% latency advantage over and its low (~1.4–3 µsec) latency. High bandwidth is important competitive offerings. These scalable latency measurements because it allows an application to move large amounts of data indicate why QLogic IB products provide a significant advantage very quickly. Low latency is important because it allows rapid on real HPC applications. synchronization and exchanges of small amounts of data.
  2. 2. WHITE PaPEr Latency Benchmarks for High Performance Computing QLogic InfiniBand Solutions Offer 70% Advantage bandwidth is at a 1:1 ratio with available bandwidth from a DDr IB This white paper compares several benchmark results. For all of connection. these results, the test bed consists of eight servers with standard “off-the shelf” components, and a QLogic SilverStorm® 9024 In contrast, socket-to-socket MPI latency in either system is 0.40 24-port DDr IB Switch. µs, while the fastest inter-node IB MPI latency of is 1.3-3 µs can be achieved. a ratio of 7x to 3x in comparing socket-to-socket and IB! Servers Thus, small-message latency is one of the areas where there is a significant penalty to go off-node. Though there are some “back- 2-socket rack-mounted servers to-back” 2-node benchmarks available to help, the latency observed 2.6 Ghz dual-core, aMD™ Opteron® 2218 processors does not always represent the desired latency required from a high- 8 GB of DDr2-667 memory performance cluster. Tyan ® Thunder n3600r (S2912) motherboards The HCas benchmarked were: Different Ways to Measure Latency • Mellanox MHGH28-XTC (ConnectX) DDr HCa MPI latency is often measured by one of a number of common • QLogic QLE7140 SDr HCa microbenchmarks such as osu_latency, or the ping-pong component of the Intel® MPI Benchmarks (formerly Pallas MPI Benchmarks), • QLogic QLE7280 DDr HCa. or the ping-pong latency component of the High Performance all benchmarks were run using MVaPICH-0.9.9 as the MPI. For the Computing Challenge (HPCC) suite of benchmarks. all of these Mellanox ConnectX HCas MVaPICH was run over the user-space microbenchmarks have the same basic pattern. Each runs a single verbs provided by the OFED-1.2.5 release. For the QLE7140 and ping-pong test sending a 0- or 1-byte message between two cores QLE7280 MVaPICH was run over the InfiniPath™ 2.2 software stack, on different cluster nodes, reporting the latency as half the time of using the QLogic PSM aPI and OFED-1.2 based drivers. one round-trip. Here are some example graphs showing the results of running osu_latency using three different IB HCas. Motivation for Studying Latency Bandwidths over the network are approaching memory bandwidths within a system. running the Bandwidth microbenchmark from Ohio State (osu_bw) on a node, using the MVaPICH-0.9.9 implementation of MPI, measures large message intra-node (socket-to-socket) MPI bandwidth of 2 GB/s with message sizes 512k or smaller. This HSG-WP07017 SN0032014-00 a 2
  3. 3. WHITE PaPEr Latency Benchmarks for High Performance Computing QLogic InfiniBand Solutions Offer 70% Advantage Judging from this test, the QLE7280, QLE7140, and ConnectX as demonstrated, the QLE7280 and QLE7140 latencies largely HCas are all similar with respect to 0-byte latency. However, as the remain flat with increasing process count. The ConnectX HCa’s message size increases significant differences are observed. For latency, however, rises with the increase of processes. at 32-cores, example with a 128-byte message size, the QLE7280 has a latency the randomring Latency of QLogic QLE7280 DDr HCa is 1.33 of 1.7 µs, whereas the ConnectX DDr adapter has a latency of 2.7 µs compared to 2.26 µs for the ConnectX HCa. This amounts to µs providing a 60% performance advantage for the QLE7280. With a 70% better performance for the QLE7280. The trend is for larger 1024-byte message size, the QLE7280’s latency is 2.80 µs for a 70% differences at larger core counts. Since low latency is required even advantage over ConnectX’s latency of 4.74 µs. at large core counts to scale application performance to the greatest extent possible, the QLogic HCa’s consistently low latency is referred another test that measures latency is the randomring latency to as “scalable latency.” benchmark which is a part of the High Performance Computing Challenge suite of benchmarks (HPCC). The benchmark tests latency across a series of randomly assigned rings, averaging across all of them.1 The benchmark forces each process to talk to every other process in the cluster. This is important because there is a substantial difference in scalability with a large number of cores between those HCas that seemed so similar when running osu_latency. 1 The measurement differs from the pingpong case since the messages are sent by two processes calling MPI_Sendrecv, rather than one calling MPI_Send followed by MPI_recv. HSG-WP07017 SN0032014-00 a 3
  4. 4. WHITE PaPEr Latency Benchmarks for High Performance Computing QLogic InfiniBand Solutions Offer 70% Advantage However, for small messages the latency cost of that initial setup Understanding Why Latency Scalability Varies is large compared to the cost of sending a message. a round-trip To understand why latency scalability would differ, it helps to on the wire can triple the cost of sending a small message, while understand, at least at a basic level, how MPI works. The following is copying a couple of cache lines from a receive buffer to their final the basic path of an MPI packet, from a sending application process location costs you very little. This leads most implementors to use to a receiving application process. a Send/recv based approach. However, in HCas that have tuned for rDMa to the exclusion of Send/recv, this causes a large slowdown, 1. Sending process has data for some remote process. resulting in poor latency. an rDMa write is much faster, but it 2. Sender places data in a buffer, passes a pointer to the MPI stack, requires that costly setup. The following describes a mechanism along with an indication of who the receiver is and a tag for used to sidestep this problem. identifying the message. achieving Low Latency with rDMa 3. ‘context’ or ‘communication id’ identifies the context over which the point-to-point communication happens -- only messages For interconnects that have been optimized for remote Direct Memory in the same communicator can be matched (there is no “any” access (rDMa), it can be desirable to use rDMa not only for large communicator). messages but also for small messages. This is done without incurring There are some variations in how this process is implemented, often the setup latency cost by mimicking a receive mailbox in memory. based on the underlying mechanism for data transfer. For each MPI process, the MPI library sets up a temporary memory location for every process in the job. The setup and coordination With many interconnects offering high performance rDMa, there is is done at initialization time, so by the time communication starts a push towards utilizing it to improve MPI performance. rDMa is every MPI process has knowledge of the memory location to write to, a one-sided communication model, allowing data to be transferred and can use rDMa. When receiving, the MPI library in the receiving between from one host to another without the involvement of the process then goes and checks each temporary memory location, remote CPU. This has the advantage of reducing CPU utilization, but and then copies any messages that may have arrived to the correct requires the rDMa initiator to know where it is writing to or reading buffers. from. This requires an exchange of information before the data can be sent. This can work well in small clusters or jobs, such as when running the common point to point microbenchmark. Each receiving process another mechanism that is used is what is known as the Send/recv has only one memory location to check, and can very quickly find model. This is a two-sided communication model where the receiver and copy any receiving message. maintains a single queue where all messages go initially, and then the receiver is involved in directing messages from that queue to their final destination. This has the advantage of not requiring remote knowledge to begin a transfer, as each side only needs to know about its own buffers, but at the cost of involving the CPU on both sides. Most high performance interconnects provide mechanisms for both of these models, but make different optimization choices in terms of tuning them. almost all implementations use rDMa for large messages, where the setup cost to exchange information initially is small relative to the cost of involving the CPU in transferring large amounts of data. Thus, most MPIs implement a ‘rendezvous protocol’ for large messages, where the sender sends a ‘request to send’, the receiver pins the final location buffer and sends a key, and the sender does an rDMa write to the final location. MPIs implemented on OpenFabrics verbs do this explicitly, while the PSM layer provided with the QLogic QLE7100 and QLE7200 series HCas does it behind the scenes. HSG-WP07017 SN0032014-00 a 4
  5. 5. WHITE PaPEr Latency Benchmarks for High Performance Computing QLogic InfiniBand Solutions Offer 70% Advantage The issue with the approach is that it doesn’t scale. With rDMa, each remote process needs its own temporary memory location to write to. Thus, as a cluster grows the receiving process has to check an additional memory location for every remote process. In today’s world of multicore processors and large clusters, the array of memory locations rises exponentially. The per-local-process memory and host software time requirements The effect of a connected protocol is to require some amount of of this algorithm go up linearly with the number of processors per-partner state both on host and on chip. When the number of in the cluster. This means that in a cluster made up of N nodes processes scales up, this can lead to strange caching effects as with M cores each, per-process memory use and latency grow as the data is sent/received from the HCa. This can be mitigated to O(M * (N-1)), while per-node memory use grows even faster, as some extent using methods like Shared receive Queues (SrQ) and O(M2 * (N-1)). Scalable rC, but remains a problem for very large clusters using rC-based MPIs. a Scalable Solution: Send/recv The QLogic approach with the PSM aPI sidesteps this by using a connectionless protocol and keeping the minimum necessary a more scalable solution is to use send/recv. Because the location state to ensure reliability. Investigations at Ohio State showed the in memory where messages are placed is determined locally, all advantages of a connectionless protocol at scale when compared to messages can go into a single queue with a single place to check, an rC-based protocol, which were limited by the small MTU and lack instead of requiring a memory location per remote process. The of reliability in the UD IB protocol.1 In another paper, the investigators results are then copied out in the order they arrive to the memory at OSU showed a need for a ‘UD rDMa’ approach in order to achieve buffers posted by the application. Thus, the per-local-process full bandwidth. 2 memory requirements for this approach are constant, and the per- node memory requirements increase only with the size of the node. PSM takes account of all of these issues behind the scenes. It allows the MPI implementor access to all of the scalability of a Connection State connectionless protocol, without the need to develop yet another implementation of segmentation and reliability, or running into any a final element which is harder to measure, but apparent in very large of the high-end bandwidth performance issues seen with UD. clusters, is the advantage of the connectionless protocol. The PSM is based on a connectionless, as opposed to a connected protocol (rC) that is used for most verbs-based MPIs. 1 http://nowlab.cse.ohio-state.edu/publications/conf-papers/2007/koop-ics07. pdf 2 http://nowlab.cse.ohio-state.edu/publications/conf-papers/2007/koop- cluster07.pdf HSG-WP07017 SN0032014-00 a 5
  6. 6. WHITE PaPEr Latency Benchmarks for High Performance Computing QLogic InfiniBand Solutions Offer 70% Advantage Summary and Conclusion The white paper, explores latency measurements, and illustrates how benchmarks measuring point-to-point latency may not be representative of the latencies that applications on large-scale clusters require. Explained were some of the underlying architectural reasons of varying approaches to low MPI latency and how the QLogic QLE7140 and QLE7280 IB HCas efficiently scale to large node counts. The current trend towards rDMa in high-performance interconnects is very useful for those applications with large amounts of data to move. as system resources are already constrained, it is vital to limit CPU usage in moving large amounts of data through the system. However, a large and growing number of applications are more latency-bound than bandwidth bound, and for those an approach to low latency that scales is necessary. The QLogic QLE7100 and QLE7200 series IB HCas provide scalable low latency. Disclaimer reasonable efforts have been made to ensure the validity and accuracy of these performance tests. QLogic Corporation is not liable for any error in this published white paper or the results thereof. Variation in results may be a result of change in configuration or in the environment. QLogic specifically disclaims any warranty, expressed or implied, relating to the test results and their accuracy, analysis, completeness or quality. www.qlogic.com Corporate Headquarters QLogic Corporation 26650 aliso Viejo Parkway aliso Viejo, Ca 92656 949.389.6000 Europe Headquarters QLogic (UK) LTD. Surrey Technology Centre 40 Occam road Guildford Surrey GU2 7YG UK +44 (0)1483 295825 © 2007 QLogic Corporation. Specifications are subject to change without notice. all rights reserved worldwide. QLogic, the QLogic logo, and SilverStorm are registered trademarks of QLogic Corporation. InfiniBand is a registered trademark of the InfiniBand Trade association. aMD and Opteron are trademarks or registered trademarks of advanced Mirco Devices. Tyan is registered trademark of Tyan Computer Corporation. Mellanox and ConnectX are trademarks or registered trademarks of Mellanox Technologies, Inc.. Infini- Path is a trademark of Pathscale, Inc.. Intel is registered trademark of Intel Corporation. all other brand and product names are trademarks or registered trademarks of their respective owners. Information supplied by QLogic Corporation is believed to be accurate and reliable. QLogic Corporation assumes no responsibility for any errors in this brochure. QLogic Corporation reserves the right, without notice, to make changes in product design or specifications. HSG-WP07017 SN0032014-00 a 6