True Scale Ddr Best In Class Performance


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

True Scale Ddr Best In Class Performance

  1. 1. WHITE PaPEr QLogic TrueScale™ DDR IB Adapter Provides Scalable, Best-In-Class Performance QLogic’s DDR Adapters QLogic’s Message Rate 340% Better and Scalable Latency Up to 33% Superior Outperform Mellanox ® adapters over Mellanox ConnectX™ adapters. The findings in Executive Summary this paper demonstrate that QLogic TrueScale adapters are the Solving today’s most challenging computational problems requires best choice for High Performance Computing (HPC) applications. more powerful, cost-effective, and power efficient systems. as clusters and the number of processors per cluster grow to address Key Findings problems of increasing complexity, the communication needs of the applications also increase. Consequently, interconnect The QLogic 7200 Series DDr InfiniBand adapters offer better performance is crucial for application scaling. Satisfying the high message and scalable latency than Mellanox’s ConnectX adapters. performance requirements of Inter-Processor Communications The test results described in this paper suggest that: (IPC) requires a interconnect that: • Message rate performance is over 340-percent better • Efficiently processes a variety of messages patterns than ConnectX • Leverages the benefits of multi-core processors • Scales with the size of the fabric • Scalable latency is up to 33 percent superior to ConnectX • Minimizes power requirements • TrueScale bandwidth performance is anywhere from 120 QLogic Host Channel adapters (HCas) have been architected with to 70 percent better at 128- and 1024-byte message sizes, these design goals in mind to provide significantly better scaling respectively performance than any other InfiniBand™ (IB) architecture. as • HPC customers can reap the benefits of TrueScale a result, a measurable and sustainable difference in application adapters, which significantly outperform Mellanox DDR performance can be realized when deploying the TrueScale IB adapters as the size of the cluster increases architecture. QLogic has performed a series of head-to-head performance benchmarks showing the I/O performance and scalability advantages of their 7200 Series of Dual Data rate (DDr) IB
  2. 2. WHITE PaPEr QLogic TrueScale™ DDR IB Adapter Provides Scalable, Best-In-Class Performance QLogic’s DDR Adapters Outperform Mellanox® “scale-out” (large node count) clusters, the efficient message results processing capabilities of the adapter enable more effective use of The most accurate way to establish the best interconnect option the available compute resources, resulting in application performance for a given application is to install and run the application on a benefits as the number of cores per node and the number of nodes variety of fabrics to determine the best performing option. However, in a cluster increase. given the costs associated with this approach, the use of industry standard benchmarks is a more pragmatic means of evaluating an Microbenchmarks interconnect. Table 1 summarizes QLogic’s findings in scalable benchmark For applications with heavy messaging requirements, message rate performance between ConnectX and TrueScale IB adapters. performance is a good indicator of how well an interconnect will be able to support the needs of an application. another factor to Message rate consider is how well the interconnect maintains its performance as As seen in Table 1, at eight processes per node (ppn), the system is scaled. The High Performance Computing Challenge TrueScale message rate performance is over three times that (HPCC) scalable latency and scalable message rate benchmarks of ConnectX. are strong indicators of how well the interconnect will support an application at scale. OSU’s Multiple Bandwidth/Message rate benchmark (osu_mbw_mr) was run on two servers connected by 1m cable (no switch), each architecturally, ConnectX is designed to offload more of the burden of server with 2x 3.0 GHz Intel® Harpertown E5472, quad-core CPUs, communication processing from the CPU to the adapter. This design 16GB raM, rHEL 5. ConnectX runs used OFED 1.3, MVaPICH-1.0.0 can provide benefits in CPU utilization, especially when using single- (default options). TrueScale runs used InfiniPath® 2.2/OFED 1.3 and QLogic MPI (default options). or dual-core compute nodes. However, given the availability of multiple cores in today’s compute nodes, this approach is no longer as multi-core systems become increasingly more prevalent, the cluster optimal. as more cores are added to a node, the communications interconnect must be able to accommodate more processes per burden on a single adapter increases significantly. This results in compute node. The TrueScale architecture was designed with this trend an increased dependency on the adapter’s capabilities for scalable in mind, enabling users to take maximum advantage of all the cores in “system” performance. Consequently, scalability anomalies can their compute nodes. This is accomplished through high message rate begin to appear when the number of cores in a compute node and superior inter- and intra-node communication capabilities. increases to four or five. Primarily due to the offload capability of ConnectX, Mellanox’s adapters require significantly more power to operate — as much as 50 percent compared to TrueScale adapters. The additional wattage required to power the compute nodes is also reflected in the associated higher cooling costs to bring down the ambient temperature in the data center. TrueScale architecture is designed to support highly-scaled applications with high message rate and ultra-low scalable latency performance. In both “scale-up” (multi-core environments) and Table 1. Summary of QLogic’s Message Rate and Scalable Latency Advantage Over Mellanox Comparison Benchmark Mellanox® QLogic QLogic advantage MHGH28 | MHGH29 QLE7240 | QLE7280 Message rage OSU Message rate @ 8 ppn 4.5 | 5.5 19 | 26 Over 340% (non-coalesced) Million messages/s Million message/s Scalable Latency HPCC random ring Latency 4.4 | 8.9 1.3 | 1.1 Up to 33% @ 128 cores µs µs HSG-WP08014 IB0030901-00 a 2
  3. 3. WHITE PaPEr QLogic TrueScale™ DDR IB Adapter Provides Scalable, Best-In-Class Performance QLogic’s DDR Adapters Outperform Mellanox® Figure 1 illustrates the ability of TrueScale to make effective use of multi-core nodes1. Note that ConnectX does not scale as the processes per node increase. With TrueScale, more application work is accomplished as the node size increases. Figure 2. TrueScale Multi-core Advantage in Latency Performance When measuring latency with a realistic 128-byte message size, the latency performance of ConnectX drops off at about four to five cores per node. Under the same conditions, TrueScale provides consistent and predictable levels of performance. Figure 1. TrueScale Multi-core Advantage in Message Rate Performance Scalable Latency application Performance In terms of scalable latency performance, at 128 cores, QLogic’s SPEC MPI2007 MPI latency ranges from 13 percent to 33 percent of Mellanox’s There are more sophisticated benchmarks, such as SPEC MPI2007, ConnectX. which measure performance at a system level over a variety of different applications. This benchmark suite includes 13 different all scalable latency results are from the HPC Challenge web site codes and emphasizes areas of performance that are most relevant ( and use the random ring Latency benchmark. ConnectX Gen1 results are from the 2008- to MPI applications running on large scale systems. The quantity 05-15 submission by Intel using 128 cores of the Intel Endeavour and performance of the microprocessors, memory architecture, cluster with Xeon® E5462 CPUs (2.8 GHz); ConnectX Gen2 results interconnect, compiler, and shared file system are all evaluated. are from the 2008-05-09 submission by TU Dresden using 128 cores of the SGI® altix® ICE 8200EX cluster with Xeon X5472 CPUs In august 2008, QLogic ran the SPECmpiM_base2007 benchmark on (3.0 GHz). QLogic QLE7240 results are from their 2008-08-05 a TrueScale enabled cluster that yielded the best overall performance submission using 128 cores of the Darwin Cluster with Xeon 5160 at 96 and 128 cores3. This result represents third-party validation CPUs (3.0 GHz); QLogic QLE7280 results are from their 2008-08-01 of the scalable performance capabilities of the architecture over a submission using 128 cores of the QLogic Benchmark Cluster with variety of application types. This result compared favorably not only Xeon E5472 CPUs (3.0 GHz). to other commodity x86-based compute clusters, but also against platforms from large system vendors. Figure 2 shows that TrueScale adapters maintain consistent latency performance as more cores are added to a node.2 Consequently, Halo Test more of the compute power can be used for application workload The halo test from argonne National Laboratory’s mpptest benchmark rather than waiting for the adapter to process messages. suite simulates communications patterns in layered ocean models. 1 These are the results of the OSU multiple bandwidth message rate (osu_ mbw_mr) test. The test used a 1-byte message size when run on two nodes, each with 2x 3.0 GHz Intel Xeon E5472 quad-core CPUs. The test used QLogic MPI 2.2 for QLE7280 adapters and MVaPICH-1.0.0 and OFED 1.3 on Gen2 ConnectX DDr adapters. 2 These are the results of the OSU Multiple-latency (osu_multi_lat) test of QLE7240 and Gen1 ConnectX HCas at 128 bytes message size when run on two 3 Details of the submission and results can be found at: nodes, each with 2x 2.33 GHz Intel Xeon E5410 quad-core CPUs. HSG-WP08014 IB0030901-00 a 3
  4. 4. WHITE PaPEr QLogic TrueScale™ DDR IB Adapter Provides Scalable, Best-In-Class Performance QLogic’s DDR Adapters Outperform Mellanox® Unlike many of the point-to-point microbenchmarks that measure Summary and Conclusion peak bandwidth, this benchmark measures throughput performance TrueScale is architecturally designed to take advantage of two over a variety of message sizes. as seen in Figure 3, TrueScale out- significant trends in high performance computing clusters: the performs Mellanox across the entire range of message sizes.1 prevalence of multi-core processors in compute nodes and the need to deploy increasingly larger clusters to tackle more complex computational problems. The benefits of the TrueScale architecture can be demonstrated in a variety of industry standard benchmarks that measure the scalable performance characteristics of the interconnect. More importantly, the advantages can be realized through improved application performance and a reduced time-to-solution at about half the power of ConnectX. Figure 3. TrueScale Bandwidth Performance on Halo Benchmark application requirements vary in terms of message sizes and patterns, so performance over a variety of message sizes is a better predictor of performance than peak measurements. At four processes per node, TrueScale bandwidth performance is anywhere from 120 to 70 percent better at 128 and 1024 byte message sizes, respectively. 1 The benchmark is the Halo test from argonne National Laboratory’s mpptest. In particular, the 2D halo psendrecv test at 4 processes per node on 8 nodes of 2 x 2.6 GHz aMD® Opteron™ 2218 CPUs, 8 GB DDr2-667 memory; NVIDIa® MCP55 PCIe chipset, for a total of 32 MPI ranks. QLogic MPI 2.2 used for TrueScale adapters and MVaPICH 0.9.9 for ConnectX. HSG-WP08014 IB0030901-00 a 4
  5. 5. WHITE PaPEr QLogic TrueScale™ DDR IB Adapter Provides Scalable, Best-In-Class Performance QLogic’s DDR Adapters Outperform Mellanox® Disclaimer reasonable efforts have been made to ensure the validity and accuracy of these performance tests. QLogic Corporation is not liable for any error in this published white paper or the results thereof. Variation in results may be a result of change in configuration or in the environment. QLogic specifically disclaims any warranty, expressed or implied, relating to the test results and their accuracy, analysis, completeness or quality. Corporate Headquarters QLogic Corporation 26650 aliso Viejo Parkway aliso Viejo, Ca 92656 949.389.6000 Europe Headquarters QLogic (UK) LTD. Quatro House Lyon Way, Frimley Camberley Surrey, GU16 7Er UK +44 (0) 1276 804 670 © 2008 QLogic Corporation. Specifications are subject to change without notice. all rights reserved worldwide. QLogic, the QLogic logo, InfiniPath, and TrueScale are trademarks or registered trademarks of QLogic Corporation. Mellanox and ConnectX are trademarks or registered trademarks of Mellanox Technologies, Inc. InfiniBand is a trademark and service mark of the InfiniBand Trade association. Intel and Xeon are registered trademarks of Intel Corporation. SGI and altix are registered trademarks of Silicon Graphics, Inc., in the United States and/or other countries worldwide. aMD and Opteron are trademarks or registered trademarks of advanced Micro Devices, Inc. NVIDIa is a registered trademarks of NVIDIa the United States and other countries. all other brand and product names are trademarks or registered trademarks of their respective owners. Information supplied by QLogic Corporation is believed to be accurate and reliable. QLogic Corporation assumes no responsibility for any errors in this brochure. QLogic Corporation reserves the right, without notice, to make changes in product design or specifications. HSG-WP08014 IB0030901-00 a 5