Your SlideShare is downloading. ×

04536342

318

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
318
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Experiences in Scaling Scientific Applications on Current-generation Quad-core Processors Kevin Barker, Kei Davis, Adolfy Hoisie, Darren Kerbyson, Mike Lang, Scott Pakin, José Carlos Sancho Performance and Architecture Lab (PAL), Los Alamos National Laboratory, USA {kjbarker,kei,hoisie,djk,mlang,pakin,jcsancho@lanl.gov} Abstract single die whereas the Tigerton incorporates two dual- core dies into a single package. In this study we In this work we present an initial performance compare the performance of two 16-core nodes, one evaluation of AMD and Intel’s first quad-core with four Barcelona processors and the other with four processor offerings: the AMD Barcelona and the Intel Tigerton processors. Our analysis relies on Xeon X7350. We examine the suitability of these performance measurements of application-independent processors in quad-socket compute nodes as building tests (microbenchmarks) and a suite of scientific blocks for large-scale scientific computing clusters. applications taken from existing workloads within the Our analysis of intra-processor and intra-node U.S. Department of Energy that represent various scalability of microbenchmarks and a range of large- scientific domains and program structures. scale scientific applications indicates that quad-core The performance and scaling behavior of each processors can deliver an improvement in application was measured on one core, when scaling performance of up to 4x per processor but is heavily from one to four cores on a single processor, and also dependent on the workload being processed. While the when using all four processors in a node. In addition, Intel processor has a higher clock rate and peak we determined the best achievable performance of performance, the AMD processor has higher memory each application on each node which is not necessarily bandwidth and intra-node scalability. The scientific when using all processing cores within a socket, or all applications we analyzed exhibit a range of cores within a node. This is heavily dependent on the performance improvements from only 3x up to the full application characteristics. 16x speed-up over a single core. Also, we note that the Though much of our work is focused on large- maximum node performance is not necessarily scale system performance including examining the achieved by using all 16 cores. largest systems available (for instance Blue Gene/L & Blue Gene/P, ASC Purple, ASC Redstorm e.g. [2]), we note that the performance at large-scale is a result 1. Introduction from both the performance of the computational nodes as well as their integration into the system as whole. The advancing level of transistor integration is This paper is organized as follows. An overview producing increasingly complex processor solutions of the Barcelona and Xeon nodes is given in Section 2. ranging from main-stream multi-cores, heterogeneous Low-level microbenchmarks are described in Section many-cores, and also special purpose processors 3 together with measured results for both nodes. including GPUs. There is no doubt that this will Section 4 describes the suite of applications, and input continue into the future until Moore’s Law no longer decks used and the methodology used to undertake the can be satisfied. This increasing integration will scalability analysis. Results are presented in Section 5 require increases in performance of the memory for the three types of analysis as described. hierarchy to feed the processors. Innovations such as Conclusions from this work are discussed in Section 6. putting memory on-top of processors, putting The contribution of this work is in the analysis of processors on-top of memory (PIMS), or a empirical performance data from a large suite of combination of both maybe a future way forward. complete scientific applications on the first generation However, the utility of future processor generations of quad-core processors from both AMD and Intel and will be a result from demonstrable increases in in a quad-socket environment. These data are obtained achievable performance from real workloads. from a strict measurement methodology to ensure that In this work we examine the performance of two conclusions drawn from the scalability analysis are state-of-the-art quad-core processors—the quad-core fair. Note that in this present work we do not consider AMD Opteron 8350 (Barcelona) and the quad-core physical or economic issues such as hardware cost, Intel Xeon X7350 (Tigerton). Both are based on 65nm power, or physical node size. The process that we process technology. The Barcelona is fabricated as a follow is directly applicable to other multi-core studies. 978-1-4244-1694-3/08/$25.00 ©2008 IEEE
  • 2. 2. Processor and Node Descriptions memory accesses, while the DHSI provides a point-to- point link between each processor and the memory The Intel Xeon X7350 (Tigerton) and AMD 8350 channels. The front-side-buss (FSB) of each processor Opteron (Barcelona) represent competing first- runs at 1066MHz. The memory speed is 667MHz and generation quad-core processor designs initially made thus provides a peak memory bandwidth of 10.7GB/s available in September 2007. Both are detailed below per processor which is shared among the four cores. and illustrate different implementations both in terms of processor configuration and connectivity to memory. 2.2. The AMD 8350 quad-core Opteron processor (Barcelona) 2.1. The Intel X7350 quad-core (Tigerton) Barcelona, the latest generation of the Opteron, The Intel Tigerton processor contains two dual- combines four Opteron cores onto a single die. Each core dies that are packaged into a single dual-chip die contains a single integrated memory controller and module (DCM) that is seated within a single socket. uses a HyperTransport (HT) network for point-to-point Each core contains a private 64KB L1 cache (32KB connections between processors. Each core has a data + 32KB instruction), and a shared 4MB L2 cache private 64KB L1 cache (32KB data + 32KB for the two cores on each die. Thus the total amount of instruction) and a private 512KB L2 cache, and each L2 cache is 8MB within the DCM. The processor processor has a shared 2MB L3 cache. The shared L3 implements the 128-bit SSE3 instruction set for SIMD cache is new to the Opteron architecture. The new 128 operations and thus can perform 4 double-precision bit SSE4a instructions enable each processor to floating-point operations per cycle. The processor is execute 4 double-precision floating-point operations clocked at 2.93GHz so the DCM has a theoretical peak per clock. The clock speed of each core is 2.0GHz performance of 46.9 Gflops/s. giving each chip a peak performance of 32Gflops/s. Each node contains four processors for a total of Each node contains four quad-core processors as 16 cores as shown in Figure 1, and contains a total of shown in Figure 2. Because each processor contains a 16GB of main memory using fully-buffered DIMMs separate memory controller, a key difference with the (FBDIMMs). Central to the node is a single memory Xeon node is that memory is connected directly to controller hub (MCH). This hub interconnects the each processor in a non-uniform memory access front-side-bus (FSB) of each processor to four (NUMA) configuration versus the Xeon’s symmetric- FBDIMM memory channels. The MCH contains a multiprocessor (SMP) configuration. DDR2 667MHz 64MB snoop buffer and a Dedicated High Speed Interconnect (DHSI) as well as PCI Express channels. Core Core Core Core The purpose of the snoop buffer is to minimize main L2 L2 L2 L2 Shared L3 Core Core Core Core Cross-bar Shared L2 Shared L2 HT (x3) Mem Cont (x2) 1066MHz FSB (a) Quad-core processor (a) Quad-core processor Socket 0 Socket 1 C C C C C C C C Socket 0 Socket 1 Socket 2 Socket 3 0 1 2 3 8GB/s 4 5 6 7 C C C C C C C C C C C C C C C C 0 1 8 9 2 3 10 11 4 5 12 13 6 7 14 15 DDR2 667 DDR2 667 8GB/s 8GB/s Memory Controller 8GB/s Hub (MCH) 8GB/s C C C C C C C C 8GB/s 8GB/s 8 9 10 11 12 13 14 15 8GB/s FBDIMM 667 FBDIMM 667 FBDIMM 667 FBDIMM 667 DDR2 667 Socket 2 Socket 3 DDR2 667 (b) Quad-processor node (b) Quad-processor node Figure 1. Overview of the Intel Tigerton Figure 2. Overview of the AMD Barcelona.
  • 3. Table 1. Characteristics of the Intel and AMD processor, memory, and node organization. Chip Memory Node Speed Peak L1 Power Transistor Speed Memory Peak L2 (MB) L3 (MB) Type (GHz) (Gflops/s) (KB) (W) count (M) (MHz) controllers (Gflops) Intel 4 (shared 2.93 46.9 64 None 130 582 FBDIMM 667 1 187.6 Tigerton 2-core) AMD 2 (shared 2.0 32 64 0.25 75 463 DDR2 667 4 128.0 Barcelona 4-core) memory is used and thus the memory bandwidth per outperforms the Xeon node in all cases. The measured processor is 10.7GB/s. The total memory capacity of single-core memory bandwidth is 4.4GB/s on the the node is 16GB (4GB per processor). The HT links Barcelona and 3.7GB/s on the Xeon. The memory connect the four processors in a 2×2 mesh. Further HT aggregate memory bandwidth, using all 16 cores, is links provide PCI Express I/O capability. Each HT 17.4GB/s and 10.2GB/s respectively. link has a theoretical peak of 8GB/s for data transfer. Figure 3(b) is based on the same data as Figure A summary of the two processor architectures and 3(a) but presents the observed memory bandwidth per nodes is presented in Table 1. It is interesting to note core. As shown, the per-core bandwidth decreases the differences in power consumption and transistor from 4.4GB/s to 1.1GB/s for the Barcelona (a factor of count per processor. The lesser transistor count on the four decrease), and from 3.7GB/s to 0.63GB/s for the Barcelona is due mainly to the reduced cache capacity. Xeon (a factor of six decrease). This decrease is significant, and there is clearly room for improvement 3. Low-level performance characteristics on both architectures. The aggregate achievable memory bandwidth is important to memory-intensive 3.1. Memory bandwidth applications, and the Barcelona node has a clear advantage over the Xeon node for such applications. To examine the memory bandwidth per core we ran the MPI version of the University of Virginia's 3.2. Processor locality Streams benchmark [10]. This benchmark is a memory stress test for a number of different operations. We The mapping of application processes to cores report here the performance of the “triad” test. In within a node affects the memory contention that the Figure 3(a) the aggregate memory bandwidth is shown application induces. In our experience, the core-to- for both the Xeon and Barcelona nodes for two cases: processor ordering is not always obvious and should at when using a single processor and when using all four a minimum be verified. Linux determines core processors in the node. The number of cores per numbering via information provided from the BIOS. processor used in both cases is varied from one to In this testing we used an MPI benchmark to four. As shown by the figure, the Barcelona node measure the latency between each core and all of the 24 5 AMD Barcelona (4-sockets) AMD Barcelona (4-sockets) 22 Intel Tigerton (4-sockets) Intel Tigerton (4-sockets) Aggregate Memory Bandwidth (GB/s) Memory Bandwidth per core (GB/s) 20 AMD Barcelona (1-socket) AMD Barcelona (1-socket) 4 18 Intel Tigerton (1-socket) Intel Tigerton (1-socket) 16 3 14 12 10 2 8 6 1 4 2 0 0 1 2 3 4 1 2 3 4 Cores Per Socket Cores Per Socket (a) Aggregate memory bandwidth (b) Bandwidth per core Figure 3. Streams Bandwidth
  • 4. Destination core Destination core 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0.43-0.44µs 1.20-1.21µs (same die) (same die/processor) 0.84-0.85µs 1.47-1.49µs Source core Source core (same processor) (HT 1-hop) 1.63-1.64µs 1.55-1.56µs (remote processor) (HT 2-hops) (a) Intel Xeon (b) AMD Barcelona Figure 4. Observed latency from any core to any other core in a node others. The latency varies depending on whether the 4. Application testing process two communicating cores are on the same die, on different dies in the same processor (in the case of the The suite of applications utilized includes many Intel Xeon), or different processors (or single- or large-scale production applications that are currently multiple-hop remote processors for AMD Barcelona). used within the U.S. Department of Energy. The From our latency measurements we were able to testing for each application consisted of determine the arrangement of the cores as seen by an application. Figure 4 shows the observed latency from 1) comparing the performance on a single core of the this test in the form of a matrix. The vertical axis Barcelona and Xeon; indicates the sending core and the horizontal axis is 2) examining the scaling behavior for two cases: (a) the receiving core. Shading is used to denote the using only one processor, and (b) using all 4 different latencies. No test was performed for a core processors in a node; sending to itself (the major diagonal in the matrix). It 3) determining the configuration (processors and can be seen that the Barcelona node uses a linear cores/processor) that yields the best performance. ordering of cores to processors – that is the first four All of the applications are typically run in a weak- cores reside on the first processor as indicated by the scaling mode. That is, the global problem size grows lowest latency, black shaded, block of size 4x4 cores with the number of nodes in the system. All available in Figure 4(b) and so on. In the case of the Xeon a memory is typically used for increased fidelity in the round-robin ordering across the dies is used. The first physical simulations or for simulating larger physical two dies reside on the first processor and so on. Cores systems. Our approach mimics typical usage by fixing on the same die are an MPI logical task distance of the sub-problem per processor no matter how many eight apart, as shown by the black diagonal lines in cores per processor are used. The global problem Figure 4(a). grows in proportion to the number of processors used. The latency from one core to another is in very This can be stated succinctly as doing strong scaling distinct ranges depending on their relationship (same within a processor and weak scaling across processors. die, remote die etc). We observe that the maximum latency is similar on both nodes, but the Xeon enjoys a 4.1. The application suite much lower intra-die and intra-processor latency. The core and processor ordering was actually included in An overview of each application is given below. Figure 1 and Figure 2 based on the observed processor Each is typically run on high-performance parallel locality map shown in Figure 4. To handle the systems utilizing many 1000s of processors at a time. differences in the processor numbering between the A summary of the application input-decks used here is two nodes we implemented a small software shim that given in Table 2. The input decks used are typical for uses the Linux sched_setaffinity() system call to problems processed on large-scale systems. allow user-defined mappings between MPI rank and physical cores. This shim gives us the ability to map GTC – Gyrokinetic Toroidal Code is a Particle-in-Cell processes to cores identically across the two nodes and (PIC) code from Princeton [12]. It was developed thereby perform fair comparisons of application to study energy transport in fusion devices. performance. Milagro - an implicit Monte Carlo (IMC) code for thermal radiative transfer from Los Alamos [5].
  • 5. Table 2. Summary of the input decks used for each application Problem per Problem per node Memory use per Input deck Processing characteristic processor processor GTC 1Dwedge 6.2M particles 24.8M particles 320MB Particle based Milagro Doublebend 0.5M particles 2M particles 50MB Particle based with replicated mesh Partisn Pencil 20x10x400 40x20x400 80MB Compute intensive S3D typical 50x50x50 100x100x50 140MB Memory/Compute intensive SAGE timing_h 140K cells 560K cells 280MB Memory intensive SPaSM BCC 64x64x64 128x128x64 150MB Compute intensive Sweep3D Pencil 20x10x400 40x20x400 8MB Kernel, small memory footprint VH1 Shock_tube 200x200x200 400x400x200 900MB Memory/Compute intensive VPIC 3D-HOT 4M particles 16M particles 256MB SSE optimized, memory intensive Partisn – SN transport code from Los Alamos [1] 5. Application Performance Analysis solving the Boltzmann equation using the discrete ordinates method, on structured meshes. 5.1. Comparison of single-core performance S3D – high-fidelity 3D simulation of turbulent combustion that includes detailed chemistry. It The performance of each application on a single originates from Sandia National Laboratory [6]. core of both processor types is shown in Figure 5(a). SAGE – an adaptive mesh (AMR) hydrodynamics The metric of “time” denotes the iteration time of the code used for the simulation of shock-waves. main computational loop (without I/O) for all Developed jointly by Los Alamos and SAIC [9]. applications except Sweep and Partisn, which are run SPaSM – the Scalable Parallel Short-range Molecular for 10 iterations, and VPIC, which is run for 100 dynamics code from Los Alamos. Used to study iterations to aid visual comparison. The reduction in material fracture and deformation properties [13]. application runtime on the Xeon core relative to the Sweep3D – a code-kernel from Los Alamos [7,10] that Barcelona core is shown in Figure 5(b). 50% indicates implements deterministic SN transport. The that the Tigerton has a runtime reduction of 50% computation is in the form of wavefronts that (halving the iteration time). Similarly, a value of -50% originate at the corners of a 3-D physical space. indicates that the Barcelona has a 50% reduction in VH1 – the Virginia Hydrodynamics code simulates an runtime relative to the Tigerton core. ideal inviscid compressible flow of gas The advantage of the Tigerton core over the hydrodynamics and is capable of simulating three- Barcelona core is between 25% and 44% depending on dimensional turbulent stellar flows [3]. the application. Recall that the Tigerton has a clock VPIC – a Particle-In-Cell code from Los Alamos used speed of 2.93GHz compared with the Barcelona of to model particle flow within a plasma [4]. 40 100% Intel Tigerton 35 Runtime advantage of Intel vs. AMD (1-core) AMD Barcelona 30 50% 25 Time (s) 20 15 0% SAGE SPaSM Milagro VH1 VPIC GTC S3D Sweep3D Partisn 10 5 -50% 0 SAGE SPaSM Milagro VH1 VPIC GTC S3D Sweep3D Partisn -100% (a) Application iteration time (b) Performance advantage of Xeon core Figure 5. Single-core performance comparison
  • 6. 20 100% 18 Intel Tigerton Runtime advantage of Intel vs. AMD (16-core) AMD Barcelona 16 14 50% 12 Time (s) 10 8 0% SAGE SPaSM VPIC Milagro VH1 GTC S3D Sweep3D Partisn 6 4 2 -50% 0 SAGE SPaSM VPIC Milagro VH1 GTC S3D Sweep3D Partisn -100% (a) Application iteration time (b) Performance advantage of Xeon node Figure 6. Single-node (16 cores) application performance comparison 2.0GHz, and that the memory bandwidth for a single the node. In fact the best performance observed on core was 3.7GB/s and 4.4GB/s respectively. Hence up VPIC and Partisn was when using 2-cores per to a 50% advantage could be expected for processor (8 cores total), and for SAGE was when computationally intensive codes executing the same using 3-cores per processor (12 cores total) on the instructions mix over the same number of cycles Xeon node. The best performance in all other cases without memory stalls. was observed when using all 16 cores. In Section 5.3 below we analyzed the performance as a function of 5.2. Comparison of node performance the number of cores and number of processors used for each application. The best performance of each application on the Xeon node (16 cores), and Barcelona node (16 cores) 5.3. Quad-core application scalability analysis is shown in Figure 6. In the same way as for the single-core comparison, Figure 6(a) depicts the To analyze the performance of using multiple iteration time, and Figure 6(b) depicts the runtime cores we followed a strict process in which the advantage of the Xeon node over the Barcelona node. problem size per processor was constant for all tests, Several observations are clear in this comparison: i) as described in Section 4. For each application we the iteration time is lower in all cases compared to show the performance when using between one and using only one-core per socket; ii) the run-time four cores per processor for the case of using one advantage of Xeon is much reduced from the single- processor and four processors in a node. Note that the core comparison of Section 5.1; and iii) the Xeon node performance relative to the single-core performance no-longer out-performs the Barcelona node for all for each processor type is shown – this is the speedup applications – the run-time on the Barcelona node is when using multiple cores. Figure 7 shows this data lower by as much as 60%. for all applications. Note that the legend for all graphs The results shown in Figure 6 are in line with the is shown in graph of Figure 7i). differences in the memory bandwidth measurements in The first observation made is that the scalability is Section 3.1. The per-core bandwidth when using all higher on the Barcelona quad-core than on the Xeon cores within a node on the Xeon is 0.63GB/s and on quad-core. This explains why the performance the Barcelona is 1.1GB/s (an advantage of almost a advantage of the Xeon quad-core nodes is less than it’s factor of 2). The 60% runtime advantage of the advantage for a single-core. The applications are Barcelona for the node performance is directly in line ordered in terms of their observed scalability in Figure with this on the memory intensive applications of 7. The applications with best scaling behavior on both SAGE and VH1. nodes are Milagro, SPaSM and Sweep3D (Figures Note however the performance presented above is 7(a) – 7(c)). Both Milagro and SPaSM are compute based on the best observed node performance – this bound, and Sweep3D has a small memory footprint does not necessarily result when using all 16 cores in resulting in high cache utilization. In contrast SAGE
  • 7. and Partisn are memory bound and show the slowest Note that even if an application has a higher scalability on both nodes. Codes that are neither speedup on the Barcelona node in comparison to the compute or memory bound scale better on the Xeon node it does not necessarily mean that it has a Barcelona than on the Xeon – that is VH1, GTC, higher performance as was evident in Figure 6(b). VPIC and S3D. 16 16 16 14 14 14 Relative performance (to 1 core) Relative performance (to 1 core) Relative performance (to 1 core) 12 12 12 10 10 10 8 8 8 6 6 6 4 4 4 2 2 2 0 0 0 1 2 3 4 1 2 3 4 1 2 3 4 Cores per socket Cores per socket Cores per socket (a) Milagro (b) SPaSM (c) Sweep3D 16 16 16 14 14 14 Relative performance (to 1 core) Relative performance (to 1 core) Relative performance (to 1 core) 12 12 12 10 10 10 8 8 8 6 6 6 4 4 4 2 2 2 0 0 0 1 2 3 4 1 2 3 4 1 2 3 4 Cores per socket Cores per socket Cores per socket (d) VH1 (e) GTC (f) VPIC 16 16 16 14 14 14 AMD Barcelona (4-sockets) Relative performance (to 1 core) Relative performance (to 1 core) Relative performance (to 1 core) Intel Tigerton (4-sockets) 12 12 12 AMD Barcelona (1-socket) 10 10 10 Intel Tigerton (1-socket) 8 8 8 6 6 6 4 4 4 2 2 2 0 0 0 1 2 3 4 1 2 3 4 1 2 3 4 Cores per socket Cores per socket Cores per socket (g) S3D (h)SAGE (i) Partisn Figure 7. Application speedup when using multiple-cores.
  • 8. 6. Conclusions References Using a suite of applications we have evaluated [1] R.S. Baker. “A Block Adaptive Mesh Refinement Algorithm for the Neutral Particle Transport Equation”, the first generation of quad-core processors available Nuclear Science & Engineering, 141(1), pp. 1-12, 2002. from AMD and Intel. Data was obtained using a strict measurement methodology that used a shim to control [2] A. Hoisie, G. Johnson, D.J. Kerbyson, M. Lang, S. Pakin. the mapping of application processes to processors, “A Performance Comparison Through Benchmarking and and compared the per-core performance as well as Modeling of Three Leading Supercomputers: Blue Gene/L, scaling up to 16 cores in a node. The process followed Red Storm, and Purple”, in proc. IEEE/ACM Conf. on is directly applicable to other multi-core studies. Supercomputing (SC06), Tampa, FL, 2006. When considering the performance of a single [3] J.M. Blondin. VH-1 User’s Guide. North Carolina State core, where there are no memory bottlenecks, the University, 1999. higher clock speed of Intel’s Xeon gives applications a measured 25-44% reduction in runtime compared with [4] K. Bowers. “Speed optimal implementation of a fully the AMD’s Barcelona. When using all of the cores in a relativistic 3d particle push with charge conserving current processor the results are more dependent on the way accumulation on modern processors”, in proc. 18th int. conf. each application uses memory. Barcelona has the edge Numerical Simul. Plasmas,2003, p.383 in memory bandwidth available on a single processor. [5] T. M. Evans, T. J. Urbatsch. “MILAGRO: A parallel Finally, when examining scaling across the entire Implicit Monte Carlo code for 3-d radiative transfer (U),” In 16-core node, the results are somewhat mixed. In Proc. of the Nuclear Explosives Code Development general, applications with a small memory footprints Conference, Las Vegas, NV, Oct 1998. perform better on the Xeon and see almost perfect scaling, Memory-bandwidth-intensive applications, on [6] E.R.Hawkes, R.Sankaran, J.C.Sutherland, J.H. Chen. the other hand, scale better on the Barcelona because “Direct numerical simulation of turbulent combustion: of the reduced memory contention. For many of the fundamental insights towards predictive models”, J. of Physics Conference Series, 16:65–79, 2005. applications, the better scaling behavior results in a higher achievable performance on the Barcelona. [7] A. Hoisie, O. Lubeck, H.J. Wasserman. “Performance While this study represents a snapshot of current and Scalability Analysis of Teraflop-Scale Parallel processors and node architectures it also represents a Architectures using Multidimensional Wavefront snapshot of current application structures. All of the Applications”, int. J. of High Performance Applications, vol. applications we ran use the “one MPI rank per core” 14, no. 4, pp. 330-346, 2000. model. Although this is an extremely portable way to [8] Intel Corporation. Quad-core Intel Xeon Processor 7300 structure an application, it may be possible to gain Series. Product Brief. 2007. more performance by exploiting the properties of multi-core processors, such as that physically [9] D.J. Kerbyson, H.J. Alme, A. Hoisie, F. Petrini, H.J. proximate processes can benefit from sharing cached Wasserman, M.L. Gittings. “Predictive Performance and data. We have shown that for applications as they exist Scalability Modeling of a Large-scale Application”, in Proc. today, it is important to consider the balance between IEEE/ACM SC, Denver, 2001. compute rate and memory rate when selecting a [10] K.R. Koch, R.S. Baker, R.E. Alcouffe. Solution of the processor from which to build a cluster. Neither the First-Order Form of the 3-D Discrete Ordinates Equation on Barcelona nor the Xeon is unambiguously faster than a Massively Parallel Processor. Trans. of the American the other. The decision of which to use must be made Nuclear Soc., 65:198-199, 1992. on a per-application (or per-workload) basis and can benefit from the results we presented in this paper. [11] J. McCalpin. “Memory bandwidth and machine balance in current high performance computers”, in IEEE Comp. Soc. Tech. committee on Computer Architecture (TCCA) Acknowledgements Newsletter, pages 19-25, Dec. 1995. We thank AMD and Intel for providing early [12] N. Wichmann, M. Adams, S. Ethier. New Advances in systems for this performance evaluation. This work the Gyrokinetic Toriodal Code and Their Impact on was funded in part by the Accelerated Strategic Performance on the Cray XT Series. In Proc. Cray User Computing program and the Office of Science of the Group (CUG), Seattle, WA, 2007. Department of Energy. Los Alamos National [13] S.J. Zhou, D.M. Beazley, P.S. Lomdahl, B.L. Holian. Laboratory is operated by Los Alamos National Large-scale molecular dynamics simulations of fracture and Security LLC for the US Department of Energy under deformation. J. of Computer-Aided Materials Design, 3(1-3), contract DE-AC52-06NA25396. pp. 183-186, 1995.

×