It has quad-core Intel Nehalem processorsrunning at 2.67 GHz, with dual socket nodes and a singleQuad Data Rate (QDR) IB link per node to a network that islocally a fat-tree with a global 2D-mesh.Each XT4 compute node containsa single quad-core 2.3 GHz AMD Opteron ”Budapest” processor,which is tightly integrated to the XT4 interconnectvia a Cray SeaStar-2 ASIC through a 6.4 GB/s bidirectionalHyperTransport interface.
Each compute node is a Dell Poweredge 1950 server equippedwith two Intel Xeon quad-core 64 bit, 2.66GHz Harpertownprocessors, connected to a Dual Data Rate (DDR) Infinibandnetwork configured as a fat treeAmazon EC2: is a virtual computing environment thatprovides a web services API for launching and managingvirtual machine instances. Amazon provides a number of differentinstance types that have varying performance characteristics.CPU capacity is defined in terms of an abstract AmazonEC2 Compute Unit. One EC2 Compute Unit is approximatelyequivalent to a 1.0-1.2 GHz 2007 Opteron or 2007 Xeonprocessor. For our tests we used the m1.large instances type.The m1.large instance type has four EC2 Compute Units, twovirtual cores with two EC2 Compute Units each, and 7.5 GBof memory. The nodes are connected with gigabit ethernet.
major differences between the Amazon Web Services environmentand that at a typical supercomputing center. For example,almost all HPC applications assume the presence of a sharedparallel filesystem between compute nodes, and a head nodethat can submit MPI jobs to all of the worker nodesThe head node couldsubmit MPI jobs to all of the worker nodes, and the file serverprovided a shared filesystem between the nodes
Targeted: These are microkernelswhich quantify basic system parameters that separatelycharacterize computation and communication performance.Proxy apps
P2p vs all-to-allCommvscompuvs memorySmallmsgvs large msg
The DGEMMresults are as one would expect based on the properties of theCPUs. The STREAM results show that EC2 is significantlyfaster for this benchmark than Lawrencium. We believe this isbecause of the particular processor distribution we received forour EC2 nodes for this testThe network latency and bandwidth results clearly show thedifference between the interconnects on the tested systemsThe ping-pong results show the latency andthe bandwidth with no self-induced contention, while therandomly ordered ring tests show the performance degradationwith self-contention. The uncontended latency and bandwidthmeasurements of the EC2 gigabit ethernet interconnect aremore than 20 times worse than the slowest other machine.However,for EC2 the less capable network clearly inhibits overall HPLperformance, by a factor of six or more. The FFTE benchmarkmeasures the floating point rate of execution of a doubleprecision complex one-dimensional discrete Fourier transform,and the PTRANS benchmark measures the time to transpose alarge matrix. Both of these benchmarks performance dependsupon the memory and network bandwidth and therefore showsimilar trends. EC2 is approximately 20 times slower thanCarver and four times slower than Lawrencium in both cases.The RandomAccess benchmark measures the rate of randomupdates of memory and its performance depends on memoryand network latency. In this case EC2 is approximately 10times slower than Carver and three times slower than Lawrencium.
GAMESS (2.7), for this benchmark problem,places relatively little demand upon the network, and thereforeis hardly slowed down at all on EC2.PARATECshows the worst performance on EC2, 52 slower thanCarver. It performs 3-DFFT’s, and the global (i.e., all-toall)data transposes within these FFT operations can incur alarge communications overheadQualitatively, it seems that those applications that performthe most collective communication with the most messages arethose that perform the worst on EC2.
relative runtime on EC2 compared to Lawrencium plottedagainst the percentage communication for each applicationas measured on Lawrencium. The overall trend is clear:the greater the fraction of its runtime an application spendscommunicating, the worse the performance is on EC2To determine these characteristics we classifiedthe MPI calls of the applications into 4 categories: smalland large messages (latency vs bandwidth limited) and pointto-point vs collective. (Note for the purposes of this work weclassified all messages < 4KB to be latency bound. The overallconclusions shown here contain no significant dependenceon this choice.) From this analysis it is clear why fvCAMbehaves anomalously; it is the only one of the applications thatperforms most of its communication via large messages, bothpoint-to-point and collectives
PERFORMANCE ANALYSIS OF HIGH PERFORMANCE COMPUTING APPLICATIONS ON THE AMAZON WEB SERVICES CLOUD Keith R. Jackson, Lavanya Ramakrishnan, Krishna Muriki, Shane Canon, Shreyas Cholia, Harvey J. Wasserman, Nicholas J. Wright Lawrence Berkeley National Lab Presentation by Abhishek Gupta,1 CS 598 Cloud Computing
GOALS Examine the performance of existing cloud computing infrastructures and create a mechanism for their quantitative evaluation Build upon previous studies by using the NERSC benchmarking framework to evaluate the performance of real scientific workloads on EC2 Under DOE Magellan project - evaluate the ability of cloud computing to meet DOE’s computing needs 2
CONTRIBUTIONS Broadest evaluation to date of application performance on virtualized cloud computing platforms Experiences with running on Amazon EC2 and the encountered performance and availability variations. Analysis of the impact of virtualization based on the communication characteristics of the application Impact of virtualization through a simple, well- documented aggregate measure that expresses the useful potential of the systems considered 3
METHODS - MACHINES Carver: Quad-core, dual-socket Linux / Nehalem / QDR IB cluster Medium-sized cluster for jobs scaling to hundreds of processors; 3,200 total cores Franklin: Cray XT4 Linux environment / Quad-core, AMD Opteron / Seastar interconnect, Lustre parallel filesystem Integrated HPC system for jobs scaling to tens of thousands of processors; 38,640 total cores 4
METHODS - MACHINES Lawrencium Quad-core, dual-socket Linux / Harpertown / DDR IB cluster Designed for jobs scaling to tens-hundreds of processors; 1,584 total cores Amazon EC2 m1.large instance type: four EC2 Compute Units, two virtual cores with two EC2 Compute Units each, and 7.5 GB of memory Heterogeneous processor types 5
METHODS – APPLICATIONS AND BENCHMARKSUSED High Performance Computing Challenge (HPCC) benchmark suite Consists of seven synthetic benchmarks Targeted synthetics : DGEMM, STREAM, and two measures of network latency and bandwidth. Complex synthetics :HPL, FFTE, PTRANS, and RandomAccess. NERSC 6 Benchmarks Set of applications representative of the NERSC workload Covers the science domains, parallelization schemes, and concurrencies, as well as machine-based characteristics that influence performance such as message size, memory 7 access pattern, and working set sizes
METHODS – NERSC APPLICATIONS CAM: The Community Atmospheric Model Lower computational intensity Large point-to-point & collective MPI messages GAMESS: General Atomic and Molecular Electronic Structure System Memory access No collectives, very little communication GTC: GyrokineticTurbulence Code High computational intensity Bandwidth-bound nearest-neighbor communication plus collectives with small data payload 8
METHODS – NERSC APPLICATIONS IMPACT-T: Integrated Map and Particle Accelerator Tracking Time Memory bandwidth & moderate computational intensity Collective performance with small to moderate message sizes MAESTRO: A Low Mach Number Stellar Hydrodynamics Code Low computational intensity Irregular communication patterns MILC: QCD High computation intensity Global communication with small messages 9 PARATEC: PARAllel Total Energy Code Global communication with small messages
RESULTS: APPLICATION PERFORMANCE Franklin and Lawrencium 1.4 to 2.6 slower than Carver. EC2 • Best case, GAMESS, EC2 is only 2.7 slower than Carver. • Worst case, PARATEC, EC2 is more than 50 slower than Carver. • Large performance spread caused by different demands of 11 application on the network.o More detailed analysis required
RESULTS: PERFORMANCE ANALYSIS USING IPM Integrated Performance Monitoring (IPM) framework • Uses the MPI profiling interface • Examine the relative amounts of time taken by an application for computing and communicating, types of MPI calls made 12
RESULTS: SUSTAINED SYSTEM PERFORMANCE SSP: aggregate measure of the workload-specific, delivered performance of a computing system For each code measure • FLOP counts on a reference system • Wall clock run time on various systems 13 • N chosen to be 3,200 Problem sets drastically reduced
RESULTS: VARIABILITY Performance Variability across runs • Non-homogeneous nature of the systems allocated • Network sharing and contention • Sharing the un-virtualized hardware 14
CONCLUSIONS EC2 performance degrades significantly as applications spend more time communicating Applications with global, all-to-all. communication perform worse then those that mostly use point-to- point communication. Amount of variability in EC2 performance can be significant. 16
DISCUSSION QUESTIONS This paper focused on performance alone. What are the performance cost tradeoffs for different platforms? How does the above tradeoff differ with application characteristics such as granularity, communication sensitivity etc.? What is the primary source of performance variability on Amazon EC2? 17