• Save
Performance evaluation and energy effciency of HPC platforms based on Intel, AMD and ARM processors
Upcoming SlideShare
Loading in...5
×
 

Performance evaluation and energy effciency of HPC platforms based on Intel, AMD and ARM processors

on

  • 527 views

Performance evaluation and energy efficiency of HPC platforms based on Intel, AMD and ARM processors

Performance evaluation and energy efficiency of HPC platforms based on Intel, AMD and ARM processors

Statistics

Views

Total Views
527
Views on SlideShare
527
Embed Views
0

Actions

Likes
1
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution-ShareAlike LicenseCC Attribution-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Performance evaluation and energy effciency of HPC platforms based on Intel, AMD and ARM processors Performance evaluation and energy effciency of HPC platforms based on Intel, AMD and ARM processors Presentation Transcript

  • Performance Evaluation and Energy Efficiency of HPC Platforms Based on Intel, AMD and ARM Processors M. Jarus, S. Varrette, A. Oleksiak and P.Bouvry Pozna´n Supercomputing and Networking Center CSC, University of Luxembourg, Luxembourg 1 / 26 EE-LSDS 2013 (PSNC & UL) Performance Evaluation and Energy Efficiency of HPC Platforms
  • Summary 1 Introduction 2 Context & Motivations 3 Experimental Setup 4 Experiments Results 5 Conclusion 2 / 26 EE-LSDS 2013 (PSNC & UL) Performance Evaluation and Energy Efficiency of HPC Platforms
  • Introduction Summary 1 Introduction 2 Context & Motivations 3 Experimental Setup 4 Experiments Results 5 Conclusion 3 / 26 EE-LSDS 2013 (PSNC & UL) Performance Evaluation and Energy Efficiency of HPC Platforms
  • Introduction Why High Performance Computing ? ”The country that out-computes will be the one that out-competes”. Council on Competitiveness Accelerate research by accelerating computations 14.4 GFlops 27.363TFlops (Dual-core i7 1.8GHz) (291computing nodes, 2944cores) Increase storage capacity 2TB (1 disk) 1042TB raw(444disks) Communicate faster 1 GbE (1 Gb/s) vs Infiniband QDR (40 Gb/s) 4 / 26 EE-LSDS 2013 (PSNC & UL) Performance Evaluation and Energy Efficiency of HPC Platforms
  • Introduction HPC at the Heart of our Daily Life Today... Research, Industry, Local Collectivities ... Tomorrow: applied research, digital health, nano/bio techno 5 / 26 EE-LSDS 2013 (PSNC & UL) Performance Evaluation and Energy Efficiency of HPC Platforms
  • Introduction HPC Evolution towards Exascale Major investments since 2012 to build an Exascale platform by 2019 → > 1.5 G$ for each leading countries (US, EU, Russia etc. ) 2010 2015 2020 Power 6MW 15MW 20MW #Nodes 18,700 50,000 100,000 Node concurrency 12 1,000 10,000 Interconnect BW 1.5GB/s 1TB/s 2TB/s MTTI Day Day Day =⇒ Max power consumption: 0.1 W per core 6 / 26 EE-LSDS 2013 (PSNC & UL) Performance Evaluation and Energy Efficiency of HPC Platforms
  • Introduction Current Leading Processor Technologies Top500 Count Model Example max. TDP 225 45% Intel Xeon X5650 6C 2.66GHz 85W 14.1W/core 134 26.8% Intel Xeon E5-2680 8C 2.7GHz 130W 16.25W/core 61 12.2% AMD Opteron 6200 16C ”Interlagos” 115W 7.2W/core 53 10.6% IBM Power BQC 16C 1.6GHz 65W 4.1W/core → alternative low power processor architectures are required. 1 GPGPU accelerators (Nvidia Tesla cards / IBM PowerXCell 8i) 2 mobile and embedded devices market (ARM, Intel Atom) =⇒ Can low-power processors really suit HPC? 7 / 26 EE-LSDS 2013 (PSNC & UL) Performance Evaluation and Energy Efficiency of HPC Platforms
  • Context & Motivations Summary 1 Introduction 2 Context & Motivations 3 Experimental Setup 4 Experiments Results 5 Conclusion 8 / 26 EE-LSDS 2013 (PSNC & UL) Performance Evaluation and Energy Efficiency of HPC Platforms
  • Context & Motivations The Mont Blanc Project www EU project start: October 2011 Objectives → develop a ARM-based Exascale HPC → using 15 to 30 less energy Current status: Tibidabo cluster based on NVidia Tegra2 SoC → based on NVidia Tegra2 SoC (1 ARM Cortex-A9 2C @ 1 GHz) → 8 Q7 board (1 GbE) per blade Total: 128 nodes (38U) → interconnect: minimalistic tree network based on 1 GbE switch → Measured performance: 120 MFlops/W Other State-of-the-art projects → EuroCloud project http://www.eurocloudserver.com/ Energy-conscious3DServer-on-ChipforGreenCloud 9 / 26 EE-LSDS 2013 (PSNC & UL) Performance Evaluation and Energy Efficiency of HPC Platforms
  • Context & Motivations [Low-Power] HPC Platforms @ PSNC & UL Name Location Size #cpus #RAM Processor max TDP/proc i7 PSNC 1U 18 288GB Intel Core i7-3615QE@2.3GHz 8C 45W 5.63W/c atom64 PSNC 1U 18 36GB Intel Atom N2600@1.6GHz 2C 3.5W 1.75W/c amdf PSNC 1U 18 72GB AMD Fusion G-T40N@1GHz 2C 9W 4.5W/c bull-bcs UL 8U 16 1TB Intel Xeon E7-4850@2GHz 10C 130W 13W/c viridis UL 2U 48 192GB ARM A9 Cortex 1.1GHz 4C 1.9W 0.48W/c Objectives Compare perf. of cutting-edge high-density HPC platforms → low power platforms (atom64, amdf and viridis) vs. → pure computing-efficient platforms (i7 and bull-bcs) 10 / 26 EE-LSDS 2013 (PSNC & UL) Performance Evaluation and Energy Efficiency of HPC Platforms
  • Experimental Setup Summary 1 Introduction 2 Context & Motivations 3 Experimental Setup 4 Experiments Results 5 Conclusion 11 / 26 EE-LSDS 2013 (PSNC & UL) Performance Evaluation and Energy Efficiency of HPC Platforms
  • Experimental Setup Considered Benchmarks Phoronix Test Suite → stressing system-wide components (disk, RAM or CPU). → C-ray, Hmmer, Pybench CPU Performances (single-threaded): → Coremark, Fhourstones, Whetstone, Linpack MPI Performances: → OSU Micro-Benchmarks osu_get_latency & osu_get_bw only HPC Performances: High-Performance Linpack (HPL) → solves a linear system of order N: A × x = b Gaussian elimination with partial pivoting two-dimensional P × Q grid of processes N by N + 1 coefficient matrix split in NB × NB blocks 12 / 26 EE-LSDS 2013 (PSNC & UL) Performance Evaluation and Energy Efficiency of HPC Platforms
  • Experimental Setup Performance Measurements On a given platform: 100 runs for each benchmark, each with the following operations 1 t0: [fix the CPU frequency] & start system/power monitoring 2 t0 + ∆s: Start the selected benchmarks. 3 t1: Benchmark finished execution. 4 t1 + ∆s: End of monitoring. PpMHz Performance per MHz impact of CPU frequency on the final benchmarks results → i7: 1.2GHz → 2.3GHz → 2.31GHz (Turbo Mode) → atom64: 0.6 GHz → 1.6GHz → amdf: 0.8 GHz → 1GHz; → bull-bcs: 1.064GHz → 1.995GHz → 1.996GHz (Turbo Mode) → viridis: 1.1GHz 13 / 26 EE-LSDS 2013 (PSNC & UL) Performance Evaluation and Energy Efficiency of HPC Platforms
  • Experimental Setup Performance Measurements PpW Performance per Watt raw benchmark result divided by → (official) the average power draw (W) → (better) the energy consumed (J) Different results achieved with different CPU frequency values → PpW metrics presented for the highest frequency value Technical details → viridis: power measures available only by groups of 4 nodes → bull-bcs: high latency between measure (slow IPMI) 40s min → atom64: strange sensors reporting 14 / 26 EE-LSDS 2013 (PSNC & UL) Performance Evaluation and Energy Efficiency of HPC Platforms
  • Experiments Results Summary 1 Introduction 2 Context & Motivations 3 Experimental Setup 4 Experiments Results 5 Conclusion 15 / 26 EE-LSDS 2013 (PSNC & UL) Performance Evaluation and Energy Efficiency of HPC Platforms
  • Experiments Results CPU Performances (single-threaded) 100 1000 10000 100000 CoreMark Fhourstones Whetstones Linpack Rawbenchmarkresult−−LOGSCALE Intel Core i7 AMD G−T40N Atom N2600 Intel Xeon E7 ARM Cortex A9 Best results are obtained by Intel Core i7, then Intel E7 → AMD, ARM and Atom achieve comparable results 16 / 26 EE-LSDS 2013 (PSNC & UL) Performance Evaluation and Energy Efficiency of HPC Platforms
  • Experiments Results OSU MPI Benchmark 3.8 Results 1 10 100 1000 10000 100000 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 Latency(µs)−LOGSCALE−theLOWERthebetter Packet size (bits) − LOGSCALE OSU One Sided MPI Get latency Test v3.8 CoolEmAll Atom64 CoolEmAll AMDF CoolEmAll i7 Viridis ARM BullX BCS 17 / 26 EE-LSDS 2013 (PSNC & UL) Performance Evaluation and Energy Efficiency of HPC Platforms
  • Experiments Results OSU MPI Benchmark 3.8 Results 0.01 0.1 1 10 100 1000 10000 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 Bandwidth(MB/s)−LOGSCALE−theHIGHERthebetter Packet size (bits) − LOGSCALE OSU MPI One Sided MPI Get Bandwidth Test v3.8 BullX BCS Viridis ARM CoolEmAll AMDF CoolEmAll i7 CoolEmAll Atom64 17 / 26 EE-LSDS 2013 (PSNC & UL) Performance Evaluation and Energy Efficiency of HPC Platforms
  • Experiments Results OSU MPI Power Measures 0 10 20 30 40 50 60 0 50 100 150 200 250 300 350 Powerusage[W] Time [s] OSU Micro−benchmark 3.8 −− CoolEmAll i7 (2 nodes) Energy=4442J Energy=3093J 111s 75s Latency test Bandwidth test 0 10 20 30 40 50 60 0 50 100 150 200 250 300 350 Powerusage[W] Time [s] OSU Micro−benchmark 3.8 −− CoolEmAll AMDF (2 nodes) Energy=3642J Energy=2816J 102s 77s Latency test Bandwidth test 0 10 20 30 40 50 60 0 100 200 300 400 500 600 700 800 Powerusage[W] Time [s] OSU Micro−benchmark 3.8 −− CoolEmAll Atom64 (2 nodes) Energy=7555J Energy=4806J 275s 172s Latency test Bandwidth test 0 2 4 6 8 10 12 14 16 0 50 100 150 200 250 300 Powerusage[W] Time [s] OSU Micro−benchmark 3.8 −− Viridis ARM (2 nodes) Energy 499J Energy 526J 45s 45s Latency test Bandwidth test 18 / 26 EE-LSDS 2013 (PSNC & UL) Performance Evaluation and Energy Efficiency of HPC Platforms
  • Experiments Results HPL 2.1 Benchmarks Results Single Node Runs i7 26 28 30 32 34 36 38 40 36609 38897 41185 Performances[GFlops] N (Problem size) HPLinpack 2.1 −− Single CPU benchmark −− CoolEmAll i7 NB=48, PxQ=2x4 NB=96, PxQ=2x4 NB=128, PxQ=2x4 NB=160, PxQ=2x4 0 10 20 30 40 50 60 0 1000 2000 3000 4000 5000 6000 7000 Powerusage[W] Time [s] HPLinpack 2.1 −− Single CPU benchmark −− CoolEmAll i7 (N=41185,P=2,Q=4) Best Run Energy 43976J NB=48 NB=128 NB=160 NB=96 Avg. 19 / 26 EE-LSDS 2013 (PSNC & UL) Performance Evaluation and Energy Efficiency of HPC Platforms
  • Experiments Results HPL 2.1 Benchmarks Results Single Node Runs amdf 1.58 1.585 1.59 1.595 1.6 1.605 1.61 18413 19496 19929 Performances[GFlops] N (Problem size) HPLinpack 2.1 −− Single CPU benchmark −− CoolEmAll AMDF NB=96, PxQ=1x2 NB=128, PxQ=1x2 NB=160, PxQ=1x2 10 12 14 16 18 20 22 0 2000 4000 6000 8000 10000 Powerusage[W] Time [s] HPLinpack 2.1 −− Single CPU benchmark −− CoolEmAll AMDF (N=19496,P=1,Q=2) Best Run Energy 57912J NB=96 NB=128 NB=160 Avg. 19 / 26 EE-LSDS 2013 (PSNC & UL) Performance Evaluation and Energy Efficiency of HPC Platforms
  • Experiments Results HPL 2.1 Benchmarks Results Single Node Runs atom64 0.91 0.915 0.92 0.925 0.93 0.935 0.94 0.945 0.95 0.955 0.96 12086 12891 13697 Performances[GFlops] N (Problem size) HPLinpack 2.1 −− Single CPU benchmark −− CoolEmAll ATOM64 NB=48, PxQ=2x2 NB=64, PxQ=2x2 NB=96, PxQ=2x2 NB=112, PxQ=2x2 NB=128, PxQ=2x2 10 11 12 13 14 15 16 17 18 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 Powerusage[W] Time [s] HPLinpack 2.1 −− Single CPU benchmark −− CoolEmAll Atom64 (N=12891,P=2,Q=2) Best Run Energy 20671J NB=48 NB=64 NB=96 NB=128 NB=112 Avg. 19 / 26 EE-LSDS 2013 (PSNC & UL) Performance Evaluation and Energy Efficiency of HPC Platforms
  • Experiments Results HPL 2.1 Benchmarks Results Single Node Runs viridis 2.95 3 3.05 3.1 3.15 3.2 3.25 17259 18410 19560 20711 Performances[GFlops] N (Problem size) HPLinpack 2.1 −− Single CPU benchmark −− Viridis ARM NB=64, PxQ=2x2 NB=96, PxQ=2x2 NB=112, PxQ=2x2 NB=128, PxQ=2x2 4.6 4.8 5 5.2 5.4 5.6 5.8 6 0 2000 4000 6000 8000 10000 Powerusage[W] Time [s] HPLinpack 2.1 −− Single CPU benchmark −− Viridis ARM (N=20711,P=2,Q=2) Best Run Energy 9983J NB=64 NB=112 NB=128 NB=96 Avg. 19 / 26 EE-LSDS 2013 (PSNC & UL) Performance Evaluation and Energy Efficiency of HPC Platforms
  • Experiments Results HPL Power Measures Full Platforms Runs i7 0 200 400 600 800 1000 0 2000 4000 6000 8000 10000 Powerusage[W] Time [s] HPLinpack 2.1 −− CoolEmAll i7 platform (18 nodes,N=174733,NB=96,PxQ=12x12) Energy: 6338465J Avg. amdf 210 220 230 240 250 260 270 280 290 300 310 320 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 Powerusage[W] Time [s] HPLinpack 2.1 −− CoolEmAll AMDF platform (16 nodes,N=77984,NB=160,PxQ=4x8) Energy: 4818744J Avg. 20 / 26 EE-LSDS 2013 (PSNC & UL) Performance Evaluation and Energy Efficiency of HPC Platforms
  • Experiments Results HPL Power Measures Full Platforms Runs atom64 215 220 225 230 235 240 245 250 255 260 265 0 2000 4000 6000 8000 10000 Powerusage[W] Time [s] HPLinpack 2.1 −− CoolEmAll Atom64 platform (18 nodes,N=54692,NB=112,PxQ=8x9) Energy: 1994357J Avg. bull-bcs 2200 2400 2600 2800 3000 3200 3400 3600 3800 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 Powerusage[W] Time [s] HPLinpack 2.1 −− BullX BCS platform (1 nodes, N=87920,NB=112,PxQ=10x16) Energy: 50363322J Avg. 20 / 26 EE-LSDS 2013 (PSNC & UL) Performance Evaluation and Energy Efficiency of HPC Platforms
  • Experiments Results HPL Benchmarks Results Best HPL results Name #cpu Rpeak N NB P Q Time [s] GFlops Effic. Energy[J] i7 1 73.6 41185 96 2 4 1175.15 39.63 53.85% 43976 amdf 1 8 19496 160 1 2 3071.36 1.609 25,14% 57912 atom64 1 6.4 12891 112 2 2 1491.84 0.9575 11,97% 20671 bcs 1 80 n/a viridis 1 4.4 20711 96 2 2 1840.87 3.218 73.14% 9983 Full platforms runs Name #nodes Rpeak N NB P Q Time [s] GFlops Effic. Energy[J] i7 18 1324.8 174733 96 12 12 7867.53 452.1 34.25% 6338465 amdf 16∗ 128 77984 160 4 8 16770.58 18.85 18.41% 4818744 atom64 18 115.2 54692 112 8 9 8547.09 12.76 8.86% 1994357 bcs 1 1280 87920 112 10 16 15115.57 1072 83,75% 50363322 viridis 12∗ 52,8 63774 96 6 8 5090 34.39 65.14 n/a 21 / 26 EE-LSDS 2013 (PSNC & UL) Performance Evaluation and Energy Efficiency of HPC Platforms
  • Experiments Results Performance per MHz 0.1 1 10 100 1000 10000 OSULat. OSUBw. HPL HPLFull CoreMark Fhourstones Whetstones Linpack PpMHz−−LOGSCALE Intel Core i7 AMD G−T40N Atom N2600 Intel Xeon E7 ARM Cortex A9 PpMHz values remain quite constant under varying CPU frequencies bull-bcs outperforms in all HPC-oriented tests 22 / 26 EE-LSDS 2013 (PSNC & UL) Performance Evaluation and Energy Efficiency of HPC Platforms
  • Experiments Results Energy-efficiency 100 1000 10000 100000 1e+06 1e+07 1e+08 OSULat. OSUBw. HPL HPLFull Cray Hmmer Pybench Energy[J]−−LOGSCALE Intel Core i7 AMD G−T40N Atom N2600 Intel Xeon E7 ARM Cortex A9 ARM Cortex A9 is almost always the most energy-efficient CPU Intel Xeon E7 requires much more energy to execute the same application 23 / 26 EE-LSDS 2013 (PSNC & UL) Performance Evaluation and Energy Efficiency of HPC Platforms
  • Conclusion Summary 1 Introduction 2 Context & Motivations 3 Experimental Setup 4 Experiments Results 5 Conclusion 24 / 26 EE-LSDS 2013 (PSNC & UL) Performance Evaluation and Energy Efficiency of HPC Platforms
  • Conclusion Path to Exascale requires alternative low-power proc. architectures → most promising direction based on mobile and embedded devices → ARM-based HPC cluster prototypes in the Mont Blanc project Tibidabo cluster: 128 nodes, 38U, 120 MFlops/W Here: performance of cutting-edge high-density HPC platforms → CoolEmAll RECS platform @ PSNC → Boston Viridis & Bull BCS @ UL Rooms for improvement yet definitively suits HPC environments → Best obtained results: Name Processor Type MFlops/W Green500 Rank* viridis ARM A9 Cortex 595.93 130 i7 Intel Core i7 565.13 133 bull-bcs Intel Core E7 324.85 186 atom64 Intel Atom N2600 55.00 476 amdf AMD Fusion G-T40N 65.45 467 * Based on November 2012 list http://www.green500.org/ 25 / 26 EE-LSDS 2013 (PSNC & UL) Performance Evaluation and Energy Efficiency of HPC Platforms
  • Conclusion Thank you for your attention.... http://hpc.uni.lu http://www.man.poznan.pl/ 1 Introduction 2 Context & Motivations 3 Experimental Setup 4 Experiments Results 5 Conclusion 26 / 26 EE-LSDS 2013 (PSNC & UL) Performance Evaluation and Energy Efficiency of HPC Platforms
  • Conclusion CoolEmAll RECS platform @ PSNC 35 kW, 1U, up to 18 nodes in a single enclosure 3 enclosure units (3U) available @ PSNC: i7 Intel i7-3615QE @ 2.3GHz 4C HT,TB, 45W TDP atom64 Intel Atom N2600 @ 1.6GHz 2C HT, 3.5W TDP amdf AMD G-T40N @ 1.0GHz 2C HT, 9W TDP 27 / 26 EE-LSDS 2013 (PSNC & UL) Performance Evaluation and Energy Efficiency of HPC Platforms
  • Conclusion Boston Viridis platform @ UL 300W, 2U, 10GbE interconnect, 48 ultra low-power SoC → ARM Cortex A9 processors @ 1.1GHz 4C HT, 1.9W TDP 28 / 26 EE-LSDS 2013 (PSNC & UL) Performance Evaluation and Energy Efficiency of HPC Platforms
  • Conclusion BullX BCS (4× S6030) platform @ UL 8U, aggregation of 4× BullX S6030 in a single SMP node → 4×4 Intel Xeon E7-4850 @ 2GHz 10C HT,TB, 130W TDP 29 / 26 EE-LSDS 2013 (PSNC & UL) Performance Evaluation and Energy Efficiency of HPC Platforms