• Save
Trends & challenges in supercomputing (for EITA-EITC 2012)
Upcoming SlideShare
Loading in...5
×
 

Trends & challenges in supercomputing (for EITA-EITC 2012)

on

  • 2,072 views

Slides for EITA-EITC 2012 @ University of Toronto, Aug 16-17, 2012 http://www.eitc.org/conferences/eita-eitc-2012

Slides for EITA-EITC 2012 @ University of Toronto, Aug 16-17, 2012 http://www.eitc.org/conferences/eita-eitc-2012

Statistics

Views

Total Views
2,072
Views on SlideShare
2,071
Embed Views
1

Actions

Likes
5
Downloads
0
Comments
0

1 Embed 1

https://twitter.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Trends & challenges in supercomputing (for EITA-EITC 2012) Trends & challenges in supercomputing (for EITA-EITC 2012) Presentation Transcript

    • Trends and Challenges in Supercomputing August 2012
    • Outline The Need for Supercomputers Top500 & Green500 lists Trends  Multi-core, many-core  Accelerators (GPU, FPGA,..) Challenges  Power Consumption  Software  Reliability  R&D Cost CCR’s contribution to the HPC community
    • THEORYEXPERIMENTSIMULATION Three Pillars of Science
    • The Need for Supercomputers Paradigm Shift: Data+Simulation Driven Science Aerospace/Engineering  Computational fluid dynamics, engine design.. Astrophysics  N-body, dark energy, dark matter… Climate/Geophysics  Weather forecast, global warming, earthquakes… Energy  Reactor design, fission/fusion reactions Life Science  Bioinformatics, proteomics, drug design … Materials/Chemistry  Multi-scale modeling (atom → molecule → genome → cell…)  Renewable energy, catalysts, fuel cells, efficient combustion..
    • The Need for Even More Powerful Supercomputers Grand Challenge Problems Application Required Performance (PetaFLOPS) Automotive Development 0.1 Human Vision Simulation 0.1 Aerodynamics Analysis 1 (Achieved in 2008) Laser Optics 10 (We are almost here!) Molecular Dynamics 20 Aerodynamic Design 1,000 (=1 ExaFLOPS. Next milestone; Hopefully 2018) Computational Cosmology 10,000 Turbulence in Physics 100,000 Computational Chemistry 1,000,000 (Source: Steven Chen, “Neuroscience, Bio-system science, Third-Brain and Bio-Supercomputing”) 1 PetaFLOPS = 1015 Floating-Point Operations Per Second Your home computer ~ 0.00001 PetaFLOPS
    • Taiwan’s No. 1 Supercomputer
    • Taiwan’s No. 1 Supercomputer
    • USA’s No. 1: Sequoia
    • in Taiwan International collaboration with IBM Thomas J. Watson Research Center Cost: $2.5 million 0.4 PetaFLOPS peak performance. (Feb 3, 2012)
    • Japan’s No. 1: K Computer
    • K Computer
    • K Computer
    • K Computer Mt. RokkoKobe Shinkansen-Line Shin-Kobe Station Ashiya City Sannomiya Sta. Tokyo Next-Generation Port Island280 miles west Supercomputer Site from Tokyo 5km from Sannomiya Sta. Kobe Sky Bridge 12 min. by Portliner Monorail Portliner Monorail Kobe Airport Photo: June, 2006
    • K Computer in Taiwan •Expected to be fully delivered by December 2014. •1 PetaFLOPS peak performance. •100 times faster than existing system.(June 24, 2012)
    • China’s No. 1: Tianhe-1A TOP500 Rank: 5 Location: Tianjin NSC (天津超算 中心) Maker: NUDT (國防科大) CPU-GPU Hybrid:
    • China’s No. 1: 天河1A
    • The TOP500 List LINPACK Power PFLOPS & Cores Technologies Efficiency Name @ Site Maker Efficiency (MFLOPS/W) 16.3 1.57M 2069 Sequoia @ LLNL IBM 81%2 10.5 K Computer @ RIKEN 93% 705K 830 (京) (理化研) Fujitsu3 8.2 81% 786K 2069 Mira @ Argonne NL IBM4 2.9 147K 846 SuperMUC @ LRZ IBM 91%5 2.6 Tianhe-1A @ NSCC Tianjin NUDT 55% 186K 635 (天河) (天津超算中心) (國防科大)6 1.9 74% 298K 377 Jaguar @ Oak Ridge NL Cray7 1.7 82% 164K 2099 Fermi @ CINECA IBM8 1.4 82% 131K 2099 JuQUEEN @ FZJ IBM9 1.3 82% 77K 603 Curie @ CEA Bull10 1.3 Nebulae @ NSCC Shenzhen Dawning 43% 121K 493 (星雲) (深圳超算中心) (曙光)98 0.18 26K 400 Windrider/ALPS @ NCHC Acer 76% (御風者) (國網中心)
    • The Green500 List Power LINPACKRank Efficiency Technologies Name @ Site (MFLOPS/W) Efficiency1-20 2100 81% (Various) Discovery @ Intel 21 1380 66% DEGIMA @ Nagasaki Univ. 22 1380 47% (出島) (長崎大學) Barcelona 23 1266 56% Supercomputing Center HA-PACS @ Univ of Tsukuba 24 1152 54% (筑波大學) Moonlight @ Los 25 1050 48% Alamos NL 62 810 - NCHC (國網中心) Windrider/ALPS @ NCHC111 400 76% (御風者) (國網中心)
    • Top 50 Systems
    • Trends in the TOP500 List
    • Trends in the TOP500 List Countries/Systems Share
    • Trends in the TOP500 List MultiCore-ness
    • Trends in the TOP500 List 11% have GPU/Accelerator
    • MultiCore The End of the GigaHertz GameSource: D. Patterson & J. Dongarra
    • Why MultiCore ? Do the math! •Power ∝ Voltage2 x Freq •Frequency ∝ Voltage •Power ∝ Frequency3 Cores V Freq Perf Power Perf/Power 1 1 1 1 1 1 Single Core Faster 1X 1.5X 1.5X 1.5X 3.3X 0.45X Single Core 2X 0.75X 0.75X 1.5X 0.8X 1.88X Multi Core 勝Source: D. Patterson & J. Dongarra
    • MultiCore  Moore’s Law # transistors doubles every 2 years.  Moore’s Law Reinterpreted # cores per chip doubles every 2 years, while clock rate decreases. Gordon Moore # threads of execution Chairman Emeritus, Intel doubles every 2 years.Source: D. Patterson & J. Dongarra
    • What’s Next ?
    • ManyCoreMultiCore is great, but ?? ?? You CANNOT do this ↑ forever! Besides, do you really need ALL of the features ? Remember 80/20 RuleModern CPU Core ManyCore + ― ×÷ Source: A. Chien
    • ManyCore & Accelerator Graphics Processing Unit is the most popular ManyCore accelerator. Why?  Image rendering is already massively parallel.  GPUs are cheap; ~ 80M cards sold per year Other competing technologies  Intel Xeon Phi (Many-Integrated-Core)  Tilera TILE64  FPGA (Convey, SRC..) They are just “accelerators”; CPU is still needed: Hybrid A Hybrid System: Performance MultiCore + Single Core Power
    • Accelerators
    • Alternatively, Simplify Augment GPU Core Feature-rich CPU Lightweight CPU x16 x32A Blue Gene compute node A node board A rack (512 nodes!)
    • Blue Gene Q node board
    • ARM-based Supercomputers ? Project Moonshot •1.1-1.4 GHz quad-core Calxeda SoC. •72 CPUs (288 cores) per 4U chassis. •Each CPU has up to 4 GB RAM.
    • ARM-based Supercomputers ? Project Copper •1.6 GHz quad-core Marvell Armada XP SoC. •48 CPUs (192 cores) per 3U chassis. •Each CPU has up to 8 GB RAM. •One SATA hard driver per CPU.
    • CPU vs. GPU Heavyweight CPU GPU/Accelerator Lightweight CPU AMD Intel AMD IBM SPARC64 Blue nVidia Intel Attribute Firestream Xeon E5 Interlagos POWER7 VIIIfx Gene/Q M2090 9370 Xeon PhiProcess (nm) 32 32 45 45 45 40 40 22 勝 Cores 8 16 8 8 16+2 32 20 61 Clock (GHz) 2.7 2.4 4 2 1.6 1.3 0.825 1.3 勝 Cache (MB) 20 16 32 6 32 0 0 32 GFLOPS per Core 10.8 9.6 32 16 12.8 665 528 >1024 GFLOPS per Chip 86 154 256 128 205 勝 Power (W) 130 115 200 58 勝 <100 225 225 300 MFLOPS/W 664 1336 1280 2207 >2100 2955 勝 2347 >2930 Street Price 1723 1130 2400 1500 ?
    • Future Challenge: Exascale (1018) U.S. DoE’s guesstimate (2009)System Attributes 2010 2015 2018 Performance (PFLOPS) 2 200 1000 Power (MW) 6 15 20 Memory (PB) 0.3 5 32-64 Node Performance 0.125 0.5 7 1 10 (TFLOPS) Node Mem BW (GB/s) 25 100 1000 400 4000 Node Concurrency 12 O(100) O(1000) O(1000) O(10000) System size (nodes) 187K 50K 5K 1M 100KNode Network BW (GB/s) 1.5 150 1000 250 2000 Storage (PB) 15 100 500-1000 MTTI Days O(1 day) Cost $200 Million Blue Gene CPU-GPU Hybrid
    • Exascale System Requirements A User’s Perspective Source: Oak Ridge “Preparing for Exascale” study, 2009
    • The Power & Energy ChallengeTeraFLOP machine today (not to scale) •Cooling •Power supply inefficiency 4550W •Instruction decode &control translations…5,000 W 10 MW ($10 Million to Operate/Yr) Enough to power ~10,000 homes! 100W Disk 100W Interconnect 150W Memory 200W Compute Source: S. Borkar of Intel
    • The Power & Energy Challenge Withtoday’s technologies 1 ExaFLOPS  Memory  New memory interfaces (3D chip stacking and vias) x100  Replace DRAM with 0 power non volatile RAM  Processor +500MW  Reduce data movement 5MW  Domain/core power gating & ~3MW ~20 MW aggressive voltage scaling ~5MW  Interconnect 2MW  Denser interconnect on package 100MW 5MW  Replace copper with optics  Storage •Compute 40X  Solid-state drives 100MW •Memory 75X  Phase-Change RAM 150MW •Network 20X  Power Efficiency  Higher operating temperature •Disk 33X tolerance 200MW •Other 900X  Power supply & cooling efficiency Source: Intel
    • Software Challenges Source: 2012 NCSI Parallel & Cluster Computing Workshop
    • Three-Legged Race
    • More legs
    • Even more legs
    • Scalability  Amdahl’s law (1967) N Speedup= N( 1-f )+f N=# of processors f=% work which can be parallelized  f=1 → Perfect speedup. But SpeedupIt’s all about scalability! N
    • Extremely Scalable Codes Science Area Code Cores Perf. (Peta (K) FLOPS) Languages & Libraries  How did they do it ? Weak scaling: Use larger problem Materials DCA++ 213 1.9 C++, MPI size! Materials WL-LSMS 223 1.8 Fortran, C/C++, MPI Chemistry NWChem 224 1.4 Fortran, ARMCI  Gustafson’s law (1988) Materials DRC 186 1.3 Fortran, MPI Speedup=(1-f)+Nf Nano C, MPI OMEN 223 >1 Materials N=# of processors Biomedical MoBo 197 0.78 C++, MPI f=% work which can be parallelized Chemistry MADNESS 140 0.55 C++, MPI, PThreads Materials LS3DF 147 0.44 Fortran, MPI Speedup SPECFEM Fortran, MPI Seismology 3D 150 0.17 Combustion S3D 147 0.08 Fortran, MPI Weather WRF 150 0.05 Fortran, MPIAll results are reported on Jaguar (224K cores, 2.33 PetaFLOPSpeak performance.) Source: T. Schulthess N Moral: Large systems are for large problems!
    • Computation Kernels
    • Computation Kernels Professor Colella’s “Dwarfs” Molecular Physics Important Nano Science Climate application Environment areas and Combustion computation Fusion kernelsNuclear Energy identified by Astrophysics Oak Ridge’sNuclear Physics 2009 Exascale Accelerator QCD Study. Aerodynamics
    • GPU ChallengesAlthough GPU is powerful, keeping it busy is not easy! 100 K ComputerLINPACK Efficiency (%) 60 20 Tianhe-1A ▲CPU-GPU Hybrid Gigabit Ethernet 100 200 300 400 500 System Rank in TOP500 List
    • GPU Performance Bottleneck Why ? Too much time spent on data movement! Time Distribution of LU Decomposition 100%CPU GPU 75% 32 GB/s 178 GB/s 50%(DDR3-1333 PC3- (GDDR5) 10600 Triple Channel) 25%Main GPURAM 8 GB/s RAM 1024 2048 4096 8192 (PCI-E 2.0 x16) Problem Size Source: V. Volkov of UC Berkeley
    • GPU Programming Difficulty Not for the faint of heart! // call GPU kernel code Code for CPU: sub1<<<dimGrid>>>(fx, fy); x = new float[n+1]; // copy to host memory y = new float[n+1]; cudaMemcpy(fx+1, &x[1], (n-1) * sizeof(float), cudaMemcpyDeviceToHost); // initialize x & y.... // release GPU memory for (int i=1; i<n; ++i) { cudaFree(fy); x[i] += ( y[i+1] + y[i-1] )*.5; cudaFree(fx); } .. // The GPU kernel code Code rewritten for GPU (CUDA): __global__ void sub1(float* fx, float* fy) {x = new float[n+1]; #define BLOCK (512)y = new float[n+1]; int t = threadIdx.x;// initialize x & y.... int b = blockIdx.x;// allocate GPU memory __shared__ float sx[BLOCK];float *fx, *fy; __shared__ float sy[BLOCK+2];cudaMalloc((void**)&fx, (n-1+2) * sizeof(float)); // copy from device to processor memorycudaMalloc((void**)&fy, (n-1+2) * sizeof(float)); sx[t] = fx[BLOCK*b+t];// copy to GPU memory sy[t] = fy[BLOCK*b+t];cudaMemcpy(fx+1, &x[1], (n-1) * sizeof(float), if (t<2) cudaMemcpyHostToDevice); sy[t+BLOCK] = fy[BLOCK*b+t+BLOCK];cudaMemcpy(fy, &y[1-1], (n-1+2) * sizeof(float), __syncthreads(); cudaMemcpyHostToDevice); // do computation sx[t] += ( sy[t+2] + sy[t] )*.5;dim3 dimGrid((n-1+2)/BLOCK, 1, 1); (not over yet…) Source: http://parallel-for.sf.net
    • Top 10 Objections to GPU Computing1. Dont want to rewrite my code or learn a new language2. Dont know what kind of performance to expect3. Rather wait for the magic CPU→GPU source code converter4. Rather wait for more CPU cores or Intels MIC5. PCI-E bandwidth kills my performance6. Amdahls law7. Dont like proprietary languages8. What if nVidia goes away ?9. GPU boards dont have enough memory10.Don’t have enough IT budget Source: V. Natoli, InsideHPC.com, 2011
    • Exascale Initiatives Partnership for Advanced Computing in EuropeExascale Co-Design Centers
    • Upcoming Supercomputers Peak Name @ Site Perf . Technologies Note (PFLOPS) Stampede @ • First major supercomputer based on Intel Xeon Phi • Cost: $50 M 10 •Memory: 272 TB •Power: 9 MW •Deployment by 2013 Titan @ • Upgrade of Jaguar. • Cost: $97 Mil 10-20 • Power: 3.5-7 MW ??? • 2007-2012 project. • Cost: $200-300 M (Actual: $1.5 B) 10NCSA/Univ. of Ill.
    • Blue Waters
    • Blue Waters
    • (IBM) Blue Waters  Each node has four 8-core 4GHz POWER7 processors.  Four threads per core → 128 threads per node.  >800W power consumption.  Water cooling.(Photo credit: Ray Cunningham, joshmeans)
    • IBM Blue Waters is dead, yet ..The same technology (POWER7) drives Watson supercomputer to take on humans in $1M Jeopardy tournament in 2011.
    • The New Blue Waters
    • High R&D Cost High-end HPC is a niche market  The R&D costs of proprietary systems have skyrocketed and need substantial subsidy from the government.  Earth Simulator: $600M, BlueGene/L: $90M, Roadrunner: $125M, Blue Waters: >$200M, K Computer: $1.3B Risk is mitigated by adopting commoditized technologies  Beowulf-type PC clusters.  Intel processors.  Mass market video cards. But this also inhibits innovation  The monotonicity of supercomputer architectures. Source: E. Strohmaier et al “Recent Trends in The Marketplace of HPC”, 2005
    • Supercomputer Funding Budget: $1276M/Yr (0.09‰ GDP) $67M/Yr (0.014‰ GDP) Peak Cost Funded Peak CostSystem (PF) ($M) Usage System (PF) ($M) Usage By Classified (Nuclear stockpileSequoia 20 250 DoE Tianhe1A 4.7 60 reliability) Climate, Materials, CFD, Animation Rendering, Biology, Climate, Mira 10 >50 DoE Nebulae 3 30 Energy, Materials, Oil, Engineering… Astrophysics … Fusion, Climate, Astrophysics,Jaguar 2.6 220 DoE Mole8.5 1.1 ? Materials..Pleiades 1.7 ? NASA Aerospace, Engineering.. Cielo 1.4 45 DoE Classified (Nuclear stockpile €100 M/Yr reliability) Climate, Energy, Astrophysics,Hopper 1.3 ? DoE Particle Physics.. $40M/Yr Peak Cost $278M/Yr (0.055‰ GDP) System (PF) ($M) Usage SuperMUC 3.2 ? CFD, medicine, astrophysics… Peak Cost System (PF) ($M) Usage Fermi 2.1 ? Plasma Physics, Climate, Energy… K 11.2 1300 Biology, Climate, Earthquake, Drug Biology, Climate, Drug Design, Design, Materials, Clean Energy, Lomonosov 1.7 30 Engineering, Materials…Tsubame2 2.3 33 Engineering, Astrophysics.. Source: EESI’s “Existing HPC Initiatives” report, 2010 S. Korea: $66M/Yr India: $20M/Yr
    • Supercomputing vs. Cloud The difference: Interconnect Amazon EC2 Bandwidth (Gigabits/sec) 1.6Latency (μsec) 1.2 0.8 0.4 NCSA (InfiniBand) Amazon EC2 Message size (bytes) Message size (bytes) Source: E. Walker “Benchmarking Amazon EC2 for High Performance Computing”, 2008.  Cloud is good for programs which need little communication.  But most scientific codes are very communication-intensive.
    • Slowdown Factor Supercomputing in the Clouds (?) MILC PARATEC Scientific Codes: Baseline: 2.67GHz Nehalem, QDR  GAMESS: Computational Chemistry InfiniBand  GTC: Turbulence Amazon EC2: 1-1.2GHz Xeon, Gigabit  Impact: Integrated Map & Particle Ethernet (?) Accelerator Tracking Time Lawrencium: 2.66 GHz Xeon, DDR  fvCAM: Atmosphere Model InfiniBand  MAESTRO256: Hydrodynamics Franklin: Cray XT4 (2.3GHz Opteron,  MILC: QCD SeaStar)  PARATEC: Computational Chemistry Source: K. Jackson “Performance of HPC applications on the Amazon Web Services Cloud”, 2010.
    • What’s New in SC’11 ?November 12 - 18, 2011 > 11,000 attendees
    • SC’11: Intel Xeon Phi (MIC)1 TeraFLOPS, 1 Chip, Many Integrated Cores Existing codes only need recompilation ! You can run an MPI job on a MIC !
    • SC’11: Intel Xeon Phi (MIC)
    • SC’11: Intel Xeon Phi (MIC) Each MIC core:  Simplified Pentium core • Short pipeline • No out-of-order execution  Enhanced with >100 new instructions • 512-bit vector instructions • 16 single-precision FP per instruction • Fused multiply-add (a*x+b) • Gather-scatter  Fully cache coherent  Four threads per core Inter-core network: 1024-bit ring bus
    • SC’11: Intel Xeon Phi (MIC)
    • SC’11: Intel Xeon Phi (MIC) Two execution modes:  Native: Recompilation  Offload: Directive-based programming, e.g. x86 CPU MIC
    • SC’11: Mont-Blanc Project Board made by Italian company “SECO” Goal: Exascale-level performance with 15 - 30times less power than current supercomputers. http://www.montblanc-project.eu
    • SC’11: Mont-Blanc Project
    • SC’11: Mont-Blanc Project
    • SC’11: OpenACC Open Accelerator Computing A standard for directive-based GPU/accelerator programming Will be part of OpenMP 4.0 float f(int n, float* v1, float* v2) { int i; float sum; #pragma acc region for for (sum = 0, i = 0; i < n; ++i) { sum += v1[i] + v2[i]; } return sum; } http://www.openacc-standard.org
    • SC’11: Sunway Bluelight China’s first supercomputer using homegrown CPUs Installed in September 2011 at Jinan NSC (濟南超算中心) Peak 1.07 PF, actual 0.8 PF (74% efficiency) #20 in TOP500, #39 in Green500 CPU:  Based on DEC Alpha 21164  0.975 - 1.2 GHz 16-core, ~ 141 GF @ 1.1GHz  65 nm process, 2-3 generations behind the latest technology  Designed by Jiangnan Computing Lab (江南計算所) Infiniband QDR Water cooling
    • SC’11: Sunway Bluelight
    • SC’11: Sunway Bluelight Water nozzles. Aluminum chassis cold plate.
    • SC’11: Amazon AWS: #72 in TOP500 Node config: Cluster Compute Eight Extra Large (cc2.8xlarge)  Eight-core 2.6 GHz Xeon E5 X5570  10 Gigabit Ethernet Cost: $2.4 per node-hour (on-demand mode) 1064 cc2.8xlarge’s → 354 TeraFLOPS (67.8% efficiency), rank #72, and $2,554 per hour
    • Q/A ?
    • Supercomputer Benchmark: LINPACK Solve a system of linear equations Ax=b using Gaussian elimination (LU decomposition) Why LINPACK ?  Simple performance model • LINPACK only uses + - ×÷ • Number of + - ×÷ required: 2N3/3+3N2/2 (N=# of unknowns)  Stress test on computation, memory, and network.  Resistant to compiler optimizations or architecture differences • Can run on any # of processors • Has been tested on Cray 1 to iPad 2 (~1.6 GigaFLOPS) LINPACK FLOPS = □/(Time to run) LINPACK Efficiency = ◇/(Theoretical Peak FLOPS)
    • Center for Computational Research More than 10 years experience delivering HPC in an academic setting Mission: Enabling and facilitating research within the university Provide  Cycles, software engineering, scientific computing/modeling, visualization Computational Cycles Delivered in 2009:  360,000 jobs run (1,000 per day)  720,000 CPU days delivered $9M Infrastructure Upgrades in 2010: (6,000 cores, 800 TB storage) Portal/Tool Development  WebMO (Chemistry)  iNquiry (Bioinformatics)  UBMoD (Metrics on Demand) TeraGrid Technology Audit Services
    • Technology Audit Services Objectives:  HPC resource usage reporting  HPC resource quality of service assurance thru continuous testing with application kernels  Quantitative and qualitative metrics of impact to science and engineering.