High Performance Computing - Challenges on the Road to Exascale Computing
Upcoming SlideShare
Loading in...5
×
 

High Performance Computing - Challenges on the Road to Exascale Computing

on

  • 1,786 views

 

Statistics

Views

Total Views
1,786
Views on SlideShare
1,786
Embed Views
0

Actions

Likes
0
Downloads
67
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

High Performance Computing - Challenges on the Road to Exascale Computing High Performance Computing - Challenges on the Road to Exascale Computing Presentation Transcript

  • April 2011High Performance ComputingChallenges on the Road to Exascale ComputingH. J. SchickIBM Germany Research & Development GmbH © 2011 IBM Corporation
  • Agenda Introduction The What and Why of High Performance Computing Exascale Challenges Balanced Systems Blue Gene Architecture and Blue Gene Active Storage Supercomputers in a Sugar Cube2 © 2011 IBM Corporation
  • © 2011 IBM Corporation
  • Origination of the “Jugene” Supercomputer4 © 2011 IBM Corporation
  • Supercomputer Satisfies Need for FLOPS FLOPS = FLoating point OPerations per Second. – Mega=106, Giga=109, Tera=1012, Peta=1015, Exa=1018 Simulation is a major application area. Many simulations based on the notion of “timestep”. – At each timestep, advance the constituent parts according to their physics or chemistry. – Example Challenge: Molecular dynamics has picosecond=10-12 timescale, but many biology processes have millisecond=10-3 timescale. • Simulation has 109 timesteps! Each timestep requires many operations!5 © 2011 IBM Corporation
  • Simulation Pseudo-code:// Initialize state of atoms.While time < 1 millisecond { // Calculate forces on 40,000 atoms. // Calculate velocities of all atoms. // Advance position of all atoms. time = time + 1picosecond}// Write biology result.6 © 2011 IBM Corporation
  • Supercomputing is Capability Computing A single instance of an application using large tightly-coupled computer resources. – For example, a single 1000-year climate simulation. Contrast to Capacity Computing: – Many instances of one or more applications using large loosely-coupled computer resources. – For example, 1000 independent 1-year climate simulations. – Often trivial parallelism. Often suited for GRID or SETI@Home-style systems.7 © 2011 IBM Corporation
  • Supercomputer Versus Your Desktop Assume 2000-processor supercomputer delivers simulation result in 1 day. Assuming memory-size is not a problem, then your 1-processor desktop would deliver same result in 2000 days = 5 years. So supercomputers make results available on a human timescale.8 © 2011 IBM Corporation
  • But what could you do if all objects were intelligent… …and connected?9 © 2011 IBM Corporation
  • What could you do withunlimited computing power…for pennies?Could you predict the path of astorm down to the squarekilometer? Could you identify another 20% of proven oil reserves without drilling one hole? © 2011 IBM Corporation
  • Grand Challenges“A grand challenge is a fundamental problem in science orengineering, with broad applications, whose solution would beenabled by the application of high performance computing resourcesthat could become available in the near future.” Computational fluid dynamics Electronic structure Calculations to calculations for the understand the • Design of hypersonic aircraft, design of new fundamental nature efficient automobile bodies, and materials: of matter: extremely quiet submarines. • Weather forecasting for short and • Chemical catalysts • Quantum long term effects. • Immunological agents chromodynamics • Efficient recovery of oil, and for • Superconductors • Condensed matter many other applications. theory11 © 2011 IBM Corporation
  • Enough Atoms to See Grains in Solidification of Metalhttp://www-phys.llnl.gov/Research/Metals_Alloys/news.html12 © 2011 IBM Corporation
  • Building Blocks of Matter QPACE = QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.) Quarks are the constituents of matter which strongly interact exchanging gluons. Particular phenomena – Confinement – Asymptotic freedom (Nobel Prize 2004) Theory of strong interactions = Quantum Chromodynamics (QCD)13 © 2011 IBM Corporation
  • Projected Performance Development Almost a doubling every year !!!14 © 2011 IBM Corporation
  • Extrapolating an Exaflop in 2018Standard technology scaling will not get us there in 2018 BlueGene/L Exaflop Exaflop compromise Assumption for “compromise guess” (2005) Directly using traditional scaled technologyNode Peak Perf 5.6GF 20TF 20TF Same node count (64k)hardware 2 8000 1600 Assume 3.5GHzconcurrency/nodeSystem Power in 1 MW 3.5 GW 25 MW Expected based on technology improvement through 4 technology generations. (OnlyCompute Chip compute chip power scaling, I/Os also scaled same way)Link Bandwidth 1.4Gbps 5 Tbps 1 Tbps Not possible to maintain bandwidth ratio.(Each unidirectional3-D link)Wires per 2 400 wires 80 wires Large wire count will eliminate high density and drive links onto cables where they areunidirectional 3-D 100x more expensive. Assume 20 Gbps signalinglinkPins in network on 24 pins 5,000 pins 1,000 pins 20 Gbps differential assumed. 20 Gbps over copper will be limited to 12 inches. Will neednode optics for in rack interconnects. 10Gbps now possible in both copper and optics.Power in network 100 KW 20 MW 4 MW 10 mW/Gbps assumed. Now: 25 mW/Gbps for long distance (greater than 2 feet on copper) for both ends one direction. 45mW/Gbps optics both ends one direction. + 15mW/Gbps of electrical Electrical power in future: separately optimized links for power.Memory 5.6GB/s 20TB/s 1 TB/s Not possible to maintain external bandwidth/FlopBandwidth/nodeL2 cache/node 4 MB 16 GB 500 MB About 6-7 technology generations with expected eDRAM density improvementsData pins associated 128 data 40,000 pins 2000 pins 3.2 Gbps per pinwith memory/node pinsPower in memory I/O 12.8 KW 80 MW 4 MW 10 mW/Gbps assumed. Most current power in address bus.(not DRAM) Future probably about 15mW/Gbps maybe get to 10mW/Gbps (2.5mW/Gbps is c*v^2*f for random data on data pins) Address power is higher.15 © 2011 IBM Corporation
  • The Big Leap from Petaflops to Exaflops We will hit 20 Petaflop in 2011/2012 …. Now beginning research for ~2018 Exascale. IT/CMOS industry is trying to double performance every 2 years. HPC industry is trying to double performance every year. Technology disruptions in many areas. – BAD NEWS: Scalability of current technologies? • Silicon Power, Interconnect, Memory, Packaging. – GOOD NEWS: Emerging technologies? • Memory technologies (e.g. storage class memory), 3D-chips, etc. Exploiting exascale machines. – Want to maximize science output per €. – Need multiple partner applications to evaluate HW trade-offs.16 © 2011 IBM Corporation
  • Exascale Challenges – Energy Power consumption will increase in the future! What is the critical limit? – JSC has 5 MW, potential of 10 MW – 1 MW is 1 M€ / year – 20 MW expected to be the critical limit Are Exascale systems a Large Scale Facility? – LHC uses 100 MW Energy efficiency – Cooling uses significant fraction (PUE > 1.2 today → 1.0) – Hot cooling water (40°C and more) might help – Free cooling: use free air to cool water – Heat recycling: use waste heat for heating, cooling, etc.17 © 2011 IBM Corporation
  • Exascale Challenges – Resiliency Ever increasing number of components – O(10000) nodes – O(100000) DIMMs of RAM Each components MTBF will not increase – Optimistic: Remains constant – Realistic: Smaller structures, lower voltages → decrease Global MTBF will decrease – Critical limit? 1 day? 1 hour? Time to write checkpoint! How to handle failures – Try to anticipate failures via monitoring – Software must help to handle failures • checkpoints, process-migration, transactional computing18 © 2011 IBM Corporation
  • Exascale Challenges – Applications Ever increasing levels of parallelism – Thousands of nodes, hundreds of cores, dozens of registers – Automatic parallelization vs. explicit exposure – How large are coherency domains? – How many languages do we have to learn? MPI + X most probably not sufficient – 1 process / core makes orchestration of processes harder – GPUs require explicit handling today (CUDA, OpenCL) What is the future paradigm – MPI + X + Y? PGAS + X (+Y)? – PGAS: UPC, Co-Array Fortran, X10, Chapel, Fortress, … Which applications are inherently scalable enough at all?19 © 2011 IBM Corporation
  • Balanced Systems Example caxpy: Processor FPU throughput Memory bandwidth [FLOPS / cycle] [words / cycle] [FLOPS / word] apeNEXT 8 2 4 QCDOC (MM) 2 0.63 3.2 QCDOC (LS) 2 2 1 Xeon 2 0.29 7 GPU 128 x 2 17.3 (*) 14.8 Cell/B.E. (MM) 8x4 1 32 Cell/B.E. (LS) 8x4 8x4 120 © 2011 IBM Corporation
  • Balanced Systems ???21 © 2011 IBM Corporation
  • … but are they Reliable, Available and Serviceable ???22 © 2011 IBM Corporation
  • Blue Gene/P23 © 2011 IBM Corporation
  • Blue Gene/P System 72 Racks, 72x32x32 Cabled 8x8x16 Rack 32 Node Cards 1 PF/s Node Card 144 (288) TB (32 chips 4x4x2) 32 compute, 0-1 IO cards 13.9 TF/s 2 (4) TB 435 GF/s Compute Card 64 (128) GB 1 chip, 20 DRAMs Chip 13.6 GF/s 4 processors 2.0 GB DDR2 (4.0GB 6/30/08) 13.6 GF/s 8 MB EDRAM24 © 2011 IBM Corporation
  • Blue Gene/P Compute ASIC 32k I1/32k D1 Snoop snoop filter PPC450 128 Multiplexing switch Double FPU L2 4MB 256 Shared L3 512b data eDRAM Directory 72b ECC 32k I1/32k D1 256 Snoop for eDRAM L3 Cache filter or PPC450 w/ECC On-Chip 128 Memory Double FPU L2 32 32k I1/32k D1 Shared Multiplexing switch Snoop SRAM filter PPC450 128 4MB Shared L3 eDRAM L2 512b data Double FPU Directory 72b ECC for eDRAM L3 Cache or 32k I1/32k D1 Snoop w/ECC On-Chip filter Memory PPC450 128 Double FPU L2 Arb DMA Hybrid PMU DDR-2 DDR-2 w/ SRAM JTAG Global Ethernet Controller Controller 256x64b Access Torus Collective Barrier 10 Gbit w/ ECC w/ ECC JTAG 6 3.4Gb/s 3 6.8Gb/s 4 global 10 Gb/s 13.6 Gb/s bidirectional bidirectional barriers or DDR-2 DRAM bus interrupts © 2011 IBM Corporation
  • Blue Gene/P Compute Card 2 x 16GB interface to 2 or 4 BGQ ASIC 29mm x 29mm FC-PBGA GB SDRAM-DDR2 NVRAM, monitors, decoupling, Vtt termination All network and IO, power input © 2011 IBM Corporation
  • Blue Gene/P Node Board 32 Compute nodes Optional IO card (one of 2 possible) Local DC-DC regulators (6 required, 8 with redundancy) 10Gb optical link © 2011 IBM Corporation
  • Blue Gene Interconnection NetworksOptimized for Parallel Programming and Scalable Management 3D Torus – Interconnects all compute nodes (65,536) – Virtual cut-through hardware routing – 1.4Gb/s on all 12 node links (2.1 GB/s per node) – Communications backbone for computations – 0.7/1.4 TB/s bisection bandwidth, 67TB/s total bandwidth Global Collective Network – One-to-all broadcast functionality – Reduction operations functionality – 2.8 Gb/s of bandwidth per link; One-way global latency 2.5 µs – ~23TB/s total bandwidth (64k machine) – Interconnects all compute and I/O nodes (1024) Low Latency Global Barrier and Interrupt – Round trip latency 1.3 µs Control Network – Boot, monitoring and diagnostics Ethernet – Incorporated into every node ASIC – Active in the I/O nodes (1:64) – All external comm. (file I/O, control, user interaction, etc.)28 © 2011 IBM Corporation
  • Source: Kirk Borne, Data Science Challenges from Distributed Petabyte Astronomical Data Collections:29 Preparing for the Data Avalanche through © 2011 IBM Corporation Persistence, Parallelization, and Provenance
  • Blue Gene Architecture in Review Blue Gene is not just FLOPs … … it’s also torus network, power efficiency, and dense packaging. Focus on scalability rather than on configurability gives the Blue Gene family’s System-on-a- Chip architecture unprecedented scalability and reliability.3030 Blue Gene Active Storage HEC FSIO 2010 © 2011 IBM Corporation
  • Thought Experiment: A Blue Gene Active Storage Machine• Integrate significant storage class memory (SCM) at each node • For now, Flash memory, maybe similar function to Fusion-io ioDrive Duo • Future systems may deploy Phase Change Memory (PCM), Memristor, or …? ioDrive Duo One Board 512 Node • Assume node density will drops 50% -- 512 Nodes/Rack for embedded apps • Objective: balance Flash bandwidth to network all-to-all throughput SLC NAND Cap. 320 GB 160 TB Read BW (64K) 1450 MB/s 725 GB/s• Resulting System Attributes: Write BW (64K) 1400 MB/s 700 GB/s • Rack: 0.5 petabyte, 512 Blue Gene processors, and embedded torus network Read IOPS (4K) 270,000 138 Mega • 700 TB/s I/O bandwidth to Flash – competitive with ~70 large disk controllers Write IOPS (4K) 257,000 131 Maga • Order of magnitude less space and power than equivalent perf via disk solution Mixed R/W 207,000 105 Mega • Can configure fewer disk controllers and optimize them for archival use IOPs(75/25@4K) • With network all-to-all throughput at 1GB/s per node, anticipate: • 1TB sort from/to persistent storage in order 10 secs. • 130 Million IOPs per rack, 700 GB/s I/O bandwidth • Inherit Blue Gene attributes: scalability, reliability, power efficiency,• Research Challenges (list not exhaustive): • Packaging – can the integration succeed? • Resilience – storage, network, system management, middleware • Data management – need clear split between on-line and archival data • Data structures and algorithms can take specific advantage of the BGAS architecture – no one cares it’s not x86 since software is embedded in storage• Related Work: • Gordon (UCSD) http://nvsl.ucsd.edu/papers/Asplos2009Gordon.pdf • FAWN (CMU) http://www.cs.cmu.edu/~fawnproj/papers/fawn-sosp2009.pdf • RamCloud (Stanford) http://www.stanford.edu/~ouster/cgi-bin/papers/ramcloud.pdf31 Blue Gene Active Storage HEC FSIO 2010 © 2011 IBM Corporation
  • From individual transistors to the globeEnergy-consumption issues (and thermal issues) propagate through hardware levels32 © 2011 IBM Corporation
  • Energy consumption of datacenters today Source: APC, Whitepaper #154 (2008) Current air-cooled datacenters are extremely inefficient. Cooling needs as much energy as IT and both are thrown-away. Provocative: Datacenter is a huge “Heater with integrated Logic”. For a 10 MW datacenter US$ 3 - 5M is wasted per year.33 © 2011 IBM Corporation
  • Hot-water-cooled datacenters – towards zero emission Micro-channel liquid coolers Heat exchanger CMOS 80ºC Direct „Waste“-Heat usage e.g. heating34 Water 60ºC © 2011 IBM Corporation
  • Paradigm change: Moore’s law goes 3D Multi-Chip Design Brain: synapse network System on ChipMeindl 05 et al. 3D Integration Benefits:  High core-cache bandwidth  Separation of technologies  Reduction in wire length  Equivalent to two generations of scaling Global wire lengths reduction  No impact on software development35 © 2011 IBM Corporation
  • Scalable Heat Removal by Interlayer Cooling cross-section through fluid port and cavities  3D integration requires (scalable) interlayer liquid cooling  Challenge: isolate electrical interconnects from liquid  Microchannel  Pin fin Through silicon via electrical bonding and water insulation scheme  A large fraction of energy in computers is spent for data transport  Shrinking computers saves energy Test vehicle with fluid manifold and connection36 © 2011 IBM Corporation
  • On the Cube Road Paradigm Changes -Energy will cost more than servers -Coolers are million fold larger than transistors Moore’s Law goes 3D -Single layer scaling slows down -Stacking of layers allows extension of Moore’s law -Approaching functional density of human brain Future computers look different -Liquid cooling and heat re-use, e.g. Aquasar -Interlayer cooled 3D chip stacks -Smarter energy by bionic designs Energy aspects are key -Cooling – power delivery – photonics -Shrink a rack to a “sugar cube”: 50x efficiency37 © 2011 IBM Corporation
  • Thank you very much for your attention.38 © 2011 IBM Corporation