“Future Exascale Supercomputers”        Mexico DF, November, 2011                                                      Pro...
Parallel Systems                      Interconnect (Myrinet, IB, Ge, 3D torus, tree, …)     Node                          ...
Looking at the Gordon Bell Prize  ● 1 GFlop/s; 1988; Cray Y-MP; 8 Processors        ● Static finite element analysis  ● 1 ...
Nvidia GPU instruction execution                                  MP1                    MP2               MP3            ...
2. To faster air plane design   Boeing: Number of wing prototypes prepared for wind-tunnel testing              Date      ...
2. To faster air plane design    Airbus: "More simulation, less tests             More                  tests“    From    ...
Diseño del ITER                                                 TOKAMAK (JET)Mexico DF, November, 2011                    ...
Materials: a new path to competitiveness     On-demand materials for effective commercial use     Conductivity: energy los...
Supercomputación, teoría y experimentación Mexico DF, November, 2011               17                                     ...
Holistic approach …                                                                                                   Towa...
10+ Pflop/s systems planned  ● IBM Blue Waters at Illinois         ● 40,000 8-core Power7, 1 PB memory,           18 PB di...
Mexico DF, November, 2011   Thanks to S. Borkar, Intel   23Mexico DF, November, 2011   Thanks to S. Borkar, Intel   24    ...
Nvidia: Chip for the ExaflopComputer Mexico DF, November, 2011   Thanks Bill Dally   25Nvidia: Node for the ExaflopCompute...
Exascale Supercomputer Mexico DF, November, 2011   Thanks Bill Dally   27BSC-CNS: International Initiatives (IESP)        ...
Back to Babel?  Book of Genesis                                                       The computer age “Now the whole eart...
Different models of computation …….  ● The dream for automatic parallelizing compilers not true …  ● … so programmer needs...
StarSs: … and executes as efficient as possible …#pragma css task input(A, B) output(C)void vadd3 (float A[BS], float B[BS...
StarSs: Enabler for exascale         Can exploit very unstructured                     Support for heterogeneity         p...
Multidisciplinary top-down approach                  Application           Programming                 and algorithms     ...
Green/Top 500 November 2011Green500 Top500_Rank    _Rank Mflops/Watt Power    Site                                        ...
Upcoming SlideShare
Loading in...5
×

Mateo valero p1

739

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
739
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
16
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Mateo valero p1

  1. 1. “Future Exascale Supercomputers” Mexico DF, November, 2011 Prof. Mateo Valero DirectorTop10Rank Site Computer Procs Rmax Rpeak RIKEN Advanced Institute Fujitsu, K computer, SPARC64 1 for Computational Science 705024 10510000 11280384 VIIIfx 2.0GHz, Tofu interconnect (AICS) 186368 2 Tianjin, China XeonX5670+NVIDIA 2566000 4701000 100352 3 Oak Ridge Nat. Lab. Crat XT5,6 cores 224162 1759000 2331000 4 Shenzhen, China XeonX5670+NVIDIA 120640 1271000 2984300 73278 5 GSIC Center, Tokyo XeonX5670+NVIDIA 1192000 2287630 56994 6 DOE/NNSA/LANL/SNL Cray XE6 8-core 2.4 GHz 142272 1110000 1365811 SGI Altix ICE 8200EX/8400EX, NASA/Ames Research 7 Xeon HT QC 3.0/Xeon 111104 1088000 1315328 Center/NAS 5570/5670 2.93 Ghz, Infiniband 8 DOE/SC/LBNL/NERSC Cray XE6 12 cores 153408 1054000 1288627 Commissariat a lEnergie Bull bullx super-node 9 138368 1050000 1254550 Atomique (CEA) S6010/S6030 QS22/LS21 Cluster, PowerXCell 10 DOE/NNSA/LANL 122400 1042000 1375776 8i / Opteron InfinibandMexico DF, November, 2011 2 1
  2. 2. Parallel Systems Interconnect (Myrinet, IB, Ge, 3D torus, tree, …) Node Node* Node** Node Node* * Node** ** Node Node Node Node Node SMP Memory homogeneous multicore (BlueGene-Q chip) IN heterogenous multicore multicore general-purpose accelerator (e.g. Cell) multicore multicore GPU multicore FPGA ASIC (e.g. Anton for MD) Network-on-chip (bus, ring, direct, …) Mexico DF, November, 2011 3Riken’s Fujitsu K with SPARC64 VIIIfx ● Homogeneous architecture: ● Compute node: ● One SPARC64 VIIIfx processor 2 GHz, 8 cores per chip 128 Gigaflops per chip ● 16 GB memory per node ● Number of nodes and cores: ● 864 cabinets * 102 compute nodes/cabinet * (1 socket * 8 CPU cores) = 705024 cores …. 50 by 60 meters ● Peak performance (DP): p ( ) ● 705024 cores * 16 GFLOPS per core = 11280384 PFLOPS ● Linpack: 10510 PF 93% efficiency. Matrix: more than 13725120 rows !!! 29 hours and 28 minutes ● Power consumption 12.6 MWatt, 0.8 Gigaflops/W Mexico DF, November, 2011 4 2
  3. 3. Looking at the Gordon Bell Prize ● 1 GFlop/s; 1988; Cray Y-MP; 8 Processors ● Static finite element analysis ● 1 TFlop/s; 1998; Cray T3E; 1024 Processors ● Modeling of metallic magnet atoms, using a variation of the locally self-consistent multiple scattering method. ● 1 PFlop/s; 2008; Cray XT5; 1.5x105 Processors ● Superconductive materials ● 1 EFlop/s; ~2018; ?; 1x108 Processors?? (109 threads) Jack Dongarra Mexico DF, November, 2011 5 Mexico DF, November, 2011 6 3
  4. 4. Nvidia GPU instruction execution MP1 MP2 MP3 MP4instruction1instruction2Long latency3Instruction4 SBAC-PAD, Vitoria October Mexico DF, November, 201128th, 2011 7Potential System Architecturefor Exascale Supercomputers System 2010 “2015” “2018” Difference attributes 2010-18 System peak 2 Pflop/s 200 Pflop/s 1 Eflop/sec O(1000) Power 6 MW 15 MW ~20 MW 20 System memory 0.3 PB 5 PB 32-64 PB O(100) Node 125 GF 0.5 TF 7 TF 1 TF 10 TF O(10) – performance O(100) Node memory 25 GB/s 0.1 1 TB/sec 0.4 TB/sec 4 TB/sec O(100) BW TB/sec Node 12 O(100) O(1,000) O(1,000) O(10,000) O(100) – concurrency O(1000) Total 225,000 O(108) O(109) O(10,000) Concurrency Total Node 1.5 GB/s 20 GB/sec 200 GB/sec O(100) Interconnect BW MTTI days O(1day) O(1 day) - O(10) EESI Final Conference 10-11 Oct. 2011, Barcelona Mexico DF, November, 2011 8 4
  5. 5. 2. To faster air plane design Boeing: Number of wing prototypes prepared for wind-tunnel testing Date 1980 1995 2005 Airplane B757/B767 B777 B787 # wing prototypes 77 11 11 5 Plateau due to RANS limitations. Further decrease expected from LES with ExaFlop EESI Final Conference 10-11 Oct. 2011, Barcelona Mexico DF, November, 2011 9Diseño del Airbus 380 Mexico DF, November, 2011 10 5
  6. 6. 2. To faster air plane design Airbus: "More simulation, less tests More tests“ From A380 to A350 - 40% less wind-tunnel days - 25% saving in aerodynamics development time - 20% saving on wind-tunnel tests cost th k t HPC thanks to HPC-enabled CFD runs, especially i hi h bl d i ll in high-speed regime, providing d i idi even better representation of aerodynamics phenomenon turned into better design choices. Acknowledgements: E. CHAPUT (AIRBUS) EESI Final Conference 10-11 Oct. 2011, Barcelona Mexico DF, November, 2011 11 2. Oil industry EESI Final Conference 10-11 Oct. 2011, Barcelona Mexico DF, November, 2011 12 6
  7. 7. Diseño del ITER TOKAMAK (JET)Mexico DF, November, 2011 13 Fundamental Sciences EESI Final ConferenceMexico DF, November, 2011 10-11 Oct. 2011, Barcelona 14 7
  8. 8. Materials: a new path to competitiveness On-demand materials for effective commercial use Conductivity: energy loss reduction Lifetime: corrosion protection, e.g. chrome Fissures: saftety insurance from molecular design Optimisation of materials / lubricants less friction, longer lifetime, less energy-losses Industrial need to speed up simulation from months to days All atom Multi-scale Exascale enables simulation of larger and realistic systems and devices EESI Final Conference, 10-11 Oct. 2011, BarcelonaMexico DF, November, 2011 15 Life Sciences and Health Population Organ Tissue Cell Macromolecule Small Molecule Atom EESI Final Conference, 10-11 Oct. 2011, BarcelonaMexico DF, November, 2011 16 8
  9. 9. Supercomputación, teoría y experimentación Mexico DF, November, 2011 17 Cortesia de IBMSupercomputing, theory and experimentation Mexico DF, November, 2011 18 Cortesia de IBM 9
  10. 10. Holistic approach … Towards exaflop Comput. Complexity Applications Async. Algs. Moldability Job Scheduling Resource awareness Load Balancin User satisfaction Programming Model Address space Dependencies Work generation ng ng Run time Locality optimization Concurrency extraction Topology and routing Interconnection External contention Processor/node NIC design Run time support Hw counters architecture Memory subsystem Core Structure Mexico DF, November, 2011 1910+ Pflop/s systems planned ● Fujitsu Kei ● 80,000 8-core Sparc64 VIIIfx processors 2 GHz, (16 Gflops/core, 58 watts 3.2 Gflops/watt), 16 GB/node 1 PB memory, 6D mesh-torus, GB/node, memory mesh torus 10 Pflops ● Crays Titan at DOE, Oak Ridge National Laboratory ● Hybrid system with Nvidia GPUs, 1 Pflop/s in 2011, 20 Pflop/s in 2012, late 2011 prototype ● $100 million Mexico DF, November, 2011 20 10
  11. 11. 10+ Pflop/s systems planned ● IBM Blue Waters at Illinois ● 40,000 8-core Power7, 1 PB memory, 18 PB disk, 500 PB archival storage, 10 Pflop/s 2012 $200 million Pflop/s, 2012, ● IBM Blue Gene/Q systems: ● Mira to DOE, Argonne National Lab with 49,000 nodes, 16-core Power A2 processor (1.6-3 GHz), 750 K cores, 750 TB memory, 70 PB disk, 5D torus 10 Pflop/s torus, ● Sequoia to Lawrence Livermore National Lab with 98304 nodes (96 racks), 16-core A2 processor, 1.6 M cores (1 GB/core), 1.6 Petabytes memory, 6 Mwatt, 3 Gflops/watt, 20 Pflop/s, 2012 Mexico DF, November, 2011 21Japan Plan for Exascale Heterogeneous, Distributed Memory GigaHz KiloCore MegaNode system 2012 2015 2018-2020 K Machine 10K Machine 100K Machine 10 PF 100 PF ExaFlops Feasibility Study (2012-2013) Exascale Project (2014-2020) Post-Petascale Projects Mexico DF, November, 2011 22 11
  12. 12. Mexico DF, November, 2011 Thanks to S. Borkar, Intel 23Mexico DF, November, 2011 Thanks to S. Borkar, Intel 24 12
  13. 13. Nvidia: Chip for the ExaflopComputer Mexico DF, November, 2011 Thanks Bill Dally 25Nvidia: Node for the ExaflopComputer Thanks Bill Dally Mexico DF, November, 2011 26 13
  14. 14. Exascale Supercomputer Mexico DF, November, 2011 Thanks Bill Dally 27BSC-CNS: International Initiatives (IESP) Improve the world’s simulation and modeling capability by improving the coordination and development of the HPC software environment B ild an i Build international plan f d l i i l l for developing the next generation open source software for scientific high-performance computing Mexico DF, November, 2011 28 14
  15. 15. Back to Babel? Book of Genesis The computer age “Now the whole earth had Fortran & MPI one language and the same words” … …”Come, let us make bricks, and burn them thoroughly. ”… …"Come, let us build ourselves a city, and a tower with its top in the heavens, ++ and let us make a name for ourselves”… And the LORD said, "Look, they are one Cilk++ people, and they have all one language; and Fortress X10 CUDA this is only the beginning of what they will do; Sisal HPF StarSs RapidMind nothing that they propose to do will now be Sequoia impossible for them. Come, let us go down, and CAF ALF OpenMP UPC SDK confuse their language there, so that they will not understand one anothers speech." Chapel MPI Mexico DF, November, 2011 Thanks to Jesus Labarta 29You will see…. in 400 years from now peoplewill get crazy New generation of programmers Parallel Multicore/manycore Programming Architectures New Usage g models Source: Picasso -- Don Quixote Dr. Avi Mendelson (Microsoft). Keynote at ISC-2007 Mexico DF, November, 2011 30 15
  16. 16. Different models of computation ……. ● The dream for automatic parallelizing compilers not true … ● … so programmer needs to express opportunities for parallel execution in the application SPMD OpenMP 2.5 Nested fork-join OpenMP 3.0 DAG – data flow Huge Lookahead &Reuse…. Latency/EBW/Scheduling ● And … asynchrony (MPI and OpenMP too synchronous): ● Collectives/barriers multiply effects of microscopic load imbalance, OS noise,… Mexico DF, November, 2011 31StarSs: … generates task graph at run time …#pragma css task input(A, B) output(C)void vadd3 (float A[BS], float B[BS], float C[BS]);#pragma css task input(sum, A) output(B)void scale_add (float sum, float A[BS], float B[BS]); Task Graph Generation#pragma css task input(A) inout(sum)void accum (float A[BS], float *sum);for (i=0; i<N; i+=BS) // C=A+B 1 2 3 4 vadd3 ( &A[i], &B[i], &C[i]);...for (i=0; i<N; i+=BS) // sum(C[i]) 5 6 7 8 accum (&C[i], &sum);...for (i=0; i<N; i+=BS) // B=sum*E scale_add (sum, &E[i], &B[i]); 9 10 11 12...for (i=0; i<N; i+=BS) // A=C+D vadd3 (&C[i], &D[i], &A[i]); 13 14 15 16...for (i=0; i<N; i+=BS) // E=C+F vadd3 (&C[i], &F[i], &E[i]); 17 18 19 20 Mexico DF, November, 2011 32 16
  17. 17. StarSs: … and executes as efficient as possible …#pragma css task input(A, B) output(C)void vadd3 (float A[BS], float B[BS], float C[BS]);#pragma css task input(sum, A) output(B)void scale_add (float sum, float A[BS], float B[BS]); Task Graph Execution#pragma css task input(A) inout(sum)void accum (float A[BS], float *sum);for (i=0; i<N; i+=BS) // C=A+B 1 1 1 2 vadd3 ( &A[i], &B[i], &C[i]);...for (i=0; i<N; i+=BS) // sum(C[i]) 2 3 4 5 accum (&C[i], &sum);...for (i=0; i<N; i+=BS) // B=sum*E scale_add (sum, &E[i], &B[i]); 6 6 6 7...for (i=0; i<N; i+=BS) // A=C+D vadd3 (&C[i], &D[i], &A[i]); 2 2 2 3...for (i=0; i<N; i+=BS) // E=C+F vadd3 (&C[i], &F[i], &E[i]); 7 8 7 8 Mexico DF, November, 2011 33StarSs: … benefiting from data access information ● Flat global address space seen by programmer ● Flexibility to dynamically traverse dataflow graph “optimizing” ● Concurrency. Critical path ● Memory access ● Opportunities for ● Prefetch ● Reuse ● Eli i t antidependences Eliminate tid d (rename) ● Replication management Mexico DF, November, 2011 34 17
  18. 18. StarSs: Enabler for exascale Can exploit very unstructured Support for heterogeneity parallelism Any # and combination of CPUs, Not just loop/data parallelism GPUs Easy to change structure Including autotuning Supports large amounts of lookahead S t l t fl k h d Malleability: Decouple program f M ll bilit D l from Not stalling for dependence resources satisfaction Allowing dynamic resource Allow for locality optimizations to allocation and load balance tolerate latency Tolerate noise Overlap data transfers, prefetch Reuse Nicely hybridizes into MPI/StarSs Data-flow; Asynchrony Data flow; Propagates to large scale the node level dataflow characteristics Potential is there; Overlap communication and Can blame runtime computation A chance against Amdahl’s law Compatible with proprietary low level technologies 35 Mexico DF, November, 2011 35StarSs: history/strategy/versionsBasic SMPSs must provide directionality ∀argument Contiguous, non partially overlapped Renaming Several schedulers ( i it l S l h d l (priority, locality,…) lit ) No nesting C/Fortran MPI/SMPSs optims. SMPSs regions C, No Fortran must provide directionality ∀argument ovelaping &strided OMPSs Reshaping strided accesses Priority and locality aware scheduling C/C++, Fortran under development O OpenMP compatibility ( ) MP tibilit (~) Dependences based only on args. with directionality Contiguous args. (address used as centinels) Separate dependences/transfers Inlined/outlined pragmas Nesting SMP/GPU/Cluster No renaming, Several schedulers: “Simple” locality aware sched,… Mexico DF, November, 2011 36 18
  19. 19. Multidisciplinary top-down approach Application Programming and algorithms models Investigate Performance solutions to these Load analysis and Power balancing Computer Center Power Projections prediction 90 tools and other 80 70 Cooling Computers $31M problems Processor 60 $23M Power (MW) Interconnect 50 $17M and node 40 30 $9M 20 $3M 10 0 2005 2006 2007 2008 2009 2010 2011 Year Mexico DF, November, 2011 37 Mexico DF, November, 2011 38 19
  20. 20. Green/Top 500 November 2011Green500 Top500_Rank _Rank Mflops/Watt Power Site Computer 1 64 2026,48 85,12 IBM - Rochester BlueGene/Q, Power BQC 16C 1.60 GHz, Custom 2 65 2026,48 85,12 IBM Thomas J. Watson Research Center BlueGene/Q, Power BQC 16C 1.60 GHz, Custom 3 29 1996,09 170,25 IBM - Rochester BlueGene/Q, Power BQC 16C 1.60 GHz, Custom 4 17 1988,56 340,5 DOE/NNSA/LLNL BlueGene/Q, Power BQC 16C 1.60 GHz, Custom 5 284 1689,86 38,67 IBM Thomas J. Watson Research Center NNSA/SC Blue Gene/Q Prototype 1 6 328 1378,32 47,05 Nagasaki University g y DEGIMA Cluster, Intel i5, ATI Radeon GPU, Infiniband QDR Bullx B505, Xeon E5649 6C 2.53GHz, Infiniband QDR, NVIDIA 7 114 1266,26 81,5 Barcelona Supercomputing Center 2090 Curie Hybrid Nodes - Bullx B505, Xeon E5640 2.67 GHz, Infiniband 8 102 1010,11 108,8 TGCC / GENCI QDR Institute of Process Engineering, Chinese Mole-8.5 Cluster, Xeon X5520 4C 2.27 GHz, Infiniband QDR, 9 21 963,7 515,2 Academy of Sciences NVIDIA 2050 GSIC Center, Tokyo Institute of HP ProLiant SL390s G7 Xeon 6C X5670, Nvidia GPU, 10 5 958,35 1243,8 Technology Linux/Windows SuperServer 2026GT-TRF, Xeon E5645 6C 2.40GHz, Infiniband 11 96 928,96 126,27 Virginia Tech QDR, NVIDIA 2050 HP ProLiant SL390s G7 Xeon 6C X5660 2.8Ghz, nVidia Fermi, 12 111 901,54 117,91 Georgia Institute of Technology Infiniband QDR CINECA / SCS - SuperComputing iDataPlex DX360M3, Xeon E5645 6C 2.40 GHz, Infiniband QDR, 13 82 891 88 891,88 160 S l ti Solution NVIDIA 2070 iDataPlex DX360M3, Xeon X5650 6C 2.66 GHz, Infiniband QDR, 14 256 891,87 76,25 Forschungszentrum Juelich (FZJ) NVIDIA 2070 Xtreme-X GreenBlade GB512X, Xeon E5 (Sandy Bridge - EP) 8C 15 61 889,19 198,72 Sandia National Laboratories 2.60GHz, Infiniband QDR RIKEN Advanced Institute for 32 1 830,18 12659,89 Computational Science (AICS) K computer, SPARC64 VIIIfx 2.0GHz, Tofu interconnect 47 2 635,15 4040 National Supercomputing Center in Tianjin NUDT YH MPP, Xeon X5670 6C 2.93 GHz, NVIDIA 2050 149 3 253,09 6950 DOE/SC/Oak Ridge National Laboratory Cray XT5-HE Opteron 6-core 2.6 GHz National Supercomputing Centre in Dawning TC3600 Blade System, Xeon X5650 6C 2.66GHz, 56 4 492,64 2580 Shenzhen (NSCS) Infiniband QDR, NVIDIA 2050 Mexico DF, November, 2011 SBAC-PAD, Vitoria October 28th, 2011 39Green/Top 500 November 2011 Top500 rankBSC, Xeon 6C, NVIDIA 2090 GPUNagasaki U., Intel i5, ATI Radeon GPUIBM and NNSA, Blue Gene/Q Mflops/watt Mwatts/Exaflop 2026,48 493 1689,86 592 Mflops/watt 1378,32 726 500-1000 100-500 1266,26 726 >1 GF/watt MF/watt MF/watt Mexico DF, November, 2011 40 20

×