Bi-section bandwidth is calculated by knowing the partition sizes and the link bandwidth. For Sequoia, 96 racks will be arranged to16x12x16x16x2.The minimal surface is 12x16x16x2.The bandwidth between the two minimal surfaces is (12x16x16x2) x 2 x 2 x 2 = 49.152 TB/sHere, 2x2 reflects bi-directional 2 GB/s links, another 2 comes from the torus connections that brings a pair of links in each dimension.* Node memory bw for Q is 42.6 gb/s advertised and 28.6 GB/s on STREAM triad. That's **NODE**, not per core...For bg/q bisection bw?Missing LatenciesMessage injection ratesFlops/watt cost money and bytes/flop that costs moneyJoule/op in 2009, 2015 and 2018: 2015: 100 pj/opCapacity doesn’t cost as much power as bandwidth how many joules to move a bit 2 picojoule/bit 75pj/bit for accessing DRAM32 Petabytes: with system memory at same fraction of system Need $ numberBest machine for 20MW and best machine for $200MMemory op is 64 bit word of memory 75 picojoule bit for (multiply by 64) (DDR 3 spec) 50 pj/ for an entire 64 bit opMemory technology in 5pj/bit by 2015 if we invest soonAnything more aggressive than 4pj/bit is close to the limit (will not sign up for 2pj/bit)2015 10 pj/flop 5pj/flop in 2018So we are talking 30:1 ratio of memory reference per flop 10pj/operation to bring a byte in8 terabits * 1pj -> 8 wattsJEDEC is fundamentally broken (DDR4 is the end) Low swing differential insertion of known technology20GB/s per component to 1 order of magnitude more 10-12 Gigabits/second per wire16-64 using courant limited scaling of hydro codesCost per DRAM in that timeframe and how much to spend# outstanding memory references per cycle- bandwidth * latency above based on memory reference size memory concurrency 200 cycles from DRAM (2GHz) is 100ns (40ns for memory alone). With queues will be 100ns O(1000) references per node to memory O(10k) for 64 byte cache lines?Need to add system bisection: 2015: whatever local node bandwidth: factor of 4-8 or 2-4 against per-node interconnect bandwidth 2018:Occupancy vs latency: zero occupancy (1 slot for message launch) 5ns per 2-4 in 20152-4 in 201810^4vs 10^9th
Bi-section bandwidth is calculated by knowing the partition sizes and the link bandwidth. For Sequoia, 96 racks will be arranged to16x12x16x16x2.The minimal surface is 12x16x16x2.The bandwidth between the two minimal surfaces is (12x16x16x2) x 2 x 2 x 2 = 49.152 TB/sHere, 2x2 reflects bi-directional 2 GB/s links, another 2 comes from the torus connections that brings a pair of links in each dimension.* Node memory bw for Q is 42.6 gb/s advertised and 28.6 GB/s on STREAM triad. That's **NODE**, not per core...For bg/q bisection bw?Missing LatenciesMessage injection ratesFlops/watt cost money and bytes/flop that costs moneyJoule/op in 2009, 2015 and 2018: 2015: 100 pj/opCapacity doesn’t cost as much power as bandwidth how many joules to move a bit 2 picojoule/bit 75pj/bit for accessing DRAM32 Petabytes: with system memory at same fraction of system Need $ numberBest machine for 20MW and best machine for $200MMemory op is 64 bit word of memory 75 picojoule bit for (multiply by 64) (DDR 3 spec) 50 pj/ for an entire 64 bit opMemory technology in 5pj/bit by 2015 if we invest soonAnything more aggressive than 4pj/bit is close to the limit (will not sign up for 2pj/bit)2015 10 pj/flop 5pj/flop in 2018So we are talking 30:1 ratio of memory reference per flop 10pj/operation to bring a byte in8 terabits * 1pj -> 8 wattsJEDEC is fundamentally broken (DDR4 is the end) Low swing differential insertion of known technology20GB/s per component to 1 order of magnitude more 10-12 Gigabits/second per wire16-64 using courant limited scaling of hydro codesCost per DRAM in that timeframe and how much to spend# outstanding memory references per cycle- bandwidth * latency above based on memory reference size memory concurrency 200 cycles from DRAM (2GHz) is 100ns (40ns for memory alone). With queues will be 100ns O(1000) references per node to memory O(10k) for 64 byte cache lines?Need to add system bisection: 2015: whatever local node bandwidth: factor of 4-8 or 2-4 against per-node interconnect bandwidth 2018:Occupancy vs latency: zero occupancy (1 slot for message launch) 5ns per 2-4 in 20152-4 in 201810^4vs 10^9th
The PowerPack platform consists of software and hardware instrumentation. Hardware tools include a WattsUp Pro meter and a NI meter, which connect to the monitored system to measure system-level power and component-level power streams.
The system power rate is measured prior to the AC/DC converter, which has an efficiency of 80-85%So during this process, you are loosing power... Which makes the numbers not adding upIntel Xeon E5462 (Harpertown) processor with 8 DIMMs or 16 GB memory (each DIMM is 2 GB DDR2).
Transcript
1.
Critical Issues at Exascale for Algorithm and Software DesignSC12, Salt Lake City, Utah, Nov 2012 Jack Dongarra, University of Tennessee, Tennessee, USA
4.
Potential System Architecture with a cap of $200M and 20MWSystems 2012 2022 Difference Titan Computer Today & 2022System peak 27 Pflop/s 1 Eflop/s O(100)Power 8.3 MW ~20 MW (2 Gflops/W) (50 Gflops/W)System memory 710 TB 32 - 64 PB O(10) (38*18688)Node performance 1,452 GF/s 1.2 or 15TF/s O(10) – O(100) (1311+141)Node memory BW 232 GB/s 2 - 4TB/s O(1000) (52+180)Node concurrency 16 cores CPU O(1k) or 10k O(100) – O(1000) 2688 CUDA coresTotal Node Interconnect 8 GB/s 200-400GB/s O(10)BWSystem size (nodes) 18,688 O(100,000) or O(1M) O(100) – O(1000)Total concurrency 50 M O(billion) O(1,000)MTTI ?? unknown O(<1 day) - O(10)
5.
Critical Issues at Peta & Exascale forAlgorithm and Software Design Synchronization-reducing algorithms Break Fork-Join model Communication-reducing algorithms Use methods which have lower bound on communication Mixed precision methods 2x speed of ops and 2x speed for data movement Autotuning Today’s machines are too complicated, build “smarts” into software to adapt to the hardware Fault resilient algorithms Implement algorithms that can recover from failures/bit flips Reproducibility of results Today we can’t guarantee this. We understand the issues, 5 but some of our “colleagues” have a hard time with this.
6.
Major Changes to Algorithms/Software• Must rethink the design of our algorithms and software Manycore and Hybrid architectures are disruptive technology Similar to what happened with cluster computing and message passing Rethink and rewrite the applications, algorithms, and software Data movement is expensive Flops are cheap 6
7.
Fork-Join Parallelization of LU and QR.Parallelize the update: dgemm • Easy and done in any reasonable software. • This is the 2/3n3 term in the FLOPs count. - • Can be done efficiently with LAPACK+multithreaded BLAS Cores
8.
Synchronization (in LAPACK LU) Step 1 Step 2 Step 3 Step 4 ... synchronous processing • Fork-join, bulk fork join 27 bulk synchronous processing 8Allowing for delayed update, out of order, asynchronous, dataflow execution
9.
PLASMA/MAGMA: Parallel Linear Algebra s/w for Multicore/Hybrid ArchitecturesObjectives High utilization of each core Scaling to large number of cores Synchronization reducing algorithmsMethodology Dynamic DAG scheduling (QUARK) Explicit parallelism Implicit communication Fine granularity / block data layoutArbitrary DAG with dynamic scheduling Fork-join parallelism DAG scheduled parallelism 9
10.
Communication Avoiding QR Example R0 R0 R0R R D0 Domain_Tile_QR D0 R1 D1 Domain_Tile_QR D1 R2 R2 D2 Domain_Tile_QR D2 R3 D3 Domain_Tile_QR D3A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rdConference on Hypercube Concurrent Computers and Applications, volume II, Applications,pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State. 11/20/2012 10
11.
Communication Avoiding QR Example R0 R0 R0R R D0 Domain_Tile_QR D0 R1 D1 Domain_Tile_QR D1 R2 R2 D2 Domain_Tile_QR D2 R3 D3 Domain_Tile_QR D3A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rdConference on Hypercube Concurrent Computers and Applications, volume II, Applications,pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State. 11/20/2012 11
12.
Communication Avoiding QR Example R0 R0 R0R R D0 Domain_Tile_QR D0 R1 D1 Domain_Tile_QR D1 R2 R2 D2 Domain_Tile_QR D2 R3 D3 Domain_Tile_QR D3A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rdConference on Hypercube Concurrent Computers and Applications, volume II, Applications,pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State. 11/20/2012 12
13.
Communication Avoiding QR Example R0 R0 R0R R D0 Domain_Tile_QR D0 R1 D1 Domain_Tile_QR D1 R2 R2 D2 Domain_Tile_QR D2 R3 D3 Domain_Tile_QR D3A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rdConference on Hypercube Concurrent Computers and Applications, volume II, Applications,pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State. 11/20/2012 13
14.
Communication Avoiding QR Example R0 R0 R0R R D0 Domain_Tile_QR D0 R1 D1 Domain_Tile_QR D1 R2 R2 D2 Domain_Tile_QR D2 R3 D3 Domain_Tile_QR D3A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rdConference on Hypercube Concurrent Computers and Applications, volume II, Applications,pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State. 11/20/2012 14
15.
PowerPack 2.0The PowerPack platform consists of software and hardware instrumentation. 15Kirk Cameron, Virginia Tech; http://scape.cs.vt.edu/software/powerpack-2-0/
16.
Power for QR Factorization LAPACK’s QR Factorization Fork-join based MKL’s QR Factorization Fork-join based PLASMA’s Conventional QR Factorization DAG based PLASMA’s Communication Reducing QR Factorization DAG baseddual-socket quad-core Intel Xeon E5462 (Harpertown) processor@ 2.80GHz (8 cores total) w / MLK BLAS 16matrix size is very tall and skinny (mxn is 1,152,000 by 288)
Views
Actions
Embeds 0
Report content