Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- JPS - It's Time for You to Grade! by glkeeler 110 views
- http://catering.inthebostonlocalare... by cateringboston 211 views
- Welcome to IPAR Industrial Partners... by LeonPuts 409 views
- La nuova società italiana trasforma... by Cagliostro Puntodue 494 views
- M&m project Kat + Raph by newham5-6 191 views
- Hipoperfusao oculta by wruivo 618 views

7,584 views

Published on

Published in:
Technology

No Downloads

Total views

7,584

On SlideShare

0

From Embeds

0

Number of Embeds

6,238

Shares

0

Downloads

21

Comments

0

Likes

1

No embeds

No notes for slide

- 1. Critical Issues at Exascale for Algorithm and Software DesignSC12, Salt Lake City, Utah, Nov 2012 Jack Dongarra, University of Tennessee, Tennessee, USA
- 2. Performance Development in Top500 1E+11 1E+10 1 Eflop/s 1E+0910000000 100 Pflop/s 10 Pflop/s10000000 1000000 1 Pflop/s N=1 100000 100 Tflop/s 10000 10 Tflop/s 1 1000 Tflop/s N=500 100 100 Gflop/s 10 10 Gflop/s 1 1 Gflop/s 1996 2002 2008 2014 2020 0.1
- 3. Potential System ArchitectureSystems 2012 2022 Difference Titan Computer Today & 2022System peak 27 Pflop/s 1 Eflop/s O(100)Power 8.3 MW ~20 MW (2 Gflops/W) (50 Gflops/W)System memory 710 TB 32 - 64 PB O(10) (38*18688)Node performance 1,452 GF/s 1.2 or 15TF/s O(10) – O(100) (1311+141)Node memory BW 232 GB/s 2 - 4TB/s O(1000) (52+180)Node concurrency 16 cores CPU O(1k) or 10k O(100) – O(1000) 2688 CUDA coresTotal Node Interconnect 8 GB/s 200-400GB/s O(10)BWSystem size (nodes) 18,688 O(100,000) or O(1M) O(100) – O(1000)Total concurrency 50 M O(billion) O(1,000)MTTI ?? unknown O(<1 day) - O(10)
- 4. Potential System Architecture with a cap of $200M and 20MWSystems 2012 2022 Difference Titan Computer Today & 2022System peak 27 Pflop/s 1 Eflop/s O(100)Power 8.3 MW ~20 MW (2 Gflops/W) (50 Gflops/W)System memory 710 TB 32 - 64 PB O(10) (38*18688)Node performance 1,452 GF/s 1.2 or 15TF/s O(10) – O(100) (1311+141)Node memory BW 232 GB/s 2 - 4TB/s O(1000) (52+180)Node concurrency 16 cores CPU O(1k) or 10k O(100) – O(1000) 2688 CUDA coresTotal Node Interconnect 8 GB/s 200-400GB/s O(10)BWSystem size (nodes) 18,688 O(100,000) or O(1M) O(100) – O(1000)Total concurrency 50 M O(billion) O(1,000)MTTI ?? unknown O(<1 day) - O(10)
- 5. Critical Issues at Peta & Exascale forAlgorithm and Software Design Synchronization-reducing algorithms Break Fork-Join model Communication-reducing algorithms Use methods which have lower bound on communication Mixed precision methods 2x speed of ops and 2x speed for data movement Autotuning Today’s machines are too complicated, build “smarts” into software to adapt to the hardware Fault resilient algorithms Implement algorithms that can recover from failures/bit flips Reproducibility of results Today we can’t guarantee this. We understand the issues, 5 but some of our “colleagues” have a hard time with this.
- 6. Major Changes to Algorithms/Software• Must rethink the design of our algorithms and software Manycore and Hybrid architectures are disruptive technology Similar to what happened with cluster computing and message passing Rethink and rewrite the applications, algorithms, and software Data movement is expensive Flops are cheap 6
- 7. Fork-Join Parallelization of LU and QR.Parallelize the update: dgemm • Easy and done in any reasonable software. • This is the 2/3n3 term in the FLOPs count. - • Can be done efficiently with LAPACK+multithreaded BLAS Cores
- 8. Synchronization (in LAPACK LU) Step 1 Step 2 Step 3 Step 4 ... synchronous processing • Fork-join, bulk fork join 27 bulk synchronous processing 8Allowing for delayed update, out of order, asynchronous, dataflow execution
- 9. PLASMA/MAGMA: Parallel Linear Algebra s/w for Multicore/Hybrid ArchitecturesObjectives High utilization of each core Scaling to large number of cores Synchronization reducing algorithmsMethodology Dynamic DAG scheduling (QUARK) Explicit parallelism Implicit communication Fine granularity / block data layoutArbitrary DAG with dynamic scheduling Fork-join parallelism DAG scheduled parallelism 9
- 10. Communication Avoiding QR Example R0 R0 R0R R D0 Domain_Tile_QR D0 R1 D1 Domain_Tile_QR D1 R2 R2 D2 Domain_Tile_QR D2 R3 D3 Domain_Tile_QR D3A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rdConference on Hypercube Concurrent Computers and Applications, volume II, Applications,pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State. 11/20/2012 10
- 11. Communication Avoiding QR Example R0 R0 R0R R D0 Domain_Tile_QR D0 R1 D1 Domain_Tile_QR D1 R2 R2 D2 Domain_Tile_QR D2 R3 D3 Domain_Tile_QR D3A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rdConference on Hypercube Concurrent Computers and Applications, volume II, Applications,pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State. 11/20/2012 11
- 12. Communication Avoiding QR Example R0 R0 R0R R D0 Domain_Tile_QR D0 R1 D1 Domain_Tile_QR D1 R2 R2 D2 Domain_Tile_QR D2 R3 D3 Domain_Tile_QR D3A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rdConference on Hypercube Concurrent Computers and Applications, volume II, Applications,pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State. 11/20/2012 12
- 13. Communication Avoiding QR Example R0 R0 R0R R D0 Domain_Tile_QR D0 R1 D1 Domain_Tile_QR D1 R2 R2 D2 Domain_Tile_QR D2 R3 D3 Domain_Tile_QR D3A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rdConference on Hypercube Concurrent Computers and Applications, volume II, Applications,pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State. 11/20/2012 13
- 14. Communication Avoiding QR Example R0 R0 R0R R D0 Domain_Tile_QR D0 R1 D1 Domain_Tile_QR D1 R2 R2 D2 Domain_Tile_QR D2 R3 D3 Domain_Tile_QR D3A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rdConference on Hypercube Concurrent Computers and Applications, volume II, Applications,pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State. 11/20/2012 14
- 15. PowerPack 2.0The PowerPack platform consists of software and hardware instrumentation. 15Kirk Cameron, Virginia Tech; http://scape.cs.vt.edu/software/powerpack-2-0/
- 16. Power for QR Factorization LAPACK’s QR Factorization Fork-join based MKL’s QR Factorization Fork-join based PLASMA’s Conventional QR Factorization DAG based PLASMA’s Communication Reducing QR Factorization DAG baseddual-socket quad-core Intel Xeon E5462 (Harpertown) processor@ 2.80GHz (8 cores total) w / MLK BLAS 16matrix size is very tall and skinny (mxn is 1,152,000 by 288)

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment