Your SlideShare is downloading.
×

×
Saving this for later?
Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.

Text the download link to your phone

Standard text messaging rates apply

Like this presentation? Why not share!

- Why we need Exascale and why we won... by ultrafilter 1103 views
- Achitecture Aware Algorithms and So... by insideHPC 775 views
- Why we need Exascale and why we won... by ultrafilter 1103 views
- CSC2013: Exascale in the US by John Towns 83 views
- Resilience at Extreme Scale by Marc Snir 295 views
- Mateo valero p1 by guadalupe.moreno 698 views
- Presentation of the 40th TOP500 List by top500 42776 views
- Exascale architecture trends (2014.8) by ultrafilter 159 views
- Ron perrot by guadalupe.moreno 540 views
- Top500 List June 2012 by top500 1636 views
- Top500 11/2011 BOF Slides by top500 277803 views
- ScimoreDB @ CommunityDays 2011 by scimore 1306 views

Like this? Share it with your network
Share

7,043

views

views

Published on

Published in:
Technology

No Downloads

Total Views

7,043

On Slideshare

0

From Embeds

0

Number of Embeds

6

Shares

0

Downloads

18

Comments

0

Likes

1

No embeds

No notes for slide

- 1. Critical Issues at Exascale for Algorithm and Software DesignSC12, Salt Lake City, Utah, Nov 2012 Jack Dongarra, University of Tennessee, Tennessee, USA
- 2. Performance Development in Top500 1E+11 1E+10 1 Eflop/s 1E+0910000000 100 Pflop/s 10 Pflop/s10000000 1000000 1 Pflop/s N=1 100000 100 Tflop/s 10000 10 Tflop/s 1 1000 Tflop/s N=500 100 100 Gflop/s 10 10 Gflop/s 1 1 Gflop/s 1996 2002 2008 2014 2020 0.1
- 3. Potential System ArchitectureSystems 2012 2022 Difference Titan Computer Today & 2022System peak 27 Pflop/s 1 Eflop/s O(100)Power 8.3 MW ~20 MW (2 Gflops/W) (50 Gflops/W)System memory 710 TB 32 - 64 PB O(10) (38*18688)Node performance 1,452 GF/s 1.2 or 15TF/s O(10) – O(100) (1311+141)Node memory BW 232 GB/s 2 - 4TB/s O(1000) (52+180)Node concurrency 16 cores CPU O(1k) or 10k O(100) – O(1000) 2688 CUDA coresTotal Node Interconnect 8 GB/s 200-400GB/s O(10)BWSystem size (nodes) 18,688 O(100,000) or O(1M) O(100) – O(1000)Total concurrency 50 M O(billion) O(1,000)MTTI ?? unknown O(<1 day) - O(10)
- 4. Potential System Architecture with a cap of $200M and 20MWSystems 2012 2022 Difference Titan Computer Today & 2022System peak 27 Pflop/s 1 Eflop/s O(100)Power 8.3 MW ~20 MW (2 Gflops/W) (50 Gflops/W)System memory 710 TB 32 - 64 PB O(10) (38*18688)Node performance 1,452 GF/s 1.2 or 15TF/s O(10) – O(100) (1311+141)Node memory BW 232 GB/s 2 - 4TB/s O(1000) (52+180)Node concurrency 16 cores CPU O(1k) or 10k O(100) – O(1000) 2688 CUDA coresTotal Node Interconnect 8 GB/s 200-400GB/s O(10)BWSystem size (nodes) 18,688 O(100,000) or O(1M) O(100) – O(1000)Total concurrency 50 M O(billion) O(1,000)MTTI ?? unknown O(<1 day) - O(10)
- 5. Critical Issues at Peta & Exascale forAlgorithm and Software Design Synchronization-reducing algorithms Break Fork-Join model Communication-reducing algorithms Use methods which have lower bound on communication Mixed precision methods 2x speed of ops and 2x speed for data movement Autotuning Today’s machines are too complicated, build “smarts” into software to adapt to the hardware Fault resilient algorithms Implement algorithms that can recover from failures/bit flips Reproducibility of results Today we can’t guarantee this. We understand the issues, 5 but some of our “colleagues” have a hard time with this.
- 6. Major Changes to Algorithms/Software• Must rethink the design of our algorithms and software Manycore and Hybrid architectures are disruptive technology Similar to what happened with cluster computing and message passing Rethink and rewrite the applications, algorithms, and software Data movement is expensive Flops are cheap 6
- 7. Fork-Join Parallelization of LU and QR.Parallelize the update: dgemm • Easy and done in any reasonable software. • This is the 2/3n3 term in the FLOPs count. - • Can be done efficiently with LAPACK+multithreaded BLAS Cores
- 8. Synchronization (in LAPACK LU) Step 1 Step 2 Step 3 Step 4 ... synchronous processing • Fork-join, bulk fork join 27 bulk synchronous processing 8Allowing for delayed update, out of order, asynchronous, dataflow execution
- 9. PLASMA/MAGMA: Parallel Linear Algebra s/w for Multicore/Hybrid ArchitecturesObjectives High utilization of each core Scaling to large number of cores Synchronization reducing algorithmsMethodology Dynamic DAG scheduling (QUARK) Explicit parallelism Implicit communication Fine granularity / block data layoutArbitrary DAG with dynamic scheduling Fork-join parallelism DAG scheduled parallelism 9
- 10. Communication Avoiding QR Example R0 R0 R0R R D0 Domain_Tile_QR D0 R1 D1 Domain_Tile_QR D1 R2 R2 D2 Domain_Tile_QR D2 R3 D3 Domain_Tile_QR D3A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rdConference on Hypercube Concurrent Computers and Applications, volume II, Applications,pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State. 11/20/2012 10
- 11. Communication Avoiding QR Example R0 R0 R0R R D0 Domain_Tile_QR D0 R1 D1 Domain_Tile_QR D1 R2 R2 D2 Domain_Tile_QR D2 R3 D3 Domain_Tile_QR D3A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rdConference on Hypercube Concurrent Computers and Applications, volume II, Applications,pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State. 11/20/2012 11
- 12. Communication Avoiding QR Example R0 R0 R0R R D0 Domain_Tile_QR D0 R1 D1 Domain_Tile_QR D1 R2 R2 D2 Domain_Tile_QR D2 R3 D3 Domain_Tile_QR D3A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rdConference on Hypercube Concurrent Computers and Applications, volume II, Applications,pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State. 11/20/2012 12
- 13. Communication Avoiding QR Example R0 R0 R0R R D0 Domain_Tile_QR D0 R1 D1 Domain_Tile_QR D1 R2 R2 D2 Domain_Tile_QR D2 R3 D3 Domain_Tile_QR D3A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rdConference on Hypercube Concurrent Computers and Applications, volume II, Applications,pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State. 11/20/2012 13
- 14. Communication Avoiding QR Example R0 R0 R0R R D0 Domain_Tile_QR D0 R1 D1 Domain_Tile_QR D1 R2 R2 D2 Domain_Tile_QR D2 R3 D3 Domain_Tile_QR D3A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rdConference on Hypercube Concurrent Computers and Applications, volume II, Applications,pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State. 11/20/2012 14
- 15. PowerPack 2.0The PowerPack platform consists of software and hardware instrumentation. 15Kirk Cameron, Virginia Tech; http://scape.cs.vt.edu/software/powerpack-2-0/
- 16. Power for QR Factorization LAPACK’s QR Factorization Fork-join based MKL’s QR Factorization Fork-join based PLASMA’s Conventional QR Factorization DAG based PLASMA’s Communication Reducing QR Factorization DAG baseddual-socket quad-core Intel Xeon E5462 (Harpertown) processor@ 2.80GHz (8 cores total) w / MLK BLAS 16matrix size is very tall and skinny (mxn is 1,152,000 by 288)

Be the first to comment