Your SlideShare is downloading.
×

×
# Introducing the official SlideShare app

### Stunning, full-screen experience for iPhone and Android

#### Text the download link to your phone

Standard text messaging rates apply

Like this presentation? Why not share!

- Pete Beckman Interview on Argo OS f... by insideHPC 456 views
- Algorithmic challenges of exascale ... by ultrafilter 179 views
- HC-4022, Towards an Ecosystem for H... by AMD Developer Cen... 651 views
- Designing Software Libraries and Mi... by insideHPC 604 views
- Hp - The Future For Scale Out Compu... by gigaspaces 2908 views
- Facilitating Web Science Collabora... by James Hendler 125 views
- Planning for operations 3-31-2011 by Kittelson Slides 433 views
- Oyunbileg101 by Buyanjargal_b 426 views
- Case study presentations [pdf] by networkingcentral 1210 views
- Linked Open Govt Data - Sem Tech East by James Hendler 1266 views
- A Framework for Incident Detection ... by Michele Weigle 1274 views
- EMJD: Application procedure by EMAP Project 411 views

Like this? Share it with your network
Share

7,019

views

views

Published on

Published in:
Technology

No Downloads

Total Views

7,019

On Slideshare

0

From Embeds

0

Number of Embeds

5

Shares

0

Downloads

18

Comments

0

Likes

1

No embeds

No notes for slide

- 1. Critical Issues at Exascale for Algorithm and Software DesignSC12, Salt Lake City, Utah, Nov 2012 Jack Dongarra, University of Tennessee, Tennessee, USA
- 2. Performance Development in Top500 1E+11 1E+10 1 Eflop/s 1E+0910000000 100 Pflop/s 10 Pflop/s10000000 1000000 1 Pflop/s N=1 100000 100 Tflop/s 10000 10 Tflop/s 1 1000 Tflop/s N=500 100 100 Gflop/s 10 10 Gflop/s 1 1 Gflop/s 1996 2002 2008 2014 2020 0.1
- 3. Potential System ArchitectureSystems 2012 2022 Difference Titan Computer Today & 2022System peak 27 Pflop/s 1 Eflop/s O(100)Power 8.3 MW ~20 MW (2 Gflops/W) (50 Gflops/W)System memory 710 TB 32 - 64 PB O(10) (38*18688)Node performance 1,452 GF/s 1.2 or 15TF/s O(10) – O(100) (1311+141)Node memory BW 232 GB/s 2 - 4TB/s O(1000) (52+180)Node concurrency 16 cores CPU O(1k) or 10k O(100) – O(1000) 2688 CUDA coresTotal Node Interconnect 8 GB/s 200-400GB/s O(10)BWSystem size (nodes) 18,688 O(100,000) or O(1M) O(100) – O(1000)Total concurrency 50 M O(billion) O(1,000)MTTI ?? unknown O(<1 day) - O(10)
- 4. Potential System Architecture with a cap of $200M and 20MWSystems 2012 2022 Difference Titan Computer Today & 2022System peak 27 Pflop/s 1 Eflop/s O(100)Power 8.3 MW ~20 MW (2 Gflops/W) (50 Gflops/W)System memory 710 TB 32 - 64 PB O(10) (38*18688)Node performance 1,452 GF/s 1.2 or 15TF/s O(10) – O(100) (1311+141)Node memory BW 232 GB/s 2 - 4TB/s O(1000) (52+180)Node concurrency 16 cores CPU O(1k) or 10k O(100) – O(1000) 2688 CUDA coresTotal Node Interconnect 8 GB/s 200-400GB/s O(10)BWSystem size (nodes) 18,688 O(100,000) or O(1M) O(100) – O(1000)Total concurrency 50 M O(billion) O(1,000)MTTI ?? unknown O(<1 day) - O(10)
- 5. Critical Issues at Peta & Exascale forAlgorithm and Software Design Synchronization-reducing algorithms Break Fork-Join model Communication-reducing algorithms Use methods which have lower bound on communication Mixed precision methods 2x speed of ops and 2x speed for data movement Autotuning Today’s machines are too complicated, build “smarts” into software to adapt to the hardware Fault resilient algorithms Implement algorithms that can recover from failures/bit flips Reproducibility of results Today we can’t guarantee this. We understand the issues, 5 but some of our “colleagues” have a hard time with this.
- 6. Major Changes to Algorithms/Software• Must rethink the design of our algorithms and software Manycore and Hybrid architectures are disruptive technology Similar to what happened with cluster computing and message passing Rethink and rewrite the applications, algorithms, and software Data movement is expensive Flops are cheap 6
- 7. Fork-Join Parallelization of LU and QR.Parallelize the update: dgemm • Easy and done in any reasonable software. • This is the 2/3n3 term in the FLOPs count. - • Can be done efficiently with LAPACK+multithreaded BLAS Cores
- 8. Synchronization (in LAPACK LU) Step 1 Step 2 Step 3 Step 4 ... synchronous processing • Fork-join, bulk fork join 27 bulk synchronous processing 8Allowing for delayed update, out of order, asynchronous, dataflow execution
- 9. PLASMA/MAGMA: Parallel Linear Algebra s/w for Multicore/Hybrid ArchitecturesObjectives High utilization of each core Scaling to large number of cores Synchronization reducing algorithmsMethodology Dynamic DAG scheduling (QUARK) Explicit parallelism Implicit communication Fine granularity / block data layoutArbitrary DAG with dynamic scheduling Fork-join parallelism DAG scheduled parallelism 9
- 10. Communication Avoiding QR Example R0 R0 R0R R D0 Domain_Tile_QR D0 R1 D1 Domain_Tile_QR D1 R2 R2 D2 Domain_Tile_QR D2 R3 D3 Domain_Tile_QR D3A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rdConference on Hypercube Concurrent Computers and Applications, volume II, Applications,pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State. 11/20/2012 10
- 11. Communication Avoiding QR Example R0 R0 R0R R D0 Domain_Tile_QR D0 R1 D1 Domain_Tile_QR D1 R2 R2 D2 Domain_Tile_QR D2 R3 D3 Domain_Tile_QR D3A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rdConference on Hypercube Concurrent Computers and Applications, volume II, Applications,pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State. 11/20/2012 11
- 12. Communication Avoiding QR Example R0 R0 R0R R D0 Domain_Tile_QR D0 R1 D1 Domain_Tile_QR D1 R2 R2 D2 Domain_Tile_QR D2 R3 D3 Domain_Tile_QR D3A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rdConference on Hypercube Concurrent Computers and Applications, volume II, Applications,pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State. 11/20/2012 12
- 13. Communication Avoiding QR Example R0 R0 R0R R D0 Domain_Tile_QR D0 R1 D1 Domain_Tile_QR D1 R2 R2 D2 Domain_Tile_QR D2 R3 D3 Domain_Tile_QR D3A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rdConference on Hypercube Concurrent Computers and Applications, volume II, Applications,pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State. 11/20/2012 13
- 14. Communication Avoiding QR Example R0 R0 R0R R D0 Domain_Tile_QR D0 R1 D1 Domain_Tile_QR D1 R2 R2 D2 Domain_Tile_QR D2 R3 D3 Domain_Tile_QR D3A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rdConference on Hypercube Concurrent Computers and Applications, volume II, Applications,pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State. 11/20/2012 14
- 15. PowerPack 2.0The PowerPack platform consists of software and hardware instrumentation. 15Kirk Cameron, Virginia Tech; http://scape.cs.vt.edu/software/powerpack-2-0/
- 16. Power for QR Factorization LAPACK’s QR Factorization Fork-join based MKL’s QR Factorization Fork-join based PLASMA’s Conventional QR Factorization DAG based PLASMA’s Communication Reducing QR Factorization DAG baseddual-socket quad-core Intel Xeon E5462 (Harpertown) processor@ 2.80GHz (8 cores total) w / MLK BLAS 16matrix size is very tall and skinny (mxn is 1,152,000 by 288)

Be the first to comment