K2I Distinguished Lecture Series

Architecture-aware Algorithms and
Software for Peta and Exascale
Computing
Jack Dongarra...
Outline
•  Overview of High Performance
Computing
•  Look at an implementation for some
linear algebra algorithms on today...
H. Meuer, H. Simon, E. Strohmaier, & JD

Rate

- Listing of the 500 most powerful
Computers in the World
- Yardstick: Rmax...
Performance Development of HPC
Over the Last 20 Years
1E+09

224	
  	
  PFlop/s	
  

100 Pflop/s

100000000

33.9	
  PFlop...
State of Supercomputing in 2014
•  Pflops computing fully established with 31
systems.
•  Three technology architecture po...
November 2013: The TOP10
Country

Cores

Rmax
[Pflops]

% of
Peak

China

3,120,000

33.9

62

17.8

1905

USA

560,640

1...
Accelerators (53 systems)
60	
  

Systems	
  

50	
  
40	
  
30	
  
20	
  
10	
  
0	
  
2006	
   2007	
   2008	
   2009	
 ...
Top500 Performance Share of Accelerators
53 of the 500 systems provide 35% of the accumulated performance

35%
30%
25%
20%...
For the Top 500: Rank at which Half of Total
Performance is Accumulated
90
80
70
60
50
40

35

30

30

20

25

10
0

Pflop...
Commodity plus Accelerator Today
Commodity

Accelerator (GPU)

Intel Xeon
8 cores
3 GHz
8*4 ops/cycle
96 Gflop/s (DP)

192...
Linpack Efficiency
100%
90%
Linpack Efficiency

80%
70%
60%
50%
40%
30%
20%
10%
0%

0

100

200

300

400

500
Linpack Efficiency
100%
90%
Linpack Efficiency

80%
70%
60%
50%
40%
30%
20%
10%
0%

0

100

200

300

400

500
Linpack Efficiency
100%
90%
Linpack Efficiency

80%
70%
60%
50%
40%
30%
20%
10%
0%

0

100

200

300

400

500
Linpack Efficiency
100%
90%
Linpack Efficiency

80%
70%
60%
50%
40%
30%
20%
10%
0%

0

100

200

300

400

500
DLA Solvers
¨  We are interested in developing

Dense Linear Algebra Solvers
¨  Retool LAPACK and ScaLAPACK for
multicore ...
Last Generations of DLA Software
Software/Algorithms follow hardware evolution in time
LINPACK (70’s)
(Vector operations)
...
A New Generation of DLA Software
Software/Algorithms follow hardware evolution in time
LINPACK (70’s)
(Vector operations)
...
Parallelization of LU and QR.
Parallelize the update:
dgemm	
  
•  Easy and done in any reasonable software.
•  This is th...
Synchronization (in LAPACK LU)

Ø  fork join
Ø  bulk synchronous processing

Cores

•  Fork-join, bulk synchronous proce...
PLASMA LU Factorization
Dataflow Driven
xTRSM

Numerical program generates tasks and
run time system executes tasks respec...
QUARK
¨  A runtime environment for the

dynamic execution of
precedence-constraint tasks
(DAGs) in a multicore machine

Ø...
The Purpose of a QUARK Runtime
¨ Objectives
Ø  High utilization of each core
Ø  Scaling to large number of cores
Ø  Syn...
QUARK
Shared Memory Superscalar Scheduling
FOR k = 0..TILES-1
A[k][k] ← DPOTRF(A[k][k])
FOR m = k+1..TILES-1
A[m][k] ← DTR...
Pipelining: Cholesky Inversion
3 Steps: Factor, Invert L, Multiply L’s

48 cores
POTRF, TRTRI and LAUUM.
The matrix is 400...
Performance of PLASMA Cholesky, Double Precision
Comparing Various Numbers of Cores (Percentage of Theoretical Peak)
1024 ...
¨  DAG too large to be

generated ahead of time
Ø  Generate it dynamically
Ø  Merge parameterized DAGs
with dynamically ...
DPLASMA: Going to Distributed Memory
PO

TR

TR

SY

TR

GE

PO

GE

SY

TR

TR

GE

SY

PO

TR

GE

SY

SY
Start	
  with	
  PLASMA
	
  
for	
  i,j	
  =	
  0..N	
  
	
  	
  	
  QUARK_Insert(	
  GEMM,	
  	
  A[i,	
  j],INPUT,	
  	
...
PLASMA

DPLASMA

(On Node)

(Distributed System)
inputs

execution window

tasks

outputs

DAGuE
	
  

QUARK
	
  
Number o...
Task Affinity in DPLASMA

User defined data
distribution function

Runtime system, called PaRSEC, is distributed
Runtime DAG scheduling
Node0

PO

Node1
Node2

TR

Node3
TR

SY

TR

GE

PO

GE

SY

TR

TR

GE

SY

SY

SY

PO

TR

SY

P...
Cholesky
	
  
DSBP =
Distributed Square
Block Packed

LU
	
  

81 nodes
Dual socket nodes
Quad core Xeon L5420
Total 648 c...
Dense LA

Compact Representation
- PTG

Hardware

Parallel Runtime

Domain Specific
Extensions

The PaRSEC framework
…

Sp...
Other Systems
PaRSEC

SMPss

StarPU

Charm
++

FLAME

QUARK

Tblas

PTG

Scheduling

Distr.
(1/core)

Repl
(1/node)

Repl
...
Performance Development in Top500
1E+11
1E+10
1 Eflop/s

1E+09

100 Pflop/s
100000000

10 Pflop/s

10000000

1 Pflop/s

10...
Today’s #1 System
Systems
System peak
Power
System memory
Node performance
Node concurrency
Node Interconnect BW
System si...
Exascale System Architecture
with a cap of $200M and 20MW
Systems
System peak
Power
System memory
Node performance
Node co...
Exascale System Architecture
with a cap of $200M and 20MW
Systems
System peak
Power
System memory
Node performance
Node co...
Major Changes to Software &
Algorithms
•  Must rethink the design of our
algorithms and software
§  Another disruptive te...
Critical Issues at Peta & Exascale for
Algorithm and Software Design
•  Synchronization-reducing algorithms
§  Break Fork...
Summary
•  Major Challenges are ahead for extreme
computing
§  Parallelism O(109)
•  Programming issues

§  Hybrid
•  Pe...
Collaborators / Software / Support
u 

u 

u 

• 

PLASMA
http://icl.cs.utk.edu/plasma/
MAGMA
http://icl.cs.utk.edu/mag...
Upcoming SlideShare
Loading in …5
×

Achitecture Aware Algorithms and Software for Peta and Exascale

1,451 views

Published on

Jack Dongarra from the University of Tennessee presented these slides at Ken Kennedy Institute of Information Technology on Feb 13, 2014.

Listen to the podcast review of this talk: http://insidehpc.com/2014/02/13/week-hpc-jack-dongarra-talks-algorithms-exascale/

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,451
On SlideShare
0
From Embeds
0
Number of Embeds
26
Actions
Shares
0
Downloads
32
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Achitecture Aware Algorithms and Software for Peta and Exascale

  1. 1. K2I Distinguished Lecture Series Architecture-aware Algorithms and Software for Peta and Exascale Computing Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester Rice University 2/13/14 1
  2. 2. Outline •  Overview of High Performance Computing •  Look at an implementation for some linear algebra algorithms on today’s High Performance Computers §  As an examples of the kind of thing needed. 2
  3. 3. H. Meuer, H. Simon, E. Strohmaier, & JD Rate - Listing of the 500 most powerful Computers in the World - Yardstick: Rmax from LINPACK MPP Ax=b, dense problem TPP performance - Updated twice a year Size SC‘xy in the States in November Meeting in Germany in June - All data available from www.top500.org 3
  4. 4. Performance Development of HPC Over the Last 20 Years 1E+09 224    PFlop/s   100 Pflop/s 100000000 33.9  PFlop/s   10 Pflop/s 10000000 1 Pflop/s 1000000 SUM   100 Tflop/s 100000 N=1    118  TFlop/s   10 Tflop/s 10000 1 Tflop/s 1000 6-8 years 1.17  TFlop/s   N=500   100 Gflop/s 100 My Laptop (70 Gflop/s) 59.7  GFlop/s   10 Gflop/s 10 My iPad2 & iPhone 4s (1.02 Gflop/s) 1 Gflop/s 1 400  MFlop/s   100 Mflop/s 0.1 1993 1995 1997 1999 2001 2003 2005 2007 2009 2011 2013
  5. 5. State of Supercomputing in 2014 •  Pflops computing fully established with 31 systems. •  Three technology architecture possibilities or “swim lanes” are thriving. •  Commodity (e.g. Intel) •  Commodity + accelerator (e.g. GPUs) •  Special purpose lightweight cores (e.g. IBM BG) •  Interest in supercomputing is now worldwide, and growing in many new markets (over 50% of Top500 computers are in industry). •  Exascale projects exist in many countries and regions. 5
  6. 6. November 2013: The TOP10 Country Cores Rmax [Pflops] % of Peak China 3,120,000 33.9 62 17.8 1905 USA 560,640 17.6 65 8.3 2120 Sequoia, BlueGene/Q (16c) + custom USA 1,572,864 17.2 85 7.9 2063 RIKEN Advanced Inst for Comp Sci K computer Fujitsu SPARC64 VIIIfx (8c) + Custom Japan 705,024 10.5 93 12.7 827 5 DOE / OS Argonne Nat Lab Mira, BlueGene/Q (16c) + Custom USA 786,432 8.16 85 3.95 2066 6 Swiss CSCS Piz Daint, Cray XC30, Xeon 8C + Nvidia Kepler (14c) + Custom Swiss 115,984 6.27 81 2.3 2726 7 Texas Advanced Computing Center Stampede, Dell Intel (8c) + Intel Xeon Phi (61c) + IB USA 204,900 2.66 61 3.3 806 8 Forschungszentrum Juelich (FZJ) JuQUEEN, BlueGene/Q, Power BQC 16C 1.6GHz+Custom Germany 458,752 5.01 85 2.30 2178 USA 393,216 4.29 85 1.97 2177 Germany 147,456 2.90 91* 3.42 848 22,212 .118 50 Rank Site Computer National University of Defense Technology DOE / OS Oak Ridge Nat Lab Tianhe-2 NUDT, Xeon 12C 2.2GHz + IntelXeon Phi (57c) + Custom Titan, Cray XK7 (16C) + Nvidia Kepler GPU (14c) + Custom 3 DOE / NNSA L Livermore Nat Lab 4 1 2 9 10 500 DOE / NNSA Vulcan, BlueGene/Q, L Livermore Nat Lab Power BQC 16C 1.6GHz+Custom Leibniz Rechenzentrum Banking SuperMUC, Intel (8c) + IB HP USA Power MFlops [MW] /Watt
  7. 7. Accelerators (53 systems) 60   Systems   50   40   30   20   10   0   2006   2007   2008   2009   2010   2011   2012   2013   Intel  MIC  (13)   Clearspeed  CSX600  (0)   ATI  GPU  (2)   IBM  PowerXCell  8i  (0)   NVIDIA  2070  (4)   NVIDIA  2050  (7)   NVIDIA  2090  (11)   NVIDIA  K20  (16)   19 US 9 China 6 Japan 4 Russia 2 France 2 Germany 2 India 1 Italy 1 Poland 1 Australia 2 Brazil 1 Saudi Arabia 1 South Korea 1 Spain 2 Switzerland 1 UK
  8. 8. Top500 Performance Share of Accelerators 53 of the 500 systems provide 35% of the accumulated performance 35% 30% 25% 20% 15% 10% 5% 2013 2012 2011 2010 2009 2008 2007 0% 2006 Fraction of Total TOP500 Performance 40%
  9. 9. For the Top 500: Rank at which Half of Total Performance is Accumulated 90 80 70 60 50 40 35 30 30 20 25 10 0 Pflop/s Numbers of Systems 100 Top 500 November 2013 20 15 10 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 5 0 0 100 200 300 400 500
  10. 10. Commodity plus Accelerator Today Commodity Accelerator (GPU) Intel Xeon 8 cores 3 GHz 8*4 ops/cycle 96 Gflop/s (DP) 192 Cuda cores/SMX 2688 “Cuda cores” Nvidia K20X “Kepler” 2688 “Cuda cores” .732 GHz 2688*2/3 ops/cycle 1.31 Tflop/s (DP) Interconnect PCI-e Gen2/3 16 lane 64 Gb/s (8 GB/s) 1 GW/s 6 GB 10
  11. 11. Linpack Efficiency 100% 90% Linpack Efficiency 80% 70% 60% 50% 40% 30% 20% 10% 0% 0 100 200 300 400 500
  12. 12. Linpack Efficiency 100% 90% Linpack Efficiency 80% 70% 60% 50% 40% 30% 20% 10% 0% 0 100 200 300 400 500
  13. 13. Linpack Efficiency 100% 90% Linpack Efficiency 80% 70% 60% 50% 40% 30% 20% 10% 0% 0 100 200 300 400 500
  14. 14. Linpack Efficiency 100% 90% Linpack Efficiency 80% 70% 60% 50% 40% 30% 20% 10% 0% 0 100 200 300 400 500
  15. 15. DLA Solvers ¨  We are interested in developing Dense Linear Algebra Solvers ¨  Retool LAPACK and ScaLAPACK for multicore and hybrid architectures 2/13/14 15
  16. 16. Last Generations of DLA Software Software/Algorithms follow hardware evolution in time LINPACK (70’s) (Vector operations) Rely on - Level-1 BLAS operations LAPACK (80’s) (Blocking, cache friendly) Rely on - Level-3 BLAS operations ScaLAPACK (90’s) (Distributed Memory) Rely on - PBLAS Mess Passing PLASMA New Algorithms (many-core friendly) MAGMA Hybrid Algorithms
 (heterogeneity friendly) Rely on - a DAG/scheduler Rely on - block data layout - hybrid scheduler - some extra kernels - hybrid kernels
  17. 17. A New Generation of DLA Software Software/Algorithms follow hardware evolution in time LINPACK (70’s) (Vector operations) Rely on - Level-1 BLAS operations LAPACK (80’s) (Blocking, cache friendly) Rely on - Level-3 BLAS operations ScaLAPACK (90’s) (Distributed Memory) Rely on - PBLAS Mess Passing PLASMA New Algorithms (many-core friendly) MAGMA Hybrid Algorithms
 (heterogeneity friendly) Rely on - a DAG/scheduler Rely on - block data layout - hybrid scheduler - some extra kernels - hybrid kernels
  18. 18. Parallelization of LU and QR. Parallelize the update: dgemm   •  Easy and done in any reasonable software. •  This is the 2/3n3 term in the FLOPs count. •  Can be done efficiently with LAPACK+multithreaded BLAS -­‐   U   L   A(1)   dgeD2   lu(   )   dtrsm  (+  dswp)   Fork - Join parallelism Bulk Sync Processing   dgemm   -­‐   U   L   A(2)  
  19. 19. Synchronization (in LAPACK LU) Ø  fork join Ø  bulk synchronous processing Cores •  Fork-join, bulk synchronous processing 23 19 Time 27
  20. 20. PLASMA LU Factorization Dataflow Driven xTRSM Numerical program generates tasks and run time system executes tasks respecting data dependences. xGEMM xGETF2 xTRSM xGEMM xGEMM xTRSM xGEMM xTRSM xGEMM xGEMM xGEMM xGEMM xGEMM xGEMM xGEMM
  21. 21. QUARK ¨  A runtime environment for the dynamic execution of precedence-constraint tasks (DAGs) in a multicore machine Ø  Translation Ø  If you have a serial program that consists of computational kernels (tasks) that are related by data dependencies, QUARK can help you execute that program (relatively efficiently and easily) in parallel on a multicore machine 21
  22. 22. The Purpose of a QUARK Runtime ¨ Objectives Ø  High utilization of each core Ø  Scaling to large number of cores Ø  Synchronization reducing algorithms ¨ Methodology Ø  Dynamic DAG scheduling (QUARK) Ø  Explicit parallelism Ø  Implicit communication Ø  Fine granularity / block data layout ¨ Arbitrary DAG with dynamic scheduling DAG scheduled parallelism Fork-join parallelism Notice the synchronization penalty in the presence of heterogeneity. 22
  23. 23. QUARK Shared Memory Superscalar Scheduling FOR k = 0..TILES-1 A[k][k] ← DPOTRF(A[k][k]) FOR m = k+1..TILES-1 A[m][k] ← DTRSM(A[k][k], A[m][k]) FOR m = k+1..TILES-1 A[m][m] ← DSYRK(A[m][k], A[m][m]) FOR n = k+1..m-1 A[m][n] ← DGEMM(A[m][k], A[n][k], A[m][n]) for (k = 0; k < A.mt; k++) { QUARK_CORE(dpotrf,...); for (m = k+1; m < A.mt; m++) { QUARK_CORE(dtrsm,...); } for (m = k+1; m < A.mt; m++) { QUARK_CORE(dsyrk,...); for (n = k+1; n < m; n++) { QUARK_CORE(dgemm,...) } } } definition – pseudocode implementation – QUARK code in PLASMA
  24. 24. Pipelining: Cholesky Inversion 3 Steps: Factor, Invert L, Multiply L’s 48 cores POTRF, TRTRI and LAUUM. The matrix is 4000 x 4000,tile size is 200 x 200, POTRF+TRTRI+LAUUM: 25 (7t-3) Cholesky Factorization alone: 3t-2 Pipelined: 18 (3t+6) 24
  25. 25. Performance of PLASMA Cholesky, Double Precision Comparing Various Numbers of Cores (Percentage of Theoretical Peak) 1024 Cores (64 x 16-cores) 2.00 GHz Intel Xeon X7550, 8,192 Gflop/s Peak (Double Precision) [nautilus] Static Scheduling 2000 Gflop/s 1500 1000 Cores (.27) 960 Cores (.27) 768 Cores (.30) 576 Cores (.33) 384 Cores (.31) 192 Cores (.48) 160 Cores (.46) 128 Cores (.43) 96 Cores (.51) 64 Cores (.58) 32 Cores (.70) 16 Cores (.78) 1000 500 0 0 20000 40000 60000 80000 100000 Matrix Size 120000 140000 160000
  26. 26. ¨  DAG too large to be generated ahead of time Ø  Generate it dynamically Ø  Merge parameterized DAGs with dynamically generated DAGs ¨  HPC is about distributed heterogeneous resources Ø  Have to be able to move the data across multiple nodes Ø  The scheduling cannot be centralized Ø  Take advantage of all available resources with minimal intervention from the user ¨  Facilitate the usage of the HPC resources Ø  Languages Ø  Portability
  27. 27. DPLASMA: Going to Distributed Memory PO TR TR SY TR GE PO GE SY TR TR GE SY PO TR GE SY SY
  28. 28. Start  with  PLASMA   for  i,j  =  0..N        QUARK_Insert(  GEMM,    A[i,  j],INPUT,      B[j,  i],INPUT,    C[i,i],INOUT  )        QUARK_Insert(  TRSM,    A[i,  j],INPUT,      B[j,  i],INOUT  )   Parse  the  C  source  code  to  Abstract  Syntax  Tree   QUARK_Insert   GEMM   A   i   B   B   j   i   j   Analyze    dependencies  with  Omega  Test   {  1  <  i  <  N  :  GEMM(i,  j)  =>  TRSM(j)  }   Generate  Code  which  has  the  Parameterized  DAG   GEMM(i,  j)   TRSM(j)   i   j   Loops  &  array   references   have  to  be   affine  
  29. 29. PLASMA DPLASMA (On Node) (Distributed System) inputs execution window tasks outputs DAGuE   QUARK   Number of tasks in DAG: O(n3) Cholesky: 1/3 n3 LU: 2/3 n3 QR: 4/3 n3 Number of tasks in parameterized DAG: O(1) Cholesky: 4 (POTRF, SYRK, GEMM, TRSM) LU: 4 (GETRF, GESSM, TSTRF, SSSSM) QR: 4 (GEQRT, LARFB, TSQRT, SSRFB) DAG: Conceptualized & Parameterized small enough to store on each core in every node = Scalable
  30. 30. Task Affinity in DPLASMA User defined data distribution function Runtime system, called PaRSEC, is distributed
  31. 31. Runtime DAG scheduling Node0 PO Node1 Node2 TR Node3 TR SY TR GE PO GE SY TR TR GE SY SY SY PO TR SY PO GE ¨  Every node has the symbolic DAG representation Ø  Only the (node local) frontier of the DAG is considered Ø  Distributed Scheduling based on remote completion notifications ¨  Background remote data transfer automatic with overlap ¨  NUMA / Cache aware Scheduling Ø  Work Stealing and sharing based on memory hierarchies
  32. 32. Cholesky   DSBP = Distributed Square Block Packed LU   81 nodes Dual socket nodes Quad core Xeon L5420 Total 648 cores at 2.5 GHz ConnectX InfiniBand DDR 4x QR  
  33. 33. Dense LA Compact Representation - PTG Hardware Parallel Runtime Domain Specific Extensions The PaRSEC framework … Sparse LA Dynamic Discovered Representation - DTG Data Scheduling Scheduling Scheduling Cores Memory Hierarchies Data Movement Coherence Chemistry Hardcore Tasks Tasks Tasks Data Movement SpecializedK SpecializedK ernels Specialized ernels Kernels Accelerators
  34. 34. Other Systems PaRSEC SMPss StarPU Charm ++ FLAME QUARK Tblas PTG Scheduling Distr. (1/core) Repl (1/node) Repl (1/node) Distr. (Actors) w/ SuperMatrix Repl (1/node) Centr. Centr. Language Internal or Seq. w/ Affine Loops Seq. w/ add_task Seq. w/ add_task MsgDriven Objects Internal (LA DSL) Seq. w/ add_task Seq. w/ add_task Internal Accelerator GPU GPU GPU GPU GPU Availability Public Public Public Public Public Not Avail. Not Avail. Early stage: ParalleX Non-academic: Swarm, MadLINQ, CnC Public All projects support Distributed and Shared Memory (QUARK with QUARKd; FLAME with Elemental)
  35. 35. Performance Development in Top500 1E+11 1E+10 1 Eflop/s 1E+09 100 Pflop/s 100000000 10 Pflop/s 10000000 1 Pflop/s 1000000 N=1   100 Tflop/s 100000 10 Tflop/s 10000 1 Tflop/s 1000 N=500   100 Gflop/s 100 10 Gflop/s 10 1 Gflop/s 1 0.1 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020
  36. 36. Today’s #1 System Systems System peak Power System memory Node performance Node concurrency Node Interconnect BW System size (nodes) Total concurrency MTTF 2014 2020-2022 Difference Today & Exa 55 Pflop/s 1 Eflop/s ~20x Tianhe-2 (3 Gflops/W) 18 MW (50 Gflops/W) ~20 MW O(1) ~15x 1.4 PB 32 - 64 PB ~50x 3.43 TF/s 1.2 or 15TF/s O(1) 24 cores CPU + 171 cores CoP O(1k) or 10k ~5x - ~50x 6.36 GB/s 200-400GB/s ~40x 16,000 O(100,000) or O(1M) ~6x - ~60x 3.12 M O(billion) ~100x Few / day O(<1 day) O(?) (1.024 PB CPU + .384 PB CoP) (.4 CPU +3 CoP) 12.48M threads (4/core)
  37. 37. Exascale System Architecture with a cap of $200M and 20MW Systems System peak Power System memory Node performance Node concurrency Node Interconnect BW System size (nodes) Total concurrency MTTF 2014 2020-2022 Difference Today & Exa 55 Pflop/s 1 Eflop/s ~20x Tianhe-2 (3 Gflops/W) 18 MW (50 Gflops/W) ~20 MW O(1) ~15x 1.4 PB 32 - 64 PB ~50x 3.43 TF/s 1.2 or 15TF/s O(1) 24 cores CPU + 171 cores CoP O(1k) or 10k ~5x - ~50x 6.36 GB/s 200-400GB/s ~40x 16,000 O(100,000) or O(1M) ~6x - ~60x 3.12 M O(billion) ~100x Few / day O(<1 day) O(?) (1.024 PB CPU + .384 PB CoP) (.4 CPU +3 CoP) 12.48M threads (4/core)
  38. 38. Exascale System Architecture with a cap of $200M and 20MW Systems System peak Power System memory Node performance Node concurrency Node Interconnect BW System size (nodes) Total concurrency MTTF 2014 2020-2022 Difference Today & Exa 55 Pflop/s 1 Eflop/s ~20x Tianhe-2 (3 Gflops/W) 18 MW (50 Gflops/W) ~20 MW O(1) ~15x 1.4 PB 32 - 64 PB ~50x 3.43 TF/s 1.2 or 15TF/s O(1) 24 cores CPU + 171 cores CoP O(1k) or 10k ~5x - ~50x 6.36 GB/s 200-400GB/s ~40x 16,000 O(100,000) or O(1M) ~6x - ~60x 3.12 M O(billion) ~100x Few / day Many / day O(?) (1.024 PB CPU + .384 PB CoP) (.4 CPU +3 CoP) 12.48M threads (4/core)
  39. 39. Major Changes to Software & Algorithms •  Must rethink the design of our algorithms and software §  Another disruptive technology • Similar to what happened with cluster computing and message passing §  Rethink and rewrite the applications, algorithms, and software §  Data movement is expense §  Flop/s are cheap, so are provisioned in excess 39
  40. 40. Critical Issues at Peta & Exascale for Algorithm and Software Design •  Synchronization-reducing algorithms §  Break Fork-Join model •  Communication-reducing algorithms §  Use methods which have lower bound on communication •  Mixed precision methods §  2x speed of ops and 2x speed for data movement •  Autotuning §  Today’s machines are too complicated, build “smarts” into software to adapt to the hardware •  Fault resilient algorithms §  Implement algorithms that can recover from failures/bit flips •  Reproducibility of results §  Today we can’t guarantee this. We understand the issues, but some of our “colleagues” have a hard time with this.
  41. 41. Summary •  Major Challenges are ahead for extreme computing §  Parallelism O(109) •  Programming issues §  Hybrid •  Peak and HPL may be very misleading •  No where near close to peak for most apps §  Fault Tolerance •  Today Sequoia BG/Q node failure rate is 1.25 failures/day §  Power •  50 Gflops/w (today at 2 Gflops/w) •  We will need completely new approaches and technologies to reach the Exascale level
  42. 42. Collaborators / Software / Support u  u  u  •  PLASMA http://icl.cs.utk.edu/plasma/ MAGMA http://icl.cs.utk.edu/magma/ Quark (RT for Shared Memory) http://icl.cs.utk.edu/quark/ Collaborating partners u  u  •  PaRSEC(Parallel Runtime Scheduling and Execution Control) http://icl.cs.utk.edu/parsec/ University of Tennessee, Knoxville University of California, Berkeley University of Colorado, Denver 
 MAGMA PLASMA 42

×