Achitecture Aware Algorithms and Software for Peta and Exascale

K2I Distinguished Lecture Series

Architecture-aware Algorithms and
Software for Peta and Exascale
Computing
Jack Dongarra
University of Tennessee
Oak Ridge National Laboratory
University of Manchester
Rice University

2/13/14

1

Outline
•  Overview of High Performance
Computing
•  Look at an implementation for some
linear algebra algorithms on today’s
High Performance Computers
§  As an examples of the kind of thing
needed.

2

H. Meuer, H. Simon, E. Strohmaier, & JD

Rate

- Listing of the 500 most powerful
Computers in the World
- Yardstick: Rmax from LINPACK MPP
Ax=b, dense problem
TPP performance
- Updated twice a year
Size
SC‘xy in the States in November
Meeting in Germany in June
- All data available from www.top500.org

3

Performance Development of HPC
Over the Last 20 Years
1E+09

224

PFlop/s

100 Pflop/s

100000000

33.9
PFlop/s

10 Pflop/s

10000000

1 Pflop/s
1000000

SUM

100 Tflop/s

100000

N=1

118
TFlop/s

10 Tflop/s

10000

1 Tflop/s

1000

6-8 years
1.17
TFlop/s

N=500

100 Gflop/s

100

My Laptop (70 Gflop/s)

59.7
GFlop/s

10 Gflop/s

10

My iPad2 & iPhone 4s (1.02 Gflop/s)

1 Gflop/s

1

400
MFlop/s

100 Mflop/s

0.1

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

State of Supercomputing in 2014
•  Pflops computing fully established with 31
systems.
•  Three technology architecture possibilities or
“swim lanes” are thriving.
•  Commodity (e.g. Intel)
•  Commodity + accelerator (e.g. GPUs)
•  Special purpose lightweight cores (e.g. IBM BG)

•  Interest in supercomputing is now worldwide, and
growing in many new markets (over 50% of Top500
computers are in industry).
•  Exascale projects exist in many countries and
regions.

5

November 2013: The TOP10
Country

Cores

Rmax
[Pflops]

% of
Peak

China

3,120,000

33.9

62

17.8

1905

USA

560,640

17.6

65

8.3

2120

Sequoia, BlueGene/Q (16c)
+ custom

USA

1,572,864

17.2

85

7.9

2063

RIKEN Advanced Inst
for Comp Sci

K computer Fujitsu SPARC64
VIIIfx (8c) + Custom

Japan

705,024

10.5

93

12.7

827

5

DOE / OS
Argonne Nat Lab

Mira, BlueGene/Q (16c)
+ Custom

USA

786,432

8.16

85

3.95

2066

6

Swiss CSCS

Piz Daint, Cray XC30, Xeon 8C +
Nvidia Kepler (14c) + Custom

Swiss

115,984

6.27

81

2.3

2726

7

Texas Advanced
Computing Center

Stampede, Dell Intel (8c) + Intel
Xeon Phi (61c) + IB

USA

204,900

2.66

61

3.3

806

8

Forschungszentrum
Juelich (FZJ)

JuQUEEN, BlueGene/Q,
Power BQC 16C 1.6GHz+Custom

Germany

458,752

5.01

85

2.30

2178

USA

393,216

4.29

85

1.97

2177

Germany

147,456

2.90

91*

3.42

848

22,212

.118

50

Rank

Site

Computer

National University
of Defense
Technology
DOE / OS
Oak Ridge Nat Lab

Tianhe-2 NUDT,
Xeon 12C 2.2GHz + IntelXeon
Phi (57c) + Custom
Titan, Cray XK7 (16C) + Nvidia
Kepler GPU (14c) + Custom

3

DOE / NNSA
L Livermore Nat Lab

4

1
2

9
10

500

DOE / NNSA
Vulcan, BlueGene/Q,
L Livermore Nat Lab Power BQC 16C 1.6GHz+Custom
Leibniz
Rechenzentrum

Banking

SuperMUC, Intel (8c) + IB

HP

USA

Power MFlops
[MW] /Watt

Accelerators (53 systems)
60

Systems

50

40

30

20

10

0

2006
2007
2008
2009
2010
2011
2012
2013

Intel
MIC
(13)

Clearspeed
CSX600
(0)

ATI
GPU
(2)

IBM
PowerXCell
8i
(0)

NVIDIA
2070
(4)

NVIDIA
2050
(7)

NVIDIA
2090
(11)

NVIDIA
K20
(16)

19 US
9 China
6 Japan
4 Russia
2 France
2 Germany
2 India
1 Italy
1 Poland

1 Australia
2 Brazil
1 Saudi Arabia
1 South Korea
1 Spain
2 Switzerland
1 UK

Top500 Performance Share of Accelerators
53 of the 500 systems provide 35% of the accumulated performance

35%
30%
25%
20%
15%
10%
5%

2013

2012

2011

2010

2009

2008

2007

0%
2006

Fraction of Total TOP500
Performance

40%

For the Top 500: Rank at which Half of Total
Performance is Accumulated
90
80
70
60
50
40

35

30

30

20

25

10
0

Pflop/s

Numbers of Systems

100

Top 500 November 2013

20
15
10

1994 1996 1998 2000 2002 2004 2006 2008 2010 2012
5
0

0

100

200

300

400

500

Commodity plus Accelerator Today
Commodity

Accelerator (GPU)

Intel Xeon
8 cores
3 GHz
8*4 ops/cycle
96 Gflop/s (DP)

192 Cuda cores/SMX
2688 “Cuda cores”

Nvidia K20X “Kepler”
2688 “Cuda cores”
.732 GHz
2688*2/3 ops/cycle
1.31 Tflop/s (DP)

Interconnect
PCI-e Gen2/3 16 lane
64 Gb/s (8 GB/s)
1 GW/s

6 GB
10

Linpack Efficiency
100%
90%
Linpack Efficiency

80%
70%
60%
50%
40%
30%
20%
10%
0%

0

100

200

300

400

500

DLA Solvers
¨  We are interested in developing

Dense Linear Algebra Solvers
¨  Retool LAPACK and ScaLAPACK for
multicore and hybrid architectures

2/13/14
15

Last Generations of DLA Software
Software/Algorithms follow hardware evolution in time
LINPACK (70’s)
(Vector operations)

Rely on
- Level-1 BLAS
operations

LAPACK (80’s)
(Blocking, cache
friendly)

Rely on
- Level-3 BLAS
operations

ScaLAPACK (90’s)
(Distributed Memory)

Rely on
- PBLAS Mess Passing

PLASMA
New Algorithms
(many-core friendly)
MAGMA
Hybrid Algorithms 
(heterogeneity friendly)

Rely on
- a DAG/scheduler
Rely on
- block data layout
- hybrid scheduler
- some extra kernels
- hybrid kernels

A New Generation of DLA Software
Software/Algorithms follow hardware evolution in time
LINPACK (70’s)
(Vector operations)

Rely on
- Level-1 BLAS
operations

LAPACK (80’s)
(Blocking, cache
friendly)

Rely on
- Level-3 BLAS
operations

ScaLAPACK (90’s)
(Distributed Memory)

Rely on
- PBLAS Mess Passing

PLASMA
New Algorithms
(many-core friendly)
MAGMA
Hybrid Algorithms 
(heterogeneity friendly)

Rely on
- a DAG/scheduler
Rely on
- block data layout
- hybrid scheduler
- some extra kernels
- hybrid kernels

Parallelization of LU and QR.
Parallelize the update:
dgemm

•  Easy and done in any reasonable software.
•  This is the 2/3n3 term in the FLOPs count.
•  Can be done efficiently with LAPACK+multithreaded BLAS

-‐

U

L

A(1)

dgeD2

lu(
)

dtrsm
(+
dswp)

Fork - Join parallelism
Bulk Sync Processing

dgemm

-‐

U

L

A(2)

Synchronization (in LAPACK LU)

Ø  fork join
Ø  bulk synchronous processing

Cores

•  Fork-join, bulk synchronous processing

23

19
Time

27

PLASMA LU Factorization
Dataflow Driven
xTRSM

Numerical program generates tasks and
run time system executes tasks respecting
data dependences.
xGEMM

xGETF2
xTRSM

xGEMM
xGEMM
xTRSM

xGEMM
xTRSM

xGEMM

xGEMM

xGEMM
xGEMM

xGEMM

xGEMM
xGEMM

QUARK
¨  A runtime environment for the

dynamic execution of
precedence-constraint tasks
(DAGs) in a multicore machine

Ø  Translation
Ø  If you have a serial program that
consists of computational kernels
(tasks) that are related by data
dependencies, QUARK can help you
execute that program (relatively
efficiently and easily) in parallel on
a multicore machine

21

The Purpose of a QUARK Runtime
¨ Objectives
Ø  High utilization of each core
Ø  Scaling to large number of cores
Ø  Synchronization reducing algorithms
¨ Methodology
Ø  Dynamic DAG scheduling (QUARK)
Ø  Explicit parallelism
Ø  Implicit communication
Ø  Fine granularity / block data layout
¨ Arbitrary DAG with dynamic

scheduling

DAG scheduled
parallelism

Fork-join parallelism
Notice the synchronization
penalty in the presence of
heterogeneity.

22

QUARK
Shared Memory Superscalar Scheduling
FOR k = 0..TILES-1
A[k][k] ← DPOTRF(A[k][k])
FOR m = k+1..TILES-1
A[m][k] ← DTRSM(A[k][k], A[m][k])
FOR m = k+1..TILES-1
A[m][m] ← DSYRK(A[m][k], A[m][m])
FOR n = k+1..m-1
A[m][n] ← DGEMM(A[m][k], A[n][k], A[m][n])
for (k = 0; k < A.mt; k++) {
QUARK_CORE(dpotrf,...);
for (m = k+1; m < A.mt; m++) {
QUARK_CORE(dtrsm,...);
}
for (m = k+1; m < A.mt; m++) {
QUARK_CORE(dsyrk,...);
for (n = k+1; n < m; n++) {
QUARK_CORE(dgemm,...)
}
}
}

definition – pseudocode

implementation –
QUARK code in PLASMA

Pipelining: Cholesky Inversion
3 Steps: Factor, Invert L, Multiply L’s

48 cores
POTRF, TRTRI and LAUUM.
The matrix is 4000 x 4000,tile size is 200 x 200,
POTRF+TRTRI+LAUUM: 25 (7t-3)
Cholesky Factorization alone: 3t-2
Pipelined: 18 (3t+6)

24

Performance of PLASMA Cholesky, Double Precision
Comparing Various Numbers of Cores (Percentage of Theoretical Peak)
1024 Cores (64 x 16-cores) 2.00 GHz Intel Xeon X7550, 8,192 Gflop/s Peak (Double Precision) [nautilus]
Static Scheduling

2000

Gflop/s

1500

1000 Cores (.27)
960 Cores (.27)
768 Cores (.30)
576 Cores (.33)
384 Cores (.31)
192 Cores (.48)
160 Cores (.46)
128 Cores (.43)
96 Cores (.51)
64 Cores (.58)
32 Cores (.70)
16 Cores (.78)

1000

500

0
0

20000

40000

60000

80000
100000
Matrix Size

120000

140000

160000

¨  DAG too large to be

generated ahead of time
Ø  Generate it dynamically
Ø  Merge parameterized DAGs
with dynamically generated
DAGs

¨  HPC is about distributed

heterogeneous resources

Ø  Have to be able to move the
data across multiple nodes
Ø  The scheduling cannot be
centralized
Ø  Take advantage of all available
resources with minimal
intervention from the user

¨  Facilitate the usage of the

HPC resources
Ø  Languages
Ø  Portability

DPLASMA: Going to Distributed Memory
PO

TR

TR

SY

TR

GE

PO

GE

SY

TR

TR

GE

SY

PO

TR

GE

SY

SY

Start
with
PLASMA

for
i,j
=
0..N

QUARK_Insert(
GEMM,

A[i,
j],INPUT,

B[j,
i],INPUT,

C[i,i],INOUT
)

QUARK_Insert(
TRSM,

A[i,
j],INPUT,

B[j,
i],INOUT
)

Parse
the
C
source
code
to
Abstract
Syntax
Tree

QUARK_Insert

GEMM

A

i

B

B

j

i

j

Analyze

dependencies
with
Omega
Test

{
1
<
i
<
N
:
GEMM(i,
j)
=>
TRSM(j)
}

Generate
Code
which
has
the
Parameterized
DAG

GEMM(i,
j)

TRSM(j)

i

j

Loops
&
array

references

have
to
be

aﬃne

PLASMA

DPLASMA

(On Node)

(Distributed System)
inputs

execution window

tasks

outputs

DAGuE

QUARK

Number of tasks in DAG:
O(n3)
Cholesky: 1/3 n3
LU: 2/3 n3
QR: 4/3 n3

Number of tasks in parameterized DAG:
O(1)
Cholesky: 4 (POTRF, SYRK, GEMM, TRSM)
LU: 4 (GETRF, GESSM, TSTRF, SSSSM)
QR: 4 (GEQRT, LARFB, TSQRT, SSRFB)
DAG: Conceptualized & Parameterized

small enough to
store on each
core in every
node = Scalable

Task Affinity in DPLASMA

User defined data
distribution function

Runtime system, called PaRSEC, is distributed

Runtime DAG scheduling
Node0

PO

Node1
Node2

TR

Node3
TR

SY

TR

GE

PO

GE

SY

TR

TR

GE

SY

SY

SY

PO

TR

SY

PO

GE

¨  Every node has the symbolic

DAG representation

Ø  Only the (node local) frontier of
the DAG is considered
Ø  Distributed Scheduling based on
remote completion notifications

¨  Background remote data

transfer automatic with
overlap
¨  NUMA / Cache aware
Scheduling

Ø  Work Stealing and sharing based
on memory hierarchies

Cholesky

DSBP =
Distributed Square
Block Packed

LU

81 nodes
Dual socket nodes
Quad core Xeon L5420
Total 648 cores at 2.5 GHz
ConnectX InfiniBand DDR 4x

QR

Dense LA

Compact Representation
- PTG

Hardware

Parallel Runtime

Domain Specific
Extensions

The PaRSEC framework
…

Sparse LA

Dynamic Discovered
Representation - DTG

Data
Scheduling
Scheduling
Scheduling

Cores

Memory
Hierarchies

Data
Movement

Coherence

Chemistry

Hardcore

Tasks
Tasks
Tasks

Data
Movement

SpecializedK
SpecializedK
ernels
Specialized
ernels
Kernels

Accelerators

Other Systems
PaRSEC

SMPss

StarPU

Charm
++

FLAME

QUARK

Tblas

PTG

Scheduling

Distr.
(1/core)

Repl
(1/node)

Repl
(1/node)

Distr.
(Actors)

w/
SuperMatrix

Repl
(1/node)

Centr.

Centr.

Language

Internal
or Seq. w/
Affine Loops

Seq.
w/
add_task

Seq.
w/
add_task

MsgDriven
Objects

Internal
(LA DSL)

Seq.
w/
add_task

Seq.
w/
add_task

Internal

Accelerator

GPU

GPU

GPU

GPU

GPU

Availability

Public

Public

Public

Public

Public

Not
Avail.

Not
Avail.

Early stage: ParalleX
Non-academic: Swarm, MadLINQ, CnC

Public

All projects support Distributed and Shared Memory
(QUARK with QUARKd; FLAME with Elemental)

Performance Development in Top500
1E+11
1E+10
1 Eflop/s

1E+09

100 Pflop/s
100000000

10 Pflop/s

10000000

1 Pflop/s

1000000

N=1

100 Tflop/s

100000

10 Tflop/s

10000

1 Tflop/s

1000

N=500

100 Gflop/s

100

10 Gflop/s

10

1 Gflop/s

1

0.1

1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020

Today’s #1 System
Systems
System peak
Power
System memory
Node performance
Node concurrency
Node Interconnect BW
System size (nodes)
Total concurrency
MTTF

2014

2020-2022

Difference
Today & Exa

55 Pflop/s

1 Eflop/s

~20x

Tianhe-2

(3 Gflops/W)

18 MW

(50 Gflops/W)

~20 MW

O(1)
~15x

1.4 PB

32 - 64 PB

~50x

3.43 TF/s

1.2 or 15TF/s

O(1)

24 cores CPU +
171 cores CoP

O(1k) or 10k

~5x - ~50x

6.36 GB/s

200-400GB/s

~40x

16,000

O(100,000) or O(1M)

~6x - ~60x

3.12 M

O(billion)

~100x

Few / day

O(<1 day)

O(?)

(1.024 PB CPU + .384 PB CoP)

(.4 CPU +3 CoP)

12.48M threads (4/core)

Exascale System Architecture
with a cap of $200M and 20MW
Systems
System peak
Power
System memory
Node performance
Node concurrency
System size (nodes)
Total concurrency
MTTF

2014

2020-2022

Difference
Today & Exa

55 Pflop/s

1 Eflop/s

~20x

Tianhe-2

(3 Gflops/W)

18 MW

(50 Gflops/W)

~20 MW

O(1)
~15x

1.4 PB

32 - 64 PB

~50x

3.43 TF/s

1.2 or 15TF/s

O(1)

24 cores CPU +
171 cores CoP

O(1k) or 10k

~5x - ~50x

6.36 GB/s

200-400GB/s

~40x

16,000

O(100,000) or O(1M)

~6x - ~60x

3.12 M

O(billion)

~100x

Few / day

O(<1 day)

O(?)

(1.024 PB CPU + .384 PB CoP)

(.4 CPU +3 CoP)


Exascale System Architecture
with a cap of $200M and 20MW
Systems
System peak
Power
System memory
Node performance
Node concurrency
System size (nodes)
Total concurrency
MTTF

2014

2020-2022

Difference
Today & Exa

55 Pflop/s

1 Eflop/s

~20x

Tianhe-2

(3 Gflops/W)

18 MW

(50 Gflops/W)

~20 MW

O(1)
~15x

1.4 PB

32 - 64 PB

~50x

3.43 TF/s

1.2 or 15TF/s

O(1)

24 cores CPU +
171 cores CoP

O(1k) or 10k

~5x - ~50x

6.36 GB/s

200-400GB/s

~40x

16,000

O(100,000) or O(1M)

~6x - ~60x

3.12 M

O(billion)

~100x

Few / day

Many / day

O(?)

(1.024 PB CPU + .384 PB CoP)

(.4 CPU +3 CoP)


Major Changes to Software &
Algorithms
•  Must rethink the design of our
algorithms and software
§  Another disruptive technology
• Similar to what happened with cluster
computing and message passing

§  Rethink and rewrite the applications,
algorithms, and software
§  Data movement is expense
§  Flop/s are cheap, so are provisioned in
excess
39

Critical Issues at Peta & Exascale for
Algorithm and Software Design
•  Synchronization-reducing algorithms
§  Break Fork-Join model

•  Communication-reducing algorithms
§  Use methods which have lower bound on communication

•  Mixed precision methods
§  2x speed of ops and 2x speed for data movement

•  Autotuning
§  Today’s machines are too complicated, build “smarts” into
software to adapt to the hardware

•  Fault resilient algorithms
§  Implement algorithms that can recover from failures/bit flips

•  Reproducibility of results
§  Today we can’t guarantee this. We understand the issues,
but some of our “colleagues” have a hard time with this.

Summary
•  Major Challenges are ahead for extreme
computing
§  Parallelism O(109)
•  Programming issues

§  Hybrid
•  Peak and HPL may be very misleading
•  No where near close to peak for most apps

§  Fault Tolerance
•  Today Sequoia BG/Q node failure rate is 1.25 failures/day

§  Power
•  50 Gflops/w (today at 2 Gflops/w)

•  We will need completely new approaches and
technologies to reach the Exascale level

Collaborators / Software / Support
u 

u 

u 

• 

PLASMA
http://icl.cs.utk.edu/plasma/
MAGMA
http://icl.cs.utk.edu/magma/
Quark (RT for Shared Memory)
http://icl.cs.utk.edu/quark/

Collaborating partners

u 

u 

• 

PaRSEC(Parallel Runtime Scheduling
and Execution Control)
http://icl.cs.utk.edu/parsec/

University of Tennessee, Knoxville
University of California, Berkeley
University of Colorado, Denver
 

MAGMA

PLASMA

42

Achitecture Aware Algorithms and Software for Peta and Exascale

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Achitecture Aware Algorithms and Software for Peta and Exascale

Similar to Achitecture Aware Algorithms and Software for Peta and Exascale (20)

More from inside-BigData.com

More from inside-BigData.com (20)

Recently uploaded

Recently uploaded (20)

Achitecture Aware Algorithms and Software for Peta and Exascale