SlideShare a Scribd company logo
K2I Distinguished Lecture Series

Architecture-aware Algorithms and
Software for Peta and Exascale
Computing
Jack Dongarra
University of Tennessee
Oak Ridge National Laboratory
University of Manchester
Rice University

2/13/14

1
Outline
•  Overview of High Performance
Computing
•  Look at an implementation for some
linear algebra algorithms on today’s
High Performance Computers
§  As an examples of the kind of thing
needed.

2
H. Meuer, H. Simon, E. Strohmaier, & JD

Rate

- Listing of the 500 most powerful
Computers in the World
- Yardstick: Rmax from LINPACK MPP
Ax=b, dense problem
TPP performance
- Updated twice a year
Size
SC‘xy in the States in November
Meeting in Germany in June
- All data available from www.top500.org

3
Performance Development of HPC
Over the Last 20 Years
1E+09

224	
  	
  PFlop/s	
  

100 Pflop/s

100000000

33.9	
  PFlop/s	
  

10 Pflop/s

10000000

1 Pflop/s
1000000

SUM	
  

100 Tflop/s

100000

N=1	
  
	
  118	
  TFlop/s	
  

10 Tflop/s

10000

1 Tflop/s

1000

6-8 years
1.17	
  TFlop/s	
  

N=500	
  

100 Gflop/s

100

My Laptop (70 Gflop/s)

59.7	
  GFlop/s	
  

10 Gflop/s

10

My iPad2 & iPhone 4s (1.02 Gflop/s)

1 Gflop/s

1

400	
  MFlop/s	
  
100 Mflop/s

0.1

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013
State of Supercomputing in 2014
•  Pflops computing fully established with 31
systems.
•  Three technology architecture possibilities or
“swim lanes” are thriving.
•  Commodity (e.g. Intel)
•  Commodity + accelerator (e.g. GPUs)
•  Special purpose lightweight cores (e.g. IBM BG)

•  Interest in supercomputing is now worldwide, and
growing in many new markets (over 50% of Top500
computers are in industry).
•  Exascale projects exist in many countries and
regions.

5
November 2013: The TOP10
Country

Cores

Rmax
[Pflops]

% of
Peak

China

3,120,000

33.9

62

17.8

1905

USA

560,640

17.6

65

8.3

2120

Sequoia, BlueGene/Q (16c)
+ custom

USA

1,572,864

17.2

85

7.9

2063

RIKEN Advanced Inst
for Comp Sci

K computer Fujitsu SPARC64
VIIIfx (8c) + Custom

Japan

705,024

10.5

93

12.7

827

5

DOE / OS
Argonne Nat Lab

Mira, BlueGene/Q (16c)
+ Custom

USA

786,432

8.16

85

3.95

2066

6

Swiss CSCS

Piz Daint, Cray XC30, Xeon 8C +
Nvidia Kepler (14c) + Custom

Swiss

115,984

6.27

81

2.3

2726

7

Texas Advanced
Computing Center

Stampede, Dell Intel (8c) + Intel
Xeon Phi (61c) + IB

USA

204,900

2.66

61

3.3

806

8

Forschungszentrum
Juelich (FZJ)

JuQUEEN, BlueGene/Q,
Power BQC 16C 1.6GHz+Custom

Germany

458,752

5.01

85

2.30

2178

USA

393,216

4.29

85

1.97

2177

Germany

147,456

2.90

91*

3.42

848

22,212

.118

50

Rank

Site

Computer

National University
of Defense
Technology
DOE / OS
Oak Ridge Nat Lab

Tianhe-2 NUDT,
Xeon 12C 2.2GHz + IntelXeon
Phi (57c) + Custom
Titan, Cray XK7 (16C) + Nvidia
Kepler GPU (14c) + Custom

3

DOE / NNSA
L Livermore Nat Lab

4

1
2

9
10

500

DOE / NNSA
Vulcan, BlueGene/Q,
L Livermore Nat Lab Power BQC 16C 1.6GHz+Custom
Leibniz
Rechenzentrum

Banking

SuperMUC, Intel (8c) + IB

HP 





USA

Power MFlops
[MW] /Watt
Accelerators (53 systems)
60	
  

Systems	
  

50	
  
40	
  
30	
  
20	
  
10	
  
0	
  
2006	
   2007	
   2008	
   2009	
   2010	
   2011	
   2012	
   2013	
  

Intel	
  MIC	
  (13)	
  
Clearspeed	
  CSX600	
  (0)	
  
ATI	
  GPU	
  (2)	
  
IBM	
  PowerXCell	
  8i	
  (0)	
  
NVIDIA	
  2070	
  (4)	
  
NVIDIA	
  2050	
  (7)	
  
NVIDIA	
  2090	
  (11)	
  
NVIDIA	
  K20	
  (16)	
  
19 US
9 China
6 Japan
4 Russia
2 France
2 Germany
2 India
1 Italy
1 Poland

1 Australia
2 Brazil
1 Saudi Arabia
1 South Korea
1 Spain
2 Switzerland
1 UK
Top500 Performance Share of Accelerators
53 of the 500 systems provide 35% of the accumulated performance

35%
30%
25%
20%
15%
10%
5%

2013

2012

2011

2010

2009

2008

2007

0%
2006

Fraction of Total TOP500
Performance

40%
For the Top 500: Rank at which Half of Total
Performance is Accumulated
90
80
70
60
50
40

35

30

30

20

25

10
0

Pflop/s

Numbers of Systems

100

Top 500 November 2013

20
15
10

1994 1996 1998 2000 2002 2004 2006 2008 2010 2012
5
0

0

100

200

300

400

500
Commodity plus Accelerator Today
Commodity

Accelerator (GPU)

Intel Xeon
8 cores
3 GHz
8*4 ops/cycle
96 Gflop/s (DP)

192 Cuda cores/SMX
2688 “Cuda cores”

Nvidia K20X “Kepler”
2688 “Cuda cores”
.732 GHz
2688*2/3 ops/cycle
1.31 Tflop/s (DP)

Interconnect
PCI-e Gen2/3 16 lane
64 Gb/s (8 GB/s)
1 GW/s

6 GB
10
Linpack Efficiency
100%
90%
Linpack Efficiency

80%
70%
60%
50%
40%
30%
20%
10%
0%

0

100

200

300

400

500
Linpack Efficiency
100%
90%
Linpack Efficiency

80%
70%
60%
50%
40%
30%
20%
10%
0%

0

100

200

300

400

500
Linpack Efficiency
100%
90%
Linpack Efficiency

80%
70%
60%
50%
40%
30%
20%
10%
0%

0

100

200

300

400

500
Linpack Efficiency
100%
90%
Linpack Efficiency

80%
70%
60%
50%
40%
30%
20%
10%
0%

0

100

200

300

400

500
DLA Solvers
¨  We are interested in developing

Dense Linear Algebra Solvers
¨  Retool LAPACK and ScaLAPACK for
multicore and hybrid architectures

2/13/14
15
Last Generations of DLA Software
Software/Algorithms follow hardware evolution in time
LINPACK (70’s)
(Vector operations)

Rely on
- Level-1 BLAS
operations

LAPACK (80’s)
(Blocking, cache
friendly)

Rely on
- Level-3 BLAS
operations

ScaLAPACK (90’s)
(Distributed Memory)

Rely on
- PBLAS Mess Passing

PLASMA
New Algorithms
(many-core friendly)
MAGMA
Hybrid Algorithms

(heterogeneity friendly)

Rely on
- a DAG/scheduler
Rely on
- block data layout
- hybrid scheduler
- some extra kernels
- hybrid kernels
A New Generation of DLA Software
Software/Algorithms follow hardware evolution in time
LINPACK (70’s)
(Vector operations)

Rely on
- Level-1 BLAS
operations

LAPACK (80’s)
(Blocking, cache
friendly)

Rely on
- Level-3 BLAS
operations

ScaLAPACK (90’s)
(Distributed Memory)

Rely on
- PBLAS Mess Passing

PLASMA
New Algorithms
(many-core friendly)
MAGMA
Hybrid Algorithms

(heterogeneity friendly)

Rely on
- a DAG/scheduler
Rely on
- block data layout
- hybrid scheduler
- some extra kernels
- hybrid kernels
Parallelization of LU and QR.
Parallelize the update:
dgemm	
  
•  Easy and done in any reasonable software.
•  This is the 2/3n3 term in the FLOPs count.
•  Can be done efficiently with LAPACK+multithreaded BLAS

-­‐	
  

U	
  
L	
  

A(1)	
  
dgeD2	
  
lu(	
   )	
  
dtrsm	
  (+	
  dswp)	
  
Fork - Join parallelism
Bulk Sync Processing

	
  
dgemm	
  

-­‐	
  
U	
  
L	
  

A(2)	
  
Synchronization (in LAPACK LU)

Ø  fork join
Ø  bulk synchronous processing

Cores

•  Fork-join, bulk synchronous processing

23

19
Time

27
PLASMA LU Factorization
Dataflow Driven
xTRSM

Numerical program generates tasks and
run time system executes tasks respecting
data dependences.
xGEMM

xGETF2
xTRSM

xGEMM
xGEMM
xTRSM

xGEMM
xTRSM

xGEMM

xGEMM

xGEMM
xGEMM

xGEMM

xGEMM
xGEMM
QUARK
¨  A runtime environment for the

dynamic execution of
precedence-constraint tasks
(DAGs) in a multicore machine

Ø  Translation
Ø  If you have a serial program that
consists of computational kernels
(tasks) that are related by data
dependencies, QUARK can help you
execute that program (relatively
efficiently and easily) in parallel on
a multicore machine

21
The Purpose of a QUARK Runtime
¨ Objectives
Ø  High utilization of each core
Ø  Scaling to large number of cores
Ø  Synchronization reducing algorithms
¨ Methodology
Ø  Dynamic DAG scheduling (QUARK)
Ø  Explicit parallelism
Ø  Implicit communication
Ø  Fine granularity / block data layout
¨ Arbitrary DAG with dynamic

scheduling

DAG scheduled
parallelism

Fork-join parallelism
Notice the synchronization
penalty in the presence of
heterogeneity.

22
QUARK
Shared Memory Superscalar Scheduling
FOR k = 0..TILES-1
A[k][k] ← DPOTRF(A[k][k])
FOR m = k+1..TILES-1
A[m][k] ← DTRSM(A[k][k], A[m][k])
FOR m = k+1..TILES-1
A[m][m] ← DSYRK(A[m][k], A[m][m])
FOR n = k+1..m-1
A[m][n] ← DGEMM(A[m][k], A[n][k], A[m][n])
for (k = 0; k < A.mt; k++) {
QUARK_CORE(dpotrf,...);
for (m = k+1; m < A.mt; m++) {
QUARK_CORE(dtrsm,...);
}
for (m = k+1; m < A.mt; m++) {
QUARK_CORE(dsyrk,...);
for (n = k+1; n < m; n++) {
QUARK_CORE(dgemm,...)
}
}
}

definition – pseudocode

implementation –
QUARK code in PLASMA
Pipelining: Cholesky Inversion
3 Steps: Factor, Invert L, Multiply L’s

48 cores
POTRF, TRTRI and LAUUM.
The matrix is 4000 x 4000,tile size is 200 x 200,
POTRF+TRTRI+LAUUM: 25 (7t-3)
Cholesky Factorization alone: 3t-2
Pipelined: 18 (3t+6)

24
Performance of PLASMA Cholesky, Double Precision
Comparing Various Numbers of Cores (Percentage of Theoretical Peak)
1024 Cores (64 x 16-cores) 2.00 GHz Intel Xeon X7550, 8,192 Gflop/s Peak (Double Precision) [nautilus]
Static Scheduling

2000

Gflop/s

1500

1000 Cores (.27)
960 Cores (.27)
768 Cores (.30)
576 Cores (.33)
384 Cores (.31)
192 Cores (.48)
160 Cores (.46)
128 Cores (.43)
96 Cores (.51)
64 Cores (.58)
32 Cores (.70)
16 Cores (.78)

1000

500

0
0

20000

40000

60000

80000
100000
Matrix Size

120000

140000

160000
¨  DAG too large to be

generated ahead of time
Ø  Generate it dynamically
Ø  Merge parameterized DAGs
with dynamically generated
DAGs

¨  HPC is about distributed

heterogeneous resources

Ø  Have to be able to move the
data across multiple nodes
Ø  The scheduling cannot be
centralized
Ø  Take advantage of all available
resources with minimal
intervention from the user

¨  Facilitate the usage of the

HPC resources
Ø  Languages
Ø  Portability
DPLASMA: Going to Distributed Memory
PO

TR

TR

SY

TR

GE

PO

GE

SY

TR

TR

GE

SY

PO

TR

GE

SY

SY
Start	
  with	
  PLASMA
	
  
for	
  i,j	
  =	
  0..N	
  
	
  	
  	
  QUARK_Insert(	
  GEMM,	
  	
  A[i,	
  j],INPUT,	
  	
  	
  B[j,	
  i],INPUT,	
  	
  C[i,i],INOUT	
  )
	
  
	
  	
  	
  QUARK_Insert(	
  TRSM,	
  	
  A[i,	
  j],INPUT,	
  	
  	
  B[j,	
  i],INOUT	
  )	
  

Parse	
  the	
  C	
  source	
  code	
  to	
  Abstract	
  Syntax	
  Tree
	
  
QUARK_Insert
	
  
GEMM
	
  

A
	
  
i
	
  

B
	
  

B
	
  
j
	
  

i
	
  

j
	
  

Analyze	
  	
  dependencies	
  with	
  Omega	
  Test
	
  
{	
  1	
  <	
  i	
  <	
  N	
  :	
  GEMM(i,	
  j)	
  =>	
  TRSM(j)	
  }
	
  

Generate	
  Code	
  which	
  has	
  the	
  Parameterized	
  DAG
	
  
GEMM(i,	
  j)
	
  

TRSM(j)
	
  

i
	
  

j
	
  

Loops	
  &	
  array	
  
references	
  
have	
  to	
  be	
  
affine	
  
PLASMA

DPLASMA

(On Node)

(Distributed System)
inputs

execution window

tasks

outputs

DAGuE
	
  

QUARK
	
  
Number of tasks in DAG:
O(n3)
Cholesky: 1/3 n3
LU: 2/3 n3
QR: 4/3 n3

Number of tasks in parameterized DAG:
O(1)
Cholesky: 4 (POTRF, SYRK, GEMM, TRSM)
LU: 4 (GETRF, GESSM, TSTRF, SSSSM)
QR: 4 (GEQRT, LARFB, TSQRT, SSRFB)
DAG: Conceptualized & Parameterized

small enough to
store on each
core in every
node = Scalable
Task Affinity in DPLASMA

User defined data
distribution function

Runtime system, called PaRSEC, is distributed
Runtime DAG scheduling
Node0

PO

Node1
Node2

TR

Node3
TR

SY

TR

GE

PO

GE

SY

TR

TR

GE

SY

SY

SY

PO

TR

SY

PO

GE

¨  Every node has the symbolic

DAG representation

Ø  Only the (node local) frontier of
the DAG is considered
Ø  Distributed Scheduling based on
remote completion notifications

¨  Background remote data

transfer automatic with
overlap
¨  NUMA / Cache aware
Scheduling

Ø  Work Stealing and sharing based
on memory hierarchies
Cholesky
	
  
DSBP =
Distributed Square
Block Packed

LU
	
  

81 nodes
Dual socket nodes
Quad core Xeon L5420
Total 648 cores at 2.5 GHz
ConnectX InfiniBand DDR 4x

QR
	
  
Dense LA

Compact Representation
- PTG

Hardware

Parallel Runtime

Domain Specific
Extensions

The PaRSEC framework
…

Sparse LA

Dynamic Discovered
Representation - DTG

Data
Scheduling
Scheduling
Scheduling

Cores

Memory
Hierarchies

Data
Movement

Coherence

Chemistry

Hardcore

Tasks
Tasks
Tasks

Data
Movement

SpecializedK
SpecializedK
ernels
Specialized
ernels
Kernels

Accelerators
Other Systems
PaRSEC

SMPss

StarPU

Charm
++

FLAME

QUARK

Tblas

PTG

Scheduling

Distr.
(1/core)

Repl
(1/node)

Repl
(1/node)

Distr.
(Actors)

w/
SuperMatrix

Repl
(1/node)

Centr.

Centr.

Language

Internal
or Seq. w/
Affine Loops

Seq.
w/
add_task

Seq.
w/
add_task

MsgDriven
Objects

Internal
(LA DSL)

Seq.
w/
add_task

Seq.
w/
add_task

Internal

Accelerator

GPU

GPU

GPU

GPU

GPU

Availability

Public

Public

Public

Public

Public

Not
Avail.

Not
Avail.

Early stage: ParalleX
Non-academic: Swarm, MadLINQ, CnC

Public

All projects support Distributed and Shared Memory
(QUARK with QUARKd; FLAME with Elemental)
Performance Development in Top500
1E+11
1E+10
1 Eflop/s

1E+09

100 Pflop/s
100000000

10 Pflop/s

10000000

1 Pflop/s

1000000

N=1	
  

100 Tflop/s

100000

10 Tflop/s

10000

1 Tflop/s

1000

N=500	
  

100 Gflop/s

100

10 Gflop/s

10

1 Gflop/s

1

0.1

1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020
Today’s #1 System
Systems
System peak
Power
System memory
Node performance
Node concurrency
Node Interconnect BW
System size (nodes)
Total concurrency
MTTF

2014

2020-2022

Difference
Today & Exa

55 Pflop/s

1 Eflop/s

~20x

Tianhe-2

(3 Gflops/W)

18 MW

(50 Gflops/W)

~20 MW

O(1)
~15x

1.4 PB

32 - 64 PB

~50x

3.43 TF/s

1.2 or 15TF/s

O(1)

24 cores CPU +
171 cores CoP

O(1k) or 10k

~5x - ~50x

6.36 GB/s

200-400GB/s

~40x

16,000

O(100,000) or O(1M)

~6x - ~60x

3.12 M

O(billion)

~100x

Few / day

O(<1 day)

O(?)

(1.024 PB CPU + .384 PB CoP)

(.4 CPU +3 CoP)

12.48M threads (4/core)
Exascale System Architecture
with a cap of $200M and 20MW
Systems
System peak
Power
System memory
Node performance
Node concurrency
Node Interconnect BW
System size (nodes)
Total concurrency
MTTF

2014

2020-2022

Difference
Today & Exa

55 Pflop/s

1 Eflop/s

~20x

Tianhe-2

(3 Gflops/W)

18 MW

(50 Gflops/W)

~20 MW

O(1)
~15x

1.4 PB

32 - 64 PB

~50x

3.43 TF/s

1.2 or 15TF/s

O(1)

24 cores CPU +
171 cores CoP

O(1k) or 10k

~5x - ~50x

6.36 GB/s

200-400GB/s

~40x

16,000

O(100,000) or O(1M)

~6x - ~60x

3.12 M

O(billion)

~100x

Few / day

O(<1 day)

O(?)

(1.024 PB CPU + .384 PB CoP)

(.4 CPU +3 CoP)

12.48M threads (4/core)
Exascale System Architecture
with a cap of $200M and 20MW
Systems
System peak
Power
System memory
Node performance
Node concurrency
Node Interconnect BW
System size (nodes)
Total concurrency
MTTF

2014

2020-2022

Difference
Today & Exa

55 Pflop/s

1 Eflop/s

~20x

Tianhe-2

(3 Gflops/W)

18 MW

(50 Gflops/W)

~20 MW

O(1)
~15x

1.4 PB

32 - 64 PB

~50x

3.43 TF/s

1.2 or 15TF/s

O(1)

24 cores CPU +
171 cores CoP

O(1k) or 10k

~5x - ~50x

6.36 GB/s

200-400GB/s

~40x

16,000

O(100,000) or O(1M)

~6x - ~60x

3.12 M

O(billion)

~100x

Few / day

Many / day

O(?)

(1.024 PB CPU + .384 PB CoP)

(.4 CPU +3 CoP)

12.48M threads (4/core)
Major Changes to Software &
Algorithms
•  Must rethink the design of our
algorithms and software
§  Another disruptive technology
• Similar to what happened with cluster
computing and message passing

§  Rethink and rewrite the applications,
algorithms, and software
§  Data movement is expense
§  Flop/s are cheap, so are provisioned in
excess
39
Critical Issues at Peta & Exascale for
Algorithm and Software Design
•  Synchronization-reducing algorithms
§  Break Fork-Join model

•  Communication-reducing algorithms
§  Use methods which have lower bound on communication

•  Mixed precision methods
§  2x speed of ops and 2x speed for data movement

•  Autotuning
§  Today’s machines are too complicated, build “smarts” into
software to adapt to the hardware

•  Fault resilient algorithms
§  Implement algorithms that can recover from failures/bit flips

•  Reproducibility of results
§  Today we can’t guarantee this. We understand the issues,
but some of our “colleagues” have a hard time with this.
Summary
•  Major Challenges are ahead for extreme
computing
§  Parallelism O(109)
•  Programming issues

§  Hybrid
•  Peak and HPL may be very misleading
•  No where near close to peak for most apps

§  Fault Tolerance
•  Today Sequoia BG/Q node failure rate is 1.25 failures/day

§  Power
•  50 Gflops/w (today at 2 Gflops/w)

•  We will need completely new approaches and
technologies to reach the Exascale level
Collaborators / Software / Support
u 

u 

u 

• 

PLASMA
http://icl.cs.utk.edu/plasma/
MAGMA
http://icl.cs.utk.edu/magma/
Quark (RT for Shared Memory)
http://icl.cs.utk.edu/quark/

Collaborating partners

u 

u 

• 

PaRSEC(Parallel Runtime Scheduling
and Execution Control)
http://icl.cs.utk.edu/parsec/

University of Tennessee, Knoxville
University of California, Berkeley
University of Colorado, Denver



MAGMA

PLASMA

42

More Related Content

What's hot

IIBMP2019 講演資料「オープンソースで始める深層学習」
IIBMP2019 講演資料「オープンソースで始める深層学習」IIBMP2019 講演資料「オープンソースで始める深層学習」
IIBMP2019 講演資料「オープンソースで始める深層学習」
Preferred Networks
 
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemAccelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Shuai Yuan
 
USENIX NSDI 2016 (Session: Resource Sharing)
USENIX NSDI 2016 (Session: Resource Sharing)USENIX NSDI 2016 (Session: Resource Sharing)
USENIX NSDI 2016 (Session: Resource Sharing)
Ryousei Takano
 
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
BigDataEverywhere
 
IEEE CloudCom 2014参加報告
IEEE CloudCom 2014参加報告IEEE CloudCom 2014参加報告
IEEE CloudCom 2014参加報告
Ryousei Takano
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
Sagar Dolas
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
AMD Developer Central
 
Apache Nemo
Apache NemoApache Nemo
Apache Nemo
NAVER Engineering
 
Report on GPGPU at FCA (Lyon, France, 11-15 October, 2010)
Report on GPGPU at FCA  (Lyon, France, 11-15 October, 2010)Report on GPGPU at FCA  (Lyon, France, 11-15 October, 2010)
Report on GPGPU at FCA (Lyon, France, 11-15 October, 2010)
PhtRaveller
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement
Kyong-Ha Lee
 
GoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesGoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with Dependencies
DataWorks Summit/Hadoop Summit
 
Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...
Fisnik Kraja
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
Sagar Dolas
 
[241]large scale search with polysemous codes
[241]large scale search with polysemous codes[241]large scale search with polysemous codes
[241]large scale search with polysemous codes
NAVER D2
 
PyTorch 튜토리얼 (Touch to PyTorch)
PyTorch 튜토리얼 (Touch to PyTorch)PyTorch 튜토리얼 (Touch to PyTorch)
PyTorch 튜토리얼 (Touch to PyTorch)
Hansol Kang
 
유연하고 확장성 있는 빅데이터 처리
유연하고 확장성 있는 빅데이터 처리유연하고 확장성 있는 빅데이터 처리
유연하고 확장성 있는 빅데이터 처리
NAVER D2
 
Exploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudExploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC Cloud
Ryousei Takano
 
SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)
Kohei KaiGai
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
Linaro
 

What's hot (20)

IIBMP2019 講演資料「オープンソースで始める深層学習」
IIBMP2019 講演資料「オープンソースで始める深層学習」IIBMP2019 講演資料「オープンソースで始める深層学習」
IIBMP2019 講演資料「オープンソースで始める深層学習」
 
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemAccelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
 
USENIX NSDI 2016 (Session: Resource Sharing)
USENIX NSDI 2016 (Session: Resource Sharing)USENIX NSDI 2016 (Session: Resource Sharing)
USENIX NSDI 2016 (Session: Resource Sharing)
 
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
 
IEEE CloudCom 2014参加報告
IEEE CloudCom 2014参加報告IEEE CloudCom 2014参加報告
IEEE CloudCom 2014参加報告
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
 
Apache Nemo
Apache NemoApache Nemo
Apache Nemo
 
Report on GPGPU at FCA (Lyon, France, 11-15 October, 2010)
Report on GPGPU at FCA  (Lyon, France, 11-15 October, 2010)Report on GPGPU at FCA  (Lyon, France, 11-15 October, 2010)
Report on GPGPU at FCA (Lyon, France, 11-15 October, 2010)
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement
 
GoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesGoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with Dependencies
 
Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...
 
cnsm2011_slide
cnsm2011_slidecnsm2011_slide
cnsm2011_slide
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
 
[241]large scale search with polysemous codes
[241]large scale search with polysemous codes[241]large scale search with polysemous codes
[241]large scale search with polysemous codes
 
PyTorch 튜토리얼 (Touch to PyTorch)
PyTorch 튜토리얼 (Touch to PyTorch)PyTorch 튜토리얼 (Touch to PyTorch)
PyTorch 튜토리얼 (Touch to PyTorch)
 
유연하고 확장성 있는 빅데이터 처리
유연하고 확장성 있는 빅데이터 처리유연하고 확장성 있는 빅데이터 처리
유연하고 확장성 있는 빅데이터 처리
 
Exploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudExploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC Cloud
 
SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
 

Similar to Achitecture Aware Algorithms and Software for Peta and Exascale

Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
inside-BigData.com
 
Barcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaBarcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de Riqueza
Facultad de Informática UCM
 
Application Optimisation using OpenPOWER and Power 9 systems
Application Optimisation using OpenPOWER and Power 9 systemsApplication Optimisation using OpenPOWER and Power 9 systems
Application Optimisation using OpenPOWER and Power 9 systems
Ganesan Narayanasamy
 
Dpdk applications
Dpdk applicationsDpdk applications
Dpdk applications
Vipin Varghese
 
Sierra overview
Sierra overviewSierra overview
Sierra overview
Ganesan Narayanasamy
 
NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019
NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019
NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019
NVIDIA
 
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
RISC-V International
 
byteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurationsbyteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE
 
Nvidia tesla-k80-overview
Nvidia tesla-k80-overviewNvidia tesla-k80-overview
Nvidia tesla-k80-overview
Communication Progress
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda en
Kohei KaiGai
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2Junli Gu
 
The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...
NECST Lab @ Politecnico di Milano
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Akihiro Hayashi
 
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
Kohei KaiGai
 
Klessydra-T: Designing Configurable Vector Co-Processors for Multi-Threaded E...
Klessydra-T: Designing Configurable Vector Co-Processors for Multi-Threaded E...Klessydra-T: Designing Configurable Vector Co-Processors for Multi-Threaded E...
Klessydra-T: Designing Configurable Vector Co-Processors for Multi-Threaded E...
RISC-V International
 
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
NVIDIA GPUs Power HPC & AI Workloads in Cloud with UnivaNVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
inside-BigData.com
 
Latest HPC News from NVIDIA
Latest HPC News from NVIDIALatest HPC News from NVIDIA
Latest HPC News from NVIDIA
inside-BigData.com
 
DATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe ConferenceDATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe Conference
LEGATO project
 
Lrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with rLrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with r
Ferdinand Jamitzky
 

Similar to Achitecture Aware Algorithms and Software for Peta and Exascale (20)

Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 
Barcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaBarcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de Riqueza
 
Application Optimisation using OpenPOWER and Power 9 systems
Application Optimisation using OpenPOWER and Power 9 systemsApplication Optimisation using OpenPOWER and Power 9 systems
Application Optimisation using OpenPOWER and Power 9 systems
 
Dpdk applications
Dpdk applicationsDpdk applications
Dpdk applications
 
Sierra overview
Sierra overviewSierra overview
Sierra overview
 
NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019
NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019
NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019
 
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
 
byteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurationsbyteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurations
 
Nvidia tesla-k80-overview
Nvidia tesla-k80-overviewNvidia tesla-k80-overview
Nvidia tesla-k80-overview
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda en
 
Anegdotic Maxeler (Romania)
  Anegdotic Maxeler (Romania)  Anegdotic Maxeler (Romania)
Anegdotic Maxeler (Romania)
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2
 
The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
 
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
 
Klessydra-T: Designing Configurable Vector Co-Processors for Multi-Threaded E...
Klessydra-T: Designing Configurable Vector Co-Processors for Multi-Threaded E...Klessydra-T: Designing Configurable Vector Co-Processors for Multi-Threaded E...
Klessydra-T: Designing Configurable Vector Co-Processors for Multi-Threaded E...
 
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
NVIDIA GPUs Power HPC & AI Workloads in Cloud with UnivaNVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
 
Latest HPC News from NVIDIA
Latest HPC News from NVIDIALatest HPC News from NVIDIA
Latest HPC News from NVIDIA
 
DATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe ConferenceDATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe Conference
 
Lrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with rLrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with r
 

More from inside-BigData.com

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
inside-BigData.com
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
inside-BigData.com
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
inside-BigData.com
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
inside-BigData.com
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
inside-BigData.com
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
inside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
inside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
inside-BigData.com
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
inside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
inside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
inside-BigData.com
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
inside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
inside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
inside-BigData.com
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
inside-BigData.com
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
inside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
inside-BigData.com
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
inside-BigData.com
 

More from inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 

Recently uploaded

GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 

Recently uploaded (20)

GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 

Achitecture Aware Algorithms and Software for Peta and Exascale

  • 1. K2I Distinguished Lecture Series Architecture-aware Algorithms and Software for Peta and Exascale Computing Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester Rice University 2/13/14 1
  • 2. Outline •  Overview of High Performance Computing •  Look at an implementation for some linear algebra algorithms on today’s High Performance Computers §  As an examples of the kind of thing needed. 2
  • 3. H. Meuer, H. Simon, E. Strohmaier, & JD Rate - Listing of the 500 most powerful Computers in the World - Yardstick: Rmax from LINPACK MPP Ax=b, dense problem TPP performance - Updated twice a year Size SC‘xy in the States in November Meeting in Germany in June - All data available from www.top500.org 3
  • 4. Performance Development of HPC Over the Last 20 Years 1E+09 224    PFlop/s   100 Pflop/s 100000000 33.9  PFlop/s   10 Pflop/s 10000000 1 Pflop/s 1000000 SUM   100 Tflop/s 100000 N=1    118  TFlop/s   10 Tflop/s 10000 1 Tflop/s 1000 6-8 years 1.17  TFlop/s   N=500   100 Gflop/s 100 My Laptop (70 Gflop/s) 59.7  GFlop/s   10 Gflop/s 10 My iPad2 & iPhone 4s (1.02 Gflop/s) 1 Gflop/s 1 400  MFlop/s   100 Mflop/s 0.1 1993 1995 1997 1999 2001 2003 2005 2007 2009 2011 2013
  • 5. State of Supercomputing in 2014 •  Pflops computing fully established with 31 systems. •  Three technology architecture possibilities or “swim lanes” are thriving. •  Commodity (e.g. Intel) •  Commodity + accelerator (e.g. GPUs) •  Special purpose lightweight cores (e.g. IBM BG) •  Interest in supercomputing is now worldwide, and growing in many new markets (over 50% of Top500 computers are in industry). •  Exascale projects exist in many countries and regions. 5
  • 6. November 2013: The TOP10 Country Cores Rmax [Pflops] % of Peak China 3,120,000 33.9 62 17.8 1905 USA 560,640 17.6 65 8.3 2120 Sequoia, BlueGene/Q (16c) + custom USA 1,572,864 17.2 85 7.9 2063 RIKEN Advanced Inst for Comp Sci K computer Fujitsu SPARC64 VIIIfx (8c) + Custom Japan 705,024 10.5 93 12.7 827 5 DOE / OS Argonne Nat Lab Mira, BlueGene/Q (16c) + Custom USA 786,432 8.16 85 3.95 2066 6 Swiss CSCS Piz Daint, Cray XC30, Xeon 8C + Nvidia Kepler (14c) + Custom Swiss 115,984 6.27 81 2.3 2726 7 Texas Advanced Computing Center Stampede, Dell Intel (8c) + Intel Xeon Phi (61c) + IB USA 204,900 2.66 61 3.3 806 8 Forschungszentrum Juelich (FZJ) JuQUEEN, BlueGene/Q, Power BQC 16C 1.6GHz+Custom Germany 458,752 5.01 85 2.30 2178 USA 393,216 4.29 85 1.97 2177 Germany 147,456 2.90 91* 3.42 848 22,212 .118 50 Rank Site Computer National University of Defense Technology DOE / OS Oak Ridge Nat Lab Tianhe-2 NUDT, Xeon 12C 2.2GHz + IntelXeon Phi (57c) + Custom Titan, Cray XK7 (16C) + Nvidia Kepler GPU (14c) + Custom 3 DOE / NNSA L Livermore Nat Lab 4 1 2 9 10 500 DOE / NNSA Vulcan, BlueGene/Q, L Livermore Nat Lab Power BQC 16C 1.6GHz+Custom Leibniz Rechenzentrum Banking SuperMUC, Intel (8c) + IB HP USA Power MFlops [MW] /Watt
  • 7. Accelerators (53 systems) 60   Systems   50   40   30   20   10   0   2006   2007   2008   2009   2010   2011   2012   2013   Intel  MIC  (13)   Clearspeed  CSX600  (0)   ATI  GPU  (2)   IBM  PowerXCell  8i  (0)   NVIDIA  2070  (4)   NVIDIA  2050  (7)   NVIDIA  2090  (11)   NVIDIA  K20  (16)   19 US 9 China 6 Japan 4 Russia 2 France 2 Germany 2 India 1 Italy 1 Poland 1 Australia 2 Brazil 1 Saudi Arabia 1 South Korea 1 Spain 2 Switzerland 1 UK
  • 8. Top500 Performance Share of Accelerators 53 of the 500 systems provide 35% of the accumulated performance 35% 30% 25% 20% 15% 10% 5% 2013 2012 2011 2010 2009 2008 2007 0% 2006 Fraction of Total TOP500 Performance 40%
  • 9. For the Top 500: Rank at which Half of Total Performance is Accumulated 90 80 70 60 50 40 35 30 30 20 25 10 0 Pflop/s Numbers of Systems 100 Top 500 November 2013 20 15 10 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 5 0 0 100 200 300 400 500
  • 10. Commodity plus Accelerator Today Commodity Accelerator (GPU) Intel Xeon 8 cores 3 GHz 8*4 ops/cycle 96 Gflop/s (DP) 192 Cuda cores/SMX 2688 “Cuda cores” Nvidia K20X “Kepler” 2688 “Cuda cores” .732 GHz 2688*2/3 ops/cycle 1.31 Tflop/s (DP) Interconnect PCI-e Gen2/3 16 lane 64 Gb/s (8 GB/s) 1 GW/s 6 GB 10
  • 15. DLA Solvers ¨  We are interested in developing Dense Linear Algebra Solvers ¨  Retool LAPACK and ScaLAPACK for multicore and hybrid architectures 2/13/14 15
  • 16. Last Generations of DLA Software Software/Algorithms follow hardware evolution in time LINPACK (70’s) (Vector operations) Rely on - Level-1 BLAS operations LAPACK (80’s) (Blocking, cache friendly) Rely on - Level-3 BLAS operations ScaLAPACK (90’s) (Distributed Memory) Rely on - PBLAS Mess Passing PLASMA New Algorithms (many-core friendly) MAGMA Hybrid Algorithms
 (heterogeneity friendly) Rely on - a DAG/scheduler Rely on - block data layout - hybrid scheduler - some extra kernels - hybrid kernels
  • 17. A New Generation of DLA Software Software/Algorithms follow hardware evolution in time LINPACK (70’s) (Vector operations) Rely on - Level-1 BLAS operations LAPACK (80’s) (Blocking, cache friendly) Rely on - Level-3 BLAS operations ScaLAPACK (90’s) (Distributed Memory) Rely on - PBLAS Mess Passing PLASMA New Algorithms (many-core friendly) MAGMA Hybrid Algorithms
 (heterogeneity friendly) Rely on - a DAG/scheduler Rely on - block data layout - hybrid scheduler - some extra kernels - hybrid kernels
  • 18. Parallelization of LU and QR. Parallelize the update: dgemm   •  Easy and done in any reasonable software. •  This is the 2/3n3 term in the FLOPs count. •  Can be done efficiently with LAPACK+multithreaded BLAS -­‐   U   L   A(1)   dgeD2   lu(   )   dtrsm  (+  dswp)   Fork - Join parallelism Bulk Sync Processing   dgemm   -­‐   U   L   A(2)  
  • 19. Synchronization (in LAPACK LU) Ø  fork join Ø  bulk synchronous processing Cores •  Fork-join, bulk synchronous processing 23 19 Time 27
  • 20. PLASMA LU Factorization Dataflow Driven xTRSM Numerical program generates tasks and run time system executes tasks respecting data dependences. xGEMM xGETF2 xTRSM xGEMM xGEMM xTRSM xGEMM xTRSM xGEMM xGEMM xGEMM xGEMM xGEMM xGEMM xGEMM
  • 21. QUARK ¨  A runtime environment for the dynamic execution of precedence-constraint tasks (DAGs) in a multicore machine Ø  Translation Ø  If you have a serial program that consists of computational kernels (tasks) that are related by data dependencies, QUARK can help you execute that program (relatively efficiently and easily) in parallel on a multicore machine 21
  • 22. The Purpose of a QUARK Runtime ¨ Objectives Ø  High utilization of each core Ø  Scaling to large number of cores Ø  Synchronization reducing algorithms ¨ Methodology Ø  Dynamic DAG scheduling (QUARK) Ø  Explicit parallelism Ø  Implicit communication Ø  Fine granularity / block data layout ¨ Arbitrary DAG with dynamic scheduling DAG scheduled parallelism Fork-join parallelism Notice the synchronization penalty in the presence of heterogeneity. 22
  • 23. QUARK Shared Memory Superscalar Scheduling FOR k = 0..TILES-1 A[k][k] ← DPOTRF(A[k][k]) FOR m = k+1..TILES-1 A[m][k] ← DTRSM(A[k][k], A[m][k]) FOR m = k+1..TILES-1 A[m][m] ← DSYRK(A[m][k], A[m][m]) FOR n = k+1..m-1 A[m][n] ← DGEMM(A[m][k], A[n][k], A[m][n]) for (k = 0; k < A.mt; k++) { QUARK_CORE(dpotrf,...); for (m = k+1; m < A.mt; m++) { QUARK_CORE(dtrsm,...); } for (m = k+1; m < A.mt; m++) { QUARK_CORE(dsyrk,...); for (n = k+1; n < m; n++) { QUARK_CORE(dgemm,...) } } } definition – pseudocode implementation – QUARK code in PLASMA
  • 24. Pipelining: Cholesky Inversion 3 Steps: Factor, Invert L, Multiply L’s 48 cores POTRF, TRTRI and LAUUM. The matrix is 4000 x 4000,tile size is 200 x 200, POTRF+TRTRI+LAUUM: 25 (7t-3) Cholesky Factorization alone: 3t-2 Pipelined: 18 (3t+6) 24
  • 25. Performance of PLASMA Cholesky, Double Precision Comparing Various Numbers of Cores (Percentage of Theoretical Peak) 1024 Cores (64 x 16-cores) 2.00 GHz Intel Xeon X7550, 8,192 Gflop/s Peak (Double Precision) [nautilus] Static Scheduling 2000 Gflop/s 1500 1000 Cores (.27) 960 Cores (.27) 768 Cores (.30) 576 Cores (.33) 384 Cores (.31) 192 Cores (.48) 160 Cores (.46) 128 Cores (.43) 96 Cores (.51) 64 Cores (.58) 32 Cores (.70) 16 Cores (.78) 1000 500 0 0 20000 40000 60000 80000 100000 Matrix Size 120000 140000 160000
  • 26. ¨  DAG too large to be generated ahead of time Ø  Generate it dynamically Ø  Merge parameterized DAGs with dynamically generated DAGs ¨  HPC is about distributed heterogeneous resources Ø  Have to be able to move the data across multiple nodes Ø  The scheduling cannot be centralized Ø  Take advantage of all available resources with minimal intervention from the user ¨  Facilitate the usage of the HPC resources Ø  Languages Ø  Portability
  • 27. DPLASMA: Going to Distributed Memory PO TR TR SY TR GE PO GE SY TR TR GE SY PO TR GE SY SY
  • 28. Start  with  PLASMA   for  i,j  =  0..N        QUARK_Insert(  GEMM,    A[i,  j],INPUT,      B[j,  i],INPUT,    C[i,i],INOUT  )        QUARK_Insert(  TRSM,    A[i,  j],INPUT,      B[j,  i],INOUT  )   Parse  the  C  source  code  to  Abstract  Syntax  Tree   QUARK_Insert   GEMM   A   i   B   B   j   i   j   Analyze    dependencies  with  Omega  Test   {  1  <  i  <  N  :  GEMM(i,  j)  =>  TRSM(j)  }   Generate  Code  which  has  the  Parameterized  DAG   GEMM(i,  j)   TRSM(j)   i   j   Loops  &  array   references   have  to  be   affine  
  • 29. PLASMA DPLASMA (On Node) (Distributed System) inputs execution window tasks outputs DAGuE   QUARK   Number of tasks in DAG: O(n3) Cholesky: 1/3 n3 LU: 2/3 n3 QR: 4/3 n3 Number of tasks in parameterized DAG: O(1) Cholesky: 4 (POTRF, SYRK, GEMM, TRSM) LU: 4 (GETRF, GESSM, TSTRF, SSSSM) QR: 4 (GEQRT, LARFB, TSQRT, SSRFB) DAG: Conceptualized & Parameterized small enough to store on each core in every node = Scalable
  • 30. Task Affinity in DPLASMA User defined data distribution function Runtime system, called PaRSEC, is distributed
  • 31. Runtime DAG scheduling Node0 PO Node1 Node2 TR Node3 TR SY TR GE PO GE SY TR TR GE SY SY SY PO TR SY PO GE ¨  Every node has the symbolic DAG representation Ø  Only the (node local) frontier of the DAG is considered Ø  Distributed Scheduling based on remote completion notifications ¨  Background remote data transfer automatic with overlap ¨  NUMA / Cache aware Scheduling Ø  Work Stealing and sharing based on memory hierarchies
  • 32. Cholesky   DSBP = Distributed Square Block Packed LU   81 nodes Dual socket nodes Quad core Xeon L5420 Total 648 cores at 2.5 GHz ConnectX InfiniBand DDR 4x QR  
  • 33. Dense LA Compact Representation - PTG Hardware Parallel Runtime Domain Specific Extensions The PaRSEC framework … Sparse LA Dynamic Discovered Representation - DTG Data Scheduling Scheduling Scheduling Cores Memory Hierarchies Data Movement Coherence Chemistry Hardcore Tasks Tasks Tasks Data Movement SpecializedK SpecializedK ernels Specialized ernels Kernels Accelerators
  • 34. Other Systems PaRSEC SMPss StarPU Charm ++ FLAME QUARK Tblas PTG Scheduling Distr. (1/core) Repl (1/node) Repl (1/node) Distr. (Actors) w/ SuperMatrix Repl (1/node) Centr. Centr. Language Internal or Seq. w/ Affine Loops Seq. w/ add_task Seq. w/ add_task MsgDriven Objects Internal (LA DSL) Seq. w/ add_task Seq. w/ add_task Internal Accelerator GPU GPU GPU GPU GPU Availability Public Public Public Public Public Not Avail. Not Avail. Early stage: ParalleX Non-academic: Swarm, MadLINQ, CnC Public All projects support Distributed and Shared Memory (QUARK with QUARKd; FLAME with Elemental)
  • 35. Performance Development in Top500 1E+11 1E+10 1 Eflop/s 1E+09 100 Pflop/s 100000000 10 Pflop/s 10000000 1 Pflop/s 1000000 N=1   100 Tflop/s 100000 10 Tflop/s 10000 1 Tflop/s 1000 N=500   100 Gflop/s 100 10 Gflop/s 10 1 Gflop/s 1 0.1 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020
  • 36. Today’s #1 System Systems System peak Power System memory Node performance Node concurrency Node Interconnect BW System size (nodes) Total concurrency MTTF 2014 2020-2022 Difference Today & Exa 55 Pflop/s 1 Eflop/s ~20x Tianhe-2 (3 Gflops/W) 18 MW (50 Gflops/W) ~20 MW O(1) ~15x 1.4 PB 32 - 64 PB ~50x 3.43 TF/s 1.2 or 15TF/s O(1) 24 cores CPU + 171 cores CoP O(1k) or 10k ~5x - ~50x 6.36 GB/s 200-400GB/s ~40x 16,000 O(100,000) or O(1M) ~6x - ~60x 3.12 M O(billion) ~100x Few / day O(<1 day) O(?) (1.024 PB CPU + .384 PB CoP) (.4 CPU +3 CoP) 12.48M threads (4/core)
  • 37. Exascale System Architecture with a cap of $200M and 20MW Systems System peak Power System memory Node performance Node concurrency Node Interconnect BW System size (nodes) Total concurrency MTTF 2014 2020-2022 Difference Today & Exa 55 Pflop/s 1 Eflop/s ~20x Tianhe-2 (3 Gflops/W) 18 MW (50 Gflops/W) ~20 MW O(1) ~15x 1.4 PB 32 - 64 PB ~50x 3.43 TF/s 1.2 or 15TF/s O(1) 24 cores CPU + 171 cores CoP O(1k) or 10k ~5x - ~50x 6.36 GB/s 200-400GB/s ~40x 16,000 O(100,000) or O(1M) ~6x - ~60x 3.12 M O(billion) ~100x Few / day O(<1 day) O(?) (1.024 PB CPU + .384 PB CoP) (.4 CPU +3 CoP) 12.48M threads (4/core)
  • 38. Exascale System Architecture with a cap of $200M and 20MW Systems System peak Power System memory Node performance Node concurrency Node Interconnect BW System size (nodes) Total concurrency MTTF 2014 2020-2022 Difference Today & Exa 55 Pflop/s 1 Eflop/s ~20x Tianhe-2 (3 Gflops/W) 18 MW (50 Gflops/W) ~20 MW O(1) ~15x 1.4 PB 32 - 64 PB ~50x 3.43 TF/s 1.2 or 15TF/s O(1) 24 cores CPU + 171 cores CoP O(1k) or 10k ~5x - ~50x 6.36 GB/s 200-400GB/s ~40x 16,000 O(100,000) or O(1M) ~6x - ~60x 3.12 M O(billion) ~100x Few / day Many / day O(?) (1.024 PB CPU + .384 PB CoP) (.4 CPU +3 CoP) 12.48M threads (4/core)
  • 39. Major Changes to Software & Algorithms •  Must rethink the design of our algorithms and software §  Another disruptive technology • Similar to what happened with cluster computing and message passing §  Rethink and rewrite the applications, algorithms, and software §  Data movement is expense §  Flop/s are cheap, so are provisioned in excess 39
  • 40. Critical Issues at Peta & Exascale for Algorithm and Software Design •  Synchronization-reducing algorithms §  Break Fork-Join model •  Communication-reducing algorithms §  Use methods which have lower bound on communication •  Mixed precision methods §  2x speed of ops and 2x speed for data movement •  Autotuning §  Today’s machines are too complicated, build “smarts” into software to adapt to the hardware •  Fault resilient algorithms §  Implement algorithms that can recover from failures/bit flips •  Reproducibility of results §  Today we can’t guarantee this. We understand the issues, but some of our “colleagues” have a hard time with this.
  • 41. Summary •  Major Challenges are ahead for extreme computing §  Parallelism O(109) •  Programming issues §  Hybrid •  Peak and HPL may be very misleading •  No where near close to peak for most apps §  Fault Tolerance •  Today Sequoia BG/Q node failure rate is 1.25 failures/day §  Power •  50 Gflops/w (today at 2 Gflops/w) •  We will need completely new approaches and technologies to reach the Exascale level
  • 42. Collaborators / Software / Support u  u  u  •  PLASMA http://icl.cs.utk.edu/plasma/ MAGMA http://icl.cs.utk.edu/magma/ Quark (RT for Shared Memory) http://icl.cs.utk.edu/quark/ Collaborating partners u  u  •  PaRSEC(Parallel Runtime Scheduling and Execution Control) http://icl.cs.utk.edu/parsec/ University of Tennessee, Knoxville University of California, Berkeley University of Colorado, Denver 
 MAGMA PLASMA 42