SlideShare a Scribd company logo
The Effect of Hierarchical Memory on
the Design of Parallel Applications
and the Expression of Parallelism
David W. Walker
Cardiff School of Computer Science & Informatics
All modern computer systems have
hierarchical memories
3
Memory Hierarchy Pyramid
L2 Cache
L3 Cache
Main memory
Remote memory
Remoter memory
Even more remote
memory
Registers
L1 CacheCapacityAccess
speed
4
Typical Quad-Core Chip
L1 cache
Instruction cache
CPU
L1 cache
Instruction cache
CPU
L2 cache
L1 cache
Instruction cache
CPU
L1 cache
Instruction cache
CPU
L2 cache
Shared Memory
Efficient access to hierarchical memory is
important in achieving good performance in both
sequential and parallel applications
𝑐𝑖𝑗 =
𝑘=0
𝑛−1
𝑎𝑖𝑘 𝑏 𝑘𝑗
void matMul_ikj(float* A, float* B, float* C, float* work, int n)
{
int i, j, k;
for (i = 0; i < n; ++i){
for (j = 0; j < n; ++j) work[j] = 0;
for (k = 0; k < n; ++k) {
float a = A[i * n + k];
for (j = 0; j < n; ++j) {
float b = B[k * n + j];
work[j] += a * b;
}
}
for (j = 0; j < n; ++j) C[i * n + j] = work[j];
}
}
In C code, get best matrix multiply performance
when A and B are accessed by row as this
improves data locality
Transforming loops is a key technique for
improving performance by changing the order
in which computations occur
This 1989 paper has
won the SC17 “Test of
Time” award
Proceedings of the 1989 ACM/IEEE Conference on Supercomputing
© 1989 ACM 089791-341-8/89/0011/0655
Block algorithms, used in libraries such as LAPACK,
can be seen as a particular type of loop
transformation.
Block algorithm has more flops per memory
reference: O(n) vs. O(1)
If efficient access to hierarchical memory
is so important, how is it supported in
programming languages?
Occam: processes communicating via
messages on typed channels
MPI: non-blocking communication allows
data movement to be overlapped with
computation
Overlap in a parallel algorithm
Interior points
Boundary points
Send boundary data to neighbours
Update interior points
Receive boundary data from neighbours
Update boundary points
Multithreading: expose more parallelism as a finer
granularity to increase scope for latency hiding
OpenMP: parallel programming model based
on threads with access to shared memory
CUDA: used on NVidia GPUs. Fine grain
parallelism, large numbers of threads running
on thousands of cores
OpenACC: “a single programming model that
will allow you to write a single program that
runs with high performance in parallel across
a wide range of target systems”
OpenACC: “a single programming model that
will allow you to write a single program that
runs with high performance in parallel across
a wide range of target systems”
Michael Wolfe in OpenACC for Multicore GPUs
https://www.pgroup.com/lit/brochures/openacc_sc15.pdf
Cooperative Parallel Programming
Programmer: indicates
opportunities for parallelism,
gives hints
Compiler: applies
transformations
Runtime:
manages threads
PGAS languages: each thread has its own private
memory and also has access to globally shared
memory
PGAS: Local and shared variables
Thread 0 Thread 1 Thread 2 Thread 3
Global shared
address space
An array{
Private memory
{
A thread is said to have
an “affinity” for certain
elements in an array,
which it can access
faster than others.
To optimize performance PGAS languages still
require the programmer to reason about data
locality and synchronization
Example: 2D Laplace Problem
24
Strip (N-1)
Strip 0
Strip 1
Strip 2
NPTSX
NPTSY
NY
NY
NY
NY
…
…
Solution is held at 0 on
the boundary, and 1 at
the 4 centre squares.
2D Laplace Problem: MPI solution
25
Process (N-1)
Process 0
Process 1
Process 2
NPTSX
NPTSY
NY
NY
NY
NY
…
…
At start of each Jacobi
iteration, each process
exchanges its first and
last rows with the
processes above and
below.
#include <stdio.h>
#include <mpi.h>
#define NPROCS 4
#define NY 20
#define NPTSX 200
#define NPTSY (NY*NPROCS)
#define NSTEPS 5000
// Routines setup_grid(), exchange_rows(), output_array()
int main (int argc, char *argv[])
{
float phi[NY+2][NPTSX], oldphi[NY+2][NPTSX];
int mask[NY+2][NPTSX];
int i, j, k, rank;
MPI_Init(&argc, &argv);
MPI_Comm_rank (MPI_COMM_WORLD, &rank);
setup_grid(rank, phi, mask);
for(k=1;k<=NSTEPS;k++){
for(j=1;j<NY+1;j++)
for(i=0;i<NPTSX;i++) oldphi[j][i] = phi[j][i];
exchange_rows(rank,oldphi);
for(j=1;j<NY+1;j++)
for(i=0;i<NPTSX;i++) {
if (mask[j][i]) phi[j][i] = 0.25*(oldphi[j][i-1] +
oldphi[j][i+1] + oldphi[j-1][i] + oldphi[j+1][i]);
}
}
output_array(rank,phi);
MPI_Finalize();
}
26
MPI Program
Exchange rows
Copy
Update
2D Laplace Problem: UPC solution
27
Thread (THREADS-1)
Thread 0
Thread 1
Thread 2
NPTSX
NPTSY
NY
NY
NY
NY
…
…
#include <stdio.h>
#include <upc.h>
#define NY 20
#define NPTSX 200
#define NPTSY (NY*THREADS)
#define NSTEPS 5000
shared[*] float phi[NPTSY][NPTSX], oldphi[NPTSY][NPTSX];
shared[*] int mask[NPTSY][NPTSX];
// Routines setup_grid(), output_array(), and RGBval()
int main ()
{
int i, j, k;
setup_grid();
upc_barrier;
for(k=1;k<=NSTEPS;k++){
upc_forall(j=0;j<NPTSY;j++;j/NY)
for(i=0;i<NPTSX;i++) oldphi[j][i] = phi[j][i];
upc_barrier;
upc_forall(j=0;j<NPTSY;j++;j/NY)
for(i=0;i<NPTSX;i++) {
if (mask[j][i]) phi[j][i] = 0.25*(oldphi[j][i-1] +
oldphi[j][i+1] + oldphi[j-1][i] + oldphi[j+1][i]);
}
upc_barrier;
}
output_array();
}
28
UPC Program #1
Can use shared arrays
Note the barriers
Updating values lying on upper and lower
boundaries of a thread requires access to data values
with different affinities. These accesses are slow.
Data Sharing Between Threads
Thread 0NY
NPTSX
Thread 1NY
Thread 2NY
NY
…
…
…
Thread THREADS-1
To update a value on a
thread’s upper or lower
boundary requires data
from the thread above
or below
30
Each thread copies its first and last rows into
shared memory at start of time step, and then
reads rows from neighbouring threads from
shared memory.
Coordinating Private and Shared Memory
31
Thread 0NY
NPTSX
Thread 1NY
Thread THREADS-1NY
…
…
…
Row 1
Row NY
Shared memory is used as a way of coordinating the sharing of
data between threads. This avoids the explicit barriers, and
coalesces data movement between local and remote memory.
Shared
memory
#include <stdio.h>
#include <upc.h>
#define NY 20
#define NPTSX 200
#define NPTSY (NY*THREADS)
#define NSTEPS 5000
shared[NPTSX] float ud[2][THREADS*NPTSX];
shared[*] float finalphi[NPTSY][NPTSX];
float phi[NY+2][NPTSX], oldphi[NY+2][NPTSX];
int mask[NY+2][NPTSX];
// Routines setup_grid(), output_array(), and RGBval()
int main ()
{
int i, j, k;
setup_grid();
upc_barrier;
for(k=1;k<=NSTEPS;k++){
…
}
output_array();
}
32
Main Program: Array Declarations
Shared array to hold rows 1
and NY of each thread
Needed for output
Arrays in private
memory
See next slide for
update code
for(i=0;i<NPTSX;i++){
ud[0][MYTHREAD*NPTSX+i] = phi[1][i];
ud[1][MYTHREAD*NPTSX+i] = phi[NY][i];
}
upc_barrier;
if (MYTHREAD>0) {
for(i=0;i<NPTSX;i++)
phi[0][i] = ud[1][(MYTHREAD-1)*NPTSX+i];
}
if (MYTHREAD<THREADS-1) {
for(i=0;i<NPTSX;i++)
phi[NY+1][i] = ud[0][(MYTHREAD+1)*NPTSX+i];
}
for(j=0;j<NY+2;j++)
for(i=0;i<NPTSX;i++) oldphi[j][i] = phi[j][i];
for(j=1;j<NY+1;j++)
for(i=0;i<NPTSX;i++) {
if (mask[j][i]) phi[j][i] = 0.25*(oldphi[j][i-1] +
oldphi[j][i+1] + oldphi[j-1][i] + oldphi[j+1][i]);
}
33
Main Program: Update
Copy rows 1 and NY of
phi to shared memory
Copy into row 0
Copy phi to
oldphi
Do the update
Copy into row
NY+1
Only one barrier
That was a straightforward example: regular
communication and good load balance.
Start at the root and visit
every node of the tree
using a depth-first traversal
algorithm.
Note: it’s an implicit tree –
every node contains all the
information needed to
specify its children.
Start at the root and visit
every node of the tree
using a depth-first traversal
algorithm.
Each thread has a node stack. When it’s
empty a thread will steal work from the
stack of another thread.
Node A
Node B
Node C
Node D
Node E
Stack
Node B
Node C
Node D
Node E
Node X
Node Y Children of A
UPC implementation allows a thread to do push and pull
operations on the top of its stack, and to steal nodes from the
bottom of other threads’ stacks.
Node A
Node B
Node C
Node Y
Node Z
Stack top
Node X
Thread has affinity for this part of its stack
and can access it using a local pointer.
Stack bottom
Other threads can steal nodes from this
part of its stack by accessing it using a
global pointer.
Complications: Need to use locks to synchronise access to
bottom part of stack, and when moving nodes between the
top and bottom parts of the stack.
OpenMP implementation can use the
task directive
Node x;
int n = make_tree(&x, max_depth);
omp_set_dynamic(0);
#pragma omp parallel shared(x) num_threads(nthreads)
{
#pragma omp single
visit_node(&x);
}
Create a pool of threads
Aim to visit each node and
process it in some way.
One thread visits root node
Do not adjust number of
threads at runtime
void visit_node(Node *x){
Node *y = x->children;
while(y != NULL){
#pragma omp task firstprivate(y)
visit_node(y);
y = y->next;
}
#pragma omp taskwait
process_node(x);
return;
}
Loop over children of
node x
Creates a new task for each
child to call visit_node(y)
Wait here until all the
child tasks have finished
The runtime system schedules the tasks on
the threads.
OpenMP tasks work well for parallelizing recursive
problems with dynamic load imbalance
SC12: https://doi.org/10.1109/SC.2012.71
SC06: https://doi.org/10.1145/1188455.1188543
“To achieve good performance the
programmer and the programming system
must reason about locality and
independence”
In Sequoia, recursive tasks act as self-
contained units of computation, and
hierarchical memory is represented by a
tree.
Programmer must provide Sequoia with a
task mapping specification that maps
different levels of the memory hierarchy to
different granularities of task.
In addition to changing the order of
arithmetical operations, we can also
change the layout of data in memory
Hilbert Space-Filling Curve
Morton Order
0 1
2 3
4 5
6 7
8 9
10 11
12 13
14 15
Morton Order: Recursive Definition
Square 2n x 2n Arrays: RM and Morton index
Block size, b = 2n-r (maximum r =n-1)
Morton Order
0 1
2 3
4 5
6 7
8 9
10 11
12 13
14 15
n= 5, r=2
Consider (i,j)=(18,13)
i2 = 10010, j2 = 01101
Interlace top 2 bits of i
and j:
1001 → 9
Morton index is:
1001010101 → 597
The unshuffle operation takes a shuffled sequence of items
and unshuffles them:
where each ai is a contiguous vector of ℓa items,
and each bi is a contiguous vector of ℓb items.
20 July 2017 53
a1b1a2b2…anbn ®a1a2…anb1b2…bn
Apply Morton Ordering to Matrix A
mortonOrder (A,n,b){
if( b < n ){
p1 = (n*n)/4
p2 = 2*p1
p3 = 3*p1
unshuffle(A,n/2,n/2)
unshuffle(A+p2,n/2,n/2)
mortonOrder(A,n/2,b)
mortonOrder(A+p1,n/2,b)
mortonOrder(A+p2,n/2,b)
mortonOrder(A+p3,n/2,b)
}
}
p1
p2 p3
n is matrix size, b is
block size. Both are
powers of 2.
Possible use of Morton or SFC ordering would be
in a library – optionally convert between matrix
layouts on entry to, and exit from, the library.
Recursive Matrix Multiply
mm_Recursive (A,B,C,n,b){ // C = C + AB
if(n==b){
matmul(A,B,C,n)
}
else{
mm_Recursive(A00,B00,C00,n/2,b)
mm_Recursive(A01,B10,C00,n/2,b)
mm_Recursive(A00,B01,C01n/2,b)
mm_Recursive(A01,B11,C01,n/2,b)
mm_Recursive(A10,B00,C10,n/2,b)
mm_Recursive(A11,B10,C10,n/2,b)
mm_Recursive(A10,B01,C11,n/2,b)
mm_Recursive(A11,B11,C11,n/2,b)
}
return
}
End of recursion. Choose b
so matrices fit in cache.
Note: all the
computational work
happens in the leaves
of the recursion tree.
A00 A01
A10 A11
Platform 1: MacBook
Pro, Intel i7, 4 cores,
256KB L2 cache/core,
6MB L3 cache
Platform 2: two Xeon
E5-2620, 6 cores each,
256KB L2 cache/core,
15MB L3 cache
gcc compiler used
with –O3 flag set
Platform 2: two Xeon
E5-2620, 6 cores each,
256KB L2 cache/core,
15MB L3 cache
gcc compiler used
with –O3 flag set
Cholesky Factorization
A = LLT
A00 = L00L00
T
A10 = L10L00
T
A11 = L10L10
T+L11L11
T
Tail Recursive Cholesky
choleskyTailRecursive (A,n,b){ // C = C + AB
if(n==b){
cholesky(A,b)
}
else{
cholesky(A00,b)
triangularSolve(A10,A00,n-b,b)
symmetricRankUpdate(A11,A10,n-b,b)
choleskyTailRecursive(A11,n-b,b)
}
return
}
End of recursion. Choose b
so matrices fit in cache.
Note: computational
work happens at all
levels of the recursion
tree.
A00
A10 A11
Binary Recursive Cholesky
choleskyBinaryRecursive (A,n,b){ // C = C + AB
if(n==b){
cholesky(A,b)
}
else{
choleskyBinaryRecursive(A00,n/2,b)
triangularSolve(A10,A00,n/2,n/2)
symmetricRankUpdate(A11,A10,n/2,n/2)
choleskyBinaryRecursive(A11,n/2,b)
}
return
}
End of recursion. Choose b
so matrices fit in cache.
Note: the 4 operations at the inner
nodes of the recursion tree have to
be done in order, so cannot do
recursive calls in parallel.
A00
A10 A11
Blocked RM order: standard algorithm
based on rectangular blocks
Tiled RM order: all operations are
expressed in terms of operations involving
square tiles, but matrices are stored in RM
order
Tiled Morton order: as above, but matrices
are stored in Morton order.
All times are relative
to time for single call
to DPOTRF
Morton order algorithms require Morton
index computations. There are a number
of ways to do these (bitwise operations,
lookup tables) but the method used does
not impact performance much.
These plots show results for the binary
recursive algorithm on Platform 1. Similar
results were obtained on Platform 2.
The Fourier transform of an nxn array, X, can be
expressed as:
Y = FnXFn
where element (p,q) of matrix Fn is wn
pq
wn = exp(-2pi / n)
𝐹4 =
1 1
1 𝑤
1 1
𝑤2
𝑤3
1 𝑤2
1 𝑤3
𝑤4
𝑤6
𝑤6
𝑤9
2D Fast Fourier Transform
Y = FnXFn = FnXFn
T = At…A1(Pn
TXPn)A1
T…At
T
where t = log2(n) and Pn
T is a permutation matrix such that Pn
TX
exchanges row k of X with row k’, where k’ is the t bits of k in
reverse order.
1 0
0 0
0 0
0 0
0 0
0 0
1 0
0 0
0 0
1 0
0 0
0 0
0 0
0 0
0 0
1 0
0 1
0 0
0 0
0 0
0 0
0 0
0 1
0 0
0 0
0 1
0 0
0 0
0 0
0 0
0 0
0 1
𝑇
0
1
2
3
4
5
6
7
=
0
4
2
6
1
5
3
7
where L = 2q, r = n/L, L*=L/2
”Butterfly” matrix
r diagonal blocks of BL
𝐴 𝑞 =
𝐵 𝐿 ⋯ 0
⋮ ⋱ ⋮
0 ⋯ 𝐵 𝐿
Kronecker matrix product 𝐴⨂𝐵 =
𝑎00 𝐵 ⋯ 𝑎0,𝑛−1 𝐵
⋮ ⋱ ⋮
𝑎 𝑚−1,0 𝐵 ⋯ 𝑎 𝑚−1,𝑛−1 𝐵
I recommend this book if
you want to understand
the mathematics behind
the FFT algorithm.
A Common 2D FFT Algorithm
Y = At…A1(Pn
TXPn)A1
T…At
T
1. Evaluate 𝑋 = 𝐴 𝑡 ⋯ 𝐴1 𝑃𝑛
𝑇 𝑋
2. Transpose 𝑋 𝑇
3. Evaluate 𝑌 𝑇
= 𝐴 𝑡 ⋯ 𝐴1 𝑃𝑛
𝑇
𝑋 𝑇
4. Transpose 𝑌 𝑇to get 𝑌
Πn is a permutation matrix that performs a perfect
shuffle index operation, and Πb,n performs a partial
bit reversal on indices.
Basis of recursive 2D FFT
𝐹𝑛Π 𝑏,𝑛 = 𝐵 𝑏,𝑛 𝐼 𝑛/𝑏⨂𝐹𝑏
Π 𝑏,𝑛 = Π 𝑛 𝐼2⨂Π 𝑛/2 𝐼4⨂Π 𝑛/4 ⋯ 𝐼 𝑛/(2𝑏)⨂Π2𝑏
𝐵 𝑏,𝑛 = 𝐵𝑛 𝐼2⨂𝐵 𝑛/2 𝐼4⨂𝐵 𝑛/4 ⋯ 𝐼 𝑛/(2𝑏)⨂𝐵2𝑏
Hb,n permutes the columns and rows of X
based on a partial bit-reversal of indices.
What is in the red box?
This is the result of partitioning the
matrix into bxb blocks and performing a
2D FFT on each
Denote this by Kb,n
𝐻 𝑏,𝑛= Π 𝑏,𝑛
𝑇
XΠ 𝑏,𝑛
𝐹𝑛 𝑋𝐹𝑛 = 𝐹𝑛 𝑋𝐹𝑛
𝑇
= 𝐵 𝑏,𝑛 𝐼 𝑛/𝑏⨂𝐹𝑏 𝐻 𝑏,𝑛 𝐼 𝑛/𝑏⨂𝐹𝑏 𝐵 𝑏,𝑛
𝑇
Kb,n = 2D FFT of bxb blocks of
partially bit-reversed matrix, X
1. Evaluate Yb,n
2. Transpose
3. Evaluate
4. Transpose 𝑋 𝑇
𝐹𝑛 𝑋𝐹𝑛 = 𝐹𝑛 𝑋𝐹𝑛
𝑇
= 𝐴 𝑡 ⋯ 𝐴 𝑠+1 𝐾𝑏,𝑛 𝐴 𝑠+1
𝑇
⋯ 𝐴 𝑡
𝑇
𝑌𝑏,𝑛 = 𝐴 𝑡 ⋯ 𝐴 𝑠+1 𝐾𝑏,𝑛
𝐹𝑛 𝑋𝐹𝑛
𝑇 = 𝑋 = 𝑌𝑏,𝑛 𝐴 𝑠+1
𝑇
⋯ 𝐴 𝑡
𝑇
𝑋 𝑇 = 𝐴 𝑡 ⋯ 𝐴 𝑠+1 𝑌𝑏,𝑛
𝑇
b = 2s
Evaluate Kb,n: the FFTs
of the bxb blocks
b
b
n
b
b
n
Evaluate 𝐾2𝑏,𝑛 = 𝐼2 ⊗ 𝐵2𝑏 𝐾𝑏,𝑛 𝐼2 ⊗ 𝐵2𝑏
𝑇
:
the FFTs of the 2bx2b blocks
2b
2b
nn
Evaluate 𝐵4𝑏 𝐾2𝑏,𝑛 𝐵4𝑏
𝑇
: the FFT of the whole
nxn array
Transpose-based 2D FFT
transposeFFT2D (X,n,b){
partialBitReversal(X,n,b)
for (each bxb block, B, of X)
fft2D(B,b)
recursiveTransposeFFT(X,n,b)
transpose(X,n,b)
recursiveTransposeFFT(X,n,b)
transpose(X,n,b)
return
}
Do FFT of each block using
any algorithm.
Transpose X
Pre-multiply blocks as we
move up the recursion
tree
Recursive Transpose-Based 2D FFT
recursiveTransposeFFT (X,n,b){
if(n>b){
recursiveTransposeFFT(X00,n/2,b)
recursiveTransposeFFT(X01,n/2,b)
recursiveTransposeFFT(X10,n/2,b)
recursiveTransposeFFT(X11,n/2,b)
butterflyPre(X,n,b)
}
return
}
End recursion when n=b.
Choose b so matrices fit in
cache.
Pre-multiply nxn block by
butterfly matrix,
overwriting X.
Note: includes
work at each level
of the recursion
tree.
Note: recursive
calls are readily
parallelizable.
Vector Radix 2D FFT
vectorRadixFFT2D (X,n,b){
partialBitReversal(X,n,b)
recursiveVRFFT(X,n,b)
return
}
Recursive Vector Radix 2D FFT
recursiveVRFFT (X,n,b){
if(n==b){
fft2D(X,n)
}
else{
recursiveVRFFT(X00,n/2,b)
recursiveVRFFT(X01,n/2,b)
recursiveVRFFT(X10,n/2,b)
recursiveVRFFT(X11,n/2,b)
butterflyPre(X,n,b)
butterflyPost(X,n,b)
}
return
}
End recursion when n=b.
Choose b so matrices fit in
cache.
Pre-multiply nxn block by
butterfly matrix,
overwriting X, and then
post-multiply.
Note: recursive
calls are readily
parallelizable.
All times are relative to
time for transpose-
based FFT on RM matrix
of same size
Morton ordering doesn’t improve FFT timings by as
much as for matrix multiplication. Computation to
data movement ratio is n for matrix multiply, and
log(n) for FFT
Morton ordering and related recursive parallel
algorithms may work well when hierarchical
memory is handled programmatically.
Thank you for your attention.
Any Questions?

More Related Content

What's hot

Partial Homomorphic Encryption
Partial Homomorphic EncryptionPartial Homomorphic Encryption
Partial Homomorphic Encryption
securityxploded
 
Computer security
Computer security Computer security
Computer security
Harry Potter
 
A survey on Fully Homomorphic Encryption
A survey on Fully Homomorphic EncryptionA survey on Fully Homomorphic Encryption
A survey on Fully Homomorphic Encryption
iosrjce
 
【論文紹介】Relay: A New IR for Machine Learning Frameworks
【論文紹介】Relay: A New IR for Machine Learning Frameworks【論文紹介】Relay: A New IR for Machine Learning Frameworks
【論文紹介】Relay: A New IR for Machine Learning Frameworks
Takeo Imai
 
Deep Learning, Keras, and TensorFlow
Deep Learning, Keras, and TensorFlowDeep Learning, Keras, and TensorFlow
Deep Learning, Keras, and TensorFlow
Oswald Campesato
 
A look into the sanitizer family (ASAN & UBSAN) by Akul Pillai
A look into the sanitizer family (ASAN & UBSAN) by Akul PillaiA look into the sanitizer family (ASAN & UBSAN) by Akul Pillai
A look into the sanitizer family (ASAN & UBSAN) by Akul Pillai
Cysinfo Cyber Security Community
 
TensorFlow Study Part I
TensorFlow Study Part ITensorFlow Study Part I
TensorFlow Study Part I
Te-Yen Liu
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
AMD Developer Central
 
Intro to Rust from Applicative / NY Meetup
Intro to Rust from Applicative / NY MeetupIntro to Rust from Applicative / NY Meetup
Intro to Rust from Applicative / NY Meetup
nikomatsakis
 
Rust: Reach Further (from QCon Sao Paolo 2018)
Rust: Reach Further (from QCon Sao Paolo 2018)Rust: Reach Further (from QCon Sao Paolo 2018)
Rust: Reach Further (from QCon Sao Paolo 2018)
nikomatsakis
 
Introduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : NotesIntroduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : Notes
Subhajit Sahu
 
20100712-OTcl Command -- Getting Started
20100712-OTcl Command -- Getting Started20100712-OTcl Command -- Getting Started
20100712-OTcl Command -- Getting StartedTeerawat Issariyakul
 
Rust "Hot or Not" at Sioux
Rust "Hot or Not" at SiouxRust "Hot or Not" at Sioux
Rust "Hot or Not" at Sioux
nikomatsakis
 

What's hot (14)

Partial Homomorphic Encryption
Partial Homomorphic EncryptionPartial Homomorphic Encryption
Partial Homomorphic Encryption
 
Computer security
Computer security Computer security
Computer security
 
A survey on Fully Homomorphic Encryption
A survey on Fully Homomorphic EncryptionA survey on Fully Homomorphic Encryption
A survey on Fully Homomorphic Encryption
 
【論文紹介】Relay: A New IR for Machine Learning Frameworks
【論文紹介】Relay: A New IR for Machine Learning Frameworks【論文紹介】Relay: A New IR for Machine Learning Frameworks
【論文紹介】Relay: A New IR for Machine Learning Frameworks
 
Deep Learning, Keras, and TensorFlow
Deep Learning, Keras, and TensorFlowDeep Learning, Keras, and TensorFlow
Deep Learning, Keras, and TensorFlow
 
A look into the sanitizer family (ASAN & UBSAN) by Akul Pillai
A look into the sanitizer family (ASAN & UBSAN) by Akul PillaiA look into the sanitizer family (ASAN & UBSAN) by Akul Pillai
A look into the sanitizer family (ASAN & UBSAN) by Akul Pillai
 
TensorFlow Study Part I
TensorFlow Study Part ITensorFlow Study Part I
TensorFlow Study Part I
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
 
Intro to Rust from Applicative / NY Meetup
Intro to Rust from Applicative / NY MeetupIntro to Rust from Applicative / NY Meetup
Intro to Rust from Applicative / NY Meetup
 
Rust: Reach Further (from QCon Sao Paolo 2018)
Rust: Reach Further (from QCon Sao Paolo 2018)Rust: Reach Further (from QCon Sao Paolo 2018)
Rust: Reach Further (from QCon Sao Paolo 2018)
 
Introduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : NotesIntroduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : Notes
 
20100712-OTcl Command -- Getting Started
20100712-OTcl Command -- Getting Started20100712-OTcl Command -- Getting Started
20100712-OTcl Command -- Getting Started
 
Rust "Hot or Not" at Sioux
Rust "Hot or Not" at SiouxRust "Hot or Not" at Sioux
Rust "Hot or Not" at Sioux
 
packet destruction in NS2
packet destruction in NS2packet destruction in NS2
packet destruction in NS2
 

Similar to The Effect of Hierarchical Memory on the Design of Parallel Algorithms and the Expression of Parallelism

Conflux: gpgpu for .net (en)
Conflux: gpgpu for .net (en)Conflux: gpgpu for .net (en)
Conflux: gpgpu for .net (en)Andrei Varanovich
 
Conflux:gpgpu for .net (en)
Conflux:gpgpu for .net (en)Conflux:gpgpu for .net (en)
Conflux:gpgpu for .net (en)
Andrei Varanovich
 
Threaded Programming
Threaded ProgrammingThreaded Programming
Threaded ProgrammingSri Prasanna
 
Parallel computation
Parallel computationParallel computation
Parallel computation
Jayanti Prasad Ph.D.
 
parallel-computation.pdf
parallel-computation.pdfparallel-computation.pdf
parallel-computation.pdf
Jayanti Prasad Ph.D.
 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUs
Igor Sfiligoi
 
Spark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with SparkSpark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with Spark
samthemonad
 
Overview of Chainer and Its Features
Overview of Chainer and Its FeaturesOverview of Chainer and Its Features
Overview of Chainer and Its Features
Seiya Tokui
 
NVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読み
NVIDIA Japan
 
Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Hajime Tazaki
 
24-TensorFlow-Clipper.pptxnjjjjnjjjjjjmm
24-TensorFlow-Clipper.pptxnjjjjnjjjjjjmm24-TensorFlow-Clipper.pptxnjjjjnjjjjjjmm
24-TensorFlow-Clipper.pptxnjjjjnjjjjjjmm
SasidharaKashyapChat
 
4Developers 2018: Ile (nie) wiesz o strukturach w .NET (Łukasz Pyrzyk)
4Developers 2018: Ile (nie) wiesz o strukturach w .NET (Łukasz Pyrzyk)4Developers 2018: Ile (nie) wiesz o strukturach w .NET (Łukasz Pyrzyk)
4Developers 2018: Ile (nie) wiesz o strukturach w .NET (Łukasz Pyrzyk)
PROIDEA
 
Towards Safe Automated Refactoring of Imperative Deep Learning Programs to Gr...
Towards Safe Automated Refactoring of Imperative Deep Learning Programs to Gr...Towards Safe Automated Refactoring of Imperative Deep Learning Programs to Gr...
Towards Safe Automated Refactoring of Imperative Deep Learning Programs to Gr...
Raffi Khatchadourian
 
Towards neuralprocessingofgeneralpurposeapproximateprograms
Towards neuralprocessingofgeneralpurposeapproximateprogramsTowards neuralprocessingofgeneralpurposeapproximateprograms
Towards neuralprocessingofgeneralpurposeapproximateprograms
Paridha Saxena
 
NeuralProcessingofGeneralPurposeApproximatePrograms
NeuralProcessingofGeneralPurposeApproximateProgramsNeuralProcessingofGeneralPurposeApproximatePrograms
NeuralProcessingofGeneralPurposeApproximateProgramsMohid Nabil
 
Options and trade offs for parallelism and concurrency in Modern C++
Options and trade offs for parallelism and concurrency in Modern C++Options and trade offs for parallelism and concurrency in Modern C++
Options and trade offs for parallelism and concurrency in Modern C++
Satalia
 
NvFX GTC 2013
NvFX GTC 2013NvFX GTC 2013
NvFX GTC 2013
Tristan Lorach
 
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiNatural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
Databricks
 

Similar to The Effect of Hierarchical Memory on the Design of Parallel Algorithms and the Expression of Parallelism (20)

Arvindsujeeth scaladays12
Arvindsujeeth scaladays12Arvindsujeeth scaladays12
Arvindsujeeth scaladays12
 
Conflux: gpgpu for .net (en)
Conflux: gpgpu for .net (en)Conflux: gpgpu for .net (en)
Conflux: gpgpu for .net (en)
 
Conflux:gpgpu for .net (en)
Conflux:gpgpu for .net (en)Conflux:gpgpu for .net (en)
Conflux:gpgpu for .net (en)
 
Threaded Programming
Threaded ProgrammingThreaded Programming
Threaded Programming
 
Data race
Data raceData race
Data race
 
Parallel computation
Parallel computationParallel computation
Parallel computation
 
parallel-computation.pdf
parallel-computation.pdfparallel-computation.pdf
parallel-computation.pdf
 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUs
 
Spark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with SparkSpark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with Spark
 
Overview of Chainer and Its Features
Overview of Chainer and Its FeaturesOverview of Chainer and Its Features
Overview of Chainer and Its Features
 
NVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読み
 
Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014
 
24-TensorFlow-Clipper.pptxnjjjjnjjjjjjmm
24-TensorFlow-Clipper.pptxnjjjjnjjjjjjmm24-TensorFlow-Clipper.pptxnjjjjnjjjjjjmm
24-TensorFlow-Clipper.pptxnjjjjnjjjjjjmm
 
4Developers 2018: Ile (nie) wiesz o strukturach w .NET (Łukasz Pyrzyk)
4Developers 2018: Ile (nie) wiesz o strukturach w .NET (Łukasz Pyrzyk)4Developers 2018: Ile (nie) wiesz o strukturach w .NET (Łukasz Pyrzyk)
4Developers 2018: Ile (nie) wiesz o strukturach w .NET (Łukasz Pyrzyk)
 
Towards Safe Automated Refactoring of Imperative Deep Learning Programs to Gr...
Towards Safe Automated Refactoring of Imperative Deep Learning Programs to Gr...Towards Safe Automated Refactoring of Imperative Deep Learning Programs to Gr...
Towards Safe Automated Refactoring of Imperative Deep Learning Programs to Gr...
 
Towards neuralprocessingofgeneralpurposeapproximateprograms
Towards neuralprocessingofgeneralpurposeapproximateprogramsTowards neuralprocessingofgeneralpurposeapproximateprograms
Towards neuralprocessingofgeneralpurposeapproximateprograms
 
NeuralProcessingofGeneralPurposeApproximatePrograms
NeuralProcessingofGeneralPurposeApproximateProgramsNeuralProcessingofGeneralPurposeApproximatePrograms
NeuralProcessingofGeneralPurposeApproximatePrograms
 
Options and trade offs for parallelism and concurrency in Modern C++
Options and trade offs for parallelism and concurrency in Modern C++Options and trade offs for parallelism and concurrency in Modern C++
Options and trade offs for parallelism and concurrency in Modern C++
 
NvFX GTC 2013
NvFX GTC 2013NvFX GTC 2013
NvFX GTC 2013
 
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiNatural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
 

Recently uploaded

ESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptxESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptx
muralinath2
 
FAIR & AI Ready KGs for Explainable Predictions
FAIR & AI Ready KGs for Explainable PredictionsFAIR & AI Ready KGs for Explainable Predictions
FAIR & AI Ready KGs for Explainable Predictions
Michel Dumontier
 
Structural Classification Of Protein (SCOP)
Structural Classification Of Protein  (SCOP)Structural Classification Of Protein  (SCOP)
Structural Classification Of Protein (SCOP)
aishnasrivastava
 
platelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptxplatelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptx
muralinath2
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Erdal Coalmaker
 
Anemia_ different types_causes_ conditions
Anemia_ different types_causes_ conditionsAnemia_ different types_causes_ conditions
Anemia_ different types_causes_ conditions
muralinath2
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
ChetanK57
 
Lab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerinLab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerin
ossaicprecious19
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Sérgio Sacani
 
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
Scintica Instrumentation
 
Cancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate PathwayCancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate Pathway
AADYARAJPANDEY1
 
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of LipidsGBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
Areesha Ahmad
 
extra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdfextra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdf
DiyaBiswas10
 
Large scale production of streptomycin.pptx
Large scale production of streptomycin.pptxLarge scale production of streptomycin.pptx
Large scale production of streptomycin.pptx
Cherry
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
muralinath2
 
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
muralinath2
 
insect taxonomy importance systematics and classification
insect taxonomy importance systematics and classificationinsect taxonomy importance systematics and classification
insect taxonomy importance systematics and classification
anitaento25
 
RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCINGRNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
AADYARAJPANDEY1
 
Viksit bharat till 2047 India@2047.pptx
Viksit bharat till 2047  India@2047.pptxViksit bharat till 2047  India@2047.pptx
Viksit bharat till 2047 India@2047.pptx
rakeshsharma20142015
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
YOGESH DOGRA
 

Recently uploaded (20)

ESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptxESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptx
 
FAIR & AI Ready KGs for Explainable Predictions
FAIR & AI Ready KGs for Explainable PredictionsFAIR & AI Ready KGs for Explainable Predictions
FAIR & AI Ready KGs for Explainable Predictions
 
Structural Classification Of Protein (SCOP)
Structural Classification Of Protein  (SCOP)Structural Classification Of Protein  (SCOP)
Structural Classification Of Protein (SCOP)
 
platelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptxplatelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptx
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
 
Anemia_ different types_causes_ conditions
Anemia_ different types_causes_ conditionsAnemia_ different types_causes_ conditions
Anemia_ different types_causes_ conditions
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
 
Lab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerinLab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerin
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
 
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
 
Cancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate PathwayCancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate Pathway
 
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of LipidsGBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
 
extra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdfextra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdf
 
Large scale production of streptomycin.pptx
Large scale production of streptomycin.pptxLarge scale production of streptomycin.pptx
Large scale production of streptomycin.pptx
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
 
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
 
insect taxonomy importance systematics and classification
insect taxonomy importance systematics and classificationinsect taxonomy importance systematics and classification
insect taxonomy importance systematics and classification
 
RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCINGRNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
 
Viksit bharat till 2047 India@2047.pptx
Viksit bharat till 2047  India@2047.pptxViksit bharat till 2047  India@2047.pptx
Viksit bharat till 2047 India@2047.pptx
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
 

The Effect of Hierarchical Memory on the Design of Parallel Algorithms and the Expression of Parallelism

  • 1. The Effect of Hierarchical Memory on the Design of Parallel Applications and the Expression of Parallelism David W. Walker Cardiff School of Computer Science & Informatics
  • 2. All modern computer systems have hierarchical memories
  • 3. 3 Memory Hierarchy Pyramid L2 Cache L3 Cache Main memory Remote memory Remoter memory Even more remote memory Registers L1 CacheCapacityAccess speed
  • 4. 4 Typical Quad-Core Chip L1 cache Instruction cache CPU L1 cache Instruction cache CPU L2 cache L1 cache Instruction cache CPU L1 cache Instruction cache CPU L2 cache Shared Memory
  • 5. Efficient access to hierarchical memory is important in achieving good performance in both sequential and parallel applications
  • 6. 𝑐𝑖𝑗 = 𝑘=0 𝑛−1 𝑎𝑖𝑘 𝑏 𝑘𝑗 void matMul_ikj(float* A, float* B, float* C, float* work, int n) { int i, j, k; for (i = 0; i < n; ++i){ for (j = 0; j < n; ++j) work[j] = 0; for (k = 0; k < n; ++k) { float a = A[i * n + k]; for (j = 0; j < n; ++j) { float b = B[k * n + j]; work[j] += a * b; } } for (j = 0; j < n; ++j) C[i * n + j] = work[j]; } }
  • 7. In C code, get best matrix multiply performance when A and B are accessed by row as this improves data locality
  • 8. Transforming loops is a key technique for improving performance by changing the order in which computations occur
  • 9. This 1989 paper has won the SC17 “Test of Time” award Proceedings of the 1989 ACM/IEEE Conference on Supercomputing © 1989 ACM 089791-341-8/89/0011/0655
  • 10. Block algorithms, used in libraries such as LAPACK, can be seen as a particular type of loop transformation.
  • 11. Block algorithm has more flops per memory reference: O(n) vs. O(1)
  • 12. If efficient access to hierarchical memory is so important, how is it supported in programming languages?
  • 13. Occam: processes communicating via messages on typed channels
  • 14. MPI: non-blocking communication allows data movement to be overlapped with computation
  • 15. Overlap in a parallel algorithm Interior points Boundary points Send boundary data to neighbours Update interior points Receive boundary data from neighbours Update boundary points
  • 16. Multithreading: expose more parallelism as a finer granularity to increase scope for latency hiding
  • 17. OpenMP: parallel programming model based on threads with access to shared memory
  • 18. CUDA: used on NVidia GPUs. Fine grain parallelism, large numbers of threads running on thousands of cores
  • 19. OpenACC: “a single programming model that will allow you to write a single program that runs with high performance in parallel across a wide range of target systems” OpenACC: “a single programming model that will allow you to write a single program that runs with high performance in parallel across a wide range of target systems” Michael Wolfe in OpenACC for Multicore GPUs https://www.pgroup.com/lit/brochures/openacc_sc15.pdf
  • 20. Cooperative Parallel Programming Programmer: indicates opportunities for parallelism, gives hints Compiler: applies transformations Runtime: manages threads
  • 21. PGAS languages: each thread has its own private memory and also has access to globally shared memory
  • 22. PGAS: Local and shared variables Thread 0 Thread 1 Thread 2 Thread 3 Global shared address space An array{ Private memory { A thread is said to have an “affinity” for certain elements in an array, which it can access faster than others.
  • 23. To optimize performance PGAS languages still require the programmer to reason about data locality and synchronization
  • 24. Example: 2D Laplace Problem 24 Strip (N-1) Strip 0 Strip 1 Strip 2 NPTSX NPTSY NY NY NY NY … … Solution is held at 0 on the boundary, and 1 at the 4 centre squares.
  • 25. 2D Laplace Problem: MPI solution 25 Process (N-1) Process 0 Process 1 Process 2 NPTSX NPTSY NY NY NY NY … … At start of each Jacobi iteration, each process exchanges its first and last rows with the processes above and below.
  • 26. #include <stdio.h> #include <mpi.h> #define NPROCS 4 #define NY 20 #define NPTSX 200 #define NPTSY (NY*NPROCS) #define NSTEPS 5000 // Routines setup_grid(), exchange_rows(), output_array() int main (int argc, char *argv[]) { float phi[NY+2][NPTSX], oldphi[NY+2][NPTSX]; int mask[NY+2][NPTSX]; int i, j, k, rank; MPI_Init(&argc, &argv); MPI_Comm_rank (MPI_COMM_WORLD, &rank); setup_grid(rank, phi, mask); for(k=1;k<=NSTEPS;k++){ for(j=1;j<NY+1;j++) for(i=0;i<NPTSX;i++) oldphi[j][i] = phi[j][i]; exchange_rows(rank,oldphi); for(j=1;j<NY+1;j++) for(i=0;i<NPTSX;i++) { if (mask[j][i]) phi[j][i] = 0.25*(oldphi[j][i-1] + oldphi[j][i+1] + oldphi[j-1][i] + oldphi[j+1][i]); } } output_array(rank,phi); MPI_Finalize(); } 26 MPI Program Exchange rows Copy Update
  • 27. 2D Laplace Problem: UPC solution 27 Thread (THREADS-1) Thread 0 Thread 1 Thread 2 NPTSX NPTSY NY NY NY NY … …
  • 28. #include <stdio.h> #include <upc.h> #define NY 20 #define NPTSX 200 #define NPTSY (NY*THREADS) #define NSTEPS 5000 shared[*] float phi[NPTSY][NPTSX], oldphi[NPTSY][NPTSX]; shared[*] int mask[NPTSY][NPTSX]; // Routines setup_grid(), output_array(), and RGBval() int main () { int i, j, k; setup_grid(); upc_barrier; for(k=1;k<=NSTEPS;k++){ upc_forall(j=0;j<NPTSY;j++;j/NY) for(i=0;i<NPTSX;i++) oldphi[j][i] = phi[j][i]; upc_barrier; upc_forall(j=0;j<NPTSY;j++;j/NY) for(i=0;i<NPTSX;i++) { if (mask[j][i]) phi[j][i] = 0.25*(oldphi[j][i-1] + oldphi[j][i+1] + oldphi[j-1][i] + oldphi[j+1][i]); } upc_barrier; } output_array(); } 28 UPC Program #1 Can use shared arrays Note the barriers Updating values lying on upper and lower boundaries of a thread requires access to data values with different affinities. These accesses are slow.
  • 29. Data Sharing Between Threads Thread 0NY NPTSX Thread 1NY Thread 2NY NY … … … Thread THREADS-1 To update a value on a thread’s upper or lower boundary requires data from the thread above or below
  • 30. 30 Each thread copies its first and last rows into shared memory at start of time step, and then reads rows from neighbouring threads from shared memory.
  • 31. Coordinating Private and Shared Memory 31 Thread 0NY NPTSX Thread 1NY Thread THREADS-1NY … … … Row 1 Row NY Shared memory is used as a way of coordinating the sharing of data between threads. This avoids the explicit barriers, and coalesces data movement between local and remote memory. Shared memory
  • 32. #include <stdio.h> #include <upc.h> #define NY 20 #define NPTSX 200 #define NPTSY (NY*THREADS) #define NSTEPS 5000 shared[NPTSX] float ud[2][THREADS*NPTSX]; shared[*] float finalphi[NPTSY][NPTSX]; float phi[NY+2][NPTSX], oldphi[NY+2][NPTSX]; int mask[NY+2][NPTSX]; // Routines setup_grid(), output_array(), and RGBval() int main () { int i, j, k; setup_grid(); upc_barrier; for(k=1;k<=NSTEPS;k++){ … } output_array(); } 32 Main Program: Array Declarations Shared array to hold rows 1 and NY of each thread Needed for output Arrays in private memory See next slide for update code
  • 33. for(i=0;i<NPTSX;i++){ ud[0][MYTHREAD*NPTSX+i] = phi[1][i]; ud[1][MYTHREAD*NPTSX+i] = phi[NY][i]; } upc_barrier; if (MYTHREAD>0) { for(i=0;i<NPTSX;i++) phi[0][i] = ud[1][(MYTHREAD-1)*NPTSX+i]; } if (MYTHREAD<THREADS-1) { for(i=0;i<NPTSX;i++) phi[NY+1][i] = ud[0][(MYTHREAD+1)*NPTSX+i]; } for(j=0;j<NY+2;j++) for(i=0;i<NPTSX;i++) oldphi[j][i] = phi[j][i]; for(j=1;j<NY+1;j++) for(i=0;i<NPTSX;i++) { if (mask[j][i]) phi[j][i] = 0.25*(oldphi[j][i-1] + oldphi[j][i+1] + oldphi[j-1][i] + oldphi[j+1][i]); } 33 Main Program: Update Copy rows 1 and NY of phi to shared memory Copy into row 0 Copy phi to oldphi Do the update Copy into row NY+1 Only one barrier
  • 34. That was a straightforward example: regular communication and good load balance.
  • 35.
  • 36. Start at the root and visit every node of the tree using a depth-first traversal algorithm. Note: it’s an implicit tree – every node contains all the information needed to specify its children. Start at the root and visit every node of the tree using a depth-first traversal algorithm.
  • 37. Each thread has a node stack. When it’s empty a thread will steal work from the stack of another thread. Node A Node B Node C Node D Node E Stack Node B Node C Node D Node E Node X Node Y Children of A
  • 38. UPC implementation allows a thread to do push and pull operations on the top of its stack, and to steal nodes from the bottom of other threads’ stacks. Node A Node B Node C Node Y Node Z Stack top Node X Thread has affinity for this part of its stack and can access it using a local pointer. Stack bottom Other threads can steal nodes from this part of its stack by accessing it using a global pointer. Complications: Need to use locks to synchronise access to bottom part of stack, and when moving nodes between the top and bottom parts of the stack.
  • 39. OpenMP implementation can use the task directive
  • 40. Node x; int n = make_tree(&x, max_depth); omp_set_dynamic(0); #pragma omp parallel shared(x) num_threads(nthreads) { #pragma omp single visit_node(&x); } Create a pool of threads Aim to visit each node and process it in some way. One thread visits root node Do not adjust number of threads at runtime
  • 41. void visit_node(Node *x){ Node *y = x->children; while(y != NULL){ #pragma omp task firstprivate(y) visit_node(y); y = y->next; } #pragma omp taskwait process_node(x); return; } Loop over children of node x Creates a new task for each child to call visit_node(y) Wait here until all the child tasks have finished The runtime system schedules the tasks on the threads.
  • 42. OpenMP tasks work well for parallelizing recursive problems with dynamic load imbalance
  • 44. “To achieve good performance the programmer and the programming system must reason about locality and independence”
  • 45. In Sequoia, recursive tasks act as self- contained units of computation, and hierarchical memory is represented by a tree.
  • 46. Programmer must provide Sequoia with a task mapping specification that maps different levels of the memory hierarchy to different granularities of task.
  • 47. In addition to changing the order of arithmetical operations, we can also change the layout of data in memory
  • 49. Morton Order 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
  • 51. Square 2n x 2n Arrays: RM and Morton index Block size, b = 2n-r (maximum r =n-1)
  • 52. Morton Order 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 n= 5, r=2 Consider (i,j)=(18,13) i2 = 10010, j2 = 01101 Interlace top 2 bits of i and j: 1001 → 9 Morton index is: 1001010101 → 597
  • 53. The unshuffle operation takes a shuffled sequence of items and unshuffles them: where each ai is a contiguous vector of ℓa items, and each bi is a contiguous vector of ℓb items. 20 July 2017 53 a1b1a2b2…anbn ®a1a2…anb1b2…bn
  • 54. Apply Morton Ordering to Matrix A mortonOrder (A,n,b){ if( b < n ){ p1 = (n*n)/4 p2 = 2*p1 p3 = 3*p1 unshuffle(A,n/2,n/2) unshuffle(A+p2,n/2,n/2) mortonOrder(A,n/2,b) mortonOrder(A+p1,n/2,b) mortonOrder(A+p2,n/2,b) mortonOrder(A+p3,n/2,b) } } p1 p2 p3 n is matrix size, b is block size. Both are powers of 2.
  • 55. Possible use of Morton or SFC ordering would be in a library – optionally convert between matrix layouts on entry to, and exit from, the library.
  • 56. Recursive Matrix Multiply mm_Recursive (A,B,C,n,b){ // C = C + AB if(n==b){ matmul(A,B,C,n) } else{ mm_Recursive(A00,B00,C00,n/2,b) mm_Recursive(A01,B10,C00,n/2,b) mm_Recursive(A00,B01,C01n/2,b) mm_Recursive(A01,B11,C01,n/2,b) mm_Recursive(A10,B00,C10,n/2,b) mm_Recursive(A11,B10,C10,n/2,b) mm_Recursive(A10,B01,C11,n/2,b) mm_Recursive(A11,B11,C11,n/2,b) } return } End of recursion. Choose b so matrices fit in cache. Note: all the computational work happens in the leaves of the recursion tree. A00 A01 A10 A11
  • 57. Platform 1: MacBook Pro, Intel i7, 4 cores, 256KB L2 cache/core, 6MB L3 cache Platform 2: two Xeon E5-2620, 6 cores each, 256KB L2 cache/core, 15MB L3 cache gcc compiler used with –O3 flag set Platform 2: two Xeon E5-2620, 6 cores each, 256KB L2 cache/core, 15MB L3 cache gcc compiler used with –O3 flag set
  • 58. Cholesky Factorization A = LLT A00 = L00L00 T A10 = L10L00 T A11 = L10L10 T+L11L11 T
  • 59. Tail Recursive Cholesky choleskyTailRecursive (A,n,b){ // C = C + AB if(n==b){ cholesky(A,b) } else{ cholesky(A00,b) triangularSolve(A10,A00,n-b,b) symmetricRankUpdate(A11,A10,n-b,b) choleskyTailRecursive(A11,n-b,b) } return } End of recursion. Choose b so matrices fit in cache. Note: computational work happens at all levels of the recursion tree. A00 A10 A11
  • 60. Binary Recursive Cholesky choleskyBinaryRecursive (A,n,b){ // C = C + AB if(n==b){ cholesky(A,b) } else{ choleskyBinaryRecursive(A00,n/2,b) triangularSolve(A10,A00,n/2,n/2) symmetricRankUpdate(A11,A10,n/2,n/2) choleskyBinaryRecursive(A11,n/2,b) } return } End of recursion. Choose b so matrices fit in cache. Note: the 4 operations at the inner nodes of the recursion tree have to be done in order, so cannot do recursive calls in parallel. A00 A10 A11
  • 61. Blocked RM order: standard algorithm based on rectangular blocks Tiled RM order: all operations are expressed in terms of operations involving square tiles, but matrices are stored in RM order Tiled Morton order: as above, but matrices are stored in Morton order. All times are relative to time for single call to DPOTRF
  • 62. Morton order algorithms require Morton index computations. There are a number of ways to do these (bitwise operations, lookup tables) but the method used does not impact performance much.
  • 63. These plots show results for the binary recursive algorithm on Platform 1. Similar results were obtained on Platform 2.
  • 64. The Fourier transform of an nxn array, X, can be expressed as: Y = FnXFn where element (p,q) of matrix Fn is wn pq wn = exp(-2pi / n) 𝐹4 = 1 1 1 𝑤 1 1 𝑤2 𝑤3 1 𝑤2 1 𝑤3 𝑤4 𝑤6 𝑤6 𝑤9
  • 65. 2D Fast Fourier Transform Y = FnXFn = FnXFn T = At…A1(Pn TXPn)A1 T…At T where t = log2(n) and Pn T is a permutation matrix such that Pn TX exchanges row k of X with row k’, where k’ is the t bits of k in reverse order. 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 𝑇 0 1 2 3 4 5 6 7 = 0 4 2 6 1 5 3 7
  • 66. where L = 2q, r = n/L, L*=L/2 ”Butterfly” matrix r diagonal blocks of BL 𝐴 𝑞 = 𝐵 𝐿 ⋯ 0 ⋮ ⋱ ⋮ 0 ⋯ 𝐵 𝐿 Kronecker matrix product 𝐴⨂𝐵 = 𝑎00 𝐵 ⋯ 𝑎0,𝑛−1 𝐵 ⋮ ⋱ ⋮ 𝑎 𝑚−1,0 𝐵 ⋯ 𝑎 𝑚−1,𝑛−1 𝐵
  • 67. I recommend this book if you want to understand the mathematics behind the FFT algorithm.
  • 68. A Common 2D FFT Algorithm Y = At…A1(Pn TXPn)A1 T…At T 1. Evaluate 𝑋 = 𝐴 𝑡 ⋯ 𝐴1 𝑃𝑛 𝑇 𝑋 2. Transpose 𝑋 𝑇 3. Evaluate 𝑌 𝑇 = 𝐴 𝑡 ⋯ 𝐴1 𝑃𝑛 𝑇 𝑋 𝑇 4. Transpose 𝑌 𝑇to get 𝑌
  • 69. Πn is a permutation matrix that performs a perfect shuffle index operation, and Πb,n performs a partial bit reversal on indices. Basis of recursive 2D FFT 𝐹𝑛Π 𝑏,𝑛 = 𝐵 𝑏,𝑛 𝐼 𝑛/𝑏⨂𝐹𝑏 Π 𝑏,𝑛 = Π 𝑛 𝐼2⨂Π 𝑛/2 𝐼4⨂Π 𝑛/4 ⋯ 𝐼 𝑛/(2𝑏)⨂Π2𝑏 𝐵 𝑏,𝑛 = 𝐵𝑛 𝐼2⨂𝐵 𝑛/2 𝐼4⨂𝐵 𝑛/4 ⋯ 𝐼 𝑛/(2𝑏)⨂𝐵2𝑏
  • 70. Hb,n permutes the columns and rows of X based on a partial bit-reversal of indices. What is in the red box? This is the result of partitioning the matrix into bxb blocks and performing a 2D FFT on each Denote this by Kb,n 𝐻 𝑏,𝑛= Π 𝑏,𝑛 𝑇 XΠ 𝑏,𝑛 𝐹𝑛 𝑋𝐹𝑛 = 𝐹𝑛 𝑋𝐹𝑛 𝑇 = 𝐵 𝑏,𝑛 𝐼 𝑛/𝑏⨂𝐹𝑏 𝐻 𝑏,𝑛 𝐼 𝑛/𝑏⨂𝐹𝑏 𝐵 𝑏,𝑛 𝑇
  • 71. Kb,n = 2D FFT of bxb blocks of partially bit-reversed matrix, X 1. Evaluate Yb,n 2. Transpose 3. Evaluate 4. Transpose 𝑋 𝑇 𝐹𝑛 𝑋𝐹𝑛 = 𝐹𝑛 𝑋𝐹𝑛 𝑇 = 𝐴 𝑡 ⋯ 𝐴 𝑠+1 𝐾𝑏,𝑛 𝐴 𝑠+1 𝑇 ⋯ 𝐴 𝑡 𝑇 𝑌𝑏,𝑛 = 𝐴 𝑡 ⋯ 𝐴 𝑠+1 𝐾𝑏,𝑛 𝐹𝑛 𝑋𝐹𝑛 𝑇 = 𝑋 = 𝑌𝑏,𝑛 𝐴 𝑠+1 𝑇 ⋯ 𝐴 𝑡 𝑇 𝑋 𝑇 = 𝐴 𝑡 ⋯ 𝐴 𝑠+1 𝑌𝑏,𝑛 𝑇 b = 2s
  • 72. Evaluate Kb,n: the FFTs of the bxb blocks b b n b b n Evaluate 𝐾2𝑏,𝑛 = 𝐼2 ⊗ 𝐵2𝑏 𝐾𝑏,𝑛 𝐼2 ⊗ 𝐵2𝑏 𝑇 : the FFTs of the 2bx2b blocks 2b 2b nn Evaluate 𝐵4𝑏 𝐾2𝑏,𝑛 𝐵4𝑏 𝑇 : the FFT of the whole nxn array
  • 73. Transpose-based 2D FFT transposeFFT2D (X,n,b){ partialBitReversal(X,n,b) for (each bxb block, B, of X) fft2D(B,b) recursiveTransposeFFT(X,n,b) transpose(X,n,b) recursiveTransposeFFT(X,n,b) transpose(X,n,b) return } Do FFT of each block using any algorithm. Transpose X Pre-multiply blocks as we move up the recursion tree
  • 74. Recursive Transpose-Based 2D FFT recursiveTransposeFFT (X,n,b){ if(n>b){ recursiveTransposeFFT(X00,n/2,b) recursiveTransposeFFT(X01,n/2,b) recursiveTransposeFFT(X10,n/2,b) recursiveTransposeFFT(X11,n/2,b) butterflyPre(X,n,b) } return } End recursion when n=b. Choose b so matrices fit in cache. Pre-multiply nxn block by butterfly matrix, overwriting X. Note: includes work at each level of the recursion tree. Note: recursive calls are readily parallelizable.
  • 75. Vector Radix 2D FFT vectorRadixFFT2D (X,n,b){ partialBitReversal(X,n,b) recursiveVRFFT(X,n,b) return }
  • 76. Recursive Vector Radix 2D FFT recursiveVRFFT (X,n,b){ if(n==b){ fft2D(X,n) } else{ recursiveVRFFT(X00,n/2,b) recursiveVRFFT(X01,n/2,b) recursiveVRFFT(X10,n/2,b) recursiveVRFFT(X11,n/2,b) butterflyPre(X,n,b) butterflyPost(X,n,b) } return } End recursion when n=b. Choose b so matrices fit in cache. Pre-multiply nxn block by butterfly matrix, overwriting X, and then post-multiply. Note: recursive calls are readily parallelizable.
  • 77. All times are relative to time for transpose- based FFT on RM matrix of same size
  • 78. Morton ordering doesn’t improve FFT timings by as much as for matrix multiplication. Computation to data movement ratio is n for matrix multiply, and log(n) for FFT
  • 79. Morton ordering and related recursive parallel algorithms may work well when hierarchical memory is handled programmatically.
  • 80. Thank you for your attention. Any Questions?