The Effect of Hierarchical Memory on the Design of Parallel Algorithms and the Expression of Parallelism

The Effect of Hierarchical Memory on
the Design of Parallel Applications
and the Expression of Parallelism
David W. Walker
Cardiff School of Computer Science & Informatics

All modern computer systems have
hierarchical memories

3
Memory Hierarchy Pyramid
L2 Cache
L3 Cache
Main memory
Remote memory
Remoter memory
Even more remote
memory
Registers
L1 CacheCapacityAccess
speed

4
Typical Quad-Core Chip
L1 cache
Instruction cache
CPU
L1 cache
Instruction cache
CPU
L2 cache
L1 cache
Instruction cache
CPU
L1 cache
Instruction cache
CPU
L2 cache
Shared Memory

Efficient access to hierarchical memory is
important in achieving good performance in both
sequential and parallel applications

𝑐𝑖𝑗 =
𝑘=0
𝑛−1
𝑎𝑖𝑘 𝑏 𝑘𝑗
void matMul_ikj(float* A, float* B, float* C, float* work, int n)
{
int i, j, k;
for (i = 0; i < n; ++i){
for (j = 0; j < n; ++j) work[j] = 0;
for (k = 0; k < n; ++k) {
float a = A[i * n + k];
for (j = 0; j < n; ++j) {
float b = B[k * n + j];
work[j] += a * b;
}
}
for (j = 0; j < n; ++j) C[i * n + j] = work[j];
}
}

In C code, get best matrix multiply performance
when A and B are accessed by row as this
improves data locality

Transforming loops is a key technique for
improving performance by changing the order
in which computations occur

This 1989 paper has
won the SC17 “Test of
Time” award
Proceedings of the 1989 ACM/IEEE Conference on Supercomputing
© 1989 ACM 089791-341-8/89/0011/0655

Block algorithms, used in libraries such as LAPACK,
can be seen as a particular type of loop
transformation.

Block algorithm has more flops per memory
reference: O(n) vs. O(1)

If efficient access to hierarchical memory
is so important, how is it supported in
programming languages?

Occam: processes communicating via
messages on typed channels

MPI: non-blocking communication allows
data movement to be overlapped with
computation

Overlap in a parallel algorithm
Interior points
Boundary points
Send boundary data to neighbours
Update interior points
Receive boundary data from neighbours
Update boundary points

Multithreading: expose more parallelism as a finer
granularity to increase scope for latency hiding

OpenMP: parallel programming model based
on threads with access to shared memory

CUDA: used on NVidia GPUs. Fine grain
parallelism, large numbers of threads running
on thousands of cores

OpenACC: “a single programming model that
will allow you to write a single program that
runs with high performance in parallel across
a wide range of target systems”
OpenACC: “a single programming model that
will allow you to write a single program that
runs with high performance in parallel across
a wide range of target systems”
Michael Wolfe in OpenACC for Multicore GPUs
https://www.pgroup.com/lit/brochures/openacc_sc15.pdf

Cooperative Parallel Programming
Programmer: indicates
opportunities for parallelism,
gives hints
Compiler: applies
transformations
Runtime:
manages threads

PGAS languages: each thread has its own private
memory and also has access to globally shared
memory

PGAS: Local and shared variables
Thread 0 Thread 1 Thread 2 Thread 3
Global shared
address space
An array{
Private memory
{
A thread is said to have
an “affinity” for certain
elements in an array,
which it can access
faster than others.

To optimize performance PGAS languages still
require the programmer to reason about data
locality and synchronization

Example: 2D Laplace Problem
24
Strip (N-1)
Strip 0
Strip 1
Strip 2
NPTSX
NPTSY
NY
NY
NY
NY
…
…
Solution is held at 0 on
the boundary, and 1 at
the 4 centre squares.

2D Laplace Problem: MPI solution
25
Process (N-1)
Process 0
Process 1
Process 2
NPTSX
NPTSY
NY
NY
NY
NY
…
…
At start of each Jacobi
iteration, each process
exchanges its first and
last rows with the
processes above and
below.

#include <stdio.h>
#include <mpi.h>
#define NPROCS 4
#define NY 20
#define NPTSX 200
#define NPTSY (NY*NPROCS)
#define NSTEPS 5000
// Routines setup_grid(), exchange_rows(), output_array()
int main (int argc, char *argv[])
{
float phi[NY+2][NPTSX], oldphi[NY+2][NPTSX];
int mask[NY+2][NPTSX];
int i, j, k, rank;
MPI_Init(&argc, &argv);
MPI_Comm_rank (MPI_COMM_WORLD, &rank);
setup_grid(rank, phi, mask);
for(k=1;k<=NSTEPS;k++){
for(j=1;j<NY+1;j++)
for(i=0;i<NPTSX;i++) oldphi[j][i] = phi[j][i];
exchange_rows(rank,oldphi);
for(j=1;j<NY+1;j++)
for(i=0;i<NPTSX;i++) {
if (mask[j][i]) phi[j][i] = 0.25*(oldphi[j][i-1] +
oldphi[j][i+1] + oldphi[j-1][i] + oldphi[j+1][i]);
}
}
output_array(rank,phi);
MPI_Finalize();
}
26
MPI Program
Exchange rows
Copy
Update

2D Laplace Problem: UPC solution
27
Thread (THREADS-1)
Thread 0
Thread 1
Thread 2
NPTSX
NPTSY
NY
NY
NY
NY
…
…

#include <stdio.h>
#include <upc.h>
#define NY 20
#define NPTSX 200
#define NPTSY (NY*THREADS)
#define NSTEPS 5000
shared[*] float phi[NPTSY][NPTSX], oldphi[NPTSY][NPTSX];
shared[*] int mask[NPTSY][NPTSX];
// Routines setup_grid(), output_array(), and RGBval()
int main ()
{
int i, j, k;
setup_grid();
upc_barrier;
upc_forall(j=0;j<NPTSY;j++;j/NY)
upc_barrier;
upc_forall(j=0;j<NPTSY;j++;j/NY)
}
upc_barrier;
}
output_array();
}
28
UPC Program #1
Can use shared arrays
Note the barriers
Updating values lying on upper and lower
boundaries of a thread requires access to data values
with different affinities. These accesses are slow.

Data Sharing Between Threads
Thread 0NY
NPTSX
Thread 1NY
Thread 2NY
NY
…
…
…
Thread THREADS-1
To update a value on a
thread’s upper or lower
boundary requires data
from the thread above
or below

30
Each thread copies its first and last rows into
shared memory at start of time step, and then
reads rows from neighbouring threads from
shared memory.

Coordinating Private and Shared Memory
31
Thread 0NY
NPTSX
Thread 1NY
Thread THREADS-1NY
…
…
…
Row 1
Row NY
Shared memory is used as a way of coordinating the sharing of
data between threads. This avoids the explicit barriers, and
coalesces data movement between local and remote memory.
Shared
memory

#include <stdio.h>
#include <upc.h>
#define NY 20
#define NPTSX 200
#define NPTSY (NY*THREADS)
#define NSTEPS 5000
shared[NPTSX] float ud[2][THREADS*NPTSX];
shared[*] float finalphi[NPTSY][NPTSX];
float phi[NY+2][NPTSX], oldphi[NY+2][NPTSX];
int mask[NY+2][NPTSX];
// Routines setup_grid(), output_array(), and RGBval()
int main ()
{
int i, j, k;
setup_grid();
upc_barrier;
…
}
output_array();
}
32
Main Program: Array Declarations
Shared array to hold rows 1
and NY of each thread
Needed for output
Arrays in private
memory
See next slide for
update code

for(i=0;i<NPTSX;i++){
ud[0][MYTHREAD*NPTSX+i] = phi[1][i];
ud[1][MYTHREAD*NPTSX+i] = phi[NY][i];
}
upc_barrier;
if (MYTHREAD>0) {
for(i=0;i<NPTSX;i++)
phi[0][i] = ud[1][(MYTHREAD-1)*NPTSX+i];
}
if (MYTHREAD<THREADS-1) {
for(i=0;i<NPTSX;i++)
phi[NY+1][i] = ud[0][(MYTHREAD+1)*NPTSX+i];
}
for(j=0;j<NY+2;j++)
for(j=1;j<NY+1;j++)
}
33
Main Program: Update
Copy rows 1 and NY of
phi to shared memory
Copy into row 0
Copy phi to
oldphi
Do the update
Copy into row
NY+1
Only one barrier

That was a straightforward example: regular
communication and good load balance.

Start at the root and visit
every node of the tree
using a depth-first traversal
algorithm.
Note: it’s an implicit tree –
every node contains all the
information needed to
specify its children.
Start at the root and visit
every node of the tree
using a depth-first traversal
algorithm.

Each thread has a node stack. When it’s
empty a thread will steal work from the
stack of another thread.
Node A
Node B
Node C
Node D
Node E
Stack
Node B
Node C
Node D
Node E
Node X
Node Y Children of A

UPC implementation allows a thread to do push and pull
operations on the top of its stack, and to steal nodes from the
bottom of other threads’ stacks.
Node A
Node B
Node C
Node Y
Node Z
Stack top
Node X
Thread has affinity for this part of its stack
and can access it using a local pointer.
Stack bottom
Other threads can steal nodes from this
part of its stack by accessing it using a
global pointer.
Complications: Need to use locks to synchronise access to
bottom part of stack, and when moving nodes between the
top and bottom parts of the stack.

OpenMP implementation can use the
task directive

Node x;
int n = make_tree(&x, max_depth);
omp_set_dynamic(0);
#pragma omp parallel shared(x) num_threads(nthreads)
{
#pragma omp single
visit_node(&x);
}
Create a pool of threads
Aim to visit each node and
process it in some way.
One thread visits root node
Do not adjust number of
threads at runtime

void visit_node(Node *x){
Node *y = x->children;
while(y != NULL){
#pragma omp task firstprivate(y)
visit_node(y);
y = y->next;
}
#pragma omp taskwait
process_node(x);
return;
}
Loop over children of
node x
Creates a new task for each
child to call visit_node(y)
Wait here until all the
child tasks have finished
The runtime system schedules the tasks on
the threads.

OpenMP tasks work well for parallelizing recursive
problems with dynamic load imbalance

SC12: https://doi.org/10.1109/SC.2012.71
SC06: https://doi.org/10.1145/1188455.1188543

“To achieve good performance the
programmer and the programming system
must reason about locality and
independence”

In Sequoia, recursive tasks act as self-
contained units of computation, and
hierarchical memory is represented by a
tree.

Programmer must provide Sequoia with a
task mapping specification that maps
different levels of the memory hierarchy to
different granularities of task.

In addition to changing the order of
arithmetical operations, we can also
change the layout of data in memory

Morton Order
0 1
2 3
4 5
6 7
8 9
10 11
12 13
14 15

Morton Order: Recursive Definition

Square 2n x 2n Arrays: RM and Morton index
Block size, b = 2n-r (maximum r =n-1)

Morton Order
0 1
2 3
4 5
6 7
8 9
10 11
12 13
14 15
n= 5, r=2
Consider (i,j)=(18,13)
i2 = 10010, j2 = 01101
Interlace top 2 bits of i
and j:
1001 → 9
Morton index is:
1001010101 → 597

The unshuffle operation takes a shuffled sequence of items
and unshuffles them:
where each ai is a contiguous vector of ℓa items,
and each bi is a contiguous vector of ℓb items.
20 July 2017 53
a1b1a2b2…anbn ®a1a2…anb1b2…bn

Apply Morton Ordering to Matrix A
mortonOrder (A,n,b){
if( b < n ){
p1 = (n*n)/4
p2 = 2*p1
p3 = 3*p1
unshuffle(A,n/2,n/2)
unshuffle(A+p2,n/2,n/2)
mortonOrder(A,n/2,b)
mortonOrder(A+p1,n/2,b)
}
}
p1
p2 p3
n is matrix size, b is
block size. Both are
powers of 2.

Possible use of Morton or SFC ordering would be
in a library – optionally convert between matrix
layouts on entry to, and exit from, the library.

Recursive Matrix Multiply
mm_Recursive (A,B,C,n,b){ // C = C + AB
if(n==b){
matmul(A,B,C,n)
}
else{
mm_Recursive(A00,B00,C00,n/2,b)
mm_Recursive(A00,B01,C01n/2,b)
}
return
}
End of recursion. Choose b
so matrices fit in cache.
Note: all the
computational work
happens in the leaves
of the recursion tree.
A00 A01
A10 A11

Platform 1: MacBook
Pro, Intel i7, 4 cores,
256KB L2 cache/core,
6MB L3 cache
Platform 2: two Xeon
E5-2620, 6 cores each,
15MB L3 cache
gcc compiler used
with –O3 flag set
Platform 2: two Xeon
E5-2620, 6 cores each,
15MB L3 cache
gcc compiler used
with –O3 flag set

Cholesky Factorization
A = LLT
A00 = L00L00
T
A10 = L10L00
T
A11 = L10L10
T+L11L11
T

Tail Recursive Cholesky
choleskyTailRecursive (A,n,b){ // C = C + AB
if(n==b){
cholesky(A,b)
}
else{
cholesky(A00,b)
triangularSolve(A10,A00,n-b,b)
symmetricRankUpdate(A11,A10,n-b,b)
choleskyTailRecursive(A11,n-b,b)
}
return
}
Note: computational
work happens at all
levels of the recursion
tree.
A00
A10 A11

Binary Recursive Cholesky
choleskyBinaryRecursive (A,n,b){ // C = C + AB
if(n==b){
cholesky(A,b)
}
else{
choleskyBinaryRecursive(A00,n/2,b)
triangularSolve(A10,A00,n/2,n/2)
symmetricRankUpdate(A11,A10,n/2,n/2)
choleskyBinaryRecursive(A11,n/2,b)
}
return
}
Note: the 4 operations at the inner
nodes of the recursion tree have to
be done in order, so cannot do
recursive calls in parallel.
A00
A10 A11

Blocked RM order: standard algorithm
based on rectangular blocks
Tiled RM order: all operations are
expressed in terms of operations involving
square tiles, but matrices are stored in RM
order
Tiled Morton order: as above, but matrices
are stored in Morton order.
All times are relative
to time for single call
to DPOTRF

Morton order algorithms require Morton
index computations. There are a number
of ways to do these (bitwise operations,
lookup tables) but the method used does
not impact performance much.

These plots show results for the binary
recursive algorithm on Platform 1. Similar
results were obtained on Platform 2.

The Fourier transform of an nxn array, X, can be
expressed as:
Y = FnXFn
where element (p,q) of matrix Fn is wn
pq
wn = exp(-2pi / n)
𝐹4 =
1 1
1 𝑤
1 1
𝑤2
𝑤3
1 𝑤2
1 𝑤3
𝑤4
𝑤6
𝑤6
𝑤9

2D Fast Fourier Transform
Y = FnXFn = FnXFn
T = At…A1(Pn
TXPn)A1
T…At
T
where t = log2(n) and Pn
T is a permutation matrix such that Pn
TX
exchanges row k of X with row k’, where k’ is the t bits of k in
reverse order.
1 0
0 0
0 0
0 0
0 0
0 0
1 0
0 0
0 0
1 0
0 0
0 0
0 0
0 0
0 0
1 0
0 1
0 0
0 0
0 0
0 0
0 0
0 1
0 0
0 0
0 1
0 0
0 0
0 0
0 0
0 0
0 1
𝑇
0
1
2
3
4
5
6
7
=
0
4
2
6
1
5
3
7

where L = 2q, r = n/L, L*=L/2
”Butterfly” matrix
r diagonal blocks of BL
𝐴 𝑞 =
𝐵 𝐿 ⋯ 0
⋮ ⋱ ⋮
0 ⋯ 𝐵 𝐿
Kronecker matrix product 𝐴⨂𝐵 =
𝑎00 𝐵 ⋯ 𝑎0,𝑛−1 𝐵
⋮ ⋱ ⋮
𝑎 𝑚−1,0 𝐵 ⋯ 𝑎 𝑚−1,𝑛−1 𝐵

I recommend this book if
you want to understand
the mathematics behind
the FFT algorithm.

A Common 2D FFT Algorithm
Y = At…A1(Pn
TXPn)A1
T…At
T
1. Evaluate 𝑋 = 𝐴 𝑡 ⋯ 𝐴1 𝑃𝑛
𝑇 𝑋
2. Transpose 𝑋 𝑇
3. Evaluate 𝑌 𝑇
= 𝐴 𝑡 ⋯ 𝐴1 𝑃𝑛
𝑇
𝑋 𝑇
4. Transpose 𝑌 𝑇to get 𝑌

Πn is a permutation matrix that performs a perfect
shuffle index operation, and Πb,n performs a partial
bit reversal on indices.
Basis of recursive 2D FFT
𝐹𝑛Π 𝑏,𝑛 = 𝐵 𝑏,𝑛 𝐼 𝑛/𝑏⨂𝐹𝑏
Π 𝑏,𝑛 = Π 𝑛 𝐼2⨂Π 𝑛/2 𝐼4⨂Π 𝑛/4 ⋯ 𝐼 𝑛/(2𝑏)⨂Π2𝑏
𝐵 𝑏,𝑛 = 𝐵𝑛 𝐼2⨂𝐵 𝑛/2 𝐼4⨂𝐵 𝑛/4 ⋯ 𝐼 𝑛/(2𝑏)⨂𝐵2𝑏

Hb,n permutes the columns and rows of X
based on a partial bit-reversal of indices.
What is in the red box?
This is the result of partitioning the
matrix into bxb blocks and performing a
2D FFT on each
Denote this by Kb,n
𝐻 𝑏,𝑛= Π 𝑏,𝑛
𝑇
XΠ 𝑏,𝑛
𝐹𝑛 𝑋𝐹𝑛 = 𝐹𝑛 𝑋𝐹𝑛
𝑇
= 𝐵 𝑏,𝑛 𝐼 𝑛/𝑏⨂𝐹𝑏 𝐻 𝑏,𝑛 𝐼 𝑛/𝑏⨂𝐹𝑏 𝐵 𝑏,𝑛
𝑇

Kb,n = 2D FFT of bxb blocks of
partially bit-reversed matrix, X
1. Evaluate Yb,n
2. Transpose
3. Evaluate
4. Transpose 𝑋 𝑇
𝐹𝑛 𝑋𝐹𝑛 = 𝐹𝑛 𝑋𝐹𝑛
𝑇
= 𝐴 𝑡 ⋯ 𝐴 𝑠+1 𝐾𝑏,𝑛 𝐴 𝑠+1
𝑇
⋯ 𝐴 𝑡
𝑇
𝑌𝑏,𝑛 = 𝐴 𝑡 ⋯ 𝐴 𝑠+1 𝐾𝑏,𝑛
𝐹𝑛 𝑋𝐹𝑛
𝑇 = 𝑋 = 𝑌𝑏,𝑛 𝐴 𝑠+1
𝑇
⋯ 𝐴 𝑡
𝑇
𝑋 𝑇 = 𝐴 𝑡 ⋯ 𝐴 𝑠+1 𝑌𝑏,𝑛
𝑇
b = 2s

Evaluate Kb,n: the FFTs
of the bxb blocks
b
b
n
b
b
n
Evaluate 𝐾2𝑏,𝑛 = 𝐼2 ⊗ 𝐵2𝑏 𝐾𝑏,𝑛 𝐼2 ⊗ 𝐵2𝑏
𝑇
:
the FFTs of the 2bx2b blocks
2b
2b
nn
Evaluate 𝐵4𝑏 𝐾2𝑏,𝑛 𝐵4𝑏
𝑇
: the FFT of the whole
nxn array

Transpose-based 2D FFT
transposeFFT2D (X,n,b){
partialBitReversal(X,n,b)
for (each bxb block, B, of X)
fft2D(B,b)
recursiveTransposeFFT(X,n,b)
transpose(X,n,b)
recursiveTransposeFFT(X,n,b)
transpose(X,n,b)
return
}
Do FFT of each block using
any algorithm.
Transpose X
Pre-multiply blocks as we
move up the recursion
tree

Recursive Transpose-Based 2D FFT
recursiveTransposeFFT (X,n,b){
if(n>b){
recursiveTransposeFFT(X00,n/2,b)
butterflyPre(X,n,b)
}
return
}
End recursion when n=b.
Choose b so matrices fit in
cache.
Pre-multiply nxn block by
butterfly matrix,
overwriting X.
Note: includes
work at each level
of the recursion
tree.
Note: recursive
calls are readily
parallelizable.

Vector Radix 2D FFT
vectorRadixFFT2D (X,n,b){
partialBitReversal(X,n,b)
recursiveVRFFT(X,n,b)
return
}

Recursive Vector Radix 2D FFT
recursiveVRFFT (X,n,b){
if(n==b){
fft2D(X,n)
}
else{
recursiveVRFFT(X00,n/2,b)
butterflyPre(X,n,b)
butterflyPost(X,n,b)
}
return
}
End recursion when n=b.
Choose b so matrices fit in
cache.
Pre-multiply nxn block by
butterfly matrix,
overwriting X, and then
post-multiply.
Note: recursive
calls are readily
parallelizable.

All times are relative to
time for transpose-
based FFT on RM matrix
of same size

Morton ordering doesn’t improve FFT timings by as
much as for matrix multiplication. Computation to
data movement ratio is n for matrix multiply, and
log(n) for FFT

Morton ordering and related recursive parallel
algorithms may work well when hierarchical
memory is handled programmatically.

Thank you for your attention.
Any Questions?

The Effect of Hierarchical Memory on the Design of Parallel Algorithms and the Expression of Parallelism

Recommended

Recommended

More Related Content

What's hot

What's hot (14)

Similar to The Effect of Hierarchical Memory on the Design of Parallel Algorithms and the Expression of Parallelism

Similar to The Effect of Hierarchical Memory on the Design of Parallel Algorithms and the Expression of Parallelism (20)

Recently uploaded

Recently uploaded (20)

The Effect of Hierarchical Memory on the Design of Parallel Algorithms and the Expression of Parallelism