A framework for practical fast matrix multiplication (BLIS retreat)

A FRAMEWORK FOR
PRACTICAL FAST MATRIX
MULTIPLICATION
1
0 5000 10000 15000
25
20
15
10
5
0
Dimension (N)
Effective GFLOPS / core
Parallel performance of Strassen on <N,N,N>
MKL, 6 cores
MKL, 24 cores
DFS, 6 cores
BFS, 6 cores
HYBRID, 6 cores
DFS, 24 cores
BFS, 24 cores
HYBRID, 24 cores
arXiv: 1409.2908
Austin Benson (arbenson@stanford.edu), ICME, Stanford
Grey Ballard, Sandia National Laboratories
BLIS Retreat, September 26, 2014

Fast matrix multiplication:
bridging theory and practice
2
• There are a number of Strassen-like algorithms for matrix
multiplication that have only been “discovered” recently.
[Smirnov13], [Benson&Ballard14]
• We show that they can achieve higher performance with
respect to MKL (sequential and sometimes in parallel).
• We use code generation to do extensive prototyping. There
are several practical issues, and there is plenty of room for
improvement (lots of expertise at UT to help here!)
2 2.81 3
[Strassen79]
2.37
[Williams12]
xxxxx xxx x

4
Key ingredients of Strassen’s algorithm
• 1. Block partitioning of matrices (<2, 2, 2>)
• 2. Seven linear combinations of sub-blocks of A
• 3. Seven linear combinations of sub-blocks of B
• 4. Seven matrix multiplies to form Mr (recursive)
• 5. Linear combinations of Mr to form Cij

Key ingredients of fast matmul algorithms
• 1. Block partitioning of matrices (<M, K, N>)
• 2. R linear combinations of sub-blocks of A
• 3. R linear combinations of sub-blocks of B
• 4. R matrix multiplies to form Mr (recursive)
R < MKN  faster than classical
• 5. Linear combinations of Mr to form Cij
5

“Outer product” fast algorithm
• <4, 2, 4> partitioning
• R = 26 multiplies (< 4 * 2 * 4 = 32)
 23% speedup per recursive step (if everything else free)
• Linear combinations of Aij to form Sr: 68 terms
• Linear combinations of Bij to form Tr: 52 terms
• Linear combinations of Mr to form Cij: 69 terms
6

Discovering fast algorithms is a
numerical challenge
7
• Low-rank tensor decompositions lead to fast algorithms
• Tensors are small, but we need exact decompositions
 NP-hard
• Use alternating least squares with regularization and
rounding tricks [Smirnov13], [Benson&Ballard14]
• We have around 10 fast algorithms for <M, K, N>
decompositions. Also have permutations, e.g., <K, M, N>.

Code generation lets us prototype
algorithms quickly
9
• We have compact representation of many fast algorithms:
1. dimensions of block partitioning (<M, K, N>)
2. linear combinations of sub-blocks (Sr, Tr)
3. linear combinations of Mr to form Cij
• We use code generation to rapidly prototype fast algorithms
• Our approach: test all algorithms on a bunch of different
problem sizes and look for patterns

Practical issues
10
• Best way to do matrix additions? (in paper)
• Can we eliminate redundant linear combinations? (in paper)
• Different problem shapes other than square (this talk)
• When to stop recursion? (this talk)
• How to parallelize? (this talk)
=

Recursion cutoff: look at gemm curve
25
20
15
10
0 1000 2000 3000
Dimension (N)
GFLOPS
Sequential dgemm performance
N x 800 x 800
N x 800 x N
N x N x N
peak
25
20
15
10
0 2000 4000 6000 8000
Dimension (N)
GFLOPS / core
Parallel dgemm performance (24 cores)
Basic idea: take another
recursive step if the sub-problems
will still operate at
high performance
11
<M, K, N> = <4, 2, 3>

Sequential performance
28
26
24
22
20
18
16
0 2000 4000 6000 8000
Dimension (N)
Effective GFLOPS
Sequential performance on N x N x N
=
12
MKL
STRASSEN
<3,2,2>
<3,2,4>
<4,2,3>
<3,4,2>
<3,3,3>
<4,2,4>
<2,3,4>
Effective GFLOPS for M x K x N multiplies
= 1e-9 * 2 * MKN / time in seconds
True peak

Sequential performance
28
26
24
22
20
18
16
0 2000 4000 6000 8000
Dimension (N)
Effective GFLOPS
Sequential performance on N x N x N
=
MKL
STRASSEN
<4,4,2>
<4,3,3>
<3,4,3>
<3,3,6>
<3,6,3>
<6,3,3>
• All algorithms beat MKL on large problems
• Strassen’s algorithm is hard to beat
13

Sequential performance =
27
26
25
24
23
22
2000 4000 6000 8000 10000 12000
dimension (N)
Effective GFLOPS
Sequential performance on N x 1600 x N
MKL
<4,2,4>
<4,3,3>
<3,2,3>
<4,2,3>
STRASSEN
• Almost all algorithms beat MKL
• <4, 2, 4> and <3, 2, 3> tend to perform the best
14

Sequential performance =
26
25
24
23
22
• Almost all algorithms beat MKL
• <4, 3, 3> and <4, 2, 3> tend to perform the best
15
10000 12000 14000 16000 18000
dimension (N)
Effective GFLOPS
Sequential performance on N x 2400 x 2400
MKL
<4,2,4>
<4,3,3>
<3,2,3>
<4,2,3>
STRASSEN

Parallelization
C
+
M2 …
M1 M7
+
M2 …
M1 M7
+
M2 …
M1 M7
16

DFS Parallelization
C
+
M2 …
M1 M7
+
M2 …
M1 M7
All threads
Use parallel MKL
17
+ Easy to implement
+ Load balanced
+ Same memory
footprint as sequential
- Need large base
cases for high
performance

BFS Parallelization
C
+
omp taskwait
M2 …
M1 M7
+
omp taskwait
M2 …
M1 M7
1 thread
18
1 thread 1 thread
+ High performance for smaller base cases
- Sometimes harder to load balance: 24 threads, 49 subproblems
- More memory

HYBRID parallelization
C
+
omp taskwait
M2 …
M1 M7
+
omp taskwait
M2 …
M1 M7
19
1 thread 1 thread all threads
+ Better load balancing
- Explicit synchronization or else we can over-subscribe threads

20
0 5000 10000 15000 20000 0 5000 10000 15000
25
20
15
10
5
0
Dimension (N)
Parallel performance of <4,2,4> on <N,2800,N>
MKL, 6 cores
MKL, 24 cores
DFS, 6 cores
BFS, 6 cores
HYBRID, 6 cores
DFS, 24 cores
BFS, 24 cores
HYBRID, 24 cores

Bandwidth problems
• We rely on the cost of matrix multiplications to be much
more expensive than the cost of matrix additions
• Parallel dgemm on 24 cores: easily get 50-75% of peak
• STREAM benchmark: < 6x speedup in read/write
performance on 24 cores
C
+
M2 …
M1 M7
21

Parallel performance =
22
Performance (6 cores) on N x N x N
28
26
24
22
20
18
9000 10000 11000 12000 13000
Dimension (N)
MKL
STRASSEN
<3,2,2>
<3,2,4>
<4,2,3>
<3,4,2>
<3,3,3>
<4,2,4>
<2,3,4>
Performance (24 cores) on N x N x N
22
20
18
16
9000 10000 11000 12000 13000
Dimension (N)
MKL
STRASSEN
<3,2,2>
<3,2,4>
<4,2,3>
<3,4,2>
<3,3,3>
<4,2,4>
<2,3,4>
• 6 cores: similar performance to sequential
• 24 cores: can sometimes beat MKL, but barely

24
23
22
21
20
19
18
10000 15000 20000 10000 15000
dimension (N)
Performance (6 cores) on N x 2800 x N
MKL
<4,2,4>
<4,3,3>
<3,2,3>
<4,2,3>
STRASSEN
20
18
16
14
12
5000 10000 15000 20000
dimension (N)
Performance (24 cores) on N x 2800 x N
MKL
<4,2,4>
<4,3,3>
<3,2,3>
<4,2,3>
STRASSEN
Bad MKL
performance
• 24 cores: MKL best for large problems
23

23
22
21
20
19
18
18
17
16
15
14
13
12
• 24 cores: MKL usually the best
24
10000 15000 20000 10000 15000
dimension (N)
Performance (6 cores) on N x 3000 x 3000
MKL
<4,2,4>
<4,3,3>
<3,2,3>
<4,2,3>
STRASSEN
16000 18000 20000 22000 24000 26000
dimension (N)
Performance (24 cores) on N x 3000 x 3000
MKL
<4,2,4>
<4,3,3>
<3,2,3>
<4,2,3>
STRASSEN

High-level conclusions
25
• For square matrix multiplication, Strassen’s algorithm is
hard to beat
• For rectangular matrix multiplication, use a fast algorithm
that “matches the shape”
• Bandwidth limits the performance of shared memory
parallel fast matrix multiplication
 should be less of an issue in distributed memory
Future work:
• Numerical stability
• Using fast matmul as a kernel for other algorithms in
numerical linear algebra

A FRAMEWORK FOR
PRACTICAL FAST MATRIX
MULTIPLICATION
26
0 5000 10000 15000
25
20
15
10
5
0
Dimension (N)
Parallel performance of Strassen on <N,N,N>
MKL, 6 cores
MKL, 24 cores
DFS, 6 cores
BFS, 6 cores
HYBRID, 6 cores
DFS, 24 cores
BFS, 24 cores
HYBRID, 24 cores
arXiv: 1409.2908
Austin Benson (arbenson@stanford.edu), ICME, Stanford
Grey Ballard, Sandia National Laboratories
BLIS Retreat, September 26, 2014

Matrix additions (linear combinations)
S1 S2
S S7 S 6 S 5 4 S3
A11 A12 A21 A22
“Pairwise”
2x
DAXPY
2x
DAXPY
27

S1 S2
S S7 S 6 S 5 4 S3
A11 A12 A21 A22
“Write once”
custom
“DAXPY”
custom
“DAXPY”
28

A11 A12 A21 A22
Entry-wise
updates
S1 S2
S S7 S 6 S 5 4 S3
“Streaming”
29

Common subexpression elimination (CSE)
• Example in <4, 2, 4> algorithm (R = 26 multiples):
T11 T25
B B24 12 B22 B23
Four additions, six reads, two writes
30

Common subexpression elimination (CSE)
• Example in <4, 2, 4> algorithm (R = 26 multiples):
T11 T25
B B24 12 B22 B23
Y
Three additions, six reads, three writes
 Net increase in communication!
31

CSE does not really help
Effective GFLOPS for M x K x N multiplies
= 1e-9 * 2 * MKN / time in seconds
32

A framework for practical fast matrix multiplication (BLIS retreat)

More Related Content

What's hot

Similar to A framework for practical fast matrix multiplication (BLIS retreat)

More from Austin Benson

Recently uploaded

A framework for practical fast matrix multiplication (BLIS retreat)

Editor's Notes