Sandia Fast Matmul

1
A FRAMEWORK FOR PRACTICAL
FAST MATRIX MULTIPLICATION
arXiv: 1409.2908
25
20
15
10
5
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
Austin Benson (arbenson@stanford.edu), ICME, Stanford
Grey Ballard, Sandia National Laboratories
Sandia National Laboratories, October 21, 2014
4
x 10
0
Dimension (N)
Effective GFLOPS / core
Parallel performance of <4,2,4> on <N,2800,N>
MKL, 6 cores
MKL, 24 cores
DFS, 6 cores
BFS, 6 cores
HYBRID, 6 cores
DFS, 24 cores
BFS, 24 cores
HYBRID, 24 cores

Fast matrix multiplication:
bridging theory and practice
2
• There are a number of Strassen-like algorithms for matrix
multiplication that have only been “discovered” recently.
[Smirnov13], [Benson&Ballard14]
• We show that they can achieve higher performance with respect
to Intel’s Math Kernel Library (MKL)
• We use code generation for extensive prototyping.
2 2.81 3
[Strassen79]
2.37
[Williams12]
xxxxx xxx x

4
Key ingredients of Strassen’s algorithm
• 1. Block partitioning of matrices (<2, 2, 2>)
• 2. Seven linear combinations of sub-blocks of A
• 3. Seven linear combinations of sub-blocks of B
• 4. Seven matrix multiplies to form Mr (recursive)
• 5. Linear combinations of Mr to form Cij

Key ingredients of fast matmul algorithms
• 1. Block partitioning of matrices (<M, K, N>)
• 2. R linear combinations of sub-blocks of A
• 3. R linear combinations of sub-blocks of B
• 4. R matrix multiplies to form Mr (recursive)
R < MKN  faster than classical
• 5. Linear combinations of Mr to form Cij
5
Main idea: Trade N^3 computations (matmul)
for more N^2 computations (matrix addition)

“Outer product” fast algorithm
• <4, 2, 4> partitioning
• R = 26 multiplies (< 4 * 2 * 4 = 32)
 23% speedup per recursive step (if everything else free)
• Linear combinations of Aij to form Sr: 68 terms
• Linear combinations of Bij to form Tr: 52 terms
• Linear combinations of Mr to form Cij: 69 terms
6

Discovering fast algorithms is a
numerical challenge
7
• Low-rank tensor decompositions lead to fast algorithms
• Tensors are small, but we need exact decompositions
 NP-hard
• Use alternating least squares with regularization and
rounding tricks [Smirnov13], [Benson&Ballard14]
• We have around 10 fast algorithms for <M, K, N> partitions.
Also have permutations, e.g., <K, M, N>.

Code generation lets us prototype
algorithms quickly
9
• We have compact representation of many fast algorithms:
1. dimensions of block partitioning (<M, K, N>)
2. linear combinations of sub-blocks (Sr, Tr)
3. linear combinations of Mr to form Cij
• We use code generation to rapidly prototype fast algorithms
• Our approach: test all algorithms on a bunch of different
problem sizes and look for patterns

Sequential performance
28
26
24
22
20
18
16
0 2000 4000 6000 8000
Dimension (N)
Effective GFLOPS
Sequential performance on N x N x N
=
10
MKL
STRASSEN
<3,2,2>
<3,2,4>
<4,2,3>
<3,4,2>
<3,3,3>
<4,2,4>
<2,3,4>
Effective GFLOPS for M x P x Q multiplies
= 1e-9 * 2 * MPQ / time in seconds
True peak

Sequential performance
28
26
24
22
20
18
16
0 2000 4000 6000 8000
Dimension (N)
Effective GFLOPS
Sequential performance on N x N x N
=
MKL
STRASSEN
<4,4,2>
<4,3,3>
<3,4,3>
<3,3,6>
<3,6,3>
<6,3,3>
• All algorithms beat MKL on large problems
• Strassen’s algorithm is hard to beat
11

Sequential performance =
27
26
25
24
23
22
2000 4000 6000 8000 10000 12000
dimension (N)
Effective GFLOPS
Sequential performance on N x 1600 x N
MKL
<4,2,4>
<4,3,3>
<3,2,3>
<4,2,3>
STRASSEN
• Almost all algorithms beat MKL
• <4, 2, 4> and <3, 2, 3> tend to perform the best
12

Sequential performance =
26
25
24
23
22
• Almost all algorithms beat MKL
• <4, 3, 3> and <4, 2, 3> tend to perform the best
13
10000 12000 14000 16000 18000
dimension (N)
Effective GFLOPS
Sequential performance on N x 2400 x 2400
MKL
<4,2,4>
<4,3,3>
<3,2,3>
<4,2,3>
STRASSEN

14
Practical issues, or what is not obvious when
…you just think about the theory
…your performance models are too simple

Practical issue #1: when to stop recursion?
Look at the gemm curve.
25
20
15
10
0 1000 2000 3000
Dimension (N)
GFLOPS
Sequential dgemm performance
N x 800 x 800
N x 800 x N
N x N x N
peak
25
20
15
10
0 2000 4000 6000 8000
Dimension (N)
GFLOPS / core
Parallel dgemm performance (24 cores)
Basic idea: take another
recursive step if the sub-problems
will still operate at
high performance
15
<M, K, N> = <4, 2, 3>

Practical issue #2: matrix additions
S1 S2
S S7 S 6 S 5 4 S3
A11 A12 A21 A22
“Pairwise”
2x
DAXPY
2x
DAXPY
16
Y := AX + Y

S1 S2
S S7 S 6 S 5 4 S3
A11 A12 A21 A22
“Write once”
custom
“DAXPY”
custom
“DAXPY”
17

A11 A12 A21 A22
Entry-wise
updates
S1 S2
S S7 S 6 S 5 4 S3
“Streaming”
18

Practical issue #3: redundant linear combinations
• Example in <4, 2, 4> algorithm (R = 26 multiplies):
T11 T25
B B24 12 B22 B23
Four additions, six reads, two writes
19

Common subexpression elimination
• Example in <4, 2, 4> algorithm (R = 26 multiplies):
T11 T25
B B24 12 B22 B23
Y
Three additions, six reads, three writes
 Net increase in communication!
20

21
Common subexpression elimination (CSE)
does not really help

22
Practical issue #4:
parallelization on shared memory

Fast matmul recursion tree
C
+
M2 …
M1 M7
+
M2 …
M1 M7
+
M2 …
M1 M7
23

DFS Parallelization
C
+
M2 …
M1 M7
+
M2 …
M1 M7
All threads
Use parallel MKL
24
+ Easy to implement
+ Load balanced
+ Same memory
footprint as sequential
- Need large base
cases for high
performance

BFS Parallelization
C
+
omp
taskwait
M2 …
M1 M7
+
omp
taskwait
M2 …
M1 M7
1 thread
25
1 thread 1 thread
+ High performance for smaller base cases
- Sometimes harder to load balance: 24 threads, 49 subproblems
- More memory

HYBRID parallelization
C
+
omp
taskwait
M2 …
M1 M7
+
omp
taskwait
M2 …
M1 M7
26
1 thread 1 thread all threads
+ Better load balancing
- Explicit synchronization or else we can over-subscribe threads

27
25
20
15
10
5
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
4
x 10
0
Dimension (N)
MKL, 6 cores
MKL, 24 cores
DFS, 6 cores
BFS, 6 cores
HYBRID, 6 cores
DFS, 24 cores
BFS, 24 cores
HYBRID, 24 cores

Bandwidth problems
• We rely on the cost of matrix multiplications to be much
more expensive than the cost of matrix additions
• Parallel dgemm on 24 cores: easily get 50-75% of peak
• STREAM benchmark: < 6x speedup in read/write
performance on 24 cores
C
+
M2 …
M1 M7
28

Parallel performance =
29
Performance (6 cores) on N x N x N
28
26
24
22
20
18
9000 10000 11000 12000 13000
Dimension (N)
MKL
STRASSEN
<3,2,2>
<3,2,4>
<4,2,3>
<3,4,2>
<3,3,3>
<4,2,4>
<2,3,4>
Performance (24 cores) on N x N x N
22
20
18
16
9000 10000 11000 12000 13000
Dimension (N)
MKL
STRASSEN
<3,2,2>
<3,2,4>
<4,2,3>
<3,4,2>
<3,3,3>
<4,2,4>
<2,3,4>
• 6 cores: similar performance to sequential
• 24 cores: can sometimes beat MKL, but barely

24
23
22
21
20
19
18
10000 15000 20000 10000 15000
dimension (N)
Performance (6 cores) on N x 2800 x N
MKL
<4,2,4>
<4,3,3>
<3,2,3>
<4,2,3>
STRASSEN
20
18
16
14
12
5000 10000 15000 20000
dimension (N)
Performance (24 cores) on N x 2800 x N
MKL
<4,2,4>
<4,3,3>
<3,2,3>
<4,2,3>
STRASSEN
Bad MKL
performance
• 24 cores: MKL best for large problems
30

23
22
21
20
19
18
18
17
16
15
14
13
12
• 24 cores: MKL usually the best
31
10000 15000 20000 10000 15000
dimension (N)
Performance (6 cores) on N x 3000 x 3000
MKL
<4,2,4>
<4,3,3>
<3,2,3>
<4,2,3>
STRASSEN
16000 18000 20000 22000 24000 26000
dimension (N)
Performance (24 cores) on N x 3000 x 3000
MKL
<4,2,4>
<4,3,3>
<3,2,3>
<4,2,3>
STRASSEN

High-level conclusions
32
• For square matrix multiplication, Strassen’s algorithm is
hard to beat
• For rectangular matrix multiplication, use a fast algorithm
that “matches the shape”
• Bandwidth limits the performance of shared memory
parallel fast matrix multiplication
 should be less of an issue in distributed memory
Future work:
• Numerical stability
• Using fast matmul as a kernel for other algorithms in
numerical linear algebra

33
A FRAMEWORK FOR PRACTICAL
FAST MATRIX MULTIPLICATION
arXiv: 1409.2908
25
20
15
10
5
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
Austin Benson (arbenson@stanford.edu), ICME, Stanford
Grey Ballard, Sandia National Laboratories
Sandia National Laboratories, October 21, 2014
4
x 10
0
Dimension (N)
MKL, 6 cores
MKL, 24 cores
DFS, 6 cores
BFS, 6 cores
HYBRID, 6 cores
DFS, 24 cores
BFS, 24 cores
HYBRID, 24 cores

Sandia Fast Matmul

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to Sandia Fast Matmul

Similar to Sandia Fast Matmul (20)

More from Austin Benson

More from Austin Benson (20)

Recently uploaded

Recently uploaded (20)

Sandia Fast Matmul

Editor's Notes