Practical fast matrix multiplication: Bridging theory and practice

1
A FRAMEWORK FOR PRACTICAL
FAST MATRIX MULTIPLICATION
arXiv: 1409.2908
25
20
15
10
5
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
Austin Benson (arbenson@stanford.edu), ICME, Stanford
Grey Ballard, Sandia National Laboratories
Sandia National Laboratories, October 21, 2014
4
x 10
0
Dimension (N)
Effective GFLOPS / core
Parallel performance of <4,2,4> on <N,2800,N>
MKL, 6 cores
MKL, 24 cores
DFS, 6 cores
BFS, 6 cores
HYBRID, 6 cores
DFS, 24 cores
BFS, 24 cores
HYBRID, 24 cores

Fast matrix multiplication:
bridging theory and practice
2
• There are a number of Strassen-like algorithms for matrix
multiplication that have only been “discovered” recently.
[Smirnov13], [Benson&Ballard14]
• We show that they can achieve higher performance with respect
to Intel’s Math Kernel Library (MKL)
• We use code generation for extensive prototyping.
2 2.81 3
[Strassen79]
2.37
[Williams12]
xxxxx xxx x

4
Key ingredients of Strassen’s algorithm
• 1. Block partitioning of matrices (<2, 2, 2>)
• 2. Seven linear combinations of sub-blocks of A
• 3. Seven linear combinations of sub-blocks of B
• 4. Seven matrix multiplies to form Mr (recursive)
• 5. Linear combinations of Mr to form Cij

Key ingredients of fast matmul algorithms
• 1. Block partitioning of matrices (<M, K, N>)
• 2. R linear combinations of sub-blocks of A
• 3. R linear combinations of sub-blocks of B
• 4. R matrix multiplies to form Mr (recursive)
R < MKN  faster than classical
• 5. Linear combinations of Mr to form Cij
5
Main idea: Trade N^3 computations (matmul)
for more N^2 computations (matrix addition)

“Outer product” fast algorithm
• <4, 2, 4> partitioning
• R = 26 multiplies (< 4 * 2 * 4 = 32)
 23% speedup per recursive step (if everything else free)
• Linear combinations of Aij to form Sr: 68 terms
• Linear combinations of Bij to form Tr: 52 terms
• Linear combinations of Mr to form Cij: 69 terms
6

Discovering fast algorithms is a
numerical challenge
7
• Low-rank tensor decompositions lead to fast algorithms
• Tensors are small, but we need exact decompositions
 NP-hard
• Use alternating least squares with regularization and
rounding tricks [Smirnov13], [Benson&Ballard14]
• We have around 10 fast algorithms for <M, K, N> partitions.
Also have permutations, e.g., <K, M, N>.

Code generation lets us prototype
algorithms quickly
9
• We have compact representation of many fast algorithms:
1. dimensions of block partitioning (<M, K, N>)
2. linear combinations of sub-blocks (Sr, Tr)
3. linear combinations of Mr to form Cij
• We use code generation to rapidly prototype fast algorithms
• Our approach: test all algorithms on a bunch of different
problem sizes and look for patterns

Sequential performance
28
26
24
22
20
18
16
0 2000 4000 6000 8000
Dimension (N)
Effective GFLOPS
Sequential performance on N x N x N
=
10
MKL
STRASSEN
<3,2,2>
<3,2,4>
<4,2,3>
<3,4,2>
<3,3,3>
<4,2,4>
<2,3,4>
Effective GFLOPS for M x P x Q multiplies
= 1e-9 * 2 * MPQ / time in seconds
True peak

Sequential performance
28
26
24
22
20
18
16
0 2000 4000 6000 8000
Dimension (N)
Effective GFLOPS
Sequential performance on N x N x N
=
MKL
STRASSEN
<4,4,2>
<4,3,3>
<3,4,3>
<3,3,6>
<3,6,3>
<6,3,3>
• All algorithms beat MKL on large problems
• Strassen’s algorithm is hard to beat
11

Sequential performance =
27
26
25
24
23
22
2000 4000 6000 8000 10000 12000
dimension (N)
Effective GFLOPS
Sequential performance on N x 1600 x N
MKL
<4,2,4>
<4,3,3>
<3,2,3>
<4,2,3>
STRASSEN
• Almost all algorithms beat MKL
• <4, 2, 4> and <3, 2, 3> tend to perform the best
12

Sequential performance =
26
25
24
23
22
• Almost all algorithms beat MKL
• <4, 3, 3> and <4, 2, 3> tend to perform the best
13
10000 12000 14000 16000 18000
dimension (N)
Effective GFLOPS
Sequential performance on N x 2400 x 2400
MKL
<4,2,4>
<4,3,3>
<3,2,3>
<4,2,3>
STRASSEN

14
Practical issues, or what is not obvious when
…you just think about the theory
…your performance models are too simple

Practical issue #1: when to stop recursion?
Look at the gemm curve.
25
20
15
10
0 1000 2000 3000
Dimension (N)
GFLOPS
Sequential dgemm performance
N x 800 x 800
N x 800 x N
N x N x N
peak
25
20
15
10
0 2000 4000 6000 8000
Dimension (N)
GFLOPS / core
Parallel dgemm performance (24 cores)
Basic idea: take another
recursive step if the sub-problems
will still operate at
high performance
15
<M, K, N> = <4, 2, 3>

Practical issue #2: matrix additions
S1 S2
S S7 S 6 S 5 4 S3
A11 A12 A21 A22
“Pairwise”
2x
DAXPY
2x
DAXPY
16
Y := AX + Y

S1 S2
S S7 S 6 S 5 4 S3
A11 A12 A21 A22
“Write once”
custom
“DAXPY”
custom
“DAXPY”
17

A11 A12 A21 A22
Entry-wise
updates
S1 S2
S S7 S 6 S 5 4 S3
“Streaming”
18

Practical issue #3: redundant linear combinations
• Example in <4, 2, 4> algorithm (R = 26 multiplies):
T11 T25
B B24 12 B22 B23
Four additions, six reads, two writes
19

Common subexpression elimination
• Example in <4, 2, 4> algorithm (R = 26 multiplies):
T11 T25
B B24 12 B22 B23
Y
Three additions, six reads, three writes
 Net increase in communication!
20

21
Common subexpression elimination (CSE)
does not really help

22
Practical issue #4:
parallelization on shared memory

Fast matmul recursion tree
C
+
M2 …
M1 M7
+
M2 …
M1 M7
+
M2 …
M1 M7
23

DFS Parallelization
C
+
M2 …
M1 M7
+
M2 …
M1 M7
All threads
Use parallel MKL
24
+ Easy to implement
+ Load balanced
+ Same memory
footprint as sequential
- Need large base
cases for high
performance

BFS Parallelization
C
+
omp
taskwait
M2 …
M1 M7
+
omp
taskwait
M2 …
M1 M7
1 thread
25
1 thread 1 thread
+ High performance for smaller base cases
- Sometimes harder to load balance: 24 threads, 49 subproblems
- More memory

HYBRID parallelization
C
+
omp
taskwait
M2 …
M1 M7
+
omp
taskwait
M2 …
M1 M7
26
1 thread 1 thread all threads
+ Better load balancing
- Explicit synchronization or else we can over-subscribe threads

27
25
20
15
10
5
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
4
x 10
0
Dimension (N)
MKL, 6 cores
MKL, 24 cores
DFS, 6 cores
BFS, 6 cores
HYBRID, 6 cores
DFS, 24 cores
BFS, 24 cores
HYBRID, 24 cores

Bandwidth problems
• We rely on the cost of matrix multiplications to be much
more expensive than the cost of matrix additions
• Parallel dgemm on 24 cores: easily get 50-75% of peak
• STREAM benchmark: < 6x speedup in read/write
performance on 24 cores
C
+
M2 …
M1 M7
28

Parallel performance =
29
Performance (6 cores) on N x N x N
28
26
24
22
20
18
9000 10000 11000 12000 13000
Dimension (N)
MKL
STRASSEN
<3,2,2>
<3,2,4>
<4,2,3>
<3,4,2>
<3,3,3>
<4,2,4>
<2,3,4>
Performance (24 cores) on N x N x N
22
20
18
16
9000 10000 11000 12000 13000
Dimension (N)
MKL
STRASSEN
<3,2,2>
<3,2,4>
<4,2,3>
<3,4,2>
<3,3,3>
<4,2,4>
<2,3,4>
• 6 cores: similar performance to sequential
• 24 cores: can sometimes beat MKL, but barely

24
23
22
21
20
19
18
10000 15000 20000 10000 15000
dimension (N)
Performance (6 cores) on N x 2800 x N
MKL
<4,2,4>
<4,3,3>
<3,2,3>
<4,2,3>
STRASSEN
20
18
16
14
12
5000 10000 15000 20000
dimension (N)
Performance (24 cores) on N x 2800 x N
MKL
<4,2,4>
<4,3,3>
<3,2,3>
<4,2,3>
STRASSEN
Bad MKL
performance
• 24 cores: MKL best for large problems
30

23
22
21
20
19
18
18
17
16
15
14
13
12
• 24 cores: MKL usually the best
31
10000 15000 20000 10000 15000
dimension (N)
Performance (6 cores) on N x 3000 x 3000
MKL
<4,2,4>
<4,3,3>
<3,2,3>
<4,2,3>
STRASSEN
16000 18000 20000 22000 24000 26000
dimension (N)
Performance (24 cores) on N x 3000 x 3000
MKL
<4,2,4>
<4,3,3>
<3,2,3>
<4,2,3>
STRASSEN

High-level conclusions
32
• For square matrix multiplication, Strassen’s algorithm is
hard to beat
• For rectangular matrix multiplication, use a fast algorithm
that “matches the shape”
• Bandwidth limits the performance of shared memory
parallel fast matrix multiplication
 should be less of an issue in distributed memory
Future work:
• Numerical stability
• Using fast matmul as a kernel for other algorithms in
numerical linear algebra

33
A FRAMEWORK FOR PRACTICAL
FAST MATRIX MULTIPLICATION
arXiv: 1409.2908
25
20
15
10
5
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
Austin Benson (arbenson@stanford.edu), ICME, Stanford
Grey Ballard, Sandia National Laboratories
Sandia National Laboratories, October 21, 2014
4
x 10
0
Dimension (N)
MKL, 6 cores
MKL, 24 cores
DFS, 6 cores
BFS, 6 cores
HYBRID, 6 cores
DFS, 24 cores
BFS, 24 cores
HYBRID, 24 cores

Practical fast matrix multiplication: Bridging theory and practice

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to Practical fast matrix multiplication: Bridging theory and practice

Similar to Practical fast matrix multiplication: Bridging theory and practice (20)

More from Austin Benson

More from Austin Benson (20)

Recently uploaded

Recently uploaded (20)

Practical fast matrix multiplication: Bridging theory and practice

Editor's Notes