A framework for practical fast matrix multiplication

1
A FRAMEWORK FOR PRACTICAL
FAST MATRIX MULTIPLICATION
github.com/arbenson/fast-matmul
arXiv: 1409.2908
25
20
15
10
Sequential dgemm performance
N x 800 x 800
N x 800 x N
N x N x N
peak
0 1000 2000 3000
Dimension (N)
GFLOPS
Austin Benson (arbenson@stanford.edu), ICME, Stanford
Grey Ballard, Sandia National Laboratories
ICME Colloquium, October 20, 2014

Fast matrix multiplication:
bridging theory and practice
2
• There are a number of Strassen-like algorithms for matrix
multiplication that have only been “discovered” recently.
[Smirnov13], [Benson&Ballard14]
• We show that they can achieve higher performance with respect
to Intel Math Kernel Library (MKL).
• We use code generation for extensive prototyping.
2 2.81 3
[Strassen79]
2.37
[Williams12]
xxxxx xxx x

4
Key ingredients of Strassen’s algorithm
• 1. Block partitioning of matrices (<2, 2, 2>)
• 2. Seven linear combinations of sub-blocks of A
• 3. Seven linear combinations of sub-blocks of B
• 4. Seven matrix multiplies to form Mr (recursive)
• 5. Linear combinations of Mr to form Cij

Key ingredients of fast matmul algorithms
• 1. Block partitioning of matrices (<M, K, N>)
• 2. R linear combinations of sub-blocks of A
• 3. R linear combinations of sub-blocks of B
• 4. R matrix multiplies to form Mr (recursive)
R < MKN  faster than classical
• 5. Linear combinations of Mr to form Cij
5
Key idea: Trade N^3 computation (matmul) for more
N^2 computation (adding matrices together)

“Outer product” fast algorithm
• <4, 2, 4> partitioning
• R = 26 multiplies (< 4 * 2 * 4 = 32)
 23% speedup per recursive step (if everything else free)
• Linear combinations of Aij to form Sr: 68 terms
• Linear combinations of Bij to form Tr: 52 terms
• Linear combinations of Mr to form Cij: 69 terms
6

Discovering fast algorithms is a
numerical challenge
7
• Low-rank tensor decompositions lead to fast algorithms
• Tensors are small, but we need exact decompositions
 NP-hard
• Use alternating least squares with regularization and
rounding tricks [Smirnov13], [Benson&Ballard14]
• We have around 10 fast algorithms for <M, K, N>
decompositions. Also have permutations like <K, M, N>.

Code generation lets us prototype
algorithms quickly
9
• We have compact representation of many fast algorithms:
1. dimensions of block partitioning (<M, K, N>)
2. linear combinations of sub-blocks (Sr, Tr)
3. linear combinations of Mr to form Cij
• We use code generation to rapidly prototype fast algorithms

Sequential performance
28
26
24
22
20
18
16
0 2000 4000 6000 8000
Dimension (N)
Effective GFLOPS
Sequential performance on N x N x N
=
10
MKL
STRASSEN
<3,2,2>
<3,2,4>
<4,2,3>
<3,4,2>
<3,3,3>
<4,2,4>
<2,3,4>
Effective GFLOPS for M x K x N multiplies
= 1e-9 * 2 * MKN / time in seconds
True peak

Sequential performance
28
26
24
22
20
18
16
0 2000 4000 6000 8000
Dimension (N)
Effective GFLOPS
Sequential performance on N x N x N
=
MKL
STRASSEN
<4,4,2>
<4,3,3>
<3,4,3>
<3,3,6>
<3,6,3>
<6,3,3>
• All algorithms beat MKL on large problems
• Strassen’s algorithm is hard to beat
11

Sequential performance =
27
26
25
24
23
22
2000 4000 6000 8000 10000 12000
dimension (N)
Effective GFLOPS
Sequential performance on N x 1600 x N
MKL
<4,2,4>
<4,3,3>
<3,2,3>
<4,2,3>
STRASSEN
• Almost all algorithms beat MKL
• <4, 2, 4> and <3, 2, 3> tend to perform the best
12

Sequential performance =
26
25
24
23
22
• Almost all algorithms beat MKL
• <4, 3, 3> and <4, 2, 3> tend to perform the best
13
10000 12000 14000 16000 18000
dimension (N)
Effective GFLOPS
Sequential performance on N x 2400 x 2400
MKL
<4,2,4>
<4,3,3>
<3,2,3>
<4,2,3>
STRASSEN

Practical issues
14
• When to stop recursion?
• How to implement the linear combinations?
• Can we eliminate redundant linear combinations?
Is this useful?
• How to parallelize? How to model parallel performance?

When to stop recursion? Look at the gemm curve.
25
20
15
10
0 1000 2000 3000
Dimension (N)
GFLOPS
N x 800 x 800
N x 800 x N
N x N x N
peak
Basic idea: take another recursive
step if the sub-problems will still
operate at high performance
15
<M, K, N> = <4, 2, 3>

24
23
22
21
20
19
18
10000 15000 20000 10000 15000
dimension (N)
Effective GFLOPS / core
Performance (6 cores) on N x 2800 x N
MKL
<4,2,4>
<4,3,3>
<3,2,3>
<4,2,3>
STRASSEN
20
18
16
14
12
5000 10000 15000 20000
dimension (N)
Effective GFLOPS / core
Performance (24 cores) on N x 2800 x N
MKL
<4,2,4>
<4,3,3>
<3,2,3>
<4,2,3>
STRASSEN
Understanding
parallel performance =
• 6 cores: similar performance to sequential
• 24 cores: MKL best for large problems
16

Bandwidth problems
• We rely on the cost of matrix multiplications to be much
more expensive than the cost of matrix additions
• Parallel dgemm on 24 cores: easily get 50-75% of peak
• STREAM benchmark: < 6x speedup in read/write
performance on 24 cores
C
+
M2 …
M1 M7
17

High-level conclusions
18
• For square matrix multiplication, Strassen’s is best
• For rectangular matrix multiplication, use a fast algorithm
that “matches the shape”
• Bandwidth limits the performance of shared memory
parallel fast matrix multiplication
 should be less of an issue in distributed memory
Future work:
• Numerical stability
• Using fast matmul as a kernel for other algorithms in
numerical linear algebra

19
A FRAMEWORK FOR PRACTICAL
FAST MATRIX MULTIPLICATION
github.com/arbenson/fast-matmul
arXiv: 1409.2908
25
20
15
10
N x 800 x 800
N x 800 x N
N x N x N
peak
0 1000 2000 3000
Dimension (N)
GFLOPS
Austin Benson (arbenson@stanford.edu), ICME, Stanford
Grey Ballard, Sandia National Laboratories
ICME Colloquium, October 20, 2014

A framework for practical fast matrix multiplication

More Related Content

What's hot

Similar to A framework for practical fast matrix multiplication

More from Austin Benson

Recently uploaded

A framework for practical fast matrix multiplication

Editor's Notes