1 
A FRAMEWORK FOR PRACTICAL 
FAST MATRIX MULTIPLICATION 
github.com/arbenson/fast-matmul 
arXiv: 1409.2908 
25 
20 
15 
10 
Sequential dgemm performance 
N x 800 x 800 
N x 800 x N 
N x N x N 
peak 
0 1000 2000 3000 
Dimension (N) 
GFLOPS 
Austin Benson (arbenson@stanford.edu), ICME, Stanford 
Grey Ballard, Sandia National Laboratories 
ICME Colloquium, October 20, 2014
Fast matrix multiplication: 
bridging theory and practice 
2 
• There are a number of Strassen-like algorithms for matrix 
multiplication that have only been “discovered” recently. 
[Smirnov13], [Benson&Ballard14] 
• We show that they can achieve higher performance with respect 
to Intel Math Kernel Library (MKL). 
• We use code generation for extensive prototyping. 
2 2.81 3 
[Strassen79] 
2.37 
[Williams12] 
xxxxx xxx x
Strassen’s algorithm 
3
4 
Key ingredients of Strassen’s algorithm 
• 1. Block partitioning of matrices (<2, 2, 2>) 
• 2. Seven linear combinations of sub-blocks of A 
• 3. Seven linear combinations of sub-blocks of B 
• 4. Seven matrix multiplies to form Mr (recursive) 
• 5. Linear combinations of Mr to form Cij
Key ingredients of fast matmul algorithms 
• 1. Block partitioning of matrices (<M, K, N>) 
• 2. R linear combinations of sub-blocks of A 
• 3. R linear combinations of sub-blocks of B 
• 4. R matrix multiplies to form Mr (recursive) 
R < MKN  faster than classical 
• 5. Linear combinations of Mr to form Cij 
5 
Key idea: Trade N^3 computation (matmul) for more 
N^2 computation (adding matrices together)
“Outer product” fast algorithm 
• <4, 2, 4> partitioning 
• R = 26 multiplies (< 4 * 2 * 4 = 32) 
 23% speedup per recursive step (if everything else free) 
• Linear combinations of Aij to form Sr: 68 terms 
• Linear combinations of Bij to form Tr: 52 terms 
• Linear combinations of Mr to form Cij: 69 terms 
6
Discovering fast algorithms is a 
numerical challenge 
7 
• Low-rank tensor decompositions lead to fast algorithms 
• Tensors are small, but we need exact decompositions 
 NP-hard 
• Use alternating least squares with regularization and 
rounding tricks [Smirnov13], [Benson&Ballard14] 
• We have around 10 fast algorithms for <M, K, N> 
decompositions. Also have permutations like <K, M, N>.
8 
[Strassen69] 
[Smirnov13]
Code generation lets us prototype 
algorithms quickly 
9 
• We have compact representation of many fast algorithms: 
1. dimensions of block partitioning (<M, K, N>) 
2. linear combinations of sub-blocks (Sr, Tr) 
3. linear combinations of Mr to form Cij 
• We use code generation to rapidly prototype fast algorithms
Sequential performance 
28 
26 
24 
22 
20 
18 
16 
0 2000 4000 6000 8000 
Dimension (N) 
Effective GFLOPS 
Sequential performance on N x N x N 
= 
10 
MKL 
STRASSEN 
<3,2,2> 
<3,2,4> 
<4,2,3> 
<3,4,2> 
<3,3,3> 
<4,2,4> 
<2,3,4> 
Effective GFLOPS for M x K x N multiplies 
= 1e-9 * 2 * MKN / time in seconds 
True peak
Sequential performance 
28 
26 
24 
22 
20 
18 
16 
0 2000 4000 6000 8000 
Dimension (N) 
Effective GFLOPS 
Sequential performance on N x N x N 
= 
MKL 
STRASSEN 
<4,4,2> 
<4,3,3> 
<3,4,3> 
<3,3,6> 
<3,6,3> 
<6,3,3> 
• All algorithms beat MKL on large problems 
• Strassen’s algorithm is hard to beat 
11
Sequential performance = 
27 
26 
25 
24 
23 
22 
2000 4000 6000 8000 10000 12000 
dimension (N) 
Effective GFLOPS 
Sequential performance on N x 1600 x N 
MKL 
<4,2,4> 
<4,3,3> 
<3,2,3> 
<4,2,3> 
STRASSEN 
• Almost all algorithms beat MKL 
• <4, 2, 4> and <3, 2, 3> tend to perform the best 
12
Sequential performance = 
26 
25 
24 
23 
22 
• Almost all algorithms beat MKL 
• <4, 3, 3> and <4, 2, 3> tend to perform the best 
13 
10000 12000 14000 16000 18000 
dimension (N) 
Effective GFLOPS 
Sequential performance on N x 2400 x 2400 
MKL 
<4,2,4> 
<4,3,3> 
<3,2,3> 
<4,2,3> 
STRASSEN
Practical issues 
14 
• When to stop recursion? 
• How to implement the linear combinations? 
• Can we eliminate redundant linear combinations? 
Is this useful? 
• How to parallelize? How to model parallel performance?
When to stop recursion? Look at the gemm curve. 
25 
20 
15 
10 
0 1000 2000 3000 
Dimension (N) 
GFLOPS 
Sequential dgemm performance 
N x 800 x 800 
N x 800 x N 
N x N x N 
peak 
Basic idea: take another recursive 
step if the sub-problems will still 
operate at high performance 
15 
<M, K, N> = <4, 2, 3>
24 
23 
22 
21 
20 
19 
18 
10000 15000 20000 10000 15000 
dimension (N) 
Effective GFLOPS / core 
Performance (6 cores) on N x 2800 x N 
MKL 
<4,2,4> 
<4,3,3> 
<3,2,3> 
<4,2,3> 
STRASSEN 
20 
18 
16 
14 
12 
5000 10000 15000 20000 
dimension (N) 
Effective GFLOPS / core 
Performance (24 cores) on N x 2800 x N 
MKL 
<4,2,4> 
<4,3,3> 
<3,2,3> 
<4,2,3> 
STRASSEN 
Understanding 
parallel performance = 
• 6 cores: similar performance to sequential 
• 24 cores: MKL best for large problems 
16
Bandwidth problems 
• We rely on the cost of matrix multiplications to be much 
more expensive than the cost of matrix additions 
• Parallel dgemm on 24 cores: easily get 50-75% of peak 
• STREAM benchmark: < 6x speedup in read/write 
performance on 24 cores 
C 
+ 
M2 … 
M1 M7 
17
High-level conclusions 
18 
• For square matrix multiplication, Strassen’s is best 
• For rectangular matrix multiplication, use a fast algorithm 
that “matches the shape” 
• Bandwidth limits the performance of shared memory 
parallel fast matrix multiplication 
 should be less of an issue in distributed memory 
Future work: 
• Numerical stability 
• Using fast matmul as a kernel for other algorithms in 
numerical linear algebra
19 
A FRAMEWORK FOR PRACTICAL 
FAST MATRIX MULTIPLICATION 
github.com/arbenson/fast-matmul 
arXiv: 1409.2908 
25 
20 
15 
10 
Sequential dgemm performance 
N x 800 x 800 
N x 800 x N 
N x N x N 
peak 
0 1000 2000 3000 
Dimension (N) 
GFLOPS 
Austin Benson (arbenson@stanford.edu), ICME, Stanford 
Grey Ballard, Sandia National Laboratories 
ICME Colloquium, October 20, 2014

A framework for practical fast matrix multiplication

  • 1.
    1 A FRAMEWORKFOR PRACTICAL FAST MATRIX MULTIPLICATION github.com/arbenson/fast-matmul arXiv: 1409.2908 25 20 15 10 Sequential dgemm performance N x 800 x 800 N x 800 x N N x N x N peak 0 1000 2000 3000 Dimension (N) GFLOPS Austin Benson (arbenson@stanford.edu), ICME, Stanford Grey Ballard, Sandia National Laboratories ICME Colloquium, October 20, 2014
  • 2.
    Fast matrix multiplication: bridging theory and practice 2 • There are a number of Strassen-like algorithms for matrix multiplication that have only been “discovered” recently. [Smirnov13], [Benson&Ballard14] • We show that they can achieve higher performance with respect to Intel Math Kernel Library (MKL). • We use code generation for extensive prototyping. 2 2.81 3 [Strassen79] 2.37 [Williams12] xxxxx xxx x
  • 3.
  • 4.
    4 Key ingredientsof Strassen’s algorithm • 1. Block partitioning of matrices (<2, 2, 2>) • 2. Seven linear combinations of sub-blocks of A • 3. Seven linear combinations of sub-blocks of B • 4. Seven matrix multiplies to form Mr (recursive) • 5. Linear combinations of Mr to form Cij
  • 5.
    Key ingredients offast matmul algorithms • 1. Block partitioning of matrices (<M, K, N>) • 2. R linear combinations of sub-blocks of A • 3. R linear combinations of sub-blocks of B • 4. R matrix multiplies to form Mr (recursive) R < MKN  faster than classical • 5. Linear combinations of Mr to form Cij 5 Key idea: Trade N^3 computation (matmul) for more N^2 computation (adding matrices together)
  • 6.
    “Outer product” fastalgorithm • <4, 2, 4> partitioning • R = 26 multiplies (< 4 * 2 * 4 = 32)  23% speedup per recursive step (if everything else free) • Linear combinations of Aij to form Sr: 68 terms • Linear combinations of Bij to form Tr: 52 terms • Linear combinations of Mr to form Cij: 69 terms 6
  • 7.
    Discovering fast algorithmsis a numerical challenge 7 • Low-rank tensor decompositions lead to fast algorithms • Tensors are small, but we need exact decompositions  NP-hard • Use alternating least squares with regularization and rounding tricks [Smirnov13], [Benson&Ballard14] • We have around 10 fast algorithms for <M, K, N> decompositions. Also have permutations like <K, M, N>.
  • 8.
  • 9.
    Code generation letsus prototype algorithms quickly 9 • We have compact representation of many fast algorithms: 1. dimensions of block partitioning (<M, K, N>) 2. linear combinations of sub-blocks (Sr, Tr) 3. linear combinations of Mr to form Cij • We use code generation to rapidly prototype fast algorithms
  • 10.
    Sequential performance 28 26 24 22 20 18 16 0 2000 4000 6000 8000 Dimension (N) Effective GFLOPS Sequential performance on N x N x N = 10 MKL STRASSEN <3,2,2> <3,2,4> <4,2,3> <3,4,2> <3,3,3> <4,2,4> <2,3,4> Effective GFLOPS for M x K x N multiplies = 1e-9 * 2 * MKN / time in seconds True peak
  • 11.
    Sequential performance 28 26 24 22 20 18 16 0 2000 4000 6000 8000 Dimension (N) Effective GFLOPS Sequential performance on N x N x N = MKL STRASSEN <4,4,2> <4,3,3> <3,4,3> <3,3,6> <3,6,3> <6,3,3> • All algorithms beat MKL on large problems • Strassen’s algorithm is hard to beat 11
  • 12.
    Sequential performance = 27 26 25 24 23 22 2000 4000 6000 8000 10000 12000 dimension (N) Effective GFLOPS Sequential performance on N x 1600 x N MKL <4,2,4> <4,3,3> <3,2,3> <4,2,3> STRASSEN • Almost all algorithms beat MKL • <4, 2, 4> and <3, 2, 3> tend to perform the best 12
  • 13.
    Sequential performance = 26 25 24 23 22 • Almost all algorithms beat MKL • <4, 3, 3> and <4, 2, 3> tend to perform the best 13 10000 12000 14000 16000 18000 dimension (N) Effective GFLOPS Sequential performance on N x 2400 x 2400 MKL <4,2,4> <4,3,3> <3,2,3> <4,2,3> STRASSEN
  • 14.
    Practical issues 14 • When to stop recursion? • How to implement the linear combinations? • Can we eliminate redundant linear combinations? Is this useful? • How to parallelize? How to model parallel performance?
  • 15.
    When to stoprecursion? Look at the gemm curve. 25 20 15 10 0 1000 2000 3000 Dimension (N) GFLOPS Sequential dgemm performance N x 800 x 800 N x 800 x N N x N x N peak Basic idea: take another recursive step if the sub-problems will still operate at high performance 15 <M, K, N> = <4, 2, 3>
  • 16.
    24 23 22 21 20 19 18 10000 15000 20000 10000 15000 dimension (N) Effective GFLOPS / core Performance (6 cores) on N x 2800 x N MKL <4,2,4> <4,3,3> <3,2,3> <4,2,3> STRASSEN 20 18 16 14 12 5000 10000 15000 20000 dimension (N) Effective GFLOPS / core Performance (24 cores) on N x 2800 x N MKL <4,2,4> <4,3,3> <3,2,3> <4,2,3> STRASSEN Understanding parallel performance = • 6 cores: similar performance to sequential • 24 cores: MKL best for large problems 16
  • 17.
    Bandwidth problems •We rely on the cost of matrix multiplications to be much more expensive than the cost of matrix additions • Parallel dgemm on 24 cores: easily get 50-75% of peak • STREAM benchmark: < 6x speedup in read/write performance on 24 cores C + M2 … M1 M7 17
  • 18.
    High-level conclusions 18 • For square matrix multiplication, Strassen’s is best • For rectangular matrix multiplication, use a fast algorithm that “matches the shape” • Bandwidth limits the performance of shared memory parallel fast matrix multiplication  should be less of an issue in distributed memory Future work: • Numerical stability • Using fast matmul as a kernel for other algorithms in numerical linear algebra
  • 19.
    19 A FRAMEWORKFOR PRACTICAL FAST MATRIX MULTIPLICATION github.com/arbenson/fast-matmul arXiv: 1409.2908 25 20 15 10 Sequential dgemm performance N x 800 x 800 N x 800 x N N x N x N peak 0 1000 2000 3000 Dimension (N) GFLOPS Austin Benson (arbenson@stanford.edu), ICME, Stanford Grey Ballard, Sandia National Laboratories ICME Colloquium, October 20, 2014

Editor's Notes

  • #4 \begin{bmatrix} C_{11} & C_{12} \\ C_{21} & C_{22} \end{bmatrix} = \begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{bmatrix} \cdot \begin{bmatrix} B_{11} & B_{12} \\ B_{21} & B_{22} \end{bmatrix} S_1 &= A_{11} + A_{22} \\ S_2 &= A_{21} + A_{22} \\ S_3 &= A_{11} \\ S_4 &= A_{22} \\ S_5 &= A_{11} + A_{12} \\ S_6 &= A_{21} - A_{11} \\ S_7 &= A_{12} - A_{22} \\ T_1 &= B_{11} + B_{22} \\ T_2 &= B_{11} \\ T_3 &= B_{12} - B_{22} \\ T_4 &= B_{21} - B_{11} \\ T_5 &= B_{22} \\ T_6 &= B_{11} + B_{12} \\ T_7 &= B_{21} + B_{22} \\
  • #9 \begin{table} \begin{tabular}{l c c c c c} Algorithm & Multiples & Multiplies & Speedup per & \\ base case & (fast) & (classical) & recursive step & Exponent\\ $\langle 2,2,3\rangle $ & 11 & 12 & 9\% & 2.89 \\ $\langle 2,2,5\rangle $ & 18 & 20 & 11\% & 2.89\\ $\langle 2,2,2\rangle $ & 7 & 8 & 14\% & 2.81 \\ $\langle 2,2,4\rangle $ & 14 & 16 & 14\% & 2.85\\ $\langle 3,3,3\rangle $ & 23 & 27 & 17\% & 2.85 \\ $\langle 2,3,3\rangle $ & 15 & 18 & 20\% & 2.81 \\ $\langle 2,3,4\rangle $ & 20 & 24 & 20\% & 2.83\\ $\langle 2,4,4\rangle $ & 26 & 32 & 23\% & 2.82 \\ $\langle 3,3,4\rangle $ & 29 & 36 & 24\% & 2.82 \\ $\langle 3,4,4\rangle $ & 38 & 48 & 26\% & 2.82 \\ $\langle 3,3,6\rangle $ & 40 & 54 & 35\% & 2.77 \\ \end{tabular} \end{table}