Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

fast-matmul-ppopp2015

253 views

Published on

Talk from PPoPP 2015 in San Francisco, CA.

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

fast-matmul-ppopp2015

  1. 1. A FRAMEWORK FOR PRACTICAL PARALLEL FAST MATRIX MULTIPLICATION github.com/arbenson/fast-matmul 1 Austin Benson arbenson@stanford.edu Stanford University Joint work with Grey Ballard, Sandia PPoPP 2015 San Francisco, CA 9000 10000 11000 12000 13000 18 20 22 24 26 28 Dimension (N) EffectiveGFLOPS/core Performance (6 cores) on N x N x N MKL STRASSEN <3,2,2> <3,2,4> <4,2,3> <3,4,2> <3,3,3> <4,2,4> <2,3,4>
  2. 2. Fast matrix multiplication: bridging theory and practice • There are a number of Strassen-like algorithms for matrix multiplication that have only been “discovered” recently. [Smirnov13], [Benson&Ballard14] • We show that they can achieve higher performance with respect to Intel Math Kernel Library (MKL). • We use code generation for extensive prototyping. 2 32 2.81 [Strassen79] 2.37 [Williams12] xxx xx xx xx
  3. 3. Strassen’s algorithm 3
  4. 4. Key ingredients of Strassen’s algorithm • 1. Block partitioning of matrices (<2, 2, 2>) • 2. Seven linear combinations of sub-blocks of A • 3. Seven linear combinations of sub-blocks of B • 4. Seven matrix multiplies to form Mr (recursive) • 5. Linear combinations of Mr to form Cij 4
  5. 5. Key ingredients of fast matmul algorithms • 1. Block partitioning of matrices (<M, K, N>) • 2. R linear combinations of sub-blocks of A • 3. R linear combinations of sub-blocks of B • 4. R matrix multiplies to form Mr (recursive) R < MKN  faster than classical • 5. Linear combinations of Mr to form Cij 5 Key idea: Trade N^3 computation (matmul) for more N^2 computation (adding matrices together).
  6. 6. “Outer product” fast algorithm • <4, 2, 4> partitioning • R = 26 multiplies (< 4 * 2 * 4 = 32)  23% speedup per recursive step (if everything else free) • Linear combinations of Aij to form Sr: 68 terms • Linear combinations of Bij to form Tr: 52 terms • Linear combinations of Mr to form Cij: 69 terms 6
  7. 7. 7 [Smirnov13] [Strassen69]
  8. 8. Code generation lets us prototype algorithms quickly • We have compact representation of many fast algorithms: 1. dimensions of block partitioning (<M, K, N>) 2. linear combinations of sub-blocks (Sr, Tr) 3. linear combinations of Mr to form Cij • We use code generation to rapidly prototype fast algorithms • All code available  github.com/arbenson/fast-matmul 8
  9. 9. Sequential performance = 9 Effective GFLOPS for M x K x N multiplies = 1e-9 * 2 * MKN / time in seconds Classical peak 0 2000 4000 6000 8000 16 18 20 22 24 26 28 Dimension (N) EffectiveGFLOPS Sequential performance on N x N x N MKL STRASSEN <4,2,2> <3,2,3> <3,3,2> <5,2,2> S<4,3,3>
  10. 10. 2000 4000 6000 8000 10000 12000 20 22 24 26 28 dimension (N) EffectiveGFLOPS Sequential performance on N x 1600 x N MKL <4,2,4> <4,3,3> <3,2,3> <4,2,3> STRASSEN Sequential performance = • Almost all algorithms beat MKL • <4, 2, 4> and <3, 2, 3> tend to perform the best 10
  11. 11. 10000 12000 14000 16000 18000 10000 12000 14000 16000 22 23 24 25 26 27 28 dimension (N) EffectiveGFLOPS Sequential performance on N x 2400 x 2400 MKL <4,2,4> <4,3,3> <3,2,3> <4,2,3> STRASSEN S<4,3,3> Sequential performance = • Almost all algorithms beat MKL • <4, 3, 3> and <4, 2, 3> tend to perform the best 11
  12. 12. Parallelization C M1 M7 + M2 … M1 M7 + M2 … M1 M7 + M2 … 12
  13. 13. DFS Parallelization C M1 M7 + M2 … M1 M7 + M2 … All threads Use parallel MKL + Easy to implement + Load balanced + Same memory footprint as sequential - Need large base cases for high performance 13
  14. 14. BFS Parallelization C M1 M7 + M2 … M1 M7 + M2 … omp taskwait omp taskwait 1 thread + High performance for smaller base cases - Sometimes harder to load balance: 24 threads, 49 subproblems - More memory 1 thread 1 thread 14
  15. 15. HYBRID parallelization C M1 M7 + M2 … M1 M7 + M2 … omp taskwait omp taskwait 1 thread 1 thread all threads + Better load balancing - Explicit synchronization or else we can over-subscribe threads 15
  16. 16. 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 x 10 4 0 5 10 15 20 25 Dimension (N) EffectiveGFLOPS/core Parallel performance of <4,2,4> on <N,2800,N> MKL, 6 cores MKL, 24 cores DFS, 6 cores BFS, 6 cores HYBRID, 6 cores DFS, 24 cores BFS, 24 cores HYBRID, 24 cores 16 26 multiplies 24 cores
  17. 17. Parallel performance = 17 9000 10000 11000 12000 13000 18 20 22 24 26 28 Dimension (N) EffectiveGFLOPS/core Performance (6 cores) on N x N x N MKL STRASSEN S<4,3,3> <4,2,2> <3,2,3> <3,3,2> <5,2,2> <2,5,2> 9000 10000 11000 12000 13000 14 16 18 20 22 Dimension (N) EffectiveGFLOPS/core Performance (24 cores) on N x N x N MKL STRASSEN S<4,3,3> <4,2,2> <3,2,3> <3,3,2> <5,2,2> <2,5,2> • 6 cores: similar performance to sequential • 24 cores: can sometimes edge out MKL
  18. 18. 10000 15000 20000 10000 15000 18 19 20 21 22 23 24 dimension (N) EffectiveGFLOPS/core Performance (6 cores) on N x 2800 x N MKL <4,2,4> <4,3,3> <3,2,3> <4,2,3> STRASSEN 5000 10000 15000 20000 12 14 16 18 20 dimension (N) EffectiveGFLOPS/core Performance (24 cores) on N x 2800 x N MKL <4,2,4> <4,3,3> <3,2,3> <4,2,3> STRASSEN Parallel performance = Bad MKL performance • 6 cores: similar performance to sequential • 24 cores: MKL best for large problems 18
  19. 19. Parallel performance = • 6 cores: similar performance to sequential • 24 cores: MKL usually the best 19 10000 15000 20000 10000 15000 18 19 20 21 22 23 dimension (N) EffectiveGFLOPS/core Performance (6 cores) on N x 3000 x 3000 MKL <4,2,4> <4,3,3> <3,2,3> <4,2,3> STRASSEN 16000 18000 20000 22000 24000 26000 12 13 14 15 16 17 18 dimension (N) EffectiveGFLOPS/core Performance (24 cores) on N x 3000 x 3000 MKL <4,2,4> <4,3,3> <3,2,3> <4,2,3> STRASSEN
  20. 20. Bandwidth problems • We rely on the cost of matrix multiplications to be much more expensive than the cost of matrix additions • Parallel dgemm on 24 cores: easily get 50-75% of peak • STREAM benchmark: < 6x speedup in read/write performance on 24 cores C M1 M7 + M2 … 20
  21. 21. High-level conclusions • Sequential: match the shape of the algorithm to the shape of the matrix. Straightforward to beat vendor implementation. • Parallel: bandwidth limits performance  should be less of an issue in distributed memory Future work: • Distributed memory implementation • Numerical stability (do not panic) 21
  22. 22. A FRAMEWORK FOR PRACTICAL PARALLEL FAST MATRIX MULTIPLICATION github.com/arbenson/fast-matmul 22 Austin Benson arbenson@stanford.edu Stanford University Joint work with Grey Ballard, Sandia PPoPP 2015 San Francisco, CA 9000 10000 11000 12000 13000 18 20 22 24 26 28 Dimension (N) EffectiveGFLOPS/core Performance (6 cores) on N x N x N MKL STRASSEN <3,2,2> <3,2,4> <4,2,3> <3,4,2> <3,3,3> <4,2,4> <2,3,4>

×