Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Sandia Fast Matmul

278 views

Published on

Slides from my talk at Sandia.

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

Sandia Fast Matmul

  1. 1. 1 A FRAMEWORK FOR PRACTICAL FAST MATRIX MULTIPLICATION arXiv: 1409.2908 25 20 15 10 5 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 Austin Benson (arbenson@stanford.edu), ICME, Stanford Grey Ballard, Sandia National Laboratories Sandia National Laboratories, October 21, 2014 4 x 10 0 Dimension (N) Effective GFLOPS / core Parallel performance of <4,2,4> on <N,2800,N> MKL, 6 cores MKL, 24 cores DFS, 6 cores BFS, 6 cores HYBRID, 6 cores DFS, 24 cores BFS, 24 cores HYBRID, 24 cores
  2. 2. Fast matrix multiplication: bridging theory and practice 2 • There are a number of Strassen-like algorithms for matrix multiplication that have only been “discovered” recently. [Smirnov13], [Benson&Ballard14] • We show that they can achieve higher performance with respect to Intel’s Math Kernel Library (MKL) • We use code generation for extensive prototyping. 2 2.81 3 [Strassen79] 2.37 [Williams12] xxxxx xxx x
  3. 3. Strassen’s algorithm 3
  4. 4. 4 Key ingredients of Strassen’s algorithm • 1. Block partitioning of matrices (<2, 2, 2>) • 2. Seven linear combinations of sub-blocks of A • 3. Seven linear combinations of sub-blocks of B • 4. Seven matrix multiplies to form Mr (recursive) • 5. Linear combinations of Mr to form Cij
  5. 5. Key ingredients of fast matmul algorithms • 1. Block partitioning of matrices (<M, K, N>) • 2. R linear combinations of sub-blocks of A • 3. R linear combinations of sub-blocks of B • 4. R matrix multiplies to form Mr (recursive) R < MKN  faster than classical • 5. Linear combinations of Mr to form Cij 5 Main idea: Trade N^3 computations (matmul) for more N^2 computations (matrix addition)
  6. 6. “Outer product” fast algorithm • <4, 2, 4> partitioning • R = 26 multiplies (< 4 * 2 * 4 = 32)  23% speedup per recursive step (if everything else free) • Linear combinations of Aij to form Sr: 68 terms • Linear combinations of Bij to form Tr: 52 terms • Linear combinations of Mr to form Cij: 69 terms 6
  7. 7. Discovering fast algorithms is a numerical challenge 7 • Low-rank tensor decompositions lead to fast algorithms • Tensors are small, but we need exact decompositions  NP-hard • Use alternating least squares with regularization and rounding tricks [Smirnov13], [Benson&Ballard14] • We have around 10 fast algorithms for <M, K, N> partitions. Also have permutations, e.g., <K, M, N>.
  8. 8. 8 [Strassen69] [Smirnov13]
  9. 9. Code generation lets us prototype algorithms quickly 9 • We have compact representation of many fast algorithms: 1. dimensions of block partitioning (<M, K, N>) 2. linear combinations of sub-blocks (Sr, Tr) 3. linear combinations of Mr to form Cij • We use code generation to rapidly prototype fast algorithms • Our approach: test all algorithms on a bunch of different problem sizes and look for patterns
  10. 10. Sequential performance 28 26 24 22 20 18 16 0 2000 4000 6000 8000 Dimension (N) Effective GFLOPS Sequential performance on N x N x N = 10 MKL STRASSEN <3,2,2> <3,2,4> <4,2,3> <3,4,2> <3,3,3> <4,2,4> <2,3,4> Effective GFLOPS for M x P x Q multiplies = 1e-9 * 2 * MPQ / time in seconds True peak
  11. 11. Sequential performance 28 26 24 22 20 18 16 0 2000 4000 6000 8000 Dimension (N) Effective GFLOPS Sequential performance on N x N x N = MKL STRASSEN <4,4,2> <4,3,3> <3,4,3> <3,3,6> <3,6,3> <6,3,3> • All algorithms beat MKL on large problems • Strassen’s algorithm is hard to beat 11
  12. 12. Sequential performance = 27 26 25 24 23 22 2000 4000 6000 8000 10000 12000 dimension (N) Effective GFLOPS Sequential performance on N x 1600 x N MKL <4,2,4> <4,3,3> <3,2,3> <4,2,3> STRASSEN • Almost all algorithms beat MKL • <4, 2, 4> and <3, 2, 3> tend to perform the best 12
  13. 13. Sequential performance = 26 25 24 23 22 • Almost all algorithms beat MKL • <4, 3, 3> and <4, 2, 3> tend to perform the best 13 10000 12000 14000 16000 18000 dimension (N) Effective GFLOPS Sequential performance on N x 2400 x 2400 MKL <4,2,4> <4,3,3> <3,2,3> <4,2,3> STRASSEN
  14. 14. 14 Practical issues, or what is not obvious when …you just think about the theory …your performance models are too simple
  15. 15. Practical issue #1: when to stop recursion? Look at the gemm curve. 25 20 15 10 0 1000 2000 3000 Dimension (N) GFLOPS Sequential dgemm performance N x 800 x 800 N x 800 x N N x N x N peak 25 20 15 10 0 2000 4000 6000 8000 Dimension (N) GFLOPS / core Parallel dgemm performance (24 cores) Basic idea: take another recursive step if the sub-problems will still operate at high performance 15 <M, K, N> = <4, 2, 3>
  16. 16. Practical issue #2: matrix additions S1 S2 S S7 S 6 S 5 4 S3 A11 A12 A21 A22 “Pairwise” 2x DAXPY 2x DAXPY 16 Y := AX + Y
  17. 17. S1 S2 S S7 S 6 S 5 4 S3 A11 A12 A21 A22 “Write once” custom “DAXPY” custom “DAXPY” 17 Practical issue #2: matrix additions
  18. 18. A11 A12 A21 A22 Entry-wise updates S1 S2 S S7 S 6 S 5 4 S3 “Streaming” 18 Practical issue #2: matrix additions
  19. 19. Practical issue #3: redundant linear combinations • Example in <4, 2, 4> algorithm (R = 26 multiplies): T11 T25 B B24 12 B22 B23 Four additions, six reads, two writes 19
  20. 20. Common subexpression elimination • Example in <4, 2, 4> algorithm (R = 26 multiplies): T11 T25 B B24 12 B22 B23 Y Three additions, six reads, three writes  Net increase in communication! 20
  21. 21. 21 Common subexpression elimination (CSE) does not really help
  22. 22. 22 Practical issue #4: parallelization on shared memory
  23. 23. Fast matmul recursion tree C + M2 … M1 M7 + M2 … M1 M7 + M2 … M1 M7 23
  24. 24. DFS Parallelization C + M2 … M1 M7 + M2 … M1 M7 All threads Use parallel MKL 24 + Easy to implement + Load balanced + Same memory footprint as sequential - Need large base cases for high performance
  25. 25. BFS Parallelization C + omp taskwait M2 … M1 M7 + omp taskwait M2 … M1 M7 1 thread 25 1 thread 1 thread + High performance for smaller base cases - Sometimes harder to load balance: 24 threads, 49 subproblems - More memory
  26. 26. HYBRID parallelization C + omp taskwait M2 … M1 M7 + omp taskwait M2 … M1 M7 26 1 thread 1 thread all threads + Better load balancing - Explicit synchronization or else we can over-subscribe threads
  27. 27. 27 25 20 15 10 5 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 4 x 10 0 Dimension (N) Effective GFLOPS / core Parallel performance of <4,2,4> on <N,2800,N> MKL, 6 cores MKL, 24 cores DFS, 6 cores BFS, 6 cores HYBRID, 6 cores DFS, 24 cores BFS, 24 cores HYBRID, 24 cores
  28. 28. Bandwidth problems • We rely on the cost of matrix multiplications to be much more expensive than the cost of matrix additions • Parallel dgemm on 24 cores: easily get 50-75% of peak • STREAM benchmark: < 6x speedup in read/write performance on 24 cores C + M2 … M1 M7 28
  29. 29. Parallel performance = 29 Performance (6 cores) on N x N x N 28 26 24 22 20 18 9000 10000 11000 12000 13000 Dimension (N) Effective GFLOPS / core MKL STRASSEN <3,2,2> <3,2,4> <4,2,3> <3,4,2> <3,3,3> <4,2,4> <2,3,4> Performance (24 cores) on N x N x N 22 20 18 16 9000 10000 11000 12000 13000 Dimension (N) Effective GFLOPS / core MKL STRASSEN <3,2,2> <3,2,4> <4,2,3> <3,4,2> <3,3,3> <4,2,4> <2,3,4> • 6 cores: similar performance to sequential • 24 cores: can sometimes beat MKL, but barely
  30. 30. 24 23 22 21 20 19 18 10000 15000 20000 10000 15000 dimension (N) Effective GFLOPS / core Performance (6 cores) on N x 2800 x N MKL <4,2,4> <4,3,3> <3,2,3> <4,2,3> STRASSEN 20 18 16 14 12 5000 10000 15000 20000 dimension (N) Effective GFLOPS / core Performance (24 cores) on N x 2800 x N MKL <4,2,4> <4,3,3> <3,2,3> <4,2,3> STRASSEN Parallel performance = Bad MKL performance • 6 cores: similar performance to sequential • 24 cores: MKL best for large problems 30
  31. 31. Parallel performance = 23 22 21 20 19 18 18 17 16 15 14 13 12 • 6 cores: similar performance to sequential • 24 cores: MKL usually the best 31 10000 15000 20000 10000 15000 dimension (N) Effective GFLOPS / core Performance (6 cores) on N x 3000 x 3000 MKL <4,2,4> <4,3,3> <3,2,3> <4,2,3> STRASSEN 16000 18000 20000 22000 24000 26000 dimension (N) Effective GFLOPS / core Performance (24 cores) on N x 3000 x 3000 MKL <4,2,4> <4,3,3> <3,2,3> <4,2,3> STRASSEN
  32. 32. High-level conclusions 32 • For square matrix multiplication, Strassen’s algorithm is hard to beat • For rectangular matrix multiplication, use a fast algorithm that “matches the shape” • Bandwidth limits the performance of shared memory parallel fast matrix multiplication  should be less of an issue in distributed memory Future work: • Numerical stability • Using fast matmul as a kernel for other algorithms in numerical linear algebra
  33. 33. 33 A FRAMEWORK FOR PRACTICAL FAST MATRIX MULTIPLICATION arXiv: 1409.2908 25 20 15 10 5 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 Austin Benson (arbenson@stanford.edu), ICME, Stanford Grey Ballard, Sandia National Laboratories Sandia National Laboratories, October 21, 2014 4 x 10 0 Dimension (N) Effective GFLOPS / core Parallel performance of <4,2,4> on <N,2800,N> MKL, 6 cores MKL, 24 cores DFS, 6 cores BFS, 6 cores HYBRID, 6 cores DFS, 24 cores BFS, 24 cores HYBRID, 24 cores

×