Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

fast-matmul-cse15

176 views

Published on

Slides from my contributed talk at SIAM CSE 2015 in Salt Lake City, UT.

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

fast-matmul-cse15

  1. 1. A FRAMEWORK FOR PRACTICAL PARALLEL FAST MATRIX MULTIPLICATION code and paper: github.com/arbenson/fast-matmul 1 Austin Benson arbenson@stanford.edu Stanford University Joint work with Grey Ballard, Sandia SIAM CSE 2015 Salt Lake City, UT 9000 10000 11000 12000 13000 14 16 18 20 22 Dimension (N) EffectiveGFLOPS/core Performance (24 cores) on N x N x N MKL STRASSEN S<4,3,3> <4,2,2> <3,2,3> <3,3,2> <5,2,2> <2,5,2>
  2. 2. Fast matrix multiplication: bridging theory and practice • There are a number of Strassen-like algorithms for matrix multiplication that have only been “discovered” recently. [Smirnov13], [Benson&Ballard14] • How well do they work in practice? 2 32 2.81 [Strassen79] 2.37 [Le Gall14] xxx xx xx xx
  3. 3. Strassen’s algorithm 3
  4. 4. 4 [Smirnov13] [Strassen69] All implemented with code generation
  5. 5. Sequential performance = 5 Effective GFLOPS for M x K x N multiplies = 1e-9 * 2 * MKN / time in seconds Classical peak 0 2000 4000 6000 8000 16 18 20 22 24 26 28 Dimension (N) EffectiveGFLOPS Sequential performance on N x N x N MKL STRASSEN <4,2,2> <3,2,3> <3,3,2> <5,2,2> S<4,3,3>
  6. 6. 2000 4000 6000 8000 10000 12000 20 22 24 26 28 dimension (N) EffectiveGFLOPS Sequential performance on N x 1600 x N MKL <4,2,4> <4,3,3> <3,2,3> <4,2,3> STRASSEN Sequential performance = • Almost all algorithms beat MKL • <4, 2, 4> and <3, 2, 3> tend to perform the best 6
  7. 7. DFS Parallelization C M1 M7 + M2 … M1 M7 + M2 … All threads Use parallel MKL + Easy to implement - Need large base cases for high performance 7
  8. 8. BFS Parallelization C M1 M7 + M2 … M1 M7 + M2 … omp taskwait omp taskwait 1 thread + High performance for smaller base cases - Sometimes harder to load balance: 24 threads, 49 subproblems - More memory 1 thread 1 thread 8
  9. 9. HYBRID parallelization C M1 M7 + M2 … M1 M7 + M2 … omp taskwait omp taskwait 1 thread 1 thread all threads + Better load balancing - Explicit synchronization or else we can over-subscribe threads 9
  10. 10. Parallel performance = 10 9000 10000 11000 12000 13000 18 20 22 24 26 28 Dimension (N) EffectiveGFLOPS/core Performance (6 cores) on N x N x N MKL STRASSEN S<4,3,3> <4,2,2> <3,2,3> <3,3,2> <5,2,2> <2,5,2> 9000 10000 11000 12000 13000 14 16 18 20 22 Dimension (N) EffectiveGFLOPS/core Performance (24 cores) on N x N x N MKL STRASSEN S<4,3,3> <4,2,2> <3,2,3> <3,3,2> <5,2,2> <2,5,2> • 6 cores: similar performance to sequential • 24 cores: can sometimes edge out MKL
  11. 11. 10000 15000 20000 10000 15000 18 19 20 21 22 23 24 dimension (N) EffectiveGFLOPS/core Performance (6 cores) on N x 2800 x N MKL <4,2,4> <4,3,3> <3,2,3> <4,2,3> STRASSEN 5000 10000 15000 20000 12 14 16 18 20 dimension (N) EffectiveGFLOPS/core Performance (24 cores) on N x 2800 x N MKL <4,2,4> <4,3,3> <3,2,3> <4,2,3> STRASSEN Parallel performance = • 6 cores: similar performance to sequential • 24 cores: MKL best for large problems 11
  12. 12. A FRAMEWORK FOR PRACTICAL PARALLEL FAST MATRIX MULTIPLICATION code and paper: github.com/arbenson/fast-matmul 12 Austin Benson arbenson@stanford.edu Stanford University Joint work with Grey Ballard, Sandia SIAM CSE 2015 Salt Lake City, UT 9000 10000 11000 12000 13000 14 16 18 20 22 Dimension (N) EffectiveGFLOPS/core Performance (24 cores) on N x N x N MKL STRASSEN S<4,3,3> <4,2,2> <3,2,3> <3,3,2> <5,2,2> <2,5,2>

×