SlideShare a Scribd company logo
1 of 33
1 
A FRAMEWORK FOR PRACTICAL 
FAST MATRIX MULTIPLICATION 
arXiv: 1409.2908 
25 
20 
15 
10 
5 
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 
Austin Benson (arbenson@stanford.edu), ICME, Stanford 
Grey Ballard, Sandia National Laboratories 
Sandia National Laboratories, October 21, 2014 
4 
x 10 
0 
Dimension (N) 
Effective GFLOPS / core 
Parallel performance of <4,2,4> on <N,2800,N> 
MKL, 6 cores 
MKL, 24 cores 
DFS, 6 cores 
BFS, 6 cores 
HYBRID, 6 cores 
DFS, 24 cores 
BFS, 24 cores 
HYBRID, 24 cores
Fast matrix multiplication: 
bridging theory and practice 
2 
• There are a number of Strassen-like algorithms for matrix 
multiplication that have only been “discovered” recently. 
[Smirnov13], [Benson&Ballard14] 
• We show that they can achieve higher performance with respect 
to Intel’s Math Kernel Library (MKL) 
• We use code generation for extensive prototyping. 
2 2.81 3 
[Strassen79] 
2.37 
[Williams12] 
xxxxx xxx x
Strassen’s algorithm 
3
4 
Key ingredients of Strassen’s algorithm 
• 1. Block partitioning of matrices (<2, 2, 2>) 
• 2. Seven linear combinations of sub-blocks of A 
• 3. Seven linear combinations of sub-blocks of B 
• 4. Seven matrix multiplies to form Mr (recursive) 
• 5. Linear combinations of Mr to form Cij
Key ingredients of fast matmul algorithms 
• 1. Block partitioning of matrices (<M, K, N>) 
• 2. R linear combinations of sub-blocks of A 
• 3. R linear combinations of sub-blocks of B 
• 4. R matrix multiplies to form Mr (recursive) 
R < MKN  faster than classical 
• 5. Linear combinations of Mr to form Cij 
5 
Main idea: Trade N^3 computations (matmul) 
for more N^2 computations (matrix addition)
“Outer product” fast algorithm 
• <4, 2, 4> partitioning 
• R = 26 multiplies (< 4 * 2 * 4 = 32) 
 23% speedup per recursive step (if everything else free) 
• Linear combinations of Aij to form Sr: 68 terms 
• Linear combinations of Bij to form Tr: 52 terms 
• Linear combinations of Mr to form Cij: 69 terms 
6
Discovering fast algorithms is a 
numerical challenge 
7 
• Low-rank tensor decompositions lead to fast algorithms 
• Tensors are small, but we need exact decompositions 
 NP-hard 
• Use alternating least squares with regularization and 
rounding tricks [Smirnov13], [Benson&Ballard14] 
• We have around 10 fast algorithms for <M, K, N> partitions. 
Also have permutations, e.g., <K, M, N>.
8 
[Strassen69] 
[Smirnov13]
Code generation lets us prototype 
algorithms quickly 
9 
• We have compact representation of many fast algorithms: 
1. dimensions of block partitioning (<M, K, N>) 
2. linear combinations of sub-blocks (Sr, Tr) 
3. linear combinations of Mr to form Cij 
• We use code generation to rapidly prototype fast algorithms 
• Our approach: test all algorithms on a bunch of different 
problem sizes and look for patterns
Sequential performance 
28 
26 
24 
22 
20 
18 
16 
0 2000 4000 6000 8000 
Dimension (N) 
Effective GFLOPS 
Sequential performance on N x N x N 
= 
10 
MKL 
STRASSEN 
<3,2,2> 
<3,2,4> 
<4,2,3> 
<3,4,2> 
<3,3,3> 
<4,2,4> 
<2,3,4> 
Effective GFLOPS for M x P x Q multiplies 
= 1e-9 * 2 * MPQ / time in seconds 
True peak
Sequential performance 
28 
26 
24 
22 
20 
18 
16 
0 2000 4000 6000 8000 
Dimension (N) 
Effective GFLOPS 
Sequential performance on N x N x N 
= 
MKL 
STRASSEN 
<4,4,2> 
<4,3,3> 
<3,4,3> 
<3,3,6> 
<3,6,3> 
<6,3,3> 
• All algorithms beat MKL on large problems 
• Strassen’s algorithm is hard to beat 
11
Sequential performance = 
27 
26 
25 
24 
23 
22 
2000 4000 6000 8000 10000 12000 
dimension (N) 
Effective GFLOPS 
Sequential performance on N x 1600 x N 
MKL 
<4,2,4> 
<4,3,3> 
<3,2,3> 
<4,2,3> 
STRASSEN 
• Almost all algorithms beat MKL 
• <4, 2, 4> and <3, 2, 3> tend to perform the best 
12
Sequential performance = 
26 
25 
24 
23 
22 
• Almost all algorithms beat MKL 
• <4, 3, 3> and <4, 2, 3> tend to perform the best 
13 
10000 12000 14000 16000 18000 
dimension (N) 
Effective GFLOPS 
Sequential performance on N x 2400 x 2400 
MKL 
<4,2,4> 
<4,3,3> 
<3,2,3> 
<4,2,3> 
STRASSEN
14 
Practical issues, or what is not obvious when 
…you just think about the theory 
…your performance models are too simple
Practical issue #1: when to stop recursion? 
Look at the gemm curve. 
25 
20 
15 
10 
0 1000 2000 3000 
Dimension (N) 
GFLOPS 
Sequential dgemm performance 
N x 800 x 800 
N x 800 x N 
N x N x N 
peak 
25 
20 
15 
10 
0 2000 4000 6000 8000 
Dimension (N) 
GFLOPS / core 
Parallel dgemm performance (24 cores) 
Basic idea: take another 
recursive step if the sub-problems 
will still operate at 
high performance 
15 
<M, K, N> = <4, 2, 3>
Practical issue #2: matrix additions 
S1 S2 
S S7 S 6 S 5 4 S3 
A11 A12 A21 A22 
“Pairwise” 
2x 
DAXPY 
2x 
DAXPY 
16 
Y := AX + Y
S1 S2 
S S7 S 6 S 5 4 S3 
A11 A12 A21 A22 
“Write once” 
custom 
“DAXPY” 
custom 
“DAXPY” 
17 
Practical issue #2: matrix additions
A11 A12 A21 A22 
Entry-wise 
updates 
S1 S2 
S S7 S 6 S 5 4 S3 
“Streaming” 
18 
Practical issue #2: matrix additions
Practical issue #3: redundant linear combinations 
• Example in <4, 2, 4> algorithm (R = 26 multiplies): 
T11 T25 
B B24 12 B22 B23 
Four additions, six reads, two writes 
19
Common subexpression elimination 
• Example in <4, 2, 4> algorithm (R = 26 multiplies): 
T11 T25 
B B24 12 B22 B23 
Y 
Three additions, six reads, three writes 
 Net increase in communication! 
20
21 
Common subexpression elimination (CSE) 
does not really help
22 
Practical issue #4: 
parallelization on shared memory
Fast matmul recursion tree 
C 
+ 
M2 … 
M1 M7 
+ 
M2 … 
M1 M7 
+ 
M2 … 
M1 M7 
23
DFS Parallelization 
C 
+ 
M2 … 
M1 M7 
+ 
M2 … 
M1 M7 
All threads 
Use parallel MKL 
24 
+ Easy to implement 
+ Load balanced 
+ Same memory 
footprint as sequential 
- Need large base 
cases for high 
performance
BFS Parallelization 
C 
+ 
omp 
taskwait 
M2 … 
M1 M7 
+ 
omp 
taskwait 
M2 … 
M1 M7 
1 thread 
25 
1 thread 1 thread 
+ High performance for smaller base cases 
- Sometimes harder to load balance: 24 threads, 49 subproblems 
- More memory
HYBRID parallelization 
C 
+ 
omp 
taskwait 
M2 … 
M1 M7 
+ 
omp 
taskwait 
M2 … 
M1 M7 
26 
1 thread 1 thread all threads 
+ Better load balancing 
- Explicit synchronization or else we can over-subscribe threads
27 
25 
20 
15 
10 
5 
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 
4 
x 10 
0 
Dimension (N) 
Effective GFLOPS / core 
Parallel performance of <4,2,4> on <N,2800,N> 
MKL, 6 cores 
MKL, 24 cores 
DFS, 6 cores 
BFS, 6 cores 
HYBRID, 6 cores 
DFS, 24 cores 
BFS, 24 cores 
HYBRID, 24 cores
Bandwidth problems 
• We rely on the cost of matrix multiplications to be much 
more expensive than the cost of matrix additions 
• Parallel dgemm on 24 cores: easily get 50-75% of peak 
• STREAM benchmark: < 6x speedup in read/write 
performance on 24 cores 
C 
+ 
M2 … 
M1 M7 
28
Parallel performance = 
29 
Performance (6 cores) on N x N x N 
28 
26 
24 
22 
20 
18 
9000 10000 11000 12000 13000 
Dimension (N) 
Effective GFLOPS / core 
MKL 
STRASSEN 
<3,2,2> 
<3,2,4> 
<4,2,3> 
<3,4,2> 
<3,3,3> 
<4,2,4> 
<2,3,4> 
Performance (24 cores) on N x N x N 
22 
20 
18 
16 
9000 10000 11000 12000 13000 
Dimension (N) 
Effective GFLOPS / core 
MKL 
STRASSEN 
<3,2,2> 
<3,2,4> 
<4,2,3> 
<3,4,2> 
<3,3,3> 
<4,2,4> 
<2,3,4> 
• 6 cores: similar performance to sequential 
• 24 cores: can sometimes beat MKL, but barely
24 
23 
22 
21 
20 
19 
18 
10000 15000 20000 10000 15000 
dimension (N) 
Effective GFLOPS / core 
Performance (6 cores) on N x 2800 x N 
MKL 
<4,2,4> 
<4,3,3> 
<3,2,3> 
<4,2,3> 
STRASSEN 
20 
18 
16 
14 
12 
5000 10000 15000 20000 
dimension (N) 
Effective GFLOPS / core 
Performance (24 cores) on N x 2800 x N 
MKL 
<4,2,4> 
<4,3,3> 
<3,2,3> 
<4,2,3> 
STRASSEN 
Parallel performance = 
Bad MKL 
performance 
• 6 cores: similar performance to sequential 
• 24 cores: MKL best for large problems 
30
Parallel performance = 
23 
22 
21 
20 
19 
18 
18 
17 
16 
15 
14 
13 
12 
• 6 cores: similar performance to sequential 
• 24 cores: MKL usually the best 
31 
10000 15000 20000 10000 15000 
dimension (N) 
Effective GFLOPS / core 
Performance (6 cores) on N x 3000 x 3000 
MKL 
<4,2,4> 
<4,3,3> 
<3,2,3> 
<4,2,3> 
STRASSEN 
16000 18000 20000 22000 24000 26000 
dimension (N) 
Effective GFLOPS / core 
Performance (24 cores) on N x 3000 x 3000 
MKL 
<4,2,4> 
<4,3,3> 
<3,2,3> 
<4,2,3> 
STRASSEN
High-level conclusions 
32 
• For square matrix multiplication, Strassen’s algorithm is 
hard to beat 
• For rectangular matrix multiplication, use a fast algorithm 
that “matches the shape” 
• Bandwidth limits the performance of shared memory 
parallel fast matrix multiplication 
 should be less of an issue in distributed memory 
Future work: 
• Numerical stability 
• Using fast matmul as a kernel for other algorithms in 
numerical linear algebra
33 
A FRAMEWORK FOR PRACTICAL 
FAST MATRIX MULTIPLICATION 
arXiv: 1409.2908 
25 
20 
15 
10 
5 
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 
Austin Benson (arbenson@stanford.edu), ICME, Stanford 
Grey Ballard, Sandia National Laboratories 
Sandia National Laboratories, October 21, 2014 
4 
x 10 
0 
Dimension (N) 
Effective GFLOPS / core 
Parallel performance of <4,2,4> on <N,2800,N> 
MKL, 6 cores 
MKL, 24 cores 
DFS, 6 cores 
BFS, 6 cores 
HYBRID, 6 cores 
DFS, 24 cores 
BFS, 24 cores 
HYBRID, 24 cores

More Related Content

What's hot

Intensity Constraint Gradient-Based Image Reconstruction
Intensity Constraint Gradient-Based Image ReconstructionIntensity Constraint Gradient-Based Image Reconstruction
Intensity Constraint Gradient-Based Image ReconstructionMasayuki Tanaka
 
Ijciras1101
Ijciras1101Ijciras1101
Ijciras1101zhendy94
 
Pixel RNN to Pixel CNN++
Pixel RNN to Pixel CNN++Pixel RNN to Pixel CNN++
Pixel RNN to Pixel CNN++Dongheon Lee
 
Restricted Boltzmann Machine | Neural Network Tutorial | Deep Learning Tutori...
Restricted Boltzmann Machine | Neural Network Tutorial | Deep Learning Tutori...Restricted Boltzmann Machine | Neural Network Tutorial | Deep Learning Tutori...
Restricted Boltzmann Machine | Neural Network Tutorial | Deep Learning Tutori...Edureka!
 
Graph Convolutional Neural Networks
Graph Convolutional Neural Networks Graph Convolutional Neural Networks
Graph Convolutional Neural Networks 신동 강
 
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...David Gleich
 
Big data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphsBig data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphsDavid Gleich
 
Optimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data PerspectiveOptimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data Perspectiveপল্লব রায়
 
Introduction to Neural Networks and Deep Learning
Introduction to Neural Networks and Deep LearningIntroduction to Neural Networks and Deep Learning
Introduction to Neural Networks and Deep LearningVahid Mirjalili
 
Generative Models and Adversarial Training (D2L3 Insight@DCU Machine Learning...
Generative Models and Adversarial Training (D2L3 Insight@DCU Machine Learning...Generative Models and Adversarial Training (D2L3 Insight@DCU Machine Learning...
Generative Models and Adversarial Training (D2L3 Insight@DCU Machine Learning...Universitat Politècnica de Catalunya
 
Ram minimum spanning tree
Ram   minimum spanning treeRam   minimum spanning tree
Ram minimum spanning treeRama Prasath A
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)byteLAKE
 
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...Artem Lutov
 
Restricted Boltzman Machine (RBM) presentation of fundamental theory
Restricted Boltzman Machine (RBM) presentation of fundamental theoryRestricted Boltzman Machine (RBM) presentation of fundamental theory
Restricted Boltzman Machine (RBM) presentation of fundamental theorySeongwon Hwang
 
2020 12-04-shake shake
2020 12-04-shake shake2020 12-04-shake shake
2020 12-04-shake shakeJAEMINJEONG5
 
R package bayesImageS: Scalable Inference for Intractable Likelihoods
R package bayesImageS: Scalable Inference for Intractable LikelihoodsR package bayesImageS: Scalable Inference for Intractable Likelihoods
R package bayesImageS: Scalable Inference for Intractable LikelihoodsMatt Moores
 
Object Segmentation (D2L7 Insight@DCU Machine Learning Workshop 2017)
Object Segmentation (D2L7 Insight@DCU Machine Learning Workshop 2017)Object Segmentation (D2L7 Insight@DCU Machine Learning Workshop 2017)
Object Segmentation (D2L7 Insight@DCU Machine Learning Workshop 2017)Universitat Politècnica de Catalunya
 
Large data with Scikit-learn - Boston Data Mining Meetup - Alex Perrier
Large data with Scikit-learn - Boston Data Mining Meetup  - Alex PerrierLarge data with Scikit-learn - Boston Data Mining Meetup  - Alex Perrier
Large data with Scikit-learn - Boston Data Mining Meetup - Alex PerrierAlexis Perrier
 

What's hot (20)

Intensity Constraint Gradient-Based Image Reconstruction
Intensity Constraint Gradient-Based Image ReconstructionIntensity Constraint Gradient-Based Image Reconstruction
Intensity Constraint Gradient-Based Image Reconstruction
 
Lec28
Lec28Lec28
Lec28
 
DNN and RBM
DNN and RBMDNN and RBM
DNN and RBM
 
Ijciras1101
Ijciras1101Ijciras1101
Ijciras1101
 
Pixel RNN to Pixel CNN++
Pixel RNN to Pixel CNN++Pixel RNN to Pixel CNN++
Pixel RNN to Pixel CNN++
 
Restricted Boltzmann Machine | Neural Network Tutorial | Deep Learning Tutori...
Restricted Boltzmann Machine | Neural Network Tutorial | Deep Learning Tutori...Restricted Boltzmann Machine | Neural Network Tutorial | Deep Learning Tutori...
Restricted Boltzmann Machine | Neural Network Tutorial | Deep Learning Tutori...
 
Graph Convolutional Neural Networks
Graph Convolutional Neural Networks Graph Convolutional Neural Networks
Graph Convolutional Neural Networks
 
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
 
Big data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphsBig data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphs
 
Optimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data PerspectiveOptimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data Perspective
 
Introduction to Neural Networks and Deep Learning
Introduction to Neural Networks and Deep LearningIntroduction to Neural Networks and Deep Learning
Introduction to Neural Networks and Deep Learning
 
Generative Models and Adversarial Training (D2L3 Insight@DCU Machine Learning...
Generative Models and Adversarial Training (D2L3 Insight@DCU Machine Learning...Generative Models and Adversarial Training (D2L3 Insight@DCU Machine Learning...
Generative Models and Adversarial Training (D2L3 Insight@DCU Machine Learning...
 
Ram minimum spanning tree
Ram   minimum spanning treeRam   minimum spanning tree
Ram minimum spanning tree
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
 
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...
 
Restricted Boltzman Machine (RBM) presentation of fundamental theory
Restricted Boltzman Machine (RBM) presentation of fundamental theoryRestricted Boltzman Machine (RBM) presentation of fundamental theory
Restricted Boltzman Machine (RBM) presentation of fundamental theory
 
2020 12-04-shake shake
2020 12-04-shake shake2020 12-04-shake shake
2020 12-04-shake shake
 
R package bayesImageS: Scalable Inference for Intractable Likelihoods
R package bayesImageS: Scalable Inference for Intractable LikelihoodsR package bayesImageS: Scalable Inference for Intractable Likelihoods
R package bayesImageS: Scalable Inference for Intractable Likelihoods
 
Object Segmentation (D2L7 Insight@DCU Machine Learning Workshop 2017)
Object Segmentation (D2L7 Insight@DCU Machine Learning Workshop 2017)Object Segmentation (D2L7 Insight@DCU Machine Learning Workshop 2017)
Object Segmentation (D2L7 Insight@DCU Machine Learning Workshop 2017)
 
Large data with Scikit-learn - Boston Data Mining Meetup - Alex Perrier
Large data with Scikit-learn - Boston Data Mining Meetup  - Alex PerrierLarge data with Scikit-learn - Boston Data Mining Meetup  - Alex Perrier
Large data with Scikit-learn - Boston Data Mining Meetup - Alex Perrier
 

Viewers also liked

NYC-Meetup- Introduction to Hadoop Echosystem
NYC-Meetup- Introduction to Hadoop EchosystemNYC-Meetup- Introduction to Hadoop Echosystem
NYC-Meetup- Introduction to Hadoop EchosystemAL500745425
 
Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)
Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)
Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)Austin Benson
 
Learning multifractal structure in large networks (KDD 2014)
Learning multifractal structure in large networks (KDD 2014)Learning multifractal structure in large networks (KDD 2014)
Learning multifractal structure in large networks (KDD 2014)Austin Benson
 
Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...
Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...
Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...Austin Benson
 
Suzlon takes a wise decision to go for CDR
Suzlon takes a wise decision to go for CDR Suzlon takes a wise decision to go for CDR
Suzlon takes a wise decision to go for CDR Himanshu Sharma
 
Tall-and-skinny Matrix Computations in MapReduce (ICME MR 2013)
Tall-and-skinny Matrix Computations in MapReduce (ICME MR 2013)Tall-and-skinny Matrix Computations in MapReduce (ICME MR 2013)
Tall-and-skinny Matrix Computations in MapReduce (ICME MR 2013)Austin Benson
 
QR Factorizations and SVDs for Tall-and-skinny Matrices in MapReduce Architec...
QR Factorizations and SVDs for Tall-and-skinny Matrices in MapReduce Architec...QR Factorizations and SVDs for Tall-and-skinny Matrices in MapReduce Architec...
QR Factorizations and SVDs for Tall-and-skinny Matrices in MapReduce Architec...Austin Benson
 
A framework for practical fast matrix multiplication
A framework for practical fast matrix multiplication�A framework for practical fast matrix multiplication�
A framework for practical fast matrix multiplicationAustin Benson
 
Silent error resilience in numerical time-stepping schemes
Silent error resilience in numerical time-stepping schemesSilent error resilience in numerical time-stepping schemes
Silent error resilience in numerical time-stepping schemesAustin Benson
 
Data Structures and Performance for Scientific Computing with Hadoop and Dumb...
Data Structures and Performance for Scientific Computing with Hadoop and Dumb...Data Structures and Performance for Scientific Computing with Hadoop and Dumb...
Data Structures and Performance for Scientific Computing with Hadoop and Dumb...Austin Benson
 
Silent error detection in numerical time stepping schemes (SIAM PP 2014)
Silent error detection in numerical time stepping schemes (SIAM PP 2014)Silent error detection in numerical time stepping schemes (SIAM PP 2014)
Silent error detection in numerical time stepping schemes (SIAM PP 2014)Austin Benson
 
Tensor Spectral Clustering
Tensor Spectral ClusteringTensor Spectral Clustering
Tensor Spectral ClusteringAustin Benson
 

Viewers also liked (17)

NYC-Meetup- Introduction to Hadoop Echosystem
NYC-Meetup- Introduction to Hadoop EchosystemNYC-Meetup- Introduction to Hadoop Echosystem
NYC-Meetup- Introduction to Hadoop Echosystem
 
fast-matmul-cse15
fast-matmul-cse15fast-matmul-cse15
fast-matmul-cse15
 
Ucapan aluan
Ucapan aluanUcapan aluan
Ucapan aluan
 
Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)
Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)
Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)
 
Learning multifractal structure in large networks (KDD 2014)
Learning multifractal structure in large networks (KDD 2014)Learning multifractal structure in large networks (KDD 2014)
Learning multifractal structure in large networks (KDD 2014)
 
Ucapan aluan
Ucapan aluanUcapan aluan
Ucapan aluan
 
Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...
Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...
Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...
 
Suzlon takes a wise decision to go for CDR
Suzlon takes a wise decision to go for CDR Suzlon takes a wise decision to go for CDR
Suzlon takes a wise decision to go for CDR
 
Tall-and-skinny Matrix Computations in MapReduce (ICME MR 2013)
Tall-and-skinny Matrix Computations in MapReduce (ICME MR 2013)Tall-and-skinny Matrix Computations in MapReduce (ICME MR 2013)
Tall-and-skinny Matrix Computations in MapReduce (ICME MR 2013)
 
QR Factorizations and SVDs for Tall-and-skinny Matrices in MapReduce Architec...
QR Factorizations and SVDs for Tall-and-skinny Matrices in MapReduce Architec...QR Factorizations and SVDs for Tall-and-skinny Matrices in MapReduce Architec...
QR Factorizations and SVDs for Tall-and-skinny Matrices in MapReduce Architec...
 
Natura si echilibrul sau
Natura si echilibrul sauNatura si echilibrul sau
Natura si echilibrul sau
 
A framework for practical fast matrix multiplication
A framework for practical fast matrix multiplication�A framework for practical fast matrix multiplication�
A framework for practical fast matrix multiplication
 
Silent error resilience in numerical time-stepping schemes
Silent error resilience in numerical time-stepping schemesSilent error resilience in numerical time-stepping schemes
Silent error resilience in numerical time-stepping schemes
 
Data Structures and Performance for Scientific Computing with Hadoop and Dumb...
Data Structures and Performance for Scientific Computing with Hadoop and Dumb...Data Structures and Performance for Scientific Computing with Hadoop and Dumb...
Data Structures and Performance for Scientific Computing with Hadoop and Dumb...
 
Silent error detection in numerical time stepping schemes (SIAM PP 2014)
Silent error detection in numerical time stepping schemes (SIAM PP 2014)Silent error detection in numerical time stepping schemes (SIAM PP 2014)
Silent error detection in numerical time stepping schemes (SIAM PP 2014)
 
426 anaerobicdigesterdesign
426 anaerobicdigesterdesign426 anaerobicdigesterdesign
426 anaerobicdigesterdesign
 
Tensor Spectral Clustering
Tensor Spectral ClusteringTensor Spectral Clustering
Tensor Spectral Clustering
 

Similar to Practical fast matrix multiplication: Bridging theory and practice

fast-matmul-ppopp2015
fast-matmul-ppopp2015fast-matmul-ppopp2015
fast-matmul-ppopp2015Austin Benson
 
PR-305: Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation LearningPR-305: Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation LearningSungchul Kim
 
GTC Taiwan 2017 GPU 平台上導入深度學習於半導體產業之 EDA 應用
GTC Taiwan 2017 GPU 平台上導入深度學習於半導體產業之 EDA 應用GTC Taiwan 2017 GPU 平台上導入深度學習於半導體產業之 EDA 應用
GTC Taiwan 2017 GPU 平台上導入深度學習於半導體產業之 EDA 應用NVIDIA Taiwan
 
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...GlobalLogic Ukraine
 
Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Optimizer and Runtime techniques by Matthias BoehmApache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Optimizer and Runtime techniques by Matthias BoehmArvind Surve
 
Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Optimizer and Runtime techniques by Matthias BoehmApache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Optimizer and Runtime techniques by Matthias BoehmArvind Surve
 
Evaluation of programs codes using machine learning
Evaluation of programs codes using machine learningEvaluation of programs codes using machine learning
Evaluation of programs codes using machine learningVivek Maskara
 
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...James McCombs
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningCastLabKAIST
 
Writing Fast MATLAB Code
Writing Fast MATLAB CodeWriting Fast MATLAB Code
Writing Fast MATLAB CodeJia-Bin Huang
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...MLconf
 
Presentation_Parallel GRASP algorithm for job shop scheduling
Presentation_Parallel GRASP algorithm for job shop schedulingPresentation_Parallel GRASP algorithm for job shop scheduling
Presentation_Parallel GRASP algorithm for job shop schedulingAntonio Maria Fiscarelli
 
IJCAI13 Paper review: Large-scale spectral clustering on graphs
IJCAI13 Paper review: Large-scale spectral clustering on graphsIJCAI13 Paper review: Large-scale spectral clustering on graphs
IJCAI13 Paper review: Large-scale spectral clustering on graphsAkisato Kimura
 
Project seminar ppt_steelcasting
Project seminar ppt_steelcastingProject seminar ppt_steelcasting
Project seminar ppt_steelcastingRudra Narayan Paul
 
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...NVIDIA Taiwan
 
SPICE-MATEX @ DAC15
SPICE-MATEX @ DAC15SPICE-MATEX @ DAC15
SPICE-MATEX @ DAC15Hao Zhuang
 
The rules of indices
The rules of indicesThe rules of indices
The rules of indicesYu Kok Hui
 

Similar to Practical fast matrix multiplication: Bridging theory and practice (20)

fast-matmul-ppopp2015
fast-matmul-ppopp2015fast-matmul-ppopp2015
fast-matmul-ppopp2015
 
PR-305: Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation LearningPR-305: Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation Learning
 
GTC Taiwan 2017 GPU 平台上導入深度學習於半導體產業之 EDA 應用
GTC Taiwan 2017 GPU 平台上導入深度學習於半導體產業之 EDA 應用GTC Taiwan 2017 GPU 平台上導入深度學習於半導體產業之 EDA 應用
GTC Taiwan 2017 GPU 平台上導入深度學習於半導體產業之 EDA 應用
 
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
 
Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Optimizer and Runtime techniques by Matthias BoehmApache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
 
Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Optimizer and Runtime techniques by Matthias BoehmApache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
 
Evaluation of programs codes using machine learning
Evaluation of programs codes using machine learningEvaluation of programs codes using machine learning
Evaluation of programs codes using machine learning
 
Recurrent Instance Segmentation (UPC Reading Group)
Recurrent Instance Segmentation (UPC Reading Group)Recurrent Instance Segmentation (UPC Reading Group)
Recurrent Instance Segmentation (UPC Reading Group)
 
xldb-2015
xldb-2015xldb-2015
xldb-2015
 
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine Learning
 
Data compression
Data compressionData compression
Data compression
 
Writing Fast MATLAB Code
Writing Fast MATLAB CodeWriting Fast MATLAB Code
Writing Fast MATLAB Code
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
 
Presentation_Parallel GRASP algorithm for job shop scheduling
Presentation_Parallel GRASP algorithm for job shop schedulingPresentation_Parallel GRASP algorithm for job shop scheduling
Presentation_Parallel GRASP algorithm for job shop scheduling
 
IJCAI13 Paper review: Large-scale spectral clustering on graphs
IJCAI13 Paper review: Large-scale spectral clustering on graphsIJCAI13 Paper review: Large-scale spectral clustering on graphs
IJCAI13 Paper review: Large-scale spectral clustering on graphs
 
Project seminar ppt_steelcasting
Project seminar ppt_steelcastingProject seminar ppt_steelcasting
Project seminar ppt_steelcasting
 
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
 
SPICE-MATEX @ DAC15
SPICE-MATEX @ DAC15SPICE-MATEX @ DAC15
SPICE-MATEX @ DAC15
 
The rules of indices
The rules of indicesThe rules of indices
The rules of indices
 

More from Austin Benson

Hypergraph Cuts with General Splitting Functions (JMM)
Hypergraph Cuts with General Splitting Functions (JMM)Hypergraph Cuts with General Splitting Functions (JMM)
Hypergraph Cuts with General Splitting Functions (JMM)Austin Benson
 
Spectral embeddings and evolving networks
Spectral embeddings and evolving networksSpectral embeddings and evolving networks
Spectral embeddings and evolving networksAustin Benson
 
Computational Frameworks for Higher-order Network Data Analysis
Computational Frameworks for Higher-order Network Data AnalysisComputational Frameworks for Higher-order Network Data Analysis
Computational Frameworks for Higher-order Network Data AnalysisAustin Benson
 
Higher-order link prediction and other hypergraph modeling
Higher-order link prediction and other hypergraph modelingHigher-order link prediction and other hypergraph modeling
Higher-order link prediction and other hypergraph modelingAustin Benson
 
Hypergraph Cuts with General Splitting Functions
Hypergraph Cuts with General Splitting FunctionsHypergraph Cuts with General Splitting Functions
Hypergraph Cuts with General Splitting FunctionsAustin Benson
 
Hypergraph Cuts with General Splitting Functions
Hypergraph Cuts with General Splitting FunctionsHypergraph Cuts with General Splitting Functions
Hypergraph Cuts with General Splitting FunctionsAustin Benson
 
Higher-order link prediction
Higher-order link predictionHigher-order link prediction
Higher-order link predictionAustin Benson
 
Simplicial closure & higher-order link prediction
Simplicial closure & higher-order link predictionSimplicial closure & higher-order link prediction
Simplicial closure & higher-order link predictionAustin Benson
 
Three hypergraph eigenvector centralities
Three hypergraph eigenvector centralitiesThree hypergraph eigenvector centralities
Three hypergraph eigenvector centralitiesAustin Benson
 
Semi-supervised learning of edge flows
Semi-supervised learning of edge flowsSemi-supervised learning of edge flows
Semi-supervised learning of edge flowsAustin Benson
 
Choosing to grow a graph
Choosing to grow a graphChoosing to grow a graph
Choosing to grow a graphAustin Benson
 
Link prediction in networks with core-fringe structure
Link prediction in networks with core-fringe structureLink prediction in networks with core-fringe structure
Link prediction in networks with core-fringe structureAustin Benson
 
Higher-order Link Prediction GraphEx
Higher-order Link Prediction GraphExHigher-order Link Prediction GraphEx
Higher-order Link Prediction GraphExAustin Benson
 
Higher-order Link Prediction Syracuse
Higher-order Link Prediction SyracuseHigher-order Link Prediction Syracuse
Higher-order Link Prediction SyracuseAustin Benson
 
Random spatial network models for core-periphery structure
Random spatial network models for core-periphery structureRandom spatial network models for core-periphery structure
Random spatial network models for core-periphery structureAustin Benson
 
Random spatial network models for core-periphery structure.
Random spatial network models for core-periphery structure.Random spatial network models for core-periphery structure.
Random spatial network models for core-periphery structure.Austin Benson
 
Simplicial closure & higher-order link prediction
Simplicial closure & higher-order link predictionSimplicial closure & higher-order link prediction
Simplicial closure & higher-order link predictionAustin Benson
 
Simplicial closure and simplicial diffusions
Simplicial closure and simplicial diffusionsSimplicial closure and simplicial diffusions
Simplicial closure and simplicial diffusionsAustin Benson
 
Sampling methods for counting temporal motifs
Sampling methods for counting temporal motifsSampling methods for counting temporal motifs
Sampling methods for counting temporal motifsAustin Benson
 
Set prediction three ways
Set prediction three waysSet prediction three ways
Set prediction three waysAustin Benson
 

More from Austin Benson (20)

Hypergraph Cuts with General Splitting Functions (JMM)
Hypergraph Cuts with General Splitting Functions (JMM)Hypergraph Cuts with General Splitting Functions (JMM)
Hypergraph Cuts with General Splitting Functions (JMM)
 
Spectral embeddings and evolving networks
Spectral embeddings and evolving networksSpectral embeddings and evolving networks
Spectral embeddings and evolving networks
 
Computational Frameworks for Higher-order Network Data Analysis
Computational Frameworks for Higher-order Network Data AnalysisComputational Frameworks for Higher-order Network Data Analysis
Computational Frameworks for Higher-order Network Data Analysis
 
Higher-order link prediction and other hypergraph modeling
Higher-order link prediction and other hypergraph modelingHigher-order link prediction and other hypergraph modeling
Higher-order link prediction and other hypergraph modeling
 
Hypergraph Cuts with General Splitting Functions
Hypergraph Cuts with General Splitting FunctionsHypergraph Cuts with General Splitting Functions
Hypergraph Cuts with General Splitting Functions
 
Hypergraph Cuts with General Splitting Functions
Hypergraph Cuts with General Splitting FunctionsHypergraph Cuts with General Splitting Functions
Hypergraph Cuts with General Splitting Functions
 
Higher-order link prediction
Higher-order link predictionHigher-order link prediction
Higher-order link prediction
 
Simplicial closure & higher-order link prediction
Simplicial closure & higher-order link predictionSimplicial closure & higher-order link prediction
Simplicial closure & higher-order link prediction
 
Three hypergraph eigenvector centralities
Three hypergraph eigenvector centralitiesThree hypergraph eigenvector centralities
Three hypergraph eigenvector centralities
 
Semi-supervised learning of edge flows
Semi-supervised learning of edge flowsSemi-supervised learning of edge flows
Semi-supervised learning of edge flows
 
Choosing to grow a graph
Choosing to grow a graphChoosing to grow a graph
Choosing to grow a graph
 
Link prediction in networks with core-fringe structure
Link prediction in networks with core-fringe structureLink prediction in networks with core-fringe structure
Link prediction in networks with core-fringe structure
 
Higher-order Link Prediction GraphEx
Higher-order Link Prediction GraphExHigher-order Link Prediction GraphEx
Higher-order Link Prediction GraphEx
 
Higher-order Link Prediction Syracuse
Higher-order Link Prediction SyracuseHigher-order Link Prediction Syracuse
Higher-order Link Prediction Syracuse
 
Random spatial network models for core-periphery structure
Random spatial network models for core-periphery structureRandom spatial network models for core-periphery structure
Random spatial network models for core-periphery structure
 
Random spatial network models for core-periphery structure.
Random spatial network models for core-periphery structure.Random spatial network models for core-periphery structure.
Random spatial network models for core-periphery structure.
 
Simplicial closure & higher-order link prediction
Simplicial closure & higher-order link predictionSimplicial closure & higher-order link prediction
Simplicial closure & higher-order link prediction
 
Simplicial closure and simplicial diffusions
Simplicial closure and simplicial diffusionsSimplicial closure and simplicial diffusions
Simplicial closure and simplicial diffusions
 
Sampling methods for counting temporal motifs
Sampling methods for counting temporal motifsSampling methods for counting temporal motifs
Sampling methods for counting temporal motifs
 
Set prediction three ways
Set prediction three waysSet prediction three ways
Set prediction three ways
 

Recently uploaded

(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...RajaP95
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 

Recently uploaded (20)

(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 

Practical fast matrix multiplication: Bridging theory and practice

  • 1. 1 A FRAMEWORK FOR PRACTICAL FAST MATRIX MULTIPLICATION arXiv: 1409.2908 25 20 15 10 5 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 Austin Benson (arbenson@stanford.edu), ICME, Stanford Grey Ballard, Sandia National Laboratories Sandia National Laboratories, October 21, 2014 4 x 10 0 Dimension (N) Effective GFLOPS / core Parallel performance of <4,2,4> on <N,2800,N> MKL, 6 cores MKL, 24 cores DFS, 6 cores BFS, 6 cores HYBRID, 6 cores DFS, 24 cores BFS, 24 cores HYBRID, 24 cores
  • 2. Fast matrix multiplication: bridging theory and practice 2 • There are a number of Strassen-like algorithms for matrix multiplication that have only been “discovered” recently. [Smirnov13], [Benson&Ballard14] • We show that they can achieve higher performance with respect to Intel’s Math Kernel Library (MKL) • We use code generation for extensive prototyping. 2 2.81 3 [Strassen79] 2.37 [Williams12] xxxxx xxx x
  • 4. 4 Key ingredients of Strassen’s algorithm • 1. Block partitioning of matrices (<2, 2, 2>) • 2. Seven linear combinations of sub-blocks of A • 3. Seven linear combinations of sub-blocks of B • 4. Seven matrix multiplies to form Mr (recursive) • 5. Linear combinations of Mr to form Cij
  • 5. Key ingredients of fast matmul algorithms • 1. Block partitioning of matrices (<M, K, N>) • 2. R linear combinations of sub-blocks of A • 3. R linear combinations of sub-blocks of B • 4. R matrix multiplies to form Mr (recursive) R < MKN  faster than classical • 5. Linear combinations of Mr to form Cij 5 Main idea: Trade N^3 computations (matmul) for more N^2 computations (matrix addition)
  • 6. “Outer product” fast algorithm • <4, 2, 4> partitioning • R = 26 multiplies (< 4 * 2 * 4 = 32)  23% speedup per recursive step (if everything else free) • Linear combinations of Aij to form Sr: 68 terms • Linear combinations of Bij to form Tr: 52 terms • Linear combinations of Mr to form Cij: 69 terms 6
  • 7. Discovering fast algorithms is a numerical challenge 7 • Low-rank tensor decompositions lead to fast algorithms • Tensors are small, but we need exact decompositions  NP-hard • Use alternating least squares with regularization and rounding tricks [Smirnov13], [Benson&Ballard14] • We have around 10 fast algorithms for <M, K, N> partitions. Also have permutations, e.g., <K, M, N>.
  • 9. Code generation lets us prototype algorithms quickly 9 • We have compact representation of many fast algorithms: 1. dimensions of block partitioning (<M, K, N>) 2. linear combinations of sub-blocks (Sr, Tr) 3. linear combinations of Mr to form Cij • We use code generation to rapidly prototype fast algorithms • Our approach: test all algorithms on a bunch of different problem sizes and look for patterns
  • 10. Sequential performance 28 26 24 22 20 18 16 0 2000 4000 6000 8000 Dimension (N) Effective GFLOPS Sequential performance on N x N x N = 10 MKL STRASSEN <3,2,2> <3,2,4> <4,2,3> <3,4,2> <3,3,3> <4,2,4> <2,3,4> Effective GFLOPS for M x P x Q multiplies = 1e-9 * 2 * MPQ / time in seconds True peak
  • 11. Sequential performance 28 26 24 22 20 18 16 0 2000 4000 6000 8000 Dimension (N) Effective GFLOPS Sequential performance on N x N x N = MKL STRASSEN <4,4,2> <4,3,3> <3,4,3> <3,3,6> <3,6,3> <6,3,3> • All algorithms beat MKL on large problems • Strassen’s algorithm is hard to beat 11
  • 12. Sequential performance = 27 26 25 24 23 22 2000 4000 6000 8000 10000 12000 dimension (N) Effective GFLOPS Sequential performance on N x 1600 x N MKL <4,2,4> <4,3,3> <3,2,3> <4,2,3> STRASSEN • Almost all algorithms beat MKL • <4, 2, 4> and <3, 2, 3> tend to perform the best 12
  • 13. Sequential performance = 26 25 24 23 22 • Almost all algorithms beat MKL • <4, 3, 3> and <4, 2, 3> tend to perform the best 13 10000 12000 14000 16000 18000 dimension (N) Effective GFLOPS Sequential performance on N x 2400 x 2400 MKL <4,2,4> <4,3,3> <3,2,3> <4,2,3> STRASSEN
  • 14. 14 Practical issues, or what is not obvious when …you just think about the theory …your performance models are too simple
  • 15. Practical issue #1: when to stop recursion? Look at the gemm curve. 25 20 15 10 0 1000 2000 3000 Dimension (N) GFLOPS Sequential dgemm performance N x 800 x 800 N x 800 x N N x N x N peak 25 20 15 10 0 2000 4000 6000 8000 Dimension (N) GFLOPS / core Parallel dgemm performance (24 cores) Basic idea: take another recursive step if the sub-problems will still operate at high performance 15 <M, K, N> = <4, 2, 3>
  • 16. Practical issue #2: matrix additions S1 S2 S S7 S 6 S 5 4 S3 A11 A12 A21 A22 “Pairwise” 2x DAXPY 2x DAXPY 16 Y := AX + Y
  • 17. S1 S2 S S7 S 6 S 5 4 S3 A11 A12 A21 A22 “Write once” custom “DAXPY” custom “DAXPY” 17 Practical issue #2: matrix additions
  • 18. A11 A12 A21 A22 Entry-wise updates S1 S2 S S7 S 6 S 5 4 S3 “Streaming” 18 Practical issue #2: matrix additions
  • 19. Practical issue #3: redundant linear combinations • Example in <4, 2, 4> algorithm (R = 26 multiplies): T11 T25 B B24 12 B22 B23 Four additions, six reads, two writes 19
  • 20. Common subexpression elimination • Example in <4, 2, 4> algorithm (R = 26 multiplies): T11 T25 B B24 12 B22 B23 Y Three additions, six reads, three writes  Net increase in communication! 20
  • 21. 21 Common subexpression elimination (CSE) does not really help
  • 22. 22 Practical issue #4: parallelization on shared memory
  • 23. Fast matmul recursion tree C + M2 … M1 M7 + M2 … M1 M7 + M2 … M1 M7 23
  • 24. DFS Parallelization C + M2 … M1 M7 + M2 … M1 M7 All threads Use parallel MKL 24 + Easy to implement + Load balanced + Same memory footprint as sequential - Need large base cases for high performance
  • 25. BFS Parallelization C + omp taskwait M2 … M1 M7 + omp taskwait M2 … M1 M7 1 thread 25 1 thread 1 thread + High performance for smaller base cases - Sometimes harder to load balance: 24 threads, 49 subproblems - More memory
  • 26. HYBRID parallelization C + omp taskwait M2 … M1 M7 + omp taskwait M2 … M1 M7 26 1 thread 1 thread all threads + Better load balancing - Explicit synchronization or else we can over-subscribe threads
  • 27. 27 25 20 15 10 5 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 4 x 10 0 Dimension (N) Effective GFLOPS / core Parallel performance of <4,2,4> on <N,2800,N> MKL, 6 cores MKL, 24 cores DFS, 6 cores BFS, 6 cores HYBRID, 6 cores DFS, 24 cores BFS, 24 cores HYBRID, 24 cores
  • 28. Bandwidth problems • We rely on the cost of matrix multiplications to be much more expensive than the cost of matrix additions • Parallel dgemm on 24 cores: easily get 50-75% of peak • STREAM benchmark: < 6x speedup in read/write performance on 24 cores C + M2 … M1 M7 28
  • 29. Parallel performance = 29 Performance (6 cores) on N x N x N 28 26 24 22 20 18 9000 10000 11000 12000 13000 Dimension (N) Effective GFLOPS / core MKL STRASSEN <3,2,2> <3,2,4> <4,2,3> <3,4,2> <3,3,3> <4,2,4> <2,3,4> Performance (24 cores) on N x N x N 22 20 18 16 9000 10000 11000 12000 13000 Dimension (N) Effective GFLOPS / core MKL STRASSEN <3,2,2> <3,2,4> <4,2,3> <3,4,2> <3,3,3> <4,2,4> <2,3,4> • 6 cores: similar performance to sequential • 24 cores: can sometimes beat MKL, but barely
  • 30. 24 23 22 21 20 19 18 10000 15000 20000 10000 15000 dimension (N) Effective GFLOPS / core Performance (6 cores) on N x 2800 x N MKL <4,2,4> <4,3,3> <3,2,3> <4,2,3> STRASSEN 20 18 16 14 12 5000 10000 15000 20000 dimension (N) Effective GFLOPS / core Performance (24 cores) on N x 2800 x N MKL <4,2,4> <4,3,3> <3,2,3> <4,2,3> STRASSEN Parallel performance = Bad MKL performance • 6 cores: similar performance to sequential • 24 cores: MKL best for large problems 30
  • 31. Parallel performance = 23 22 21 20 19 18 18 17 16 15 14 13 12 • 6 cores: similar performance to sequential • 24 cores: MKL usually the best 31 10000 15000 20000 10000 15000 dimension (N) Effective GFLOPS / core Performance (6 cores) on N x 3000 x 3000 MKL <4,2,4> <4,3,3> <3,2,3> <4,2,3> STRASSEN 16000 18000 20000 22000 24000 26000 dimension (N) Effective GFLOPS / core Performance (24 cores) on N x 3000 x 3000 MKL <4,2,4> <4,3,3> <3,2,3> <4,2,3> STRASSEN
  • 32. High-level conclusions 32 • For square matrix multiplication, Strassen’s algorithm is hard to beat • For rectangular matrix multiplication, use a fast algorithm that “matches the shape” • Bandwidth limits the performance of shared memory parallel fast matrix multiplication  should be less of an issue in distributed memory Future work: • Numerical stability • Using fast matmul as a kernel for other algorithms in numerical linear algebra
  • 33. 33 A FRAMEWORK FOR PRACTICAL FAST MATRIX MULTIPLICATION arXiv: 1409.2908 25 20 15 10 5 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 Austin Benson (arbenson@stanford.edu), ICME, Stanford Grey Ballard, Sandia National Laboratories Sandia National Laboratories, October 21, 2014 4 x 10 0 Dimension (N) Effective GFLOPS / core Parallel performance of <4,2,4> on <N,2800,N> MKL, 6 cores MKL, 24 cores DFS, 6 cores BFS, 6 cores HYBRID, 6 cores DFS, 24 cores BFS, 24 cores HYBRID, 24 cores

Editor's Notes

  1. \begin{bmatrix} C_{11} & C_{12} \\ C_{21} & C_{22} \end{bmatrix} = \begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{bmatrix} \cdot \begin{bmatrix} B_{11} & B_{12} \\ B_{21} & B_{22} \end{bmatrix} S_1 &= A_{11} + A_{22} \\ S_2 &= A_{21} + A_{22} \\ S_3 &= A_{11} \\ S_4 &= A_{22} \\ S_5 &= A_{11} + A_{12} \\ S_6 &= A_{21} - A_{11} \\ S_7 &= A_{12} - A_{22} \\ T_1 &= B_{11} + B_{22} \\ T_2 &= B_{11} \\ T_3 &= B_{12} - B_{22} \\ T_4 &= B_{21} - B_{11} \\ T_5 &= B_{22} \\ T_6 &= B_{11} + B_{12} \\ T_7 &= B_{21} + B_{22} \\ C_{11} &= M_1 + M_4 – M_5 + M_7 \\ C_{12} &= M_3 + M_5 \\ C_{21} &= M_2 + M_4 \\ C_{22} &= M_1 – M_2 + M_3 + M_6 \\
  2. \begin{table} \begin{tabular}{l c c c c c} Algorithm & Multiples & Multiplies & Speedup per & \\ base case & (fast) & (classical) & recursive step & Exponent\\ $\langle 2,2,3\rangle $ & 11 & 12 & 9\% & 2.89 \\ $\langle 2,2,5\rangle $ & 18 & 20 & 11\% & 2.89\\ $\langle 2,2,2\rangle $ & 7 & 8 & 14\% & 2.81 \\ $\langle 2,2,4\rangle $ & 14 & 16 & 14\% & 2.85\\ $\langle 3,3,3\rangle $ & 23 & 27 & 17\% & 2.85 \\ $\langle 2,3,3\rangle $ & 15 & 18 & 20\% & 2.81 \\ $\langle 2,3,4\rangle $ & 20 & 24 & 20\% & 2.83\\ $\langle 2,4,4\rangle $ & 26 & 32 & 23\% & 2.82 \\ $\langle 3,3,4\rangle $ & 29 & 36 & 24\% & 2.82 \\ $\langle 3,4,4\rangle $ & 38 & 48 & 26\% & 2.82 \\ $\langle 3,3,6\rangle $ & 40 & 54 & 35\% & 2.77 \\ \end{tabular} \end{table}
  3. S_1 &= A_{11} + A_{22} \\ S_2 &= A_{21} + A_{22} \\ S_3 &= A_{11} \\ S_4 &= A_{22} \\ S_5 &= A_{11} + A_{12} \\ S_6 &= A_{21} - A_{11} \\ S_7 &= A_{12} - A_{22} \\
  4. T{11} &= B{24} - \left(B{12} + B{22}\right) \\ T{25} &= B{23} + B{12} + B{22}
  5. Y &= B_{12} + B_{22} \\ T_{11} &= B_{24} - Y \\ T_{25} &= B_{23} + Y
  6. \begin{eqnarray*} S_7 &=& A_{12} - A_{22} \\ T_7 &=& B_{21} + B_{22} \\ M_7 &=& S_7 \cdot T_7 \end{eqnarray*}
  7. \begin{eqnarray*} S_7 &=& A_{12} - A_{22} \\ T_7 &=& B_{21} + B_{22} \\ M_7 &=& S_7 \cdot T_7 \end{eqnarray*}
  8. \begin{eqnarray*} S_7 &=& A_{12} - A_{22} \\ T_7 &=& B_{21} + B_{22} \\ M_7 &=& S_7 \cdot T_7 \end{eqnarray*}
  9. FIX THIS