SlideShare a Scribd company logo
1 of 15
I/O Efficient Matrix
Multiplication
Presented By
Shubham Joshi (2011227)
Shubham Jaju (2011226)
Contents
• I/O efficient
- Parallel Disk Model
- how to make algorithm i/o efficient
• Matrix Multiplication in 2D mesh
- Cannon’s Algorithm
• I/O efficient matrix multiplication (MPI model)
• Results & Conclusion
• References
Parallel Disk Model
• Main memory size M Problem size N
• External memory = D disks
• Data is transferred in blocks of size B
• Up to D ·B data per I/O step (102 per
sec.)
• Goal 1: Minimize number of I/O steps
• Goal 2: Minimize number of CPU
instructions
• scan(x) := O( x/D·B ) I/Os
How to make algorithms I/O-
efficient?
Only a few golden rules:
• Avoid unstructured access patterns.(e.g. avoid goto
statement)
• Incorporate LOCALITY directly into the algorithm.
Tools:
• Scanning: scan(N) = O(N/DB) I/Os.
• Special I/O-efficient data structures (BTrees , B+ Trees).
• “Simulation” of parallel algorithms.
i/o efficient matrix
multiplication
Principle of locality takes two forms:
Temporal Locality: says that if any memory location is accessed
now, it is more likely to be accessed again in near future.
Spatial Locality: says that if any memory location is accessed
now, then it's neighboring memory locations are expected to
be accessed in near future.
Let's see how can we make use of the cache properties to
speed up our Matrix Multiplication program.
Assume that the 2-D array matrix is stored in a row-major order
a[0][0],a[0][1],...,a[0][n−1],a[1][0]
,a[1][1],...,a[1][n−1]...a[n−1][n−1]
The accesses into array C are
made in the following order:
• C[0][0] is accessed n times – For the first execution of k loop.
• C[0][1] is accessed n times – For the second execution of k loop.
• C[0][2] is accessed n times – For the third execution of k loop.
• ...
• C[1][0] is accessed n times – For the (n+1)th execution of k loop.
Since the same element is accessed repeatedly and accesses are
made sequentially, both temporal and spatial locality of reference is
observed.
For C, accesses are already optimized for cache!
• The accesses into array A are made in the following order:
• A[0][0],A[0][1],A[0][2],...,A[0][n] – This whole
sequence n times, for each iteration of k loop under a single
iteration of j loop.
• A[1][0],A[1][1],A[1][2],...,A[1][n] – This whole
sequence n times, for each iteration of k loop under a single
iteration of j loop.
• ...
• ...
• A[n][0],A[n][1],A[n][2],...,A[n][n] – This whole
sequence n times, for each iteration of k loop under a single
iteration of j loop.
Hence, each row of array A is accessed n times and each
element in the row is accessed sequentially. We observe good
spatial locality of reference. But the same element is not
referenced repeatedly so no temporal locality.
• The accesses into array B are made in the following order:
• B[0][0],B[1][0],B[2][0],...,B[n][0]
• B[0][1],B[1][1],B[2][1],...,B[n][1]
• ...
• B[0][n],B[1][n],B[2][n],...,B[n][n]
• Then again,
• B[0][0],B[1][0],B[2][0],...,B[n][0]
• B[0][1],B[1][1],B[2][1],...,B[n][1]
• ...
• B[0][n],B[1][n],B[2][n],...,B[n][n]
The whole array is accessed once and only then repetition
starts. So there's no spatial locality or temporal locality for B.
Each element will incur a cache miss on every access made to it.
And thus we have a cache miss on every execution of the
multiplication statement
Improving cache efficiency
To improve upon this traditional i−j−k setting,
we can employ loop interchange.
By trying out the other 5 combinations of the loop orderings
namely, i−k−j, j−k−i ,j−i−k, k−i−j and k−j−i we find that i−k−j is the
optimal
• C[0][0],C[0][1],C[0][2],...,C[0][n] – This whole sequence n times, for
each iteration of k loop……
• C[n][0],C[n][1],C[n][2],...,C[n][n] – This whole sequence n times, for
each iteration of k loop.
Now C has good spatial locality.
Similarly,
A enjoys good temporal as well as spatial locality.
B now has spatial locality.
Catch Efficient using different iterationsTimeInSeconds
Matrix Multiplication in 2D mesh
Mesh Network:
A set of nodes arranged in the form of a p dimensional lattice is called a mesh network.
In a mesh network only neighbouring nodes can communicate with each other.
Cannon’s Algorithm
Procedure MATRIXMULT
begin
for k = 1 to n-1 step 1 do
begin
for all Pi,j where i and j ranges from 1 to n do
if i is greater than k then
rotate a in the east direction
end if
if j is greater than k then
rotate b in the south direction
end if
end
for all Pi;j where i and j lies between 1 and n do
compute the product of a and b and store it in c
for k= 1 to n-1 step 1 do for all Pi;j where i and j ranges from 1 to n do
rotate a in the east
rotate b in the south
c=c+aXb
end
Results of I/O efficient parallel
Matrix Multiplication
Matrix_Size 500 *500 700*700 1000*1000 1200*1200
Time Taken
Parallel (in sec)
0.38 1.156053 3.409883 7.037
Machine Specification
I3 Processor CPU@2.20Gz
RAM-2GB
Windows 64 bit
References
Cache-Oblivious Algorithms – Harald Prokop
MIT June 1999
Matrix Multiplication on a Distributed Memory Machine,
www.phy.ornl.gov
Thank You

More Related Content

What's hot

ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via RandomizationICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via RandomizationHidekazu Oiwa
 
Md2k 0219 shang
Md2k 0219 shangMd2k 0219 shang
Md2k 0219 shangBBKuhn
 
Matrix transposition
Matrix transpositionMatrix transposition
Matrix transposition동호 이
 
Dynamic Program Problems
Dynamic Program ProblemsDynamic Program Problems
Dynamic Program ProblemsRanjit Sasmal
 
Basic Fresher Algorithm
Basic Fresher AlgorithmBasic Fresher Algorithm
Basic Fresher AlgorithmFairPeSearch
 
Javascript Array map method
Javascript Array map methodJavascript Array map method
Javascript Array map methodtanerochris
 
FAST MAP PROJECTION ON CUDA.ppt
FAST MAP PROJECTION ON CUDA.pptFAST MAP PROJECTION ON CUDA.ppt
FAST MAP PROJECTION ON CUDA.pptgrssieee
 
Space-efficient Feature Maps for String Alignment Kernels
Space-efficient Feature Maps for String Alignment KernelsSpace-efficient Feature Maps for String Alignment Kernels
Space-efficient Feature Maps for String Alignment KernelsYasuo Tabei
 
Parallelizing matrix multiplication
Parallelizing  matrix multiplicationParallelizing  matrix multiplication
Parallelizing matrix multiplicationDEEPIKA T
 
Support Vector Machines (SVM)
Support Vector Machines (SVM)Support Vector Machines (SVM)
Support Vector Machines (SVM)FAO
 
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...Preferred Networks
 

What's hot (20)

Paralell
ParalellParalell
Paralell
 
Quiz 2
Quiz 2Quiz 2
Quiz 2
 
ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via RandomizationICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
 
Md2k 0219 shang
Md2k 0219 shangMd2k 0219 shang
Md2k 0219 shang
 
Profiling in Python
Profiling in PythonProfiling in Python
Profiling in Python
 
LeNet-5
LeNet-5LeNet-5
LeNet-5
 
Matrix transposition
Matrix transpositionMatrix transposition
Matrix transposition
 
Dynamic Program Problems
Dynamic Program ProblemsDynamic Program Problems
Dynamic Program Problems
 
Recurrent Instance Segmentation (UPC Reading Group)
Recurrent Instance Segmentation (UPC Reading Group)Recurrent Instance Segmentation (UPC Reading Group)
Recurrent Instance Segmentation (UPC Reading Group)
 
Parallel algorithms
Parallel algorithms Parallel algorithms
Parallel algorithms
 
Basic Fresher Algorithm
Basic Fresher AlgorithmBasic Fresher Algorithm
Basic Fresher Algorithm
 
Javascript Array map method
Javascript Array map methodJavascript Array map method
Javascript Array map method
 
Deep Learning meetup
Deep Learning meetupDeep Learning meetup
Deep Learning meetup
 
FAST MAP PROJECTION ON CUDA.ppt
FAST MAP PROJECTION ON CUDA.pptFAST MAP PROJECTION ON CUDA.ppt
FAST MAP PROJECTION ON CUDA.ppt
 
Space-efficient Feature Maps for String Alignment Kernels
Space-efficient Feature Maps for String Alignment KernelsSpace-efficient Feature Maps for String Alignment Kernels
Space-efficient Feature Maps for String Alignment Kernels
 
Hubba Deep Learning
Hubba Deep LearningHubba Deep Learning
Hubba Deep Learning
 
Parallelizing matrix multiplication
Parallelizing  matrix multiplicationParallelizing  matrix multiplication
Parallelizing matrix multiplication
 
Support Vector Machines (SVM)
Support Vector Machines (SVM)Support Vector Machines (SVM)
Support Vector Machines (SVM)
 
End sem
End semEnd sem
End sem
 
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...
 

Similar to IOEfficientParalleMatrixMultiplication_present

Loop parallelization & pipelining
Loop parallelization & pipeliningLoop parallelization & pipelining
Loop parallelization & pipeliningjagrat123
 
Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01
Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01
Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01Hemant Jha
 
Fuzzy c means clustering protocol for wireless sensor networks
Fuzzy c means clustering protocol for wireless sensor networksFuzzy c means clustering protocol for wireless sensor networks
Fuzzy c means clustering protocol for wireless sensor networksmourya chandra
 
unit-4-dynamic programming
unit-4-dynamic programmingunit-4-dynamic programming
unit-4-dynamic programminghodcsencet
 
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021Sergey Karayev
 
Objects as points (CenterNet) review [CDM]
Objects as points (CenterNet) review [CDM]Objects as points (CenterNet) review [CDM]
Objects as points (CenterNet) review [CDM]Dongmin Choi
 
Cyclic quorum
Cyclic quorumCyclic quorum
Cyclic quorumDanny Luk
 
Lca seminar modified
Lca seminar modifiedLca seminar modified
Lca seminar modifiedInbok Lee
 
Oxford 05-oct-2012
Oxford 05-oct-2012Oxford 05-oct-2012
Oxford 05-oct-2012Ted Dunning
 
Artificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep LearningArtificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep LearningSujit Pal
 
Fast Non-Uniform Filtering with Symmetric Weighted Integral Images
Fast Non-Uniform Filtering with Symmetric Weighted Integral ImagesFast Non-Uniform Filtering with Symmetric Weighted Integral Images
Fast Non-Uniform Filtering with Symmetric Weighted Integral Imagesdavidmarimon
 
Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford MapR Technologies
 
Design and Analysis of Algorithms Lecture Notes
Design and Analysis of Algorithms Lecture NotesDesign and Analysis of Algorithms Lecture Notes
Design and Analysis of Algorithms Lecture NotesSreedhar Chowdam
 
Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationGeoffrey Fox
 
The Traveling Salesman Problem: A Neural Network Perspective
The Traveling Salesman Problem: A Neural Network PerspectiveThe Traveling Salesman Problem: A Neural Network Perspective
The Traveling Salesman Problem: A Neural Network Perspectivemustafa sarac
 
Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...
Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...
Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...Sean Moran
 
Fast Object Recognition from 3D Depth Data with Extreme Learning Machine
Fast Object Recognition from 3D Depth Data with Extreme Learning MachineFast Object Recognition from 3D Depth Data with Extreme Learning Machine
Fast Object Recognition from 3D Depth Data with Extreme Learning MachineSoma Boubou
 

Similar to IOEfficientParalleMatrixMultiplication_present (20)

Loop parallelization & pipelining
Loop parallelization & pipeliningLoop parallelization & pipelining
Loop parallelization & pipelining
 
Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01
Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01
Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01
 
DAA Notes.pdf
DAA Notes.pdfDAA Notes.pdf
DAA Notes.pdf
 
Fuzzy c means clustering protocol for wireless sensor networks
Fuzzy c means clustering protocol for wireless sensor networksFuzzy c means clustering protocol for wireless sensor networks
Fuzzy c means clustering protocol for wireless sensor networks
 
unit-4-dynamic programming
unit-4-dynamic programmingunit-4-dynamic programming
unit-4-dynamic programming
 
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
 
Objects as points (CenterNet) review [CDM]
Objects as points (CenterNet) review [CDM]Objects as points (CenterNet) review [CDM]
Objects as points (CenterNet) review [CDM]
 
Cyclic quorum
Cyclic quorumCyclic quorum
Cyclic quorum
 
Lca seminar modified
Lca seminar modifiedLca seminar modified
Lca seminar modified
 
Oxford 05-oct-2012
Oxford 05-oct-2012Oxford 05-oct-2012
Oxford 05-oct-2012
 
Artificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep LearningArtificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep Learning
 
Fast Non-Uniform Filtering with Symmetric Weighted Integral Images
Fast Non-Uniform Filtering with Symmetric Weighted Integral ImagesFast Non-Uniform Filtering with Symmetric Weighted Integral Images
Fast Non-Uniform Filtering with Symmetric Weighted Integral Images
 
Deep Learning
Deep LearningDeep Learning
Deep Learning
 
ECCV WS 2012 (Frank)
ECCV WS 2012 (Frank)ECCV WS 2012 (Frank)
ECCV WS 2012 (Frank)
 
Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford
 
Design and Analysis of Algorithms Lecture Notes
Design and Analysis of Algorithms Lecture NotesDesign and Analysis of Algorithms Lecture Notes
Design and Analysis of Algorithms Lecture Notes
 
Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel application
 
The Traveling Salesman Problem: A Neural Network Perspective
The Traveling Salesman Problem: A Neural Network PerspectiveThe Traveling Salesman Problem: A Neural Network Perspective
The Traveling Salesman Problem: A Neural Network Perspective
 
Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...
Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...
Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...
 
Fast Object Recognition from 3D Depth Data with Extreme Learning Machine
Fast Object Recognition from 3D Depth Data with Extreme Learning MachineFast Object Recognition from 3D Depth Data with Extreme Learning Machine
Fast Object Recognition from 3D Depth Data with Extreme Learning Machine
 

IOEfficientParalleMatrixMultiplication_present

  • 1. I/O Efficient Matrix Multiplication Presented By Shubham Joshi (2011227) Shubham Jaju (2011226)
  • 2. Contents • I/O efficient - Parallel Disk Model - how to make algorithm i/o efficient • Matrix Multiplication in 2D mesh - Cannon’s Algorithm • I/O efficient matrix multiplication (MPI model) • Results & Conclusion • References
  • 3. Parallel Disk Model • Main memory size M Problem size N • External memory = D disks • Data is transferred in blocks of size B • Up to D ·B data per I/O step (102 per sec.) • Goal 1: Minimize number of I/O steps • Goal 2: Minimize number of CPU instructions • scan(x) := O( x/D·B ) I/Os
  • 4. How to make algorithms I/O- efficient? Only a few golden rules: • Avoid unstructured access patterns.(e.g. avoid goto statement) • Incorporate LOCALITY directly into the algorithm. Tools: • Scanning: scan(N) = O(N/DB) I/Os. • Special I/O-efficient data structures (BTrees , B+ Trees). • “Simulation” of parallel algorithms.
  • 5. i/o efficient matrix multiplication Principle of locality takes two forms: Temporal Locality: says that if any memory location is accessed now, it is more likely to be accessed again in near future. Spatial Locality: says that if any memory location is accessed now, then it's neighboring memory locations are expected to be accessed in near future. Let's see how can we make use of the cache properties to speed up our Matrix Multiplication program.
  • 6. Assume that the 2-D array matrix is stored in a row-major order a[0][0],a[0][1],...,a[0][n−1],a[1][0] ,a[1][1],...,a[1][n−1]...a[n−1][n−1] The accesses into array C are made in the following order: • C[0][0] is accessed n times – For the first execution of k loop. • C[0][1] is accessed n times – For the second execution of k loop. • C[0][2] is accessed n times – For the third execution of k loop. • ... • C[1][0] is accessed n times – For the (n+1)th execution of k loop. Since the same element is accessed repeatedly and accesses are made sequentially, both temporal and spatial locality of reference is observed. For C, accesses are already optimized for cache!
  • 7. • The accesses into array A are made in the following order: • A[0][0],A[0][1],A[0][2],...,A[0][n] – This whole sequence n times, for each iteration of k loop under a single iteration of j loop. • A[1][0],A[1][1],A[1][2],...,A[1][n] – This whole sequence n times, for each iteration of k loop under a single iteration of j loop. • ... • ... • A[n][0],A[n][1],A[n][2],...,A[n][n] – This whole sequence n times, for each iteration of k loop under a single iteration of j loop. Hence, each row of array A is accessed n times and each element in the row is accessed sequentially. We observe good spatial locality of reference. But the same element is not referenced repeatedly so no temporal locality.
  • 8. • The accesses into array B are made in the following order: • B[0][0],B[1][0],B[2][0],...,B[n][0] • B[0][1],B[1][1],B[2][1],...,B[n][1] • ... • B[0][n],B[1][n],B[2][n],...,B[n][n] • Then again, • B[0][0],B[1][0],B[2][0],...,B[n][0] • B[0][1],B[1][1],B[2][1],...,B[n][1] • ... • B[0][n],B[1][n],B[2][n],...,B[n][n] The whole array is accessed once and only then repetition starts. So there's no spatial locality or temporal locality for B. Each element will incur a cache miss on every access made to it. And thus we have a cache miss on every execution of the multiplication statement
  • 9. Improving cache efficiency To improve upon this traditional i−j−k setting, we can employ loop interchange. By trying out the other 5 combinations of the loop orderings namely, i−k−j, j−k−i ,j−i−k, k−i−j and k−j−i we find that i−k−j is the optimal
  • 10. • C[0][0],C[0][1],C[0][2],...,C[0][n] – This whole sequence n times, for each iteration of k loop…… • C[n][0],C[n][1],C[n][2],...,C[n][n] – This whole sequence n times, for each iteration of k loop. Now C has good spatial locality. Similarly, A enjoys good temporal as well as spatial locality. B now has spatial locality.
  • 11. Catch Efficient using different iterationsTimeInSeconds
  • 12. Matrix Multiplication in 2D mesh Mesh Network: A set of nodes arranged in the form of a p dimensional lattice is called a mesh network. In a mesh network only neighbouring nodes can communicate with each other. Cannon’s Algorithm Procedure MATRIXMULT begin for k = 1 to n-1 step 1 do begin for all Pi,j where i and j ranges from 1 to n do if i is greater than k then rotate a in the east direction end if if j is greater than k then rotate b in the south direction end if end for all Pi;j where i and j lies between 1 and n do compute the product of a and b and store it in c for k= 1 to n-1 step 1 do for all Pi;j where i and j ranges from 1 to n do rotate a in the east rotate b in the south c=c+aXb end
  • 13. Results of I/O efficient parallel Matrix Multiplication Matrix_Size 500 *500 700*700 1000*1000 1200*1200 Time Taken Parallel (in sec) 0.38 1.156053 3.409883 7.037 Machine Specification I3 Processor CPU@2.20Gz RAM-2GB Windows 64 bit
  • 14. References Cache-Oblivious Algorithms – Harald Prokop MIT June 1999 Matrix Multiplication on a Distributed Memory Machine, www.phy.ornl.gov