SlideShare a Scribd company logo
1 of 13
Download to read offline
http://www.iaeme.com/IJARET/index.asp 11 editor@iaeme.com
International Journal of Advanced Research in Engineering and Technology
(IJARET)
Volume 6, Issue 7, Jul 2015, pp. 11-23, Article ID: IJARET_06_07_003
Available online at
http://www.iaeme.com/IJARET/issues.asp?JTypeIJARET&VType=6&IType=7
ISSN Print: 0976-6480 and ISSN Online: 0976-6499
Β© IAEME Publication
___________________________________________________________________________
SPARSE STORAGE RECOMMENDATION
SYSTEM FOR SPARSE MATRIX VECTOR
MULTIPLICATION ON GPU
Monika Shah
Department of Computer Science & Engineering, Nirma University
ABSTRACT
Sparse Matrix Vector Multiplication (SpMV) Ax=b is a well-known kernel
in science, engineering, and web world. Harnessing large computing
capabilities of GPU device, many sparse storage formats have been proposed
to optimize performance of SpMV on GPU. Compressed Sparse Row (CSR),
ELLPACK (ELL), Hybrid (HYB), and Aligned COO sparse storage formats
are known for efficient implementation of SpMV on GPU for wide spectrum of
sparse matrix pattern. Researchers have observed that performance of SpMV
on GPU for a given matrix A can vary widely depending on sparse storage
format used. Hence, it has become a great challenge to choose an appropriate
storage format from this collection for a given sparse matrix. To resolve this
problem, this paper proposes an algorithm that recommend highly suitable
storage format for a given sparse matrix. This system use simple metrics (like
row length, number of rows, number of columns and number of non-zero
element) of a given sparse matrix to analyse impact of different storage format
on performance of SpMV. To demonstrate influence of this algorithm,
performance of SpMV and its associated application βˆ’ Conjugate Gradient
Solver (CGS) over various sparse matrix patterns with various sparse formats
have been compared.
Key words: Sparse Matrix, SpMV, Sparse format, Heuristics, K-mean
clustering and Load balance
Cite this Article: Shah, M. Sparse Storage Recommendation System for
Sparse Matrix Vector Multiplication on GPU. International Journal of
Advanced Research in Engineering and Technology, 6(7), 2015, pp. 11-23.
http://www.iaeme.com/currentissue.asp?JType=IJARET&VType=6&IType=7
_____________________________________________________________________
1. INTRODUCTION
Since last many years, Sparse Matrix Vector Multiplication (SpMV) has become most
prominent computing dwarf for science and engineering applications. Linear algebra
IJARET
Monika Shah
http://www.iaeme.com/IJARET/index.asp 12 editor@iaeme.com
solver (like partial differential equations [1], [2], conjugate gradient solver [3], [4],
Gaussian reduction of complex matrices, etc.), fluid dynamics [5], Database query
processing on large database (LDB) [6], information retrieval [7], network theory [8],
[9] , page rank computation [10], physics of disordered and quantum system [11] are
well-known applications that have recurrent use of SpMV. Sparse matrix used in
these applications are varied widely in non-zero pattern.
Continuous growth of computer users, and their increasing usage constantly increase
size of many datasets used in such applications. This continuous and exponential
growth of dataset has raised need to apply High Performance Computing. Researchers
have provide many solutions through inventions for high performance computing
device architectures like Graphical Processing Unit (GPU) and optimizing algorithms
for these devices. GPU device is well-known high performance promising device for
regular applications. Hence, it is a great challenge to use GPU for irregular
application like SpMV.
Generalized implementation of parallel SpMV has become complex because of
following properties of sparse matrix:
1. Imbalanced number of nonzero elements in each row
2. Imbalanced number of nonzero elements in each column β€’ Wide-range of sparse
patterns (diagonal, skewed, power law distribution of non-zero elements for each
row, almost equal number of non-zero elements per row, block, etc.) β€’ Varied sparse
level matrix(ratio of nonzero elements to size of matrix)
For an efficient and generalized implementation of SpMV on GPU, two important
factors are influenced by past research [12]: (i) Synchronization free load distribution
among computational resources, (ii) Reduce fetch operations to avoid drawback of
low latency memory access in GPU. Hence, it is preferred to select such sparse
storage format that support high compression along with better synchronization free
load distribution. Major challenge to satisfy these factors are:
1. Continuous growth in dataset make very large size of sparse matrix.
2. Indirection used in storage representation of sparse matrix increases size of data to be
transferred from CPU to GPU device as an additional overhead
3. Existence of large class of sparse matrix pattern
4. Difficult to balance work distribution due to imbalanced number of nonzero elements
for each row as well as for each column
5. Restriction to increase concurrency due to existence of data dependency among row
elements to compute output vector
Harnessing high computing capabilities of GPU, and unceasing performance
demand of SpMV kernel motivate researchers to optimize SpMV on GPU that deal
with all challenges listed above. During past research, Coordinate (COO),
Compressed Sparse Row (CSR), Compressed Sparse Column (CSC), ELLPACK
(ELL), Hybrid (HYB), and Jagged Diagonal Storage (JDS) have been proposed with
different compression strategy [13]. They have also proposed SpMV algorithms for
these sparse storage formats on GPU. Bulky index structure of COO format reduce
synchronization free load distribution degree among parallel threads, and Increase
communication overhead between CPU and GPU. CSC schedule all columns of a
sparse matrix sequentially in SPMV, and vector b is loaded and stored frequently in
each iteration. This factors are responsible for recurrent communication overhead,
which limits performance of CSC on GPU. These factors are responsible for less
popularity COO and CSC sparse formats on GPU. Aligned Coordinate (Aligned
Sparse Storage Recommendation System for Sparse Matrix Vector Multiplication on GPU
http://www.iaeme.com/IJARET/index.asp 13 editor@iaeme.com
COO) [12] is introduced as compressed and suitable for synchronization free balanced
load distribution and proper cache utilization.
Sparse matrix metrics like Number of Rows (NR), Number of Columns (NC),
Number of Non-Zero elements (NNZ), Non-zero elements in a row (row_len), and
Non-zero elements in a column (col_len) are playing important role in compression
ratio, and parallel degree for various sparse formats. An important point to focus here
is compression ratio for these recognized sparse storage formats are varied based on
sparse level and sparse pattern of an input matrix. Considering these factors, this
paper proposes an algorithm to recommend highly suitable storage format fora given
sparse matrix. The remaining paper is structured as follows: The course of Optimizing
sparse formats and their SpMV implementation is traced in section II. Section III
brings forth our attempt to define heuristics and an algorithm that recommend highly
suitable storage format for implementing SpMV on GPU. Parallel algorithm of CGS
has been discussed in section III-D. Section IV demonstrates and analyse result of this
proposed work. Conclusion of the paper is described in section V.
2. SPARSE STRORAGE FORMATS
Many storage formats are proposed as a result of past effort by researchers. As
mentioned in section I, compressed storage, synchronization free load distribution,
and highest possible concurrency have become main goal to design sparse matrix
formats for NVIDIA GPU and CUDA programming environment. Bell at el.[13] have
introduced storage formats COO, CSR, CSC, ELL, HYB supporting different level of
compression to different sparse matrix pattern. Shah M. et al[12] have introduced an
Aligned COO. Many other extension to this benchmark sparse format [14], [15], [16],
[17] as well as hybrid of these storage formats [18], [19], [20] also have been
proposed in past. Tragedy even after research of this large set of sparse format is that
there is not any standard format suitable for almost all class of sparse matrix patterns.
In addition to this, it is also difficult to identify suitable sparse matrix format
supporting best compression as well as synchronization free and balanced work-load
distribution.
Table 1 Sparse Matrix Formats and Their Space Complexity
Sparse matrix format Space Complexity
COO NNZ x 3
CSC NNZ x 2 + ( NC +1)
CSR NNZ x 2 + (NR +1)
ELL (NR x max row length) x 2
HYB β‰… ELL, for rows with similar length
β‰… COO, for rest row elements
Aligned COO Num_segments x Segment_length x 3
β‰… (max_row_length x (≀ NR) x 3)
Selection of proper data compression strategy is important due to two major
reasons: (i) Data transferring overhead between CPU and GPU (ii) Design of Memory
access pattern for each concurrent thread depend on data structure. Table 1 presents
memory space required for various sparse formats. It sustain that compression
percentage for same format varies from one sparse matrix to another sparse matrix
based on basic statistics of the matrix. For example, COO provide highest
compression for small and highly sparse matrix; CSC and CSR give better
Monika Shah
http://www.iaeme.com/IJARET/index.asp 14 editor@iaeme.com
compression for small size of sparse matrix in terms of columns and rows
respectively; ELL is suitable for compression of sparse matrix with less difference in
NNZ for each row and large matrix in terms of number of rows. COO, CSC, CSR,
and ELL are known as core sparse storage formats designed to support higher
compression. HYB is designed to reduce padding space in ELL format. HYB suggest
better compression in form of hybrid pattern of ELL and COO. Aligned COO
provides better compression in compare to ELL for highly skewed sparse matrix with
power-law distribution.
Table 2 SPMV Algorithms and Their Time Complexity
SpMV Time Complexity
(Excluding Memory access Overhead)
COO_ flat
max _ _ β„Ž
CSC max _ _
max _ _ β„Ž
_
CSR
≀
max _concurrent_threads
% max _ &_
CSR(vector)
≀
max _& '
(max _& ' @ & + +,(& '_ -..
Where,
max _& ' =
max _ _ β„Ž
& '_ 0-
and
max _& ' _' _ & =
max _ &_
warp_size
ELL β‰₯
max _concurrent_threads
max _ &_
HYB
β‰… ELL, for rows with similar length
+ β‰… COO_flat, for rest row elements
Aligned_COO
β‰… CSR, for aligned rows
+ β‰… COO_flat, for rest row elements
Increased concurrency and synchronization free load distribution are important
factors to reduce runtime of parallel SpMV on GPU. Table 2 represents rum-time
complexity of various SpMV implementation of above listed sparse storage formats.
COO_flat algorithm specify highest concurrency but does not ensure synchronization
free load distribution among concurrent threads due to row elements across warp
boundary. CSC is also less preferred due to an additional overhead of accessing
output vector in every iteration. On other side, ELL has an overhead of transferring
Sparse Storage Recommendation System for Sparse Matrix Vector Multiplication on GPU
http://www.iaeme.com/IJARET/index.asp 15 editor@iaeme.com
extraneous memory containing padding of zero values over low latency memory
access. CSR and ELL has very similar SpMV algorithm except an additional
overhead by CSR to access memory to fetch row index. CSR implementation on GPU
is more efficient than ELL, where NNZ to be accessed by one thread block and
iteration is much larger than another block or iteration. CSR Vector provide much
higher concurrency than CSR and ELL, but has an overhead of performing series of
parallel reduction steps by each thread. Hence, CSR Vector is not suitable when
average NNZ per row is less than steps required by parallel reductions by each thread
that is log (warp size). HYB and Aligned COO kernels are designed to make efficient
SpMV using hybrid of above mentioned sparse formats and their kernels. Aligned
COO reorders nonzero elements to make balanced workload distribution among each
computing resource and thus reduce number of row segment compare to number of
rows in ELL format retaining maximum row length same as original. Hence, Aligned
COO give optimized performance for highly skewed sparse matrix pattern.
3. PROPOSED WORK
Section II discuss strengths and weaknesses of various sparse matrix storage formats
and its SpMV implementations. It indicates that selection of sparse storage format is
important factor for efficient SpMV on GPU. The collection of SpMV algorithms like
JAD, CSR, ELL, CSR vector, HYB, and Aligned COO cover wide spectrum of sparse
matrix pattern for better performance. Recognizing sparse matrix pattern is great
challenge. Statistics analysis is considered to be a good methodology to recognize
sparse patterns. Diagonal pattern is simple to recognized, and JAD is recommended
for diagonal pattern of sparse matrix. This section proposes a strategy to suggest most
appropriate SpMV implementation for all sparse pattern except diagonal.
Working flow of this proposed work is described in Figure 1. Here, K-mean
clustering is used to generate detailed statistics from basic matrix statistics like NR,
NC, NNZ and row length vector rl []. This derived statistics are analysed and
compared with pre-defined Heuristics and suggest most appropriate SpMV algorithm.
Section III-A explain input and output parameter of K-mean clustering algorithm.
Section III-B define heuristics for SpMV algorithms CSR, ELL, CSR Vector, HYB,
and Aligned COO algorithms.
Figure 1 Working flow of Heuristic based Selection of SpMV algorithm
Monika Shah
http://www.iaeme.com/IJARET/index.asp 16 editor@iaeme.com
Detailed description of Heuristics based SpMV selection algorithm is given in
section III-C. To prove effectiveness of this proposed algorithm, a well-known SpMV
application βˆ’ CGS on GPU is implemented as shown in section III-D.
3.1. K-Mean Clustering And Its Parameters
Here, K-mean clustering is used to identify similarity level among sparse matrix rows
using parameter row-length. This K-mean clustering algorithm constructs 2 clusters
based on row length vector rl []. For highly skewed sparse matrix, centroid of cluster
is not sufficient to predict similarity of row-length. Hence, this K-mean clustering
algorithm is slightly modified and identify Lower Bound (LB), Upper Bound (UB),
Number of element (CNT) and Centroid (C) for both cluster beans. This clusters are
named as cluster H and cluster L based on their centroid value i.e. C_L < C_H.
Similarly, output parameters of this K-mean clustering algorithm are named as LB_H,
UB_H, CNT_H, C_H, LB_L, UB_L, CNT_L, and C_L as shown in Figure 2.
Figure 2 K-mean clustering for this proposed work
3.2. Heuristics
Based on empirical result analysis and basic understanding of various SpMV
algorithms, heuristics are defined to suggest a suitable sparse storage format and GPU
based SpMV algorithm capable to give better performance for given sparse matrix.
Following points are center of focus in design of this heuristics.
1. Obtain highest possible concurrency degree
2. Better compression of sparse matrix to reduce memory access cost
3. Balanced work load distribution among threads
4. Synchronization free load distribution as far as possible
5. Reducing number of blocks to reduce block schedule cost
Sparse Storage Recommendation System for Sparse Matrix Vector Multiplication on GPU
http://www.iaeme.com/IJARET/index.asp
3.3. Heuristics for CSR Vector
CSR Vector is designed to propose highest pos
free load distribution, which in turn ensures good accuracy. Every execution thread of
this SpMV algorithm executes at
operations. For execution of CSR Vector, warp W (collect
32 threads in general) is allotted to each row of sparse matrix for computation. CSR
storage format is used to implement this SpMV algorithm. But, CSR is preferred for
small matrix size to avoid large number of low latency mem
row index. Looking all these criteria, CSR Vector is preferred to apply if following
condition is satisfied:
3.4. Heuristics for CSR
CSR SpMV is preferred for small matrix, where each row does not have large number
of non-zero elements as well as majority of rows do not have equivalent size in terms
of non-zero elements. Hence, CSR SpMV is preferred when CSR Vector is not
applicable and following condition is satisfied:
3.5. Heuristics for ELL
ELL storage format and ELL SpMV algo
equivalent row length, which reduces padding overhead and improve performance.
But, large row-length reduce concurrency degree in ELL SpMV. ELL is preferred
when there is not much difference between neither centr
between upper bound of higher value cluster and centroid value of cluster having
lower centroid value. Hence, it is concluded that ELLPACK is preferred, when CSR
or CSR Vector are not applicable for given sparse matrix, and fo
satisfied:
3.6. Heuristics for HYB
When a large sparse matrix does not have equivalent row length, but power
distribution of non-zero elements among rows of the matrix with highly skewed
visualization, Hybrid sparse format and its SpMV is preferred. Hence, it is concluded
that HYB is preferred, when CSR or ELL sparse formats are not suitable for given
sparse matrix, and following condition is satisfied:
(
6_7
6_8
β‰₯ 100
3.7 Heuristics for Aligned COO
Aligned COO format and its SpMV are designed to optimize performance for large
sparse matrix having skewed distribution of non
alignment of large sized row with small si
of number of execution units. But as it is based on COO format, it provide less
compression in compare to hybrid. Hence, Aligned COO is preferred when neither
CSR nor ELL nor HYB formats are suitable as well as i
condition:
Sparse Storage Recommendation System for Sparse Matrix Vector Multiplication on GPU
ARET/index.asp 17 editor@iaeme.com
Heuristics for CSR Vector
Vector is designed to propose highest possible concurrency with synchronization
free load distribution, which in turn ensures good accuracy. Every execution thread of
this SpMV algorithm executes at-least 1 Multiplication and log
operations. For execution of CSR Vector, warp W (collection of execution threads i.e.
32 threads in general) is allotted to each row of sparse matrix for computation. CSR
storage format is used to implement this SpMV algorithm. But, CSR is preferred for
small matrix size to avoid large number of low latency memory access for fetching
row index. Looking all these criteria, CSR Vector is preferred to apply if following
Heuristics for CSR
CSR SpMV is preferred for small matrix, where each row does not have large number
ments as well as majority of rows do not have equivalent size in terms
zero elements. Hence, CSR SpMV is preferred when CSR Vector is not
applicable and following condition is satisfied:
Heuristics for ELL
ELL storage format and ELL SpMV algorithm is preferred for a sparse matrix with
equivalent row length, which reduces padding overhead and improve performance.
length reduce concurrency degree in ELL SpMV. ELL is preferred
when there is not much difference between neither centroid value of two clusters nor
between upper bound of higher value cluster and centroid value of cluster having
lower centroid value. Hence, it is concluded that ELLPACK is preferred, when CSR
or CSR Vector are not applicable for given sparse matrix, and following condition is
.49) or (C_H-C_L) ≀ 6))
Heuristics for HYB
When a large sparse matrix does not have equivalent row length, but power
zero elements among rows of the matrix with highly skewed
Hybrid sparse format and its SpMV is preferred. Hence, it is concluded
that HYB is preferred, when CSR or ELL sparse formats are not suitable for given
sparse matrix, and following condition is satisfied:
100 . ; (
<=_7
8=_7
β‰₯ 100 . ; (
>>?
>@
β‰₯ 100 .
Heuristics for Aligned COO
Aligned COO format and its SpMV are designed to optimize performance for large
sparse matrix having skewed distribution of non-zero elements and also have possible
alignment of large sized row with small sized row such that it can reduce requirement
of number of execution units. But as it is based on COO format, it provide less
compression in compare to hybrid. Hence, Aligned COO is preferred when neither
CSR nor ELL nor HYB formats are suitable as well as it should satisfy following
Sparse Storage Recommendation System for Sparse Matrix Vector Multiplication on GPU
editor@iaeme.com
sible concurrency with synchronization
free load distribution, which in turn ensures good accuracy. Every execution thread of
log2W addition
ion of execution threads i.e.
32 threads in general) is allotted to each row of sparse matrix for computation. CSR
storage format is used to implement this SpMV algorithm. But, CSR is preferred for
ory access for fetching
row index. Looking all these criteria, CSR Vector is preferred to apply if following
CSR SpMV is preferred for small matrix, where each row does not have large number
ments as well as majority of rows do not have equivalent size in terms
zero elements. Hence, CSR SpMV is preferred when CSR Vector is not
rithm is preferred for a sparse matrix with
equivalent row length, which reduces padding overhead and improve performance.
length reduce concurrency degree in ELL SpMV. ELL is preferred
oid value of two clusters nor
between upper bound of higher value cluster and centroid value of cluster having
lower centroid value. Hence, it is concluded that ELLPACK is preferred, when CSR
llowing condition is
6))
When a large sparse matrix does not have equivalent row length, but power-law
zero elements among rows of the matrix with highly skewed
Hybrid sparse format and its SpMV is preferred. Hence, it is concluded
that HYB is preferred, when CSR or ELL sparse formats are not suitable for given
Aligned COO format and its SpMV are designed to optimize performance for large
zero elements and also have possible
zed row such that it can reduce requirement
of number of execution units. But as it is based on COO format, it provide less
compression in compare to hybrid. Hence, Aligned COO is preferred when neither
t should satisfy following
http://www.iaeme.com/IJARET/index.asp
3.8. Heuristics based Sparse format recommendation
This section describes an algorithm to suggest most suitable sparse format and its
associated SpMV for given sparse matrix. It performs K
matrix metric, and compare its output parameters with heuristics defined in above
section.
Algorithm 1 Heuristics based Sparse format recommendation
Input: NNZ, NR, NC, rl [ ]
Output: Suitable_SpMV
Perform K-mean clustering with two bean
3.9. Parallel CGS
To demonstrate effectiveness of hereby proposed Heuristics based sparse format
recommendation algorithm for efficient SpMV, it is preferred to test this algorithm on
such application that has frequent and high usage of SpMV kernel as well as
applicable for wide category of sparse patterns. Conjugate Gradient Solver (CGS) is
such a well-known application that find solution vector x for Ax=b.
Every CGS call invokes SpMV kernel in a loop of hu
iterations. As GPU has major overhead of memory transfer between CPU and GPU,
this parallel CGS is designed such a way that it need to transfer sparse matrix A, and
input vector b at time of first iteration only. GPU based parallel CGS
Monika Shah
http://www.iaeme.com/IJARET/index.asp 18 editor@iaeme.com
Heuristics based Sparse format recommendation
This section describes an algorithm to suggest most suitable sparse format and its
associated SpMV for given sparse matrix. It performs K-mean clustering on sparse
atrix metric, and compare its output parameters with heuristics defined in above
Algorithm 1 Heuristics based Sparse format recommendation
NNZ, NR, NC, rl [ ]
mean clustering with two bean
(C_L β‰₯ log, E. AND (C_H β‰₯
W
2
.
AND (NNZ ≀ max threads) then
To demonstrate effectiveness of hereby proposed Heuristics based sparse format
recommendation algorithm for efficient SpMV, it is preferred to test this algorithm on
has frequent and high usage of SpMV kernel as well as
applicable for wide category of sparse patterns. Conjugate Gradient Solver (CGS) is
known application that find solution vector x for Ax=b.
call invokes SpMV kernel in a loop of hundreds to thousands
iterations. As GPU has major overhead of memory transfer between CPU and GPU,
this parallel CGS is designed such a way that it need to transfer sparse matrix A, and
input vector b at time of first iteration only. GPU based parallel CGS
editor@iaeme.com
This section describes an algorithm to suggest most suitable sparse format and its
mean clustering on sparse
atrix metric, and compare its output parameters with heuristics defined in above
To demonstrate effectiveness of hereby proposed Heuristics based sparse format
recommendation algorithm for efficient SpMV, it is preferred to test this algorithm on
has frequent and high usage of SpMV kernel as well as
applicable for wide category of sparse patterns. Conjugate Gradient Solver (CGS) is
ndreds to thousands
iterations. As GPU has major overhead of memory transfer between CPU and GPU,
this parallel CGS is designed such a way that it need to transfer sparse matrix A, and
input vector b at time of first iteration only. GPU based parallel CGS is described in
Sparse Storage Recommendation System for Sparse Matrix Vector Multiplication on GPU
http://www.iaeme.com/IJARET/index.asp 19 editor@iaeme.com
Algorithm 2, where SpMV kernel is executed for specified number of iterations or
size of sparse matrix which is in hundreds.
4. EXPERIMENTAL RESULT
For proper evaluation of this proposed algorithm, various SpMV algorithms like CSR,
CSR Vector, ELL, and HYB SpMV from NVIDIA cusp-library and Aligned COO
algorithms have been implemented on NVIDIA GPU.
Collection of sparse matrix used in this experiment is listed along with its basic
properties in Table 3.
4.1. Test Platform
These experiments have been executed on Intel(R) Core(TM) i3 CPU @ 3.20 GHz
with 4GB RAM, 2 Γ— 256 KB (L2 Cache) and 4 MB (L3 Cache), and NVIDIA C2070
GPU device using CUDA version 4.0 on Ubuntu 11.
This dataset contains 31 sparse matrix, which are retrieved from well-known
source The University of Florida Sparse Matrix Collection. Selection of sparse matrix
are done such that the collection contain matrix with various sparse level and large
category of sparse patterns.
4.2. Result Analysis
Table 4 list performance of CSR, CSR Vector, ELL, HYB, and Aligned COO
algorithms in terms of GFLOPS per seconds for each matrix listed in Table 3.
Heuristics based SpMV selection algorithm is implemented and its output compared
with performance result recorded in Table 4. Overhead of memory transfer cost
between CPU and GPU is always important factor for overall performance. This cost
is considered to be amortized over large number of iterations. Parallel CGS algorithm
listed in Algorithm 2 also have been implemented for 200 iterations. Execution time
including memory transfer time of this CGS has been compared with result of our
proposed algorithm. Result of proposed algorithm is satisfied for 30 sparse matrix out
of 31 sparse matrix listed in the dataset.
Algorithm 2 Parallel CGS
Input: Sparse Matrix A, Vector b, NR, NC, iterations
Output: Vector x
Initialize vector dev_x = 0
Copy vectors from Host memory to Device memory
(b β†’ dev_r, and b β†’ dev_p)
Copy matrix from Host memory to Device memory
(A β†’ dev_A)
Compute dev_rsold = dev_rT x dev_r using dev_inner_product(dev_r, dev_r, NR,
dev_rsold)
for i = 1 β†’ min (iterations, NR βˆ— NC) do
Initialize vector dev_Ap = 0
Monika Shah
http://www.iaeme.com/IJARET/index.asp 20 editor@iaeme.com
Perform Ap= A*p using
dev _SpMV (dev_A, dev_p, dev_Ap)
Perform rsold = pT * Ap using
dev_inner_product (dev_p, dev_Ap, NR, dev_rsold) dev_alpha = dev_rsold
Asynchronous computation of dev_x and dev_r Perform x += alpha*p using
dev_add_scalarMul(dev_alpha,dev_p,dev_x, 1,dev_x) Perform r-=alpha*Ap using
dev_add_scalarMul(dev_alpha,dev_Ap,dev_r, -1,dev_r) Compute dev_rsnew = dev_rT x
dev_r using
dev_inner_product(dev r,dev r,NR,dev_rsnew) Copy device rsnew dev_rsnew to host
rsnew rsnew
If √ & < 1 OPQ
Exit for
end if
Compute p= r + ((rsnew/rsold)*p) using
R_ ' =
R_ &
R_
dev_add_scalarMul(dev_temp,dev_p,dev_r, 1,dev_p)
dev_rsold = dev_rsnew
End for
Copy device vector to host vector dev x β†’ x
Return
__________________________________________________________________
5. CONCLUSION
In this paper, various factors responsible for achieving higher performance of SpMV
on GPU for various sparse pattern have been discussed. It has been realized that some
decision making algorithm is required to suggest highest performance giving SpMV
algorithm, especially for those applications that use large sparse matrices having
variety of sparse pattern and having recurrent use of SpMV. Hereby proposed
algorithm perform statistical analysis of sparse pattern and provide approximately
97% successful result. This statistical result recommend use of such clustering based
Heuristics design for appropriate sparse format selection.
Sparse Storage Recommendation System for Sparse Matrix Vector Multiplication on GPU
http://www.iaeme.com/IJARET/index.asp 21 editor@iaeme.com
Table 3 Sparse Matrix Collection Used In Experimentation
Matrix NR NC NNZ
Sparsity %
NNZ / (NR*NC)
3D 51448_3D 51448 514484 1056610 0.00003992
add20 2395 2395 17319 0.00301934
add32 4960 4960 23884 0.00097083
adder dcop_19 1813 1813 11245 0.00342109
aircraft 3754 7517 20267 0.00071821
airfoil 4253 4253 24578 0.00135880
airfoil_2d 14214 14214 259688 0.00128534
aug3dcqp 35543 35543 128115 0.00010141
bayer01 57735 57735 277774 0.00008333
bcsstk36 23052 23052 1143140 0.00215121
bcsstm38 8032 8033 10485 0.00016251
bfwa782 782 783 7514 0.01227164
bips07_3078_iv 21128 21128 75729 0.00016965
Bloweybl 30003 30003 120000 0.00013331
c64b 51035 51035 717841 0.00027561
coater1 1348 1348 19457 0.01070770
crankseg_2 63838 63838 14148858 0.00347187
crashbasis 160000 160000 1750416 0.00683756
delaunay_n15 32768 32768 196548 0.00018305
epb0 1794 1794 7764 0.00241235
FEM_3D_ thermal1 17880 17880 430740 0.00134735
fpga_trans_01 1220 1220 7382 0.00495969
G2_circuit 150102 150102 726674 0.00003225
gupta1 31802 31802 2164210 0.00213989
Hamrle2 5952 5952 22162 0.00062558
jagmesh2 1009 1009 6865 0.00674308
jagmesh3 1089 1089 7361 0.00620699
lhr07 7337 7337 156508 0.00290737
lung2 109460 109460 492564 0.00004111
net100 29920 29920 2033200 0.00227121
Zd_Jac6 22835 22835 1711983 0.00328320
Monika Shah
http://www.iaeme.com/IJARET/index.asp 22 editor@iaeme.com
Table 4 Execution Performance of Various Spmv
Matrix
Performance (GFLOP/sec)
CSR CSR Vector ELL HYB A_COO
3D 51448_3D 0.56 3.46 0.11 6.45 5.41
add20 0.34 0.5 0.22 0.15 0.15
add32 1.42 0.99 1.06 0.2 1.31
adder_dcop_19 0.1 0.38 0.02 0.09 0.1
aircraft 2.83 0.71 3.33 0.17 4.28
airfoil 2.95 0.41 3.6 0.2 3.86
airfoil_2d 1.17 2.03 10.99 6.64 11.2
aug3dcqp 4.51 0.24 6.58 1.01 4.87
bayer01 3.91 0.28 3.63 1.92 5.57
bcsstk36 0.66 7.95 2.32 6.04 4.81
bcsstm38 1.62 0.3 0.83 0.09 1
bfwa782 0.47 0.95 0.65 0.06 0.76
bips07_3078_iv 0.74 0.04 0.7 0.05 0.94
bloweybl 0.14 0.27 0.01 0.94 0.96
c64b 0.15 0.79 0.05 3.61 3.6
coater1 0.76 2.13 0.77 0.16 0.96
crankseg_2 0.61 8.39 2.28 11.5 6.37
crashbasis 2.09 2.06 16.21 16.18 11.06
delaunay_n15 3.31 1.31 5.7 1.55 4.26
epb0 0.8 0.54 1.07 0.06 1.19
FEM_3D_thermal1 0.87 4.44 14.63 14.59 10.01
Fpga_trans_01 0.35 0.88 0.31 0.06 0.06
G2_circuit 6.48 0.52 12.29 4.65 8.8
gupta1 .3 4.56 0.2 5.64 5.24
Hamrle2 3.26 0.8 4.18 0.19 4.63
jagmesh2 1.17 0.45 1.15 0.05 1.46
jagmesh3 1.25 0.44 1.25 0.06 1.51
lhr07 1.2 1.36 3.76 1.09 4.51
lung2 5.41 0.86 9.14 3.34 9.45
net100 0.65 6.59 7.11 5.78 6.74
Zd Jac6 1.82 2.93 1.82 4.14 4.31
REFERENCES
[1] Lee, I. Efficient sparse matrix vector multiplication using compressed graph, in
IEEE SoutheastCon 2010 (SoutheastCon), Proceedings of the, March 2010, pp.
328–331.
[2] Wang, H. C. and Hwang, K. Multicoloring for fast sparse matrix-vector
multiplication in solving pde problems, in Parallel Processing, 1993. ICPP 1993.
International Conference on, Vol. 3, Aug 1993, pp. 215–222.
[3] Jamroz, B. and Mullowney, P. Performance of parallel sparse matrix-vector
multiplications in linear solves on multiple gpus, in Application Accelerators in
High Performance Computing (SAAHPC), 2012 Symposium on, July 2012, pp.
149–152.
Sparse Storage Recommendation System for Sparse Matrix Vector Multiplication on GPU
http://www.iaeme.com/IJARET/index.asp 23 editor@iaeme.com
[4] Hestenes, E. and Stiefel, M. R. Methods of conjugate gradients for solving linear
systems, 1952.
[5] van der Veen, M. Sparse matrix vector multiplication on a field programmable
gate array, September 2007.
[6] Ashany, R. Application of sparse matrix techniques to search, retrieval,
classification and relationship analysis in large data base systems βˆ’ sparcom, in
Proceedings of the Fourth International Conference on Very Large Data Bases βˆ’
Volume 4, VLDB ’78, VLDB Endowment, 1978, pp. 499–516.
[7] Goharian, N., Grossman, D. and El-Ghazawi, T. Enterprise text processing: A
sparse matrix approach, Information Technology: Coding and Computing,
International Conference on, vol. 0, 2001.
[8] Bender, M. A., Brodal, G. S., Fagerberg, R., Jacob, R. and Vicari, E. Optimal
sparse matrix dense vector multiplication in the i/o-model, in Proceedings of the
Nineteenth Annual ACM Symposium on Parallel Algorithms and Architectures,
SPAA ’07, ACM, 2007.
[9] Manzini, G. Lower bounds for sparse matrix vector multiplication on hypercubic
networks, Vol. 2, 1998.
[10] Wu, T., Wang, B., Shan, Y., Yan, F., Wang, Y. and Xu, N. Efficient pagerank
and spmv computation on amd gpus, in ICPP, 2010, pp. 81–89,.
[11] Gan, Z. and Harrison, R. Calibrating quantum chemistry: A multi-teraflop,
parallel-vector, full-configuration interaction program for the cray-x1, in
Supercomputing, 2005. Proceedings of the ACM/IEEE SC 2005 Conference, Nov
2005
[12] Shah, M. and Patel, V. An efficient sparse matrix multiplication for skewed
matrix on gpu, in High Performance Computing and Communication 2012 IEEE
9th International Conference on Embedded Software and Systems (HPCC-
ICESS), 2012 IEEE 14th International Conference on, June 2012, pp. 1301–1306.
[13] Bell, N. and Garland, M. Implementing sparse matrix-vector multiplication on
throughput-oriented processors, in SC, 2009.
[14] Dziekonski, A., Lamecki, A. and Mrozowski, M. A memory efficient and fast
sparse matrix vector product on a gpu, Progress In Electromagnetics Research,
Vol. 116, 2011, pp. 49–63.
[15] Vazquez, F., Ortega, G., Fernandez, J. and Garzon, E. Improving the performance
of the sparse matrix vector product with gpus, Computer and Information
Technology, International Conference on, vol. 0, 2010.
[16] Pinar, A. and Heath, M. T. Improving performance of sparse matrix-vector
multiplication, in Proceedings of the 1999 ACM/IEEE conference on
Supercomputing (CDROM), Supercomputing ’99, 1999.
[17] Shahnaz, R. and Usman, A. Blocked-based sparse matrix-vector multiplication on
distributed memory parallel computers. Int. Arab J. Inf. Technol., 2011.
[18] Yang, X., Parthasarathy, S. and Sadayappan, P. Fast sparse matrix-vector
multiplication on gpus: Implications for graph mining. CoRR, vol. abs/1103.2405,
2011.
[19] Cao, W., Yao, L., Li, Z., Wang, Y. and Wang, Z. Implementing sparse matrix-
vector multiplication using cuda based on a hybrid sparse matrix format, in
International Conference on Computer Application and System Modeling, 2010.
[20] Choi, J. W., Singh, A. and Vuduc, R. W. Model-driven autotuning of sparse
matrix-vector multiply on gpus, in Proceedings of the 15th ACM SIGPLAN
Symposium on Principles and Practice of Parallel Programming, PPoPP ’10,
2010.

More Related Content

What's hot

Memory and I/O optimized rectilinear Steiner minimum tree routing for VLSI
Memory and I/O optimized rectilinear Steiner minimum tree routing for VLSI Memory and I/O optimized rectilinear Steiner minimum tree routing for VLSI
Memory and I/O optimized rectilinear Steiner minimum tree routing for VLSI IJECEIAES
Β 
Parallel algorithms for multi-source graph traversal and its applications
Parallel algorithms for multi-source graph traversal and its applicationsParallel algorithms for multi-source graph traversal and its applications
Parallel algorithms for multi-source graph traversal and its applicationsSubhajit Sahu
Β 
HOMOGENEOUS MULTISTAGE ARCHITECTURE FOR REAL-TIME IMAGE PROCESSING
HOMOGENEOUS MULTISTAGE ARCHITECTURE FOR REAL-TIME IMAGE PROCESSINGHOMOGENEOUS MULTISTAGE ARCHITECTURE FOR REAL-TIME IMAGE PROCESSING
HOMOGENEOUS MULTISTAGE ARCHITECTURE FOR REAL-TIME IMAGE PROCESSINGcscpconf
Β 
Ijiret archana-kv-increasing-memory-performance-using-cache-optimizations-in-...
Ijiret archana-kv-increasing-memory-performance-using-cache-optimizations-in-...Ijiret archana-kv-increasing-memory-performance-using-cache-optimizations-in-...
Ijiret archana-kv-increasing-memory-performance-using-cache-optimizations-in-...IJIR JOURNALS IJIRUSA
Β 
2010 - Stapelberg, Krzesinski - Network Re-engineering using Successive Survi...
2010 - Stapelberg, Krzesinski - Network Re-engineering using Successive Survi...2010 - Stapelberg, Krzesinski - Network Re-engineering using Successive Survi...
2010 - Stapelberg, Krzesinski - Network Re-engineering using Successive Survi...Dieter Stapelberg
Β 
A STUDY OF ENERGY-AREA TRADEOFFS OF VARIOUS ARCHITECTURAL STYLES FOR ROUTING ...
A STUDY OF ENERGY-AREA TRADEOFFS OF VARIOUS ARCHITECTURAL STYLES FOR ROUTING ...A STUDY OF ENERGY-AREA TRADEOFFS OF VARIOUS ARCHITECTURAL STYLES FOR ROUTING ...
A STUDY OF ENERGY-AREA TRADEOFFS OF VARIOUS ARCHITECTURAL STYLES FOR ROUTING ...VLSICS Design
Β 
A tutorial on CGAL polyhedron for subdivision algorithms
A tutorial on CGAL polyhedron for subdivision algorithmsA tutorial on CGAL polyhedron for subdivision algorithms
A tutorial on CGAL polyhedron for subdivision algorithmsRadu Ursu
Β 
Progressive Meshes
Progressive MeshesProgressive Meshes
Progressive MeshesRoopesh Jhurani
Β 
Performance Improvement Technique in Column-Store
Performance Improvement Technique in Column-StorePerformance Improvement Technique in Column-Store
Performance Improvement Technique in Column-StoreIDES Editor
Β 
Performance comparison of XY,OE and DyAd routing algorithm by Load Variation...
Performance comparison of  XY,OE and DyAd routing algorithm by Load Variation...Performance comparison of  XY,OE and DyAd routing algorithm by Load Variation...
Performance comparison of XY,OE and DyAd routing algorithm by Load Variation...Jayesh Kumar Dalal
Β 

What's hot (14)

Memory base
Memory baseMemory base
Memory base
Β 
Memory and I/O optimized rectilinear Steiner minimum tree routing for VLSI
Memory and I/O optimized rectilinear Steiner minimum tree routing for VLSI Memory and I/O optimized rectilinear Steiner minimum tree routing for VLSI
Memory and I/O optimized rectilinear Steiner minimum tree routing for VLSI
Β 
Parallel algorithms for multi-source graph traversal and its applications
Parallel algorithms for multi-source graph traversal and its applicationsParallel algorithms for multi-source graph traversal and its applications
Parallel algorithms for multi-source graph traversal and its applications
Β 
HOMOGENEOUS MULTISTAGE ARCHITECTURE FOR REAL-TIME IMAGE PROCESSING
HOMOGENEOUS MULTISTAGE ARCHITECTURE FOR REAL-TIME IMAGE PROCESSINGHOMOGENEOUS MULTISTAGE ARCHITECTURE FOR REAL-TIME IMAGE PROCESSING
HOMOGENEOUS MULTISTAGE ARCHITECTURE FOR REAL-TIME IMAGE PROCESSING
Β 
Ijiret archana-kv-increasing-memory-performance-using-cache-optimizations-in-...
Ijiret archana-kv-increasing-memory-performance-using-cache-optimizations-in-...Ijiret archana-kv-increasing-memory-performance-using-cache-optimizations-in-...
Ijiret archana-kv-increasing-memory-performance-using-cache-optimizations-in-...
Β 
F044062933
F044062933F044062933
F044062933
Β 
V3I8-0460
V3I8-0460V3I8-0460
V3I8-0460
Β 
2010 - Stapelberg, Krzesinski - Network Re-engineering using Successive Survi...
2010 - Stapelberg, Krzesinski - Network Re-engineering using Successive Survi...2010 - Stapelberg, Krzesinski - Network Re-engineering using Successive Survi...
2010 - Stapelberg, Krzesinski - Network Re-engineering using Successive Survi...
Β 
A STUDY OF ENERGY-AREA TRADEOFFS OF VARIOUS ARCHITECTURAL STYLES FOR ROUTING ...
A STUDY OF ENERGY-AREA TRADEOFFS OF VARIOUS ARCHITECTURAL STYLES FOR ROUTING ...A STUDY OF ENERGY-AREA TRADEOFFS OF VARIOUS ARCHITECTURAL STYLES FOR ROUTING ...
A STUDY OF ENERGY-AREA TRADEOFFS OF VARIOUS ARCHITECTURAL STYLES FOR ROUTING ...
Β 
A tutorial on CGAL polyhedron for subdivision algorithms
A tutorial on CGAL polyhedron for subdivision algorithmsA tutorial on CGAL polyhedron for subdivision algorithms
A tutorial on CGAL polyhedron for subdivision algorithms
Β 
Progressive Meshes
Progressive MeshesProgressive Meshes
Progressive Meshes
Β 
Performance Improvement Technique in Column-Store
Performance Improvement Technique in Column-StorePerformance Improvement Technique in Column-Store
Performance Improvement Technique in Column-Store
Β 
Performance comparison of XY,OE and DyAd routing algorithm by Load Variation...
Performance comparison of  XY,OE and DyAd routing algorithm by Load Variation...Performance comparison of  XY,OE and DyAd routing algorithm by Load Variation...
Performance comparison of XY,OE and DyAd routing algorithm by Load Variation...
Β 
1
11
1
Β 

Viewers also liked

Final prjm
Final   prjmFinal   prjm
Final prjmsp45432
Β 
Math Problem Solving for 3rd Grade
Math Problem Solving for 3rd GradeMath Problem Solving for 3rd Grade
Math Problem Solving for 3rd GradeDhen Bathan
Β 
Multiplication & Division
Multiplication & DivisionMultiplication & Division
Multiplication & Divisionnapolib
Β 
3rd Grade Math Activity: Metric Mango Tree (measurement; number sense)
3rd Grade Math Activity: Metric Mango Tree (measurement; number sense)3rd Grade Math Activity: Metric Mango Tree (measurement; number sense)
3rd Grade Math Activity: Metric Mango Tree (measurement; number sense)Mango Math Group
Β 
division
divisiondivision
divisionchioumei
Β 
Multiplication and division_rules
Multiplication and division_rulesMultiplication and division_rules
Multiplication and division_rulespage
Β 
Multiplication ppt
Multiplication pptMultiplication ppt
Multiplication pptAbha Arora
Β 
Multiplication and Division Rules
Multiplication and Division RulesMultiplication and Division Rules
Multiplication and Division RulesSarah Tanti
Β 
3rd Grade Math Strategies
3rd Grade Math Strategies3rd Grade Math Strategies
3rd Grade Math Strategiesdawnrosevear
Β 
Multiplication lesson
Multiplication lessonMultiplication lesson
Multiplication lessoncarovego
Β 
Multiplication & division teaching ideas
Multiplication & division teaching ideasMultiplication & division teaching ideas
Multiplication & division teaching ideasJoanne Villis
Β 
Multiplication or division powerpoint
Multiplication or division powerpointMultiplication or division powerpoint
Multiplication or division powerpointtjprigge
Β 
Learning multiplication
Learning multiplicationLearning multiplication
Learning multiplicationjadenlover
Β 
Multiplication Powerpoint
Multiplication PowerpointMultiplication Powerpoint
Multiplication Powerpointguestc6baef
Β 
Grade 3 mtap reviewer
Grade 3 mtap reviewerGrade 3 mtap reviewer
Grade 3 mtap reviewerEclud Sugar
Β 
LINEAR and ANGULAR Measurement
LINEAR and ANGULAR MeasurementLINEAR and ANGULAR Measurement
LINEAR and ANGULAR MeasurementDhruv Parekh
Β 
2 Digit Multiplication Easily Explained
2 Digit Multiplication Easily Explained2 Digit Multiplication Easily Explained
2 Digit Multiplication Easily ExplainedBrent Daigle, Ph.D.
Β 

Viewers also liked (20)

Final prjm
Final   prjmFinal   prjm
Final prjm
Β 
Math Problem Solving for 3rd Grade
Math Problem Solving for 3rd GradeMath Problem Solving for 3rd Grade
Math Problem Solving for 3rd Grade
Β 
Grade 3 Math Overview ClassK12
Grade 3 Math Overview ClassK12Grade 3 Math Overview ClassK12
Grade 3 Math Overview ClassK12
Β 
Multiplication & Division
Multiplication & DivisionMultiplication & Division
Multiplication & Division
Β 
3rd Grade Math Activity: Metric Mango Tree (measurement; number sense)
3rd Grade Math Activity: Metric Mango Tree (measurement; number sense)3rd Grade Math Activity: Metric Mango Tree (measurement; number sense)
3rd Grade Math Activity: Metric Mango Tree (measurement; number sense)
Β 
division
divisiondivision
division
Β 
Tula/ Poem
Tula/ PoemTula/ Poem
Tula/ Poem
Β 
Multiplication and division_rules
Multiplication and division_rulesMultiplication and division_rules
Multiplication and division_rules
Β 
Multiplication ppt
Multiplication pptMultiplication ppt
Multiplication ppt
Β 
Multiplication
MultiplicationMultiplication
Multiplication
Β 
Multiplication and Division Rules
Multiplication and Division RulesMultiplication and Division Rules
Multiplication and Division Rules
Β 
3rd Grade Math Strategies
3rd Grade Math Strategies3rd Grade Math Strategies
3rd Grade Math Strategies
Β 
Multiplication lesson
Multiplication lessonMultiplication lesson
Multiplication lesson
Β 
Multiplication & division teaching ideas
Multiplication & division teaching ideasMultiplication & division teaching ideas
Multiplication & division teaching ideas
Β 
Multiplication or division powerpoint
Multiplication or division powerpointMultiplication or division powerpoint
Multiplication or division powerpoint
Β 
Learning multiplication
Learning multiplicationLearning multiplication
Learning multiplication
Β 
Multiplication Powerpoint
Multiplication PowerpointMultiplication Powerpoint
Multiplication Powerpoint
Β 
Grade 3 mtap reviewer
Grade 3 mtap reviewerGrade 3 mtap reviewer
Grade 3 mtap reviewer
Β 
LINEAR and ANGULAR Measurement
LINEAR and ANGULAR MeasurementLINEAR and ANGULAR Measurement
LINEAR and ANGULAR Measurement
Β 
2 Digit Multiplication Easily Explained
2 Digit Multiplication Easily Explained2 Digit Multiplication Easily Explained
2 Digit Multiplication Easily Explained
Β 

Similar to Sparse matrix GPU recommendation system

Effective Sparse Matrix Representation for the GPU Architectures
Effective Sparse Matrix Representation for the GPU ArchitecturesEffective Sparse Matrix Representation for the GPU Architectures
Effective Sparse Matrix Representation for the GPU ArchitecturesIJCSEA Journal
Β 
An octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passingAn octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passingeSAT Journals
Β 
HYBRID APPROACH TO DESIGN OF STORAGE ATTACHED NETWORK SIMULATION SYSTEMS
HYBRID APPROACH TO DESIGN OF STORAGE ATTACHED NETWORK SIMULATION SYSTEMSHYBRID APPROACH TO DESIGN OF STORAGE ATTACHED NETWORK SIMULATION SYSTEMS
HYBRID APPROACH TO DESIGN OF STORAGE ATTACHED NETWORK SIMULATION SYSTEMSIAEME Publication
Β 
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...ijesajournal
Β 
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...ijesajournal
Β 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentIJERD Editor
Β 
Coarse grained hybrid reconfigurable architecture
Coarse grained hybrid reconfigurable architectureCoarse grained hybrid reconfigurable architecture
Coarse grained hybrid reconfigurable architectureDhiraj Chaudhary
Β 
Coarse grained hybrid reconfigurable architecture with no c router
Coarse grained hybrid reconfigurable architecture with no c routerCoarse grained hybrid reconfigurable architecture with no c router
Coarse grained hybrid reconfigurable architecture with no c routerDhiraj Chaudhary
Β 
Coarse grained hybrid reconfigurable architecture with noc router for variabl...
Coarse grained hybrid reconfigurable architecture with noc router for variabl...Coarse grained hybrid reconfigurable architecture with noc router for variabl...
Coarse grained hybrid reconfigurable architecture with noc router for variabl...Dhiraj Chaudhary
Β 
Different Approaches in Energy Efficient Cache Memory
Different Approaches in Energy Efficient Cache MemoryDifferent Approaches in Energy Efficient Cache Memory
Different Approaches in Energy Efficient Cache MemoryDhritiman Halder
Β 
Adaptive load balancing techniques in global scale grid environment
Adaptive load balancing techniques in global scale grid environmentAdaptive load balancing techniques in global scale grid environment
Adaptive load balancing techniques in global scale grid environmentiaemedu
Β 
IEEE Emerging topic in computing Title and Abstract 2016
IEEE Emerging topic in computing Title and Abstract 2016 IEEE Emerging topic in computing Title and Abstract 2016
IEEE Emerging topic in computing Title and Abstract 2016 tsysglobalsolutions
Β 
Clustering based performance improvement strategies for mobile ad hoc netwo
Clustering based performance improvement strategies for mobile ad hoc netwoClustering based performance improvement strategies for mobile ad hoc netwo
Clustering based performance improvement strategies for mobile ad hoc netwoIAEME Publication
Β 
Dark silicon and the end of multicore scaling
Dark silicon and the end of multicore scalingDark silicon and the end of multicore scaling
Dark silicon and the end of multicore scalingLΓ©ia de Sousa
Β 
Compositional Analysis for the Multi-Resource Server
Compositional Analysis for the Multi-Resource ServerCompositional Analysis for the Multi-Resource Server
Compositional Analysis for the Multi-Resource ServerEricsson
Β 
Implementation of FSM-MBIST and Design of Hybrid MBIST for Memory cluster in ...
Implementation of FSM-MBIST and Design of Hybrid MBIST for Memory cluster in ...Implementation of FSM-MBIST and Design of Hybrid MBIST for Memory cluster in ...
Implementation of FSM-MBIST and Design of Hybrid MBIST for Memory cluster in ...Editor IJCATR
Β 
Performance comparison of row per slave and rows set per slave method in pvm ...
Performance comparison of row per slave and rows set per slave method in pvm ...Performance comparison of row per slave and rows set per slave method in pvm ...
Performance comparison of row per slave and rows set per slave method in pvm ...eSAT Journals
Β 

Similar to Sparse matrix GPU recommendation system (20)

Effective Sparse Matrix Representation for the GPU Architectures
Effective Sparse Matrix Representation for the GPU ArchitecturesEffective Sparse Matrix Representation for the GPU Architectures
Effective Sparse Matrix Representation for the GPU Architectures
Β 
An octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passingAn octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passing
Β 
HYBRID APPROACH TO DESIGN OF STORAGE ATTACHED NETWORK SIMULATION SYSTEMS
HYBRID APPROACH TO DESIGN OF STORAGE ATTACHED NETWORK SIMULATION SYSTEMSHYBRID APPROACH TO DESIGN OF STORAGE ATTACHED NETWORK SIMULATION SYSTEMS
HYBRID APPROACH TO DESIGN OF STORAGE ATTACHED NETWORK SIMULATION SYSTEMS
Β 
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Β 
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Β 
F017423643
F017423643F017423643
F017423643
Β 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
Β 
Coarse grained hybrid reconfigurable architecture
Coarse grained hybrid reconfigurable architectureCoarse grained hybrid reconfigurable architecture
Coarse grained hybrid reconfigurable architecture
Β 
Coarse grained hybrid reconfigurable architecture with no c router
Coarse grained hybrid reconfigurable architecture with no c routerCoarse grained hybrid reconfigurable architecture with no c router
Coarse grained hybrid reconfigurable architecture with no c router
Β 
Coarse grained hybrid reconfigurable architecture with noc router for variabl...
Coarse grained hybrid reconfigurable architecture with noc router for variabl...Coarse grained hybrid reconfigurable architecture with noc router for variabl...
Coarse grained hybrid reconfigurable architecture with noc router for variabl...
Β 
Different Approaches in Energy Efficient Cache Memory
Different Approaches in Energy Efficient Cache MemoryDifferent Approaches in Energy Efficient Cache Memory
Different Approaches in Energy Efficient Cache Memory
Β 
Adaptive load balancing techniques in global scale grid environment
Adaptive load balancing techniques in global scale grid environmentAdaptive load balancing techniques in global scale grid environment
Adaptive load balancing techniques in global scale grid environment
Β 
An efficient multi-level cache system for geometrically interconnected many-...
An efficient multi-level cache system for geometrically  interconnected many-...An efficient multi-level cache system for geometrically  interconnected many-...
An efficient multi-level cache system for geometrically interconnected many-...
Β 
IEEE Emerging topic in computing Title and Abstract 2016
IEEE Emerging topic in computing Title and Abstract 2016 IEEE Emerging topic in computing Title and Abstract 2016
IEEE Emerging topic in computing Title and Abstract 2016
Β 
Clustering based performance improvement strategies for mobile ad hoc netwo
Clustering based performance improvement strategies for mobile ad hoc netwoClustering based performance improvement strategies for mobile ad hoc netwo
Clustering based performance improvement strategies for mobile ad hoc netwo
Β 
Dark silicon and the end of multicore scaling
Dark silicon and the end of multicore scalingDark silicon and the end of multicore scaling
Dark silicon and the end of multicore scaling
Β 
Compositional Analysis for the Multi-Resource Server
Compositional Analysis for the Multi-Resource ServerCompositional Analysis for the Multi-Resource Server
Compositional Analysis for the Multi-Resource Server
Β 
Implementation of FSM-MBIST and Design of Hybrid MBIST for Memory cluster in ...
Implementation of FSM-MBIST and Design of Hybrid MBIST for Memory cluster in ...Implementation of FSM-MBIST and Design of Hybrid MBIST for Memory cluster in ...
Implementation of FSM-MBIST and Design of Hybrid MBIST for Memory cluster in ...
Β 
Performance comparison of row per slave and rows set per slave method in pvm ...
Performance comparison of row per slave and rows set per slave method in pvm ...Performance comparison of row per slave and rows set per slave method in pvm ...
Performance comparison of row per slave and rows set per slave method in pvm ...
Β 
40120140507002
4012014050700240120140507002
40120140507002
Β 

More from IAEME Publication

IAEME_Publication_Call_for_Paper_September_2022.pdf
IAEME_Publication_Call_for_Paper_September_2022.pdfIAEME_Publication_Call_for_Paper_September_2022.pdf
IAEME_Publication_Call_for_Paper_September_2022.pdfIAEME Publication
Β 
MODELING AND ANALYSIS OF SURFACE ROUGHNESS AND WHITE LATER THICKNESS IN WIRE-...
MODELING AND ANALYSIS OF SURFACE ROUGHNESS AND WHITE LATER THICKNESS IN WIRE-...MODELING AND ANALYSIS OF SURFACE ROUGHNESS AND WHITE LATER THICKNESS IN WIRE-...
MODELING AND ANALYSIS OF SURFACE ROUGHNESS AND WHITE LATER THICKNESS IN WIRE-...IAEME Publication
Β 
A STUDY ON THE REASONS FOR TRANSGENDER TO BECOME ENTREPRENEURS
A STUDY ON THE REASONS FOR TRANSGENDER TO BECOME ENTREPRENEURSA STUDY ON THE REASONS FOR TRANSGENDER TO BECOME ENTREPRENEURS
A STUDY ON THE REASONS FOR TRANSGENDER TO BECOME ENTREPRENEURSIAEME Publication
Β 
BROAD UNEXPOSED SKILLS OF TRANSGENDER ENTREPRENEURS
BROAD UNEXPOSED SKILLS OF TRANSGENDER ENTREPRENEURSBROAD UNEXPOSED SKILLS OF TRANSGENDER ENTREPRENEURS
BROAD UNEXPOSED SKILLS OF TRANSGENDER ENTREPRENEURSIAEME Publication
Β 
DETERMINANTS AFFECTING THE USER'S INTENTION TO USE MOBILE BANKING APPLICATIONS
DETERMINANTS AFFECTING THE USER'S INTENTION TO USE MOBILE BANKING APPLICATIONSDETERMINANTS AFFECTING THE USER'S INTENTION TO USE MOBILE BANKING APPLICATIONS
DETERMINANTS AFFECTING THE USER'S INTENTION TO USE MOBILE BANKING APPLICATIONSIAEME Publication
Β 
ANALYSE THE USER PREDILECTION ON GPAY AND PHONEPE FOR DIGITAL TRANSACTIONS
ANALYSE THE USER PREDILECTION ON GPAY AND PHONEPE FOR DIGITAL TRANSACTIONSANALYSE THE USER PREDILECTION ON GPAY AND PHONEPE FOR DIGITAL TRANSACTIONS
ANALYSE THE USER PREDILECTION ON GPAY AND PHONEPE FOR DIGITAL TRANSACTIONSIAEME Publication
Β 
VOICE BASED ATM FOR VISUALLY IMPAIRED USING ARDUINO
VOICE BASED ATM FOR VISUALLY IMPAIRED USING ARDUINOVOICE BASED ATM FOR VISUALLY IMPAIRED USING ARDUINO
VOICE BASED ATM FOR VISUALLY IMPAIRED USING ARDUINOIAEME Publication
Β 
IMPACT OF EMOTIONAL INTELLIGENCE ON HUMAN RESOURCE MANAGEMENT PRACTICES AMONG...
IMPACT OF EMOTIONAL INTELLIGENCE ON HUMAN RESOURCE MANAGEMENT PRACTICES AMONG...IMPACT OF EMOTIONAL INTELLIGENCE ON HUMAN RESOURCE MANAGEMENT PRACTICES AMONG...
IMPACT OF EMOTIONAL INTELLIGENCE ON HUMAN RESOURCE MANAGEMENT PRACTICES AMONG...IAEME Publication
Β 
VISUALISING AGING PARENTS & THEIR CLOSE CARERS LIFE JOURNEY IN AGING ECONOMY
VISUALISING AGING PARENTS & THEIR CLOSE CARERS LIFE JOURNEY IN AGING ECONOMYVISUALISING AGING PARENTS & THEIR CLOSE CARERS LIFE JOURNEY IN AGING ECONOMY
VISUALISING AGING PARENTS & THEIR CLOSE CARERS LIFE JOURNEY IN AGING ECONOMYIAEME Publication
Β 
A STUDY ON THE IMPACT OF ORGANIZATIONAL CULTURE ON THE EFFECTIVENESS OF PERFO...
A STUDY ON THE IMPACT OF ORGANIZATIONAL CULTURE ON THE EFFECTIVENESS OF PERFO...A STUDY ON THE IMPACT OF ORGANIZATIONAL CULTURE ON THE EFFECTIVENESS OF PERFO...
A STUDY ON THE IMPACT OF ORGANIZATIONAL CULTURE ON THE EFFECTIVENESS OF PERFO...IAEME Publication
Β 
GANDHI ON NON-VIOLENT POLICE
GANDHI ON NON-VIOLENT POLICEGANDHI ON NON-VIOLENT POLICE
GANDHI ON NON-VIOLENT POLICEIAEME Publication
Β 
A STUDY ON TALENT MANAGEMENT AND ITS IMPACT ON EMPLOYEE RETENTION IN SELECTED...
A STUDY ON TALENT MANAGEMENT AND ITS IMPACT ON EMPLOYEE RETENTION IN SELECTED...A STUDY ON TALENT MANAGEMENT AND ITS IMPACT ON EMPLOYEE RETENTION IN SELECTED...
A STUDY ON TALENT MANAGEMENT AND ITS IMPACT ON EMPLOYEE RETENTION IN SELECTED...IAEME Publication
Β 
ATTRITION IN THE IT INDUSTRY DURING COVID-19 PANDEMIC: LINKING EMOTIONAL INTE...
ATTRITION IN THE IT INDUSTRY DURING COVID-19 PANDEMIC: LINKING EMOTIONAL INTE...ATTRITION IN THE IT INDUSTRY DURING COVID-19 PANDEMIC: LINKING EMOTIONAL INTE...
ATTRITION IN THE IT INDUSTRY DURING COVID-19 PANDEMIC: LINKING EMOTIONAL INTE...IAEME Publication
Β 
INFLUENCE OF TALENT MANAGEMENT PRACTICES ON ORGANIZATIONAL PERFORMANCE A STUD...
INFLUENCE OF TALENT MANAGEMENT PRACTICES ON ORGANIZATIONAL PERFORMANCE A STUD...INFLUENCE OF TALENT MANAGEMENT PRACTICES ON ORGANIZATIONAL PERFORMANCE A STUD...
INFLUENCE OF TALENT MANAGEMENT PRACTICES ON ORGANIZATIONAL PERFORMANCE A STUD...IAEME Publication
Β 
A STUDY OF VARIOUS TYPES OF LOANS OF SELECTED PUBLIC AND PRIVATE SECTOR BANKS...
A STUDY OF VARIOUS TYPES OF LOANS OF SELECTED PUBLIC AND PRIVATE SECTOR BANKS...A STUDY OF VARIOUS TYPES OF LOANS OF SELECTED PUBLIC AND PRIVATE SECTOR BANKS...
A STUDY OF VARIOUS TYPES OF LOANS OF SELECTED PUBLIC AND PRIVATE SECTOR BANKS...IAEME Publication
Β 
EXPERIMENTAL STUDY OF MECHANICAL AND TRIBOLOGICAL RELATION OF NYLON/BaSO4 POL...
EXPERIMENTAL STUDY OF MECHANICAL AND TRIBOLOGICAL RELATION OF NYLON/BaSO4 POL...EXPERIMENTAL STUDY OF MECHANICAL AND TRIBOLOGICAL RELATION OF NYLON/BaSO4 POL...
EXPERIMENTAL STUDY OF MECHANICAL AND TRIBOLOGICAL RELATION OF NYLON/BaSO4 POL...IAEME Publication
Β 
ROLE OF SOCIAL ENTREPRENEURSHIP IN RURAL DEVELOPMENT OF INDIA - PROBLEMS AND ...
ROLE OF SOCIAL ENTREPRENEURSHIP IN RURAL DEVELOPMENT OF INDIA - PROBLEMS AND ...ROLE OF SOCIAL ENTREPRENEURSHIP IN RURAL DEVELOPMENT OF INDIA - PROBLEMS AND ...
ROLE OF SOCIAL ENTREPRENEURSHIP IN RURAL DEVELOPMENT OF INDIA - PROBLEMS AND ...IAEME Publication
Β 
OPTIMAL RECONFIGURATION OF POWER DISTRIBUTION RADIAL NETWORK USING HYBRID MET...
OPTIMAL RECONFIGURATION OF POWER DISTRIBUTION RADIAL NETWORK USING HYBRID MET...OPTIMAL RECONFIGURATION OF POWER DISTRIBUTION RADIAL NETWORK USING HYBRID MET...
OPTIMAL RECONFIGURATION OF POWER DISTRIBUTION RADIAL NETWORK USING HYBRID MET...IAEME Publication
Β 
APPLICATION OF FRUGAL APPROACH FOR PRODUCTIVITY IMPROVEMENT - A CASE STUDY OF...
APPLICATION OF FRUGAL APPROACH FOR PRODUCTIVITY IMPROVEMENT - A CASE STUDY OF...APPLICATION OF FRUGAL APPROACH FOR PRODUCTIVITY IMPROVEMENT - A CASE STUDY OF...
APPLICATION OF FRUGAL APPROACH FOR PRODUCTIVITY IMPROVEMENT - A CASE STUDY OF...IAEME Publication
Β 
A MULTIPLE – CHANNEL QUEUING MODELS ON FUZZY ENVIRONMENT
A MULTIPLE – CHANNEL QUEUING MODELS ON FUZZY ENVIRONMENTA MULTIPLE – CHANNEL QUEUING MODELS ON FUZZY ENVIRONMENT
A MULTIPLE – CHANNEL QUEUING MODELS ON FUZZY ENVIRONMENTIAEME Publication
Β 

More from IAEME Publication (20)

IAEME_Publication_Call_for_Paper_September_2022.pdf
IAEME_Publication_Call_for_Paper_September_2022.pdfIAEME_Publication_Call_for_Paper_September_2022.pdf
IAEME_Publication_Call_for_Paper_September_2022.pdf
Β 
MODELING AND ANALYSIS OF SURFACE ROUGHNESS AND WHITE LATER THICKNESS IN WIRE-...
MODELING AND ANALYSIS OF SURFACE ROUGHNESS AND WHITE LATER THICKNESS IN WIRE-...MODELING AND ANALYSIS OF SURFACE ROUGHNESS AND WHITE LATER THICKNESS IN WIRE-...
MODELING AND ANALYSIS OF SURFACE ROUGHNESS AND WHITE LATER THICKNESS IN WIRE-...
Β 
A STUDY ON THE REASONS FOR TRANSGENDER TO BECOME ENTREPRENEURS
A STUDY ON THE REASONS FOR TRANSGENDER TO BECOME ENTREPRENEURSA STUDY ON THE REASONS FOR TRANSGENDER TO BECOME ENTREPRENEURS
A STUDY ON THE REASONS FOR TRANSGENDER TO BECOME ENTREPRENEURS
Β 
BROAD UNEXPOSED SKILLS OF TRANSGENDER ENTREPRENEURS
BROAD UNEXPOSED SKILLS OF TRANSGENDER ENTREPRENEURSBROAD UNEXPOSED SKILLS OF TRANSGENDER ENTREPRENEURS
BROAD UNEXPOSED SKILLS OF TRANSGENDER ENTREPRENEURS
Β 
DETERMINANTS AFFECTING THE USER'S INTENTION TO USE MOBILE BANKING APPLICATIONS
DETERMINANTS AFFECTING THE USER'S INTENTION TO USE MOBILE BANKING APPLICATIONSDETERMINANTS AFFECTING THE USER'S INTENTION TO USE MOBILE BANKING APPLICATIONS
DETERMINANTS AFFECTING THE USER'S INTENTION TO USE MOBILE BANKING APPLICATIONS
Β 
ANALYSE THE USER PREDILECTION ON GPAY AND PHONEPE FOR DIGITAL TRANSACTIONS
ANALYSE THE USER PREDILECTION ON GPAY AND PHONEPE FOR DIGITAL TRANSACTIONSANALYSE THE USER PREDILECTION ON GPAY AND PHONEPE FOR DIGITAL TRANSACTIONS
ANALYSE THE USER PREDILECTION ON GPAY AND PHONEPE FOR DIGITAL TRANSACTIONS
Β 
VOICE BASED ATM FOR VISUALLY IMPAIRED USING ARDUINO
VOICE BASED ATM FOR VISUALLY IMPAIRED USING ARDUINOVOICE BASED ATM FOR VISUALLY IMPAIRED USING ARDUINO
VOICE BASED ATM FOR VISUALLY IMPAIRED USING ARDUINO
Β 
IMPACT OF EMOTIONAL INTELLIGENCE ON HUMAN RESOURCE MANAGEMENT PRACTICES AMONG...
IMPACT OF EMOTIONAL INTELLIGENCE ON HUMAN RESOURCE MANAGEMENT PRACTICES AMONG...IMPACT OF EMOTIONAL INTELLIGENCE ON HUMAN RESOURCE MANAGEMENT PRACTICES AMONG...
IMPACT OF EMOTIONAL INTELLIGENCE ON HUMAN RESOURCE MANAGEMENT PRACTICES AMONG...
Β 
VISUALISING AGING PARENTS & THEIR CLOSE CARERS LIFE JOURNEY IN AGING ECONOMY
VISUALISING AGING PARENTS & THEIR CLOSE CARERS LIFE JOURNEY IN AGING ECONOMYVISUALISING AGING PARENTS & THEIR CLOSE CARERS LIFE JOURNEY IN AGING ECONOMY
VISUALISING AGING PARENTS & THEIR CLOSE CARERS LIFE JOURNEY IN AGING ECONOMY
Β 
A STUDY ON THE IMPACT OF ORGANIZATIONAL CULTURE ON THE EFFECTIVENESS OF PERFO...
A STUDY ON THE IMPACT OF ORGANIZATIONAL CULTURE ON THE EFFECTIVENESS OF PERFO...A STUDY ON THE IMPACT OF ORGANIZATIONAL CULTURE ON THE EFFECTIVENESS OF PERFO...
A STUDY ON THE IMPACT OF ORGANIZATIONAL CULTURE ON THE EFFECTIVENESS OF PERFO...
Β 
GANDHI ON NON-VIOLENT POLICE
GANDHI ON NON-VIOLENT POLICEGANDHI ON NON-VIOLENT POLICE
GANDHI ON NON-VIOLENT POLICE
Β 
A STUDY ON TALENT MANAGEMENT AND ITS IMPACT ON EMPLOYEE RETENTION IN SELECTED...
A STUDY ON TALENT MANAGEMENT AND ITS IMPACT ON EMPLOYEE RETENTION IN SELECTED...A STUDY ON TALENT MANAGEMENT AND ITS IMPACT ON EMPLOYEE RETENTION IN SELECTED...
A STUDY ON TALENT MANAGEMENT AND ITS IMPACT ON EMPLOYEE RETENTION IN SELECTED...
Β 
ATTRITION IN THE IT INDUSTRY DURING COVID-19 PANDEMIC: LINKING EMOTIONAL INTE...
ATTRITION IN THE IT INDUSTRY DURING COVID-19 PANDEMIC: LINKING EMOTIONAL INTE...ATTRITION IN THE IT INDUSTRY DURING COVID-19 PANDEMIC: LINKING EMOTIONAL INTE...
ATTRITION IN THE IT INDUSTRY DURING COVID-19 PANDEMIC: LINKING EMOTIONAL INTE...
Β 
INFLUENCE OF TALENT MANAGEMENT PRACTICES ON ORGANIZATIONAL PERFORMANCE A STUD...
INFLUENCE OF TALENT MANAGEMENT PRACTICES ON ORGANIZATIONAL PERFORMANCE A STUD...INFLUENCE OF TALENT MANAGEMENT PRACTICES ON ORGANIZATIONAL PERFORMANCE A STUD...
INFLUENCE OF TALENT MANAGEMENT PRACTICES ON ORGANIZATIONAL PERFORMANCE A STUD...
Β 
A STUDY OF VARIOUS TYPES OF LOANS OF SELECTED PUBLIC AND PRIVATE SECTOR BANKS...
A STUDY OF VARIOUS TYPES OF LOANS OF SELECTED PUBLIC AND PRIVATE SECTOR BANKS...A STUDY OF VARIOUS TYPES OF LOANS OF SELECTED PUBLIC AND PRIVATE SECTOR BANKS...
A STUDY OF VARIOUS TYPES OF LOANS OF SELECTED PUBLIC AND PRIVATE SECTOR BANKS...
Β 
EXPERIMENTAL STUDY OF MECHANICAL AND TRIBOLOGICAL RELATION OF NYLON/BaSO4 POL...
EXPERIMENTAL STUDY OF MECHANICAL AND TRIBOLOGICAL RELATION OF NYLON/BaSO4 POL...EXPERIMENTAL STUDY OF MECHANICAL AND TRIBOLOGICAL RELATION OF NYLON/BaSO4 POL...
EXPERIMENTAL STUDY OF MECHANICAL AND TRIBOLOGICAL RELATION OF NYLON/BaSO4 POL...
Β 
ROLE OF SOCIAL ENTREPRENEURSHIP IN RURAL DEVELOPMENT OF INDIA - PROBLEMS AND ...
ROLE OF SOCIAL ENTREPRENEURSHIP IN RURAL DEVELOPMENT OF INDIA - PROBLEMS AND ...ROLE OF SOCIAL ENTREPRENEURSHIP IN RURAL DEVELOPMENT OF INDIA - PROBLEMS AND ...
ROLE OF SOCIAL ENTREPRENEURSHIP IN RURAL DEVELOPMENT OF INDIA - PROBLEMS AND ...
Β 
OPTIMAL RECONFIGURATION OF POWER DISTRIBUTION RADIAL NETWORK USING HYBRID MET...
OPTIMAL RECONFIGURATION OF POWER DISTRIBUTION RADIAL NETWORK USING HYBRID MET...OPTIMAL RECONFIGURATION OF POWER DISTRIBUTION RADIAL NETWORK USING HYBRID MET...
OPTIMAL RECONFIGURATION OF POWER DISTRIBUTION RADIAL NETWORK USING HYBRID MET...
Β 
APPLICATION OF FRUGAL APPROACH FOR PRODUCTIVITY IMPROVEMENT - A CASE STUDY OF...
APPLICATION OF FRUGAL APPROACH FOR PRODUCTIVITY IMPROVEMENT - A CASE STUDY OF...APPLICATION OF FRUGAL APPROACH FOR PRODUCTIVITY IMPROVEMENT - A CASE STUDY OF...
APPLICATION OF FRUGAL APPROACH FOR PRODUCTIVITY IMPROVEMENT - A CASE STUDY OF...
Β 
A MULTIPLE – CHANNEL QUEUING MODELS ON FUZZY ENVIRONMENT
A MULTIPLE – CHANNEL QUEUING MODELS ON FUZZY ENVIRONMENTA MULTIPLE – CHANNEL QUEUING MODELS ON FUZZY ENVIRONMENT
A MULTIPLE – CHANNEL QUEUING MODELS ON FUZZY ENVIRONMENT
Β 

Recently uploaded

OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
Β 
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2RajaP95
Β 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
Β 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLDeelipZope
Β 
Call Us ≽ 8377877756 β‰Ό Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 β‰Ό Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 β‰Ό Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 β‰Ό Call Girls In Shastri Nagar (Delhi)dollysharma2066
Β 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
Β 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxvipinkmenon1
Β 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
Β 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfme23b1001
Β 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.eptoze12
Β 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxPoojaBan
Β 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxk795866
Β 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoΓ£o Esperancinha
Β 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learningmisbanausheenparvam
Β 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
Β 
Gurgaon ✑️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✑️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✑️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✑️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
Β 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
Β 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
Β 

Recently uploaded (20)

OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
Β 
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
Β 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
Β 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCL
Β 
Call Us ≽ 8377877756 β‰Ό Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 β‰Ό Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 β‰Ό Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 β‰Ό Call Girls In Shastri Nagar (Delhi)
Β 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
Β 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptx
Β 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
Β 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdf
Β 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.
Β 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptx
Β 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
Β 
young call girls in Rajiv ChowkπŸ” 9953056974 πŸ” Delhi escort Service
young call girls in Rajiv ChowkπŸ” 9953056974 πŸ” Delhi escort Serviceyoung call girls in Rajiv ChowkπŸ” 9953056974 πŸ” Delhi escort Service
young call girls in Rajiv ChowkπŸ” 9953056974 πŸ” Delhi escort Service
Β 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Β 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learning
Β 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
Β 
Gurgaon ✑️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✑️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✑️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✑️9711147426✨Call In girls Gurgaon Sector 51 escort service
Β 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
Β 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
Β 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
Β 

Sparse matrix GPU recommendation system

  • 1. http://www.iaeme.com/IJARET/index.asp 11 editor@iaeme.com International Journal of Advanced Research in Engineering and Technology (IJARET) Volume 6, Issue 7, Jul 2015, pp. 11-23, Article ID: IJARET_06_07_003 Available online at http://www.iaeme.com/IJARET/issues.asp?JTypeIJARET&VType=6&IType=7 ISSN Print: 0976-6480 and ISSN Online: 0976-6499 Β© IAEME Publication ___________________________________________________________________________ SPARSE STORAGE RECOMMENDATION SYSTEM FOR SPARSE MATRIX VECTOR MULTIPLICATION ON GPU Monika Shah Department of Computer Science & Engineering, Nirma University ABSTRACT Sparse Matrix Vector Multiplication (SpMV) Ax=b is a well-known kernel in science, engineering, and web world. Harnessing large computing capabilities of GPU device, many sparse storage formats have been proposed to optimize performance of SpMV on GPU. Compressed Sparse Row (CSR), ELLPACK (ELL), Hybrid (HYB), and Aligned COO sparse storage formats are known for efficient implementation of SpMV on GPU for wide spectrum of sparse matrix pattern. Researchers have observed that performance of SpMV on GPU for a given matrix A can vary widely depending on sparse storage format used. Hence, it has become a great challenge to choose an appropriate storage format from this collection for a given sparse matrix. To resolve this problem, this paper proposes an algorithm that recommend highly suitable storage format for a given sparse matrix. This system use simple metrics (like row length, number of rows, number of columns and number of non-zero element) of a given sparse matrix to analyse impact of different storage format on performance of SpMV. To demonstrate influence of this algorithm, performance of SpMV and its associated application βˆ’ Conjugate Gradient Solver (CGS) over various sparse matrix patterns with various sparse formats have been compared. Key words: Sparse Matrix, SpMV, Sparse format, Heuristics, K-mean clustering and Load balance Cite this Article: Shah, M. Sparse Storage Recommendation System for Sparse Matrix Vector Multiplication on GPU. International Journal of Advanced Research in Engineering and Technology, 6(7), 2015, pp. 11-23. http://www.iaeme.com/currentissue.asp?JType=IJARET&VType=6&IType=7 _____________________________________________________________________ 1. INTRODUCTION Since last many years, Sparse Matrix Vector Multiplication (SpMV) has become most prominent computing dwarf for science and engineering applications. Linear algebra IJARET
  • 2. Monika Shah http://www.iaeme.com/IJARET/index.asp 12 editor@iaeme.com solver (like partial differential equations [1], [2], conjugate gradient solver [3], [4], Gaussian reduction of complex matrices, etc.), fluid dynamics [5], Database query processing on large database (LDB) [6], information retrieval [7], network theory [8], [9] , page rank computation [10], physics of disordered and quantum system [11] are well-known applications that have recurrent use of SpMV. Sparse matrix used in these applications are varied widely in non-zero pattern. Continuous growth of computer users, and their increasing usage constantly increase size of many datasets used in such applications. This continuous and exponential growth of dataset has raised need to apply High Performance Computing. Researchers have provide many solutions through inventions for high performance computing device architectures like Graphical Processing Unit (GPU) and optimizing algorithms for these devices. GPU device is well-known high performance promising device for regular applications. Hence, it is a great challenge to use GPU for irregular application like SpMV. Generalized implementation of parallel SpMV has become complex because of following properties of sparse matrix: 1. Imbalanced number of nonzero elements in each row 2. Imbalanced number of nonzero elements in each column β€’ Wide-range of sparse patterns (diagonal, skewed, power law distribution of non-zero elements for each row, almost equal number of non-zero elements per row, block, etc.) β€’ Varied sparse level matrix(ratio of nonzero elements to size of matrix) For an efficient and generalized implementation of SpMV on GPU, two important factors are influenced by past research [12]: (i) Synchronization free load distribution among computational resources, (ii) Reduce fetch operations to avoid drawback of low latency memory access in GPU. Hence, it is preferred to select such sparse storage format that support high compression along with better synchronization free load distribution. Major challenge to satisfy these factors are: 1. Continuous growth in dataset make very large size of sparse matrix. 2. Indirection used in storage representation of sparse matrix increases size of data to be transferred from CPU to GPU device as an additional overhead 3. Existence of large class of sparse matrix pattern 4. Difficult to balance work distribution due to imbalanced number of nonzero elements for each row as well as for each column 5. Restriction to increase concurrency due to existence of data dependency among row elements to compute output vector Harnessing high computing capabilities of GPU, and unceasing performance demand of SpMV kernel motivate researchers to optimize SpMV on GPU that deal with all challenges listed above. During past research, Coordinate (COO), Compressed Sparse Row (CSR), Compressed Sparse Column (CSC), ELLPACK (ELL), Hybrid (HYB), and Jagged Diagonal Storage (JDS) have been proposed with different compression strategy [13]. They have also proposed SpMV algorithms for these sparse storage formats on GPU. Bulky index structure of COO format reduce synchronization free load distribution degree among parallel threads, and Increase communication overhead between CPU and GPU. CSC schedule all columns of a sparse matrix sequentially in SPMV, and vector b is loaded and stored frequently in each iteration. This factors are responsible for recurrent communication overhead, which limits performance of CSC on GPU. These factors are responsible for less popularity COO and CSC sparse formats on GPU. Aligned Coordinate (Aligned
  • 3. Sparse Storage Recommendation System for Sparse Matrix Vector Multiplication on GPU http://www.iaeme.com/IJARET/index.asp 13 editor@iaeme.com COO) [12] is introduced as compressed and suitable for synchronization free balanced load distribution and proper cache utilization. Sparse matrix metrics like Number of Rows (NR), Number of Columns (NC), Number of Non-Zero elements (NNZ), Non-zero elements in a row (row_len), and Non-zero elements in a column (col_len) are playing important role in compression ratio, and parallel degree for various sparse formats. An important point to focus here is compression ratio for these recognized sparse storage formats are varied based on sparse level and sparse pattern of an input matrix. Considering these factors, this paper proposes an algorithm to recommend highly suitable storage format fora given sparse matrix. The remaining paper is structured as follows: The course of Optimizing sparse formats and their SpMV implementation is traced in section II. Section III brings forth our attempt to define heuristics and an algorithm that recommend highly suitable storage format for implementing SpMV on GPU. Parallel algorithm of CGS has been discussed in section III-D. Section IV demonstrates and analyse result of this proposed work. Conclusion of the paper is described in section V. 2. SPARSE STRORAGE FORMATS Many storage formats are proposed as a result of past effort by researchers. As mentioned in section I, compressed storage, synchronization free load distribution, and highest possible concurrency have become main goal to design sparse matrix formats for NVIDIA GPU and CUDA programming environment. Bell at el.[13] have introduced storage formats COO, CSR, CSC, ELL, HYB supporting different level of compression to different sparse matrix pattern. Shah M. et al[12] have introduced an Aligned COO. Many other extension to this benchmark sparse format [14], [15], [16], [17] as well as hybrid of these storage formats [18], [19], [20] also have been proposed in past. Tragedy even after research of this large set of sparse format is that there is not any standard format suitable for almost all class of sparse matrix patterns. In addition to this, it is also difficult to identify suitable sparse matrix format supporting best compression as well as synchronization free and balanced work-load distribution. Table 1 Sparse Matrix Formats and Their Space Complexity Sparse matrix format Space Complexity COO NNZ x 3 CSC NNZ x 2 + ( NC +1) CSR NNZ x 2 + (NR +1) ELL (NR x max row length) x 2 HYB β‰… ELL, for rows with similar length β‰… COO, for rest row elements Aligned COO Num_segments x Segment_length x 3 β‰… (max_row_length x (≀ NR) x 3) Selection of proper data compression strategy is important due to two major reasons: (i) Data transferring overhead between CPU and GPU (ii) Design of Memory access pattern for each concurrent thread depend on data structure. Table 1 presents memory space required for various sparse formats. It sustain that compression percentage for same format varies from one sparse matrix to another sparse matrix based on basic statistics of the matrix. For example, COO provide highest compression for small and highly sparse matrix; CSC and CSR give better
  • 4. Monika Shah http://www.iaeme.com/IJARET/index.asp 14 editor@iaeme.com compression for small size of sparse matrix in terms of columns and rows respectively; ELL is suitable for compression of sparse matrix with less difference in NNZ for each row and large matrix in terms of number of rows. COO, CSC, CSR, and ELL are known as core sparse storage formats designed to support higher compression. HYB is designed to reduce padding space in ELL format. HYB suggest better compression in form of hybrid pattern of ELL and COO. Aligned COO provides better compression in compare to ELL for highly skewed sparse matrix with power-law distribution. Table 2 SPMV Algorithms and Their Time Complexity SpMV Time Complexity (Excluding Memory access Overhead) COO_ flat max _ _ β„Ž CSC max _ _ max _ _ β„Ž _ CSR ≀ max _concurrent_threads % max _ &_ CSR(vector) ≀ max _& ' (max _& ' @ & + +,(& '_ -.. Where, max _& ' = max _ _ β„Ž & '_ 0- and max _& ' _' _ & = max _ &_ warp_size ELL β‰₯ max _concurrent_threads max _ &_ HYB β‰… ELL, for rows with similar length + β‰… COO_flat, for rest row elements Aligned_COO β‰… CSR, for aligned rows + β‰… COO_flat, for rest row elements Increased concurrency and synchronization free load distribution are important factors to reduce runtime of parallel SpMV on GPU. Table 2 represents rum-time complexity of various SpMV implementation of above listed sparse storage formats. COO_flat algorithm specify highest concurrency but does not ensure synchronization free load distribution among concurrent threads due to row elements across warp boundary. CSC is also less preferred due to an additional overhead of accessing output vector in every iteration. On other side, ELL has an overhead of transferring
  • 5. Sparse Storage Recommendation System for Sparse Matrix Vector Multiplication on GPU http://www.iaeme.com/IJARET/index.asp 15 editor@iaeme.com extraneous memory containing padding of zero values over low latency memory access. CSR and ELL has very similar SpMV algorithm except an additional overhead by CSR to access memory to fetch row index. CSR implementation on GPU is more efficient than ELL, where NNZ to be accessed by one thread block and iteration is much larger than another block or iteration. CSR Vector provide much higher concurrency than CSR and ELL, but has an overhead of performing series of parallel reduction steps by each thread. Hence, CSR Vector is not suitable when average NNZ per row is less than steps required by parallel reductions by each thread that is log (warp size). HYB and Aligned COO kernels are designed to make efficient SpMV using hybrid of above mentioned sparse formats and their kernels. Aligned COO reorders nonzero elements to make balanced workload distribution among each computing resource and thus reduce number of row segment compare to number of rows in ELL format retaining maximum row length same as original. Hence, Aligned COO give optimized performance for highly skewed sparse matrix pattern. 3. PROPOSED WORK Section II discuss strengths and weaknesses of various sparse matrix storage formats and its SpMV implementations. It indicates that selection of sparse storage format is important factor for efficient SpMV on GPU. The collection of SpMV algorithms like JAD, CSR, ELL, CSR vector, HYB, and Aligned COO cover wide spectrum of sparse matrix pattern for better performance. Recognizing sparse matrix pattern is great challenge. Statistics analysis is considered to be a good methodology to recognize sparse patterns. Diagonal pattern is simple to recognized, and JAD is recommended for diagonal pattern of sparse matrix. This section proposes a strategy to suggest most appropriate SpMV implementation for all sparse pattern except diagonal. Working flow of this proposed work is described in Figure 1. Here, K-mean clustering is used to generate detailed statistics from basic matrix statistics like NR, NC, NNZ and row length vector rl []. This derived statistics are analysed and compared with pre-defined Heuristics and suggest most appropriate SpMV algorithm. Section III-A explain input and output parameter of K-mean clustering algorithm. Section III-B define heuristics for SpMV algorithms CSR, ELL, CSR Vector, HYB, and Aligned COO algorithms. Figure 1 Working flow of Heuristic based Selection of SpMV algorithm
  • 6. Monika Shah http://www.iaeme.com/IJARET/index.asp 16 editor@iaeme.com Detailed description of Heuristics based SpMV selection algorithm is given in section III-C. To prove effectiveness of this proposed algorithm, a well-known SpMV application βˆ’ CGS on GPU is implemented as shown in section III-D. 3.1. K-Mean Clustering And Its Parameters Here, K-mean clustering is used to identify similarity level among sparse matrix rows using parameter row-length. This K-mean clustering algorithm constructs 2 clusters based on row length vector rl []. For highly skewed sparse matrix, centroid of cluster is not sufficient to predict similarity of row-length. Hence, this K-mean clustering algorithm is slightly modified and identify Lower Bound (LB), Upper Bound (UB), Number of element (CNT) and Centroid (C) for both cluster beans. This clusters are named as cluster H and cluster L based on their centroid value i.e. C_L < C_H. Similarly, output parameters of this K-mean clustering algorithm are named as LB_H, UB_H, CNT_H, C_H, LB_L, UB_L, CNT_L, and C_L as shown in Figure 2. Figure 2 K-mean clustering for this proposed work 3.2. Heuristics Based on empirical result analysis and basic understanding of various SpMV algorithms, heuristics are defined to suggest a suitable sparse storage format and GPU based SpMV algorithm capable to give better performance for given sparse matrix. Following points are center of focus in design of this heuristics. 1. Obtain highest possible concurrency degree 2. Better compression of sparse matrix to reduce memory access cost 3. Balanced work load distribution among threads 4. Synchronization free load distribution as far as possible 5. Reducing number of blocks to reduce block schedule cost
  • 7. Sparse Storage Recommendation System for Sparse Matrix Vector Multiplication on GPU http://www.iaeme.com/IJARET/index.asp 3.3. Heuristics for CSR Vector CSR Vector is designed to propose highest pos free load distribution, which in turn ensures good accuracy. Every execution thread of this SpMV algorithm executes at operations. For execution of CSR Vector, warp W (collect 32 threads in general) is allotted to each row of sparse matrix for computation. CSR storage format is used to implement this SpMV algorithm. But, CSR is preferred for small matrix size to avoid large number of low latency mem row index. Looking all these criteria, CSR Vector is preferred to apply if following condition is satisfied: 3.4. Heuristics for CSR CSR SpMV is preferred for small matrix, where each row does not have large number of non-zero elements as well as majority of rows do not have equivalent size in terms of non-zero elements. Hence, CSR SpMV is preferred when CSR Vector is not applicable and following condition is satisfied: 3.5. Heuristics for ELL ELL storage format and ELL SpMV algo equivalent row length, which reduces padding overhead and improve performance. But, large row-length reduce concurrency degree in ELL SpMV. ELL is preferred when there is not much difference between neither centr between upper bound of higher value cluster and centroid value of cluster having lower centroid value. Hence, it is concluded that ELLPACK is preferred, when CSR or CSR Vector are not applicable for given sparse matrix, and fo satisfied: 3.6. Heuristics for HYB When a large sparse matrix does not have equivalent row length, but power distribution of non-zero elements among rows of the matrix with highly skewed visualization, Hybrid sparse format and its SpMV is preferred. Hence, it is concluded that HYB is preferred, when CSR or ELL sparse formats are not suitable for given sparse matrix, and following condition is satisfied: ( 6_7 6_8 β‰₯ 100 3.7 Heuristics for Aligned COO Aligned COO format and its SpMV are designed to optimize performance for large sparse matrix having skewed distribution of non alignment of large sized row with small si of number of execution units. But as it is based on COO format, it provide less compression in compare to hybrid. Hence, Aligned COO is preferred when neither CSR nor ELL nor HYB formats are suitable as well as i condition: Sparse Storage Recommendation System for Sparse Matrix Vector Multiplication on GPU ARET/index.asp 17 editor@iaeme.com Heuristics for CSR Vector Vector is designed to propose highest possible concurrency with synchronization free load distribution, which in turn ensures good accuracy. Every execution thread of this SpMV algorithm executes at-least 1 Multiplication and log operations. For execution of CSR Vector, warp W (collection of execution threads i.e. 32 threads in general) is allotted to each row of sparse matrix for computation. CSR storage format is used to implement this SpMV algorithm. But, CSR is preferred for small matrix size to avoid large number of low latency memory access for fetching row index. Looking all these criteria, CSR Vector is preferred to apply if following Heuristics for CSR CSR SpMV is preferred for small matrix, where each row does not have large number ments as well as majority of rows do not have equivalent size in terms zero elements. Hence, CSR SpMV is preferred when CSR Vector is not applicable and following condition is satisfied: Heuristics for ELL ELL storage format and ELL SpMV algorithm is preferred for a sparse matrix with equivalent row length, which reduces padding overhead and improve performance. length reduce concurrency degree in ELL SpMV. ELL is preferred when there is not much difference between neither centroid value of two clusters nor between upper bound of higher value cluster and centroid value of cluster having lower centroid value. Hence, it is concluded that ELLPACK is preferred, when CSR or CSR Vector are not applicable for given sparse matrix, and following condition is .49) or (C_H-C_L) ≀ 6)) Heuristics for HYB When a large sparse matrix does not have equivalent row length, but power zero elements among rows of the matrix with highly skewed Hybrid sparse format and its SpMV is preferred. Hence, it is concluded that HYB is preferred, when CSR or ELL sparse formats are not suitable for given sparse matrix, and following condition is satisfied: 100 . ; ( <=_7 8=_7 β‰₯ 100 . ; ( >>? >@ β‰₯ 100 . Heuristics for Aligned COO Aligned COO format and its SpMV are designed to optimize performance for large sparse matrix having skewed distribution of non-zero elements and also have possible alignment of large sized row with small sized row such that it can reduce requirement of number of execution units. But as it is based on COO format, it provide less compression in compare to hybrid. Hence, Aligned COO is preferred when neither CSR nor ELL nor HYB formats are suitable as well as it should satisfy following Sparse Storage Recommendation System for Sparse Matrix Vector Multiplication on GPU editor@iaeme.com sible concurrency with synchronization free load distribution, which in turn ensures good accuracy. Every execution thread of log2W addition ion of execution threads i.e. 32 threads in general) is allotted to each row of sparse matrix for computation. CSR storage format is used to implement this SpMV algorithm. But, CSR is preferred for ory access for fetching row index. Looking all these criteria, CSR Vector is preferred to apply if following CSR SpMV is preferred for small matrix, where each row does not have large number ments as well as majority of rows do not have equivalent size in terms zero elements. Hence, CSR SpMV is preferred when CSR Vector is not rithm is preferred for a sparse matrix with equivalent row length, which reduces padding overhead and improve performance. length reduce concurrency degree in ELL SpMV. ELL is preferred oid value of two clusters nor between upper bound of higher value cluster and centroid value of cluster having lower centroid value. Hence, it is concluded that ELLPACK is preferred, when CSR llowing condition is 6)) When a large sparse matrix does not have equivalent row length, but power-law zero elements among rows of the matrix with highly skewed Hybrid sparse format and its SpMV is preferred. Hence, it is concluded that HYB is preferred, when CSR or ELL sparse formats are not suitable for given Aligned COO format and its SpMV are designed to optimize performance for large zero elements and also have possible zed row such that it can reduce requirement of number of execution units. But as it is based on COO format, it provide less compression in compare to hybrid. Hence, Aligned COO is preferred when neither t should satisfy following
  • 8. http://www.iaeme.com/IJARET/index.asp 3.8. Heuristics based Sparse format recommendation This section describes an algorithm to suggest most suitable sparse format and its associated SpMV for given sparse matrix. It performs K matrix metric, and compare its output parameters with heuristics defined in above section. Algorithm 1 Heuristics based Sparse format recommendation Input: NNZ, NR, NC, rl [ ] Output: Suitable_SpMV Perform K-mean clustering with two bean 3.9. Parallel CGS To demonstrate effectiveness of hereby proposed Heuristics based sparse format recommendation algorithm for efficient SpMV, it is preferred to test this algorithm on such application that has frequent and high usage of SpMV kernel as well as applicable for wide category of sparse patterns. Conjugate Gradient Solver (CGS) is such a well-known application that find solution vector x for Ax=b. Every CGS call invokes SpMV kernel in a loop of hu iterations. As GPU has major overhead of memory transfer between CPU and GPU, this parallel CGS is designed such a way that it need to transfer sparse matrix A, and input vector b at time of first iteration only. GPU based parallel CGS Monika Shah http://www.iaeme.com/IJARET/index.asp 18 editor@iaeme.com Heuristics based Sparse format recommendation This section describes an algorithm to suggest most suitable sparse format and its associated SpMV for given sparse matrix. It performs K-mean clustering on sparse atrix metric, and compare its output parameters with heuristics defined in above Algorithm 1 Heuristics based Sparse format recommendation NNZ, NR, NC, rl [ ] mean clustering with two bean (C_L β‰₯ log, E. AND (C_H β‰₯ W 2 . AND (NNZ ≀ max threads) then To demonstrate effectiveness of hereby proposed Heuristics based sparse format recommendation algorithm for efficient SpMV, it is preferred to test this algorithm on has frequent and high usage of SpMV kernel as well as applicable for wide category of sparse patterns. Conjugate Gradient Solver (CGS) is known application that find solution vector x for Ax=b. call invokes SpMV kernel in a loop of hundreds to thousands iterations. As GPU has major overhead of memory transfer between CPU and GPU, this parallel CGS is designed such a way that it need to transfer sparse matrix A, and input vector b at time of first iteration only. GPU based parallel CGS editor@iaeme.com This section describes an algorithm to suggest most suitable sparse format and its mean clustering on sparse atrix metric, and compare its output parameters with heuristics defined in above To demonstrate effectiveness of hereby proposed Heuristics based sparse format recommendation algorithm for efficient SpMV, it is preferred to test this algorithm on has frequent and high usage of SpMV kernel as well as applicable for wide category of sparse patterns. Conjugate Gradient Solver (CGS) is ndreds to thousands iterations. As GPU has major overhead of memory transfer between CPU and GPU, this parallel CGS is designed such a way that it need to transfer sparse matrix A, and input vector b at time of first iteration only. GPU based parallel CGS is described in
  • 9. Sparse Storage Recommendation System for Sparse Matrix Vector Multiplication on GPU http://www.iaeme.com/IJARET/index.asp 19 editor@iaeme.com Algorithm 2, where SpMV kernel is executed for specified number of iterations or size of sparse matrix which is in hundreds. 4. EXPERIMENTAL RESULT For proper evaluation of this proposed algorithm, various SpMV algorithms like CSR, CSR Vector, ELL, and HYB SpMV from NVIDIA cusp-library and Aligned COO algorithms have been implemented on NVIDIA GPU. Collection of sparse matrix used in this experiment is listed along with its basic properties in Table 3. 4.1. Test Platform These experiments have been executed on Intel(R) Core(TM) i3 CPU @ 3.20 GHz with 4GB RAM, 2 Γ— 256 KB (L2 Cache) and 4 MB (L3 Cache), and NVIDIA C2070 GPU device using CUDA version 4.0 on Ubuntu 11. This dataset contains 31 sparse matrix, which are retrieved from well-known source The University of Florida Sparse Matrix Collection. Selection of sparse matrix are done such that the collection contain matrix with various sparse level and large category of sparse patterns. 4.2. Result Analysis Table 4 list performance of CSR, CSR Vector, ELL, HYB, and Aligned COO algorithms in terms of GFLOPS per seconds for each matrix listed in Table 3. Heuristics based SpMV selection algorithm is implemented and its output compared with performance result recorded in Table 4. Overhead of memory transfer cost between CPU and GPU is always important factor for overall performance. This cost is considered to be amortized over large number of iterations. Parallel CGS algorithm listed in Algorithm 2 also have been implemented for 200 iterations. Execution time including memory transfer time of this CGS has been compared with result of our proposed algorithm. Result of proposed algorithm is satisfied for 30 sparse matrix out of 31 sparse matrix listed in the dataset. Algorithm 2 Parallel CGS Input: Sparse Matrix A, Vector b, NR, NC, iterations Output: Vector x Initialize vector dev_x = 0 Copy vectors from Host memory to Device memory (b β†’ dev_r, and b β†’ dev_p) Copy matrix from Host memory to Device memory (A β†’ dev_A) Compute dev_rsold = dev_rT x dev_r using dev_inner_product(dev_r, dev_r, NR, dev_rsold) for i = 1 β†’ min (iterations, NR βˆ— NC) do Initialize vector dev_Ap = 0
  • 10. Monika Shah http://www.iaeme.com/IJARET/index.asp 20 editor@iaeme.com Perform Ap= A*p using dev _SpMV (dev_A, dev_p, dev_Ap) Perform rsold = pT * Ap using dev_inner_product (dev_p, dev_Ap, NR, dev_rsold) dev_alpha = dev_rsold Asynchronous computation of dev_x and dev_r Perform x += alpha*p using dev_add_scalarMul(dev_alpha,dev_p,dev_x, 1,dev_x) Perform r-=alpha*Ap using dev_add_scalarMul(dev_alpha,dev_Ap,dev_r, -1,dev_r) Compute dev_rsnew = dev_rT x dev_r using dev_inner_product(dev r,dev r,NR,dev_rsnew) Copy device rsnew dev_rsnew to host rsnew rsnew If √ & < 1 OPQ Exit for end if Compute p= r + ((rsnew/rsold)*p) using R_ ' = R_ & R_ dev_add_scalarMul(dev_temp,dev_p,dev_r, 1,dev_p) dev_rsold = dev_rsnew End for Copy device vector to host vector dev x β†’ x Return __________________________________________________________________ 5. CONCLUSION In this paper, various factors responsible for achieving higher performance of SpMV on GPU for various sparse pattern have been discussed. It has been realized that some decision making algorithm is required to suggest highest performance giving SpMV algorithm, especially for those applications that use large sparse matrices having variety of sparse pattern and having recurrent use of SpMV. Hereby proposed algorithm perform statistical analysis of sparse pattern and provide approximately 97% successful result. This statistical result recommend use of such clustering based Heuristics design for appropriate sparse format selection.
  • 11. Sparse Storage Recommendation System for Sparse Matrix Vector Multiplication on GPU http://www.iaeme.com/IJARET/index.asp 21 editor@iaeme.com Table 3 Sparse Matrix Collection Used In Experimentation Matrix NR NC NNZ Sparsity % NNZ / (NR*NC) 3D 51448_3D 51448 514484 1056610 0.00003992 add20 2395 2395 17319 0.00301934 add32 4960 4960 23884 0.00097083 adder dcop_19 1813 1813 11245 0.00342109 aircraft 3754 7517 20267 0.00071821 airfoil 4253 4253 24578 0.00135880 airfoil_2d 14214 14214 259688 0.00128534 aug3dcqp 35543 35543 128115 0.00010141 bayer01 57735 57735 277774 0.00008333 bcsstk36 23052 23052 1143140 0.00215121 bcsstm38 8032 8033 10485 0.00016251 bfwa782 782 783 7514 0.01227164 bips07_3078_iv 21128 21128 75729 0.00016965 Bloweybl 30003 30003 120000 0.00013331 c64b 51035 51035 717841 0.00027561 coater1 1348 1348 19457 0.01070770 crankseg_2 63838 63838 14148858 0.00347187 crashbasis 160000 160000 1750416 0.00683756 delaunay_n15 32768 32768 196548 0.00018305 epb0 1794 1794 7764 0.00241235 FEM_3D_ thermal1 17880 17880 430740 0.00134735 fpga_trans_01 1220 1220 7382 0.00495969 G2_circuit 150102 150102 726674 0.00003225 gupta1 31802 31802 2164210 0.00213989 Hamrle2 5952 5952 22162 0.00062558 jagmesh2 1009 1009 6865 0.00674308 jagmesh3 1089 1089 7361 0.00620699 lhr07 7337 7337 156508 0.00290737 lung2 109460 109460 492564 0.00004111 net100 29920 29920 2033200 0.00227121 Zd_Jac6 22835 22835 1711983 0.00328320
  • 12. Monika Shah http://www.iaeme.com/IJARET/index.asp 22 editor@iaeme.com Table 4 Execution Performance of Various Spmv Matrix Performance (GFLOP/sec) CSR CSR Vector ELL HYB A_COO 3D 51448_3D 0.56 3.46 0.11 6.45 5.41 add20 0.34 0.5 0.22 0.15 0.15 add32 1.42 0.99 1.06 0.2 1.31 adder_dcop_19 0.1 0.38 0.02 0.09 0.1 aircraft 2.83 0.71 3.33 0.17 4.28 airfoil 2.95 0.41 3.6 0.2 3.86 airfoil_2d 1.17 2.03 10.99 6.64 11.2 aug3dcqp 4.51 0.24 6.58 1.01 4.87 bayer01 3.91 0.28 3.63 1.92 5.57 bcsstk36 0.66 7.95 2.32 6.04 4.81 bcsstm38 1.62 0.3 0.83 0.09 1 bfwa782 0.47 0.95 0.65 0.06 0.76 bips07_3078_iv 0.74 0.04 0.7 0.05 0.94 bloweybl 0.14 0.27 0.01 0.94 0.96 c64b 0.15 0.79 0.05 3.61 3.6 coater1 0.76 2.13 0.77 0.16 0.96 crankseg_2 0.61 8.39 2.28 11.5 6.37 crashbasis 2.09 2.06 16.21 16.18 11.06 delaunay_n15 3.31 1.31 5.7 1.55 4.26 epb0 0.8 0.54 1.07 0.06 1.19 FEM_3D_thermal1 0.87 4.44 14.63 14.59 10.01 Fpga_trans_01 0.35 0.88 0.31 0.06 0.06 G2_circuit 6.48 0.52 12.29 4.65 8.8 gupta1 .3 4.56 0.2 5.64 5.24 Hamrle2 3.26 0.8 4.18 0.19 4.63 jagmesh2 1.17 0.45 1.15 0.05 1.46 jagmesh3 1.25 0.44 1.25 0.06 1.51 lhr07 1.2 1.36 3.76 1.09 4.51 lung2 5.41 0.86 9.14 3.34 9.45 net100 0.65 6.59 7.11 5.78 6.74 Zd Jac6 1.82 2.93 1.82 4.14 4.31 REFERENCES [1] Lee, I. Efficient sparse matrix vector multiplication using compressed graph, in IEEE SoutheastCon 2010 (SoutheastCon), Proceedings of the, March 2010, pp. 328–331. [2] Wang, H. C. and Hwang, K. Multicoloring for fast sparse matrix-vector multiplication in solving pde problems, in Parallel Processing, 1993. ICPP 1993. International Conference on, Vol. 3, Aug 1993, pp. 215–222. [3] Jamroz, B. and Mullowney, P. Performance of parallel sparse matrix-vector multiplications in linear solves on multiple gpus, in Application Accelerators in High Performance Computing (SAAHPC), 2012 Symposium on, July 2012, pp. 149–152.
  • 13. Sparse Storage Recommendation System for Sparse Matrix Vector Multiplication on GPU http://www.iaeme.com/IJARET/index.asp 23 editor@iaeme.com [4] Hestenes, E. and Stiefel, M. R. Methods of conjugate gradients for solving linear systems, 1952. [5] van der Veen, M. Sparse matrix vector multiplication on a field programmable gate array, September 2007. [6] Ashany, R. Application of sparse matrix techniques to search, retrieval, classification and relationship analysis in large data base systems βˆ’ sparcom, in Proceedings of the Fourth International Conference on Very Large Data Bases βˆ’ Volume 4, VLDB ’78, VLDB Endowment, 1978, pp. 499–516. [7] Goharian, N., Grossman, D. and El-Ghazawi, T. Enterprise text processing: A sparse matrix approach, Information Technology: Coding and Computing, International Conference on, vol. 0, 2001. [8] Bender, M. A., Brodal, G. S., Fagerberg, R., Jacob, R. and Vicari, E. Optimal sparse matrix dense vector multiplication in the i/o-model, in Proceedings of the Nineteenth Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA ’07, ACM, 2007. [9] Manzini, G. Lower bounds for sparse matrix vector multiplication on hypercubic networks, Vol. 2, 1998. [10] Wu, T., Wang, B., Shan, Y., Yan, F., Wang, Y. and Xu, N. Efficient pagerank and spmv computation on amd gpus, in ICPP, 2010, pp. 81–89,. [11] Gan, Z. and Harrison, R. Calibrating quantum chemistry: A multi-teraflop, parallel-vector, full-configuration interaction program for the cray-x1, in Supercomputing, 2005. Proceedings of the ACM/IEEE SC 2005 Conference, Nov 2005 [12] Shah, M. and Patel, V. An efficient sparse matrix multiplication for skewed matrix on gpu, in High Performance Computing and Communication 2012 IEEE 9th International Conference on Embedded Software and Systems (HPCC- ICESS), 2012 IEEE 14th International Conference on, June 2012, pp. 1301–1306. [13] Bell, N. and Garland, M. Implementing sparse matrix-vector multiplication on throughput-oriented processors, in SC, 2009. [14] Dziekonski, A., Lamecki, A. and Mrozowski, M. A memory efficient and fast sparse matrix vector product on a gpu, Progress In Electromagnetics Research, Vol. 116, 2011, pp. 49–63. [15] Vazquez, F., Ortega, G., Fernandez, J. and Garzon, E. Improving the performance of the sparse matrix vector product with gpus, Computer and Information Technology, International Conference on, vol. 0, 2010. [16] Pinar, A. and Heath, M. T. Improving performance of sparse matrix-vector multiplication, in Proceedings of the 1999 ACM/IEEE conference on Supercomputing (CDROM), Supercomputing ’99, 1999. [17] Shahnaz, R. and Usman, A. Blocked-based sparse matrix-vector multiplication on distributed memory parallel computers. Int. Arab J. Inf. Technol., 2011. [18] Yang, X., Parthasarathy, S. and Sadayappan, P. Fast sparse matrix-vector multiplication on gpus: Implications for graph mining. CoRR, vol. abs/1103.2405, 2011. [19] Cao, W., Yao, L., Li, Z., Wang, Y. and Wang, Z. Implementing sparse matrix- vector multiplication using cuda based on a hybrid sparse matrix format, in International Conference on Computer Application and System Modeling, 2010. [20] Choi, J. W., Singh, A. and Vuduc, R. W. Model-driven autotuning of sparse matrix-vector multiply on gpus, in Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’10, 2010.