This document proposes an algorithm to recommend the most suitable sparse storage format for implementing sparse matrix vector multiplication (SpMV) on a GPU. The algorithm uses k-means clustering on metrics of an input sparse matrix like row length to analyze the matrix pattern. Based on this analysis and predefined heuristics about the performance of different SpMV algorithms, the algorithm recommends a storage format like CSR, ELL, Hybrid, or Aligned COO that is well-suited for the matrix pattern. The performance of the recommended storage format is demonstrated on a conjugate gradient solver application to show the effectiveness of the algorithm at choosing a high performing approach.
Effective Sparse Matrix Representation for the GPU ArchitecturesIJCSEA Journal
General purpose computation on graphics processing unit (GPU) is prominent in the high performance computing era of this time. Porting or accelerating the data parallel applications onto GPU gives the default performance improvement because of the increased computational units. Better performances can be seen if application specific fine tuning is done with respect to the architecture under consideration. One such very widely used computation intensive kernel is sparse matrix vector multiplication (SPMV) in sparse matrix based applications. Most of the existing data format representations of sparse matrix are developed with respect to the central processing unit (CPU) or multi cores. This paper gives a new format for sparse matrix representation with respect to graphics processor architecture that can give 2x to 5x performance improvement compared to CSR (compressed row format), 2x to 54x performance improvement with respect to COO (coordinate format) and 3x to 10 x improvement compared to CSR vector format for the class of application that fit for the proposed new format. It also gives 10% to 133% improvements in memory transfer (of only access information of sparse matrix) between CPU and GPU. This paper gives the details of the new format and its requirement with complete experimentation details and results of comparison.
Accelerating sparse matrix-vector multiplication in iterative methods using GPUSubhajit Sahu
Kiran Kumar Matam; Kishore Kothapalli
All Authors
19
Paper
Citations
1
Patent
Citation
486
Full
Text Views
Multiplying a sparse matrix with a vector (spmv for short) is a fundamental operation in many linear algebra kernels. Having an efficient spmv kernel on modern architectures such as the GPUs is therefore of principal interest. The computational challenges that spmv poses are significantlydifferent compared to that of the dense linear algebra kernels. Recent work in this direction has focused on designing data structures to represent sparse matrices so as to improve theefficiency of spmv kernels. However, as the nature of sparseness differs across sparse matrices, there is no clear answer as to which data structure to use given a sparse matrix. In this work, we address this problem by devising techniques to understand the nature of the sparse matrix and then choose appropriate data structures accordingly. By using our technique, we are able to improve the performance of the spmv kernel on an Nvidia Tesla GPU (C1060) by a factor of up to80% in some instances, and about 25% on average compared to the best results of Bell and Garland [3] on the standard dataset (cf. Williams et al. SC'07) used in recent literature. We also use our spmv in the conjugate gradient method and show an average 20% improvement compared to using HYB spmv of [3], on the dataset obtained from the The University of Florida Sparse Matrix Collection [9].
Published in: 2011 International Conference on Parallel Processing
Date of Conference: 13-16 Sept. 2011
Date Added to IEEE Xplore: 17 October 2011
ISBN Information:
ISSN Information:
INSPEC Accession Number: 12316254
DOI: 10.1109/ICPP.2011.82
Publisher: IEEE
Conference Location: Taipei City, Taiwan
The document discusses porting a seismic inversion code to run in parallel using standard message passing libraries. It describes three options considered for distributing the large 3D seismic data across processors: mapping the data to a processor grid, treating it as a sparse matrix problem, or distributing the data as 1D vectors assigned to each processor. The third option was chosen as it best preserved the code structure, had regular dependencies, and simplified communications. The parallel code was implemented using the Distributed Data Library (DDL) for data management and the Message Passing Interface (MPI) for basic point-to-point communication between processors. Initial tests on clusters showed near linear speedup on up to 30 processors.
Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...Dhiraj Chaudhary
This document describes a coarse-grained reconfigurable architecture with a Network-on-Chip (NoC) router designed for variable block size motion estimation. The architecture contains 16 processing elements arranged in a 2D array that can calculate Sum of Absolute Differences (SAD) for different block sizes. An NoC with intelligent routers is used to direct reference block data between processing elements to reduce memory interactions and increase computation efficiency. The architecture supports fast search algorithms like diamond search that further improve performance over full search.
This document discusses the design of a pipelined architecture for sparse matrix-vector multiplication on an FPGA. It begins with introductions to matrices, linear algebra, and matrix multiplication. It then describes the objective of building a hardware processor to perform multiple arithmetic operations in parallel through pipelining. The document reviews literature on pipelined floating point units. It provides details on the proposed pipelined design for sparse matrix-vector multiplication, including storing vector values in on-chip memory and using multiple pipelines to complete results in parallel. Simulation results showing reduced power and execution time are presented before concluding the design can improve performance for scientific applications.
The document proposes a new model for emulating massively parallel single instruction multiple data (SIMD) machines in a distributed system using a network of virtual processing elements managed by distributed host agents. It describes the architecture of the model, which uses virtual processing elements arranged in topological structures like meshes and GPUs to emulate different parallel machine architectures. An example application of edge detection on an MRI image is provided to demonstrate the performance of the proposed parallel virtual machine model.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Effective Sparse Matrix Representation for the GPU ArchitecturesIJCSEA Journal
General purpose computation on graphics processing unit (GPU) is prominent in the high performance computing era of this time. Porting or accelerating the data parallel applications onto GPU gives the default performance improvement because of the increased computational units. Better performances can be seen if application specific fine tuning is done with respect to the architecture under consideration. One such very widely used computation intensive kernel is sparse matrix vector multiplication (SPMV) in sparse matrix based applications. Most of the existing data format representations of sparse matrix are developed with respect to the central processing unit (CPU) or multi cores. This paper gives a new format for sparse matrix representation with respect to graphics processor architecture that can give 2x to 5x performance improvement compared to CSR (compressed row format), 2x to 54x performance improvement with respect to COO (coordinate format) and 3x to 10 x improvement compared to CSR vector format for the class of application that fit for the proposed new format. It also gives 10% to 133% improvements in memory transfer (of only access information of sparse matrix) between CPU and GPU. This paper gives the details of the new format and its requirement with complete experimentation details and results of comparison.
Accelerating sparse matrix-vector multiplication in iterative methods using GPUSubhajit Sahu
Kiran Kumar Matam; Kishore Kothapalli
All Authors
19
Paper
Citations
1
Patent
Citation
486
Full
Text Views
Multiplying a sparse matrix with a vector (spmv for short) is a fundamental operation in many linear algebra kernels. Having an efficient spmv kernel on modern architectures such as the GPUs is therefore of principal interest. The computational challenges that spmv poses are significantlydifferent compared to that of the dense linear algebra kernels. Recent work in this direction has focused on designing data structures to represent sparse matrices so as to improve theefficiency of spmv kernels. However, as the nature of sparseness differs across sparse matrices, there is no clear answer as to which data structure to use given a sparse matrix. In this work, we address this problem by devising techniques to understand the nature of the sparse matrix and then choose appropriate data structures accordingly. By using our technique, we are able to improve the performance of the spmv kernel on an Nvidia Tesla GPU (C1060) by a factor of up to80% in some instances, and about 25% on average compared to the best results of Bell and Garland [3] on the standard dataset (cf. Williams et al. SC'07) used in recent literature. We also use our spmv in the conjugate gradient method and show an average 20% improvement compared to using HYB spmv of [3], on the dataset obtained from the The University of Florida Sparse Matrix Collection [9].
Published in: 2011 International Conference on Parallel Processing
Date of Conference: 13-16 Sept. 2011
Date Added to IEEE Xplore: 17 October 2011
ISBN Information:
ISSN Information:
INSPEC Accession Number: 12316254
DOI: 10.1109/ICPP.2011.82
Publisher: IEEE
Conference Location: Taipei City, Taiwan
The document discusses porting a seismic inversion code to run in parallel using standard message passing libraries. It describes three options considered for distributing the large 3D seismic data across processors: mapping the data to a processor grid, treating it as a sparse matrix problem, or distributing the data as 1D vectors assigned to each processor. The third option was chosen as it best preserved the code structure, had regular dependencies, and simplified communications. The parallel code was implemented using the Distributed Data Library (DDL) for data management and the Message Passing Interface (MPI) for basic point-to-point communication between processors. Initial tests on clusters showed near linear speedup on up to 30 processors.
Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...Dhiraj Chaudhary
This document describes a coarse-grained reconfigurable architecture with a Network-on-Chip (NoC) router designed for variable block size motion estimation. The architecture contains 16 processing elements arranged in a 2D array that can calculate Sum of Absolute Differences (SAD) for different block sizes. An NoC with intelligent routers is used to direct reference block data between processing elements to reduce memory interactions and increase computation efficiency. The architecture supports fast search algorithms like diamond search that further improve performance over full search.
This document discusses the design of a pipelined architecture for sparse matrix-vector multiplication on an FPGA. It begins with introductions to matrices, linear algebra, and matrix multiplication. It then describes the objective of building a hardware processor to perform multiple arithmetic operations in parallel through pipelining. The document reviews literature on pipelined floating point units. It provides details on the proposed pipelined design for sparse matrix-vector multiplication, including storing vector values in on-chip memory and using multiple pipelines to complete results in parallel. Simulation results showing reduced power and execution time are presented before concluding the design can improve performance for scientific applications.
The document proposes a new model for emulating massively parallel single instruction multiple data (SIMD) machines in a distributed system using a network of virtual processing elements managed by distributed host agents. It describes the architecture of the model, which uses virtual processing elements arranged in topological structures like meshes and GPUs to emulate different parallel machine architectures. An example application of edge detection on an MRI image is provided to demonstrate the performance of the proposed parallel virtual machine model.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
This document proposes a new magnetic flip-flop (MFF) design that uses checkpointing, power gating, and self-enable mechanisms to achieve ultra-low power compared to conventional CMOS flip-flops and previous MFF designs. The key aspects of the design are:
1) Checkpointing periodically stores the system state in spin-transfer torque magnetic random access memory (STT-MRAM) cells integrated locally in each flip-flop, reducing dynamic power needs.
2) Power gating cuts off power supply voltage when entering standby mode, suppressing static leakage power.
3) Self-enable switching allows programming pulses to STT-MRAM cells to be shortened,
Memory and I/O optimized rectilinear Steiner minimum tree routing for VLSI IJECEIAES
As the size of devices are scaling down at rapid pace, the interconnect delay play a major part in performance of IC chips. Therefore minimizing delay and wire length is the most desired objective. FLUTE (Fast Look-Up table) presented a fast and accurate RSMT (Rectilinear Steiner Minimum Tree) construction for both smaller and higher degree net. FLUTE presented an optimization technique that reduces time complexity for RSMT construction for both smaller and larger degree nets. However for larger degree net this technique induces memory overhead, as it does not consider the memory requirement in constructing RSMT. Since availability of memory is very less and is expensive, it is desired to utilize memory more efficiently which in turn results in reducing I/O time (i.e. reduce the number of I/O disk access). The proposed work presents a Memory Optimized RSMT (MORSMT) construction in order to address the memory overhead for larger degree net. The depth-first search and divide and conquer approach is adopted to build a Memory optimized tree. Experiments are conducted to evaluate the performance of proposed approach over existing model for varied benchmarks in terms of computation time, memory overhead and wire length. The experimental results show that the proposed model is scalable and efficient.
Parallel algorithms for multi-source graph traversal and its applicationsSubhajit Sahu
Highlighted notes on Parallel algorithms for multi-source graph traversal and its applications.
While doing research work under Prof. Kishore Kothapalli.
Seema is working on Multi-source BFS with hybrid-CSR, with applications in APSP, diameter, centrality, reachability.
BFS can be either top-down (from visited nodes, mark next frontier), or bottom-up (from unvisited nodes, mark next frontier). She mentioned that hybrid approach is more efficient. EtaGraph uses unified degree cut (UDC) graph partitioning. Also overlaps data transfer with kernel execution. iCENTRAL uses biconnected components for betwenness centrality on dynamic graphs.
Hybrid CSR uses an additional value array for storing packed "has edge/neighbour" bits. This can give better memory access pattern if many bits are set, and cause many threads to wait if many bits are zero. She mentioned Volta architecture has independent PC, stack per thread (similar to CPU?). Does is not matter then if the threads in a block diverge?
(BFS = G*v, Multi-source BFS = G*vs)
HOMOGENEOUS MULTISTAGE ARCHITECTURE FOR REAL-TIME IMAGE PROCESSINGcscpconf
The document describes a homogeneous multistage architecture for real-time image processing. It proposes a parallel architecture using multiple identical processing elements connected by different communication links. As an example application, it discusses a multi-hypothesis approach for road recognition, which uses multiple hypotheses to detect and track road edges in video in real-time. Experimental results using a FPGA demonstrate the architecture can detect roadsides in images within 60 milliseconds.
The processor-memory bandwidth in modern generation
processors is the important bottleneck due to a number of
processor cores dealing it through with the same bus/ processor-
memory interface. Caches take a significant amount
of energy in current microprocessors. To design an energyefficient
microprocessor, it is important to optimize cache
energy economic consumption. Powerful utilization of this
resource is consequently an important view of memory hierarchy
design of multi core processors. This is presently an
important field of research on a large number of research
issues that have suggested a number of techniques to figure
out the problem. The better contribution of this theme is the
assessment of effectiveness of some of the proficiencies that
were enforced in recent chip multiprocessors. Cache optimization
techniques that were named for single core processors
but have not been implemented in multi core processors
are as well tested to forecast their effectiveness.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
This document discusses sparse matrix formats and their impact on GPU performance for solving large sparse linear systems that arise from computational modeling techniques like finite element analysis. It analyzes the compressed sparse row (CSR) format and an alternative blocked CSR format. Performance tests on a CPU and GPU show that CSR's irregular memory access hurts GPU performance, while blocked CSR improves data locality and mitigates this issue. The findings illustrate how hardware, software strategies, and sparse matrix storage affect computational performance.
This document summarizes the results of applying the Successive Survivable Routing (SSR) algorithm to different network models to calculate backup capacity requirements. The SSR algorithm provides an approximate solution to the Spare Capacity Assignment problem of finding the minimum backup capacity needed to deploy equivalent failure-disjoint recovery paths. The document finds that enhanced versions of SSR that incorporate capacity giveback and state-dependent restoration can reduce backup capacity requirements by an average of 12% compared to the standard SSR algorithm. These SSR techniques were applied to reference networks, realistic networks from internet topology data, and synthetic networks generated by the BRITE topology generator.
A STUDY OF ENERGY-AREA TRADEOFFS OF VARIOUS ARCHITECTURAL STYLES FOR ROUTING ...VLSICS Design
This document presents a study exploring various architectural styles for routing inputs in a coarse-grained reconfigurable fabric. It examines integrated constants, inputs coming from the side of the fabric, and combinations with dedicated pass gates. It implements these styles using a 90nm ASIC process and analyzes the area and energy tradeoffs. It finds that an approach combining inputs coming from the side with 50% dedicated pass gates provides the best results, with 31% energy savings and 62% area savings over the baseline architecture.
A tutorial on CGAL polyhedron for subdivision algorithmsRadu Ursu
This document provides a tutorial on implementing subdivision algorithms using the CGAL polyhedron data structure. It summarizes two approaches for subdivision: using Euler operators for √3 subdivision and a modifier callback mechanism for quad-triangle subdivision. It then introduces a combinatorial subdivision library (CSL) with increased abstraction, demonstrating Catmull-Clark and Doo-Sabin subdivisions. Accompanying applications visualize the subdivision schemes and provide interaction capabilities. The goal is to demonstrate connectivity and geometry operations on CGAL polyhedra in the context of subdivision algorithms.
The document summarizes progressive meshes, which provide an efficient, lossless, and continuous resolution representation for storing and transmitting triangle meshes. Progressive meshes use a sequence of vertex split operations to refine an initial coarse mesh into the original mesh. This representation supports smooth level-of-detail transitions, reduced transmission bandwidth, selective refinement of areas, and mesh compression through delta encoding of attributes. The representation uses data structures like a base mesh, vertex split records, and traversal classes to apply splits and iterate through the progressive mesh sequence.
Performance Improvement Technique in Column-StoreIDES Editor
Column-oriented database has gained popularity as
“Data Warehousing” data and performance issues for
“Analytical Queries” have increased. Each attribute of a
relation is physically stored as a separate column, which will
help analytical queries to work fast. The overhead is incurred
in tuple reconstruction for multi attribute queries. Each tuple
reconstruction is joining of two columns based on tuple IDs,
making it significant cost component. For reducing cost,
physical design have multiple presorted copies of each base
table, such that tuples are already appropriately organized in
different orders across the various columns.
This paper proposes a novel design, called
partitioning, that minimizes the tuple reconstruction cost. It
achieves performance similar to using presorted data, but
without requiring the heavy initial presorting step. In addition,
it handles dynamic, unpredictable workloads with no idle time
and frequent updates. Partitioning provides the direct loading
of the data in respective partitions. Partitions are created on
the fly and depend on distribution of data, which will work
nicely in limited storage space environments.
Performance comparison of XY,OE and DyAd routing algorithm by Load Variation...Jayesh Kumar Dalal
This summarizes a document that compares the performance of three routing algorithms - XY, Odd-Even (OE), and DyAD - under varying load conditions on a 2D 3x3 mesh network-on-chip topology. The document presents simulation results showing that DyAD routing achieves the minimum overall average latency per channel in clock cycles per flit and packets, as well as the minimum total network power, making it the best performing algorithm compared to XY and OE routing.
This document summarizes a research paper that proposes a new cache coherence protocol called Phase-Priority Based (PPB) cache coherence. PPB aims to optimize directory-based cache coherence protocols for multicore processors. It introduces the concepts of "phase" and "priority" for coherence messages to reduce unnecessary transient states and message stalling. PPB differentiates messages into inner and outer phases based on their place in the coherence transaction ordering. It also prioritizes messages in the on-chip network to improve efficiency. Analysis shows PPB outperforms traditional MESI, reducing transient states and stalls by up to 24% with a 7.4% speedup.
The document discusses different ways to measure heights and lengths. It provides an example of measuring the heights of Lia and Nick using centimeters. Lia is 100cm and Nick is 150cm as he is 50cm taller than Lia. The document also mentions that non-standard measurements like string can be used, and that mathematics aims to simplify complicated things.
This document provides examples of addition and subtraction word problems and their step-by-step solutions. It contains 10 practice problems involving quantities like football games attended, pencils in a drawer, seashells collected, baseball cards owned, and orange balloons. It also provides 4 multi-step word problems involving oranges, students at a school, weight of sugar in a bag, and chocolate bars bought and distributed. The document aims to help students practice solving addition and subtraction word problems.
The document provides an overview of the Common Core requirements for Grade 3 math. It covers the following topics: operations and algebraic thinking; number and operations in base ten; number and operations-fractions; measurement and data; and geometry. For each topic, it lists the key standards and provides an illustrative question example to demonstrate how the standard might appear in practice.
This document discusses strategies for teaching multiplication and division. It provides examples of ways to represent multiplication, such as arrays and repeated addition. It also lists multiplication fact strategies like doubles and nines. For division, it discusses strategies like repeated subtraction and partial quotients. The document then compares lessons from textbooks and state standards. It introduces an assessment called ANIE that evaluates student work in estimation, calculation, representation, and application of math concepts.
3rd Grade Math Activity: Metric Mango Tree (measurement; number sense)Mango Math Group
A sample math lesson from Mango Math's 3rd grade math curriculum.
Mango Math provides grade level math games and activities that reinforce core math concepts. Our activities are designed to enhance and compliment existing curriculum and are aligned with NCTM standards. Our innovative and fun math curriculum products are designed to assist teachers, resource room instructors, home school organizations, and parents build positive attitudes towards math while reinforcing key math skills.
for more information visit www.mangomathgroup.com
The document describes making various numbers of items (e.g. sausages, buttons) and sharing them equally across plates, cakes, leaves, etc. It asks how many items there are altogether and how many are on each plate/cake/leaf. The numbers are then divided to show the equal distribution.
Multiplication and division are opposites. Multiplication involves grouping numbers, while division involves sharing numbers into groups. Some key rules are: multiplying by 0 equals 0; multiplying by 1 equals the original number; multiplying by 2 is doubling the number; dividing by 2 is halving the number; and multiplying by 10 moves the digits left and adds a 0.
This document proposes a new magnetic flip-flop (MFF) design that uses checkpointing, power gating, and self-enable mechanisms to achieve ultra-low power compared to conventional CMOS flip-flops and previous MFF designs. The key aspects of the design are:
1) Checkpointing periodically stores the system state in spin-transfer torque magnetic random access memory (STT-MRAM) cells integrated locally in each flip-flop, reducing dynamic power needs.
2) Power gating cuts off power supply voltage when entering standby mode, suppressing static leakage power.
3) Self-enable switching allows programming pulses to STT-MRAM cells to be shortened,
Memory and I/O optimized rectilinear Steiner minimum tree routing for VLSI IJECEIAES
As the size of devices are scaling down at rapid pace, the interconnect delay play a major part in performance of IC chips. Therefore minimizing delay and wire length is the most desired objective. FLUTE (Fast Look-Up table) presented a fast and accurate RSMT (Rectilinear Steiner Minimum Tree) construction for both smaller and higher degree net. FLUTE presented an optimization technique that reduces time complexity for RSMT construction for both smaller and larger degree nets. However for larger degree net this technique induces memory overhead, as it does not consider the memory requirement in constructing RSMT. Since availability of memory is very less and is expensive, it is desired to utilize memory more efficiently which in turn results in reducing I/O time (i.e. reduce the number of I/O disk access). The proposed work presents a Memory Optimized RSMT (MORSMT) construction in order to address the memory overhead for larger degree net. The depth-first search and divide and conquer approach is adopted to build a Memory optimized tree. Experiments are conducted to evaluate the performance of proposed approach over existing model for varied benchmarks in terms of computation time, memory overhead and wire length. The experimental results show that the proposed model is scalable and efficient.
Parallel algorithms for multi-source graph traversal and its applicationsSubhajit Sahu
Highlighted notes on Parallel algorithms for multi-source graph traversal and its applications.
While doing research work under Prof. Kishore Kothapalli.
Seema is working on Multi-source BFS with hybrid-CSR, with applications in APSP, diameter, centrality, reachability.
BFS can be either top-down (from visited nodes, mark next frontier), or bottom-up (from unvisited nodes, mark next frontier). She mentioned that hybrid approach is more efficient. EtaGraph uses unified degree cut (UDC) graph partitioning. Also overlaps data transfer with kernel execution. iCENTRAL uses biconnected components for betwenness centrality on dynamic graphs.
Hybrid CSR uses an additional value array for storing packed "has edge/neighbour" bits. This can give better memory access pattern if many bits are set, and cause many threads to wait if many bits are zero. She mentioned Volta architecture has independent PC, stack per thread (similar to CPU?). Does is not matter then if the threads in a block diverge?
(BFS = G*v, Multi-source BFS = G*vs)
HOMOGENEOUS MULTISTAGE ARCHITECTURE FOR REAL-TIME IMAGE PROCESSINGcscpconf
The document describes a homogeneous multistage architecture for real-time image processing. It proposes a parallel architecture using multiple identical processing elements connected by different communication links. As an example application, it discusses a multi-hypothesis approach for road recognition, which uses multiple hypotheses to detect and track road edges in video in real-time. Experimental results using a FPGA demonstrate the architecture can detect roadsides in images within 60 milliseconds.
The processor-memory bandwidth in modern generation
processors is the important bottleneck due to a number of
processor cores dealing it through with the same bus/ processor-
memory interface. Caches take a significant amount
of energy in current microprocessors. To design an energyefficient
microprocessor, it is important to optimize cache
energy economic consumption. Powerful utilization of this
resource is consequently an important view of memory hierarchy
design of multi core processors. This is presently an
important field of research on a large number of research
issues that have suggested a number of techniques to figure
out the problem. The better contribution of this theme is the
assessment of effectiveness of some of the proficiencies that
were enforced in recent chip multiprocessors. Cache optimization
techniques that were named for single core processors
but have not been implemented in multi core processors
are as well tested to forecast their effectiveness.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
This document discusses sparse matrix formats and their impact on GPU performance for solving large sparse linear systems that arise from computational modeling techniques like finite element analysis. It analyzes the compressed sparse row (CSR) format and an alternative blocked CSR format. Performance tests on a CPU and GPU show that CSR's irregular memory access hurts GPU performance, while blocked CSR improves data locality and mitigates this issue. The findings illustrate how hardware, software strategies, and sparse matrix storage affect computational performance.
This document summarizes the results of applying the Successive Survivable Routing (SSR) algorithm to different network models to calculate backup capacity requirements. The SSR algorithm provides an approximate solution to the Spare Capacity Assignment problem of finding the minimum backup capacity needed to deploy equivalent failure-disjoint recovery paths. The document finds that enhanced versions of SSR that incorporate capacity giveback and state-dependent restoration can reduce backup capacity requirements by an average of 12% compared to the standard SSR algorithm. These SSR techniques were applied to reference networks, realistic networks from internet topology data, and synthetic networks generated by the BRITE topology generator.
A STUDY OF ENERGY-AREA TRADEOFFS OF VARIOUS ARCHITECTURAL STYLES FOR ROUTING ...VLSICS Design
This document presents a study exploring various architectural styles for routing inputs in a coarse-grained reconfigurable fabric. It examines integrated constants, inputs coming from the side of the fabric, and combinations with dedicated pass gates. It implements these styles using a 90nm ASIC process and analyzes the area and energy tradeoffs. It finds that an approach combining inputs coming from the side with 50% dedicated pass gates provides the best results, with 31% energy savings and 62% area savings over the baseline architecture.
A tutorial on CGAL polyhedron for subdivision algorithmsRadu Ursu
This document provides a tutorial on implementing subdivision algorithms using the CGAL polyhedron data structure. It summarizes two approaches for subdivision: using Euler operators for √3 subdivision and a modifier callback mechanism for quad-triangle subdivision. It then introduces a combinatorial subdivision library (CSL) with increased abstraction, demonstrating Catmull-Clark and Doo-Sabin subdivisions. Accompanying applications visualize the subdivision schemes and provide interaction capabilities. The goal is to demonstrate connectivity and geometry operations on CGAL polyhedra in the context of subdivision algorithms.
The document summarizes progressive meshes, which provide an efficient, lossless, and continuous resolution representation for storing and transmitting triangle meshes. Progressive meshes use a sequence of vertex split operations to refine an initial coarse mesh into the original mesh. This representation supports smooth level-of-detail transitions, reduced transmission bandwidth, selective refinement of areas, and mesh compression through delta encoding of attributes. The representation uses data structures like a base mesh, vertex split records, and traversal classes to apply splits and iterate through the progressive mesh sequence.
Performance Improvement Technique in Column-StoreIDES Editor
Column-oriented database has gained popularity as
“Data Warehousing” data and performance issues for
“Analytical Queries” have increased. Each attribute of a
relation is physically stored as a separate column, which will
help analytical queries to work fast. The overhead is incurred
in tuple reconstruction for multi attribute queries. Each tuple
reconstruction is joining of two columns based on tuple IDs,
making it significant cost component. For reducing cost,
physical design have multiple presorted copies of each base
table, such that tuples are already appropriately organized in
different orders across the various columns.
This paper proposes a novel design, called
partitioning, that minimizes the tuple reconstruction cost. It
achieves performance similar to using presorted data, but
without requiring the heavy initial presorting step. In addition,
it handles dynamic, unpredictable workloads with no idle time
and frequent updates. Partitioning provides the direct loading
of the data in respective partitions. Partitions are created on
the fly and depend on distribution of data, which will work
nicely in limited storage space environments.
Performance comparison of XY,OE and DyAd routing algorithm by Load Variation...Jayesh Kumar Dalal
This summarizes a document that compares the performance of three routing algorithms - XY, Odd-Even (OE), and DyAD - under varying load conditions on a 2D 3x3 mesh network-on-chip topology. The document presents simulation results showing that DyAD routing achieves the minimum overall average latency per channel in clock cycles per flit and packets, as well as the minimum total network power, making it the best performing algorithm compared to XY and OE routing.
This document summarizes a research paper that proposes a new cache coherence protocol called Phase-Priority Based (PPB) cache coherence. PPB aims to optimize directory-based cache coherence protocols for multicore processors. It introduces the concepts of "phase" and "priority" for coherence messages to reduce unnecessary transient states and message stalling. PPB differentiates messages into inner and outer phases based on their place in the coherence transaction ordering. It also prioritizes messages in the on-chip network to improve efficiency. Analysis shows PPB outperforms traditional MESI, reducing transient states and stalls by up to 24% with a 7.4% speedup.
The document discusses different ways to measure heights and lengths. It provides an example of measuring the heights of Lia and Nick using centimeters. Lia is 100cm and Nick is 150cm as he is 50cm taller than Lia. The document also mentions that non-standard measurements like string can be used, and that mathematics aims to simplify complicated things.
This document provides examples of addition and subtraction word problems and their step-by-step solutions. It contains 10 practice problems involving quantities like football games attended, pencils in a drawer, seashells collected, baseball cards owned, and orange balloons. It also provides 4 multi-step word problems involving oranges, students at a school, weight of sugar in a bag, and chocolate bars bought and distributed. The document aims to help students practice solving addition and subtraction word problems.
The document provides an overview of the Common Core requirements for Grade 3 math. It covers the following topics: operations and algebraic thinking; number and operations in base ten; number and operations-fractions; measurement and data; and geometry. For each topic, it lists the key standards and provides an illustrative question example to demonstrate how the standard might appear in practice.
This document discusses strategies for teaching multiplication and division. It provides examples of ways to represent multiplication, such as arrays and repeated addition. It also lists multiplication fact strategies like doubles and nines. For division, it discusses strategies like repeated subtraction and partial quotients. The document then compares lessons from textbooks and state standards. It introduces an assessment called ANIE that evaluates student work in estimation, calculation, representation, and application of math concepts.
3rd Grade Math Activity: Metric Mango Tree (measurement; number sense)Mango Math Group
A sample math lesson from Mango Math's 3rd grade math curriculum.
Mango Math provides grade level math games and activities that reinforce core math concepts. Our activities are designed to enhance and compliment existing curriculum and are aligned with NCTM standards. Our innovative and fun math curriculum products are designed to assist teachers, resource room instructors, home school organizations, and parents build positive attitudes towards math while reinforcing key math skills.
for more information visit www.mangomathgroup.com
The document describes making various numbers of items (e.g. sausages, buttons) and sharing them equally across plates, cakes, leaves, etc. It asks how many items there are altogether and how many are on each plate/cake/leaf. The numbers are then divided to show the equal distribution.
Multiplication and division are opposites. Multiplication involves grouping numbers, while division involves sharing numbers into groups. Some key rules are: multiplying by 0 equals 0; multiplying by 1 equals the original number; multiplying by 2 is doubling the number; dividing by 2 is halving the number; and multiplying by 10 moves the digits left and adds a 0.
1) Multiplying numbers can be done by breaking them into simpler factors and distributing multiplication across addition.
2) Doubling multiplication tables allows calculating unknown products, like doubling the 6 times table to find 7 x 12.
3) Using multiples of 10 near the numbers being multiplied allows calculating the product without fully multiplying, like using 8 x 20 to find 8 x 19.
Multiplication is a repeated addition. It can be represented by using fingers to count groups being added together. The order of the factors does not change the product, so 2 x 3 equals 3 x 2 and both equal 6. Practice multiplication by representing problems with fingers to find that the product is the same regardless of the order of the factors.
Multiplication and division are opposites. Multiplication means grouping numbers, while division means sharing a number into groups. The document provides rules and examples for multiplying and dividing small whole numbers. Key rules include: multiplying by 0 equals 0; multiplying by 1 equals the original number; doubling a number is the same as multiplying by 2; and multiplying by 10 moves the digits left and adds a 0.
The document discusses various properties and methods of multiplication. It defines factors and products, and covers the associative, commutative, and distributive properties. It also discusses finding multiples of a number, and methods for multiplying numbers by 1, 2, or 3 digits as well as powers of ten.
This document provides 16 teaching ideas for teaching multiplication and division to students. The teaching ideas include revising number patterns online, investigating multiples, using visual representations and words to teach concepts, creating instructional videos and songs with QR codes, using apps and games to practice, exploring arrays with blocks and in the environment, playing games like the array game to practice, creating a multiplication pyramid together, and using strategies like Study Ladder for rapid recall practice. Bloom's Taxonomy and Multiple Intelligences are also incorporated into activity ideas.
This document contains 14 word problems involving multiplication and division. The problems involve sharing items equally between groups of characters from cartoons and movies. They ask how many items each character or group would receive, or how many items there were total.
This document discusses different methods for helping students memorize their multiplication facts, including memorization, manipulatives, and multiplication tables. It explains that memorization requires time, practice, and different learning tools. Manipulatives are physical objects that can represent abstract math concepts, like blocks or calculators. Multiplication tables display the factors and products in an organized chart to make patterns and answers easier to find.
The document explains how to perform 2-digit multiplication. It goes through the step-by-step process, which includes: 1) lining up the numbers with their place values, 2) multiplying the ones place and carrying numbers, 3) multiplying the tens place and using a placeholder zero, and 4) adding the partial products together to get the final product. The example shown is 26 x 12 = 312, and each step of the multiplication is demonstrated.
Effective Sparse Matrix Representation for the GPU ArchitecturesIJCSEA Journal
General purpose computation on graphics processing unit (GPU) is prominent in the high performance computing era of this time. Porting or accelerating the data parallel applications onto GPU gives the default performance improvement because of the increased computational units. Better performances can be seen if application specific fine tuning is done with respect to the architecture under consideration. One such very widely used computation intensive kernel is sparse matrix vector multiplication (SPMV) in sparse matrix based applications. Most of the existing data format representations of sparse matrix are developed with respect to the central processing unit (CPU) or multi cores. This paper gives a new format for sparse matrix representation with respect to graphics processor architecture that can give 2x to 5x performance improvement compared to CSR (compressed row format), 2x to 54x performance improvement with respect to COO (coordinate format) and 3x to 10 x improvement compared to CSR vector format for the class of application that fit for the proposed new format. It also gives 10% to 133% improvements in memory transfer (of only access information of sparse matrix) between CPU and GPU. This paper gives the details of the new format and its requirement with complete experimentation
details and results of comparison.
An octa core processor with shared memory and message-passingeSAT Journals
Abstract This being the era of fast, high performance computing, there is the need of having efficient optimizations in the processor architecture and at the same time in memory hierarchy too. Each and every day, the advancement of applications in communication and multimedia systems are compelling to increase number of cores in the main processor viz., dual-core, quad-core, octa-core and so on. But, for enhancing the overall performance of multi processor chip, there are stringent requirements to improve inter-core synchronization. Thus, a MPSoC with 8-cores supporting both message-passing and shared-memory inter-core communication mechanisms is implemented on Virtex 5 LX110T FPGA. Each core is based on MIPS III (Microprocessor without interlocked pipelined stages) ISA, handling only integer type instructions and having six-stage pipeline with data hazard detection unit and forwarding logic. The eight processing cores and one central shared memory core are inter connected using 3x3 2-D mesh topology based Network-on-chip (NoC) with virtual channel router. The router is four stage pipelined supporting DOR X-Y routing algorithm and with round robin arbitration technique. For verification and functionality test of above fully synthesized multi core processor, matrix multiplication operation is mapped onto the above said. Partitioning and scheduling of multiple multiplications and addition for each element of resultant matrix has been done accordingly among eight cores to get maximum throughput. All the codes for processor design are written in Verilog HDL. Keywords: MPSoC, message-passing, shared memory, MIPS, ISA, wormhole router, network-on-chip, SIMD, data level parallelism, 2-D Mesh, virtual channel
HYBRID APPROACH TO DESIGN OF STORAGE ATTACHED NETWORK SIMULATION SYSTEMSIAEME Publication
This document presents a hybrid approach to modeling Storage Attached Networks (SAN) using a combination of deterministic simulation and machine learning. A Software Package for Operation Simulation (SPOS) is developed to simulate SAN components and their interactions. A Set of Algorithms for Automated Parameter Settings (SAAPS) uses reinforcement learning to dynamically adjust SPOS parameters based on real SAN data, improving simulation accuracy even outside training ranges. This hybrid approach reduces development costs compared to detailed simulation alone or machine learning without a simulator framework.
Dominant block guided optimal cache size estimation to maximize ipc of embedd...ijesajournal
Embedded system software is highly constrained from performance, memory footprint, energy consumption and implementing cost view point. It is always desirable to obtain better Instructions per Cycle (IPC). Instruction cache has major contribution in improving IPC. Cache memories are realized on the same chip where the processor is running. This considerably increases the system cost as well. Hence, it is required to maintain a trade-off between cache sizes and performance improvement offered. Determining the number of cache lines and size of cache line are important parameters for cache designing. The design space for cache is quite large. It is time taking to execute the given application with different cache sizes on an instruction set simulator (ISS) to figure out the optimal cache size. In this paper, a technique is proposed to identify a number of cache lines and cache line size for the L1 instruction cache that will offer best or nearly best IPC. Cache size is derived, at a higher abstraction level, from basic block analysis in the Low Level Virtual Machine (LLVM) environment. The cache size estimated from the LLVM environment is cross validated by simulating the set of benchmark applications with different cache sizes in SimpleScalar’s out-of-order simulator. The proposed method seems to be superior in terms of estimation accuracy and/or estimation time as compared to the existing methods for estimation of optimal cache size parameters (cache line size, number of cache lines).
Dominant block guided optimal cache size estimation to maximize ipc of embedd...ijesajournal
Embedded system software is highly constrained from performance, memory footprint, energy consumption
and implementing cost view point. It is always desirable to obtain better Instructions per Cycle (IPC).
Instruction cache has major contribu
tion in improving IPC. Cache memories are realized on the same chip
where the processor is running. This considerably increases the system cost as well. Hence, it is required to
maintain a trade
-
off between cache sizes and performance improvement offered.
Determining the number
of cache lines and size of cache line are important parameters for cache designing. The design space for
cache is quite large. It is time taking to execute the given application with different cache sizes on an
instruction set simula
tor (ISS) to figure out the optimal cache size. In this paper, a technique is proposed to
identify a number of cache lines and cache line size for the L1 instruction cache that will offer best or
nearly best IPC. Cache size is derived, at a higher abstract
ion level, from basic block analysis in the Low
Level Virtual Machine (LLVM) environment. The cache size estimated from the LLVM environment is cross
validated by simulating the set of benchmark applications with different cache sizes in SimpleScalar’s out
-
of
-
order simulator. The proposed method seems to be superior in terms of estimation accuracy and/or
estimation time as compared to the existing methods for estimation of optimal cache size parameters (cache
line size, number of cache lines).
This document discusses profiling and optimization of sparse matrix-vector multiplication (SpMV). It proposes a system that uses performance modeling to predict execution times of different SpMV kernels for a given sparse matrix stored in various formats. An auto-selection algorithm then identifies the optimal storage format and predicted execution time based on the performance modeling. The system was evaluated on NVIDIA GPUs and achieved accurate performance predictions to select the best SpMV solution for a target sparse matrix.
International Journal of Engineering Research and DevelopmentIJERD Editor
Electrical, Electronics and Computer Engineering,
Information Engineering and Technology,
Mechanical, Industrial and Manufacturing Engineering,
Automation and Mechatronics Engineering,
Material and Chemical Engineering,
Civil and Architecture Engineering,
Biotechnology and Bio Engineering,
Environmental Engineering,
Petroleum and Mining Engineering,
Marine and Agriculture engineering,
Aerospace Engineering.
Coarse grained hybrid reconfigurable architecture with no c routerDhiraj Chaudhary
The document describes a coarse-grained reconfigurable architecture with a Network-on-Chip router that is designed to perform variable block size motion estimation for video compression. The architecture uses intelligent NoC routers to direct reference block data between processing elements, reducing data interactions with external memory and decreasing execution time. The paper also proposes two enhancements that reduce the architecture's area by 4.8% and router power consumption by 42%.
This document describes a coarse-grained reconfigurable architecture with a Network-on-Chip (NoC) router designed for variable block size motion estimation. The architecture contains 16 processing elements arranged in a 2D array that can calculate Sum of Absolute Differences (SAD) for different block sizes. An NoC with intelligent routers is used to direct reference block data between processing elements to reduce memory interactions and improve execution time. The architecture supports fast search algorithms like diamond search that further improve efficiency over full search.
Coarse grained hybrid reconfigurable architecture with noc router for variabl...Dhiraj Chaudhary
The document describes a coarse-grained reconfigurable architecture with a Network-on-Chip router that is designed to perform variable block size motion estimation for video compression. The architecture uses intelligent NoC routers to direct reference block data between processing elements, reducing data interactions with external memory and decreasing execution time. The paper also proposes two enhancements that reduce the architecture's area by 4.8% and router power consumption by 42%.
Different Approaches in Energy Efficient Cache MemoryDhritiman Halder
This document discusses various approaches for improving the energy efficiency of cache memory architectures, specifically for write-through caches. It begins by introducing the way-tagged cache approach, which maintains way tags for the L2 cache in the L1 cache. This allows the L2 cache to operate in a direct-mapped manner for write hits, reducing energy. The document then reviews related work on cache sub-banking, bit line segmentation, way prediction, way memoization, and a new way memoization technique using a memory address buffer to skip redundant tag/way accesses. The goal of these techniques is to reduce unnecessary accesses and optimize for write-through policy overhead while maintaining performance.
Adaptive load balancing techniques in global scale grid environmentiaemedu
The document discusses various adaptive load balancing techniques for distributed applications in grid environments. It first describes adaptive mesh refinement algorithms that partition computational domains using space-filling curves or by distributing grids independently or at different levels. It also discusses dynamic load balancing using tiling and multi-criteria geometric partitioning. The document then covers repartitioning algorithms based on multilevel diffusion and the adaptive characteristics of structured adaptive mesh refinement applications. Finally, it discusses adaptive workload balancing on heterogeneous resources by benchmarking resource characteristics and estimating application parameters to find optimal load distribution.
Many-core chip multiprocessor offers high parallel processing power for big data analytics; however, they require efficient multi-level cache and interconnection to achieve high system throughput. Using on-chip first level L1 and second level L2 per core fast private caches is expensive for large number of cores. In this paper, for moderate number of cores from 16 to 64, we present a cost and performance efficient multi-level cache system with per core L1 and last level shared bus cache on each bus line of a costefficient geometrically bus-based interconnection. In our approach, we extracted cache hit and miss concurrencies and applied concurrent average memory access time to more accurately determine the cache system performance. We conducted least recently used cache policy-based simulation for cache system with L1, with L1/L2, and with L1/shared bus cache. Our simulation results show that an average system throughput improvement of 2.5x can be achieved by using system with L1/shared bus cache system compared to using only first level L1 or L1/L2. Further, we show that the throughput degradation for the proposed cache system is only within 5% for a single bus fault, suggesting a good bus fault tolerance.
IEEE Emerging topic in computing Title and Abstract 2016 tsysglobalsolutions
This document contains 3 summaries of research papers from the IEEE Transactions on Emerging Topics in Computing from May and June 2016.
The first paper proposes a software toolchain that introduces variability awareness from high-level modeling down to runtime management on heterogeneous multicore platforms. It demonstrates the toolchain on 2 platforms.
The second paper proposes a method to jointly tune on-chip lasers and microring resonators in nanophotonic interconnects to improve energy efficiency under thermal variations. It shows up to 53% energy reduction is possible.
The third paper introduces a new multiple-access single-charge associative memory architecture called MASC TCAM that can search contents multiple times with a single precharge, achieving
Clustering based performance improvement strategies for mobile ad hoc netwoIAEME Publication
This document discusses various clustering techniques that can improve performance in mobile ad hoc networks (MANETs). It begins by introducing MANETs and clustering concepts. It then reviews several clustering algorithms including lowest-ID, highest degree, least clustering change, and trust-based clustering. It also discusses clustering based on outlier detection for identifying misbehaving nodes. The document concludes that clustering is an important technique for resource management and routing in MANETs, and that selecting optimal cluster heads is critical to network performance and energy efficiency.
Dark silicon and the end of multicore scalingLéia de Sousa
This document models limits on multicore scaling over the next 5 technology generations from 45nm to 8nm. It combines models of device scaling, single-core performance scaling, and multicore performance scaling. The key findings are:
- Using optimistic ITRS projections, only 7.9x average speedup is possible across common parallel workloads, far short of the target 32x speedup from 5 generations of doubling cores.
- Power limitations mean an increasing fraction of chip area must be powered off ("dark silicon") - 21% at 22nm and over 50% at 8nm.
- Neither CPU-like nor GPU-like multicore designs can achieve expected performance levels. Radical innovations are needed
Compositional Analysis for the Multi-Resource ServerEricsson
The Multi-Resource Server (MRS) technique has been proposed to enable predictable execution of memory intensive real-time applications on COTS multi-core platforms.
Implementation of FSM-MBIST and Design of Hybrid MBIST for Memory cluster in ...Editor IJCATR
In current scenario, power efficient MPSoC’s are of great demand. The power efficient asynchronous MPSoC’s with
multiple memories are thought-off to replace clocked synchronous SoC, in which clock consumes more than 40% of the total power. It
is right time to develop the test compliant asynchronous MpSoC. In this paper, Traditional MBIST and FSM based MBIST schemes
are designed and applied to single port RAM. The results are discussed based on the synthesis reports obtained from RTL Complier
from Cadence. FSM based MBIST is power and area efficient method for single memory testing. It consumes 40% less power when
compared with traditional MBIST. But, in case of multiple memory scenarios, separate MBIST controllers are required to test each
individual memories. Thus this scheme consumes huge area and becomes inefficient. A novel technique for testing different memories
which are working at different frequencies is in need. Therefore, an area efficient Hybrid MBIST is proposed with single MBIST
controller to test multiple memories in an Asynchronous SoC. It also includes multiple test algorithms to detect various faults. An
Asynchronous SoC with DWT processor and multiple memories is discussed in this paper, which will used as Design under Test
[DUT] and Hybrid MBIST is built around it to test the heterogeneous memories. The design is coded in Verilog and Validated in
Spartan-3e FPGA kit.
Performance comparison of row per slave and rows set per slave method in pvm ...eSAT Journals
Abstract Parallel computing operates on the principle that large problems can often be divided into smaller ones, which are then solved concurrently to save time by taking advantage of non-local resources and overcoming memory constraints. Multiplication of larger matrices requires a lot of computation time. This paper deals with the two methods for handling Parallel Matrix Multiplication. First is, dividing the rows of one of the input matrices into set of rows based on the number of slaves and assigning one rows set for each slave for computation. Second method is, assigning one row of one of the input matrices at a time for each slave starting from first row to first slave and second row to second slave and so on and loop backs to the first slave when last slave assignment is finished and repeated until all rows are finished assigning. These two methods are implemented using Parallel Virtual Machine and the computation is performed for different sizes of matrices over the different number of nodes. The results show that the row per slave method gives the optimal computation time in PVM based parallel matrix multiplication. Keywords: Parallel Execution, Cluster Computing, MPI (Message Passing Interface), PVM (Parallel Virtual Machine) RAM (Random Access Memory).
The paper presents a nature inspired algorithm that copies the big bang theory of evolution.
This algorithm is simple with regard to number of parameters. Embedded systems are powered by
batteries and enhancing the operating time of the battery by reducing the power consumption is vital.
Embedded systems consume power while accessing the memory during their operation. An efficient
method for power management is proposed in this work. The proposed method, reduce the energy
consumption in memories from 76% up to 98% as compared to other methods reported in the
literature.
Submission Deadline: 30th September 2022
Acceptance Notification: Within Three Days’ time period
Online Publication: Within 24 Hrs. time Period
Expected Date of Dispatch of Printed Journal: 5th October 2022
MODELING AND ANALYSIS OF SURFACE ROUGHNESS AND WHITE LATER THICKNESS IN WIRE-...IAEME Publication
White layer thickness (WLT) formed and surface roughness in wire electric discharge turning (WEDT) of tungsten carbide composite has been made to model through response surface methodology (RSM). A Taguchi’s standard Design of experiments involving five input variables with three levels has been employed to establish a mathematical model between input parameters and responses. Percentage of cobalt content, spindle speed, Pulse on-time, wire feed and pulse off-time were changed during the experimental tests based on the Taguchi’s orthogonal array L27 (3^13). Analysis of variance (ANOVA) revealed that the mathematical models obtained can adequately describe performance within the parameters of the factors considered. There was a good agreement between the experimental and predicted values in this study.
A STUDY ON THE REASONS FOR TRANSGENDER TO BECOME ENTREPRENEURSIAEME Publication
The study explores the reasons for a transgender to become entrepreneurs. In this study transgender entrepreneur was taken as independent variable and reasons to become as dependent variable. Data were collected through a structured questionnaire containing a five point Likert Scale. The study examined the data of 30 transgender entrepreneurs in Salem Municipal Corporation of Tamil Nadu State, India. Simple Random sampling technique was used. Garrett Ranking Technique (Percentile Position, Mean Scores) was used as the analysis for the present study to identify the top 13 stimulus factors for establishment of trans entrepreneurial venture. Economic advancement of a nation is governed upon the upshot of a resolute entrepreneurial doings. The conception of entrepreneurship has stretched and materialized to the socially deflated uncharted sections of transgender community. Presently transgenders have smashed their stereotypes and are making recent headlines of achievements in various fields of our Indian society. The trans-community is gradually being observed in a new light and has been trying to achieve prospective growth in entrepreneurship. The findings of the research revealed that the optimistic changes are taking place to change affirmative societal outlook of the transgender for entrepreneurial ventureship. It also laid emphasis on other transgenders to renovate their traditional living. The paper also highlights that legislators, supervisory body should endorse an impartial canons and reforms in Tamil Nadu Transgender Welfare Board Association.
BROAD UNEXPOSED SKILLS OF TRANSGENDER ENTREPRENEURSIAEME Publication
Since ages gender difference is always a debatable theme whether caused by nature, evolution or environment. The birth of a transgender is dreadful not only for the child but also for their parents. The pain of living in the wrong physique and treated as second class victimized citizen is outrageous and fully harboured with vicious baseless negative scruples. For so long, social exclusion had perpetuated inequality and deprivation experiencing ingrained malign stigma and besieged victims of crime or violence across their life spans. They are pushed into the murky way of life with a source of eternal disgust, bereft sexual potency and perennial fear. Although they are highly visible but very little is known about them. The common public needs to comprehend the ravaged arrogance on these insensitive souls and assist in integrating them into the mainstream by offering equal opportunity, treat with humanity and respect their dignity. Entrepreneurship in the current age is endorsing the gender fairness movement. Unstable careers and economic inadequacy had inclined one of the gender variant people called Transgender to become entrepreneurs. These tiny budding entrepreneurs resulted in economic transition by means of employment, free from the clutches of stereotype jobs, raised standard of living and handful of financial empowerment. Besides all these inhibitions, they were able to witness a platform for skill set development that ignited them to enter into entrepreneurial domain. This paper epitomizes skill sets involved in trans-entrepreneurs of Thoothukudi Municipal Corporation of Tamil Nadu State and is a groundbreaking determination to sightsee various skills incorporated and the impact on entrepreneurship.
DETERMINANTS AFFECTING THE USER'S INTENTION TO USE MOBILE BANKING APPLICATIONSIAEME Publication
The banking and financial services industries are experiencing increased technology penetration. Among them, the banking industry has made technological advancements to better serve the general populace. The economy focused on transforming the banking sector's system into a cashless, paperless, and faceless one. The researcher wants to evaluate the user's intention for utilising a mobile banking application. The study also examines the variables affecting the user's behaviour intention when selecting specific applications for financial transactions. The researcher employed a well-structured questionnaire and a descriptive study methodology to gather the respondents' primary data utilising the snowball sampling technique. The study includes variables like performance expectations, effort expectations, social impact, enabling circumstances, and perceived risk. Each of the aforementioned variables has a major impact on how users utilise mobile banking applications. The outcome will assist the service provider in comprehending the user's history with mobile banking applications.
ANALYSE THE USER PREDILECTION ON GPAY AND PHONEPE FOR DIGITAL TRANSACTIONSIAEME Publication
Technology upgradation in banking sector took the economy to view that payment mode towards online transactions using mobile applications. This system enabled connectivity between banks, Merchant and user in a convenient mode. there are various applications used for online transactions such as Google pay, Paytm, freecharge, mobikiwi, oxygen, phonepe and so on and it also includes mobile banking applications. The study aimed at evaluating the predilection of the user in adopting digital transaction. The study is descriptive in nature. The researcher used random sample techniques to collect the data. The findings reveal that mobile applications differ with the quality of service rendered by Gpay and Phonepe. The researcher suggest the Phonepe application should focus on implementing the application should be user friendly interface and Gpay on motivating the users to feel the importance of request for money and modes of payments in the application.
VOICE BASED ATM FOR VISUALLY IMPAIRED USING ARDUINOIAEME Publication
The prototype of a voice-based ATM for visually impaired using Arduino is to help people who are blind. This uses RFID cards which contain users fingerprint encrypted on it and interacts with the users through voice commands. ATM operates when sensor detects the presence of one person in the cabin. After scanning the RFID card, it will ask to select the mode like –normal or blind. User can select the respective mode through voice input, if blind mode is selected the balance check or cash withdraw can be done through voice input. Normal mode procedure is same as the existing ATM.
IMPACT OF EMOTIONAL INTELLIGENCE ON HUMAN RESOURCE MANAGEMENT PRACTICES AMONG...IAEME Publication
There is increasing acceptability of emotional intelligence as a major factor in personality assessment and effective human resource management. Emotional intelligence as the ability to build capacity, empathize, co-operate, motivate and develop others cannot be divorced from both effective performance and human resource management systems. The human person is crucial in defining organizational leadership and fortunes in terms of challenges and opportunities and walking across both multinational and bilateral relationships. The growing complexity of the business world requires a great deal of self-confidence, integrity, communication, conflict and diversity management to keep the global enterprise within the paths of productivity and sustainability. Using the exploratory research design and 255 participants the result of this original study indicates strong positive correlation between emotional intelligence and effective human resource management. The paper offers suggestions on further studies between emotional intelligence and human capital development and recommends for conflict management as an integral part of effective human resource management.
VISUALISING AGING PARENTS & THEIR CLOSE CARERS LIFE JOURNEY IN AGING ECONOMYIAEME Publication
Our life journey, in general, is closely defined by the way we understand the meaning of why we coexist and deal with its challenges. As we develop the "inspiration economy", we could say that nearly all of the challenges we have faced are opportunities that help us to discover the rest of our journey. In this note paper, we explore how being faced with the opportunity of being a close carer for an aging parent with dementia brought intangible discoveries that changed our insight of the meaning of the rest of our life journey.
A STUDY ON THE IMPACT OF ORGANIZATIONAL CULTURE ON THE EFFECTIVENESS OF PERFO...IAEME Publication
The main objective of this study is to analyze the impact of aspects of Organizational Culture on the Effectiveness of the Performance Management System (PMS) in the Health Care Organization at Thanjavur. Organizational Culture and PMS play a crucial role in present-day organizations in achieving their objectives. PMS needs employees’ cooperation to achieve its intended objectives. Employees' cooperation depends upon the organization’s culture. The present study uses exploratory research to examine the relationship between the Organization's culture and the Effectiveness of the Performance Management System. The study uses a Structured Questionnaire to collect the primary data. For this study, Thirty-six non-clinical employees were selected from twelve randomly selected Health Care organizations at Thanjavur. Thirty-two fully completed questionnaires were received.
Living in 21st century in itself reminds all of us the necessity of police and its administration. As more and more we are entering into the modern society and culture, the more we require the services of the so called ‘Khaki Worthy’ men i.e., the police personnel. Whether we talk of Indian police or the other nation’s police, they all have the same recognition as they have in India. But as already mentioned, their services and requirements are different after the like 26th November, 2008 incidents, where they without saving their own lives has sacrificed themselves without any hitch and without caring about their respective family members and wards. In other words, they are like our heroes and mentors who can guide us from the darkness of fear, militancy, corruption and other dark sides of life and so on. Now the question arises, if Gandhi would have been alive today, what would have been his reaction/opinion to the police and its functioning? Would he have some thing different in his mind now what he had been in his mind before the partition or would he be going to start some Satyagraha in the form of some improvement in the functioning of the police administration? Really these questions or rather night mares can come to any one’s mind, when there is too much confusion is prevailing in our minds, when there is too much corruption in the society and when the polices working is also in the questioning because of one or the other case throughout the India. It is matter of great concern that we have to thing over our administration and our practical approach because the police personals are also like us, they are part and parcel of our society and among one of us, so why we all are pin pointing towards them.
A STUDY ON TALENT MANAGEMENT AND ITS IMPACT ON EMPLOYEE RETENTION IN SELECTED...IAEME Publication
The goal of this study was to see how talent management affected employee retention in the selected IT organizations in Chennai. The fundamental issue was the difficulty to attract, hire, and retain talented personnel who perform well and the gap between supply and demand of talent acquisition and retaining them within the firms. The study's main goals were to determine the impact of talent management on employee retention in IT companies in Chennai, investigate talent management strategies that IT companies could use to improve talent acquisition, performance management, career planning and formulate retention strategies that the IT firms could use. The respondents were given a structured close-ended questionnaire with the 5 Point Likert Scale as part of the study's quantitative research design. The target population consisted of 289 IT professionals. The questionnaires were distributed and collected by the researcher directly. The Statistical Package for Social Sciences (SPSS) was used to collect and analyse the questionnaire responses. Hypotheses that were formulated for the various areas of the study were tested using a variety of statistical tests. The key findings of the study suggested that talent management had an impact on employee retention. The studies also found that there is a clear link between the implementation of talent management and retention measures. Management should provide enough training and development for employees, clarify job responsibilities, provide adequate remuneration packages, and recognise employees for exceptional performance.
ATTRITION IN THE IT INDUSTRY DURING COVID-19 PANDEMIC: LINKING EMOTIONAL INTE...IAEME Publication
Globally, Millions of dollars were spent by the organizations for employing skilled Information Technology (IT) professionals. It is costly to replace unskilled employees with IT professionals possessing technical skills and competencies that aid in interconnecting the business processes. The organization’s employment tactics were forced to alter by globalization along with technological innovations as they consistently diminish to remain lean, outsource to concentrate on core competencies along with restructuring/reallocate personnel to gather efficiency. As other jobs, organizations or professions have become reasonably more appropriate in a shifting employment landscape, the above alterations trigger both involuntary as well as voluntary turnover. The employee view on jobs is also afflicted by the COVID-19 pandemic along with the employee-driven labour market. So, having effective strategies is necessary to tackle the withdrawal rate of employees. By associating Emotional Intelligence (EI) along with Talent Management (TM) in the IT industry, the rise in attrition rate was analyzed in this study. Only 303 respondents were collected out of 350 participants to whom questionnaires were distributed. From the employees of IT organizations located in Bangalore (India), the data were congregated. A simple random sampling methodology was employed to congregate data as of the respondents. Generating the hypothesis along with testing is eventuated. The effect of EI and TM along with regression analysis between TM and EI was analyzed. The outcomes indicated that employee and Organizational Performance (OP) were elevated by effective EI along with TM.
INFLUENCE OF TALENT MANAGEMENT PRACTICES ON ORGANIZATIONAL PERFORMANCE A STUD...IAEME Publication
By implementing talent management strategy, organizations would have the option to retain their skilled professionals while additionally working on their overall performance. It is the course of appropriately utilizing the ideal individuals, setting them up for future top positions, exploring and dealing with their performance, and holding them back from leaving the organization. It is employee performance that determines the success of every organization. The firm quickly obtains an upper hand over its rivals in the event that its employees having particular skills that cannot be duplicated by the competitors. Thus, firms are centred on creating successful talent management practices and processes to deal with the unique human resources. Firms are additionally endeavouring to keep their top/key staff since on the off chance that they leave; the whole store of information leaves the firm's hands. The study's objective was to determine the impact of talent management on organizational performance among the selected IT organizations in Chennai. The study recommends that talent management limitedly affects performance. On the off chance that this talent is appropriately management and implemented properly, organizations might benefit as much as possible from their maintained assets to support development and productivity, both monetarily and non-monetarily.
A STUDY OF VARIOUS TYPES OF LOANS OF SELECTED PUBLIC AND PRIVATE SECTOR BANKS...IAEME Publication
Banking regulations act of India, 1949 defines banking as “acceptance of deposits for the purpose of lending or investment from the public, repayment on demand or otherwise and withdrawable through cheques, drafts order or otherwise”, the major participants of the Indian financial system are commercial banks, the financial institution encompassing term lending institutions. Investments institutions, specialized financial institution and the state level development banks, non banking financial companies (NBFC) and other market intermediaries such has the stock brokers and money lenders are among the oldest of the certain variants of NBFC and the oldest market participants. The asset quality of banks is one of the most important indicators of their financial health. The Indian banking sector has been facing severe problems of increasing Non- Performing Assets (NPAs). The NPAs growth directly and indirectly affects the quality of assets and profitability of banks. It also shows the efficiency of banks credit risk management and the recovery effectiveness. NPA do not generate any income, whereas, the bank is required to make provisions for such as assets that why is a double edge weapon. This paper outlines the concept of quality of bank loans of different types like Housing, Agriculture and MSME loans in state Haryana of selected public and private sector banks. This study is highlighting problems associated with the role of commercial bank in financing Small and Medium Scale Enterprises (SME). The overall objective of the research was to assess the effect of the financing provisions existing for the setting up and operations of MSMEs in the country and to generate recommendations for more robust financing mechanisms for successful operation of the MSMEs, in turn understanding the impact of MSME loans on financial institutions due to NPA. There are many research conducted on the topic of Non- Performing Assets (NPA) Management, concerning particular bank, comparative study of public and private banks etc. In this paper the researcher is considering the aggregate data of selected public sector and private sector banks and attempts to compare the NPA of Housing, Agriculture and MSME loans in state Haryana of public and private sector banks. The tools used in the study are average and Anova test and variance. The findings reveal that NPA is common problem for both public and private sector banks and is associated with all types of loans either that is housing loans, agriculture loans and loans to SMES. NPAs of both public and private sector banks show the increasing trend. In 2010-11 GNPA of public and private sector were at same level it was 2% but after 2010-11 it increased in many fold and at present there is GNPA in some more than 15%. It shows the dark area of Indian banking sector.
EXPERIMENTAL STUDY OF MECHANICAL AND TRIBOLOGICAL RELATION OF NYLON/BaSO4 POL...IAEME Publication
An experiment conducted in this study found that BaSO4 changed Nylon 6's mechanical properties. By changing the weight ratios, BaSO4 was used to make Nylon 6. This Researcher looked into how hard Nylon-6/BaSO4 composites are and how well they wear. Experiments were done based on Taguchi design L9. Nylon-6/BaSO4 composites can be tested for their hardness number using a Rockwell hardness testing apparatus. On Nylon/BaSO4, the wear behavior was measured by a wear monitor, pinon-disc friction by varying reinforcement, sliding speed, and sliding distance, and the microstructure of the crack surfaces was observed by SEM. This study provides significant contributions to ultimate strength by increasing BaSO4 content up to 16% in the composites, and sliding speed contributes 72.45% to the wear rate
ROLE OF SOCIAL ENTREPRENEURSHIP IN RURAL DEVELOPMENT OF INDIA - PROBLEMS AND ...IAEME Publication
The majority of the population in India lives in villages. The village is the back bone of the country. Village or rural industries play an important role in the national economy, particularly in the rural development. Developing the rural economy is one of the key indicators towards a country’s success. Whether it be the need to look after the welfare of the farmers or invest in rural infrastructure, Governments have to ensure that rural development isn’t compromised. The economic development of our country largely depends on the progress of rural areas and the standard of living of rural masses. Village or rural industries play an important role in the national economy, particularly in the rural development. Rural entrepreneurship is based on stimulating local entrepreneurial talent and the subsequent growth of indigenous enterprises. It recognizes opportunity in the rural areas and accelerates a unique blend of resources either inside or outside of agriculture. Rural entrepreneurship brings an economic value to the rural sector by creating new methods of production, new markets, new products and generate employment opportunities thereby ensuring continuous rural development. Social Entrepreneurship has the direct and primary objective of serving the society along with the earning profits. So, social entrepreneurship is different from the economic entrepreneurship as its basic objective is not to earn profits but for providing innovative solutions to meet the society needs which are not taken care by majority of the entrepreneurs as they are in the business for profit making as a sole objective. So, the Social Entrepreneurs have the huge growth potential particularly in the developing countries like India where we have huge societal disparities in terms of the financial positions of the population. Still 22 percent of the Indian population is below the poverty line and also there is disparity among the rural & urban population in terms of families living under BPL. 25.7 percent of the rural population & 13.7 percent of the urban population is under BPL which clearly shows the disparity of the poor people in the rural and urban areas. The need to develop social entrepreneurship in agriculture is dictated by a large number of social problems. Such problems include low living standards, unemployment, and social tension. The reasons that led to the emergence of the practice of social entrepreneurship are the above factors. The research problem lays upon disclosing the importance of role of social entrepreneurship in rural development of India. The paper the tendencies of social entrepreneurship in India, to present successful examples of such business for providing recommendations how to improve situation in rural areas in terms of social entrepreneurship development. Indian government has made some steps towards development of social enterprises, social entrepreneurship, and social in- novation, but a lot remains to be improved.
OPTIMAL RECONFIGURATION OF POWER DISTRIBUTION RADIAL NETWORK USING HYBRID MET...IAEME Publication
Distribution system is a critical link between the electric power distributor and the consumers. Most of the distribution networks commonly used by the electric utility is the radial distribution network. However in this type of network, it has technical issues such as enormous power losses which affect the quality of the supply. Nowadays, the introduction of Distributed Generation (DG) units in the system help improve and support the voltage profile of the network as well as the performance of the system components through power loss mitigation. In this study network reconfiguration was done using two meta-heuristic algorithms Particle Swarm Optimization and Gravitational Search Algorithm (PSO-GSA) to enhance power quality and voltage profile in the system when simultaneously applied with the DG units. Backward/Forward Sweep Method was used in the load flow analysis and simulated using the MATLAB program. Five cases were considered in the Reconfiguration based on the contribution of DG units. The proposed method was tested using IEEE 33 bus system. Based on the results, there was a voltage profile improvement in the system from 0.9038 p.u. to 0.9594 p.u.. The integration of DG in the network also reduced power losses from 210.98 kW to 69.3963 kW. Simulated results are drawn to show the performance of each case.
APPLICATION OF FRUGAL APPROACH FOR PRODUCTIVITY IMPROVEMENT - A CASE STUDY OF...IAEME Publication
Manufacturing industries have witnessed an outburst in productivity. For productivity improvement manufacturing industries are taking various initiatives by using lean tools and techniques. However, in different manufacturing industries, frugal approach is applied in product design and services as a tool for improvement. Frugal approach contributed to prove less is more and seems indirectly contributing to improve productivity. Hence, there is need to understand status of frugal approach application in manufacturing industries. All manufacturing industries are trying hard and putting continuous efforts for competitive existence. For productivity improvements, manufacturing industries are coming up with different effective and efficient solutions in manufacturing processes and operations. To overcome current challenges, manufacturing industries have started using frugal approach in product design and services. For this study, methodology adopted with both primary and secondary sources of data. For primary source interview and observation technique is used and for secondary source review has done based on available literatures in website, printed magazines, manual etc. An attempt has made for understanding application of frugal approach with the study of manufacturing industry project. Manufacturing industry selected for this project study is Mahindra and Mahindra Ltd. This paper will help researcher to find the connections between the two concepts productivity improvement and frugal approach. This paper will help to understand significance of frugal approach for productivity improvement in manufacturing industry. This will also help to understand current scenario of frugal approach in manufacturing industry. In manufacturing industries various process are involved to deliver the final product. In the process of converting input in to output through manufacturing process productivity plays very critical role. Hence this study will help to evolve status of frugal approach in productivity improvement programme. The notion of frugal can be viewed as an approach towards productivity improvement in manufacturing industries.
A MULTIPLE – CHANNEL QUEUING MODELS ON FUZZY ENVIRONMENTIAEME Publication
In this paper, we investigated a queuing model of fuzzy environment-based a multiple channel queuing model (M/M/C) ( /FCFS) and study its performance under realistic conditions. It applies a nonagonal fuzzy number to analyse the relevant performance of a multiple channel queuing model (M/M/C) ( /FCFS). Based on the sub interval average ranking method for nonagonal fuzzy number, we convert fuzzy number to crisp one. Numerical results reveal that the efficiency of this method. Intuitively, the fuzzy environment adapts well to a multiple channel queuing models (M/M/C) ( /FCFS) are very well.
Introduction- e - waste – definition - sources of e-waste– hazardous substances in e-waste - effects of e-waste on environment and human health- need for e-waste management– e-waste handling rules - waste minimization techniques for managing e-waste – recycling of e-waste - disposal treatment methods of e- waste – mechanism of extraction of precious metal from leaching solution-global Scenario of E-waste – E-waste in India- case studies.
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...IJECEIAES
Medical image analysis has witnessed significant advancements with deep learning techniques. In the domain of brain tumor segmentation, the ability to
precisely delineate tumor boundaries from magnetic resonance imaging (MRI)
scans holds profound implications for diagnosis. This study presents an ensemble convolutional neural network (CNN) with transfer learning, integrating
the state-of-the-art Deeplabv3+ architecture with the ResNet18 backbone. The
model is rigorously trained and evaluated, exhibiting remarkable performance
metrics, including an impressive global accuracy of 99.286%, a high-class accuracy of 82.191%, a mean intersection over union (IoU) of 79.900%, a weighted
IoU of 98.620%, and a Boundary F1 (BF) score of 83.303%. Notably, a detailed comparative analysis with existing methods showcases the superiority of
our proposed model. These findings underscore the model’s competence in precise brain tumor localization, underscoring its potential to revolutionize medical
image analysis and enhance healthcare outcomes. This research paves the way
for future exploration and optimization of advanced CNN models in medical
imaging, emphasizing addressing false positives and resource efficiency.
Discover the latest insights on Data Driven Maintenance with our comprehensive webinar presentation. Learn about traditional maintenance challenges, the right approach to utilizing data, and the benefits of adopting a Data Driven Maintenance strategy. Explore real-world examples, industry best practices, and innovative solutions like FMECA and the D3M model. This presentation, led by expert Jules Oudmans, is essential for asset owners looking to optimize their maintenance processes and leverage digital technologies for improved efficiency and performance. Download now to stay ahead in the evolving maintenance landscape.
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsVictor Morales
K8sGPT is a tool that analyzes and diagnoses Kubernetes clusters. This presentation was used to share the requirements and dependencies to deploy K8sGPT in a local environment.
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Sinan KOZAK
Sinan from the Delivery Hero mobile infrastructure engineering team shares a deep dive into performance acceleration with Gradle build cache optimizations. Sinan shares their journey into solving complex build-cache problems that affect Gradle builds. By understanding the challenges and solutions found in our journey, we aim to demonstrate the possibilities for faster builds. The case study reveals how overlapping outputs and cache misconfigurations led to significant increases in build times, especially as the project scaled up with numerous modules using Paparazzi tests. The journey from diagnosing to defeating cache issues offers invaluable lessons on maintaining cache integrity without sacrificing functionality.
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTjpsjournal1
The rivalry between prominent international actors for dominance over Central Asia's hydrocarbon
reserves and the ancient silk trade route, along with China's diplomatic endeavours in the area, has been
referred to as the "New Great Game." This research centres on the power struggle, considering
geopolitical, geostrategic, and geoeconomic variables. Topics including trade, political hegemony, oil
politics, and conventional and nontraditional security are all explored and explained by the researcher.
Using Mackinder's Heartland, Spykman Rimland, and Hegemonic Stability theories, examines China's role
in Central Asia. This study adheres to the empirical epistemological method and has taken care of
objectivity. This study analyze primary and secondary research documents critically to elaborate role of
china’s geo economic outreach in central Asian countries and its future prospect. China is thriving in trade,
pipeline politics, and winning states, according to this study, thanks to important instruments like the
Shanghai Cooperation Organisation and the Belt and Road Economic Initiative. According to this study,
China is seeing significant success in commerce, pipeline politics, and gaining influence on other
governments. This success may be attributed to the effective utilisation of key tools such as the Shanghai
Cooperation Organisation and the Belt and Road Economic Initiative.
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...shadow0702a
This document serves as a comprehensive step-by-step guide on how to effectively use PyCharm for remote debugging of the Windows Subsystem for Linux (WSL) on a local Windows machine. It meticulously outlines several critical steps in the process, starting with the crucial task of enabling permissions, followed by the installation and configuration of WSL.
The guide then proceeds to explain how to set up the SSH service within the WSL environment, an integral part of the process. Alongside this, it also provides detailed instructions on how to modify the inbound rules of the Windows firewall to facilitate the process, ensuring that there are no connectivity issues that could potentially hinder the debugging process.
The document further emphasizes on the importance of checking the connection between the Windows and WSL environments, providing instructions on how to ensure that the connection is optimal and ready for remote debugging.
It also offers an in-depth guide on how to configure the WSL interpreter and files within the PyCharm environment. This is essential for ensuring that the debugging process is set up correctly and that the program can be run effectively within the WSL terminal.
Additionally, the document provides guidance on how to set up breakpoints for debugging, a fundamental aspect of the debugging process which allows the developer to stop the execution of their code at certain points and inspect their program at those stages.
Finally, the document concludes by providing a link to a reference blog. This blog offers additional information and guidance on configuring the remote Python interpreter in PyCharm, providing the reader with a well-rounded understanding of the process.
artificial intelligence and data science contents.pptxGauravCar
What is artificial intelligence? Artificial intelligence is the ability of a computer or computer-controlled robot to perform tasks that are commonly associated with the intellectual processes characteristic of humans, such as the ability to reason.
› ...
Artificial intelligence (AI) | Definitio
Embedded machine learning-based road conditions and driving behavior monitoringIJECEIAES
Car accident rates have increased in recent years, resulting in losses in human lives, properties, and other financial costs. An embedded machine learning-based system is developed to address this critical issue. The system can monitor road conditions, detect driving patterns, and identify aggressive driving behaviors. The system is based on neural networks trained on a comprehensive dataset of driving events, driving styles, and road conditions. The system effectively detects potential risks and helps mitigate the frequency and impact of accidents. The primary goal is to ensure the safety of drivers and vehicles. Collecting data involved gathering information on three key road events: normal street and normal drive, speed bumps, circular yellow speed bumps, and three aggressive driving actions: sudden start, sudden stop, and sudden entry. The gathered data is processed and analyzed using a machine learning system designed for limited power and memory devices. The developed system resulted in 91.9% accuracy, 93.6% precision, and 92% recall. The achieved inference time on an Arduino Nano 33 BLE Sense with a 32-bit CPU running at 64 MHz is 34 ms and requires 2.6 kB peak RAM and 139.9 kB program flash memory, making it suitable for resource-constrained embedded systems.
2. Monika Shah
http://www.iaeme.com/IJARET/index.asp 12 editor@iaeme.com
solver (like partial differential equations [1], [2], conjugate gradient solver [3], [4],
Gaussian reduction of complex matrices, etc.), fluid dynamics [5], Database query
processing on large database (LDB) [6], information retrieval [7], network theory [8],
[9] , page rank computation [10], physics of disordered and quantum system [11] are
well-known applications that have recurrent use of SpMV. Sparse matrix used in
these applications are varied widely in non-zero pattern.
Continuous growth of computer users, and their increasing usage constantly increase
size of many datasets used in such applications. This continuous and exponential
growth of dataset has raised need to apply High Performance Computing. Researchers
have provide many solutions through inventions for high performance computing
device architectures like Graphical Processing Unit (GPU) and optimizing algorithms
for these devices. GPU device is well-known high performance promising device for
regular applications. Hence, it is a great challenge to use GPU for irregular
application like SpMV.
Generalized implementation of parallel SpMV has become complex because of
following properties of sparse matrix:
1. Imbalanced number of nonzero elements in each row
2. Imbalanced number of nonzero elements in each column • Wide-range of sparse
patterns (diagonal, skewed, power law distribution of non-zero elements for each
row, almost equal number of non-zero elements per row, block, etc.) • Varied sparse
level matrix(ratio of nonzero elements to size of matrix)
For an efficient and generalized implementation of SpMV on GPU, two important
factors are influenced by past research [12]: (i) Synchronization free load distribution
among computational resources, (ii) Reduce fetch operations to avoid drawback of
low latency memory access in GPU. Hence, it is preferred to select such sparse
storage format that support high compression along with better synchronization free
load distribution. Major challenge to satisfy these factors are:
1. Continuous growth in dataset make very large size of sparse matrix.
2. Indirection used in storage representation of sparse matrix increases size of data to be
transferred from CPU to GPU device as an additional overhead
3. Existence of large class of sparse matrix pattern
4. Difficult to balance work distribution due to imbalanced number of nonzero elements
for each row as well as for each column
5. Restriction to increase concurrency due to existence of data dependency among row
elements to compute output vector
Harnessing high computing capabilities of GPU, and unceasing performance
demand of SpMV kernel motivate researchers to optimize SpMV on GPU that deal
with all challenges listed above. During past research, Coordinate (COO),
Compressed Sparse Row (CSR), Compressed Sparse Column (CSC), ELLPACK
(ELL), Hybrid (HYB), and Jagged Diagonal Storage (JDS) have been proposed with
different compression strategy [13]. They have also proposed SpMV algorithms for
these sparse storage formats on GPU. Bulky index structure of COO format reduce
synchronization free load distribution degree among parallel threads, and Increase
communication overhead between CPU and GPU. CSC schedule all columns of a
sparse matrix sequentially in SPMV, and vector b is loaded and stored frequently in
each iteration. This factors are responsible for recurrent communication overhead,
which limits performance of CSC on GPU. These factors are responsible for less
popularity COO and CSC sparse formats on GPU. Aligned Coordinate (Aligned
3. Sparse Storage Recommendation System for Sparse Matrix Vector Multiplication on GPU
http://www.iaeme.com/IJARET/index.asp 13 editor@iaeme.com
COO) [12] is introduced as compressed and suitable for synchronization free balanced
load distribution and proper cache utilization.
Sparse matrix metrics like Number of Rows (NR), Number of Columns (NC),
Number of Non-Zero elements (NNZ), Non-zero elements in a row (row_len), and
Non-zero elements in a column (col_len) are playing important role in compression
ratio, and parallel degree for various sparse formats. An important point to focus here
is compression ratio for these recognized sparse storage formats are varied based on
sparse level and sparse pattern of an input matrix. Considering these factors, this
paper proposes an algorithm to recommend highly suitable storage format fora given
sparse matrix. The remaining paper is structured as follows: The course of Optimizing
sparse formats and their SpMV implementation is traced in section II. Section III
brings forth our attempt to define heuristics and an algorithm that recommend highly
suitable storage format for implementing SpMV on GPU. Parallel algorithm of CGS
has been discussed in section III-D. Section IV demonstrates and analyse result of this
proposed work. Conclusion of the paper is described in section V.
2. SPARSE STRORAGE FORMATS
Many storage formats are proposed as a result of past effort by researchers. As
mentioned in section I, compressed storage, synchronization free load distribution,
and highest possible concurrency have become main goal to design sparse matrix
formats for NVIDIA GPU and CUDA programming environment. Bell at el.[13] have
introduced storage formats COO, CSR, CSC, ELL, HYB supporting different level of
compression to different sparse matrix pattern. Shah M. et al[12] have introduced an
Aligned COO. Many other extension to this benchmark sparse format [14], [15], [16],
[17] as well as hybrid of these storage formats [18], [19], [20] also have been
proposed in past. Tragedy even after research of this large set of sparse format is that
there is not any standard format suitable for almost all class of sparse matrix patterns.
In addition to this, it is also difficult to identify suitable sparse matrix format
supporting best compression as well as synchronization free and balanced work-load
distribution.
Table 1 Sparse Matrix Formats and Their Space Complexity
Sparse matrix format Space Complexity
COO NNZ x 3
CSC NNZ x 2 + ( NC +1)
CSR NNZ x 2 + (NR +1)
ELL (NR x max row length) x 2
HYB ≅ ELL, for rows with similar length
≅ COO, for rest row elements
Aligned COO Num_segments x Segment_length x 3
≅ (max_row_length x (≤ NR) x 3)
Selection of proper data compression strategy is important due to two major
reasons: (i) Data transferring overhead between CPU and GPU (ii) Design of Memory
access pattern for each concurrent thread depend on data structure. Table 1 presents
memory space required for various sparse formats. It sustain that compression
percentage for same format varies from one sparse matrix to another sparse matrix
based on basic statistics of the matrix. For example, COO provide highest
compression for small and highly sparse matrix; CSC and CSR give better
4. Monika Shah
http://www.iaeme.com/IJARET/index.asp 14 editor@iaeme.com
compression for small size of sparse matrix in terms of columns and rows
respectively; ELL is suitable for compression of sparse matrix with less difference in
NNZ for each row and large matrix in terms of number of rows. COO, CSC, CSR,
and ELL are known as core sparse storage formats designed to support higher
compression. HYB is designed to reduce padding space in ELL format. HYB suggest
better compression in form of hybrid pattern of ELL and COO. Aligned COO
provides better compression in compare to ELL for highly skewed sparse matrix with
power-law distribution.
Table 2 SPMV Algorithms and Their Time Complexity
SpMV Time Complexity
(Excluding Memory access Overhead)
COO_ flat
max _ _ ℎ
CSC max _ _
max _ _ ℎ
_
CSR
≤
max _concurrent_threads
% max _ &_
CSR(vector)
≤
max _& '
(max _& ' @ & + +,(& '_ -..
Where,
max _& ' =
max _ _ ℎ
& '_ 0-
and
max _& ' _' _ & =
max _ &_
warp_size
ELL ≥
max _concurrent_threads
max _ &_
HYB
≅ ELL, for rows with similar length
+ ≅ COO_flat, for rest row elements
Aligned_COO
≅ CSR, for aligned rows
+ ≅ COO_flat, for rest row elements
Increased concurrency and synchronization free load distribution are important
factors to reduce runtime of parallel SpMV on GPU. Table 2 represents rum-time
complexity of various SpMV implementation of above listed sparse storage formats.
COO_flat algorithm specify highest concurrency but does not ensure synchronization
free load distribution among concurrent threads due to row elements across warp
boundary. CSC is also less preferred due to an additional overhead of accessing
output vector in every iteration. On other side, ELL has an overhead of transferring
5. Sparse Storage Recommendation System for Sparse Matrix Vector Multiplication on GPU
http://www.iaeme.com/IJARET/index.asp 15 editor@iaeme.com
extraneous memory containing padding of zero values over low latency memory
access. CSR and ELL has very similar SpMV algorithm except an additional
overhead by CSR to access memory to fetch row index. CSR implementation on GPU
is more efficient than ELL, where NNZ to be accessed by one thread block and
iteration is much larger than another block or iteration. CSR Vector provide much
higher concurrency than CSR and ELL, but has an overhead of performing series of
parallel reduction steps by each thread. Hence, CSR Vector is not suitable when
average NNZ per row is less than steps required by parallel reductions by each thread
that is log (warp size). HYB and Aligned COO kernels are designed to make efficient
SpMV using hybrid of above mentioned sparse formats and their kernels. Aligned
COO reorders nonzero elements to make balanced workload distribution among each
computing resource and thus reduce number of row segment compare to number of
rows in ELL format retaining maximum row length same as original. Hence, Aligned
COO give optimized performance for highly skewed sparse matrix pattern.
3. PROPOSED WORK
Section II discuss strengths and weaknesses of various sparse matrix storage formats
and its SpMV implementations. It indicates that selection of sparse storage format is
important factor for efficient SpMV on GPU. The collection of SpMV algorithms like
JAD, CSR, ELL, CSR vector, HYB, and Aligned COO cover wide spectrum of sparse
matrix pattern for better performance. Recognizing sparse matrix pattern is great
challenge. Statistics analysis is considered to be a good methodology to recognize
sparse patterns. Diagonal pattern is simple to recognized, and JAD is recommended
for diagonal pattern of sparse matrix. This section proposes a strategy to suggest most
appropriate SpMV implementation for all sparse pattern except diagonal.
Working flow of this proposed work is described in Figure 1. Here, K-mean
clustering is used to generate detailed statistics from basic matrix statistics like NR,
NC, NNZ and row length vector rl []. This derived statistics are analysed and
compared with pre-defined Heuristics and suggest most appropriate SpMV algorithm.
Section III-A explain input and output parameter of K-mean clustering algorithm.
Section III-B define heuristics for SpMV algorithms CSR, ELL, CSR Vector, HYB,
and Aligned COO algorithms.
Figure 1 Working flow of Heuristic based Selection of SpMV algorithm
6. Monika Shah
http://www.iaeme.com/IJARET/index.asp 16 editor@iaeme.com
Detailed description of Heuristics based SpMV selection algorithm is given in
section III-C. To prove effectiveness of this proposed algorithm, a well-known SpMV
application − CGS on GPU is implemented as shown in section III-D.
3.1. K-Mean Clustering And Its Parameters
Here, K-mean clustering is used to identify similarity level among sparse matrix rows
using parameter row-length. This K-mean clustering algorithm constructs 2 clusters
based on row length vector rl []. For highly skewed sparse matrix, centroid of cluster
is not sufficient to predict similarity of row-length. Hence, this K-mean clustering
algorithm is slightly modified and identify Lower Bound (LB), Upper Bound (UB),
Number of element (CNT) and Centroid (C) for both cluster beans. This clusters are
named as cluster H and cluster L based on their centroid value i.e. C_L < C_H.
Similarly, output parameters of this K-mean clustering algorithm are named as LB_H,
UB_H, CNT_H, C_H, LB_L, UB_L, CNT_L, and C_L as shown in Figure 2.
Figure 2 K-mean clustering for this proposed work
3.2. Heuristics
Based on empirical result analysis and basic understanding of various SpMV
algorithms, heuristics are defined to suggest a suitable sparse storage format and GPU
based SpMV algorithm capable to give better performance for given sparse matrix.
Following points are center of focus in design of this heuristics.
1. Obtain highest possible concurrency degree
2. Better compression of sparse matrix to reduce memory access cost
3. Balanced work load distribution among threads
4. Synchronization free load distribution as far as possible
5. Reducing number of blocks to reduce block schedule cost
7. Sparse Storage Recommendation System for Sparse Matrix Vector Multiplication on GPU
http://www.iaeme.com/IJARET/index.asp
3.3. Heuristics for CSR Vector
CSR Vector is designed to propose highest pos
free load distribution, which in turn ensures good accuracy. Every execution thread of
this SpMV algorithm executes at
operations. For execution of CSR Vector, warp W (collect
32 threads in general) is allotted to each row of sparse matrix for computation. CSR
storage format is used to implement this SpMV algorithm. But, CSR is preferred for
small matrix size to avoid large number of low latency mem
row index. Looking all these criteria, CSR Vector is preferred to apply if following
condition is satisfied:
3.4. Heuristics for CSR
CSR SpMV is preferred for small matrix, where each row does not have large number
of non-zero elements as well as majority of rows do not have equivalent size in terms
of non-zero elements. Hence, CSR SpMV is preferred when CSR Vector is not
applicable and following condition is satisfied:
3.5. Heuristics for ELL
ELL storage format and ELL SpMV algo
equivalent row length, which reduces padding overhead and improve performance.
But, large row-length reduce concurrency degree in ELL SpMV. ELL is preferred
when there is not much difference between neither centr
between upper bound of higher value cluster and centroid value of cluster having
lower centroid value. Hence, it is concluded that ELLPACK is preferred, when CSR
or CSR Vector are not applicable for given sparse matrix, and fo
satisfied:
3.6. Heuristics for HYB
When a large sparse matrix does not have equivalent row length, but power
distribution of non-zero elements among rows of the matrix with highly skewed
visualization, Hybrid sparse format and its SpMV is preferred. Hence, it is concluded
that HYB is preferred, when CSR or ELL sparse formats are not suitable for given
sparse matrix, and following condition is satisfied:
(
6_7
6_8
≥ 100
3.7 Heuristics for Aligned COO
Aligned COO format and its SpMV are designed to optimize performance for large
sparse matrix having skewed distribution of non
alignment of large sized row with small si
of number of execution units. But as it is based on COO format, it provide less
compression in compare to hybrid. Hence, Aligned COO is preferred when neither
CSR nor ELL nor HYB formats are suitable as well as i
condition:
Sparse Storage Recommendation System for Sparse Matrix Vector Multiplication on GPU
ARET/index.asp 17 editor@iaeme.com
Heuristics for CSR Vector
Vector is designed to propose highest possible concurrency with synchronization
free load distribution, which in turn ensures good accuracy. Every execution thread of
this SpMV algorithm executes at-least 1 Multiplication and log
operations. For execution of CSR Vector, warp W (collection of execution threads i.e.
32 threads in general) is allotted to each row of sparse matrix for computation. CSR
storage format is used to implement this SpMV algorithm. But, CSR is preferred for
small matrix size to avoid large number of low latency memory access for fetching
row index. Looking all these criteria, CSR Vector is preferred to apply if following
Heuristics for CSR
CSR SpMV is preferred for small matrix, where each row does not have large number
ments as well as majority of rows do not have equivalent size in terms
zero elements. Hence, CSR SpMV is preferred when CSR Vector is not
applicable and following condition is satisfied:
Heuristics for ELL
ELL storage format and ELL SpMV algorithm is preferred for a sparse matrix with
equivalent row length, which reduces padding overhead and improve performance.
length reduce concurrency degree in ELL SpMV. ELL is preferred
when there is not much difference between neither centroid value of two clusters nor
between upper bound of higher value cluster and centroid value of cluster having
lower centroid value. Hence, it is concluded that ELLPACK is preferred, when CSR
or CSR Vector are not applicable for given sparse matrix, and following condition is
.49) or (C_H-C_L) ≤ 6))
Heuristics for HYB
When a large sparse matrix does not have equivalent row length, but power
zero elements among rows of the matrix with highly skewed
Hybrid sparse format and its SpMV is preferred. Hence, it is concluded
that HYB is preferred, when CSR or ELL sparse formats are not suitable for given
sparse matrix, and following condition is satisfied:
100 . ; (
<=_7
8=_7
≥ 100 . ; (
>>?
>@
≥ 100 .
Heuristics for Aligned COO
Aligned COO format and its SpMV are designed to optimize performance for large
sparse matrix having skewed distribution of non-zero elements and also have possible
alignment of large sized row with small sized row such that it can reduce requirement
of number of execution units. But as it is based on COO format, it provide less
compression in compare to hybrid. Hence, Aligned COO is preferred when neither
CSR nor ELL nor HYB formats are suitable as well as it should satisfy following
Sparse Storage Recommendation System for Sparse Matrix Vector Multiplication on GPU
editor@iaeme.com
sible concurrency with synchronization
free load distribution, which in turn ensures good accuracy. Every execution thread of
log2W addition
ion of execution threads i.e.
32 threads in general) is allotted to each row of sparse matrix for computation. CSR
storage format is used to implement this SpMV algorithm. But, CSR is preferred for
ory access for fetching
row index. Looking all these criteria, CSR Vector is preferred to apply if following
CSR SpMV is preferred for small matrix, where each row does not have large number
ments as well as majority of rows do not have equivalent size in terms
zero elements. Hence, CSR SpMV is preferred when CSR Vector is not
rithm is preferred for a sparse matrix with
equivalent row length, which reduces padding overhead and improve performance.
length reduce concurrency degree in ELL SpMV. ELL is preferred
oid value of two clusters nor
between upper bound of higher value cluster and centroid value of cluster having
lower centroid value. Hence, it is concluded that ELLPACK is preferred, when CSR
llowing condition is
6))
When a large sparse matrix does not have equivalent row length, but power-law
zero elements among rows of the matrix with highly skewed
Hybrid sparse format and its SpMV is preferred. Hence, it is concluded
that HYB is preferred, when CSR or ELL sparse formats are not suitable for given
Aligned COO format and its SpMV are designed to optimize performance for large
zero elements and also have possible
zed row such that it can reduce requirement
of number of execution units. But as it is based on COO format, it provide less
compression in compare to hybrid. Hence, Aligned COO is preferred when neither
t should satisfy following
8. http://www.iaeme.com/IJARET/index.asp
3.8. Heuristics based Sparse format recommendation
This section describes an algorithm to suggest most suitable sparse format and its
associated SpMV for given sparse matrix. It performs K
matrix metric, and compare its output parameters with heuristics defined in above
section.
Algorithm 1 Heuristics based Sparse format recommendation
Input: NNZ, NR, NC, rl [ ]
Output: Suitable_SpMV
Perform K-mean clustering with two bean
3.9. Parallel CGS
To demonstrate effectiveness of hereby proposed Heuristics based sparse format
recommendation algorithm for efficient SpMV, it is preferred to test this algorithm on
such application that has frequent and high usage of SpMV kernel as well as
applicable for wide category of sparse patterns. Conjugate Gradient Solver (CGS) is
such a well-known application that find solution vector x for Ax=b.
Every CGS call invokes SpMV kernel in a loop of hu
iterations. As GPU has major overhead of memory transfer between CPU and GPU,
this parallel CGS is designed such a way that it need to transfer sparse matrix A, and
input vector b at time of first iteration only. GPU based parallel CGS
Monika Shah
http://www.iaeme.com/IJARET/index.asp 18 editor@iaeme.com
Heuristics based Sparse format recommendation
This section describes an algorithm to suggest most suitable sparse format and its
associated SpMV for given sparse matrix. It performs K-mean clustering on sparse
atrix metric, and compare its output parameters with heuristics defined in above
Algorithm 1 Heuristics based Sparse format recommendation
NNZ, NR, NC, rl [ ]
mean clustering with two bean
(C_L ≥ log, E. AND (C_H ≥
W
2
.
AND (NNZ ≤ max threads) then
To demonstrate effectiveness of hereby proposed Heuristics based sparse format
recommendation algorithm for efficient SpMV, it is preferred to test this algorithm on
has frequent and high usage of SpMV kernel as well as
applicable for wide category of sparse patterns. Conjugate Gradient Solver (CGS) is
known application that find solution vector x for Ax=b.
call invokes SpMV kernel in a loop of hundreds to thousands
iterations. As GPU has major overhead of memory transfer between CPU and GPU,
this parallel CGS is designed such a way that it need to transfer sparse matrix A, and
input vector b at time of first iteration only. GPU based parallel CGS
editor@iaeme.com
This section describes an algorithm to suggest most suitable sparse format and its
mean clustering on sparse
atrix metric, and compare its output parameters with heuristics defined in above
To demonstrate effectiveness of hereby proposed Heuristics based sparse format
recommendation algorithm for efficient SpMV, it is preferred to test this algorithm on
has frequent and high usage of SpMV kernel as well as
applicable for wide category of sparse patterns. Conjugate Gradient Solver (CGS) is
ndreds to thousands
iterations. As GPU has major overhead of memory transfer between CPU and GPU,
this parallel CGS is designed such a way that it need to transfer sparse matrix A, and
input vector b at time of first iteration only. GPU based parallel CGS is described in
9. Sparse Storage Recommendation System for Sparse Matrix Vector Multiplication on GPU
http://www.iaeme.com/IJARET/index.asp 19 editor@iaeme.com
Algorithm 2, where SpMV kernel is executed for specified number of iterations or
size of sparse matrix which is in hundreds.
4. EXPERIMENTAL RESULT
For proper evaluation of this proposed algorithm, various SpMV algorithms like CSR,
CSR Vector, ELL, and HYB SpMV from NVIDIA cusp-library and Aligned COO
algorithms have been implemented on NVIDIA GPU.
Collection of sparse matrix used in this experiment is listed along with its basic
properties in Table 3.
4.1. Test Platform
These experiments have been executed on Intel(R) Core(TM) i3 CPU @ 3.20 GHz
with 4GB RAM, 2 × 256 KB (L2 Cache) and 4 MB (L3 Cache), and NVIDIA C2070
GPU device using CUDA version 4.0 on Ubuntu 11.
This dataset contains 31 sparse matrix, which are retrieved from well-known
source The University of Florida Sparse Matrix Collection. Selection of sparse matrix
are done such that the collection contain matrix with various sparse level and large
category of sparse patterns.
4.2. Result Analysis
Table 4 list performance of CSR, CSR Vector, ELL, HYB, and Aligned COO
algorithms in terms of GFLOPS per seconds for each matrix listed in Table 3.
Heuristics based SpMV selection algorithm is implemented and its output compared
with performance result recorded in Table 4. Overhead of memory transfer cost
between CPU and GPU is always important factor for overall performance. This cost
is considered to be amortized over large number of iterations. Parallel CGS algorithm
listed in Algorithm 2 also have been implemented for 200 iterations. Execution time
including memory transfer time of this CGS has been compared with result of our
proposed algorithm. Result of proposed algorithm is satisfied for 30 sparse matrix out
of 31 sparse matrix listed in the dataset.
Algorithm 2 Parallel CGS
Input: Sparse Matrix A, Vector b, NR, NC, iterations
Output: Vector x
Initialize vector dev_x = 0
Copy vectors from Host memory to Device memory
(b → dev_r, and b → dev_p)
Copy matrix from Host memory to Device memory
(A → dev_A)
Compute dev_rsold = dev_rT x dev_r using dev_inner_product(dev_r, dev_r, NR,
dev_rsold)
for i = 1 → min (iterations, NR ∗ NC) do
Initialize vector dev_Ap = 0
10. Monika Shah
http://www.iaeme.com/IJARET/index.asp 20 editor@iaeme.com
Perform Ap= A*p using
dev _SpMV (dev_A, dev_p, dev_Ap)
Perform rsold = pT * Ap using
dev_inner_product (dev_p, dev_Ap, NR, dev_rsold) dev_alpha = dev_rsold
Asynchronous computation of dev_x and dev_r Perform x += alpha*p using
dev_add_scalarMul(dev_alpha,dev_p,dev_x, 1,dev_x) Perform r-=alpha*Ap using
dev_add_scalarMul(dev_alpha,dev_Ap,dev_r, -1,dev_r) Compute dev_rsnew = dev_rT x
dev_r using
dev_inner_product(dev r,dev r,NR,dev_rsnew) Copy device rsnew dev_rsnew to host
rsnew rsnew
If √ & < 1 OPQ
Exit for
end if
Compute p= r + ((rsnew/rsold)*p) using
R_ ' =
R_ &
R_
dev_add_scalarMul(dev_temp,dev_p,dev_r, 1,dev_p)
dev_rsold = dev_rsnew
End for
Copy device vector to host vector dev x → x
Return
__________________________________________________________________
5. CONCLUSION
In this paper, various factors responsible for achieving higher performance of SpMV
on GPU for various sparse pattern have been discussed. It has been realized that some
decision making algorithm is required to suggest highest performance giving SpMV
algorithm, especially for those applications that use large sparse matrices having
variety of sparse pattern and having recurrent use of SpMV. Hereby proposed
algorithm perform statistical analysis of sparse pattern and provide approximately
97% successful result. This statistical result recommend use of such clustering based
Heuristics design for appropriate sparse format selection.
12. Monika Shah
http://www.iaeme.com/IJARET/index.asp 22 editor@iaeme.com
Table 4 Execution Performance of Various Spmv
Matrix
Performance (GFLOP/sec)
CSR CSR Vector ELL HYB A_COO
3D 51448_3D 0.56 3.46 0.11 6.45 5.41
add20 0.34 0.5 0.22 0.15 0.15
add32 1.42 0.99 1.06 0.2 1.31
adder_dcop_19 0.1 0.38 0.02 0.09 0.1
aircraft 2.83 0.71 3.33 0.17 4.28
airfoil 2.95 0.41 3.6 0.2 3.86
airfoil_2d 1.17 2.03 10.99 6.64 11.2
aug3dcqp 4.51 0.24 6.58 1.01 4.87
bayer01 3.91 0.28 3.63 1.92 5.57
bcsstk36 0.66 7.95 2.32 6.04 4.81
bcsstm38 1.62 0.3 0.83 0.09 1
bfwa782 0.47 0.95 0.65 0.06 0.76
bips07_3078_iv 0.74 0.04 0.7 0.05 0.94
bloweybl 0.14 0.27 0.01 0.94 0.96
c64b 0.15 0.79 0.05 3.61 3.6
coater1 0.76 2.13 0.77 0.16 0.96
crankseg_2 0.61 8.39 2.28 11.5 6.37
crashbasis 2.09 2.06 16.21 16.18 11.06
delaunay_n15 3.31 1.31 5.7 1.55 4.26
epb0 0.8 0.54 1.07 0.06 1.19
FEM_3D_thermal1 0.87 4.44 14.63 14.59 10.01
Fpga_trans_01 0.35 0.88 0.31 0.06 0.06
G2_circuit 6.48 0.52 12.29 4.65 8.8
gupta1 .3 4.56 0.2 5.64 5.24
Hamrle2 3.26 0.8 4.18 0.19 4.63
jagmesh2 1.17 0.45 1.15 0.05 1.46
jagmesh3 1.25 0.44 1.25 0.06 1.51
lhr07 1.2 1.36 3.76 1.09 4.51
lung2 5.41 0.86 9.14 3.34 9.45
net100 0.65 6.59 7.11 5.78 6.74
Zd Jac6 1.82 2.93 1.82 4.14 4.31
REFERENCES
[1] Lee, I. Efficient sparse matrix vector multiplication using compressed graph, in
IEEE SoutheastCon 2010 (SoutheastCon), Proceedings of the, March 2010, pp.
328–331.
[2] Wang, H. C. and Hwang, K. Multicoloring for fast sparse matrix-vector
multiplication in solving pde problems, in Parallel Processing, 1993. ICPP 1993.
International Conference on, Vol. 3, Aug 1993, pp. 215–222.
[3] Jamroz, B. and Mullowney, P. Performance of parallel sparse matrix-vector
multiplications in linear solves on multiple gpus, in Application Accelerators in
High Performance Computing (SAAHPC), 2012 Symposium on, July 2012, pp.
149–152.
13. Sparse Storage Recommendation System for Sparse Matrix Vector Multiplication on GPU
http://www.iaeme.com/IJARET/index.asp 23 editor@iaeme.com
[4] Hestenes, E. and Stiefel, M. R. Methods of conjugate gradients for solving linear
systems, 1952.
[5] van der Veen, M. Sparse matrix vector multiplication on a field programmable
gate array, September 2007.
[6] Ashany, R. Application of sparse matrix techniques to search, retrieval,
classification and relationship analysis in large data base systems − sparcom, in
Proceedings of the Fourth International Conference on Very Large Data Bases −
Volume 4, VLDB ’78, VLDB Endowment, 1978, pp. 499–516.
[7] Goharian, N., Grossman, D. and El-Ghazawi, T. Enterprise text processing: A
sparse matrix approach, Information Technology: Coding and Computing,
International Conference on, vol. 0, 2001.
[8] Bender, M. A., Brodal, G. S., Fagerberg, R., Jacob, R. and Vicari, E. Optimal
sparse matrix dense vector multiplication in the i/o-model, in Proceedings of the
Nineteenth Annual ACM Symposium on Parallel Algorithms and Architectures,
SPAA ’07, ACM, 2007.
[9] Manzini, G. Lower bounds for sparse matrix vector multiplication on hypercubic
networks, Vol. 2, 1998.
[10] Wu, T., Wang, B., Shan, Y., Yan, F., Wang, Y. and Xu, N. Efficient pagerank
and spmv computation on amd gpus, in ICPP, 2010, pp. 81–89,.
[11] Gan, Z. and Harrison, R. Calibrating quantum chemistry: A multi-teraflop,
parallel-vector, full-configuration interaction program for the cray-x1, in
Supercomputing, 2005. Proceedings of the ACM/IEEE SC 2005 Conference, Nov
2005
[12] Shah, M. and Patel, V. An efficient sparse matrix multiplication for skewed
matrix on gpu, in High Performance Computing and Communication 2012 IEEE
9th International Conference on Embedded Software and Systems (HPCC-
ICESS), 2012 IEEE 14th International Conference on, June 2012, pp. 1301–1306.
[13] Bell, N. and Garland, M. Implementing sparse matrix-vector multiplication on
throughput-oriented processors, in SC, 2009.
[14] Dziekonski, A., Lamecki, A. and Mrozowski, M. A memory efficient and fast
sparse matrix vector product on a gpu, Progress In Electromagnetics Research,
Vol. 116, 2011, pp. 49–63.
[15] Vazquez, F., Ortega, G., Fernandez, J. and Garzon, E. Improving the performance
of the sparse matrix vector product with gpus, Computer and Information
Technology, International Conference on, vol. 0, 2010.
[16] Pinar, A. and Heath, M. T. Improving performance of sparse matrix-vector
multiplication, in Proceedings of the 1999 ACM/IEEE conference on
Supercomputing (CDROM), Supercomputing ’99, 1999.
[17] Shahnaz, R. and Usman, A. Blocked-based sparse matrix-vector multiplication on
distributed memory parallel computers. Int. Arab J. Inf. Technol., 2011.
[18] Yang, X., Parthasarathy, S. and Sadayappan, P. Fast sparse matrix-vector
multiplication on gpus: Implications for graph mining. CoRR, vol. abs/1103.2405,
2011.
[19] Cao, W., Yao, L., Li, Z., Wang, Y. and Wang, Z. Implementing sparse matrix-
vector multiplication using cuda based on a hybrid sparse matrix format, in
International Conference on Computer Application and System Modeling, 2010.
[20] Choi, J. W., Singh, A. and Vuduc, R. W. Model-driven autotuning of sparse
matrix-vector multiply on gpus, in Proceedings of the 15th ACM SIGPLAN
Symposium on Principles and Practice of Parallel Programming, PPoPP ’10,
2010.