This document summarizes various parallel models that have been used to accelerate the Smith-Waterman algorithm for sequence alignment. It discusses implementations using associative massive parallelism on architectures like associative computing, ClearSpeed coprocessors and Convey computers. Parallel programming models of MPI, OpenMP and hybrid approaches are also covered. Implementations using systolic arrays and single/multi graphics processors are described. The document concludes that parallel models provide an excellent platform for processing vast biological sequence data efficiently.
Enhancing the matrix transpose operation using intel avx instruction set exte...ijcsit
General-purpose microprocessors are augmented with short-vector instruction extensions in order to
simultaneously process more than one data element using the same operation. This type of parallelism is
known as data-parallel processing. Many scientific, engineering, and signal processing applications can be
formulated as matrix operations. Therefore, accelerating these kernel operations on microprocessors,
which are the building blocks or large high-performance computing systems, will definitely boost the
performance of the aforementioned applications. In this paper, we consider the acceleration of the matrix
transpose operation using the 256-bit Intel advanced vector extension (AVX) instructions. We present a
novel vector-based matrix transpose algorithm and its optimized implementation using AVX instructions.
The experimental results on Intel Core i7 processor demonstrates a 2.83 speedup over the standard
sequential implementation, and a maximum of 1.53 speedup over the GCC library implementation. When
the transpose is combined with matrix addition to compute the matrix update, B + AT, where A and B are
squared matrices, the speedup of our implementation over the sequential algorithm increased to 3.19.
DYNAMIC TASK PARTITIONING MODEL IN PARALLEL COMPUTINGcscpconf
Parallel computing systems compose task partitioning strategies in a true multiprocessing
manner. Such systems share the algorithm and processing unit as computing resources which
leads to highly inter process communications capabilities. The main part of the proposed
algorithm is resource management unit which performs task partitioning and co-scheduling .In
this paper, we present a technique for integrated task partitioning and co-scheduling on the
privately owned network. We focus on real-time and non preemptive systems. A large variety of
experiments have been conducted on the proposed algorithm using synthetic and real tasks.
Goal of computation model is to provide a realistic representation of the costs of programming
The results show the benefit of the task partitioning. The main characteristics of our method are
optimal scheduling and strong link between partitioning, scheduling and communication. Some
important models for task partitioning are also discussed in the paper. We target the algorithm
for task partitioning which improve the inter process communication between the tasks and use
the recourses of the system in the efficient manner. The proposed algorithm contributes the
inter-process communication cost minimization amongst the executing processes.
Performance comparison of row per slave and rows set per slave method in pvm ...eSAT Journals
Abstract Parallel computing operates on the principle that large problems can often be divided into smaller ones, which are then solved concurrently to save time by taking advantage of non-local resources and overcoming memory constraints. Multiplication of larger matrices requires a lot of computation time. This paper deals with the two methods for handling Parallel Matrix Multiplication. First is, dividing the rows of one of the input matrices into set of rows based on the number of slaves and assigning one rows set for each slave for computation. Second method is, assigning one row of one of the input matrices at a time for each slave starting from first row to first slave and second row to second slave and so on and loop backs to the first slave when last slave assignment is finished and repeated until all rows are finished assigning. These two methods are implemented using Parallel Virtual Machine and the computation is performed for different sizes of matrices over the different number of nodes. The results show that the row per slave method gives the optimal computation time in PVM based parallel matrix multiplication. Keywords: Parallel Execution, Cluster Computing, MPI (Message Passing Interface), PVM (Parallel Virtual Machine) RAM (Random Access Memory).
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Enhancing the matrix transpose operation using intel avx instruction set exte...ijcsit
General-purpose microprocessors are augmented with short-vector instruction extensions in order to
simultaneously process more than one data element using the same operation. This type of parallelism is
known as data-parallel processing. Many scientific, engineering, and signal processing applications can be
formulated as matrix operations. Therefore, accelerating these kernel operations on microprocessors,
which are the building blocks or large high-performance computing systems, will definitely boost the
performance of the aforementioned applications. In this paper, we consider the acceleration of the matrix
transpose operation using the 256-bit Intel advanced vector extension (AVX) instructions. We present a
novel vector-based matrix transpose algorithm and its optimized implementation using AVX instructions.
The experimental results on Intel Core i7 processor demonstrates a 2.83 speedup over the standard
sequential implementation, and a maximum of 1.53 speedup over the GCC library implementation. When
the transpose is combined with matrix addition to compute the matrix update, B + AT, where A and B are
squared matrices, the speedup of our implementation over the sequential algorithm increased to 3.19.
DYNAMIC TASK PARTITIONING MODEL IN PARALLEL COMPUTINGcscpconf
Parallel computing systems compose task partitioning strategies in a true multiprocessing
manner. Such systems share the algorithm and processing unit as computing resources which
leads to highly inter process communications capabilities. The main part of the proposed
algorithm is resource management unit which performs task partitioning and co-scheduling .In
this paper, we present a technique for integrated task partitioning and co-scheduling on the
privately owned network. We focus on real-time and non preemptive systems. A large variety of
experiments have been conducted on the proposed algorithm using synthetic and real tasks.
Goal of computation model is to provide a realistic representation of the costs of programming
The results show the benefit of the task partitioning. The main characteristics of our method are
optimal scheduling and strong link between partitioning, scheduling and communication. Some
important models for task partitioning are also discussed in the paper. We target the algorithm
for task partitioning which improve the inter process communication between the tasks and use
the recourses of the system in the efficient manner. The proposed algorithm contributes the
inter-process communication cost minimization amongst the executing processes.
Performance comparison of row per slave and rows set per slave method in pvm ...eSAT Journals
Abstract Parallel computing operates on the principle that large problems can often be divided into smaller ones, which are then solved concurrently to save time by taking advantage of non-local resources and overcoming memory constraints. Multiplication of larger matrices requires a lot of computation time. This paper deals with the two methods for handling Parallel Matrix Multiplication. First is, dividing the rows of one of the input matrices into set of rows based on the number of slaves and assigning one rows set for each slave for computation. Second method is, assigning one row of one of the input matrices at a time for each slave starting from first row to first slave and second row to second slave and so on and loop backs to the first slave when last slave assignment is finished and repeated until all rows are finished assigning. These two methods are implemented using Parallel Virtual Machine and the computation is performed for different sizes of matrices over the different number of nodes. The results show that the row per slave method gives the optimal computation time in PVM based parallel matrix multiplication. Keywords: Parallel Execution, Cluster Computing, MPI (Message Passing Interface), PVM (Parallel Virtual Machine) RAM (Random Access Memory).
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
FAST AND EFFICIENT IMAGE COMPRESSION BASED ON PARALLEL COMPUTING USING MATLABJournal For Research
Image compression technique is used in many applications for example, satellite imaging, medical imaging, video where the size of the iamge requires more space to store, in such application image compression effectively can be used. There are two types in image compression techniques Lossy and Lossless comression. Both these techniques are used for compression of images, but these techniques are not fast. The image compression techniques both lossy and lossless image compression techniques are not fast, they take more time for compression and decompression. For fast and efficient image compression a parallel computing technique is used in matlab. Matlab is used in this project for parallel computing of images. In this paper we will discuss Regular image compression technique, three alternatives of parallel computing using matlab, comparison of image compression with and without parallel computing.
The proposed system is an efficient processing of 16-bit Multiplier Accumulator using Radix-8 and Radix-16 modified Booth Algorithm and other adders (SPST adder, Carry select adder, Parallel Prefix adder) using VHDL (Very High Speed Integrated Circuit Hardware Description Language). This proposed system provides low power, high speed and fewer delays. In both booth multipliers, comparison between the power consumption (mw) and estimated delay (ns) are calculated. The application of digital signal processing like fast fourier transform, finite impulse response and convolution needs high speed and low power MAC (Multiplier and Accumulator) units to construct an added. By reducing the glitches (from 1 to 0 transition) and spikes (from 0 to 1 transition), the speed of operation is improved and dynamic power is reduced. The adder designed with SPST avoids the unwanted glitches and spikes, reduce the switching power dissipation and the dynamic power. The speed can be improved by reducing the number of partial products to half, by grouping of bits in the multiplier term. The proposed Radix-8 and Radix-16 Modified Booth Algorithm MAC with SPST reduces the delay and obtain low power consumption as compared to array MAC.
A PROGRESSIVE MESH METHOD FOR PHYSICAL SIMULATIONS USING LATTICE BOLTZMANN ME...ijdpsjournal
In this paper, a new progressive mesh algorithm is introduced in order to perform fast physical simulations by the use of a lattice Boltzmann method (LBM) on a single-node multi-GPU architecture. This algorithm is able to mesh automatically the simulation domain according to the propagation of fluids. This method can also be useful in order to perform several types of physical simulations. In this paper, we associate this
algorithm with a multiphase and multicomponent lattice Boltzmann model (MPMC–LBM) because it is
able to perform various types of simulations on complex geometries. The use of this algorithm combined
with the massive parallelism of GPUs[5] allows to obtain very good performance in comparison with the
staticmesh method used in literature. Several simulations are shown in order to evaluate the algorithm.
Transformation and dynamic visualization of images from computer through an F...TELKOMNIKA JOURNAL
This article shows the implementation of a system that uses a graphic interface to load a digital image into a programmable logic device, which is stored in its internal RAM memory and is responsible for visualizing it in a matrix of RGB LEDs, so that This way, the LEDs show an equivalent to the image that was sent from the PC, conserving an aspect ratio and respecting as much as possible the color of the original image. To carry out this task, a Matlab script was designed to load the image, convert and format the data, which are transmitted to the FPGA using the RS232 protocol. The FPGA is in charge of receiving them, storing them and generating all the signals of control and synchronization of the system including the control of the PWM signals necessary to conserve the brightness of each one of the LEDs. This system allows the visualization of static images in standard formats and, in addition, thanks to the flexibility of the hardware used, it allows the visualization of moving images type GIF.
A Review on Image Compression in Parallel using CUDAIJERD Editor
Now a days images are prodigiously and sizably voluminous in size. So, this size is not facilely fits in applications. For that image compression is require. Image Compression algorithms are more resource conserving. It takes more time to consummate the task of compression. Utilizing Parallel implementation of the compression algorithm this quandary can be overcome. CUDA (Compute Unified Device Architecture) Provides parallel execution for algorithm utilizing the multi-threading. CUDA is NVIDIA`s parallel computing platform. CUDA uses GPU (Graphical Processing Unit) for the parallel execution. GPU have the number of the cores for parallel execution support. Image compression can additionally implemented in parallel utilizing CUDA. There are number of algorithms for image compression. Among them DWT (Discrete Wavelet Transform) is best suited for parallel implementation due to its more mathematical calculation and good compression result compare to other methods. In this paper included different parallel techniques for image compression. With the actualizing this image compression algorithm over the GPU utilizing CUDA it will perform the operations in parallel. In this way, vast diminish in processing time is conceivable. Furthermore it is conceivable to enhance the execution of image compression algorithms.
A BINARY TO RESIDUE CONVERSION USING NEW PROPOSED NON-COPRIME MODULI SETcsandit
Residue Number System is generally supposed to use co-prime moduli set. Non-coprime moduli sets are a field in RNS which is little studied. That's why this work was devoted to them. The resources that discuss non-coprime in RNS are very limited. For the previous reasons, this paper analyses the RNS conversion using suggested non-coprime moduli set.
Effective Sparse Matrix Representation for the GPU ArchitecturesIJCSEA Journal
General purpose computation on graphics processing unit (GPU) is prominent in the high performance computing era of this time. Porting or accelerating the data parallel applications onto GPU gives the default performance improvement because of the increased computational units. Better performances can be seen if application specific fine tuning is done with respect to the architecture under consideration. One such very widely used computation intensive kernel is sparse matrix vector multiplication (SPMV) in sparse matrix based applications. Most of the existing data format representations of sparse matrix are developed with respect to the central processing unit (CPU) or multi cores. This paper gives a new format for sparse matrix representation with respect to graphics processor architecture that can give 2x to 5x performance improvement compared to CSR (compressed row format), 2x to 54x performance improvement with respect to COO (coordinate format) and 3x to 10 x improvement compared to CSR vector format for the class of application that fit for the proposed new format. It also gives 10% to 133% improvements in memory transfer (of only access information of sparse matrix) between CPU and GPU. This paper gives the details of the new format and its requirement with complete experimentation details and results of comparison.
program partitioning and scheduling IN Advanced Computer ArchitecturePankaj Kumar Jain
Advanced Computer Architecture,Program Partitioning and Scheduling,Program Partitioning & Scheduling,Latency,Levels of Parallelism,Loop-level Parallelism,Subprogram-level Parallelism,Job or Program-Level Parallelism,Communication Latency,Grain Packing and Scheduling,Program Graphs and Packing
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
Hardware/software co-design for a parallel three-dimensional bresenham’s algo...IJECEIAES
Line plotting is the one of the basic operations in the scan conversion. Bresenham’s line drawing algorithm is an efficient and high popular algorithm utilized for this purpose. This algorithm starts from one end-point of the line to the other end-point by calculating one point at each step. As a result, the calculation time for all the points depends on the length of the line thereby the number of the total points presented. In this paper, we developed an approach to speed up the Bresenham algorithm by partitioning each line into number of segments, find the points belong to those segments and drawing them simultaneously to formulate the main line. As a result, the higher number of segments generated, the faster the points are calculated. By employing 32 cores in the Field Programmable Gate Array, a line of length 992 points is formulated in 0.31μs only. The complete system is implemented using Zybo board that contains the Xilinx Zynq-7000 chip (Z-7010).
International Journal of Engineering Research and DevelopmentIJERD Editor
Electrical, Electronics and Computer Engineering,
Information Engineering and Technology,
Mechanical, Industrial and Manufacturing Engineering,
Automation and Mechatronics Engineering,
Material and Chemical Engineering,
Civil and Architecture Engineering,
Biotechnology and Bio Engineering,
Environmental Engineering,
Petroleum and Mining Engineering,
Marine and Agriculture engineering,
Aerospace Engineering.
Comprehensive Performance Evaluation on Multiplication of Matrices using MPIijtsrd
In Matrix multiplication we refer to a concept that is used in technology applications such as digital image processing, digital signal processing and graph problem solving. Multiplication of huge matrices requires a lot of computing time as its complexity is O n3 . Because most engineering science applications require higher computational throughput with minimum time, many sequential and analogue algorithms are developed. In this paper, methods of matrix multiplication are elect, implemented, and analyzed. A performance analysis is evaluated, and some recommendations are given when using open MP and MPI methods of parallel of latitude computing. Adamu Abubakar I | Oyku A | Mehmet K | Amina M. Tako ""Comprehensive Performance Evaluation on Multiplication of Matrices using MPI""
Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-2 , February 2020,
URL: https://www.ijtsrd.com/papers/ijtsrd30015.pdf
Paper Url : https://www.ijtsrd.com/engineering/electrical-engineering/30015/comprehensive-performance-evaluation-on-multiplication-of-matrices-using-mpi/adamu-abubakar-i
HOMOGENEOUS MULTISTAGE ARCHITECTURE FOR REAL-TIME IMAGE PROCESSINGcscpconf
In this article, we present a new multistage architecture oriented to real-time complex processing applications. Given a set of rules, this proposed architecture allows the using of different communication links (point to point link, hardware router…) to connect unlimited number of parallel computing elements (software processors) to follow the increasing complexity of algorithms. In particular, this work brings out a parallel implementation of multihypothesis approach for road recognition application on the proposed Multiprocessor Systemon-Chip (MP-SoC) architecture. This algorithm is usually the main part of the lane keeping applications. Experimental results using images of a real road scene are presented. Using a low cost FPGA-based System-on-Chip, our hardware architecture is able to detect and recognize the roadsides in a time limit of 60 mSec. Moreover, we demonstrate that our multistage architecture may be used to achieve good speed-up in solving automotive applications.
FAST AND EFFICIENT IMAGE COMPRESSION BASED ON PARALLEL COMPUTING USING MATLABJournal For Research
Image compression technique is used in many applications for example, satellite imaging, medical imaging, video where the size of the iamge requires more space to store, in such application image compression effectively can be used. There are two types in image compression techniques Lossy and Lossless comression. Both these techniques are used for compression of images, but these techniques are not fast. The image compression techniques both lossy and lossless image compression techniques are not fast, they take more time for compression and decompression. For fast and efficient image compression a parallel computing technique is used in matlab. Matlab is used in this project for parallel computing of images. In this paper we will discuss Regular image compression technique, three alternatives of parallel computing using matlab, comparison of image compression with and without parallel computing.
The proposed system is an efficient processing of 16-bit Multiplier Accumulator using Radix-8 and Radix-16 modified Booth Algorithm and other adders (SPST adder, Carry select adder, Parallel Prefix adder) using VHDL (Very High Speed Integrated Circuit Hardware Description Language). This proposed system provides low power, high speed and fewer delays. In both booth multipliers, comparison between the power consumption (mw) and estimated delay (ns) are calculated. The application of digital signal processing like fast fourier transform, finite impulse response and convolution needs high speed and low power MAC (Multiplier and Accumulator) units to construct an added. By reducing the glitches (from 1 to 0 transition) and spikes (from 0 to 1 transition), the speed of operation is improved and dynamic power is reduced. The adder designed with SPST avoids the unwanted glitches and spikes, reduce the switching power dissipation and the dynamic power. The speed can be improved by reducing the number of partial products to half, by grouping of bits in the multiplier term. The proposed Radix-8 and Radix-16 Modified Booth Algorithm MAC with SPST reduces the delay and obtain low power consumption as compared to array MAC.
A PROGRESSIVE MESH METHOD FOR PHYSICAL SIMULATIONS USING LATTICE BOLTZMANN ME...ijdpsjournal
In this paper, a new progressive mesh algorithm is introduced in order to perform fast physical simulations by the use of a lattice Boltzmann method (LBM) on a single-node multi-GPU architecture. This algorithm is able to mesh automatically the simulation domain according to the propagation of fluids. This method can also be useful in order to perform several types of physical simulations. In this paper, we associate this
algorithm with a multiphase and multicomponent lattice Boltzmann model (MPMC–LBM) because it is
able to perform various types of simulations on complex geometries. The use of this algorithm combined
with the massive parallelism of GPUs[5] allows to obtain very good performance in comparison with the
staticmesh method used in literature. Several simulations are shown in order to evaluate the algorithm.
Transformation and dynamic visualization of images from computer through an F...TELKOMNIKA JOURNAL
This article shows the implementation of a system that uses a graphic interface to load a digital image into a programmable logic device, which is stored in its internal RAM memory and is responsible for visualizing it in a matrix of RGB LEDs, so that This way, the LEDs show an equivalent to the image that was sent from the PC, conserving an aspect ratio and respecting as much as possible the color of the original image. To carry out this task, a Matlab script was designed to load the image, convert and format the data, which are transmitted to the FPGA using the RS232 protocol. The FPGA is in charge of receiving them, storing them and generating all the signals of control and synchronization of the system including the control of the PWM signals necessary to conserve the brightness of each one of the LEDs. This system allows the visualization of static images in standard formats and, in addition, thanks to the flexibility of the hardware used, it allows the visualization of moving images type GIF.
A Review on Image Compression in Parallel using CUDAIJERD Editor
Now a days images are prodigiously and sizably voluminous in size. So, this size is not facilely fits in applications. For that image compression is require. Image Compression algorithms are more resource conserving. It takes more time to consummate the task of compression. Utilizing Parallel implementation of the compression algorithm this quandary can be overcome. CUDA (Compute Unified Device Architecture) Provides parallel execution for algorithm utilizing the multi-threading. CUDA is NVIDIA`s parallel computing platform. CUDA uses GPU (Graphical Processing Unit) for the parallel execution. GPU have the number of the cores for parallel execution support. Image compression can additionally implemented in parallel utilizing CUDA. There are number of algorithms for image compression. Among them DWT (Discrete Wavelet Transform) is best suited for parallel implementation due to its more mathematical calculation and good compression result compare to other methods. In this paper included different parallel techniques for image compression. With the actualizing this image compression algorithm over the GPU utilizing CUDA it will perform the operations in parallel. In this way, vast diminish in processing time is conceivable. Furthermore it is conceivable to enhance the execution of image compression algorithms.
A BINARY TO RESIDUE CONVERSION USING NEW PROPOSED NON-COPRIME MODULI SETcsandit
Residue Number System is generally supposed to use co-prime moduli set. Non-coprime moduli sets are a field in RNS which is little studied. That's why this work was devoted to them. The resources that discuss non-coprime in RNS are very limited. For the previous reasons, this paper analyses the RNS conversion using suggested non-coprime moduli set.
Effective Sparse Matrix Representation for the GPU ArchitecturesIJCSEA Journal
General purpose computation on graphics processing unit (GPU) is prominent in the high performance computing era of this time. Porting or accelerating the data parallel applications onto GPU gives the default performance improvement because of the increased computational units. Better performances can be seen if application specific fine tuning is done with respect to the architecture under consideration. One such very widely used computation intensive kernel is sparse matrix vector multiplication (SPMV) in sparse matrix based applications. Most of the existing data format representations of sparse matrix are developed with respect to the central processing unit (CPU) or multi cores. This paper gives a new format for sparse matrix representation with respect to graphics processor architecture that can give 2x to 5x performance improvement compared to CSR (compressed row format), 2x to 54x performance improvement with respect to COO (coordinate format) and 3x to 10 x improvement compared to CSR vector format for the class of application that fit for the proposed new format. It also gives 10% to 133% improvements in memory transfer (of only access information of sparse matrix) between CPU and GPU. This paper gives the details of the new format and its requirement with complete experimentation details and results of comparison.
program partitioning and scheduling IN Advanced Computer ArchitecturePankaj Kumar Jain
Advanced Computer Architecture,Program Partitioning and Scheduling,Program Partitioning & Scheduling,Latency,Levels of Parallelism,Loop-level Parallelism,Subprogram-level Parallelism,Job or Program-Level Parallelism,Communication Latency,Grain Packing and Scheduling,Program Graphs and Packing
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
Hardware/software co-design for a parallel three-dimensional bresenham’s algo...IJECEIAES
Line plotting is the one of the basic operations in the scan conversion. Bresenham’s line drawing algorithm is an efficient and high popular algorithm utilized for this purpose. This algorithm starts from one end-point of the line to the other end-point by calculating one point at each step. As a result, the calculation time for all the points depends on the length of the line thereby the number of the total points presented. In this paper, we developed an approach to speed up the Bresenham algorithm by partitioning each line into number of segments, find the points belong to those segments and drawing them simultaneously to formulate the main line. As a result, the higher number of segments generated, the faster the points are calculated. By employing 32 cores in the Field Programmable Gate Array, a line of length 992 points is formulated in 0.31μs only. The complete system is implemented using Zybo board that contains the Xilinx Zynq-7000 chip (Z-7010).
International Journal of Engineering Research and DevelopmentIJERD Editor
Electrical, Electronics and Computer Engineering,
Information Engineering and Technology,
Mechanical, Industrial and Manufacturing Engineering,
Automation and Mechatronics Engineering,
Material and Chemical Engineering,
Civil and Architecture Engineering,
Biotechnology and Bio Engineering,
Environmental Engineering,
Petroleum and Mining Engineering,
Marine and Agriculture engineering,
Aerospace Engineering.
Comprehensive Performance Evaluation on Multiplication of Matrices using MPIijtsrd
In Matrix multiplication we refer to a concept that is used in technology applications such as digital image processing, digital signal processing and graph problem solving. Multiplication of huge matrices requires a lot of computing time as its complexity is O n3 . Because most engineering science applications require higher computational throughput with minimum time, many sequential and analogue algorithms are developed. In this paper, methods of matrix multiplication are elect, implemented, and analyzed. A performance analysis is evaluated, and some recommendations are given when using open MP and MPI methods of parallel of latitude computing. Adamu Abubakar I | Oyku A | Mehmet K | Amina M. Tako ""Comprehensive Performance Evaluation on Multiplication of Matrices using MPI""
Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-2 , February 2020,
URL: https://www.ijtsrd.com/papers/ijtsrd30015.pdf
Paper Url : https://www.ijtsrd.com/engineering/electrical-engineering/30015/comprehensive-performance-evaluation-on-multiplication-of-matrices-using-mpi/adamu-abubakar-i
HOMOGENEOUS MULTISTAGE ARCHITECTURE FOR REAL-TIME IMAGE PROCESSINGcscpconf
In this article, we present a new multistage architecture oriented to real-time complex processing applications. Given a set of rules, this proposed architecture allows the using of different communication links (point to point link, hardware router…) to connect unlimited number of parallel computing elements (software processors) to follow the increasing complexity of algorithms. In particular, this work brings out a parallel implementation of multihypothesis approach for road recognition application on the proposed Multiprocessor Systemon-Chip (MP-SoC) architecture. This algorithm is usually the main part of the lane keeping applications. Experimental results using images of a real road scene are presented. Using a low cost FPGA-based System-on-Chip, our hardware architecture is able to detect and recognize the roadsides in a time limit of 60 mSec. Moreover, we demonstrate that our multistage architecture may be used to achieve good speed-up in solving automotive applications.
Performance Analysis of Parallel Algorithms on Multi-core System using OpenMP IJCSEIT Journal
The current multi-core architectures have become popular due to performance, and efficient processing of
multiple tasks simultaneously. Today’s the parallel algorithms are focusing on multi-core systems. The
design of parallel algorithm and performance measurement is the major issue on multi-core environment. If
one wishes to execute a single application faster, then the application must be divided into subtask or
threads to deliver desired result. Numerical problems, especially the solution of linear system of equation
have many applications in science and engineering. This paper describes and analyzes the parallel
algorithms for computing the solution of dense system of linear equations, and to approximately compute
the value of π using OpenMP interface. The performances (speedup) of parallel algorithms on multi-core
system have been presented. The experimental results on a multi-core processor show that the proposed
parallel algorithms achieves good performance (speedup) compared to the sequential
The network anomaly detection technology based
on support vector machine (SVM) can efficiently detect unknown
attacks or variants of known attacks. However, it cannot be used
for detection of large-scale intrusion scenarios due to the demand
of computational time. The graphics processing unit (GPU) has
the characteristics of multi-threads and powerful parallel
processing capability. Hence Parallel computing framework is
used to accelerate the SVM-based classification.
Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...csandit
Computational Grid (CG) creates a large heterogeneous and distributed paradigm to manage and execute the applications which are computationally intensive. In grid scheduling tasks are assigned to the proper processors in the grid system to for its execution by considering the execution policy and the optimization objectives. In this paper, makespan and the faulttolerance of the computational nodes of the grid which are the two important parameters for the task execution, are considered and tried to optimize it. As the grid scheduling is considered to be NP-Hard, so a meta-heuristics evolutionary based techniques are often used to find a solution for this. We have proposed a NSGA II for this purpose. The performance estimation ofthe proposed Fault tolerance Aware NSGA II (FTNSGA II) has been done by writing program in Matlab. The simulation results evaluates the performance of the all proposed algorithm and the results of proposed model is compared with existing model Min-Min and Max-Min algorithm which proves effectiveness of the model.
The aim of the proposed research will be to develop software for implementing a parallel solution for the RSA decryption algorithm. Multithread and distributed computing methods will be used to reach the aimed objective. This effort will include the development of a hybrid OpenMP/MPI program to maximize the use of computational resources and, consequently, decrease the time to decrypt large ciphertexts.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Concurrent Matrix Multiplication on Multi-core ProcessorsCSCJournals
With the advent of multi-cores every processor has built-in parallel computational power and that can only be fully utilized only if the program in execution is written accordingly. This study is a part of an on-going research for designing of a new parallel programming model for multi-core architectures. In this paper we have presented a simple, highly efficient and scalable implementation of a common matrix multiplication algorithm using a newly developed parallel programming model SPC3 PM for general purpose multi-core processors. From our study it is found that matrix multiplication done concurrently on multi-cores using SPC3 PM requires much less execution time than that required using the present standard parallel programming environments like OpenMP. Our approach also shows scalability, better and uniform speedup and better utilization of available cores than that the algorithm written using standard OpenMP or similar parallel programming tools. We have tested our approach for up to 24 cores with different matrices size varying from 100 x 100 to 10000 x 10000 elements. And for all these tests our proposed approach has shown much improved performance and scalability
What is high performance Computing, examples of OpenMP, MPI and Pthreads. How obtain HPC benefits using OpenSolaris.
This version was presented at OpenSolaris Day in Porto Alegre, Brazil, in April 16, 2008.
Similar to A survey of Parallel models for Sequence Alignment using Smith Waterman Algorithm (20)
An Examination of Effectuation Dimension as Financing Practice of Small and M...iosrjce
IOSR Journal of Business and Management (IOSR-JBM) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of business and managemant and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications inbusiness and management. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Does Goods and Services Tax (GST) Leads to Indian Economic Development?iosrjce
IOSR Journal of Business and Management (IOSR-JBM) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of business and managemant and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications inbusiness and management. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Childhood Factors that influence success in later lifeiosrjce
IOSR Journal of Business and Management (IOSR-JBM) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of business and managemant and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications inbusiness and management. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Emotional Intelligence and Work Performance Relationship: A Study on Sales Pe...iosrjce
IOSR Journal of Business and Management (IOSR-JBM) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of business and managemant and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications inbusiness and management. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Customer’s Acceptance of Internet Banking in Dubaiiosrjce
IOSR Journal of Business and Management (IOSR-JBM) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of business and managemant and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications inbusiness and management. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
A Study of Employee Satisfaction relating to Job Security & Working Hours amo...iosrjce
IOSR Journal of Business and Management (IOSR-JBM) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of business and managemant and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications inbusiness and management. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Consumer Perspectives on Brand Preference: A Choice Based Model Approachiosrjce
IOSR Journal of Business and Management (IOSR-JBM) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of business and managemant and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications inbusiness and management. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Student`S Approach towards Social Network Sitesiosrjce
IOSR Journal of Business and Management (IOSR-JBM) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of business and managemant and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications inbusiness and management. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Broadcast Management in Nigeria: The systems approach as an imperativeiosrjce
IOSR Journal of Business and Management (IOSR-JBM) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of business and managemant and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications inbusiness and management. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
A Study on Retailer’s Perception on Soya Products with Special Reference to T...iosrjce
IOSR Journal of Business and Management (IOSR-JBM) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of business and managemant and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications inbusiness and management. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
A Study Factors Influence on Organisation Citizenship Behaviour in Corporate ...iosrjce
IOSR Journal of Business and Management (IOSR-JBM) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of business and managemant and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications inbusiness and management. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Consumers’ Behaviour on Sony Xperia: A Case Study on Bangladeshiosrjce
IOSR Journal of Business and Management (IOSR-JBM) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of business and managemant and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications inbusiness and management. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Design of a Balanced Scorecard on Nonprofit Organizations (Study on Yayasan P...iosrjce
IOSR Journal of Business and Management (IOSR-JBM) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of business and managemant and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications inbusiness and management. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Public Sector Reforms and Outsourcing Services in Nigeria: An Empirical Evalu...iosrjce
IOSR Journal of Business and Management (IOSR-JBM) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of business and managemant and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications inbusiness and management. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Media Innovations and its Impact on Brand awareness & Considerationiosrjce
IOSR Journal of Business and Management (IOSR-JBM) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of business and managemant and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications inbusiness and management. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Customer experience in supermarkets and hypermarkets – A comparative studyiosrjce
IOSR Journal of Business and Management (IOSR-JBM) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of business and managemant and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications inbusiness and management. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Social Media and Small Businesses: A Combinational Strategic Approach under t...iosrjce
IOSR Journal of Business and Management (IOSR-JBM) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of business and managemant and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications inbusiness and management. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Secretarial Performance and the Gender Question (A Study of Selected Tertiary...iosrjce
IOSR Journal of Business and Management (IOSR-JBM) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of business and managemant and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications inbusiness and management. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Implementation of Quality Management principles at Zimbabwe Open University (...iosrjce
IOSR Journal of Business and Management (IOSR-JBM) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of business and managemant and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications inbusiness and management. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Organizational Conflicts Management In Selected Organizaions In Lagos State, ...iosrjce
IOSR Journal of Business and Management (IOSR-JBM) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of business and managemant and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications inbusiness and management. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Democratizing Fuzzing at Scale by Abhishek Aryaabh.arya
Presented at NUS: Fuzzing and Software Security Summer School 2024
This keynote talks about the democratization of fuzzing at scale, highlighting the collaboration between open source communities, academia, and industry to advance the field of fuzzing. It delves into the history of fuzzing, the development of scalable fuzzing platforms, and the empowerment of community-driven research. The talk will further discuss recent advancements leveraging AI/ML and offer insights into the future evolution of the fuzzing landscape.
Automobile Management System Project Report.pdfKamal Acharya
The proposed project is developed to manage the automobile in the automobile dealer company. The main module in this project is login, automobile management, customer management, sales, complaints and reports. The first module is the login. The automobile showroom owner should login to the project for usage. The username and password are verified and if it is correct, next form opens. If the username and password are not correct, it shows the error message.
When a customer search for a automobile, if the automobile is available, they will be taken to a page that shows the details of the automobile including automobile name, automobile ID, quantity, price etc. “Automobile Management System” is useful for maintaining automobiles, customers effectively and hence helps for establishing good relation between customer and automobile organization. It contains various customized modules for effectively maintaining automobiles and stock information accurately and safely.
When the automobile is sold to the customer, stock will be reduced automatically. When a new purchase is made, stock will be increased automatically. While selecting automobiles for sale, the proposed software will automatically check for total number of available stock of that particular item, if the total stock of that particular item is less than 5, software will notify the user to purchase the particular item.
Also when the user tries to sale items which are not in stock, the system will prompt the user that the stock is not enough. Customers of this system can search for a automobile; can purchase a automobile easily by selecting fast. On the other hand the stock of automobiles can be maintained perfectly by the automobile shop manager overcoming the drawbacks of existing system.
Cosmetic shop management system project report.pdfKamal Acharya
Buying new cosmetic products is difficult. It can even be scary for those who have sensitive skin and are prone to skin trouble. The information needed to alleviate this problem is on the back of each product, but it's thought to interpret those ingredient lists unless you have a background in chemistry.
Instead of buying and hoping for the best, we can use data science to help us predict which products may be good fits for us. It includes various function programs to do the above mentioned tasks.
Data file handling has been effectively used in the program.
The automated cosmetic shop management system should deal with the automation of general workflow and administration process of the shop. The main processes of the system focus on customer's request where the system is able to search the most appropriate products and deliver it to the customers. It should help the employees to quickly identify the list of cosmetic product that have reached the minimum quantity and also keep a track of expired date for each cosmetic product. It should help the employees to find the rack number in which the product is placed.It is also Faster and more efficient way.
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdfKamal Acharya
The College Bus Management system is completely developed by Visual Basic .NET Version. The application is connect with most secured database language MS SQL Server. The application is develop by using best combination of front-end and back-end languages. The application is totally design like flat user interface. This flat user interface is more attractive user interface in 2017. The application is gives more important to the system functionality. The application is to manage the student’s details, driver’s details, bus details, bus route details, bus fees details and more. The application has only one unit for admin. The admin can manage the entire application. The admin can login into the application by using username and password of the admin. The application is develop for big and small colleges. It is more user friendly for non-computer person. Even they can easily learn how to manage the application within hours. The application is more secure by the admin. The system will give an effective output for the VB.Net and SQL Server given as input to the system. The compiled java program given as input to the system, after scanning the program will generate different reports. The application generates the report for users. The admin can view and download the report of the data. The application deliver the excel format reports. Because, excel formatted reports is very easy to understand the income and expense of the college bus. This application is mainly develop for windows operating system users. In 2017, 73% of people enterprises are using windows operating system. So the application will easily install for all the windows operating system users. The application-developed size is very low. The application consumes very low space in disk. Therefore, the user can allocate very minimum local disk space for this application.
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxR&R Consult
CFD analysis is incredibly effective at solving mysteries and improving the performance of complex systems!
Here's a great example: At a large natural gas-fired power plant, where they use waste heat to generate steam and energy, they were puzzled that their boiler wasn't producing as much steam as expected.
R&R and Tetra Engineering Group Inc. were asked to solve the issue with reduced steam production.
An inspection had shown that a significant amount of hot flue gas was bypassing the boiler tubes, where the heat was supposed to be transferred.
R&R Consult conducted a CFD analysis, which revealed that 6.3% of the flue gas was bypassing the boiler tubes without transferring heat. The analysis also showed that the flue gas was instead being directed along the sides of the boiler and between the modules that were supposed to capture the heat. This was the cause of the reduced performance.
Based on our results, Tetra Engineering installed covering plates to reduce the bypass flow. This improved the boiler's performance and increased electricity production.
It is always satisfying when we can help solve complex challenges like this. Do your systems also need a check-up or optimization? Give us a call!
Work done in cooperation with James Malloy and David Moelling from Tetra Engineering.
More examples of our work https://www.r-r-consult.dk/en/cases-en/
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
Vaccine management system project report documentation..pdfKamal Acharya
The Division of Vaccine and Immunization is facing increasing difficulty monitoring vaccines and other commodities distribution once they have been distributed from the national stores. With the introduction of new vaccines, more challenges have been anticipated with this additions posing serious threat to the already over strained vaccine supply chain system in Kenya.
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSEDuvanRamosGarzon1
AIRCRAFT GENERAL
The Single Aisle is the most advanced family aircraft in service today, with fly-by-wire flight controls.
The A318, A319, A320 and A321 are twin-engine subsonic medium range aircraft.
The family offers a choice of engines
A survey of Parallel models for Sequence Alignment using Smith Waterman Algorithm
1. IOSR Journal of Computer Engineering (IOSR-JCE)
e-ISSN: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 2, Ver. VI (Mar – Apr. 2015), PP 48-52
www.iosrjournals.org
DOI: 10.9790/0661-17264852 www.iosrjournals.org 48 | Page
A survey of Parallel models for Sequence Alignment using Smith
Waterman Algorithm
Jyoti Pandey1
, Dr. Nilay Khare2,
Rahul Pandey3
1,2
(Computer Science & Engineering, Maulana Azad National Institute of Technology, India)
3
(Computer Science & Engineering, International Institute of Technology, Hyderabad, India)
Abstract: Nowadays stack of biological data growing steeply, so there is need of smart way to handle and
process these data to extract meaningful information related to biological life. The purpose of this survey is to
study various parallel models which perform alignment of the sequences as fast as possible, which is a big
challenge for both engineer and biologist. The various parallel models discussed in this paper are:
implementation using associative massive parallelism contain architecture such as associative computing,
ClearSpeed coprocessor and Convey Computer. Some parallel programming models such as MPI, OpenMP and
hybrid (combination of both). Then the implementation of alignment using systolic array and lastly uses single
and multi-graphics processors, that is, using graphics processing units.
Keywords: ASC, MTAP, PE, recursive variable expansion, Smith Waterman Algorithm, SIMD, scoring matrix.
I. Introduction
Whenever a new gene is found, its functionalities are inferred by finding the percentage of similarity to
one of the known sequence. The general problem is to find the degree of similarity between two separate
sequences and best match of the query string inside much large database sequence. The focus is mainly on the
sequence of the nucleic acid which contain Adenine (‘A’), Cytosine (‘C’), Guanine (‘G’) and Thymine (‘T’) for
DNA or the sequence of amino acid form an alphabet of 20 possible amino acid.
There are various algorithms for the alignment of two sequences which are classified as local and
global alignment. The Smith Waterman algorithm is a local alignment algorithm, which finds a query sequence
into the large sized database sequence. It uses the standard dynamic programming since it incorporates both
overlapping subproblem and optimal substructure property. The main extract of the algorithm has been
explained as mathematically below:
H (i, 0) =0 for 0<=i<=m
H (0, j) =0 for 0<=j<=n
1<=i<=m, 1<=j<=n
H (i-1, j-1) +s (i, j) is to match or mismatch, H (i-1, j) +d is for deletion and H (i, j-1) +e is for insertion.
where
a, b = strings over the alphabet
m=length (a)
n=length (b)
S (a, b) =similarity function on a, b
H (i, j) =maximum similarity score
d=penalty for deletion
e=penalty for insertion
There are various accelerated versions of the algorithm like implementation on GPU, FPGA, SIMD,
Cell Broadband engine. Also the various models have been developed like Convey Computer uses HC-1, Clear
speed coprocessor and so on. The various modes for parallelizing the sequence alignment algorithm are
discussed below.
2. A survey of Parallel models for Sequence Alignment using Smith Waterman Algorithm
DOI: 10.9790/0661-17264852 www.iosrjournals.org 49 | Page
II. Parallel Models
2.1 Associative Massive Parallelism
The implementation of the Smith Waterman algorithm on three different parallel architectures using
associative massive parallelism [1] where different architectures include: Associative Computing (ASC), the
ClearSpeed coprocessor, and the Convey Computer FPGA coprocessor [2]. These three architectures allow non-
overlapping, non-intersecting subalignments.
2.1.1 Associating computing model
Associative Computing (ASC) [3] is an associative model of parallel computation can be used to
implement and extend local sequence alignment algorithm such as a Smith Waterman algorithm. As the name
indicates, it does not use associative memory. In associative computing, data searching is performed by content
rather than the using address of memory. The Associative computing model finds the best local alignment in
O(m+n) time using (m+1) processing element, where m and n are size of comparing sequences.
The ASC model consists of an array of cells where each cell contains processing element and local
memory, it also contains an instruction stream (IS). As seen in Fig1 every processing element has access to the
memory local to the processing element. Here, each processing element can perform the task of the sequential
processor, but not of issuing instruction.
Rather than using a 2-D contiguous matrix of sequential computer, m+1 PE’s is used where each PE hold one
row of data in sequential Smith Waterman algorithm.
Fig1: ASC model.
2.1.2 ClearSpeed Coprocessor
To deploy Smith Waterman alignment ClearSpeed SIMD CSX620 PCI-X accelerator board is used as
platform, which has 96 processing elements per chip with two chips per board so 196 processing elements, the
MTAP architecture [2] is used by the board. Since similarity to SIMD like board so CSX620 was chosen, but it
lacks associative function, so associative functionality is handled at the software level by CSX620. The
functions were written in ClearSpeed assembly language for speed and efficiency.
The Smith Waterman alignment code implemented on ClearSpeed accelerator contain essentially two
parts, firstly, the parallel computation of score matrix and secondly, the sequential traceback. The parallel part
of the code has an optimal running time of O(m+n) on m processing elements. The performance is excellent for
small input size, but it is not good to handle large strings. So to overcome the limitation, the Convey Computer
HC1 was chosen to perform the Smith Waterman alignment.
2.1.3 Convey Computer
The Convey Computer combines both host processor and coprocessor in a single hardware system,
both processors share global memory and program can be executed concurrently on both processors. FPGA
technology has been employed on Convey coprocessor which makes it reconfigurable computing.
Convey Computer’s Smith Waterman algorithm run on the x86/64-bit CPU processor, coprocessor or
on the both processors concurrently. The implementation on both the processor and coprocessor achieve very
high throughput as compared to earlier generation of HC-1 coprocessor, one of the fastest GCUPS rating at its
release. The Convey computer is more robust, scalable that can even implement the large data set alignment as
compared to the earlier ones.
3. A survey of Parallel models for Sequence Alignment using Smith Waterman Algorithm
DOI: 10.9790/0661-17264852 www.iosrjournals.org 50 | Page
2.2 Parallel programming model
2.2.1 Open MP
All processors can access data from shared memory without explicit communication. A joint effort by
compiler vendors to establish standards in this field, known as OpenMP [3]. Open multiprocessing (OpenMP) is
a shared memory architecture API, having a multithreaded architecture, which is an open specification for
shared memory parallelism. Its API consist Compiler Directives, Runtime library routines and Environment
variables. The main idea of OpenMP is data-shared parallel execution.
In Smith Waterman algorithm implementation, the OpenMP based parallelism using various sizes of
the blocks for filling score matrix by different processors in parallel.
2.2.2 MPI
The message passing interface is a standardization of message passing interface library specification.
Its central focus is on distributed processing. An MPI program contains a collection of processes, those
exchange messages. All MPI processes execute the same program in an SPMD style, that is, the single program
executes on multiple data, which enhances the execution speed.
2.2.3 Hybrid (OpenMP and MPI)
By using mixed mode programming, we can take advantage of both shared memory and message
passing interface. In mixed mode programming two levels of communication are used, that is, inter-node
communication and intra-node communication. In inter-node communication, the message passing method is
used between various processors and intra-node uses the shared memory method, where various processors
access shared memory
Fig2: MPI/Open MP program execution on two processors.
Fig 2 shows a hybrid system in which two nodes communicate through a message passing interface (MPI) and
the processor themselves fork four threads which uses shared memory for communication.
Basically, all three models discussed above are inherited from SIMD modelling [4], [5].
2.3. Systolic Architecture
Most of the hardware implementation for sequence alignment algorithm such as Smith Waterman
algorithm do not present full implementation of algorithms and the traceback process is performed on general
purpose processor. There are various techniques to accelerate the algorithm like linear recursive variable
expansion (RVE) implementations [5]; systolic arrays [6]; incremental approaches [7]; and other simplifications
of the algorithm.
Milik and Pulka speed up the processing task by optimizing the hardware architecture and taking advantage of
the properties of FPGA [8].
Using Systolic Array
Systolic Array is an arrangement of processors in 1-D or 2-D form where data flow across the array
between neighbor synchronously [9]. In systolic array design, the systolic processing element array is mapped to
4. A survey of Parallel models for Sequence Alignment using Smith Waterman Algorithm
DOI: 10.9790/0661-17264852 www.iosrjournals.org 51 | Page
an anti-diagonal line of the score matrix in sequence alignment algorithm like Smith Waterman algorithm.
Limited number of PE can be implemented on FPGA due to hardware limitation. So, in the calculation of
similarity matrix matrices needed to be divided into submatrices. The complete Smith Waterman algorithm is
implemented on FPGA not like earlier ones where scoring matrix is calculated on FPGA and traceback is
calculated on general purpose processor.
Smith-Waterman algorithm on an innovative reconfigurable supercomputing platform, the XD1000, for
DNA takes a 384-PE systolic array working at 66.7 MHz, achieves 25.6 GCUPS peak performance [10].
Fig3: Mapping Smith Waterman to Systolic PE Array
2.4 Single and multi graphics processor
In this paper, firstly basics of the Smith Waterman algorithm are presented, the column maximum and
row maximum are calculated for each table entry causing O(L1L2(L1+L2)) computational work which is
comparatively more as compared to retaining the previous column and row sums. By doing so the storage
requirement increases, but it reduces work significantly.
The classic way to parallelize the Smith Waterman algorithm is to work along anti-diagonals. As each
diagonal element depends upon cell above, left and diagonally above, so each diagonal element can be
calculated independently of others.
The main problem with parallelizing through diagonal approach is how to access memory. Any
hardware can be used to accelerate memory accesses. In Graphics processing unit, mainly it is the matter of how
to fetch database and query sequence from memory, through the implementation of memory banks and memory
streaming, the memory access can be enhanced with hardware. So it implies that, it is very efficient to access 32
consecutive memory locations in comparison to those scattered along the anti-diagonals [11], [12].
To deal with slow random memory access Smith Waterman algorithm can be reformulated in which
calculations can be performed in parallel one row or column at a time.
Using SLI (Nvidia) or ATI CrossFire technology, Smith Waterman algorithm can be implemented on
upto 4 GPUs which are installed on a single motherboard.
If it needs to be done by implementing a large number of GPUs operated on different motherboard and
communicating via the connecting network, then a smart parallelizing and implementation strategy needed. In
this, the sequence is distributed in N slightly overlapping blocks and blocks distributed on different processing
cores and then result is collected through Ethernet, so needed much attention while combining the result
collected from GPUs on the host CPU. Here the size of overlapping need to be considered taking care as the
result depends on alignment from left to right (because Smith Waterman compute from left to right) as
compared to alignment in right to left.
III. Conclusion
The sequence alignment using a Smith Waterman algorithm has been the most extensively used
algorithms in Bioinformatics to align two sequences for extracting information relevant to the evolutionary and
biological life. So far there are various models or we can say the accelerator used for alignment of sequences
5. A survey of Parallel models for Sequence Alignment using Smith Waterman Algorithm
DOI: 10.9790/0661-17264852 www.iosrjournals.org 52 | Page
which has specific advantages, according to the needs like depending on small or large dataset or just to get
highest alignment or scoring value in scoring matrix. This paper discusses about the FPGA, GPU, Associative
massive parallelism and various hybrid system.
We conclude that parallel models provide an excellent platform on which to a run sequence alignment,
and those parallel architectures will be able to cope far more easily with the vast data produced by the high
throughput sequencer.
References
[1]. S. Steinfadt, J.W. Baker, SWAMP: Smith–Waterman using associative massive parallelism, International Symposium on Parallel
and Distributed Processing, 2008, 1–8.
[2]. S. Steinfadt, Fine-grained parallel implementations for SWAMP+ Smith–Waterman alignment, Parallel Computing, 39(12), 2013,
819-833.
[3]. Potter, J. W. Baker, S. Scott, A. Bansal, C. Leangsuksun, and C. Asthagiri, ASC: an associative-computing paradigm, Computer, 27
(11), 1994, 19-25.
[4]. M. Farrar, Striped Smith-Waterman speeds database searches six times over other SIMD implementations, Bioinformatics, 23(2),
2007, 156–161.
[5]. L. Hasan and Z. Al-Ars, An efficient and high performance linear recursive variable expansion implementation of the Smith-
Waterman algorithm, Proc. Annual International Conference of the IEEE Engineering in Medicine and Biology Society, The
Netherlands, 2009, 3845–3848.
[6]. M. Gok and C. Ylmaz, Efficient cell designs for systolic smith-Waterman implementations, International Conference on, 2006, 1–
4.
[7]. A. Pulka and A. Milik, Considerations on incremental approach to hardware implementation of smith-waterman algorithm, in
Mixed Design of Integrated Circuits and Systems (MIXDES), Proceedings of the 18th International Conference, 2011, 283–288.
[8]. A. Milik and A. Pulka, Hardware oriented optimization of Smith Waterman algorithm, International Conference on, 2010, 319–322.
[9]. L. Hasan, Y.M. Khawaja, A. Bais, A Systolic Array Architecture for The Smith-Waterman Algorithm With High Performance Cell
Design, Proc. IADIS European Conference on Data Mining, Amsterdam, The Netherlands, July 2008.
[10]. P. Zhang, G. Tan, and G. R. Gao, Implementation of the Smith Waterman algorithm on a reconfigurable supercomputing platform,
Proc. 1st international workshop on High-performance reconfigurable computing technology and applications, New York, NY,
USA, 2007, 39-48.
[11]. Ł. Ligowski, W. Rudnicki, An efficient implementation of Smith–Waterman algorithm on GPU using CUDA for massively parallel
scanning of sequence databases, 23rd IEEE International Parallel and Distributed Processing Symposium, May 25–29, Aurelia
Convention Centre and Expo Rome, Italy, 2009.
[12]. V. Chaudhary, F. Liu, V Matta, and T. Yang, Parallel implementations of local Sequence alignment: hardware and software,
Parallel Computing for Bioinformatics and Computational Biology: Models, Enabling Technologies, and Case Studies, Wiley
Series on Parallel and Distributed Computing , 2006, 234.