The document discusses porting a seismic inversion code to run in parallel using standard message passing libraries. It describes three options considered for distributing the large 3D seismic data across processors: mapping the data to a processor grid, treating it as a sparse matrix problem, or distributing the data as 1D vectors assigned to each processor. The third option was chosen as it best preserved the code structure, had regular dependencies, and simplified communications. The parallel code was implemented using the Distributed Data Library (DDL) for data management and the Message Passing Interface (MPI) for basic point-to-point communication between processors. Initial tests on clusters showed near linear speedup on up to 30 processors.
Effective Sparse Matrix Representation for the GPU ArchitecturesIJCSEA Journal
General purpose computation on graphics processing unit (GPU) is prominent in the high performance computing era of this time. Porting or accelerating the data parallel applications onto GPU gives the default performance improvement because of the increased computational units. Better performances can be seen if application specific fine tuning is done with respect to the architecture under consideration. One such very widely used computation intensive kernel is sparse matrix vector multiplication (SPMV) in sparse matrix based applications. Most of the existing data format representations of sparse matrix are developed with respect to the central processing unit (CPU) or multi cores. This paper gives a new format for sparse matrix representation with respect to graphics processor architecture that can give 2x to 5x performance improvement compared to CSR (compressed row format), 2x to 54x performance improvement with respect to COO (coordinate format) and 3x to 10 x improvement compared to CSR vector format for the class of application that fit for the proposed new format. It also gives 10% to 133% improvements in memory transfer (of only access information of sparse matrix) between CPU and GPU. This paper gives the details of the new format and its requirement with complete experimentation details and results of comparison.
A Tale of Data Pattern Discovery in ParallelJenny Liu
In the era of IoTs and A.I., distributed and parallel computing is embracing big data driven and algorithm focused applications and services. With rapid progress and development on parallel frameworks, algorithms and accelerated computing capacities, it still remains challenging on deliver an efficient and scalable data analysis solution. This talk shares a research experience on data pattern discovery in domain applications. In particular, the research scrutinizes key factors in analysis workflow design and data parallelism improvement on cloud.
Over time, Machine Learning inference workloads became more and more demanding in terms of latency and throughput, with multiple models being deployed in the system. This scenario provides large rooms for optimizations of runtime and memory, which current systems fall short in exploring because they employ a black-box model of ML models and tasks.
On the opposite side, Pretzel adopts a white-box description of ML models, which allows the framework to perform optimizations over deployed models and running tasks, saving memory and increasing the overall system performance. In this talk we will show the motivations behind Pretzel, its current design and possible future developments.
Effective Sparse Matrix Representation for the GPU ArchitecturesIJCSEA Journal
General purpose computation on graphics processing unit (GPU) is prominent in the high performance computing era of this time. Porting or accelerating the data parallel applications onto GPU gives the default performance improvement because of the increased computational units. Better performances can be seen if application specific fine tuning is done with respect to the architecture under consideration. One such very widely used computation intensive kernel is sparse matrix vector multiplication (SPMV) in sparse matrix based applications. Most of the existing data format representations of sparse matrix are developed with respect to the central processing unit (CPU) or multi cores. This paper gives a new format for sparse matrix representation with respect to graphics processor architecture that can give 2x to 5x performance improvement compared to CSR (compressed row format), 2x to 54x performance improvement with respect to COO (coordinate format) and 3x to 10 x improvement compared to CSR vector format for the class of application that fit for the proposed new format. It also gives 10% to 133% improvements in memory transfer (of only access information of sparse matrix) between CPU and GPU. This paper gives the details of the new format and its requirement with complete experimentation details and results of comparison.
A Tale of Data Pattern Discovery in ParallelJenny Liu
In the era of IoTs and A.I., distributed and parallel computing is embracing big data driven and algorithm focused applications and services. With rapid progress and development on parallel frameworks, algorithms and accelerated computing capacities, it still remains challenging on deliver an efficient and scalable data analysis solution. This talk shares a research experience on data pattern discovery in domain applications. In particular, the research scrutinizes key factors in analysis workflow design and data parallelism improvement on cloud.
Over time, Machine Learning inference workloads became more and more demanding in terms of latency and throughput, with multiple models being deployed in the system. This scenario provides large rooms for optimizations of runtime and memory, which current systems fall short in exploring because they employ a black-box model of ML models and tasks.
On the opposite side, Pretzel adopts a white-box description of ML models, which allows the framework to perform optimizations over deployed models and running tasks, saving memory and increasing the overall system performance. In this talk we will show the motivations behind Pretzel, its current design and possible future developments.
The objective of this paper is to present the hybrid approach for edge detection. Under this technique, edge
detection is performed in two phase. In first phase, Canny Algorithm is applied for image smoothing and in
second phase neural network is to detecting actual edges. Neural network is a wonderful tool for edge
detection. As it is a non-linear network with built-in thresholding capability. Neural Network can be trained
with back propagation technique using few training patterns but the most important and difficult part is to
identify the correct and proper training set.
A NEW PARALLEL MATRIX MULTIPLICATION ALGORITHM ON HEX-CELL NETWORK (PMMHC) US...ijcsit
A widespread attention has been paid in parallelizing algorithms for computationally intensive applications. In this paper, we propose a new parallel Matrix multiplication on the Hex-cell interconnection network. The proposed algorithm has been evaluated and compared with sequential algorithm in terms of speedup, and efficiency using IMAN1, where a set of simulation runs, carried out on different input data distributions with different sizes. Thus, simulation results supported the theoretical analysis and m
A New Approach to Linear Estimation Problem in Multiuser Massive MIMO SystemsRadita Apriana
A novel approach for solving linear estimation problem in multi-user massive MIMO systems is
proposed. In this approach, the difficulty of matrix inversion is attributed to the incomplete definition of the
dot product. The general definition of dot product implies that the columns of channel matrix are always
orthogonal whereas, in practice, they may be not. If the latter information can be incorporated into dot
product, then the unknowns can be directly computed from projections without inverting the channel
matrix. By doing so, the proposed method is able to achieve an exact solution with a 25% reduction in
computational complexity as compared to the QR method. Proposed method is stable, offers an extra
flexibility of computing any single unknown, and can be implemented in just twelve lines of code.
Comprehensive Performance Evaluation on Multiplication of Matrices using MPIijtsrd
In Matrix multiplication we refer to a concept that is used in technology applications such as digital image processing, digital signal processing and graph problem solving. Multiplication of huge matrices requires a lot of computing time as its complexity is O n3 . Because most engineering science applications require higher computational throughput with minimum time, many sequential and analogue algorithms are developed. In this paper, methods of matrix multiplication are elect, implemented, and analyzed. A performance analysis is evaluated, and some recommendations are given when using open MP and MPI methods of parallel of latitude computing. Adamu Abubakar I | Oyku A | Mehmet K | Amina M. Tako ""Comprehensive Performance Evaluation on Multiplication of Matrices using MPI""
Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-2 , February 2020,
URL: https://www.ijtsrd.com/papers/ijtsrd30015.pdf
Paper Url : https://www.ijtsrd.com/engineering/electrical-engineering/30015/comprehensive-performance-evaluation-on-multiplication-of-matrices-using-mpi/adamu-abubakar-i
Distributed graph frameworks formulate tasks as sequences of supersteps within which communication is performed asynchronously by sending messages over the graph edges. PageRank's communication pattern is identical across supersteps since each vertex sends messages to all its edges. We exploit this pattern to develop a new communication paradigm that allows us to exchange messages that include only edge payloads, dramatically reducing bandwidth requirements. Experiments on a web graph of 38 billion vertices and 3.1 trillion edges yield execution times of 34.4 seconds per iteration, suggesting more than an order of magnitude improvement over the state-of-the-art.
Author:
Stergios Stergiou
Publication:
WWW '20: Proceedings of The Web Conference 2020, April 2020. Pages 2761–2767.
https://doi.org/10.1145/3366423.3380035
A Dependent Set Based Approach for Large Graph AnalysisEditor IJCATR
Now a day’s social or computer networks produced graphs of thousands of nodes & millions of edges. Such Large graphs
are used to store and represent information. As it is a complex data structure it requires extra processing. Partitioning or clustering
methods are used to decompose a large graph. In this paper dependent set based graph partitioning approach is proposed which
decomposes a large graph into sub graphs. It creates uniform partitions with very few edge cuts. It also prevents the loss of
information. The work also focuses on an approach that handles dynamic updation in a large graph and represents a large graph in
abstract form.
Hex-Cell is an interconnection network that has attractive features like the embedding capability of topological structures; such as; bus, ring, tree and mesh topologies. In this paper, we present two algorithms for embedding bus and ring topologies onto Hex-Cell interconnection network. We use three metrics to evaluate our proposed algorithms: dilation, congestion, and expansion. Our evaluation results
show that the congestion of our two proposed algorithms is equal to one; and the dilation is equal to 2d-1 for the first algorithm and 1 for the second.
Stochastic Computing Correlation Utilization in Convolutional Neural Network ...TELKOMNIKA JOURNAL
In recent years, many applications have been implemented in embedded systems and mobile Internet of Things (IoT) devices that typically have constrained resources, smaller power budget, and exhibit "smartness" or intelligence. To implement computation-intensive and resource-hungry Convolutional Neural Network (CNN) in this class of devices, many research groups have developed specialized parallel accelerators using Graphical Processing Units (GPU), Field-Programmable Gate Arrays (FPGA), or Application-Specific Integrated Circuits (ASIC). An alternative computing paradigm called Stochastic Computing (SC) can implement CNN with low hardware footprint and power consumption. To enable building more efficient SC CNN, this work incorporates the CNN basic functions in SC that exploit correlation, share Random Number Generators (RNG), and is more robust to rounding error. Experimental results show our proposed solution provides significant savings in hardware footprint and increased accuracy for the SC CNN basic functions circuits compared to previous work.
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...ijdpsjournal
The Science Information Network (SINET) is a Japanese academic backbone network for more than 800 universities and research institutions. The characteristic of SINET traffic is that it is enormous and highly variable. In this paper, we present a task-decomposition based anomaly detection of massive and highvolatility session data of SINET. Three main features are discussed: Tash scheduling, Traffic discrimination, and Histogramming. We adopt a task-decomposition based dynamic scheduling method to handle the massive session data stream of SINET. In the experiment, we have analysed SINET traffic from 2/27 to 3/8 and detect some anomalies by LSTM based time-series data processing.
CAN-dagen 2015: Om forskning
Vår årliga CAN-dag handlar denna gång om forskning. Vi presenterar nya data om alkoholkonsumtionen i Sverige och ett antal avhandlingsarbeten.
(Fredagen den 17 april 2015)
The objective of this paper is to present the hybrid approach for edge detection. Under this technique, edge
detection is performed in two phase. In first phase, Canny Algorithm is applied for image smoothing and in
second phase neural network is to detecting actual edges. Neural network is a wonderful tool for edge
detection. As it is a non-linear network with built-in thresholding capability. Neural Network can be trained
with back propagation technique using few training patterns but the most important and difficult part is to
identify the correct and proper training set.
A NEW PARALLEL MATRIX MULTIPLICATION ALGORITHM ON HEX-CELL NETWORK (PMMHC) US...ijcsit
A widespread attention has been paid in parallelizing algorithms for computationally intensive applications. In this paper, we propose a new parallel Matrix multiplication on the Hex-cell interconnection network. The proposed algorithm has been evaluated and compared with sequential algorithm in terms of speedup, and efficiency using IMAN1, where a set of simulation runs, carried out on different input data distributions with different sizes. Thus, simulation results supported the theoretical analysis and m
A New Approach to Linear Estimation Problem in Multiuser Massive MIMO SystemsRadita Apriana
A novel approach for solving linear estimation problem in multi-user massive MIMO systems is
proposed. In this approach, the difficulty of matrix inversion is attributed to the incomplete definition of the
dot product. The general definition of dot product implies that the columns of channel matrix are always
orthogonal whereas, in practice, they may be not. If the latter information can be incorporated into dot
product, then the unknowns can be directly computed from projections without inverting the channel
matrix. By doing so, the proposed method is able to achieve an exact solution with a 25% reduction in
computational complexity as compared to the QR method. Proposed method is stable, offers an extra
flexibility of computing any single unknown, and can be implemented in just twelve lines of code.
Comprehensive Performance Evaluation on Multiplication of Matrices using MPIijtsrd
In Matrix multiplication we refer to a concept that is used in technology applications such as digital image processing, digital signal processing and graph problem solving. Multiplication of huge matrices requires a lot of computing time as its complexity is O n3 . Because most engineering science applications require higher computational throughput with minimum time, many sequential and analogue algorithms are developed. In this paper, methods of matrix multiplication are elect, implemented, and analyzed. A performance analysis is evaluated, and some recommendations are given when using open MP and MPI methods of parallel of latitude computing. Adamu Abubakar I | Oyku A | Mehmet K | Amina M. Tako ""Comprehensive Performance Evaluation on Multiplication of Matrices using MPI""
Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-2 , February 2020,
URL: https://www.ijtsrd.com/papers/ijtsrd30015.pdf
Paper Url : https://www.ijtsrd.com/engineering/electrical-engineering/30015/comprehensive-performance-evaluation-on-multiplication-of-matrices-using-mpi/adamu-abubakar-i
Distributed graph frameworks formulate tasks as sequences of supersteps within which communication is performed asynchronously by sending messages over the graph edges. PageRank's communication pattern is identical across supersteps since each vertex sends messages to all its edges. We exploit this pattern to develop a new communication paradigm that allows us to exchange messages that include only edge payloads, dramatically reducing bandwidth requirements. Experiments on a web graph of 38 billion vertices and 3.1 trillion edges yield execution times of 34.4 seconds per iteration, suggesting more than an order of magnitude improvement over the state-of-the-art.
Author:
Stergios Stergiou
Publication:
WWW '20: Proceedings of The Web Conference 2020, April 2020. Pages 2761–2767.
https://doi.org/10.1145/3366423.3380035
A Dependent Set Based Approach for Large Graph AnalysisEditor IJCATR
Now a day’s social or computer networks produced graphs of thousands of nodes & millions of edges. Such Large graphs
are used to store and represent information. As it is a complex data structure it requires extra processing. Partitioning or clustering
methods are used to decompose a large graph. In this paper dependent set based graph partitioning approach is proposed which
decomposes a large graph into sub graphs. It creates uniform partitions with very few edge cuts. It also prevents the loss of
information. The work also focuses on an approach that handles dynamic updation in a large graph and represents a large graph in
abstract form.
Hex-Cell is an interconnection network that has attractive features like the embedding capability of topological structures; such as; bus, ring, tree and mesh topologies. In this paper, we present two algorithms for embedding bus and ring topologies onto Hex-Cell interconnection network. We use three metrics to evaluate our proposed algorithms: dilation, congestion, and expansion. Our evaluation results
show that the congestion of our two proposed algorithms is equal to one; and the dilation is equal to 2d-1 for the first algorithm and 1 for the second.
Stochastic Computing Correlation Utilization in Convolutional Neural Network ...TELKOMNIKA JOURNAL
In recent years, many applications have been implemented in embedded systems and mobile Internet of Things (IoT) devices that typically have constrained resources, smaller power budget, and exhibit "smartness" or intelligence. To implement computation-intensive and resource-hungry Convolutional Neural Network (CNN) in this class of devices, many research groups have developed specialized parallel accelerators using Graphical Processing Units (GPU), Field-Programmable Gate Arrays (FPGA), or Application-Specific Integrated Circuits (ASIC). An alternative computing paradigm called Stochastic Computing (SC) can implement CNN with low hardware footprint and power consumption. To enable building more efficient SC CNN, this work incorporates the CNN basic functions in SC that exploit correlation, share Random Number Generators (RNG), and is more robust to rounding error. Experimental results show our proposed solution provides significant savings in hardware footprint and increased accuracy for the SC CNN basic functions circuits compared to previous work.
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...ijdpsjournal
The Science Information Network (SINET) is a Japanese academic backbone network for more than 800 universities and research institutions. The characteristic of SINET traffic is that it is enormous and highly variable. In this paper, we present a task-decomposition based anomaly detection of massive and highvolatility session data of SINET. Three main features are discussed: Tash scheduling, Traffic discrimination, and Histogramming. We adopt a task-decomposition based dynamic scheduling method to handle the massive session data stream of SINET. In the experiment, we have analysed SINET traffic from 2/27 to 3/8 and detect some anomalies by LSTM based time-series data processing.
CAN-dagen 2015: Om forskning
Vår årliga CAN-dag handlar denna gång om forskning. Vi presenterar nya data om alkoholkonsumtionen i Sverige och ett antal avhandlingsarbeten.
(Fredagen den 17 april 2015)
- Edital e Anexos do Processo 08/2016 – Chamada Pública 01/2016Maria Julia Medeiros
Edital e Anexos do Processo 08/2016 – Chamada Pública 01/2016, para aquisição de Gêneros Alimentícios da Agricultura Familiar, conforme Anexo I – Termo de Referência
Jesteśmy start-upem ukierunkowanym na wprowadzenie na rynek nowoczesnych zaawansowanych technologii produkcji tworzyw.
Więcej na: http://www.innowacyjnetechnologieitworzywa.pl/ .
Rúbrica de evaluación para el taller de climogramas en 1º de ESO para la asignatura de Geografía e Historia. Si quiere conocer el proyecto de cerca, acércate: https://argaraleman.wordpress.com/2017/01/13/taller-de-climogramas/
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONSijscmcj
The purpose of this article is to determine the usefulness of the Graphics Processing Unit (GPU) calculations used to implement the Latent Semantic Indexing (LSI) reduction of the TERM-BY DOCUMENT matrix. Considered reduction of the matrix is based on the use of the SVD (Singular Value Decomposition) decomposition. A high computational complexity of the SVD decomposition - O(n3), causes that a reduction of a large indexing structure is a difficult task. In this article there is a comparison of the time complexity and accuracy of the algorithms implemented for two different environments. The first environment is associated with the CPU and MATLAB R2011a. The second environment is related to graphics processors and the CULA library. The calculations were carried out on generally available benchmark matrices, which were combined to achieve the resulting matrix of high size. For both considered environments computations were performed for double and single precision data.
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...IJCNCJournal
Intrusion detection poses a significant challenge within expansive and persistently interconnected environments. As malicious code continues to advance and sophisticated attack methodologies proliferate, various advanced deep learning-based detection approaches have been proposed. Nevertheless, the complexity and accuracy of intrusion detection models still need further enhancement to render them more adaptable to diverse system categories, particularly within resource-constrained devices, such as those embedded in edge computing systems. This research introduces a three-stage training paradigm, augmented by an enhanced pruning methodology and model compression techniques. The objective is to elevate the system's effectiveness, concurrently maintaining a high level of accuracy for intrusion detection. Empirical assessments conducted on the UNSW-NB15 dataset evince that this solution notably reduces the model's dimensions, while upholding accuracy levels equivalent to similar proposals.
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...IJCNCJournal
Intrusion detection poses a significant challenge within expansive and persistently interconnected environments. As malicious code continues to advance and sophisticated attack methodologies proliferate, various advanced deep learning-based detection approaches have been proposed. Nevertheless, the complexity and accuracy of intrusion detection models still need further enhancement to render them more adaptable to diverse system categories, particularly within resource-constrained devices, such as those embedded in edge computing systems. This research introduces a three-stage training paradigm, augmented by an enhanced pruning methodology and model compression techniques. The objective is to elevate the system's effectiveness, concurrently maintaining a high level of accuracy for intrusion detection. Empirical assessments conducted on the UNSW-NB15 dataset evince that this solution notably reduces the model's dimensions, while upholding accuracy levels equivalent to similar proposals.
Implementation of p pic algorithm in map reduce to handle big dataeSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...RSIS International
In this paper, we have designed the VLSI hardware for a novel RS decoding algorithm suitable for Multi-Gb/s Communication Systems. Through this paper we show that the performance benefit of the algorithm is truly witnessed when implemented in hardware thus avoiding the extra processing time of Fetch-Decode-Execute cycle of traditional microprocessor based computing systems. The new algorithm with less time complexity combined with its application specific hardware implementation makes it suitable for high speed real-time systems with hard timing constraints. The design is implemented as a digital hardware using VHDL
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...cscpconf
For performing distributed data mining two approaches are possible: First, data from several sources are copied to a data warehouse and mining algorithms are applied in it. Secondly,
mining can performed at the local sites and the results can be aggregated. When the number of
features is high, a lot of bandwidth is consumed in transferring datasets to a centralized location. For this dimensionality reduction can be done at the local sites. In dimensionality reduction a certain encoding is applied on data so as to obtain its compressed form. The
reduced features thus obtained at the local sites are aggregated and data mining algorithms are applied on them. There are several methods of performing dimensionality reduction. Two most important ones are Discrete Wavelet Transforms (DWT) and Principal Component Analysis (PCA). Here a detailed study is done on how PCA could be useful in reducing data flow across a distributed network.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
A TALE of DATA PATTERN DISCOVERY IN PARALLELJenny Liu
In the era of IoTs and A.I., distributed and parallel computing is embracing big data driven and algorithm focused applications and services. With rapid progress and development on parallel frameworks, algorithms and accelerated computing capacities, it still remains challenging on deliver an efficient and scalable data analysis solution. This talk shares a research experience on data pattern discovery in domain applications. In particular, the research scrutinizes key factors in analysis workflow design and data parallelism improvement on cloud.
Vertex covering has important applications for wireless sensor networks such as monitoring link failures,
facility location, clustering, and data aggregation. In this study, we designed three algorithms for
constructing vertex cover in wireless sensor networks. The first algorithm, which is an adaption of the
Parnas & Ron’s algorithm, is a greedy approach that finds a vertex cover by using the degrees of the
nodes. The second algorithm finds a vertex cover from graph matching where Hoepman’s weighted
matching algorithm is used. The third algorithm firstly forms a breadth-first search tree and then
constructs a vertex cover by selecting nodes with predefined levels from breadth-first tree. We show the
operation of the designed algorithms, analyze them, and provide the simulation results in the TOSSIM
environment. Finally we have implemented, compared and assessed all these approaches. The transmitted
message count of the first algorithm is smallest among other algorithms where the third algorithm has
turned out to be presenting the best results in vertex cover approximation ratio.
Effective Sparse Matrix Representation for the GPU ArchitecturesIJCSEA Journal
General purpose computation on graphics processing unit (GPU) is prominent in the high performance computing era of this time. Porting or accelerating the data parallel applications onto GPU gives the default performance improvement because of the increased computational units. Better performances can be seen if application specific fine tuning is done with respect to the architecture under consideration. One such very widely used computation intensive kernel is sparse matrix vector multiplication (SPMV) in sparse matrix based applications. Most of the existing data format representations of sparse matrix are developed with respect to the central processing unit (CPU) or multi cores. This paper gives a new format for sparse matrix representation with respect to graphics processor architecture that can give 2x to 5x performance improvement compared to CSR (compressed row format), 2x to 54x performance improvement with respect to COO (coordinate format) and 3x to 10 x improvement compared to CSR vector format for the class of application that fit for the proposed new format. It also gives 10% to 133% improvements in memory transfer (of only access information of sparse matrix) between CPU and GPU. This paper gives the details of the new format and its requirement with complete experimentation
details and results of comparison.
A Broadband Wireless Access technology known as
Worldwide Interoperability for Microwave Access (WiMAX) is
based on IEEE 802.16 standards. It uses orthogonal frequency
division multiple accesses (OFDMA) as one of its multiple access
technique. Major design factors of OFDMA resource allocation are
scheduling and burst allocation. To calculate the appropriate
dimensions and location of each user’s data so as to construct the
bursts in the downlink subframe, is the responsibility of burst
allocation algorithm. Bursts are calculated in terms of number of
slots for each user. Burst Allocation Algorithm is used to overcome
the resource wastage in the form of unused and unallocated slots per
frame. It affects the Base station performance in mobile WiMAX
systems. In this Paper, HOCSA (Hybrid One Column Striping with
Non Increasing Area) algorithm is proposed to overcome frame
wastage. HOCSA is implemented by improving eOCSA algorithm
and is evaluated using MATLAB. HOCSA achieves significant
reduction of resource wastage per frame, leading to more
exploitation of the WiMAX frame.
Similar to Achieving Portability and Efficiency in a HPC Code Using Standard Message-passing Libraries (20)
HOCSA: AN EFFICIENT DOWNLINK BURST ALLOCATION ALGORITHM TO ACHIEVE HIGH FRAME...
Achieving Portability and Efficiency in a HPC Code Using Standard Message-passing Libraries
1. HPCS 96 The 10th Annual International
Conference on High Performance
Computers
Achieving Portability and Efficiency in a HPC code using standard
message-passing libraries.
Derryck L. Lamptey, G. A. Manson, R. K. England
National Transputer Support Centre, 5 Palmerston Road, Sheffield, S10 2TE, U.K.
Contact author: bq176@torfree.net
Telephone: +44 742 76 87 40
Fax: +44 742 72 75 63
2. SuperCan2 (Interleaf)
Portability and Efficiency using
standard message-passing libraries.
2 ĆĆ
1 Introduction
As part of a European Union EUREKA project (EU 638, PARSIM) the NTSC has ported
sequential FORTRAN code for seismic inversion onto a parallel platform. The sequential code has
been industrially tested by one of the PARSIM partners, Ødegaard & DanneskioldĆSamsøe, and
remains commercially confidential. To respect confidentiality the parallel code presented here is a
sanitised version of the parallel code developed at the NTSC. As part of the GPĆMIMD2 project the
code has been tested on the CSĆ2 parallel processing supercomputer at CERN and results show
near linear scalability on up to 30 processors. The development environment for this successful
project involved using MPI on a number of different different platforms to ensure a portable yet effiĆ
cient parallel code.
The code has been functionally tested on inĆhouse Silicon Graphics' servers, and is due to
be benchmarked on a Silicon Graphics' PowerChallenge Array (16 nodes) at the Silicon Graphics'
SuperComputer Technology Centre in Neuchatel, Switzerland. Results from these runs should be
available by the end of February, 1996.
2 The Code
A sequential implementation of the algorithm was available, from which the parallelisation
could be developed. The relevant code, referred to as sequential code", is the code necessary to
solve the problem once initial estimates and a number of preset variables have been set up. The solĆ
ution of the seismic problem principally involves repeating a number of 3Ćdimensional operations deĆ
signed to minimise the error between the seismic (input) data and the synthetic (output) data sets.
The calculation computes a set of variables which reduce the entropy of a given data set over a
number of iterations, using a conjugate gradient method. The total seismic inversion involves a comĆ
plex algorithm, most of which will not benefit greatly from parallelisation. However, a small section
of the code, shown as Parallel Prototype" in Figure 1 accounts for 95% of the computation time. This
is the portion of the algorithm which has been parallelised.
// Global Optimisation =
Loop for N
Non linear optimisation of some binary variables
// Local Optimisation (small) =
1 dimensional optimisation
3 dimensional optimisation
// 3 dimensional optimisation (big) =
Loop for M
scan 1: calculation of global scalar.
scan 2..5: update of a number of data structures in each
processing cell. Inter-cell communications required
between each scan.
End loop M
End loop N
Figure 1 Overview of the optimisation algorithm.
Main
Parallel Prototype
The main optimisation code (Main) does some external optimisation, then calls the internal
optimisation procedure (Parallel Prototype) to solve the system of equations.
3. SuperCan2 (Interleaf)
Portability and Efficiency using
standard message-passing libraries.
3 ĆĆ
3 Options for Code Parallelisation
For this problem, the data structures are large (circa 2GB), regular and 3Ćdimensional, and
there are spatial dependencies between the data elements in all 3 dimensions. The data could not
be distributed conveniently in the vertical direction. The computational code in the parallel prototype
would be wellĆstructured, if these spatial dependencies could be accomodated. Considerable time
was therefore spent analysing these data structures and the spatial dependencies. Three data disĆ
tribution approaches were proposed for this parallelisation:
S Option 1 (The processor grid approach)
S Option 2 (The sparse matrix solver approach)
S Option 3 (The distributed vector approach)
3.1 Option 1 (The processor grid approach)
Figure 2 Mapping the seismic data on to a grid of processors
τ
y
x
τ
y
x
P1
P2
P3
P4
P5
P6
This solution provides a straightforward mapping, and is conceptually simple, but has a
number of disadvantages to it:
S the option would require complicated communications and process sychĆ
ronisation, and scalability was not thought to be easily obtainable in a parĆ
allel version.
S load balancing was expected to be a problem because of the distribution
patterns that would necessarily arise.
4. SuperCan2 (Interleaf)
Portability and Efficiency using
standard message-passing libraries.
4 ĆĆ
3.2 Option 2 (The sparse matrix solver approach)
Figure 3 Mapping the seismic data on to a sparse matrix
τ
y
x
x ∗ τ ∗ y
x ∗ τ ∗ y
The problem could be formulated as a sparse banded matrix type problem. The solution
could then be handled by a number of parallel sparse matrix solvers currently available. Conceptually
the sparse matrix approach is simple and elegant, but a number of implementation issues remained:
S Parallel formulation of the sparse matrix is not straightforward, because the data
structures necessary for formulation as a matrix problem are not easily mapped
into the data structures of the calling routine", (Main" in Figure 1). With this apĆ
proach, the sparse matrix would have to be constructed on each entry to the proĆ
totype section of code, and deconstructed" on each exit.
S Several sparse matrices would need to be set up to represent all the disĆ
tributed data structures.
S Sparse matrix data storage was likely to be in the triad" format, giving rise
to a memory requirement up to three times that of any other approach.
5. SuperCan2 (Interleaf)
Portability and Efficiency using
standard message-passing libraries.
5 ĆĆ
3.3 Option 3 (The distributed vector approach)
Figure 4 Mapping the seismic data on
to one dimensional vectors
τ
y
x
x ∗ τ ∗ y
P1
P2
P3
P1 P2 P3 P4
This approach formulates the 3 dimensional structures as 1 dimensional structures, permitĆ
ting the parallel data structures to be viewed in a manner similar to that described in Option 1, but
reducing the communication complexity by assigning entire seismic lines (y direction) to processors.
This permits the data on each processor to be viewed as a contiguous data space. This option is
possible, and beneficial in terms of processing, because:
S loop control is highly regular.
S the spatial dependencies between the data elements are well ordered.
S the code structure of the original sequential code could be preserved to a
great degree in the parallel prototype.
S there was a fairly straightforward mapping of existing data structures
In this formulation of the problem each processor has a number of complete seismic lines
(see Figure 4), permitting an SPMD (single program multiple data) approach to be employed.
Figure 4 also illustrates the mapping of the seismic data onto a number of processors, the seismic
data being represented by equivalent 1Ćdimensional vectors.
Option 3 was chosen.
4 Library Interfaces
The software architecture of the parallel prototype is shown in Figure 5.
6. SuperCan2 (Interleaf)
Portability and Efficiency using
standard message-passing libraries.
6 ĆĆ
.
I/O Computation
MPI
DDL
Application
Parallel prototype
Figure 5 Organisation of the Library hierarchy
Application Library
4.1 Distributed Data Libraries
DDL is the Distributed Data Library developed at the Institute for Advanced Scientific ComĆ
putation at the University of Liverpool[1]. DDL is a library system which creates, manages and operĆ
ates upon distributed objects such as matrices and vectors, thus permitting the exploitation of distribĆ
uted memory parallel computers from a number of singleĆthreaded FORTRAN program. Many of the
conventions employed by the DDL are derived from the MPI standard. The distributed data structures
supported by the library are treated for the most part as opaque objects i.e. that they can only be
manipulated through calls to DDL procedures. DDL objects may be passed to procedures via the
use of handles. But DDL also permits the application to directly access the data inside the distributed
vector, in which case the application can treat the data like a local data segment. The DDL manages
system memory, allocating space required for new objects and deĆallocating redundant objects.
DDL is used in the parallel prototype primarily for input and output, memory allocation, and distribĆ
uted data management. Because of the parallelisation option chosen, only a small subset of the DDL
interface is required, namely:
S Input and Output ( DDL_Open(), DDL_Fileformat(), DDL_Read(),
DDL_Open_host(), DDL_Write(), DDL_Close() ).
S Memory Allocation and Data distribution ( DDL_Create_vector(),
DDL_Free() ).
S Data Access ( DDL_Get_vector() ).
S Data distribution query ( DDL_Size(), DDL_GSize(),
DDL_Offset_Vector(), DDL_Size_Vector() ).
S Miscellaneous ( DDL_Init_sparse(), DDL_Finalize_sparse(),
DDL_Ishost() ).
4.2 Message Passing Interface
MPI is a Message Passing Interface which is intended to become a standard for applications
running on distributed memory MIMD concurrent computers, and is well described elsewhere[2]. It
7. SuperCan2 (Interleaf)
Portability and Efficiency using
standard message-passing libraries.
7 ĆĆ
is not intended as a complete parallel programming environment, and currently lacks universal supĆ
port for features such as parallel I/O, parallel program composition, dynamic process control and
debugging. However, the core of MPI consists of a set of routines which support pointĆtoĆpoint comĆ
munication between pairs of processes or between groups of processes. Because of the regular naĆ
ture of the seismic data structures, the messageĆpassing requirements of the chosen approach are
not complicated and rely on a very basic and standard subset of the MPI library:
S Message Passing routines (MPI_Send(), MPI_Recv()).
S Communication contexts (MPI_Comm_Size(), MPI_Comm_Rank()).
S Data reduction (MPI_AllReduce()).
S Miscellaneous (MPI_Init(), MPI_Finalize()).
More and more vendors are providing proprietary implementations of MPI, e.g. Silicon
Graphics.
4.3 Use of standard libraries in porting
During the design stage, the portable libraries (DDL and MPI) were chosen. These libraries
were used during the development, but a conscious effort had to be made during the development
process to use smallest subset of functions from these libraries, in order to minimize the library deĆ
pendencies. All known implementations of MPI incorporate the MPI routines required by the parallel
prototype. DDL is portable, and available for an increasing number of platforms.
During the development phase, coding and functional testing was carried out on the NTSC's
set of Sun workstations, and performance testing and module integration was carried out on the
CSĆ2 supercomputer at CERN, mostly because the system bandwidth of the inĆhouse SUN network
is around 0.5MB/sec, as compared to the 50MB/sec nodeĆtoĆnode bandwidth that the CSĆ2 can susĆ
tain. Portability was also ensured for a minimal cost in efficiency (Benchmarks show that on the CSĆ2,
around 80% of the native messageĆpassing bandwidth is available through the use of the MPI liĆ
braries).
5 Results
5.1 Parallel Performance
Using data from North Sea oilfields, the parallel prototypes developed have demonstrated
impressive parallel scalability in benchmarking tests on the CSĆ2 machine at CERN (see Figure 6).
8. SuperCan2 (Interleaf)
Portability and Efficiency using
standard message-passing libraries.
8 ĆĆ
0
5
10
15
20
25
30
35
0 5 10 15 20 25 30 35
Computationspeedup
Number of processors
Number of lines: 60
traces per line: 151
samples per trace: 56
samples per wavelet: 9
Figure 6 Parallel scalability of the parallel prototype on real seismic data
linear speedup
achieved speedup
5.2 Numerical Performance
A key objective for the parallel prototype is to reduce the entropy of the provided data set,
whilst maintaining identical numerical accuracy for all parallel configurations (invariant of the number
of processors). This objective has been met. See Figure 7.
318000
319000
320000
321000
0 5 10 15 20 25 30 35
ResidualEnergy
Number of processes
Figure 7 Final energies for a number of processors on real seismic data.
6 Conclusions
It is clear that this port of the numerical kernel of a large sequential program has been a sucĆ
cess, and that the code runs on a number of parallel processing machines which are currently comĆ
mercially available. Portability was obtained without significant efficiency concessions. The authors
conclude that the role played by MPI and DDL is crucial in formulating parallelisation strategies, and
in providing prototype implementations. From this good base the project partners can consider movĆ
ing the parallelisation to other parts of the problem solution, secure in the framework of data manageĆ
ment which has been provided by MPI and DDL.
9. SuperCan2 (Interleaf)
Portability and Efficiency using
standard message-passing libraries.
9 ĆĆ
7 References
[1] Tim Oliver et. al., “Sparse DDL – User Guide” , (1995) Institute for Advanced Scientific
Computation, Liverpool, England,
tim@supr.scm.liv.ac.uk (http://supr.scm.liv.ac.uk/~tim/parsim/parsim.html)
[2] Bill Gropp, Rusty Lusk, Tony Skjellum, Nathan Doss, “A Portable MPI Implementa-
tion”, (November, 1994), Argonne National Laboratories/Mississippi State University
gropp@mcs.anl.gov