This document summarizes a research paper that implemented Levenberg-Marquardt artificial neural network training using graphics processing unit (GPU) hardware acceleration. The key points are:
1) This appears to be the first description of implementing artificial neural networks using the Levenberg-Marquardt training method on a GPU.
2) The paper describes their approach for implementing the Levenberg-Marquardt algorithm on a GPU, which involves solving the matrix inversion operation that is typically computationally expensive.
3) Results show that training networks using the GPU implementation can be up to 10 times faster than using a CPU-only implementation on the same hardware.
Cell Charge Approximation for Accelerating Molecular Simulation on CUDA-Enabl...ijcax
Methods for Molecular Dynamics(MD) simulations are investigated. MD simulation is the widely used computer simulation approach to study the properties of molecular system. Force calculation in MD is computationally intensive. Paral-lel programming techniques can be applied to improve those calculations.
The major aim of this paper is to speed up the MD simulation calculations by/using General Purpose Graphics Processing Unit(GPU) computing paradigm, an efficient and economical way for parallel computing. For that we are proposing a method called cell charge approximation which treats the
electrostatic interactions in MD simulations.This method reduces the complexity of force calculations.
Median based parallel steering kernel regression for image reconstructioncsandit
Image reconstruction is a process of obtaining the original image from corrupted data.
Applications of image reconstruction include Computer Tomography, radar imaging, weather
forecasting etc. Recently steering kernel regression method has been applied for image
reconstruction [1]. There are two major drawbacks in this technique. Firstly, it is
computationally intensive. Secondly, output of the algorithm suffers form spurious edges
(especially in case of denoising). We propose a modified version of Steering Kernel Regression
called as Median Based Parallel Steering Kernel Regression Technique. In the proposed
algorithm the first problem is overcome by implementing it in on GPUs and multi-cores. The
second problem is addressed by a gradient based suppression in which median filter is used.
Our algorithm gives better output than that of the Steering Kernel Regression. The results are
compared using Root Mean Square Error(RMSE). Our algorithm has also shown a speedup of
21x using GPUs and shown speedup of 6x using multi-cores.
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTIONcsandit
Image reconstruction is a process of obtaining the original image from corrupted data.Applications of image reconstruction include Computer Tomography, radar imaging, weather forecasting etc. Recently steering kernel regression method has been applied for image reconstruction [1]. There are two major drawbacks in this technique. Firstly, it is computationally intensive. Secondly, output of the algorithm suffers form spurious edges(especially in case of denoising). We propose a modified version of Steering Kernel Regression called as Median Based Parallel Steering Kernel Regression Technique. In the proposed algorithm the first problem is overcome by implementing it in on GPUs and multi-cores. The second problem is addressed by a gradient based suppression in which median filter is used.Our algorithm gives better output than that of the Steering Kernel Regression. The results are compared using Root Mean Square Error(RMSE). Our algorithm has also shown a speedup of 21x using GPUs and shown speedup of 6x using multi-cores.
The network anomaly detection technology based
on support vector machine (SVM) can efficiently detect unknown
attacks or variants of known attacks. However, it cannot be used
for detection of large-scale intrusion scenarios due to the demand
of computational time. The graphics processing unit (GPU) has
the characteristics of multi-threads and powerful parallel
processing capability. Hence Parallel computing framework is
used to accelerate the SVM-based classification.
Comprehensive Performance Evaluation on Multiplication of Matrices using MPIijtsrd
In Matrix multiplication we refer to a concept that is used in technology applications such as digital image processing, digital signal processing and graph problem solving. Multiplication of huge matrices requires a lot of computing time as its complexity is O n3 . Because most engineering science applications require higher computational throughput with minimum time, many sequential and analogue algorithms are developed. In this paper, methods of matrix multiplication are elect, implemented, and analyzed. A performance analysis is evaluated, and some recommendations are given when using open MP and MPI methods of parallel of latitude computing. Adamu Abubakar I | Oyku A | Mehmet K | Amina M. Tako ""Comprehensive Performance Evaluation on Multiplication of Matrices using MPI""
Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-2 , February 2020,
URL: https://www.ijtsrd.com/papers/ijtsrd30015.pdf
Paper Url : https://www.ijtsrd.com/engineering/electrical-engineering/30015/comprehensive-performance-evaluation-on-multiplication-of-matrices-using-mpi/adamu-abubakar-i
Hardware Architecture for Calculating LBP-Based Image Region DescriptorsMarek Kraft
In this paper, an efficient hardware architecture, enabling the computation of LBP-based image region descriptors is presented. The complete region descriptor is formed by combining individual local descriptors and arranging them into a grid, as typically used in object detection and recognition. The proposed solution performs massively parallel, pipelined computations, facilitating the processing of over two hundred VGA frames per second and can easily be adopted to different window and grid sizes for the use of other descriptors.
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology
Cell Charge Approximation for Accelerating Molecular Simulation on CUDA-Enabl...ijcax
Methods for Molecular Dynamics(MD) simulations are investigated. MD simulation is the widely used computer simulation approach to study the properties of molecular system. Force calculation in MD is computationally intensive. Paral-lel programming techniques can be applied to improve those calculations.
The major aim of this paper is to speed up the MD simulation calculations by/using General Purpose Graphics Processing Unit(GPU) computing paradigm, an efficient and economical way for parallel computing. For that we are proposing a method called cell charge approximation which treats the
electrostatic interactions in MD simulations.This method reduces the complexity of force calculations.
Median based parallel steering kernel regression for image reconstructioncsandit
Image reconstruction is a process of obtaining the original image from corrupted data.
Applications of image reconstruction include Computer Tomography, radar imaging, weather
forecasting etc. Recently steering kernel regression method has been applied for image
reconstruction [1]. There are two major drawbacks in this technique. Firstly, it is
computationally intensive. Secondly, output of the algorithm suffers form spurious edges
(especially in case of denoising). We propose a modified version of Steering Kernel Regression
called as Median Based Parallel Steering Kernel Regression Technique. In the proposed
algorithm the first problem is overcome by implementing it in on GPUs and multi-cores. The
second problem is addressed by a gradient based suppression in which median filter is used.
Our algorithm gives better output than that of the Steering Kernel Regression. The results are
compared using Root Mean Square Error(RMSE). Our algorithm has also shown a speedup of
21x using GPUs and shown speedup of 6x using multi-cores.
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTIONcsandit
Image reconstruction is a process of obtaining the original image from corrupted data.Applications of image reconstruction include Computer Tomography, radar imaging, weather forecasting etc. Recently steering kernel regression method has been applied for image reconstruction [1]. There are two major drawbacks in this technique. Firstly, it is computationally intensive. Secondly, output of the algorithm suffers form spurious edges(especially in case of denoising). We propose a modified version of Steering Kernel Regression called as Median Based Parallel Steering Kernel Regression Technique. In the proposed algorithm the first problem is overcome by implementing it in on GPUs and multi-cores. The second problem is addressed by a gradient based suppression in which median filter is used.Our algorithm gives better output than that of the Steering Kernel Regression. The results are compared using Root Mean Square Error(RMSE). Our algorithm has also shown a speedup of 21x using GPUs and shown speedup of 6x using multi-cores.
The network anomaly detection technology based
on support vector machine (SVM) can efficiently detect unknown
attacks or variants of known attacks. However, it cannot be used
for detection of large-scale intrusion scenarios due to the demand
of computational time. The graphics processing unit (GPU) has
the characteristics of multi-threads and powerful parallel
processing capability. Hence Parallel computing framework is
used to accelerate the SVM-based classification.
Comprehensive Performance Evaluation on Multiplication of Matrices using MPIijtsrd
In Matrix multiplication we refer to a concept that is used in technology applications such as digital image processing, digital signal processing and graph problem solving. Multiplication of huge matrices requires a lot of computing time as its complexity is O n3 . Because most engineering science applications require higher computational throughput with minimum time, many sequential and analogue algorithms are developed. In this paper, methods of matrix multiplication are elect, implemented, and analyzed. A performance analysis is evaluated, and some recommendations are given when using open MP and MPI methods of parallel of latitude computing. Adamu Abubakar I | Oyku A | Mehmet K | Amina M. Tako ""Comprehensive Performance Evaluation on Multiplication of Matrices using MPI""
Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-2 , February 2020,
URL: https://www.ijtsrd.com/papers/ijtsrd30015.pdf
Paper Url : https://www.ijtsrd.com/engineering/electrical-engineering/30015/comprehensive-performance-evaluation-on-multiplication-of-matrices-using-mpi/adamu-abubakar-i
Hardware Architecture for Calculating LBP-Based Image Region DescriptorsMarek Kraft
In this paper, an efficient hardware architecture, enabling the computation of LBP-based image region descriptors is presented. The complete region descriptor is formed by combining individual local descriptors and arranging them into a grid, as typically used in object detection and recognition. The proposed solution performs massively parallel, pipelined computations, facilitating the processing of over two hundred VGA frames per second and can easily be adopted to different window and grid sizes for the use of other descriptors.
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology
Comparative study to realize an automatic speaker recognition system IJECEIAES
In this research, we present an automatic speaker recognition system based on adaptive orthogonal transformations. To obtain the informative features with a minimum dimension from the input signals, we created an adaptive operator, which helped to identify the speaker’s voice in a fast and efficient manner. We test the efficiency and the performance of our method by comparing it with another approach, mel-frequency cepstral coefficients (MFCCs), which is widely used by researchers as their feature extraction method. The experimental results show the importance of creating the adaptive operator, which gives added value to the proposed approach. The performance of the system achieved 96.8% accuracy using Fourier transform as a compression method and 98.1% using Correlation as a compression method.
A Review on Image Compression in Parallel using CUDAIJERD Editor
Now a days images are prodigiously and sizably voluminous in size. So, this size is not facilely fits in applications. For that image compression is require. Image Compression algorithms are more resource conserving. It takes more time to consummate the task of compression. Utilizing Parallel implementation of the compression algorithm this quandary can be overcome. CUDA (Compute Unified Device Architecture) Provides parallel execution for algorithm utilizing the multi-threading. CUDA is NVIDIA`s parallel computing platform. CUDA uses GPU (Graphical Processing Unit) for the parallel execution. GPU have the number of the cores for parallel execution support. Image compression can additionally implemented in parallel utilizing CUDA. There are number of algorithms for image compression. Among them DWT (Discrete Wavelet Transform) is best suited for parallel implementation due to its more mathematical calculation and good compression result compare to other methods. In this paper included different parallel techniques for image compression. With the actualizing this image compression algorithm over the GPU utilizing CUDA it will perform the operations in parallel. In this way, vast diminish in processing time is conceivable. Furthermore it is conceivable to enhance the execution of image compression algorithms.
Performance analysis of real-time and general-purpose operating systems for p...IJECEIAES
In general, modern operating systems can be divided into two essential parts, real-time operating systems (RTOS) and general-purpose operating systems (GPOS). The main difference between GPOS and RTOS is the system is time-critical or not. It means that; in GPOS, a high-priority thread cannot preempt a kernel call. But, in RTOS, a low-priority task is preempted by a high-priority task if necessary, even if it’s executing a kernel call. Most Linux distributions can be used as both GPOS and RTOS with kernel modifications. In this study, two Linux distributions, Ubuntu and Pardus, were analyzed and their performances were compared both as GPOS and RTOS for path planning of the multi-robot systems. Robot groups with different numbers of members were used to perform the path tracking tasks using both Ubuntu and Pardus as GPOS and RTOS. In this way, both the performance of two different Linux distributions in robotic applications were observed and compared in two forms, GPOS, and RTOS.
Real-time traffic sign detection and recognition using Raspberry Pi IJECEIAES
Nowadays, the number of road accident in Malaysia is increasing expeditiously. One of the ways to reduce the number of road accident is through the development of the advanced driving assistance system (ADAS) by professional engineers. Several ADAS system has been proposed by taking into consideration the delay tolerance and the accuracy of the system itself. In this work, a traffic sign recognition system has been developed to increase the safety of the road users by installing the system inside the car for driver’s awareness. TensorFlow algorithm has been considered in this work for object recognition through machine learning due to its high accuracy. The algorithm is embedded in the Raspberry Pi 3 for processing and analysis to detect the traffic sign from the real-time video recording from Raspberry Pi camera NoIR. This work aims to study the accuracy, delay and reliability of the developed system using a Raspberry Pi 3 processor considering several scenarios related to the state of the environment and the condition of the traffic signs. A real-time testbed implementation has been conducted considering twenty different traffic signs and the results show that the system has more than 90% accuracy and is reliable with an acceptable delay.
Parallel implementation of pulse compression method on a multi-core digital ...IJECEIAES
Pulse compression algorithm is widely used in radar applications. It requires a huge processing power in order to be executed in real time. Therefore, its processing must be distributed along multiple processing units. The present paper proposes a real time platform based on the multi-core digital signal processor (DSP) C6678 from Texas Instruments (TI). The objective of this paper is the optimization of the parallel implementation of pulse compression algorithm over the eight cores of the C6678 DSP. Two parallelization approaches were implemented. The first approach is based on the open multi processing (OpenMP) programming interface, which is a software interface that helps to execute different sections of a program on a multi core processor. The second approach is an optimized method that we have proposed in order to distribute the processing and to synchronize the eight cores of the C6678 DSP. The proposed method gives the best performance. Indeed, a parallel efficiency of 94% was obtained when the eight cores were activated.
CONFIGURABLE TASK MAPPING FOR MULTIPLE OBJECTIVES IN MACRO-PROGRAMMING OF WIR...ijassn
Macro-programming is the new generation advanced method of using Wireless Sensor Network (WSNs), where application developers can extract data from sensor nodes through a high level abstraction of the system. Instead of developing the entire application, task graph representation of the WSN model presents simplified approach of data collection. However, mapping of tasks onto sensor nodes highlights several problems in energy consumption and routing delay. In this paper, we present an efficient hybrid approach of task mapping for WSN – Hybrid Genetic Algorithm, considering multiple objectives of optimization – energy consumption, routing delay and soft real time requirement. We also present a method to configure the algorithm as per user's need by changing the heuristics used for optimization. The trade-off analysis between energy consumption and delivery delay was performed and simulation results are presented. The algorithm is applicable during macro-programming enabling developers to choose a better mapping according to their application requirements.
The Cerebellar Model Articulation Controller (CMAC) is an influential cerebrum propelled processing model in
numerous pertinent fields. There are different researches done using CMAC in many applications using its
characteristics in easy implementation and good results for example: facial expression recognition, pattern
recognition etc. In this paper we have presented some methods of using CMAC and presents their results.
SPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHYcsandit
This paper presents a parallel approach to improve the time complexity problem associated
with sequential algorithms. An image steganography algorithm in transform domain is
considered for implementation. Image steganography is a technique to hide secret message in
an image. With the parallel implementation, large message can be hidden in large image since
it does not take much processing time. It is implemented on GPU systems. Parallel
programming is done using OpenCL in CUDA cores from NVIDIA. The speed-up improvement
obtained is very good with reasonably good output signal quality, when large amount of data is
processed
Performance comparison of row per slave and rows set per slave method in pvm ...eSAT Journals
Abstract Parallel computing operates on the principle that large problems can often be divided into smaller ones, which are then solved concurrently to save time by taking advantage of non-local resources and overcoming memory constraints. Multiplication of larger matrices requires a lot of computation time. This paper deals with the two methods for handling Parallel Matrix Multiplication. First is, dividing the rows of one of the input matrices into set of rows based on the number of slaves and assigning one rows set for each slave for computation. Second method is, assigning one row of one of the input matrices at a time for each slave starting from first row to first slave and second row to second slave and so on and loop backs to the first slave when last slave assignment is finished and repeated until all rows are finished assigning. These two methods are implemented using Parallel Virtual Machine and the computation is performed for different sizes of matrices over the different number of nodes. The results show that the row per slave method gives the optimal computation time in PVM based parallel matrix multiplication. Keywords: Parallel Execution, Cluster Computing, MPI (Message Passing Interface), PVM (Parallel Virtual Machine) RAM (Random Access Memory).
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
An Adaptive Load Balancing Middleware for Distributed SimulationGabriele D'Angelo
The simulation is useful to support the design and performance evaluation of complex systems, possibly composed by a massive number of interacting entities. For this reason, the simulation of such systems may need aggregate computation and memory resources obtained by clusters of parallel and distributed execution units. Shared computer clusters composed of available Commercial-Off-the-Shelf hardware are preferable to dedicated systems, mainly for cost reasons. The performance of distributed simulations is influenced by the heterogeneity of execution units and by their respective CPU load in background. Adaptive load balancing mechanisms could improve the resources utilization and the simulation process execution, by dynamically tuning the simulation load with an eye to the synchronization and communication overheads reduction. In this work it will be presented the GAIA+ framework: a new load balancing mechanism for distributed simulation. The framework has been evaluated by performing testbed simulations of a wireless ad hoc network model. Results confirm the effectiveness of the proposed solutions.
Deep Learning Fast MRI Using Channel Attention in Magnitude DomainJoonhyung Lee
My presentation on how we participated in the fastMRI Challanege in 2019.
Aside from theoretical considerations, it also explains key implementation issues that arise in all deep learning for MRI such as disk I/O and CPU/GPU load balancing.
Used for presentation at ISBI 2020 Oral session.
Accidentally wrote the title as "Deep Learning Sum-of-Squares Images in Accelerated Parallel MRI". Sorry for the mistake!
Black-box modeling of nonlinear system using evolutionary neural NARX modelIJECEIAES
Nonlinear systems with uncertainty and disturbance are very difficult to model using mathematic approach. Therefore, a black-box modeling approach without any prior knowledge is necessary. There are some modeling approaches have been used to develop a black box model such as fuzzy logic, neural network, and evolution algorithms. In this paper, an evolutionary neural network by combining a neural network and a modified differential evolution algorithm is applied to model a nonlinear system. The feasibility and effectiveness of the proposed modeling are tested on a piezoelectric actuator SISO system and an experimental quadruple tank MIMO system.
Research Inventy : International Journal of Engineering and Science is published by the group of young academic and industrial researchers with 12 Issues per year. It is an online as well as print version open access journal that provides rapid publication (monthly) of articles in all areas of the subject such as: civil, mechanical, chemical, electronic and computer engineering as well as production and information technology. The Journal welcomes the submission of manuscripts that meet the general criteria of significance and scientific excellence. Papers will be published by rapid process within 20 days after acceptance and peer review process takes only 7 days. All articles published in Research Inventy will be peer-reviewed.
Research Inventy : International Journal of Engineering and Science is published by the group of young academic and industrial researchers with 12 Issues per year. It is an online as well as print version open access journal that provides rapid publication (monthly) of articles in all areas of the subject such as: civil, mechanical, chemical, electronic and computer engineering as well as production and information technology. The Journal welcomes the submission of manuscripts that meet the general criteria of significance and scientific excellence. Papers will be published by rapid process within 20 days after acceptance and peer review process takes only 7 days. All articles published in Research Inventy will be peer-reviewed.
Research Inventy : International Journal of Engineering and Science is published by the group of young academic and industrial researchers with 12 Issues per year. It is an online as well as print version open access journal that provides rapid publication (monthly) of articles in all areas of the subject such as: civil, mechanical, chemical, electronic and computer engineering as well as production and information technology. The Journal welcomes the submission of manuscripts that meet the general criteria of significance and scientific excellence. Papers will be published by rapid process within 20 days after acceptance and peer review process takes only 7 days. All articles published in Research Inventy will be peer-reviewed.
Comparative study to realize an automatic speaker recognition system IJECEIAES
In this research, we present an automatic speaker recognition system based on adaptive orthogonal transformations. To obtain the informative features with a minimum dimension from the input signals, we created an adaptive operator, which helped to identify the speaker’s voice in a fast and efficient manner. We test the efficiency and the performance of our method by comparing it with another approach, mel-frequency cepstral coefficients (MFCCs), which is widely used by researchers as their feature extraction method. The experimental results show the importance of creating the adaptive operator, which gives added value to the proposed approach. The performance of the system achieved 96.8% accuracy using Fourier transform as a compression method and 98.1% using Correlation as a compression method.
A Review on Image Compression in Parallel using CUDAIJERD Editor
Now a days images are prodigiously and sizably voluminous in size. So, this size is not facilely fits in applications. For that image compression is require. Image Compression algorithms are more resource conserving. It takes more time to consummate the task of compression. Utilizing Parallel implementation of the compression algorithm this quandary can be overcome. CUDA (Compute Unified Device Architecture) Provides parallel execution for algorithm utilizing the multi-threading. CUDA is NVIDIA`s parallel computing platform. CUDA uses GPU (Graphical Processing Unit) for the parallel execution. GPU have the number of the cores for parallel execution support. Image compression can additionally implemented in parallel utilizing CUDA. There are number of algorithms for image compression. Among them DWT (Discrete Wavelet Transform) is best suited for parallel implementation due to its more mathematical calculation and good compression result compare to other methods. In this paper included different parallel techniques for image compression. With the actualizing this image compression algorithm over the GPU utilizing CUDA it will perform the operations in parallel. In this way, vast diminish in processing time is conceivable. Furthermore it is conceivable to enhance the execution of image compression algorithms.
Performance analysis of real-time and general-purpose operating systems for p...IJECEIAES
In general, modern operating systems can be divided into two essential parts, real-time operating systems (RTOS) and general-purpose operating systems (GPOS). The main difference between GPOS and RTOS is the system is time-critical or not. It means that; in GPOS, a high-priority thread cannot preempt a kernel call. But, in RTOS, a low-priority task is preempted by a high-priority task if necessary, even if it’s executing a kernel call. Most Linux distributions can be used as both GPOS and RTOS with kernel modifications. In this study, two Linux distributions, Ubuntu and Pardus, were analyzed and their performances were compared both as GPOS and RTOS for path planning of the multi-robot systems. Robot groups with different numbers of members were used to perform the path tracking tasks using both Ubuntu and Pardus as GPOS and RTOS. In this way, both the performance of two different Linux distributions in robotic applications were observed and compared in two forms, GPOS, and RTOS.
Real-time traffic sign detection and recognition using Raspberry Pi IJECEIAES
Nowadays, the number of road accident in Malaysia is increasing expeditiously. One of the ways to reduce the number of road accident is through the development of the advanced driving assistance system (ADAS) by professional engineers. Several ADAS system has been proposed by taking into consideration the delay tolerance and the accuracy of the system itself. In this work, a traffic sign recognition system has been developed to increase the safety of the road users by installing the system inside the car for driver’s awareness. TensorFlow algorithm has been considered in this work for object recognition through machine learning due to its high accuracy. The algorithm is embedded in the Raspberry Pi 3 for processing and analysis to detect the traffic sign from the real-time video recording from Raspberry Pi camera NoIR. This work aims to study the accuracy, delay and reliability of the developed system using a Raspberry Pi 3 processor considering several scenarios related to the state of the environment and the condition of the traffic signs. A real-time testbed implementation has been conducted considering twenty different traffic signs and the results show that the system has more than 90% accuracy and is reliable with an acceptable delay.
Parallel implementation of pulse compression method on a multi-core digital ...IJECEIAES
Pulse compression algorithm is widely used in radar applications. It requires a huge processing power in order to be executed in real time. Therefore, its processing must be distributed along multiple processing units. The present paper proposes a real time platform based on the multi-core digital signal processor (DSP) C6678 from Texas Instruments (TI). The objective of this paper is the optimization of the parallel implementation of pulse compression algorithm over the eight cores of the C6678 DSP. Two parallelization approaches were implemented. The first approach is based on the open multi processing (OpenMP) programming interface, which is a software interface that helps to execute different sections of a program on a multi core processor. The second approach is an optimized method that we have proposed in order to distribute the processing and to synchronize the eight cores of the C6678 DSP. The proposed method gives the best performance. Indeed, a parallel efficiency of 94% was obtained when the eight cores were activated.
CONFIGURABLE TASK MAPPING FOR MULTIPLE OBJECTIVES IN MACRO-PROGRAMMING OF WIR...ijassn
Macro-programming is the new generation advanced method of using Wireless Sensor Network (WSNs), where application developers can extract data from sensor nodes through a high level abstraction of the system. Instead of developing the entire application, task graph representation of the WSN model presents simplified approach of data collection. However, mapping of tasks onto sensor nodes highlights several problems in energy consumption and routing delay. In this paper, we present an efficient hybrid approach of task mapping for WSN – Hybrid Genetic Algorithm, considering multiple objectives of optimization – energy consumption, routing delay and soft real time requirement. We also present a method to configure the algorithm as per user's need by changing the heuristics used for optimization. The trade-off analysis between energy consumption and delivery delay was performed and simulation results are presented. The algorithm is applicable during macro-programming enabling developers to choose a better mapping according to their application requirements.
The Cerebellar Model Articulation Controller (CMAC) is an influential cerebrum propelled processing model in
numerous pertinent fields. There are different researches done using CMAC in many applications using its
characteristics in easy implementation and good results for example: facial expression recognition, pattern
recognition etc. In this paper we have presented some methods of using CMAC and presents their results.
SPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHYcsandit
This paper presents a parallel approach to improve the time complexity problem associated
with sequential algorithms. An image steganography algorithm in transform domain is
considered for implementation. Image steganography is a technique to hide secret message in
an image. With the parallel implementation, large message can be hidden in large image since
it does not take much processing time. It is implemented on GPU systems. Parallel
programming is done using OpenCL in CUDA cores from NVIDIA. The speed-up improvement
obtained is very good with reasonably good output signal quality, when large amount of data is
processed
Performance comparison of row per slave and rows set per slave method in pvm ...eSAT Journals
Abstract Parallel computing operates on the principle that large problems can often be divided into smaller ones, which are then solved concurrently to save time by taking advantage of non-local resources and overcoming memory constraints. Multiplication of larger matrices requires a lot of computation time. This paper deals with the two methods for handling Parallel Matrix Multiplication. First is, dividing the rows of one of the input matrices into set of rows based on the number of slaves and assigning one rows set for each slave for computation. Second method is, assigning one row of one of the input matrices at a time for each slave starting from first row to first slave and second row to second slave and so on and loop backs to the first slave when last slave assignment is finished and repeated until all rows are finished assigning. These two methods are implemented using Parallel Virtual Machine and the computation is performed for different sizes of matrices over the different number of nodes. The results show that the row per slave method gives the optimal computation time in PVM based parallel matrix multiplication. Keywords: Parallel Execution, Cluster Computing, MPI (Message Passing Interface), PVM (Parallel Virtual Machine) RAM (Random Access Memory).
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
An Adaptive Load Balancing Middleware for Distributed SimulationGabriele D'Angelo
The simulation is useful to support the design and performance evaluation of complex systems, possibly composed by a massive number of interacting entities. For this reason, the simulation of such systems may need aggregate computation and memory resources obtained by clusters of parallel and distributed execution units. Shared computer clusters composed of available Commercial-Off-the-Shelf hardware are preferable to dedicated systems, mainly for cost reasons. The performance of distributed simulations is influenced by the heterogeneity of execution units and by their respective CPU load in background. Adaptive load balancing mechanisms could improve the resources utilization and the simulation process execution, by dynamically tuning the simulation load with an eye to the synchronization and communication overheads reduction. In this work it will be presented the GAIA+ framework: a new load balancing mechanism for distributed simulation. The framework has been evaluated by performing testbed simulations of a wireless ad hoc network model. Results confirm the effectiveness of the proposed solutions.
Deep Learning Fast MRI Using Channel Attention in Magnitude DomainJoonhyung Lee
My presentation on how we participated in the fastMRI Challanege in 2019.
Aside from theoretical considerations, it also explains key implementation issues that arise in all deep learning for MRI such as disk I/O and CPU/GPU load balancing.
Used for presentation at ISBI 2020 Oral session.
Accidentally wrote the title as "Deep Learning Sum-of-Squares Images in Accelerated Parallel MRI". Sorry for the mistake!
Black-box modeling of nonlinear system using evolutionary neural NARX modelIJECEIAES
Nonlinear systems with uncertainty and disturbance are very difficult to model using mathematic approach. Therefore, a black-box modeling approach without any prior knowledge is necessary. There are some modeling approaches have been used to develop a black box model such as fuzzy logic, neural network, and evolution algorithms. In this paper, an evolutionary neural network by combining a neural network and a modified differential evolution algorithm is applied to model a nonlinear system. The feasibility and effectiveness of the proposed modeling are tested on a piezoelectric actuator SISO system and an experimental quadruple tank MIMO system.
Research Inventy : International Journal of Engineering and Science is published by the group of young academic and industrial researchers with 12 Issues per year. It is an online as well as print version open access journal that provides rapid publication (monthly) of articles in all areas of the subject such as: civil, mechanical, chemical, electronic and computer engineering as well as production and information technology. The Journal welcomes the submission of manuscripts that meet the general criteria of significance and scientific excellence. Papers will be published by rapid process within 20 days after acceptance and peer review process takes only 7 days. All articles published in Research Inventy will be peer-reviewed.
Research Inventy : International Journal of Engineering and Science is published by the group of young academic and industrial researchers with 12 Issues per year. It is an online as well as print version open access journal that provides rapid publication (monthly) of articles in all areas of the subject such as: civil, mechanical, chemical, electronic and computer engineering as well as production and information technology. The Journal welcomes the submission of manuscripts that meet the general criteria of significance and scientific excellence. Papers will be published by rapid process within 20 days after acceptance and peer review process takes only 7 days. All articles published in Research Inventy will be peer-reviewed.
Research Inventy : International Journal of Engineering and Science is published by the group of young academic and industrial researchers with 12 Issues per year. It is an online as well as print version open access journal that provides rapid publication (monthly) of articles in all areas of the subject such as: civil, mechanical, chemical, electronic and computer engineering as well as production and information technology. The Journal welcomes the submission of manuscripts that meet the general criteria of significance and scientific excellence. Papers will be published by rapid process within 20 days after acceptance and peer review process takes only 7 days. All articles published in Research Inventy will be peer-reviewed.
Research Inventy : International Journal of Engineering and Science is published by the group of young academic and industrial researchers with 12 Issues per year. It is an online as well as print version open access journal that provides rapid publication (monthly) of articles in all areas of the subject such as: civil, mechanical, chemical, electronic and computer engineering as well as production and information technology. The Journal welcomes the submission of manuscripts that meet the general criteria of significance and scientific excellence. Papers will be published by rapid process within 20 days after acceptance and peer review process takes only 7 days. All articles published in Research Inventy will be peer-reviewed.
Research Inventy : International Journal of Engineering and Science is published by the group of young academic and industrial researchers with 12 Issues per year. It is an online as well as print version open access journal that provides rapid publication (monthly) of articles in all areas of the subject such as: civil, mechanical, chemical, electronic and computer engineering as well as production and information technology. The Journal welcomes the submission of manuscripts that meet the general criteria of significance and scientific excellence. Papers will be published by rapid process within 20 days after acceptance and peer review process takes only 7 days. All articles published in Research Inventy will be peer-reviewed.
Research Inventy : International Journal of Engineering and Science is published by the group of young academic and industrial researchers with 12 Issues per year. It is an online as well as print version open access journal that provides rapid publication (monthly) of articles in all areas of the subject such as: civil, mechanical, chemical, electronic and computer engineering as well as production and information technology. The Journal welcomes the submission of manuscripts that meet the general criteria of significance and scientific excellence. Papers will be published by rapid process within 20 days after acceptance and peer review process takes only 7 days. All articles published in Research Inventy will be peer-reviewed.
Cell Charge Approximation for Accelerating Molecular Simulation on CUDA-Enabl...ijcax
Methods for Molecular Dynamics(MD) simulations are investigated. MD simulation is the widely used computer simulation approach to study the properties of molecular system. Force calculation in MD is computationally intensive. Paral-lel programming techniques can be applied to improve those calculations.The major aim of this paper is to speed up the MD simulation calculations by/using General Purpose
Graphics Processing Unit(GPU) computing paradigm, an efficient and economical way for parallel computing. For that we are proposing a method called cell charge approximation which treats the electrostatic interactions in MD simulations. This method reduces the complexity of force calculations.
Comparative Study of Neural Networks Algorithms for Cloud Computing CPU Sched...IJECEIAES
Cloud Computing is the most powerful computing model of our time. While the major IT providers and consumers are competing to exploit the benefits of this computing model in order to thrive their profits, most of the cloud computing platforms are still built on operating systems that uses basic CPU (Core Processing Unit) scheduling algorithms that lacks the intelligence needed for such innovative computing model. Correspdondingly, this paper presents the benefits of applying Artificial Neural Networks algorithms in regards to enhancing CPU scheduling for Cloud Computing model. Furthermore, a set of characteristics and theoretical metrics are proposed for the sake of comparing the different Artificial Neural Networks algorithms and finding the most accurate algorithm for Cloud Computing CPU Scheduling.
NETWORK-AWARE DATA PREFETCHING OPTIMIZATION OF COMPUTATIONS IN A HETEROGENEOU...IJCNCJournal
Rapid development of diverse computer architectures and hardware accelerators caused that designing parallel systems faces new problems resulting from their heterogeneity. Our implementation of a parallel
system called KernelHive allows to efficiently run applications in a heterogeneous environment consisting
of multiple collections of nodes with different types of computing devices. The execution engine of the
system is open for optimizer implementations, focusing on various criteria. In this paper, we propose a new
optimizer for KernelHive, that utilizes distributed databases and performs data prefetching to optimize the
execution time of applications, which process large input data. Employing a versatile data management
scheme, which allows combining various distributed data providers, we propose using NoSQL databases
for our purposes. We support our solution with results of experiments with real executions of our OpenCL
implementation of a regular expression matching application in various hardware configurations.
Additionally, we propose a network-aware scheduling scheme for selecting hardware for the proposed
optimizer and present simulations that demonstrate its advantages.
Comparison Between Levenberg-Marquardt And Scaled Conjugate Gradient Training...CSCJournals
The Internet paved way for information sharing all over the world decades ago and its popularity for distribution of data has spread like a wildfire ever since. Data in the form of images, sounds, animations and videos is gaining users’ preference in comparison to plain text all across the globe. Despite unprecedented progress in the fields of data storage, computing speed and data transmission speed, the demands of available data and its size (due to the increase in both, quality and quantity) continue to overpower the supply of resources. One of the reasons for this may be how the uncompressed data is compressed in order to send it across the network. This paper compares the two most widely used training algorithms for multilayer perceptron (MLP) image compression – the Levenberg-Marquardt algorithm and the Scaled Conjugate Gradient algorithm. We test the performance of the two training algorithms by compressing the standard test image (Lena or Lenna) in terms of accuracy and speed. Based on our results, we conclude that both algorithms were comparable in terms of speed and accuracy. However, the Levenberg- Marquardt algorithm has shown slightly better performance in terms of accuracy (as found in the average training accuracy and mean squared error), whereas the Scaled Conjugate Gradient algorithm faired better in terms of speed (as found in the average training iteration) on a simple MLP structure (2 hidden layers).
Hardback solution to accelerate multimedia computation through mgp in cmpeSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Comparison of Neural Network Training Functions for Hematoma Classification i...IOSR Journals
Classification is one of the most important task in application areas of artificial neural networks
(ANN).Training neural networks is a complex task in the supervised learning field of research. The main
difficulty in adopting ANN is to find the most appropriate combination of learning, transfer and training
function for the classification task. We compared the performances of three types of training algorithms in feed
forward neural network for brain hematoma classification. In this work we have selected Gradient Descent
based backpropagation, Gradient Descent with momentum, Resilence backpropogation algorithms. Under
conjugate based algorithms, Scaled Conjugate back propagation, Conjugate Gradient backpropagation with
Polak-Riebreupdates(CGP) and Conjugate Gradient backpropagation with Fletcher-Reeves updates (CGF).The
last category is Quasi Newton based algorithm, under this BFGS, Levenberg-Marquardt algorithms are
selected. Proposed work compared training algorithm on the basis of mean square error, accuracy, rate of
convergence and correctness of the classification. Our conclusion about the training functions is based on the
simulation results
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
A PROGRESSIVE MESH METHOD FOR PHYSICAL SIMULATIONS USING LATTICE BOLTZMANN ME...ijdpsjournal
In this paper, a new progressive mesh algorithm is introduced in order to perform fast physical simulations by the use of a lattice Boltzmann method (LBM) on a single-node multi-GPU architecture. This algorithm is able to mesh automatically the simulation domain according to the propagation of fluids. This method can also be useful in order to perform several types of physical simulations. In this paper, we associate this
algorithm with a multiphase and multicomponent lattice Boltzmann model (MPMC–LBM) because it is
able to perform various types of simulations on complex geometries. The use of this algorithm combined
with the massive parallelism of GPUs[5] allows to obtain very good performance in comparison with the
staticmesh method used in literature. Several simulations are shown in order to evaluate the algorithm.
PROBABILISTIC DIFFUSION IN RANDOM NETWORK G...ijfcstjournal
In this paper, we consider a random network such that there could be a link between any two nodes in the network with a certain probability (plink). Diffusion is the phenomenon of spreading information throughout the network, starting from one or more initial set of nodes (called the early adopters). Information spreads along the links with a certain probability (pdiff). Diffusion happens in rounds with the first round involving the early adopters. The nodes that receive the information for the first time are said to be covered and
become candidates for diffusion in the subsequent round. Diffusion continues until all the nodes in the network have received the information (successful diffusion) or there are no more candidate nodes to spread the information but one or more nodes are yet to receive the information (diffusion failure). On the basis of exhaustive simulations conducted in this paper, we observe that for a given plink and pdiff values, the fraction of successful diffusion attempts does not appreciably change with increase in the number of early
adopters; whereas, the average number of rounds per successful diffusion attempt decreases with increase
in the number of early adopters. The invariant nature of the fraction of successful diffusion attempts with increase in the number of early adopters for a random network (for fixed plink and pdiff values) is an interesting and noteworthy observation (for further research) and it has not been hitherto reported in the literature.
SPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHY cscpconf
This paper presents a parallel approach to improve the time complexity problem associated with sequential algorithms. An image steganography algorithm in transform domain is considered for implementation. Image steganography is a technique to hide secret message in an image. With the parallel implementation, large message can be hidden in large image since it does not take much processing time. It is implemented on GPU systems. Parallel programming is done using OpenCL in CUDA cores from NVIDIA. The speed-up improvement
obtained is very good with reasonably good output signal quality, when large amount of data is processed
EFFICIENT USE OF HYBRID ADAPTIVE NEURO-FUZZY INFERENCE SYSTEM COMBINED WITH N...csandit
This research study proposes a novel method for automatic fault prediction from foundry data
introducing the so-called Meta Prediction Function (MPF). Kernel Principal Component
Analysis (KPCA) is used for dimension reduction. Different algorithms are used for building the
MPF such as Multiple Linear Regression (MLR), Adaptive Neuro Fuzzy Inference System
(ANFIS), Support Vector Machine (SVM) and Neural Network (NN). We used classical
machine learning methods such as ANFIS, SVM and NN for comparison with our proposed
MPF. Our empirical results show that the MPF consistently outperform the classical methods.
Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...csandit
Computational Grid (CG) creates a large heterogeneous and distributed paradigm to manage and execute the applications which are computationally intensive. In grid scheduling tasks are assigned to the proper processors in the grid system to for its execution by considering the execution policy and the optimization objectives. In this paper, makespan and the faulttolerance of the computational nodes of the grid which are the two important parameters for the task execution, are considered and tried to optimize it. As the grid scheduling is considered to be NP-Hard, so a meta-heuristics evolutionary based techniques are often used to find a solution for this. We have proposed a NSGA II for this purpose. The performance estimation ofthe proposed Fault tolerance Aware NSGA II (FTNSGA II) has been done by writing program in Matlab. The simulation results evaluates the performance of the all proposed algorithm and the results of proposed model is compared with existing model Min-Min and Max-Min algorithm which proves effectiveness of the model.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
A0270107
1. Research Inventy: International Journal Of Engineering And Science
Issn: 2278-4721, Vol. 2, Issue 7 (March 2013), Pp 1-7
Www.Researchinventy.Com
1
Graphics Processor Unit Hardware Acceleration of Levenberg-
Marquardt Artificial Neural Network Training
1
David Scanlan, 1
David Mulvaney
1
(School of Electronic, Electrical and Systems Engineering, Loughborough University LE11 3TU, UK)
Abstract - This paper makes two principal contributions. The first is that there appears to be no previous a
description in the research literature of an artificial neural network implementation on a graphics processor unit
(GPU) that uses the Levenberg-Marquardt (LM) training method. The second is an initial attempt at determining
when it is computationally beneficial to exploit a GPU’s parallel nature in preference to the traditional
implementation on a central processing unit (CPU). The paper describes the approach taken to successfully
implement the LM method, discusses the advantages of this approach for GPU implementation and presents
results that compare GPU and CPU performance on two test data sets
Keywords - Artificial Neural Networks, Graphics Processor Unit, Levenberg-Marquardt Networks
I. INTRODUCTION
All desktop computers contain some form of graphics processing unit (GPU) and it is becoming
increasingly common for manufacturers to provide the user with access to its programmable operations. The
inherent parallelism, high data bandwidth and favourable cost to performance ratio that are features of modern
GPUs, have made them an attractive choice for the computational acceleration of many applications, including
fluid flow simulation [1], finite-element simulation [2] and ice crystal growth [3].
Neural networks’ ease of use and semi-automatic adaption has made them a very desirable option for many
applications, such as handwriting identification [4] and speech recognition [5]. A significant drawback is long
calculation time, as, not only does the realisation of artificial neural networks (ANNs) often require the
application of training data over many epochs, but also the repeated application of that data with different
training parameters is needed to generate a solution with good classification performance. A number of
alternative solutions have been developed to reduce this computational overhead, such as automatically tuning
parameters to reduce the number of alternative ANN solutions that need be generated [6], novel training
approaches that require fewer epochs [7] or accelerating the computations by using bespoke electronic hardware
[8]. As this third approach is taken in this paper, it is important to mention that previous researchers looking to
hardware acceleration have investigated a number of novel approaches. These include central processing units
(CPUs) tailored to perform the calculations needed during training [9], mesh connected machines of architecture
similar to that of interconnected neurons [10], or GPUs adapted to mirror the parallel nature of neural
calculations [11].
GPUs have been previously used by a number of researchers to accelerate ANN classification, for example [12],
[13] and [14]. In the literature, no previous GPU solution using the Levenberg-Marquardt (LM) training method
has been described. This is probably due the fact that its calculation involves a matrix inversion operation that is
appears to be computationally expensive even for parallel solution. This paper has adopted a solution for the
matrix inversion operation that allows the LM algorithm to be implemented efficiently on a GPU. Note that a
commercial LM solution exists for which the operational details have not been published, but for which it is
claimed that a calculation speed improvement of up to 65 times can be obtained by choosing the GPU rather
than the CPU implementation [15]. For the examples in this paper, the time taken to train the network using the
GPU is shown to be up to ten times faster than a similar implementation run solely on the machine’s CPU. In
practice, the measured difference in performance will depend significantly on the specific GPU and CPU used in
the comparison and these should always be specified alongside the quoted figures.
This paper briefly describes the LM algorithm, the general architecture of modern GPUs and the implementation
of the ANN on the selected GPU. Finally, results are presented to compare the training times of the GPU and
CPU on two test data sets.
2. Graphics Processor Unit Hardware Acceleration Of Levenberg…
2
II. LEVENBERG-MARQUARDT ARTIFICIAL NEURAL NETWORKS
In an ANN, each neuron will typically apply an activation function to a weighted sum of inputs and
provide a single output. During supervised learning, a set of training vectors with known outputs is repeatedly
applied over a number of epochs and the weight values are altered in such a way as to improve the overall
classification performance of the ANN. Such a training process can be performed by one of a number of
algorithms, the most popular being backpropagation [16], but LM [17] and Conjugate Gradient Descent [18] are
also in common use. Unsupervised learning methods are also available, but these are beyond the scope of this
paper. When the weights are updated after presenting each input vector to the network this is known as online
training; the alternative being batch training where all the training vectors are applied before the weights are
updated. There is also a hybrid of the two which uses mini-batches of the total data set before applying the
weight update
The neurons themselves can be interconnected in a number of ways, but, due to its simplicity and regular
structure, the most commonly used architecture is the multi-layer perceptron (MLP) feed-forward network. An
example MLP network with 11 input neurons, four hidden neurons and two output neurons is shown in Fig. 1.
In the general case, for M input neurons im, P hidden neurons hp and one output neuron o, the weights on the
edges between the input and hidden layers can be represented by Wpm and those between the hidden and output
layer (assuming a single output neuron) by wp. Given k input vectors, input value m is given the value
mi when
presented with vector γ where γ={1,2,…,k}.
hidden
h1
h2
h3
h4
neurons
outputinput
values
i1
i2
i3
i4
i5
i6
i7
i8
i9
i10
i11
w2
w1
w3
w4
o
Wpm
value
Fig. 1. Example of an MLP ANN with a single output
The LM algorithm has recently become increasingly popular as its second-order optimisation techniques allow a
very efficient batch update method. A drawback of the LM approach is that the ANN must have only a single
output, but this can be overcome by implementing multiple networks [19]. For the detailed mathematics
underlying the LM algorithm, the reader is referred to Marquardt [20], but general LM algorithm is shown in
Fig. 2 and is briefly explained below.
Compute network outputs
and LMS error for a batch
Formulate the Jacobian
J(w), where w represents
the network weights
Use the Jacobian to
update the Levenberg-
Marquardt weights
Re-calculate the error
using the new weights and
adjust µ accordingly
stopping
condition met?
network trained
yes
no
Fig. 2. Flow diagram outlining the procedure for using the Levenberg- Marquardt training algorithm
Each batch of data is fed forward through the network, as described previously, to obtain a vector of output
values each calculated using equation (1) below, where z is the activation function.
3. Graphics Processor Unit Hardware Acceleration Of Levenberg…
3
ppm
M
m
m
P
p
wWizzo
11
(1)
The least mean-square (LMS) error can then be obtained using
k
oRwE
1
2
)(
2
1
][
, (2)
where R
γ
is the desired output from the ANN for a specific input vector γ. The Jacobian matrix used in LM
requires a vector of all the weights contained within the network to calculate a matrix of partial derivatives (with
respect to each weight individually) for each input pattern in the batch. The Jacobian is given by
v
pp
v
v
w
e
w
e
w
e
w
e
w
e
w
e
)(
......
)(
............
)(
......
)(
............
)(
......
)(
)(
1
1
1
1
1
ww
ww
ww
wJ
, (3)
where v = MP + 2P and w is a vector of weights w = [W11,..,WPM, B1,..,BP, w1,..,wP]T
, where the Bp values are the
bias values of the hidden neurons. To update the weights during training the LM algorithm determines a weight
update vector Δw, calculated by
(w)eJI(w)J(w)Jw T1T
μ , (4)
where e is the vector containing errors for each input vector in the batch and I is the identity matrix of
dimension v. The new weights can now be calculated by
www oldnew . (5)
The ANN’s LMS error with new weights is now computed. If the new error is smaller than the previous error
then μ is reduced by a factor μ-
, but if the error is larger than the previous error u is increased by a factor of μ+
.
The values of μ, μ-
and μ+
are all training parameters that must be selected before training the network.
III. GRAPHICS PROCESSOR UNIT IMPLEMENTATION
A CPU’s pipeline is instruction flow-driven, whereas a GPU uses a data-dependent control flow that is
better suited to construct and output graphics to a monitor. In GPUs, all data are represented as a stream of a
common data type that passes through the pipeline as a single entity with each element in the stream being
operated upon in an identical manner. The main components that a stream will pass through in the GPU are the
vertex processor, the fragment processor and the memory system.The vertex processor receives vertices from an
application and operates on these primitives to generate a screen position, a colour and a texture coordinate.
Fixed function hardware operations are then applied such as clip and cull. In the fragment processor, each
texture element or texel (similar to a single pixel displayed on the screen) is processed on by a shader program.
A texel is made up of four floating point components, namely red, green, blue and alpha (opacity). After
leaving the fragment processor, a texel passes through some tests, such as a depth test (or z-cull), as well as
other fixed procedures, such as alpha blending. Both the vertex processor and the fragment processor are
programmable in most modern GPUs, giving the programmer considerable flexibility when deploying a
graphics application.The type of memory, bus width and memory clock frequency used in a GPU, as well as the
interface the GPU has with the system memory, determine the speed at which data can be transferred between
the two. Most graphics cards ultilize a form of double-data rate (DDR) memory such as Graphics DDR (GDDR)
4. Graphics Processor Unit Hardware Acceleration Of Levenberg…
4
providing data bandwidths above 3 Gbit/s. The transfers between GPU and CPU are determined by the graphics
card interface bus employed, with the current standard, PCI Express 2.0, having a maximum bandwidth of 8
GB/s
The majority of graphics programs are developed to communicate with either SGI’s OpenGL [21] or
Microsoft’s Direct3D [22] drivers. Alternatively, at a lower level somewhat akin to assembler languages,
programs known as ‘shaders’ can be loaded and run on each of the vertex and fragment programmable
processors.In the current work, the Brook GPU program language was used [23]. Brook extends the
functionality of the C or C++ language, allowing extra data types that define structures of floating point
numbers to match the native architecture of GPUs. For example, the float4 data type is simply a structure of four
floating values that matches the texel representation. By creating a standard texture on the GPU it is possible to
create a 2D array of float4 structures. The major advantage of such an implementation is that the vector pipeline
will be able to operate on each component independently. In contrast, if an array of single floats was required,
then, when mapped to a single stream, only 25% of the fragment processor pipeline would be utilised. Hence,
mapping data to make best use of a GPUs parallel processing capabilities is important in achieving the best
acceleration of a given application.
A further consideration in making best use of a GPUs capabilities is the quantity of data that needs to be
transferred between the computer’s main memory and the GPU’s local memory. In general, applications that
adapt well to GPU implementation are those that are computationally intensive yet can principally operate on
data that resides in local memory. Sharing data with the rest of the computer system requires their transfer
between the GPU’s streams incurring time penalties which, if they accumulate, would be detrimental to
performance. In any given application, data will need to be shared with the CPU and, to achieve acceleration,
the improvement in performance that results from moving stream operations to the GPU should outweigh the
overheads associated with the transfers involved.
IV. NEURAL NETWORK IMPLEMENTATION
The GPU used to generate the results in this paper is an NVIDIA GeForce 6800 GS with GDDR3
memory, operating on the PCI-Express X16 bus, includes 512MB of memory and provides a memory bus width
of 256 bits. The CPU used for comparison purposes is an AMD Athlon64 3.5GHz with 1GB of DDR system
memory.
4.1 Design Overview
The implementation described in this paper concentrates on multiple-input single-output networks with
a single hidden layer as this architecture meets the requirements of the LM algorithm. Although the ANN is
limited to one hidden layer, the number of neurons in this layer can be chosen by the user. In order to make
good use of the GPU’s parallel pipelines, up four networks are created at a time and trained concurrently. For
networks with more than four outputs, the process is repeated once the first four have been calculated. Fig. 4
shows how a network with multiple outputs is treated as multiple networks with single outputs to utilise the
GPU’s vector pipeline to best effect.
Fig. 3. Multiple single-output neural networks running in parallel on the GPU, using all four components of the
float4 data type.
The inputs are replicated into each of the streams’ four components and the outputs split into groups of four and
assigned one channel each. The weights are randomly assigned values in the range
NN
1,1 , where N is the
total number of inputs of the neuron that is connected to the weight’s output. This method of weight
initialisation has been shown in [24] to lead to better convergence than initialising the weights randomly in the
range [-0.5,0.5].
5. Graphics Processor Unit Hardware Acceleration Of Levenberg…
5
Batch training methods are better suited to GPU implementation than single pattern online methods as they
operate on a single large data set for each calculation. This means that the data set can be loaded into the GPU
memory for training rather than loading each training vector individually on each cycle through the loop. The
LM algorithm is a batch training method known to outperform most other training methods in terms of
calculation time for medium-sized ANN with a few hundred weights or more, but has been little used in GPU
implementations as it requires a matrix inversion operation that appears to be not well supported by the GPU
architecture. However, it has been shown by Galoppo et al. [25], that algorithms such as Gaussian Elimination
and LU decomposition (where L and U refer to the lower and upper triangular matrices) that are able to
calculate a matrix inverse can be adapted to run efficiently on a parallel GPU implementation.
4.2 Software
To compare the time taken to train the neural network on the GPU, two versions of the program were
required. One used solely CPU code and the second principally using the streams on the GPU. For the GPU
version, initial data such as the input patterns, weights and known outputs are first loaded into CPU memory and
then transferred into streams. It should be noted that this costly data transfer is only performed once per training
of a set of four networks. The program utilises the float4 data type throughout and is able to train four single
output networks simultaneously, keeping all required data in streams in the fast on-chip memory. Only once the
set of networks has been trained are the final results and weights read back from the GPU to the system memory
to be saved. Once a network has been trained for each output, a set of test data may be used to verify the
success of the training. The test data is provided in a similar method to the training data and a log file produced
detailing the results. Using the saved weights for all the networks allows them to be used collectively for
predicting the output of any input test vector.
4.3 Limitations
The size of datasets used in any part of the calculation or training of the neural network is strictly
limited to the texture size available from the GPU. In the case of the 6800GS this is l024xl024. In most cases
this is quite adequate but if a larger batch size was required, a hybrid semi-batch method could be used and
implemented in the software. The LM algorithm requires an approximation to the 2D Hessian matrix of
dimension v x v. As there is only a single output per network this effectively means that ANNs with large
numbers of hidden and inputs neurons cannot be supported. The need to train four single-output networks in
parallel does restrict the stopping criterion options that can be made available. If the user requires a stopping
condition that relates to the output error of the network, the training of all the streams of ANNs would need to
continue until all four networks had reached the desired error, effectively wasting computational effort.
V. RESULTS
The timing results were obtained using relevant routines available in the Brook compiler and the
OpenGL libraries to provide performance counters with millisecond accuracy.Two data sets obtained from
ultrasonic reflections from obstacles were used to generate the results; further information on the training data
can be found in [26]. Both data sets had 26 continuous inputs and 11 discrete outputs, with the first data set
contained 107 training vectors and the second 2562 vectors. Stop sets and test sets were also available. To
demonstrate the importance of the numbers of neurons in the networks in this study, a single ANN with four
inputs and a single output was trained on the first data set to classify a single output class only and the timing
results are shown in Fig. 4. The CPU outperforms the GPU not only because together the network and training
data are small enough to fit in the CPU’s cache and can therefore be accessed quickly, but also the single output
means the GPU is essentially working at only a quarter of its potential throughput capacity.
0
0.5
1
1.5
2
2.5
2 3 4 5 6 7 8 9 10 11 12 13 14 15
Trainingtime(s)
Number of hidden neurons
GPU
CPU
Fig. 4. Relative times taken to train a single ANN using the ultrasonic data with different numbers of hidden
neurons.
6. Graphics Processor Unit Hardware Acceleration Of Levenberg…
6
For the second data set, 11 separate neural networks were trained, one for each output class. The GPU now
outperforms the CPU, as shown in Fig. 5. This is in spite of the fact that, due to the texture size limitation, only
a subset of these vectors could be applied in a single batch and consequently copying of data from the main
computer memory to the GPU’s on-chip memory was needed during training.
0
1000
2000
3000
4000
5000
6000
2 3 4 5 6 7 8 9 10 11 12 13 14 15
Trainingtime(s)
Number of hidden neurons
GPU
CPU
(a) batch size of 512 data vectors
0
1000
2000
3000
4000
5000
6000
2 3 4 5 6 7 8 9 10 11 12 13 14 15
Trainingtime(s)
Number of hidden neurons
GPU
CPU
(b) batch size of 1024 data vectors
Fig. 5. Times taken to train a set of 11 ANNs using the ultrasonic data with different numbers of hidden
neurons.
The results show that there is an increase in time required to train larger networks and, in general, this may
either result from including more hidden neurons (as here) or be due to the fact that a larger batch size is
available. For other larger ANNs tested, the GPU training was found to outperform the CPU by a factor of
between three and ten.
VI. CONCLUSION
The consumer demand for increasingly powerful graphic performance in applications such as high-
definition television and gaming has led to substantial improvements in the computational power of GPUs,
leading to a continual fall in the cost-performance ratio. Many scientific applications have already begun to use
the GPU as a coprocessor for computationally intensive work. However, GPUs are by no means a suitable
platform for all applications, as the reconfiguration of an application requires that careful thought be given to
many aspects of the design, including data-driven control flow, identification of appropriate parallelisms in the
code and efficient memory usage in data storage. The accessibility of GPU hardware is continually improving
and both NVIDIA [27] and ATI [28] have not only made development tools available at a reasonable cost, but
have also produced excellent supporting documents to ease the learning process.
The work described in this paper has shown the methods and concepts that are required to map ANN using the
LM training algorithm to a GPU. A number of modifications could be made to the network architecture, such as
the implementation of multiple layers of ANNs [19]. Although the LM training method is notorious for its long
computation time, its overall training time and classification performance are generally favourable in
comparison with other approaches. As the size of a batch is restricted by the texture size offered by the GPU, a
modification that allows the software to operate in a semi-batch mode could allow larger training datasets to be
used. Very large ANNs, trained using large data sets that need to be streamed to the GPU would yield the
greatest benefits in terms of reduced calculation time. The further goal could be the development of a distributed
system, essentially creating a GPU cluster.
7. Graphics Processor Unit Hardware Acceleration Of Levenberg…
7
REFERENCES
[1] Z. Fan, F. Qui, A. Kaufman and S. Yoakum-Stover, GPU cluster for high performance computing, Proc. ACM/IEEE Conf. On
Supercomputing, Pittsburgh, PA, 2004, 47-58
[2] M. Rumpf and R Strzodka, Using graphics cards for quantized FEM computations, Proc. Visualization, Imaging and Image
Processing Conf., Marbella, Spain, 2001, 193-202.
[3] T. Kim and M. Lin, Visual simulation of ice crystal growth, Proc. ACM SIGGRAPH Eurographics Symp. on Computer
Animation, San Diego, CA, 2003, 86-97.
[4] I.-S. Oh and C.Y. Suen, Distance features for neural network-based recognition of handwritten characters, Int. J. of Document
Analysis and Recognition, 1(2), 1998, 73-88.
[5] E. Trentin, M. Gori, Robust combination of neural networks and hidden Markov models for speech recognition, IEEE Trans.
Neural Networks, 14(6), 2003, 1519-1531.
[6] L. Behera, S. Kumar and A. Patnaik, On adaptive learning rate that guarantees convergence in feedforward networks, IEEE
Trans. Neural Networks, 17(5), 2006, 1116-1125.
[7] X. Liang, Removal of hidden neurons in multilayer perceptrons by orthogonal projection and weight crosswise propagation,
Neural Computing and Applications, 16(1), 2007, 57-68.
[8] J Misra and I Saha, Artificial neural networks in hardware: A survey of two decades of progress. Neurocomputing, 74(1–3),
2010, 239-255.
[9] R. F. Lyon and L. S. Yaeger, On-line hand-printing recognition with neural networks, Proc. 5th Int. Conf. on Microelectronics
for Neural Networks and Fuzzy Systems, Torino, Italy, 1996, 201-212.
[10] R. A. Ayoubi and M. A. Bayoumi, Efficient mapping algorithm of multilayer neural network on Torus architecture, IEEE Trans.
Parallel and Distributed Systems, 14(9), 2003, 932-943.
[11] K. Oh and K Jung, GPU implementation of neural networks, Pattern Recognition, 37(6), 2004, 1311-1314.
[12] R. Dolan and G. DeSouza, GPU-based simulation of cellular neural networks for image processing, Proc. 2009 Int. Joint Conf.
on Neural Networks, Atlanta, GA, 2009, 2712-2717.
[13] H. Jang, A. Park, K. Jung, Neural network implementation using CUDA and OpenMP, Proc. Digital Image Computing:
Techniques and Applications, Canberra, Australia, 2008, 155-161.
[14] T-Y Hoa, P-M Lama and C-S Leung, Parallelization of cellular neural networks on GPU, Pattern Recognition, 41(8), 2008,
2684-2692.
[15] Neurosolutions, http://www.neurosolutions.com/products/cuda, accessed 17 December 2012.
[16] J. Hertz, A. Krogh, R.G.Palmer, Introduction to the theory of neural computation, (Reading, MA: Addison-Wesley, 1991), 115-
120.
[17] M. T. Hagan. and M. Menhaj, Training feed-forward networks with the Marquardt algorithm, IEEE Trans. Neural Networks,
5(6), 1994, 989-993.
[18] C. Charalambous, Conjugate gradient algorithm for efficient training of artificial neural networks, IEE Proc. G Circuits, Devices
and Systems, 139(3), 1992, 301-310.
[19] D. J. Mulvaney and I. P. W. Sillitoe, The classification of ultrasonic signals using novel neural network approaches, Int. J. Robotics and
Automation, 14(1), 1999. 15-22.
[20] D. W. Marquardt, An algorithm for least-squares estimation of nonlinear parameters, J. Soc. Industrial and Applied Mathematics,
11( 2), 1963, 431-441.
[21] OpenGL, http://www.sgi.com/products/software/opengl, accessed 17 December 2012.
[22] DirectX, www.microsoft.com/en-us/download/details.aspx?id=35, accessed 17 December 2012.
[23] Brook GPU, http://graphics.stanford.edu/projects/brookgpu/index.html, accessed 17 December 2012.
[24] G. Thimm and E. Fiesler, High-order and multilayer perceptron initialization, IEEE Trans. Neural Networks, 8(2), 1997, 349-
359.
[25] N. Galoppo, N. K. Govindaraju, M. Henson and D. Manocha, LU-GPU: Efficient algorithms for solving dense linear systems on
graphics hardware, Proc. ACM/IEEE Conf. On Supercomputing, Seattle, WA, 2005, 3.
[26] I.P.W. Sillitoe, L. C. Axbrink and D. J. Mulvaney, Extensible sonar classifiers for local environment mapping, Proc. Int. Conf.
Applied Informatics, Innsbruck, Austria, 2001, 149-154.
[27] NVIDIA CUDA, https://developer.nvidia.com/category/zone/cuda-zone, accessed 17 December 2012.
[28] ATI development tools, http://developer.amd.com/tools/heterogeneous-computing/, accessed 17 December 2012.