This paper presents a novel approach for real time face detection using heterogeneous
computing. The algorithm uses local binary pattern (LBP) as feature vector for face detection.
OpenCL is used to accelerate the code using GPU[1]. Illuminance invariance is achieved using
gamma correction and Difference of Gaussian(DOG) to make the algorithm robust against
varying lighting conditions. This implementation is compared with previous parallel
implementation and is found to perform faster.
Real Time Face Detection on GPU Using OPENCLcsandit
This document summarizes a research paper that presents a real-time face detection algorithm using local binary patterns (LBP) as features. The algorithm is implemented on a GPU using OpenCL to achieve faster processing speeds compared to CPU implementations. LBP features are extracted from divided image blocks in parallel on the GPU. Histograms of the LBP codes from each block are concatenated to form an overall feature vector for classification. Experimental results show the GPU implementation achieves a 5x speed improvement over CPU for 640x480 images.
Accelerating Real Time Applications on Heterogeneous PlatformsIJMER
In this paper we describe about the novel implementations of depth estimation from a stereo
images using feature extraction algorithms that run on the graphics processing unit (GPU) which is
suitable for real time applications like analyzing video in real-time vision systems. Modern graphics
cards contain large number of parallel processors and high-bandwidth memory for accelerating the
processing of data computation operations. In this paper we give general idea of how to accelerate the
real time application using heterogeneous platforms. We have proposed to use some added resources to
grasp more computationally involved optimization methods. This proposed approach will indirectly
accelerate a database by producing better plan quality.
The objective of this paper is to present the hybrid approach for edge detection. Under this technique, edge
detection is performed in two phase. In first phase, Canny Algorithm is applied for image smoothing and in
second phase neural network is to detecting actual edges. Neural network is a wonderful tool for edge
detection. As it is a non-linear network with built-in thresholding capability. Neural Network can be trained
with back propagation technique using few training patterns but the most important and difficult part is to
identify the correct and proper training set.
COUPLED FPGA/ASIC IMPLEMENTATION OF ELLIPTIC CURVE CRYPTO-PROCESSORIJNSA Journal
In this paper, we propose an elliptic curve key generation processor over GF(2163) scheme based on the Montgomery scalar multiplication algorithm. The new architecture is performed using polynomial basis. The Finite Field operations use a cellular automata multiplier and Fermat algorithm for inversion. For real time implementation, the architecture has been tested on an ISE 9.1 Software using Xilinx Virtex II Pro FPGA and on an ASIC CMOS 45 nm technology as well. The proposed implementation provides a time of 2.07 ms and 38 percent of Slices in Xilinx Virtex II Pro FPGA. Such features reveal the high efficiently of this implementation design.
Molecular dynamics (MD) is a very useful tool to understand various phenomena in atomistic detail. In MD, we can overcome the size- and time-scale problems by efficient parallelization. In this lecture, I’ll explain various parallelization methods of MD with some examples of GENESIS MD software optimization on Fugaku.
SPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHYcsandit
This paper presents a parallel approach to improve the time complexity problem associated
with sequential algorithms. An image steganography algorithm in transform domain is
considered for implementation. Image steganography is a technique to hide secret message in
an image. With the parallel implementation, large message can be hidden in large image since
it does not take much processing time. It is implemented on GPU systems. Parallel
programming is done using OpenCL in CUDA cores from NVIDIA. The speed-up improvement
obtained is very good with reasonably good output signal quality, when large amount of data is
processed
For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/auvizsystems/embedded-vision-training/videos/pages/may-2016-embedded-vision-summit
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Nagesh Gupta, Founder and CEO of Auviz Systems, presents the "Semantic Segmentation for Scene Understanding: Algorithms and Implementations" tutorial at the May 2016 Embedded Vision Summit.
Recent research in deep learning provides powerful tools that begin to address the daunting problem of automated scene understanding. Modifying deep learning methods, such as CNNs, to classify pixels in a scene with the help of the neighboring pixels has provided very good results in semantic segmentation. This technique provides a good starting point towards understanding a scene. A second challenge is how such algorithms can be deployed on embedded hardware at the performance required for real-world applications. A variety of approaches are being pursued for this, including GPUs, FPGAs, and dedicated hardware.
This talk provides insights into deep learning solutions for semantic segmentation, focusing on current state of the art algorithms and implementation choices. Gupta discusses the effect of porting these algorithms to fixed-point representation and the pros and cons of implementing them on FPGAs.
Real Time Face Detection on GPU Using OPENCLcsandit
This document summarizes a research paper that presents a real-time face detection algorithm using local binary patterns (LBP) as features. The algorithm is implemented on a GPU using OpenCL to achieve faster processing speeds compared to CPU implementations. LBP features are extracted from divided image blocks in parallel on the GPU. Histograms of the LBP codes from each block are concatenated to form an overall feature vector for classification. Experimental results show the GPU implementation achieves a 5x speed improvement over CPU for 640x480 images.
Accelerating Real Time Applications on Heterogeneous PlatformsIJMER
In this paper we describe about the novel implementations of depth estimation from a stereo
images using feature extraction algorithms that run on the graphics processing unit (GPU) which is
suitable for real time applications like analyzing video in real-time vision systems. Modern graphics
cards contain large number of parallel processors and high-bandwidth memory for accelerating the
processing of data computation operations. In this paper we give general idea of how to accelerate the
real time application using heterogeneous platforms. We have proposed to use some added resources to
grasp more computationally involved optimization methods. This proposed approach will indirectly
accelerate a database by producing better plan quality.
The objective of this paper is to present the hybrid approach for edge detection. Under this technique, edge
detection is performed in two phase. In first phase, Canny Algorithm is applied for image smoothing and in
second phase neural network is to detecting actual edges. Neural network is a wonderful tool for edge
detection. As it is a non-linear network with built-in thresholding capability. Neural Network can be trained
with back propagation technique using few training patterns but the most important and difficult part is to
identify the correct and proper training set.
COUPLED FPGA/ASIC IMPLEMENTATION OF ELLIPTIC CURVE CRYPTO-PROCESSORIJNSA Journal
In this paper, we propose an elliptic curve key generation processor over GF(2163) scheme based on the Montgomery scalar multiplication algorithm. The new architecture is performed using polynomial basis. The Finite Field operations use a cellular automata multiplier and Fermat algorithm for inversion. For real time implementation, the architecture has been tested on an ISE 9.1 Software using Xilinx Virtex II Pro FPGA and on an ASIC CMOS 45 nm technology as well. The proposed implementation provides a time of 2.07 ms and 38 percent of Slices in Xilinx Virtex II Pro FPGA. Such features reveal the high efficiently of this implementation design.
Molecular dynamics (MD) is a very useful tool to understand various phenomena in atomistic detail. In MD, we can overcome the size- and time-scale problems by efficient parallelization. In this lecture, I’ll explain various parallelization methods of MD with some examples of GENESIS MD software optimization on Fugaku.
SPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHYcsandit
This paper presents a parallel approach to improve the time complexity problem associated
with sequential algorithms. An image steganography algorithm in transform domain is
considered for implementation. Image steganography is a technique to hide secret message in
an image. With the parallel implementation, large message can be hidden in large image since
it does not take much processing time. It is implemented on GPU systems. Parallel
programming is done using OpenCL in CUDA cores from NVIDIA. The speed-up improvement
obtained is very good with reasonably good output signal quality, when large amount of data is
processed
For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/auvizsystems/embedded-vision-training/videos/pages/may-2016-embedded-vision-summit
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Nagesh Gupta, Founder and CEO of Auviz Systems, presents the "Semantic Segmentation for Scene Understanding: Algorithms and Implementations" tutorial at the May 2016 Embedded Vision Summit.
Recent research in deep learning provides powerful tools that begin to address the daunting problem of automated scene understanding. Modifying deep learning methods, such as CNNs, to classify pixels in a scene with the help of the neighboring pixels has provided very good results in semantic segmentation. This technique provides a good starting point towards understanding a scene. A second challenge is how such algorithms can be deployed on embedded hardware at the performance required for real-world applications. A variety of approaches are being pursued for this, including GPUs, FPGAs, and dedicated hardware.
This talk provides insights into deep learning solutions for semantic segmentation, focusing on current state of the art algorithms and implementation choices. Gupta discusses the effect of porting these algorithms to fixed-point representation and the pros and cons of implementing them on FPGAs.
This document discusses parallelizing graph algorithms on GPUs for optimization. It summarizes previous work on parallel Breadth-First Search (BFS), All Pair Shortest Path (APSP), and Traveling Salesman Problem (TSP) algorithms. It then proposes implementing BFS, APSP, and TSP on GPUs using optimization techniques like reducing data transfers between CPU and GPU and modifying the algorithms to maximize GPU computing power and memory usage. The paper claims this will improve performance and speedup over CPU implementations. It focuses on optimizing graph algorithms for parallel GPU processing to accelerate applications involving large graph analysis and optimization problems.
HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardin...Sunny Kr
Cardinality estimation has a wide range of applications and
is of particular importance in database systems. Various
algorithms have been proposed in the past, and the HyperLogLog algorithm is one of them
Iaetsd implementation of power efficient iterative logarithmic multiplier usi...Iaetsd Iaetsd
This document describes the design and implementation of a power efficient iterative logarithmic multiplier using Mitchell's algorithm and reversible logic. It involves converting multiplication to addition using logarithmic numbers. The proposed design implements a basic block consisting of leading one detectors, encoders, barrel shifters and a decoder to calculate an approximate product. Error correction circuits are then cascaded with the basic blocks to improve accuracy. The 4x4 reversible logarithmic multiplier is designed and simulated using Xilinx tools, demonstrating lower power consumption through the use of reversible logic.
CycleGAN is a framework that uses generative adversarial networks to perform image-to-image translation without paired training examples. It consists of two generators that map images from one domain to another and back, along with two discriminators that classify images as real or fake. The generators are trained to translate images such that they are classified as real by the discriminators, while also remaining consistent when translated back to the original domain. The authors demonstrate it for the task of colorizing grayscale images without paired color-grayscale image samples.
Climb is a generic image processing library written in Common Lisp that provides algorithms like thresholding, mathematical morphology, and watershed segmentation. It aims to be fully generic by handling different image formats, pixel types, and storage schemes. Climb uses morphers to allow algorithms to operate on different image types transparently and property descriptions to select algorithm implementations based on image structure details. The library also provides tools for chaining algorithms together into complex image processing pipelines.
A PROGRESSIVE MESH METHOD FOR PHYSICAL SIMULATIONS USING LATTICE BOLTZMANN ME...ijdpsjournal
In this paper, a new progressive mesh algorithm is introduced in order to perform fast physical simulations by the use of a lattice Boltzmann method (LBM) on a single-node multi-GPU architecture. This algorithm is able to mesh automatically the simulation domain according to the propagation of fluids. This method can also be useful in order to perform several types of physical simulations. In this paper, we associate this
algorithm with a multiphase and multicomponent lattice Boltzmann model (MPMC–LBM) because it is
able to perform various types of simulations on complex geometries. The use of this algorithm combined
with the massive parallelism of GPUs[5] allows to obtain very good performance in comparison with the
staticmesh method used in literature. Several simulations are shown in order to evaluate the algorithm.
Manifold learning with application to object recognitionzukun
This document discusses manifold learning techniques for dimensionality reduction that can uncover the intrinsic structure of high-dimensional data. It introduces Isomap and Locally Linear Embedding (LLE) as two popular manifold learning algorithms. Isomap uses graph-based distances to preserve global structure, while LLE aims to preserve local linear relationships between neighbors. Both techniques find low-dimensional embeddings that best represent the high-dimensional data. Manifold learning provides data compression and enables techniques like object recognition by discovering the underlying manifold structure.
FAST AND EFFICIENT IMAGE COMPRESSION BASED ON PARALLEL COMPUTING USING MATLABJournal For Research
Image compression technique is used in many applications for example, satellite imaging, medical imaging, video where the size of the iamge requires more space to store, in such application image compression effectively can be used. There are two types in image compression techniques Lossy and Lossless comression. Both these techniques are used for compression of images, but these techniques are not fast. The image compression techniques both lossy and lossless image compression techniques are not fast, they take more time for compression and decompression. For fast and efficient image compression a parallel computing technique is used in matlab. Matlab is used in this project for parallel computing of images. In this paper we will discuss Regular image compression technique, three alternatives of parallel computing using matlab, comparison of image compression with and without parallel computing.
This document describes an FPGA-based human detection system with an embedded platform. Key points:
- The system uses HOG features, SVM classification, and AdaBoost algorithms for human detection in images and video.
- FPGA circuits are designed to accelerate the computationally intensive HOG feature extraction, including modules for gradient calculation, histogram accumulation, and more.
- The full system is implemented on an embedded platform to achieve a real-time human detection system running at 15 frames per second.
- Experimental results show the FPGA-based system has similar detection accuracy to a PC-based software implementation but significantly faster speed, suitable for real-time embedded applications.
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
The document discusses video compression using the Set Partitioning in Hierarchical Trees (SPIHT) algorithm and neural networks. It presents the principles of SPIHT coding and the backpropagation algorithm for neural networks. Various neural network training algorithms are tested for compressing video frames, including gradient descent with momentum and adaptive learning. The results show the compressed frames with different algorithms, and gradient descent with momentum and adaptive learning achieved the best compression ratio of 1.1737089:1 while maintaining image clarity.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
This document discusses optimizations for implementing an N-body simulation algorithm on GPUs using OpenCL. It begins with an overview of the basic N-body algorithm and its parallel implementation. Two key optimizations are explored: using local memory to enable data reuse across work items, and unrolling the computation loop. Performance results on AMD and Nvidia GPUs show that data reuse provides significant speedup, and loop unrolling further improves performance on the AMD GPU. An example N-body application is provided to experiment with these optimization techniques.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
This document summarizes a research paper that implemented Levenberg-Marquardt artificial neural network training using graphics processing unit (GPU) hardware acceleration. The key points are:
1) This appears to be the first description of implementing artificial neural networks using the Levenberg-Marquardt training method on a GPU.
2) The paper describes their approach for implementing the Levenberg-Marquardt algorithm on a GPU, which involves solving the matrix inversion operation that is typically computationally expensive.
3) Results show that training networks using the GPU implementation can be up to 10 times faster than using a CPU-only implementation on the same hardware.
This document summarizes a research paper about developing a new set of low-complexity features for detecting steganography in JPEG images. The proposed features, called DCTR features, are computed by taking the discrete cosine transform (DCT) of non-overlapping 8x8 blocks of the image, resulting in 64 feature maps. Histograms are formed from the quantized noise residuals in these feature maps. This approach has lower computational complexity than previous rich models used for steganalysis and provides competitive detection accuracy across different steganographic algorithms while using fewer features. The paper introduces the concept of an undecimated DCT and explains how it relates to previous work in JPEG steganalysis.
Wilmott Nyc Jul2012 Nag Talk John HoldenJohn Holden
“Latest releases and news from NAG”
Key new mathematical functionality and technology developments will be shared. Highlights include the launch of the latest release of the NAG C Library and NAG Toolbox for MATLAB®”
Senior quant from Tier 1 Investment Bank commenting on Mark 23 releases of the NAG Toolbox for MATLAB and NAG C Library :
“We deploy production code in C++ embedding NAG C Library functions wherever we can, but often proto-type new models in MATLAB before writing our C++ code, having the same NAG algorithms in MATLAB via the NAG Toolbox for MATLAB is a real win for us. Thank you, thank you.
“Obvious finance highlights from my point of view: i) matrix functions (especially exponential) ii) additional nearest correlation matrix functions iii) skip ahead for Mersenne Twister as well as new local and global optimisation functions”
Performance analysis of sobel edge filter on heterogeneous system using opencleSAT Publishing House
This document discusses performance analysis of the Sobel edge detection filter on heterogeneous systems using OpenCL. It begins with an introduction to OpenCL and describes its architecture, including the platform model, execution model, memory model, and programming model. It then provides an overview of GPUs and CPUs, comparing their architectures and number of cores. It also gives mathematical representations of image convolution and describes how the Sobel filter works. The document analyzes the performance of implementing Sobel edge detection using OpenCL on CPUs and GPUs and finds that GPUs provide much higher performance compared to CPUs.
The complexity of Medical image reconstruction requires tens to hundreds of billions of computations per second. Until few years ago, special purpose processors designed especially for such applications were used. Such processors require significant design effort and are thus difficult to change as new algorithms in reconstructions evolve and have limited parallelism. Hence the demand for flexibility in medical applications motivated the use of stream processors with massively parallel architecture. Stream processing architectures offers data parallel kind of parallelism.
SPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHY cscpconf
This paper presents a parallel approach to improve the time complexity problem associated with sequential algorithms. An image steganography algorithm in transform domain is considered for implementation. Image steganography is a technique to hide secret message in an image. With the parallel implementation, large message can be hidden in large image since it does not take much processing time. It is implemented on GPU systems. Parallel programming is done using OpenCL in CUDA cores from NVIDIA. The speed-up improvement
obtained is very good with reasonably good output signal quality, when large amount of data is processed
An OpenCL Method of Parallel Sorting Algorithms for GPU ArchitectureWaqas Tariq
In this paper, we present a comparative performance analysis of different parallel sorting algorithms: Bitonic sort and Parallel Radix Sort. In order to study the interaction between the algorithms and architecture, we implemented both the algorithms in OpenCL and compared its performance with Quick Sort algorithm, the fastest algorithm. In our simulation, we have used Intel Core2Duo CPU 2.67GHz and NVidia Quadro FX 3800 as graphical processing unit.
This document discusses parallelizing graph algorithms on GPUs for optimization. It summarizes previous work on parallel Breadth-First Search (BFS), All Pair Shortest Path (APSP), and Traveling Salesman Problem (TSP) algorithms. It then proposes implementing BFS, APSP, and TSP on GPUs using optimization techniques like reducing data transfers between CPU and GPU and modifying the algorithms to maximize GPU computing power and memory usage. The paper claims this will improve performance and speedup over CPU implementations. It focuses on optimizing graph algorithms for parallel GPU processing to accelerate applications involving large graph analysis and optimization problems.
HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardin...Sunny Kr
Cardinality estimation has a wide range of applications and
is of particular importance in database systems. Various
algorithms have been proposed in the past, and the HyperLogLog algorithm is one of them
Iaetsd implementation of power efficient iterative logarithmic multiplier usi...Iaetsd Iaetsd
This document describes the design and implementation of a power efficient iterative logarithmic multiplier using Mitchell's algorithm and reversible logic. It involves converting multiplication to addition using logarithmic numbers. The proposed design implements a basic block consisting of leading one detectors, encoders, barrel shifters and a decoder to calculate an approximate product. Error correction circuits are then cascaded with the basic blocks to improve accuracy. The 4x4 reversible logarithmic multiplier is designed and simulated using Xilinx tools, demonstrating lower power consumption through the use of reversible logic.
CycleGAN is a framework that uses generative adversarial networks to perform image-to-image translation without paired training examples. It consists of two generators that map images from one domain to another and back, along with two discriminators that classify images as real or fake. The generators are trained to translate images such that they are classified as real by the discriminators, while also remaining consistent when translated back to the original domain. The authors demonstrate it for the task of colorizing grayscale images without paired color-grayscale image samples.
Climb is a generic image processing library written in Common Lisp that provides algorithms like thresholding, mathematical morphology, and watershed segmentation. It aims to be fully generic by handling different image formats, pixel types, and storage schemes. Climb uses morphers to allow algorithms to operate on different image types transparently and property descriptions to select algorithm implementations based on image structure details. The library also provides tools for chaining algorithms together into complex image processing pipelines.
A PROGRESSIVE MESH METHOD FOR PHYSICAL SIMULATIONS USING LATTICE BOLTZMANN ME...ijdpsjournal
In this paper, a new progressive mesh algorithm is introduced in order to perform fast physical simulations by the use of a lattice Boltzmann method (LBM) on a single-node multi-GPU architecture. This algorithm is able to mesh automatically the simulation domain according to the propagation of fluids. This method can also be useful in order to perform several types of physical simulations. In this paper, we associate this
algorithm with a multiphase and multicomponent lattice Boltzmann model (MPMC–LBM) because it is
able to perform various types of simulations on complex geometries. The use of this algorithm combined
with the massive parallelism of GPUs[5] allows to obtain very good performance in comparison with the
staticmesh method used in literature. Several simulations are shown in order to evaluate the algorithm.
Manifold learning with application to object recognitionzukun
This document discusses manifold learning techniques for dimensionality reduction that can uncover the intrinsic structure of high-dimensional data. It introduces Isomap and Locally Linear Embedding (LLE) as two popular manifold learning algorithms. Isomap uses graph-based distances to preserve global structure, while LLE aims to preserve local linear relationships between neighbors. Both techniques find low-dimensional embeddings that best represent the high-dimensional data. Manifold learning provides data compression and enables techniques like object recognition by discovering the underlying manifold structure.
FAST AND EFFICIENT IMAGE COMPRESSION BASED ON PARALLEL COMPUTING USING MATLABJournal For Research
Image compression technique is used in many applications for example, satellite imaging, medical imaging, video where the size of the iamge requires more space to store, in such application image compression effectively can be used. There are two types in image compression techniques Lossy and Lossless comression. Both these techniques are used for compression of images, but these techniques are not fast. The image compression techniques both lossy and lossless image compression techniques are not fast, they take more time for compression and decompression. For fast and efficient image compression a parallel computing technique is used in matlab. Matlab is used in this project for parallel computing of images. In this paper we will discuss Regular image compression technique, three alternatives of parallel computing using matlab, comparison of image compression with and without parallel computing.
This document describes an FPGA-based human detection system with an embedded platform. Key points:
- The system uses HOG features, SVM classification, and AdaBoost algorithms for human detection in images and video.
- FPGA circuits are designed to accelerate the computationally intensive HOG feature extraction, including modules for gradient calculation, histogram accumulation, and more.
- The full system is implemented on an embedded platform to achieve a real-time human detection system running at 15 frames per second.
- Experimental results show the FPGA-based system has similar detection accuracy to a PC-based software implementation but significantly faster speed, suitable for real-time embedded applications.
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
The document discusses video compression using the Set Partitioning in Hierarchical Trees (SPIHT) algorithm and neural networks. It presents the principles of SPIHT coding and the backpropagation algorithm for neural networks. Various neural network training algorithms are tested for compressing video frames, including gradient descent with momentum and adaptive learning. The results show the compressed frames with different algorithms, and gradient descent with momentum and adaptive learning achieved the best compression ratio of 1.1737089:1 while maintaining image clarity.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
This document discusses optimizations for implementing an N-body simulation algorithm on GPUs using OpenCL. It begins with an overview of the basic N-body algorithm and its parallel implementation. Two key optimizations are explored: using local memory to enable data reuse across work items, and unrolling the computation loop. Performance results on AMD and Nvidia GPUs show that data reuse provides significant speedup, and loop unrolling further improves performance on the AMD GPU. An example N-body application is provided to experiment with these optimization techniques.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
This document summarizes a research paper that implemented Levenberg-Marquardt artificial neural network training using graphics processing unit (GPU) hardware acceleration. The key points are:
1) This appears to be the first description of implementing artificial neural networks using the Levenberg-Marquardt training method on a GPU.
2) The paper describes their approach for implementing the Levenberg-Marquardt algorithm on a GPU, which involves solving the matrix inversion operation that is typically computationally expensive.
3) Results show that training networks using the GPU implementation can be up to 10 times faster than using a CPU-only implementation on the same hardware.
This document summarizes a research paper about developing a new set of low-complexity features for detecting steganography in JPEG images. The proposed features, called DCTR features, are computed by taking the discrete cosine transform (DCT) of non-overlapping 8x8 blocks of the image, resulting in 64 feature maps. Histograms are formed from the quantized noise residuals in these feature maps. This approach has lower computational complexity than previous rich models used for steganalysis and provides competitive detection accuracy across different steganographic algorithms while using fewer features. The paper introduces the concept of an undecimated DCT and explains how it relates to previous work in JPEG steganalysis.
Wilmott Nyc Jul2012 Nag Talk John HoldenJohn Holden
“Latest releases and news from NAG”
Key new mathematical functionality and technology developments will be shared. Highlights include the launch of the latest release of the NAG C Library and NAG Toolbox for MATLAB®”
Senior quant from Tier 1 Investment Bank commenting on Mark 23 releases of the NAG Toolbox for MATLAB and NAG C Library :
“We deploy production code in C++ embedding NAG C Library functions wherever we can, but often proto-type new models in MATLAB before writing our C++ code, having the same NAG algorithms in MATLAB via the NAG Toolbox for MATLAB is a real win for us. Thank you, thank you.
“Obvious finance highlights from my point of view: i) matrix functions (especially exponential) ii) additional nearest correlation matrix functions iii) skip ahead for Mersenne Twister as well as new local and global optimisation functions”
Performance analysis of sobel edge filter on heterogeneous system using opencleSAT Publishing House
This document discusses performance analysis of the Sobel edge detection filter on heterogeneous systems using OpenCL. It begins with an introduction to OpenCL and describes its architecture, including the platform model, execution model, memory model, and programming model. It then provides an overview of GPUs and CPUs, comparing their architectures and number of cores. It also gives mathematical representations of image convolution and describes how the Sobel filter works. The document analyzes the performance of implementing Sobel edge detection using OpenCL on CPUs and GPUs and finds that GPUs provide much higher performance compared to CPUs.
The complexity of Medical image reconstruction requires tens to hundreds of billions of computations per second. Until few years ago, special purpose processors designed especially for such applications were used. Such processors require significant design effort and are thus difficult to change as new algorithms in reconstructions evolve and have limited parallelism. Hence the demand for flexibility in medical applications motivated the use of stream processors with massively parallel architecture. Stream processing architectures offers data parallel kind of parallelism.
SPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHY cscpconf
This paper presents a parallel approach to improve the time complexity problem associated with sequential algorithms. An image steganography algorithm in transform domain is considered for implementation. Image steganography is a technique to hide secret message in an image. With the parallel implementation, large message can be hidden in large image since it does not take much processing time. It is implemented on GPU systems. Parallel programming is done using OpenCL in CUDA cores from NVIDIA. The speed-up improvement
obtained is very good with reasonably good output signal quality, when large amount of data is processed
An OpenCL Method of Parallel Sorting Algorithms for GPU ArchitectureWaqas Tariq
In this paper, we present a comparative performance analysis of different parallel sorting algorithms: Bitonic sort and Parallel Radix Sort. In order to study the interaction between the algorithms and architecture, we implemented both the algorithms in OpenCL and compared its performance with Quick Sort algorithm, the fastest algorithm. In our simulation, we have used Intel Core2Duo CPU 2.67GHz and NVidia Quadro FX 3800 as graphical processing unit.
The document describes the design and development of a floating-point co-processor to accelerate graphics functions such as geometric transformations in OpenGL. It details the optimization of the architecture for executing nested loop programs common in graphics, including matrix-vector multiplication. The controller was optimized for executing typical geometric transformation algorithms through techniques like strength reduction and exploiting induction variables. The complete processor was tested hierarchically and demonstrated accelerating the model-view transformation in OpenGL.
International Journal of Computational Engineering Research (IJCER) ijceronline
This document describes a system for implementing an artificial neuron using an FPGA. The system first converts analog signals from electrochemical sensors to digital signals using a 12-bit analog-to-digital converter (ADC). It then implements the mathematical operations of a neuron in digital logic on the FPGA, including multiplication, accumulation, and an activation function. Simulation and chipscope results are presented which verify the design and operation of the artificial neuron on the FPGA board. The system provides a modular design that could be expanded to create a complete artificial neural network for processing electrochemical sensor data.
This project required me to find the optimum architecture configuration that would run the 'eeg' benchmark application using the SimpleScalar simulator.
On Implementation of Neuron Network(Back-propagation)Yu Liu
This document outlines Yu Liu's work implementing and comparing different parallel versions of a neural network using backpropagation. It discusses motivations for parallel programming practice and library study. It provides an introduction to neural networks and backpropagation algorithms. Three implementations are compared: sequential C++ STL, Skelton library, and Intel TBB. Benchmark results show improved speedups from parallel versions. Remaining challenges are also noted, like addressing local minima problems and testing on larger data.
IRJET- Latin Square Computation of Order-3 using Open CLIRJET Journal
This document discusses using OpenCL parallel programming to compute Latin squares of order 3 more efficiently than sequential algorithms. It proposes dividing the input matrix into sub-matrices that are processed concurrently by multiple processing elements in the GPU. This parallel approach reduces the computation time compared to performing the operations sequentially on the CPU. First, the input matrix is divided based on task or data parallelism. Then the sub-matrices are computed simultaneously by different processing elements. The results are combined and stored in GPU memory before being transferred to CPU memory and output. Implementing the Latin square computation with OpenCL exploits parallelism to improve efficiency over the traditional sequential approach.
International Refereed Journal of Engineering and Science (IRJES)irjes
International Refereed Journal of Engineering and Science (IRJES) is a leading international journal for publication of new ideas, the state of the art research results and fundamental advances in all aspects of Engineering and Science. IRJES is a open access, peer reviewed international journal with a primary objective to provide the academic community and industry for the submission of half of original research and applications
International Refereed Journal of Engineering and Science (IRJES)irjes
International Refereed Journal of Engineering and Science (IRJES) is a leading international journal for publication of new ideas, the state of the art research results and fundamental advances in all aspects of Engineering and Science. IRJES is a open access, peer reviewed international journal with a primary objective to provide the academic community and industry for the submission of half of original research and applications
Real-Time Implementation and Performance Optimization of Local Derivative Pat...IJECEIAES
Pattern based texture descriptors are widely used in Content Based Image Retrieval (CBIR) for efficient retrieval of matching images. Local Derivative Pattern (LDP), a higher order local pattern operator, originally proposed for face recognition, encodes the distinctive spatial relationships contained in a local region of an image as the feature vector. LDP efficiently extracts finer details and provides efficient retrieval however, it was proposed for images of limited resolution. Over the period of time the development in the digital image sensors had paid way for capturing images at a very high resolution. LDP algorithm though very efficient in content-based image retrieval did not scale well when capturing features from such high-resolution images as it becomes computationally very expensive. This paper proposes how to efficiently extract parallelism from the LDP algorithm and strategies for optimally implementing it by exploiting some inherent General-Purpose Graphics Processing Unit (GPGPU) characteristics. By optimally configuring the GPGPU kernels, image retrieval was performed at a much faster rate. The LDP algorithm was ported on to Compute Unified Device Architecture (CUDA) supported GPGPU and a maximum speed up of around 240x was achieved as compared to its sequential counterpart.
This document discusses single-GPU and multi-GPU implementations of the MAD IQA algorithm to improve its computational performance. A single-GPU implementation achieved a 24x speedup over the CPU version, bringing the runtime down to 40ms. A multi-GPU implementation using 3 GPUs achieved an additional speedup of 33x over the CPU version, bringing the runtime to 28.9ms, but required many more data transfers between GPUs and system memory. While parallelizing tasks across GPUs improved performance, latency from data transfers between devices limited gains.
OpenGL Based Testing Tool Architecture for Exascale ComputingCSCJournals
1) The document proposes an OpenGL based testing tool architecture for exascale computing to improve performance and accuracy of OpenGL programs.
2) It identifies common errors that occur when programming shaders in OpenGL Shading Language (GLSL) such as errors in file reading, compilation, linking, and rendering.
3) The proposed testing architecture divides the GLSL programming process into four stages - file reading, compilation, pre-linking/linking, and rendering - and validates each stage to detect errors and enforce error-free code.
This document describes a project to create a dynamic sorting algorithm visualizer using OpenGL. It includes an abstract, introduction on OpenGL and libraries used. It describes the implementation including functions for initializing circles, drawing circles, swapping circles, and handling keyboard input and window resizing. It lists the hardware and software requirements. It includes screenshots of the project and concludes with discussing expanding the project with additional sorting algorithms and graphical effects. It provides references used in the project development.
Highlighted notes on Hybrid Multicore Computing
While doing research work under Prof. Dip Banerjee, Prof. Kishore Kothapalli.
In this comprehensive report, Prof. Dip Banerjee describes about the benefit of utilizing both multicore systems, CPUs with vector instructions, and manycore systems, GPUs with large no. of low speed ALUs. Such hybrid systems are beneficial to several algorithms as an accelerator cant optimize for all parts of an algorithms (some computations are very regular, while some very irregular).
Estimation of Optimized Energy and Latency Constraint for Task Allocation in ...ijcsit
In Network on Chip (NoC) rooted system, energy consumption is affected by task scheduling and allocation
schemes which affect the performance of the system. In this paper we test the pre-existing proposed
algorithms and introduced a new energy skilled algorithm for 3D NoC architecture. An efficient dynamic
and cluster approaches are proposed along with the optimization using bio-inspired algorithm. The
proposed algorithm has been implemented and evaluated on randomly generated benchmark and real life
application such as MMS, Telecom and VOPD. The algorithm has also been tested with the E3S benchmark
and has been compared with the existing mapping algorithm spiral and crinkle and has shown better
reduction in the communication energy consumption and shows improvement in the performance of the
system. On performing experimental analysis of proposed algorithm results shows that average reduction
in energy consumption is 49%, reduction in communication cost is 48% and average latency is 34%.
Cluster based approach is mapped onto NoC using Dynamic Diagonal Mapping (DDMap), Crinkle and
Spiral algorithms and found DDmap provides improved result. On analysis and comparison of mapping of
cluster using DDmap approach the average energy reduction is 14% and 9% with crinkle and spiral.
OpenGL is a standard graphics library used to render 2D and 3D graphics. It provides basic drawing primitives like points, lines, and polygons. OpenGL follows a graphics pipeline where commands are processed through various stages including transformations, rasterization, and finally writing to the framebuffer. Programmers use OpenGL by issuing commands to specify geometry and settings, which are then rendered to the screen independent of hardware.
Design and implementation of complex floating point processor using fpgaVLSICS Design
This paper presents complete processor hardware with three arithmetic units. The first arithmetic unit can
perform 32-bit integer arithmetic operations. The second unit can perform arithmetic operations such as
addition, subtraction, multiplication, division, and square root on 32-bit floating point numbers. The third
unit can perform arithmetic operations such as addition, subtraction, multiplication on complex numbers.
The specific advancement in this processor is the new architecture introduced for complex arithmetic unit.
In general complex floating point arithmetic hardware consists of floating to fixed and fixed to floating
conversions. But using such hardware will lead to compromise between accuracy and number of bits used
to represent the fixed point equivalent of floating point numbers. The proposed architecture avoids that
compromise and it is implemented with less number of look-up tables to save around 5500 logic gates. The
complex numbers are represented using a subset of IEEE754 standard floating point format, 16-bits for
real part and 16-bits for imaginary part. The floating point arithmetic unit works on 32-bit IEEE754 single
precision numbers. The instruction set is specially designed to support integer, floating point and complex
floating point arithmetic operations. The on-chip RAM is 8kBytes and is extendable up to 64kBytes. As the
processor is designed to implement on FPGA, the embedded block RAMs are utilized as RAM.
IRJET- Digital Image Forgery Detection using Local Binary Patterns (LBP) and ...IRJET Journal
This document proposes a method to detect digital image forgeries using local binary patterns (LBP) and histogram of oriented gradients (HOG). It extracts LBP features from the input image, then applies HOG to the LBP features. These combined features are classified using a support vector machine (SVM) as authentic or tampered. Testing on CASIA datasets achieved detection rates of 92.3% for CASIA-1 and 96.1% for CASIA-2, outperforming other existing methods. The method is effective at forgery detection while having reduced time complexity.
Similar to REAL TIME FACE DETECTION ON GPU USING OPENCL (20)
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceIndexBug
Imagine a world where machines not only perform tasks but also learn, adapt, and make decisions. This is the promise of Artificial Intelligence (AI), a technology that's not just enhancing our lives but revolutionizing entire industries.
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdfTechgropse Pvt.Ltd.
In this blog post, we'll delve into the intersection of AI and app development in Saudi Arabia, focusing on the food delivery sector. We'll explore how AI is revolutionizing the way Saudi consumers order food, how restaurants manage their operations, and how delivery partners navigate the bustling streets of cities like Riyadh, Jeddah, and Dammam. Through real-world case studies, we'll showcase how leading Saudi food delivery apps are leveraging AI to redefine convenience, personalization, and efficiency.
Full-RAG: A modern architecture for hyper-personalizationZilliz
Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
CAKE: Sharing Slices of Confidential Data on BlockchainClaudio Di Ciccio
Presented at the CAiSE 2024 Forum, Intelligent Information Systems, June 6th, Limassol, Cyprus.
Synopsis: Cooperative information systems typically involve various entities in a collaborative process within a distributed environment. Blockchain technology offers a mechanism for automating such processes, even when only partial trust exists among participants. The data stored on the blockchain is replicated across all nodes in the network, ensuring accessibility to all participants. While this aspect facilitates traceability, integrity, and persistence, it poses challenges for adopting public blockchains in enterprise settings due to confidentiality issues. In this paper, we present a software tool named Control Access via Key Encryption (CAKE), designed to ensure data confidentiality in scenarios involving public blockchains. After outlining its core components and functionalities, we showcase the application of CAKE in the context of a real-world cyber-security project within the logistics domain.
Paper: https://doi.org/10.1007/978-3-031-61000-4_16
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
Things to Consider When Choosing a Website Developer for your Website | FODUUFODUU
Choosing the right website developer is crucial for your business. This article covers essential factors to consider, including experience, portfolio, technical skills, communication, pricing, reputation & reviews, cost and budget considerations and post-launch support. Make an informed decision to ensure your website meets your business goals.
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
2. 442 Computer Science & Information Technology (CS & IT)
Section 4 discusses the implementation details on GPU. Experimental results are shown in section
5. In Section 6, we give a brief conclusion of the paper.
2. HETEROGENEOUS COMPUTING WITH OPENCL
OpenCL[3] is an industry standard writing parallel programs targeting heterogeneous platforms.
In this section a brief overview of heterogeneous computing with OpenCL programming model is
given. Programming model of OpenCL is classified into 4 models[4] Platform model, Execution
model, Memory model, Programming model.
2.1. platform model
The OpenCL platform model defines a high-level representation of any heterogeneous platform
used with OpenCL. This model is shown in the Fig 1. The host can be connected to one or more
OpenCL devices(DSP, FPGA, GPU, CPU etc), the device is where the kernel execute. OpenCL
devices are further divided into compute units which are further divided into processing
elements(PEs), and computation occurs in this PEs. Each PEs is used to execute an SIMD.
Fig. 1. OpenCL Platform Model .
2.2. Execution model
OpenCL execution consist of two parts - host program and collection of kernels. OpenCL
abstracts away the exact steps for processing of kernel on various platforms like CPU, GPU etc.
Kernels execute on OpenCL devices, they are simple functions that transforms input memory
object into output memory objects. OpenCL defines two types of kernel, OpenCL kernels and
Native kernels.
Execution of kernel on a OpenCL device:
1. Kernel is defined in the host,
2. Host issues a command for execution of kernel on OpenCL device,
3. As a result OpenCL runtime creates an index space.
4. An instance of the kernel is called work item, which is defined by the coordinates in the
indexspace (NDRange) as shown in Fig 2.
3. Computer Science & Information Technology (CS & IT) 443
Fig. 2. Block diagram of NDRange.
2.3. Memory model
OpenCL defines two types of memory objects, buffer objects and image objects. Buffer object is
a contiguous block of memory made available to the kernel, whereas image buffers are restricted
to holding images. To use the image buffer format OpenCL device should support it. In this paper
buffer object is used for face detection. OpenCL memory model defines five memory region:
• Host memory: This memory is visible to host only.
• Global memory: This memory region permits read/write to all work items in all the
work groups.
• Local memory: This memory region, local to the work group, can be accessed by work
items within the work group.
• Constant memory: This region of global memory remains constant during execution of
the kernel. Workitems have read only access to these objects.
• Private memory: This memory region is private to a work item i.e variables defined
private in one work item are not visible to the other work item. Block diagram of memory
model is shown in Fig 3.
Fig. 3. Block diagram for memory model
4. 444 Computer Science & Information Technology (CS & IT)
2.4. Programming model
Programming model is where the programmer will parallelize the algorithm. OpenCL is designed
both for data and task parallelism. In this paper we have used data parallelism which will be
discussed in section 4.
Basic work flow of an application in OpenCL frame work is shown below in block diagram Fig.4
FIG. 4. WORK FLOW DIAGRAM OF OPENCL.
Here we start with the host program that defines the context. The context contains two OpenCL
devices, a CPU and a GPU. Next the command queue is defined, one for GPU and the other for
CPU. Host program then defines the program object to compile and generate the kernel object
both for OpenCL devices. After that host program defines the memory object required by the
program and maps them to the arguments of the kernel. Finally the host program enqueue the
commands to the command queue to execute the kernels and then the results are read back.
3. OVERVIEW OF THE ALGORITHM
In this section it is discussed about how the image is captured from camera and converted to grey
scale used gamma operation and DOG operation in preprocessing[1] and LBP feature extraction,
Histogram and classifier.
3.1 Face detection using LBP:
LBP operator labels the pixel by thresholding the 3x3 neighbourhood of each pixel with the
center value. This generic formulation of the operator puts no limitations to the size of the
neighbourhood [5]. Here each pixel is compared to its 8 neighbours (on its left-top, left-middle,
left bottom, right-top, right-middle, right-bottom) followed in clockwise direction. Wherever the
center pixel value is greater than the neighbour write 1 to the corresponding neighbour pixel
otherwise write 0. This gives an 8-digit binary number as shown in Fig.5. This 8-digit binary
number is then converted to a decimal.
5. Computer Science & Information Technology (CS & IT) 445
Fig. 5. LPB Thresholding
After getting the LBP of the block , histogram of each block is calculated in parallel and is
concatenated as shown in Fig.6. This gives the feature vector used for training the classifier. LBP
operator[6] is defined as
where gp is the intensity of the image at the pth sample point where p is the total number of the
sample point at a radius of R denoted by(P,R). The P spaced sampling points of the window are
used to calculate the difference between center gc and its surrounding pixel[5]. The feature vector
of the image obtained after cascading the histogram is,
where k is an integer to represent the sub histogram that is obtained from each block k=1,2...K. K
is the total no of histograms ,and f(x,y) = where f(x; y) is the LBP
calculated value at pixel (x,y).
Fig. 6. LBP in Face detection
6. 446 Computer Science & Information Technology (CS & IT)
3.2. classifier
There are many methods to determine the dissimilarity between the LBP pattern, here chi-square
method is used and further work is going on with SVM for training and classifying the LBP
feature. The chi-square distance used to measure the dissimilarity between two LBP images S and
M is given by
where L is the length of the feature vector of the image and Sx and Mx are respectively the sample
and model image in the respective bin.
4. IMPLEMENTATION
The goal of this paper is to implement LBP algorithm on a cpu, gpu based heterogeneous
platform using OpenCL, to reduce computation time. The process of LBP feature extraction and
histogram calculation from the image is computationally expensive (N2
xW2
, where NxN is size of
image and WxW is size of LBP block) and it is easy to figure out that the extraction in different
parts are independent as discussed in section 3. Thus it can be efficiently parallelized [7].
For real time implementation, first the image is captured using OpenCV and is converted to a
grey scale image. Then preprocessing of the image is done. To figure out different features in the
image, various algorithms are implemented on the image. For parallel processing of the
algorithm, the image is subdivided into smaller parts. In this case the image is divided into 16 x
16 pixels blocks. The task of calculating LBP for each block is given to work items. So there are
different work items processing different blocks in parallel. Each work item processes 256 pixels
and different work items work in parallel, reducing the processing time up to a significant extent.
So, to calculate the LBP of 16x16 pixels blocks the image is converted to an one dimensional
array. A global reference of the image is used by each work item for creating one 18 x 18 two
dimensional matrix to find out the LBP for each block. From the 18 x 18 matrix, LBP is
calculated as discussed in section 3 for 16x16 pixels not considering the boundary pixel of 18 x
18 matrix . Afterwards the calculated values for LBP are processed to form a histogram ranging
from 0 to 255. Each work item formulates a histogram accordingly for one block and the different
histograms are cascaded to form the actual histogram for the complete image in order to get the
LBP feature vector as shown in Fig.7. This feature vector is then classified using nearest
neighbourhood method.
Fig. 7. Histogram Calculation on GPU
7. Computer Science & Information Technology (CS & IT) 447
The Block Diagram Of The Overall Method Is Shown In Fig.8
Fig. 8. Block diagram of the overall method
5. RESULTS
In the proposed paper we have calculated each block in different compute units. Since the
calculation of histogram depends on all the pixels within a block thus it is better to do the whole
calculation within one compute unit. Additionally, the amount of computation per compute unit
shouldn’t be too small otherwise the overhead associated with managing a compute unit will be
more than the actual computation. Since the whole computation is done in the GPU and only the
input image and the final histogram are transferred between the CPU and GPU thus overheads
associated with data transfer are minimal. As a result the computational time is 20 ms. The
performance table of the implementation is shown in Table 1.
Table 1. Performance Table
Input Resolution 640x480
Sub Histograms 256
CPU(i5 3rd generation) 109 ms
AMD(7670M) 20 ms
Table 2. Comparison With Previous Work
PREVIOUS WORK[8] OUR WORK
Image size 512x512 640x480
Sub Histograms 256 256
Feature Extraction 36.3 ms 20 ms
Compared to previous work[8], the image is grabbed from camera and processed in real time. As
can be seen from TABLE 2, computational time for our implementation is less. Performance of
the proposed paper is tested on AMD 7670M GPU, i5 3rd generation CPU based system. Total
time for feature extraction on CPU was 108 ms and on GPU was 20 ms for 640X480 resolution
input image. Thus we get a 5x improvement in speed using GPU implementation.
8. 448 Computer Science & Information Technology (CS & IT)
6. CONCLUSION
In this paper real time face detection using LBP feature extraction is done and is classified using
nearest neighbourhood method. We have parallelized the existing LBP algorithm to make it
suitable for implementation on SIMD architecture such as GPGPU. Performance gain has been
achieved over previous implementations.
REFERENCES
[1] Xiaoyang Tan; Triggs, B., ”Enhanced Local Texture Feature Sets for Face Recognition Under
Difficult Lighting Conditions,” Image Processing, IEEE Transactions on , vol.19, no.6, pp.1635,1650,
June 2010 doi: 10.1109/TIP.2010.2042645
[2] Computational Imaging and Vision, Vol. 40 Pietikinen, M., Hadid, A.,Zhao, G., Ahonen, T. 2011,
XVI, 212 p.
[3] KHRONOS:OpenCLoverviewwebpage, http://www.khronos.org/opencl/,2009.
[4] Aaftab Munshi, Benedict Gaster, Timothy G. Mattson, James Fung, Dan Ginsburg ISBN: 978-0-3217
4964-2
[5] TY - JOUR T1 - Facial expression recognition based on Local Binary Patterns: A comprehensive
study JO - Image and Vision Computing VL - 27 IS - 6 SP - 803 EP - 816 PY - 2009/5/4/ T2 - AU -
Shan, Caifeng AU - Gong, Shaogang AU - McOwan, Peter W. SN - 0262-8856 DO
http://dx.doi.org/10.1016/j.imavis.2008.08.005
UR-http://www.sciencedirect.com/science/article/pii/S0262885608001844 KW - Facial expression
recognition KW - Local Binary Patterns KW - Support vector machine KW - Adaboost KW – Linear
discriminant analysis KW - Linear programming ER
[6] Caifeng Shan; Shaogang Gong; McOwan, Peter W., ”Robust facial expression recognition using local
binary patterns,” Image Processing, 2005. ICIP 2005. IEEE International Conference on , vol.2, no.,
pp.II,370-3, 11-14 Sept. 2005 doi: 10.1109/ICIP.2005.1530069
[7] Miguel Bordallo Lpez ; Henri Nyknen ; Jari Hannuksela ; Olli Silvn and Markku Vehvilinen
”Accelerating image recognition on mobile devices using GPGPU”, Proc. SPIE 7872, Parallel
Processing for Imaging Applications, 78720R (January 25, 2011); doi:10.1117/12.872860;
http://dx.doi.org/10.1117/12.872860
[8] Parallel Implementation of LBP Based Face recognition on GPU Using OpenCL Dwith, C.Y.N. ;
Rathna, G.N. Parallel and Distributed Computing, Applications and Technologies (PDCAT), 2012
13th International Conference on Digital Object Identifier: 10.1109/PDCAT.2012.107 Publication
Year: 2012 , Page(s): 755 - 760