Pattern based texture descriptors are widely used in Content Based Image Retrieval (CBIR) for efficient retrieval of matching images. Local Derivative Pattern (LDP), a higher order local pattern operator, originally proposed for face recognition, encodes the distinctive spatial relationships contained in a local region of an image as the feature vector. LDP efficiently extracts finer details and provides efficient retrieval however, it was proposed for images of limited resolution. Over the period of time the development in the digital image sensors had paid way for capturing images at a very high resolution. LDP algorithm though very efficient in content-based image retrieval did not scale well when capturing features from such high-resolution images as it becomes computationally very expensive. This paper proposes how to efficiently extract parallelism from the LDP algorithm and strategies for optimally implementing it by exploiting some inherent General-Purpose Graphics Processing Unit (GPGPU) characteristics. By optimally configuring the GPGPU kernels, image retrieval was performed at a much faster rate. The LDP algorithm was ported on to Compute Unified Device Architecture (CUDA) supported GPGPU and a maximum speed up of around 240x was achieved as compared to its sequential counterpart.
HARDWARE SOFTWARE CO-SIMULATION FOR TRAFFIC LOAD COMPUTATION USING MATLAB SIM...ijcsity
Due to increase in number of vehicles, Traffic is a major problem faced in urban areas throughout the
world. This document presents a newly developed Matlab Simulink model to compute traffic load for real
time traffic signal control. Signal processing, video and image processing and Xilinx Blockset have been
extensively used for traffic load computation. The approach used is Edge detection operation, wherein,
Edges are extracted to identify the number of vehicles. The developed model computes the results with
greater degrees of accuracy and is capable of being used to set the green signal duration so as to release
the traffic dynamically on traffic junctions.
Xilinx System Generator (XSG) provides Simulink Blockset for several hardware operations that could be
implemented on various Xilinx Field programmable gate arrays (FPGAs). The method described in this
paper involves object feature identification and detection. Xilinx System Generator provides some blocks to
transform data provided from the software side of the simulation environment to the hardware side. In our
case it is MATLAB Simulink to System Generator blocks. This is an important concept to understand in the
design process using Xilinx System Generator. The Xilinx System Generator, embedded in MATLAB
Simulink is used to program the model and then test on the FPGA board using the properties of hardware
co-simulation tools.
Estimation of Optimized Energy and Latency Constraint for Task Allocation in ...ijcsit
In Network on Chip (NoC) rooted system, energy consumption is affected by task scheduling and allocation
schemes which affect the performance of the system. In this paper we test the pre-existing proposed
algorithms and introduced a new energy skilled algorithm for 3D NoC architecture. An efficient dynamic
and cluster approaches are proposed along with the optimization using bio-inspired algorithm. The
proposed algorithm has been implemented and evaluated on randomly generated benchmark and real life
application such as MMS, Telecom and VOPD. The algorithm has also been tested with the E3S benchmark
and has been compared with the existing mapping algorithm spiral and crinkle and has shown better
reduction in the communication energy consumption and shows improvement in the performance of the
system. On performing experimental analysis of proposed algorithm results shows that average reduction
in energy consumption is 49%, reduction in communication cost is 48% and average latency is 34%.
Cluster based approach is mapped onto NoC using Dynamic Diagonal Mapping (DDMap), Crinkle and
Spiral algorithms and found DDmap provides improved result. On analysis and comparison of mapping of
cluster using DDmap approach the average energy reduction is 14% and 9% with crinkle and spiral.
The document describes a method for speeding up cycle-based logic simulation using graphics processing units (GPUs). The authors develop a parallel cycle-based logic simulation algorithm that uses and-inverter graphs (AIGs) as the design representation. They create two clustering algorithms that partition circuit gates into independent blocks that can be simulated concurrently. Experimental results show speedups of up to 5x using the first clustering algorithm and up to 21x using the second algorithm on several benchmarks. The work demonstrates that GPUs can significantly accelerate logic simulation through parallelization.
Highlighted notes on Hybrid Multicore Computing
While doing research work under Prof. Dip Banerjee, Prof. Kishore Kothapalli.
In this comprehensive report, Prof. Dip Banerjee describes about the benefit of utilizing both multicore systems, CPUs with vector instructions, and manycore systems, GPUs with large no. of low speed ALUs. Such hybrid systems are beneficial to several algorithms as an accelerator cant optimize for all parts of an algorithms (some computations are very regular, while some very irregular).
Realtime Energy Efficient Digital Image Watermarking on Mobile Devices using ...CSCJournals
This document summarizes a research paper that proposes a real-time and energy efficient image watermarking scheme for mobile devices using Android. The method uses a hybrid of discrete cosine transform (DCT) and discrete wavelet transform (DWT) along with extreme learning machine (ELM), a single layer feedforward neural network. Low frequency DCT coefficients of the Y component of captured color images are selected to generate a dataset for ELM, which produces a key for embedding watermark data like capture time, device ID, etc. Experiments show the process embeds and extracts watermarks in under a second, making it suitable for a mobile app. Analysis found the overall and LCD energy usage was 34.4J and 31.
SPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHYcsandit
This paper presents a parallel approach to improve the time complexity problem associated
with sequential algorithms. An image steganography algorithm in transform domain is
considered for implementation. Image steganography is a technique to hide secret message in
an image. With the parallel implementation, large message can be hidden in large image since
it does not take much processing time. It is implemented on GPU systems. Parallel
programming is done using OpenCL in CUDA cores from NVIDIA. The speed-up improvement
obtained is very good with reasonably good output signal quality, when large amount of data is
processed
IRJET- Saliency based Image Co-SegmentationIRJET Journal
The document describes a saliency-based image co-segmentation method. It proposes to first extract inter-image information through co-saliency maps generated from multiple saliency extraction methods. Then, it performs single image segmentation on each individual image. A key step is fusing the diverse saliency maps through a saliency co-fusion process at the superpixel level. This exploits inter-image information to boost common foreground regions and suppress background regions. Experiments show the co-fusion based approach achieves competitive performance compared to state-of-the-art methods without parameter tuning.
HARDWARE SOFTWARE CO-SIMULATION FOR TRAFFIC LOAD COMPUTATION USING MATLAB SIM...ijcsity
Due to increase in number of vehicles, Traffic is a major problem faced in urban areas throughout the
world. This document presents a newly developed Matlab Simulink model to compute traffic load for real
time traffic signal control. Signal processing, video and image processing and Xilinx Blockset have been
extensively used for traffic load computation. The approach used is Edge detection operation, wherein,
Edges are extracted to identify the number of vehicles. The developed model computes the results with
greater degrees of accuracy and is capable of being used to set the green signal duration so as to release
the traffic dynamically on traffic junctions.
Xilinx System Generator (XSG) provides Simulink Blockset for several hardware operations that could be
implemented on various Xilinx Field programmable gate arrays (FPGAs). The method described in this
paper involves object feature identification and detection. Xilinx System Generator provides some blocks to
transform data provided from the software side of the simulation environment to the hardware side. In our
case it is MATLAB Simulink to System Generator blocks. This is an important concept to understand in the
design process using Xilinx System Generator. The Xilinx System Generator, embedded in MATLAB
Simulink is used to program the model and then test on the FPGA board using the properties of hardware
co-simulation tools.
Estimation of Optimized Energy and Latency Constraint for Task Allocation in ...ijcsit
In Network on Chip (NoC) rooted system, energy consumption is affected by task scheduling and allocation
schemes which affect the performance of the system. In this paper we test the pre-existing proposed
algorithms and introduced a new energy skilled algorithm for 3D NoC architecture. An efficient dynamic
and cluster approaches are proposed along with the optimization using bio-inspired algorithm. The
proposed algorithm has been implemented and evaluated on randomly generated benchmark and real life
application such as MMS, Telecom and VOPD. The algorithm has also been tested with the E3S benchmark
and has been compared with the existing mapping algorithm spiral and crinkle and has shown better
reduction in the communication energy consumption and shows improvement in the performance of the
system. On performing experimental analysis of proposed algorithm results shows that average reduction
in energy consumption is 49%, reduction in communication cost is 48% and average latency is 34%.
Cluster based approach is mapped onto NoC using Dynamic Diagonal Mapping (DDMap), Crinkle and
Spiral algorithms and found DDmap provides improved result. On analysis and comparison of mapping of
cluster using DDmap approach the average energy reduction is 14% and 9% with crinkle and spiral.
The document describes a method for speeding up cycle-based logic simulation using graphics processing units (GPUs). The authors develop a parallel cycle-based logic simulation algorithm that uses and-inverter graphs (AIGs) as the design representation. They create two clustering algorithms that partition circuit gates into independent blocks that can be simulated concurrently. Experimental results show speedups of up to 5x using the first clustering algorithm and up to 21x using the second algorithm on several benchmarks. The work demonstrates that GPUs can significantly accelerate logic simulation through parallelization.
Highlighted notes on Hybrid Multicore Computing
While doing research work under Prof. Dip Banerjee, Prof. Kishore Kothapalli.
In this comprehensive report, Prof. Dip Banerjee describes about the benefit of utilizing both multicore systems, CPUs with vector instructions, and manycore systems, GPUs with large no. of low speed ALUs. Such hybrid systems are beneficial to several algorithms as an accelerator cant optimize for all parts of an algorithms (some computations are very regular, while some very irregular).
Realtime Energy Efficient Digital Image Watermarking on Mobile Devices using ...CSCJournals
This document summarizes a research paper that proposes a real-time and energy efficient image watermarking scheme for mobile devices using Android. The method uses a hybrid of discrete cosine transform (DCT) and discrete wavelet transform (DWT) along with extreme learning machine (ELM), a single layer feedforward neural network. Low frequency DCT coefficients of the Y component of captured color images are selected to generate a dataset for ELM, which produces a key for embedding watermark data like capture time, device ID, etc. Experiments show the process embeds and extracts watermarks in under a second, making it suitable for a mobile app. Analysis found the overall and LCD energy usage was 34.4J and 31.
SPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHYcsandit
This paper presents a parallel approach to improve the time complexity problem associated
with sequential algorithms. An image steganography algorithm in transform domain is
considered for implementation. Image steganography is a technique to hide secret message in
an image. With the parallel implementation, large message can be hidden in large image since
it does not take much processing time. It is implemented on GPU systems. Parallel
programming is done using OpenCL in CUDA cores from NVIDIA. The speed-up improvement
obtained is very good with reasonably good output signal quality, when large amount of data is
processed
IRJET- Saliency based Image Co-SegmentationIRJET Journal
The document describes a saliency-based image co-segmentation method. It proposes to first extract inter-image information through co-saliency maps generated from multiple saliency extraction methods. Then, it performs single image segmentation on each individual image. A key step is fusing the diverse saliency maps through a saliency co-fusion process at the superpixel level. This exploits inter-image information to boost common foreground regions and suppress background regions. Experiments show the co-fusion based approach achieves competitive performance compared to state-of-the-art methods without parameter tuning.
Cuda Based Performance Evaluation Of The Computational Efficiency Of The Dct ...acijjournal
Recent advances in computing such as the massively parallel GPUs (Graphical Processing Units),coupled
with the need to store and deliver large quantities of digital data especially images, has brought a number
of challenges for Computer Scientists, the research community and other stakeholders. These challenges,
such as prohibitively large costs to manipulate the digital data amongst others, have been the focus of the
research community in recent years and has led to the investigation of image compression techniques that
can achieve excellent results. One such technique is the Discrete Cosine Transform, which helps separate
an image into parts of differing frequencies and has the advantage of excellent energy-compaction.
This paper investigates the use of the Compute Unified Device Architecture (CUDA) programming model
to implement the DCT based Cordic based Loeffler algorithm for efficient image compression. The
computational efficiency is analyzed and evaluated under both the CPU and GPU. The PSNR (Peak Signal
to Noise Ratio) is used to evaluate image reconstruction quality in this paper. The results are presented
and discussed
The document discusses grid computing in remote sensing data processing. It describes how grid computing can help process huge amounts of remote sensing data in real-time by distributing processing across networked computers. Key requirements for a grid environment for remote sensing include sharing computational and software resources, managing resources, and supporting different data formats. Case studies demonstrate how grid middleware can improve the efficiency of tasks like image deblurring and generating lookup tables for aerosol analysis.
A Review on Scheduling in Cloud Computingijujournal
This document reviews scheduling techniques in cloud computing. It discusses key concepts like virtualization and different scheduling algorithms. The review surveys various scheduling algorithms for tasks, workflows, real-time applications and energy efficiency. It analyzes algorithms based on parameters like makespan, cost, energy consumption and concludes many algorithms can improve resource utilization and performance while reducing energy costs.
Video Compression Algorithm Based on Frame Difference Approaches ijsc
The huge usage of digital multimedia via communications, wireless communications, Internet, Intranet and cellular mobile leads to incurable growth of data flow through these Media. The researchers go deep in developing efficient techniques in these fields such as compression of data, image and video. Recently, video compression techniques and their applications in many areas (educational, agriculture, medical …) cause this field to be one of the most interested fields. Wavelet transform is an efficient method that can be used to perform an efficient compression technique. This work deals with the developing of an efficient video compression approach based on frames difference approaches that concentrated on the calculation of frame near distance (difference between frames). The
selection of the meaningful frame depends on many factors such as compression performance, frame details, frame size and near distance between frames. Three different approaches are applied for removing the lowest frame difference. In this paper, many videos are tested to insure the efficiency of this technique, in addition a good performance results has been obtained.
In this work, we define a new method for indexing and retrieving non-geotagged
video sequences based on the visual content only by using the Local Binary Pattern
(LBP) and Singular Value Decomposition (SVD) techniques. The main question of our
system, Is it possible to determine the geographic location of a video film on the GISmap
from just its pixels of frames?. The proposed system is introduced to answer the
questions like that. The GIS database was constructed by storing the reference images
on the intersection between segment roads in the map. The Local Binary Pattern
(LBP) is used to extract the features form images. The Singular Value Decomposition
(SVD) technique is used for compress the length of features and indexing the images
in the database. The input to the system is a video taken from the camera puts on a
vehicle as forward facing camera. The output of the proposed system is the geolocation
of keyframes of video which correspond the geo-tagged images retrieved
from the GIS database.
This document presents a novel approach for jointly optimizing spatial prediction and transform coding in video compression. It aims to improve performance and reduce complexity compared to existing techniques. The proposed method uses singular value decomposition (SVD) to compress images. SVD decomposes an image matrix into three matrices, allowing the image to be approximated using only a few singular values. This achieves compression by removing redundant information. The document outlines the SVD approach for image compression and measures compression performance using compression ratio and mean squared error between the original and compressed images. It then discusses trends in image and video coding, including combining natural and synthetic content. Finally, it provides a block diagram of the proposed system and compares its compression performance to existing discrete cosine transform-
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
IRJET-Survey of Image Object Recognition TechniquesIRJET Journal
This document discusses various techniques for object recognition in digital images. It begins by defining object recognition and describing its goals. It then outlines several important techniques for object recognition, including spatial relations, temporal relations, data retrieval in conventional databases, image extraction through mining, and content-based retrieval. For each technique, it provides examples and discusses how the technique can be used to recognize objects in images. The document concludes that object recognition can be improved by using contextual information and a knowledge base to classify segmented image regions.
Reversible Encrypytion and Information ConcealmentIJERA Editor
Recently, a lot of attention is paid to reversible data hiding (RDH) in encrypted pictures, since it maintains the wonderful property that the initial image cover will be losslessly recovered when embedded data is extracted, whereas protects the image content that is need to be kept confidential. Other techniques used antecedently are to embed data by reversibly vacating area from the pictures, that area unit been encryted, may cause some errors on information extraction or image restoration. In this paper, we propose a unique methodology by reserving room before secret writing (i.e reserving room before encryption) with a conventional RDH algorithmic rule, and thus it becomes straightforward for hider to reversibly embed data in the encrypted image. The projected methodology is able to implement real reversibility, that is, information extraction and image recovery area unit free of any error. This methodology embedds larger payloads for constant image quality than the antecedently used techniques, like for PSNR= 40db.
Reversible encrypted data concealment in images by reserving room approachIAEME Publication
The document summarizes a novel method for reversible encrypted data concealment in images. The proposed method reserves room in the original image before encryption using a traditional reversible data hiding algorithm like lifting wavelet transform. This makes it easy for a data hider to reversibly embed data in the encrypted image by simply filling the pre-reserved space. The method is compared to previous "vacating room after encryption" methods and shows it can embed over 10 times as large payloads with the same image quality, as well as achieve better performance in terms of PSNR and MSE. Experiments on test images demonstrate the benefits of the proposed "reserving room before encryption" approach.
Internet data almost double every year. The need of multimedia communication
is less storage space and fast transmission. So, the large volume of video data has become
the reason for video compression. The aim of this paper is to achieve temporal compression
for three-dimensional (3D) videos using motion estimation-compensation and wavelets.
Instead of performing a two-dimensional (2D) motion search, as is common in conventional
video codec’s, the use of a 3D motion search has been proposed, that is able to better exploit
the temporal correlations of 3D content. This leads to more accurate motion prediction and
a smaller residual. The discrete wavelet transform (DWT) compression scheme has been
added for better compression ratio. The DWT has a high-energy compaction property thus
greatly impacted the field of compression. The quality parameters peak signal to noise ratio
(PSNR) and mean square error (MSE) have been calculated. The simulation results shows
that the proposed work improves the PSNR from existing work.
In present day MAC unit is demanded in most of the Digital signal processing. Function of addition and multiplication is performed by the MAC unit. MAC operates in two stages. Firstly, multiplier computes the given number output and the result is forwarded to second stage i.e. addition/accumulation operates. Speed of multiplier is important in MAC unit which determines critical path as well as area is also of great importance in designing of MAC unit. Multiplier plays an important roles in many digital signal processing (DSP) applications such as in convolution, digital filters and other data processing unit. Many research has been performed on MAC implementation. This paper provides analysis of the research and investigations held till now.
Intensity Enhancement in Gray Level Images using HSV Color Coding TechniqueIRJET Journal
This document discusses techniques for enhancing the intensity of gray scale images using HSV color space coding. It begins with an abstract discussing the motivation to increase image clarity and reduce errors from fatigue. Section 1 provides an introduction to image processing and enhancement. Section 1.1 discusses digital images, including types such as black and white, color, binary, and indexed color images. Section 2 covers hardware used in image processing like lights. Section 3 discusses linear filters that can perform operations like smoothing and sharpening through convolution.
IRJET- Transformation of Realistic Images and Videos into Cartoon Images and ...IRJET Journal
This document summarizes research on using a Generative Adversarial Network (GAN) called Cartoon GAN to transform real-world images and videos into cartoon images and videos. The researchers trained Cartoon GAN on 3000 real-world images to learn how to generate cartoon images by using content and adversarial loss functions. They were able to successfully convert both individual images and video clips into cartoon/animated versions. For video, they used the OpenCV library to divide videos into frames, pass each frame through the trained Cartoon GAN model, and then recombine the cartoonized frames into an output cartoon video. The researchers concluded that Cartoon GAN is an effective method for automatically transforming real media into cartoons and aims to improve the quality and resolution
Reversible Image Data Hiding with Contrast EnhancementIRJET Journal
This document proposes a reversible image data hiding technique with contrast enhancement. It aims to embed data into a cover image in a reversible manner while also enhancing the contrast of the cover image. The technique first calculates prediction errors of pixel values in the cover image. It then generates a histogram of the prediction errors and selects carriers for data embedding from peaks in the histogram. Binary secret data is embedded into the carriers by dynamically shifting the prediction error histogram. This allows data to be embedded while increasing cover image quality compared to other reversible data hiding methods. The original cover image can be recovered by extracting the embedded data and reversing the histogram shifts. The technique is meant to achieve a higher peak signal-to-noise ratio than the original cover image after data
Adaptive Image Resizing using Edge Contrastingijtsrd
Zooming is an important image processing operation. It can be termed as the process of enlarging or magnifying the image to a given factor. Indiscriminate application of a function to an image in order to resample it, produces aliasing, edge blurring. So the objective is to reduce these artifacts.This paper considers distinctive interpolation systems identified with versatile techniques with innate abilities to ensure sharp edges and subtleties. It is a versatile resampling calculation for zooming up pictures In this work, a versatile edge improvement procedure is proposed for two dimensional 2 D picture scaling application. The foreseen picture scaling calculation comprises of an edge identifier, bilinear interpolation and Sobel filter. The bilinear interpolation characterizes the power of the scaled pixel with the weighted normal of the four neighboring pixels, inserted pictures become smooth and loss of edge data. The versatile edge upgrade procedure is utilized to secure the edge includes successfully, to accomplish better picture quality and to dodge the edge data. The Sobel filter endeavors to decrease the commotion, in obscured and mutilated edges which is delivered by bilinear interpolation. A mathematical control and equipment sharing strategy are utilized to lessen registering asset of the bilinear interpolation. The examination shows that edges are very much safeguarded and interpolation artifacts obscuring, jaggies are decreased To contrast existing algorithms and proposed strategy calculation, we have taken original pictures and results for discussion. And we have gone to the choice that proposed calculation is superior to the current algorithms. We have looked at the images by two different ways – Mean Square Error MSE and Peak Signal to Noise Ratio PSNR . Mohd Sadiq Abdul Aziz | Dr. Bharti Chourasia "Adaptive Image Resizing using Edge Contrasting" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-6 , October 2020, URL: https://www.ijtsrd.com/papers/ijtsrd35789.pdf Paper Url: https://www.ijtsrd.com/engineering/electronics-and-communication-engineering/35789/adaptive-image-resizing-using-edge-contrasting/mohd-sadiq-abdul-aziz
Simple and Fast Implementation of Segmented Matrix Algorithm for Haar DWT on ...IDES Editor
Haar discrete wavelet transform (DWT), the
simplest among all DWTs, has diverse applications in signal
and image processing fields. A traditional approach for 2D
Haar DWT is 1D row operation followed by and 1D column
operation. In 2002, Chen and Liao presented a fast algorithm
for 2D Haar DWT based on segmented matrix. However, this
method is infeasible for its high computational requirements
for processing large sized images. In this paper, we have
implemented the segmented matrix algorithm on a low cost
NVIDIA’s GPU to achieve speedup in computation. The
efficiency of our GPU based implementation is measured and
compared with CPU based algorithms. Our experimental
results show performance improvement over a factor of 28.5
compared with Chen and Liao’s CPU based segmented matrix
algorithm and a factor of 8 compared to MATLAB’s wavelet
function for an image of size 2560×2560.
IRJET- Design and Implementation of ATM Security System using Vibration Senso...IRJET Journal
This document discusses implementing bi-histogram equalization for contrast enhancement on the Android platform. It begins with an introduction to histogram equalization and its drawback of changing image brightness. It then presents bi-histogram equalization as an approach to overcome this by decomposing the image into two sub-images based on the mean and equalizing them independently to preserve the mean brightness. The paper outlines implementing various steps like image acquisition, preprocessing, and bi-histogram equalization on Android. It shows output images with enhanced visibility compared to the originals, avoiding the flattening property of standard histogram equalization.
The complexity of Medical image reconstruction requires tens to hundreds of billions of computations per second. Until few years ago, special purpose processors designed especially for such applications were used. Such processors require significant design effort and are thus difficult to change as new algorithms in reconstructions evolve and have limited parallelism. Hence the demand for flexibility in medical applications motivated the use of stream processors with massively parallel architecture. Stream processing architectures offers data parallel kind of parallelism.
Image Processing Application on Graphics processorsCSCJournals
In this work, we introduce real time image processing techniques using modern programmable Graphic Processing Units GPU. GPU are SIMD (Single Instruction, Multiple Data) device that is inherently data-parallel. By utilizing NVIDIA new GPU programming framework, “Compute Unified Device Architecture” CUDA as a computational resource, we realize significant acceleration in image processing algorithm computations. We show that a range of computer vision algorithms map readily to CUDA with significant performance gains. Specifically, we demonstrate the efficiency of our approach by a parallelization and optimization of image processing, Morphology applications and image integral.
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONScseij
This document summarizes a survey on GPU systems and their performance on different applications. It discusses how GPUs can be used for general purpose computing due to their high parallel processing capabilities. Several computational intensive applications that achieve speedups when implemented on GPUs are described, including video decoding, matrix multiplication, parallel AES encryption, and password recovery for MS office documents. The GPU architecture and Nvidia's CUDA programming model are also summarized. While GPUs provide significant performance benefits, some limitations for non-graphics applications are noted. The conclusion is that GPUs are a good alternative for computational intensive tasks to reduce CPU load and improve performance compared to CPU-only implementations.
Cuda Based Performance Evaluation Of The Computational Efficiency Of The Dct ...acijjournal
Recent advances in computing such as the massively parallel GPUs (Graphical Processing Units),coupled
with the need to store and deliver large quantities of digital data especially images, has brought a number
of challenges for Computer Scientists, the research community and other stakeholders. These challenges,
such as prohibitively large costs to manipulate the digital data amongst others, have been the focus of the
research community in recent years and has led to the investigation of image compression techniques that
can achieve excellent results. One such technique is the Discrete Cosine Transform, which helps separate
an image into parts of differing frequencies and has the advantage of excellent energy-compaction.
This paper investigates the use of the Compute Unified Device Architecture (CUDA) programming model
to implement the DCT based Cordic based Loeffler algorithm for efficient image compression. The
computational efficiency is analyzed and evaluated under both the CPU and GPU. The PSNR (Peak Signal
to Noise Ratio) is used to evaluate image reconstruction quality in this paper. The results are presented
and discussed
The document discusses grid computing in remote sensing data processing. It describes how grid computing can help process huge amounts of remote sensing data in real-time by distributing processing across networked computers. Key requirements for a grid environment for remote sensing include sharing computational and software resources, managing resources, and supporting different data formats. Case studies demonstrate how grid middleware can improve the efficiency of tasks like image deblurring and generating lookup tables for aerosol analysis.
A Review on Scheduling in Cloud Computingijujournal
This document reviews scheduling techniques in cloud computing. It discusses key concepts like virtualization and different scheduling algorithms. The review surveys various scheduling algorithms for tasks, workflows, real-time applications and energy efficiency. It analyzes algorithms based on parameters like makespan, cost, energy consumption and concludes many algorithms can improve resource utilization and performance while reducing energy costs.
Video Compression Algorithm Based on Frame Difference Approaches ijsc
The huge usage of digital multimedia via communications, wireless communications, Internet, Intranet and cellular mobile leads to incurable growth of data flow through these Media. The researchers go deep in developing efficient techniques in these fields such as compression of data, image and video. Recently, video compression techniques and their applications in many areas (educational, agriculture, medical …) cause this field to be one of the most interested fields. Wavelet transform is an efficient method that can be used to perform an efficient compression technique. This work deals with the developing of an efficient video compression approach based on frames difference approaches that concentrated on the calculation of frame near distance (difference between frames). The
selection of the meaningful frame depends on many factors such as compression performance, frame details, frame size and near distance between frames. Three different approaches are applied for removing the lowest frame difference. In this paper, many videos are tested to insure the efficiency of this technique, in addition a good performance results has been obtained.
In this work, we define a new method for indexing and retrieving non-geotagged
video sequences based on the visual content only by using the Local Binary Pattern
(LBP) and Singular Value Decomposition (SVD) techniques. The main question of our
system, Is it possible to determine the geographic location of a video film on the GISmap
from just its pixels of frames?. The proposed system is introduced to answer the
questions like that. The GIS database was constructed by storing the reference images
on the intersection between segment roads in the map. The Local Binary Pattern
(LBP) is used to extract the features form images. The Singular Value Decomposition
(SVD) technique is used for compress the length of features and indexing the images
in the database. The input to the system is a video taken from the camera puts on a
vehicle as forward facing camera. The output of the proposed system is the geolocation
of keyframes of video which correspond the geo-tagged images retrieved
from the GIS database.
This document presents a novel approach for jointly optimizing spatial prediction and transform coding in video compression. It aims to improve performance and reduce complexity compared to existing techniques. The proposed method uses singular value decomposition (SVD) to compress images. SVD decomposes an image matrix into three matrices, allowing the image to be approximated using only a few singular values. This achieves compression by removing redundant information. The document outlines the SVD approach for image compression and measures compression performance using compression ratio and mean squared error between the original and compressed images. It then discusses trends in image and video coding, including combining natural and synthetic content. Finally, it provides a block diagram of the proposed system and compares its compression performance to existing discrete cosine transform-
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
IRJET-Survey of Image Object Recognition TechniquesIRJET Journal
This document discusses various techniques for object recognition in digital images. It begins by defining object recognition and describing its goals. It then outlines several important techniques for object recognition, including spatial relations, temporal relations, data retrieval in conventional databases, image extraction through mining, and content-based retrieval. For each technique, it provides examples and discusses how the technique can be used to recognize objects in images. The document concludes that object recognition can be improved by using contextual information and a knowledge base to classify segmented image regions.
Reversible Encrypytion and Information ConcealmentIJERA Editor
Recently, a lot of attention is paid to reversible data hiding (RDH) in encrypted pictures, since it maintains the wonderful property that the initial image cover will be losslessly recovered when embedded data is extracted, whereas protects the image content that is need to be kept confidential. Other techniques used antecedently are to embed data by reversibly vacating area from the pictures, that area unit been encryted, may cause some errors on information extraction or image restoration. In this paper, we propose a unique methodology by reserving room before secret writing (i.e reserving room before encryption) with a conventional RDH algorithmic rule, and thus it becomes straightforward for hider to reversibly embed data in the encrypted image. The projected methodology is able to implement real reversibility, that is, information extraction and image recovery area unit free of any error. This methodology embedds larger payloads for constant image quality than the antecedently used techniques, like for PSNR= 40db.
Reversible encrypted data concealment in images by reserving room approachIAEME Publication
The document summarizes a novel method for reversible encrypted data concealment in images. The proposed method reserves room in the original image before encryption using a traditional reversible data hiding algorithm like lifting wavelet transform. This makes it easy for a data hider to reversibly embed data in the encrypted image by simply filling the pre-reserved space. The method is compared to previous "vacating room after encryption" methods and shows it can embed over 10 times as large payloads with the same image quality, as well as achieve better performance in terms of PSNR and MSE. Experiments on test images demonstrate the benefits of the proposed "reserving room before encryption" approach.
Internet data almost double every year. The need of multimedia communication
is less storage space and fast transmission. So, the large volume of video data has become
the reason for video compression. The aim of this paper is to achieve temporal compression
for three-dimensional (3D) videos using motion estimation-compensation and wavelets.
Instead of performing a two-dimensional (2D) motion search, as is common in conventional
video codec’s, the use of a 3D motion search has been proposed, that is able to better exploit
the temporal correlations of 3D content. This leads to more accurate motion prediction and
a smaller residual. The discrete wavelet transform (DWT) compression scheme has been
added for better compression ratio. The DWT has a high-energy compaction property thus
greatly impacted the field of compression. The quality parameters peak signal to noise ratio
(PSNR) and mean square error (MSE) have been calculated. The simulation results shows
that the proposed work improves the PSNR from existing work.
In present day MAC unit is demanded in most of the Digital signal processing. Function of addition and multiplication is performed by the MAC unit. MAC operates in two stages. Firstly, multiplier computes the given number output and the result is forwarded to second stage i.e. addition/accumulation operates. Speed of multiplier is important in MAC unit which determines critical path as well as area is also of great importance in designing of MAC unit. Multiplier plays an important roles in many digital signal processing (DSP) applications such as in convolution, digital filters and other data processing unit. Many research has been performed on MAC implementation. This paper provides analysis of the research and investigations held till now.
Intensity Enhancement in Gray Level Images using HSV Color Coding TechniqueIRJET Journal
This document discusses techniques for enhancing the intensity of gray scale images using HSV color space coding. It begins with an abstract discussing the motivation to increase image clarity and reduce errors from fatigue. Section 1 provides an introduction to image processing and enhancement. Section 1.1 discusses digital images, including types such as black and white, color, binary, and indexed color images. Section 2 covers hardware used in image processing like lights. Section 3 discusses linear filters that can perform operations like smoothing and sharpening through convolution.
IRJET- Transformation of Realistic Images and Videos into Cartoon Images and ...IRJET Journal
This document summarizes research on using a Generative Adversarial Network (GAN) called Cartoon GAN to transform real-world images and videos into cartoon images and videos. The researchers trained Cartoon GAN on 3000 real-world images to learn how to generate cartoon images by using content and adversarial loss functions. They were able to successfully convert both individual images and video clips into cartoon/animated versions. For video, they used the OpenCV library to divide videos into frames, pass each frame through the trained Cartoon GAN model, and then recombine the cartoonized frames into an output cartoon video. The researchers concluded that Cartoon GAN is an effective method for automatically transforming real media into cartoons and aims to improve the quality and resolution
Reversible Image Data Hiding with Contrast EnhancementIRJET Journal
This document proposes a reversible image data hiding technique with contrast enhancement. It aims to embed data into a cover image in a reversible manner while also enhancing the contrast of the cover image. The technique first calculates prediction errors of pixel values in the cover image. It then generates a histogram of the prediction errors and selects carriers for data embedding from peaks in the histogram. Binary secret data is embedded into the carriers by dynamically shifting the prediction error histogram. This allows data to be embedded while increasing cover image quality compared to other reversible data hiding methods. The original cover image can be recovered by extracting the embedded data and reversing the histogram shifts. The technique is meant to achieve a higher peak signal-to-noise ratio than the original cover image after data
Adaptive Image Resizing using Edge Contrastingijtsrd
Zooming is an important image processing operation. It can be termed as the process of enlarging or magnifying the image to a given factor. Indiscriminate application of a function to an image in order to resample it, produces aliasing, edge blurring. So the objective is to reduce these artifacts.This paper considers distinctive interpolation systems identified with versatile techniques with innate abilities to ensure sharp edges and subtleties. It is a versatile resampling calculation for zooming up pictures In this work, a versatile edge improvement procedure is proposed for two dimensional 2 D picture scaling application. The foreseen picture scaling calculation comprises of an edge identifier, bilinear interpolation and Sobel filter. The bilinear interpolation characterizes the power of the scaled pixel with the weighted normal of the four neighboring pixels, inserted pictures become smooth and loss of edge data. The versatile edge upgrade procedure is utilized to secure the edge includes successfully, to accomplish better picture quality and to dodge the edge data. The Sobel filter endeavors to decrease the commotion, in obscured and mutilated edges which is delivered by bilinear interpolation. A mathematical control and equipment sharing strategy are utilized to lessen registering asset of the bilinear interpolation. The examination shows that edges are very much safeguarded and interpolation artifacts obscuring, jaggies are decreased To contrast existing algorithms and proposed strategy calculation, we have taken original pictures and results for discussion. And we have gone to the choice that proposed calculation is superior to the current algorithms. We have looked at the images by two different ways – Mean Square Error MSE and Peak Signal to Noise Ratio PSNR . Mohd Sadiq Abdul Aziz | Dr. Bharti Chourasia "Adaptive Image Resizing using Edge Contrasting" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-6 , October 2020, URL: https://www.ijtsrd.com/papers/ijtsrd35789.pdf Paper Url: https://www.ijtsrd.com/engineering/electronics-and-communication-engineering/35789/adaptive-image-resizing-using-edge-contrasting/mohd-sadiq-abdul-aziz
Simple and Fast Implementation of Segmented Matrix Algorithm for Haar DWT on ...IDES Editor
Haar discrete wavelet transform (DWT), the
simplest among all DWTs, has diverse applications in signal
and image processing fields. A traditional approach for 2D
Haar DWT is 1D row operation followed by and 1D column
operation. In 2002, Chen and Liao presented a fast algorithm
for 2D Haar DWT based on segmented matrix. However, this
method is infeasible for its high computational requirements
for processing large sized images. In this paper, we have
implemented the segmented matrix algorithm on a low cost
NVIDIA’s GPU to achieve speedup in computation. The
efficiency of our GPU based implementation is measured and
compared with CPU based algorithms. Our experimental
results show performance improvement over a factor of 28.5
compared with Chen and Liao’s CPU based segmented matrix
algorithm and a factor of 8 compared to MATLAB’s wavelet
function for an image of size 2560×2560.
IRJET- Design and Implementation of ATM Security System using Vibration Senso...IRJET Journal
This document discusses implementing bi-histogram equalization for contrast enhancement on the Android platform. It begins with an introduction to histogram equalization and its drawback of changing image brightness. It then presents bi-histogram equalization as an approach to overcome this by decomposing the image into two sub-images based on the mean and equalizing them independently to preserve the mean brightness. The paper outlines implementing various steps like image acquisition, preprocessing, and bi-histogram equalization on Android. It shows output images with enhanced visibility compared to the originals, avoiding the flattening property of standard histogram equalization.
The complexity of Medical image reconstruction requires tens to hundreds of billions of computations per second. Until few years ago, special purpose processors designed especially for such applications were used. Such processors require significant design effort and are thus difficult to change as new algorithms in reconstructions evolve and have limited parallelism. Hence the demand for flexibility in medical applications motivated the use of stream processors with massively parallel architecture. Stream processing architectures offers data parallel kind of parallelism.
Image Processing Application on Graphics processorsCSCJournals
In this work, we introduce real time image processing techniques using modern programmable Graphic Processing Units GPU. GPU are SIMD (Single Instruction, Multiple Data) device that is inherently data-parallel. By utilizing NVIDIA new GPU programming framework, “Compute Unified Device Architecture” CUDA as a computational resource, we realize significant acceleration in image processing algorithm computations. We show that a range of computer vision algorithms map readily to CUDA with significant performance gains. Specifically, we demonstrate the efficiency of our approach by a parallelization and optimization of image processing, Morphology applications and image integral.
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONScseij
This document summarizes a survey on GPU systems and their performance on different applications. It discusses how GPUs can be used for general purpose computing due to their high parallel processing capabilities. Several computational intensive applications that achieve speedups when implemented on GPUs are described, including video decoding, matrix multiplication, parallel AES encryption, and password recovery for MS office documents. The GPU architecture and Nvidia's CUDA programming model are also summarized. While GPUs provide significant performance benefits, some limitations for non-graphics applications are noted. The conclusion is that GPUs are a good alternative for computational intensive tasks to reduce CPU load and improve performance compared to CPU-only implementations.
Image processing has become under the spotlight recently and leads to a
significant shift in various fields such as biomedical, satellite images, and
graphical applications. Nevertheless, the poor quality of an image is one of
the noticeable limitations of image processing as it restricts efficient data
extraction to be conducted. Conventionally, the image was processed via
software applications such as MATLAB. In spite of the software's ability to
cater to the data extraction of low-quality image issues, it still suffers from the
time-consuming issue. As the ability to obtain a rapid outcome is a favorable
feature of efficient image processing, the use of hardware in image processing
is deemed to keep the addressed issue at bay. Thus, the image enhancement
techniques using hardware have gradually rising interest among researchers
with numerous approaches such as field programmable gate array (FPGA). In
this study, 25 different research papers published from 2016 to 2021 are
studied and analyzed to focus on the performance of FPGA as hardware
implementation in image processing techniques.
The Computation Complexity Reduction of 2-D Gaussian FilterIRJET Journal
This document discusses reducing the computational complexity of a 2D Gaussian filter for image smoothing. It begins with an abstract that notes 2D Gaussian filters are commonly used for image smoothing but require heavy computational resources. It then proposes using fixed-point arithmetic rather than floating point to implement the filter on an FPGA, which can increase efficiency and decrease area and complexity. The document is divided into sections that cover the theory behind image filtering, image smoothing and sharpening, quality metrics for evaluation, and an energy scalable Gaussian smoothing filter architecture. It concludes by discussing results and benefits of implementing the filter using fixed-point arithmetic on an FPGA.
SPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHY cscpconf
This paper presents a parallel approach to improve the time complexity problem associated with sequential algorithms. An image steganography algorithm in transform domain is considered for implementation. Image steganography is a technique to hide secret message in an image. With the parallel implementation, large message can be hidden in large image since it does not take much processing time. It is implemented on GPU systems. Parallel programming is done using OpenCL in CUDA cores from NVIDIA. The speed-up improvement
obtained is very good with reasonably good output signal quality, when large amount of data is processed
The document discusses the evolution of GPU architecture and capabilities over time. It describes how GPUs have become massively parallel processors with programmable capabilities beyond just graphics. The document outlines the core components of a GPU including the graphics pipeline and programming model. It also discusses how GPUs are well suited for parallel, data-intensive applications and how their capabilities have expanded into general purpose computing through technologies like CUDA.
An Investigation towards Effectiveness in Image Enhancement Process in MPSoC IJECEIAES
The document discusses image enhancement techniques for use in multiprocessor systems-on-chip (MPSoC). It reviews existing image enhancement methods implemented on FPGA and identifies limitations like low accuracy. The paper proposes using advanced bus architectures like AMBA/AXI in MPSoC to improve communication between memory and input medical images for enhancement, reducing heterogeneity effects. It summarizes various image enhancement algorithms that could be implemented on reconfigurable FPGA hardware for use in MPSoC, including brightness control, contrast stretching, histogram equalization, and edge detection techniques.
Graphics Processing Unit GPU is a processor or electronic chip for graphics. GPUs are massively parallel processors used widely used for 3D graphic and many non graphic applications. As the demand for graphics applications increases, GPU has become indispensable. The use of GPUs has now matured to a point where there are countless industrial applications. This paper provides a brief introduction on GPUs, their properties, and their applications. Matthew N. O. Sadiku | Adedamola A. Omotoso | Sarhan M. Musa "Graphics Processing Unit: An Introduction" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-1 , December 2019, URL: https://www.ijtsrd.com/papers/ijtsrd29647.pdf Paper URL: https://www.ijtsrd.com/engineering/electrical-engineering/29647/graphics-processing-unit-an-introduction/matthew-n-o-sadiku
FACE COUNTING USING OPEN CV & PYTHON FOR ANALYZING UNUSUAL EVENTS IN CROWDSIRJET Journal
The document discusses face counting using OpenCV and Python by analyzing unusual events in crowds. It proposes using the Haar cascade algorithm for face detection and counting. Feature extraction is performed using gray-level co-occurrence matrix (GLCM) to extract texture and edge features. Discriminant analysis is then used to differentiate between samples accurately. The system aims to correctly detect and count faces in images using Python tools like OpenCV for digital image processing tasks and feature extraction algorithms like GLCM and discrete wavelet transform (DWT). It is intended to have good recognition accuracy compared to previous methods.
HARDWARE/SOFTWARE CO-DESIGN OF A 2D GRAPHICS SYSTEM ON FPGAijesajournal
This document describes the hardware/software co-design of a 2D graphics system implemented on an FPGA. It discusses the hardware design which includes developing Bresenham and BitBLT IP cores to accelerate computationally intensive 2D graphics operations. It also discusses the software design which includes graphics drivers and APIs running on a CPU core to initialize and manage the graphics creation process by driving the IP cores. The system is aimed to benefit low-end embedded applications by providing reconfigurable 2D graphics capabilities on FPGA.
IRJET- Different Approaches for Implementation of Fractal Image Compressi...IRJET Journal
This document discusses different approaches for implementing fractal image compression on medical images using parallel processing on GPUs. It analyzes fractal image compression algorithms in MATLAB to compress medical images with very low loss in image quality. The key aspects covered are:
1. Fractal image compression uses self-similarity within images to achieve high compression ratios while preserving image quality.
2. Implementing fractal compression in parallel on GPUs can significantly reduce computational time compared to CPU implementations, as the redundant processing can be distributed across multiple processors.
3. The document implements and evaluates different fractal compression algorithms on MATLAB to compress medical images with low signal-to-noise ratio, high compression ratio, and reduced encoding time.
HARDWARE SOFTWARE CO-SIMULATION OF MOTION ESTIMATION IN H.264 ENCODERcscpconf
This paper proposes about motion estimation in H.264/AVC encoder. Compared with standards
such as MPEG-2 and MPEG-4 Visual, H.264 can deliver better image quality at the same
compressed bit rate or at a lower bit rate. The increase in compression efficiency comes at the
expense of increase in complexity, which is a fact that must be overcome. An efficient Co-design
methodology is required, where the encoder software application is highly optimized and
structured in a very modular and efficient manner, so as to allow its most complex and time
consuming operations to be offloaded to dedicated hardware accelerators. The Motion
Estimation algorithm is the most computationally intensive part of the encoder which is simulated using MATLAB. The hardware/software co-simulation is done using system generator tool and implemented using Xilinx FPGA Spartan 3E for different scanning methods.
Graphics processing units (GPUs) are increasingly being used for general-purpose computing applications due to their highly parallel and programmable nature. GPU computing uses the GPU alongside the CPU in a heterogeneous model, with the sequential CPU portion handling control flow and passing data to the GPU for parallel intensive computations. GPUs have evolved from fixed-function processors into fully programmable parallel processors. Many applications that require large amounts of parallelism and throughput can benefit from offloading work to the GPU. GPU architectures provide a high degree of parallelism through multiple stream processors that can execute the same instructions on different data sets. Software environments like CUDA and OpenCL allow general-purpose programming of GPUs for applications beyond graphics. Future improvements may include
Accelerating Real Time Applications on Heterogeneous PlatformsIJMER
In this paper we describe about the novel implementations of depth estimation from a stereo
images using feature extraction algorithms that run on the graphics processing unit (GPU) which is
suitable for real time applications like analyzing video in real-time vision systems. Modern graphics
cards contain large number of parallel processors and high-bandwidth memory for accelerating the
processing of data computation operations. In this paper we give general idea of how to accelerate the
real time application using heterogeneous platforms. We have proposed to use some added resources to
grasp more computationally involved optimization methods. This proposed approach will indirectly
accelerate a database by producing better plan quality.
Orthogonal Matching Pursuit in 2D for Java with GPGPU ProspectivesMatt Simons
This document summarizes a project that implemented the Orthogonal Matching Pursuit algorithm in two dimensions (OMP2D) in Java and created an ImageJ plugin to apply it. It discusses the algorithm, details the Java implementation and optimizations, and proposes methods for accelerating it using GPUs. The author created a fully functional OMP2D ImageJ plugin with good performance compared to other implementations. The open source software and documentation are publicly available. The document outlines how further speed improvements could be achieved through mass parallelization on GPUs.
Study of Energy Efficient Images with Just Noticeable Difference Threshold Ba...ijtsrd
This document presents a novel method for producing energy-efficient images using a Feature Transform based Just Noticeable Difference Threshold (FTJNDT) model. The proposed method aims to reduce image energy consumption on displays like OLED by lowering pixel luminance below the just-noticeable difference threshold while maintaining perceptual quality. The FTJNDT model determines individual luminance thresholds for each image block based on visual saliency and non-linear modulation functions. An optimization framework is used to estimate modulation parameters and feature values using an objective image quality assessment. Experimental results showed the method reduced image energy consumption by an average of 4.31% compared to original images.
This document discusses accelerating the seam carving algorithm for image resizing using CUDA (Compute Unified Device Architecture) on a GPU (graphics processing unit). Seam carving is a content-aware image resizing technique that identifies paths of least importance (seams) through an image that can be removed or inserted to change the image size. The document explains that seam carving involves large matrix calculations that can be significantly accelerated by implementing them in parallel on a CUDA-enabled GPU. It presents the seam carving algorithm, describes CUDA and GPU architecture, proposes implementing seam carving using CUDA to achieve speed-ups, and concludes that parallelizing seam carving calculations on a GPU exploits its massive parallelism for faster execution compared to
Similar to Real-Time Implementation and Performance Optimization of Local Derivative Pattern Algorithm on GPUs (20)
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...IJECEIAES
Medical image analysis has witnessed significant advancements with deep learning techniques. In the domain of brain tumor segmentation, the ability to
precisely delineate tumor boundaries from magnetic resonance imaging (MRI)
scans holds profound implications for diagnosis. This study presents an ensemble convolutional neural network (CNN) with transfer learning, integrating
the state-of-the-art Deeplabv3+ architecture with the ResNet18 backbone. The
model is rigorously trained and evaluated, exhibiting remarkable performance
metrics, including an impressive global accuracy of 99.286%, a high-class accuracy of 82.191%, a mean intersection over union (IoU) of 79.900%, a weighted
IoU of 98.620%, and a Boundary F1 (BF) score of 83.303%. Notably, a detailed comparative analysis with existing methods showcases the superiority of
our proposed model. These findings underscore the model’s competence in precise brain tumor localization, underscoring its potential to revolutionize medical
image analysis and enhance healthcare outcomes. This research paves the way
for future exploration and optimization of advanced CNN models in medical
imaging, emphasizing addressing false positives and resource efficiency.
Embedded machine learning-based road conditions and driving behavior monitoringIJECEIAES
Car accident rates have increased in recent years, resulting in losses in human lives, properties, and other financial costs. An embedded machine learning-based system is developed to address this critical issue. The system can monitor road conditions, detect driving patterns, and identify aggressive driving behaviors. The system is based on neural networks trained on a comprehensive dataset of driving events, driving styles, and road conditions. The system effectively detects potential risks and helps mitigate the frequency and impact of accidents. The primary goal is to ensure the safety of drivers and vehicles. Collecting data involved gathering information on three key road events: normal street and normal drive, speed bumps, circular yellow speed bumps, and three aggressive driving actions: sudden start, sudden stop, and sudden entry. The gathered data is processed and analyzed using a machine learning system designed for limited power and memory devices. The developed system resulted in 91.9% accuracy, 93.6% precision, and 92% recall. The achieved inference time on an Arduino Nano 33 BLE Sense with a 32-bit CPU running at 64 MHz is 34 ms and requires 2.6 kB peak RAM and 139.9 kB program flash memory, making it suitable for resource-constrained embedded systems.
Advanced control scheme of doubly fed induction generator for wind turbine us...IJECEIAES
This paper describes a speed control device for generating electrical energy on an electricity network based on the doubly fed induction generator (DFIG) used for wind power conversion systems. At first, a double-fed induction generator model was constructed. A control law is formulated to govern the flow of energy between the stator of a DFIG and the energy network using three types of controllers: proportional integral (PI), sliding mode controller (SMC) and second order sliding mode controller (SOSMC). Their different results in terms of power reference tracking, reaction to unexpected speed fluctuations, sensitivity to perturbations, and resilience against machine parameter alterations are compared. MATLAB/Simulink was used to conduct the simulations for the preceding study. Multiple simulations have shown very satisfying results, and the investigations demonstrate the efficacy and power-enhancing capabilities of the suggested control system.
Neural network optimizer of proportional-integral-differential controller par...IJECEIAES
Wide application of proportional-integral-differential (PID)-regulator in industry requires constant improvement of methods of its parameters adjustment. The paper deals with the issues of optimization of PID-regulator parameters with the use of neural network technology methods. A methodology for choosing the architecture (structure) of neural network optimizer is proposed, which consists in determining the number of layers, the number of neurons in each layer, as well as the form and type of activation function. Algorithms of neural network training based on the application of the method of minimizing the mismatch between the regulated value and the target value are developed. The method of back propagation of gradients is proposed to select the optimal training rate of neurons of the neural network. The neural network optimizer, which is a superstructure of the linear PID controller, allows increasing the regulation accuracy from 0.23 to 0.09, thus reducing the power consumption from 65% to 53%. The results of the conducted experiments allow us to conclude that the created neural superstructure may well become a prototype of an automatic voltage regulator (AVR)-type industrial controller for tuning the parameters of the PID controller.
An improved modulation technique suitable for a three level flying capacitor ...IJECEIAES
This research paper introduces an innovative modulation technique for controlling a 3-level flying capacitor multilevel inverter (FCMLI), aiming to streamline the modulation process in contrast to conventional methods. The proposed
simplified modulation technique paves the way for more straightforward and
efficient control of multilevel inverters, enabling their widespread adoption and
integration into modern power electronic systems. Through the amalgamation of
sinusoidal pulse width modulation (SPWM) with a high-frequency square wave
pulse, this controlling technique attains energy equilibrium across the coupling
capacitor. The modulation scheme incorporates a simplified switching pattern
and a decreased count of voltage references, thereby simplifying the control
algorithm.
A review on features and methods of potential fishing zoneIJECEIAES
This review focuses on the importance of identifying potential fishing zones in seawater for sustainable fishing practices. It explores features like sea surface temperature (SST) and sea surface height (SSH), along with classification methods such as classifiers. The features like SST, SSH, and different classifiers used to classify the data, have been figured out in this review study. This study underscores the importance of examining potential fishing zones using advanced analytical techniques. It thoroughly explores the methodologies employed by researchers, covering both past and current approaches. The examination centers on data characteristics and the application of classification algorithms for classification of potential fishing zones. Furthermore, the prediction of potential fishing zones relies significantly on the effectiveness of classification algorithms. Previous research has assessed the performance of models like support vector machines, naïve Bayes, and artificial neural networks (ANN). In the previous result, the results of support vector machine (SVM) were 97.6% more accurate than naive Bayes's 94.2% to classify test data for fisheries classification. By considering the recent works in this area, several recommendations for future works are presented to further improve the performance of the potential fishing zone models, which is important to the fisheries community.
Electrical signal interference minimization using appropriate core material f...IJECEIAES
As demand for smaller, quicker, and more powerful devices rises, Moore's law is strictly followed. The industry has worked hard to make little devices that boost productivity. The goal is to optimize device density. Scientists are reducing connection delays to improve circuit performance. This helped them understand three-dimensional integrated circuit (3D IC) concepts, which stack active devices and create vertical connections to diminish latency and lower interconnects. Electrical involvement is a big worry with 3D integrates circuits. Researchers have developed and tested through silicon via (TSV) and substrates to decrease electrical wave involvement. This study illustrates a novel noise coupling reduction method using several electrical involvement models. A 22% drop in electrical involvement from wave-carrying to victim TSVs introduces this new paradigm and improves system performance even at higher THz frequencies.
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...IJECEIAES
Climate change's impact on the planet forced the United Nations and governments to promote green energies and electric transportation. The deployments of photovoltaic (PV) and electric vehicle (EV) systems gained stronger momentum due to their numerous advantages over fossil fuel types. The advantages go beyond sustainability to reach financial support and stability. The work in this paper introduces the hybrid system between PV and EV to support industrial and commercial plants. This paper covers the theoretical framework of the proposed hybrid system including the required equation to complete the cost analysis when PV and EV are present. In addition, the proposed design diagram which sets the priorities and requirements of the system is presented. The proposed approach allows setup to advance their power stability, especially during power outages. The presented information supports researchers and plant owners to complete the necessary analysis while promoting the deployment of clean energy. The result of a case study that represents a dairy milk farmer supports the theoretical works and highlights its advanced benefits to existing plants. The short return on investment of the proposed approach supports the paper's novelty approach for the sustainable electrical system. In addition, the proposed system allows for an isolated power setup without the need for a transmission line which enhances the safety of the electrical network
Bibliometric analysis highlighting the role of women in addressing climate ch...IJECEIAES
Fossil fuel consumption increased quickly, contributing to climate change
that is evident in unusual flooding and draughts, and global warming. Over
the past ten years, women's involvement in society has grown dramatically,
and they succeeded in playing a noticeable role in reducing climate change.
A bibliometric analysis of data from the last ten years has been carried out to
examine the role of women in addressing the climate change. The analysis's
findings discussed the relevant to the sustainable development goals (SDGs),
particularly SDG 7 and SDG 13. The results considered contributions made
by women in the various sectors while taking geographic dispersion into
account. The bibliometric analysis delves into topics including women's
leadership in environmental groups, their involvement in policymaking, their
contributions to sustainable development projects, and the influence of
gender diversity on attempts to mitigate climate change. This study's results
highlight how women have influenced policies and actions related to climate
change, point out areas of research deficiency and recommendations on how
to increase role of the women in addressing the climate change and
achieving sustainability. To achieve more successful results, this initiative
aims to highlight the significance of gender equality and encourage
inclusivity in climate change decision-making processes.
Voltage and frequency control of microgrid in presence of micro-turbine inter...IJECEIAES
The active and reactive load changes have a significant impact on voltage
and frequency. In this paper, in order to stabilize the microgrid (MG) against
load variations in islanding mode, the active and reactive power of all
distributed generators (DGs), including energy storage (battery), diesel
generator, and micro-turbine, are controlled. The micro-turbine generator is
connected to MG through a three-phase to three-phase matrix converter, and
the droop control method is applied for controlling the voltage and
frequency of MG. In addition, a method is introduced for voltage and
frequency control of micro-turbines in the transition state from gridconnected mode to islanding mode. A novel switching strategy of the matrix
converter is used for converting the high-frequency output voltage of the
micro-turbine to the grid-side frequency of the utility system. Moreover,
using the switching strategy, the low-order harmonics in the output current
and voltage are not produced, and consequently, the size of the output filter
would be reduced. In fact, the suggested control strategy is load-independent
and has no frequency conversion restrictions. The proposed approach for
voltage and frequency regulation demonstrates exceptional performance and
favorable response across various load alteration scenarios. The suggested
strategy is examined in several scenarios in the MG test systems, and the
simulation results are addressed.
Enhancing battery system identification: nonlinear autoregressive modeling fo...IJECEIAES
Precisely characterizing Li-ion batteries is essential for optimizing their
performance, enhancing safety, and prolonging their lifespan across various
applications, such as electric vehicles and renewable energy systems. This
article introduces an innovative nonlinear methodology for system
identification of a Li-ion battery, employing a nonlinear autoregressive with
exogenous inputs (NARX) model. The proposed approach integrates the
benefits of nonlinear modeling with the adaptability of the NARX structure,
facilitating a more comprehensive representation of the intricate
electrochemical processes within the battery. Experimental data collected
from a Li-ion battery operating under diverse scenarios are employed to
validate the effectiveness of the proposed methodology. The identified
NARX model exhibits superior accuracy in predicting the battery's behavior
compared to traditional linear models. This study underscores the
importance of accounting for nonlinearities in battery modeling, providing
insights into the intricate relationships between state-of-charge, voltage, and
current under dynamic conditions.
Smart grid deployment: from a bibliometric analysis to a surveyIJECEIAES
Smart grids are one of the last decades' innovations in electrical energy.
They bring relevant advantages compared to the traditional grid and
significant interest from the research community. Assessing the field's
evolution is essential to propose guidelines for facing new and future smart
grid challenges. In addition, knowing the main technologies involved in the
deployment of smart grids (SGs) is important to highlight possible
shortcomings that can be mitigated by developing new tools. This paper
contributes to the research trends mentioned above by focusing on two
objectives. First, a bibliometric analysis is presented to give an overview of
the current research level about smart grid deployment. Second, a survey of
the main technological approaches used for smart grid implementation and
their contributions are highlighted. To that effect, we searched the Web of
Science (WoS), and the Scopus databases. We obtained 5,663 documents
from WoS and 7,215 from Scopus on smart grid implementation or
deployment. With the extraction limitation in the Scopus database, 5,872 of
the 7,215 documents were extracted using a multi-step process. These two
datasets have been analyzed using a bibliometric tool called bibliometrix.
The main outputs are presented with some recommendations for future
research.
Use of analytical hierarchy process for selecting and prioritizing islanding ...IJECEIAES
One of the problems that are associated to power systems is islanding
condition, which must be rapidly and properly detected to prevent any
negative consequences on the system's protection, stability, and security.
This paper offers a thorough overview of several islanding detection
strategies, which are divided into two categories: classic approaches,
including local and remote approaches, and modern techniques, including
techniques based on signal processing and computational intelligence.
Additionally, each approach is compared and assessed based on several
factors, including implementation costs, non-detected zones, declining
power quality, and response times using the analytical hierarchy process
(AHP). The multi-criteria decision-making analysis shows that the overall
weight of passive methods (24.7%), active methods (7.8%), hybrid methods
(5.6%), remote methods (14.5%), signal processing-based methods (26.6%),
and computational intelligent-based methods (20.8%) based on the
comparison of all criteria together. Thus, it can be seen from the total weight
that hybrid approaches are the least suitable to be chosen, while signal
processing-based methods are the most appropriate islanding detection
method to be selected and implemented in power system with respect to the
aforementioned factors. Using Expert Choice software, the proposed
hierarchy model is studied and examined.
Enhancing of single-stage grid-connected photovoltaic system using fuzzy logi...IJECEIAES
The power generated by photovoltaic (PV) systems is influenced by
environmental factors. This variability hampers the control and utilization of
solar cells' peak output. In this study, a single-stage grid-connected PV
system is designed to enhance power quality. Our approach employs fuzzy
logic in the direct power control (DPC) of a three-phase voltage source
inverter (VSI), enabling seamless integration of the PV connected to the
grid. Additionally, a fuzzy logic-based maximum power point tracking
(MPPT) controller is adopted, which outperforms traditional methods like
incremental conductance (INC) in enhancing solar cell efficiency and
minimizing the response time. Moreover, the inverter's real-time active and
reactive power is directly managed to achieve a unity power factor (UPF).
The system's performance is assessed through MATLAB/Simulink
implementation, showing marked improvement over conventional methods,
particularly in steady-state and varying weather conditions. For solar
irradiances of 500 and 1,000 W/m2
, the results show that the proposed
method reduces the total harmonic distortion (THD) of the injected current
to the grid by approximately 46% and 38% compared to conventional
methods, respectively. Furthermore, we compare the simulation results with
IEEE standards to evaluate the system's grid compatibility.
Enhancing photovoltaic system maximum power point tracking with fuzzy logic-b...IJECEIAES
Photovoltaic systems have emerged as a promising energy resource that
caters to the future needs of society, owing to their renewable, inexhaustible,
and cost-free nature. The power output of these systems relies on solar cell
radiation and temperature. In order to mitigate the dependence on
atmospheric conditions and enhance power tracking, a conventional
approach has been improved by integrating various methods. To optimize
the generation of electricity from solar systems, the maximum power point
tracking (MPPT) technique is employed. To overcome limitations such as
steady-state voltage oscillations and improve transient response, two
traditional MPPT methods, namely fuzzy logic controller (FLC) and perturb
and observe (P&O), have been modified. This research paper aims to
simulate and validate the step size of the proposed modified P&O and FLC
techniques within the MPPT algorithm using MATLAB/Simulink for
efficient power tracking in photovoltaic systems.
Adaptive synchronous sliding control for a robot manipulator based on neural ...IJECEIAES
Robot manipulators have become important equipment in production lines, medical fields, and transportation. Improving the quality of trajectory tracking for
robot hands is always an attractive topic in the research community. This is a
challenging problem because robot manipulators are complex nonlinear systems
and are often subject to fluctuations in loads and external disturbances. This
article proposes an adaptive synchronous sliding control scheme to improve trajectory tracking performance for a robot manipulator. The proposed controller
ensures that the positions of the joints track the desired trajectory, synchronize
the errors, and significantly reduces chattering. First, the synchronous tracking
errors and synchronous sliding surfaces are presented. Second, the synchronous
tracking error dynamics are determined. Third, a robust adaptive control law is
designed,the unknown components of the model are estimated online by the neural network, and the parameters of the switching elements are selected by fuzzy
logic. The built algorithm ensures that the tracking and approximation errors
are ultimately uniformly bounded (UUB). Finally, the effectiveness of the constructed algorithm is demonstrated through simulation and experimental results.
Simulation and experimental results show that the proposed controller is effective with small synchronous tracking errors, and the chattering phenomenon is
significantly reduced.
Remote field-programmable gate array laboratory for signal acquisition and de...IJECEIAES
A remote laboratory utilizing field-programmable gate array (FPGA) technologies enhances students’ learning experience anywhere and anytime in embedded system design. Existing remote laboratories prioritize hardware access and visual feedback for observing board behavior after programming, neglecting comprehensive debugging tools to resolve errors that require internal signal acquisition. This paper proposes a novel remote embeddedsystem design approach targeting FPGA technologies that are fully interactive via a web-based platform. Our solution provides FPGA board access and debugging capabilities beyond the visual feedback provided by existing remote laboratories. We implemented a lab module that allows users to seamlessly incorporate into their FPGA design. The module minimizes hardware resource utilization while enabling the acquisition of a large number of data samples from the signal during the experiments by adaptively compressing the signal prior to data transmission. The results demonstrate an average compression ratio of 2.90 across three benchmark signals, indicating efficient signal acquisition and effective debugging and analysis. This method allows users to acquire more data samples than conventional methods. The proposed lab allows students to remotely test and debug their designs, bridging the gap between theory and practice in embedded system design.
Detecting and resolving feature envy through automated machine learning and m...IJECEIAES
Efficiently identifying and resolving code smells enhances software project quality. This paper presents a novel solution, utilizing automated machine learning (AutoML) techniques, to detect code smells and apply move method refactoring. By evaluating code metrics before and after refactoring, we assessed its impact on coupling, complexity, and cohesion. Key contributions of this research include a unique dataset for code smell classification and the development of models using AutoGluon for optimal performance. Furthermore, the study identifies the top 20 influential features in classifying feature envy, a well-known code smell, stemming from excessive reliance on external classes. We also explored how move method refactoring addresses feature envy, revealing reduced coupling and complexity, and improved cohesion, ultimately enhancing code quality. In summary, this research offers an empirical, data-driven approach, integrating AutoML and move method refactoring to optimize software project quality. Insights gained shed light on the benefits of refactoring on code quality and the significance of specific features in detecting feature envy. Future research can expand to explore additional refactoring techniques and a broader range of code metrics, advancing software engineering practices and standards.
Smart monitoring technique for solar cell systems using internet of things ba...IJECEIAES
Rapidly and remotely monitoring and receiving the solar cell systems status parameters, solar irradiance, temperature, and humidity, are critical issues in enhancement their efficiency. Hence, in the present article an improved smart prototype of internet of things (IoT) technique based on embedded system through NodeMCU ESP8266 (ESP-12E) was carried out experimentally. Three different regions at Egypt; Luxor, Cairo, and El-Beheira cities were chosen to study their solar irradiance profile, temperature, and humidity by the proposed IoT system. The monitoring data of solar irradiance, temperature, and humidity were live visualized directly by Ubidots through hypertext transfer protocol (HTTP) protocol. The measured solar power radiation in Luxor, Cairo, and El-Beheira ranged between 216-1000, 245-958, and 187-692 W/m 2 respectively during the solar day. The accuracy and rapidity of obtaining monitoring results using the proposed IoT system made it a strong candidate for application in monitoring solar cell systems. On the other hand, the obtained solar power radiation results of the three considered regions strongly candidate Luxor and Cairo as suitable places to build up a solar cells system station rather than El-Beheira.
An efficient security framework for intrusion detection and prevention in int...IJECEIAES
Over the past few years, the internet of things (IoT) has advanced to connect billions of smart devices to improve quality of life. However, anomalies or malicious intrusions pose several security loopholes, leading to performance degradation and threat to data security in IoT operations. Thereby, IoT security systems must keep an eye on and restrict unwanted events from occurring in the IoT network. Recently, various technical solutions based on machine learning (ML) models have been derived towards identifying and restricting unwanted events in IoT. However, most ML-based approaches are prone to miss-classification due to inappropriate feature selection. Additionally, most ML approaches applied to intrusion detection and prevention consider supervised learning, which requires a large amount of labeled data to be trained. Consequently, such complex datasets are impossible to source in a large network like IoT. To address this problem, this proposed study introduces an efficient learning mechanism to strengthen the IoT security aspects. The proposed algorithm incorporates supervised and unsupervised approaches to improve the learning models for intrusion detection and mitigation. Compared with the related works, the experimental outcome shows that the model performs well in a benchmark dataset. It accomplishes an improved detection accuracy of approximately 99.21%.
Gas agency management system project report.pdfKamal Acharya
The project entitled "Gas Agency" is done to make the manual process easier by making it a computerized system for billing and maintaining stock. The Gas Agencies get the order request through phone calls or by personal from their customers and deliver the gas cylinders to their address based on their demand and previous delivery date. This process is made computerized and the customer's name, address and stock details are stored in a database. Based on this the billing for a customer is made simple and easier, since a customer order for gas can be accepted only after completing a certain period from the previous delivery. This can be calculated and billed easily through this. There are two types of delivery like domestic purpose use delivery and commercial purpose use delivery. The bill rate and capacity differs for both. This can be easily maintained and charged accordingly.
Software Engineering and Project Management - Introduction, Modeling Concepts...Prakhyath Rai
Introduction, Modeling Concepts and Class Modeling: What is Object orientation? What is OO development? OO Themes; Evidence for usefulness of OO development; OO modeling history. Modeling
as Design technique: Modeling, abstraction, The Three models. Class Modeling: Object and Class Concept, Link and associations concepts, Generalization and Inheritance, A sample class model, Navigation of class models, and UML diagrams
Building the Analysis Models: Requirement Analysis, Analysis Model Approaches, Data modeling Concepts, Object Oriented Analysis, Scenario-Based Modeling, Flow-Oriented Modeling, class Based Modeling, Creating a Behavioral Model.
Build the Next Generation of Apps with the Einstein 1 Platform.
Rejoignez Philippe Ozil pour une session de workshops qui vous guidera à travers les détails de la plateforme Einstein 1, l'importance des données pour la création d'applications d'intelligence artificielle et les différents outils et technologies que Salesforce propose pour vous apporter tous les bénéfices de l'IA.
Discover the latest insights on Data Driven Maintenance with our comprehensive webinar presentation. Learn about traditional maintenance challenges, the right approach to utilizing data, and the benefits of adopting a Data Driven Maintenance strategy. Explore real-world examples, industry best practices, and innovative solutions like FMECA and the D3M model. This presentation, led by expert Jules Oudmans, is essential for asset owners looking to optimize their maintenance processes and leverage digital technologies for improved efficiency and performance. Download now to stay ahead in the evolving maintenance landscape.
Software Engineering and Project Management - Software Testing + Agile Method...Prakhyath Rai
Software Testing: A Strategic Approach to Software Testing, Strategic Issues, Test Strategies for Conventional Software, Test Strategies for Object -Oriented Software, Validation Testing, System Testing, The Art of Debugging.
Agile Methodology: Before Agile – Waterfall, Agile Development.
Digital Twins Computer Networking Paper Presentation.pptxaryanpankaj78
A Digital Twin in computer networking is a virtual representation of a physical network, used to simulate, analyze, and optimize network performance and reliability. It leverages real-time data to enhance network management, predict issues, and improve decision-making processes.
Applications of artificial Intelligence in Mechanical Engineering.pdfAtif Razi
Historically, mechanical engineering has relied heavily on human expertise and empirical methods to solve complex problems. With the introduction of computer-aided design (CAD) and finite element analysis (FEA), the field took its first steps towards digitization. These tools allowed engineers to simulate and analyze mechanical systems with greater accuracy and efficiency. However, the sheer volume of data generated by modern engineering systems and the increasing complexity of these systems have necessitated more advanced analytical tools, paving the way for AI.
AI offers the capability to process vast amounts of data, identify patterns, and make predictions with a level of speed and accuracy unattainable by traditional methods. This has profound implications for mechanical engineering, enabling more efficient design processes, predictive maintenance strategies, and optimized manufacturing operations. AI-driven tools can learn from historical data, adapt to new information, and continuously improve their performance, making them invaluable in tackling the multifaceted challenges of modern mechanical engineering.
2. ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 8, No. 6, December 2018 : 5457 - 5471
5458
very large feature vector which makes the algorithm computationally very expensive. Currently, with the
advancements in the digital signal technology images are being captured at a very high resolution and
extracting LDP features using the current general-purpose hardware from large databases of such high-
resolution images becomes very challenging. This limits the applicable domains of this powerful algorithm
with high discriminative property. LDP like most image processing algorithms perform the same
computation on a number of pixels, which is a typical data parallel operation. Therefore, computing the LDP
features from different regions of the high-resolution image in parallel will reduce the computation time of
the algorithm thereby making it efficient for real time applications.
Currently, Graphics Processing Unit (GPU) has entered the General-Purpose Computing Domain
(GPGPU). In comparison to the single processor CPU, GPGPUs have very high computation power. They
are inexpensive and scalable than CPUs. The highly data parallel image processing algorithms exploits the
Single Instruction Multiple Data (SIMD) architecture and can be efficiently parallelized on a GPGPU [3].
However, traditional GPGPU development is based on graphics function library like OpenGL and Direct 3D,
which restricts its use by professionals, who are familiar with graphics API. The emergence of Compute
Unified Device Architecture (CUDA), NVIDIA’s parallel architecture implementation, removes these
inconveniences as it provides APIs for programmers to develop parallel applications using the C
programming language. GPUs are inexpensive, scalable and have very high computation power compared to
single processor CPUs. GPUs have been widely used now for accelerating many efficient CBIR algorithms
[4], [5] in the field of image vision, medical imaging [6], remote sensing [7] etc. This paper proposes a
parallel implementation of the LDP algorithm on NVIDIA GPUs. The implementation exploits the features
like the global memory, shared memory, L1 cache, CUDA streams and optimal use of registers. We have
measured the GPU implementations on Fermi as well as the latest Kepler architecture. To the best of our
knowledge this is the first reported effort to parallelize the LDP algorithm using GPUs. Main contributions of
the paper are summarized as follows: (1) We present an efficient strategy for parallel calculation of the LDP
feature vector of an image by optimally using the GPU global memory and the faster shared memory (2) We
have further tuned the speedup of LDP computations by using the L1 data cache. (3) We have accelerated the
GPU operations using the concept of CUDA streams. (4) We have exploited the asynchronous nature of the
CUDA kernel calls by deploying calculation on host while the device executes the GPU kernels thereby
extracting enhanced parallelism.
The rest of the paper is organized as follows. We begin with a discussion on the related work in
section 2. Section 3 reviews the mathematical foundation of the LDP algorithm and describes the main
concepts of programming GPUs using NVIDIA’s CUDA model. In section 4, we present our approach for
computing the LDP using the GPU. Finally, section 5 presents the experiments done and a detailed analysis
of the results that demonstrates the efficiency of our GPU algorithm and section 6 draws some conclusions
about the presented work and suggests some optimizations as future work.
2. RELATED WORK
The most computationally expensive part of the CBIR algorithms is the feature extraction. Color,
texture or shape features or a combination of these are generally extracted from the images for efficient
image matching and retrieval. GPGPUs are now being efficiently used for implementing this computationally
expensive feature extraction task as well as the matching task in CBIR algorithms. It is a well-known fact
that the GPUs perform much faster than CPUs and attain excellent speedup for image processing applications
which are specifically tailored for parallel programming. Several notable works showing the advantage of
using a GPU and the substantial amount of speed up obtained over the sequential version, for such
applications are available in the literature. Basic image processing algorithms like histogram equalization,
Discrete Cosine Transform (DCT) encode and decode, cloud removal etc. are efficiently parallelized using
CUDA and is reported to have attained excellent speedup [8]. In Ref. [9] a parallel version of the fractal
image encoding algorithm is built on the GPU and the results indicate the encoding was achieved in
milliseconds time without compromising on the quality of the decoded image. Tek et al. [10] efficiently
parallelized a face recognition framework using CUDA and reported a speedup of 29x. Gangodkar et al. in
[11] implemented a parallel version of fast normalized cross correlation for real time template matching
which attained a speedup of 17x compared to the sequential execution. In Ref. [12] a parallel algorithm for
accelerating the image inpainting process is proposed. The intense computational task of processing the
fourth order PDE equations is accelerated by 36x using the NVIDIA GEFORCE GTX 670 GPU card. A
parallel implementation of content-based video search and retrieval of videos is proposed in Ref. [13]. The
3. Int J Elec & Comp Eng ISSN: 2088-8708
Real- Time Implementation and Performance Optimization of Local Derivative … (Nisha Chandran S)
5459
authors achieved around 80% accuracy in video search and retrieval and a speedup of 5x compared to the
sequential version of the algorithm. In Ref. [14] a parallel implementation of the audio compression
algorithm Apple Lossless Audio Codec (ALAC) is presented and a speedup of 80-90% is achieved. Heidari
et al. [15] proposed a parallel implementation of color moments and texture feature and achieved a speed up
of 144x over the sequential implementation. Zhu et al. [16] have parallelized the SURF algorithm and have
achieved a speedup of 15x over its sequential counterpart. In [17] Priscariu and Reid have implemented
Histogram of Oriented Gradients (HOG) in GPU and have achieved a speedup of 67x. In [18] Park et al.
explored the design and implementation issues of image processing algorithms on GPUs with CUDA
framework. They have selected multi view stereo matching, linear feature extraction, JPEG2000 image
encoding non-photorealistic rendering as example applications and efficiently parallelized them on the GPU.
In [19] the authors proposed an implementation of Ising simulation work using the CUDA framework. They
have attained a performance improvement of 20-40 times compared to the conventional CPU
implementation. In [20] the authors have implemented a parallel algorithm in GPU for value prediction and
speculative execution. They have observed that software value prediction techniques did not attain much
performance gain whereas the software speculation techniques attained considerable performance gain.
However, optimizing a parallel GPU code is not a trivial task. To achieve good performance the programmer
should fine tune the code taking into account the underlying architecture.
Several tuning strategies using different aspects of the GPU architectures have been discussed in the
standard CUDA books [21]-[23]. In [24] authors give insights into the relationship between occupancy,
thread block size and shape, cache hierarchy configuration and thread access pattern to global memory. By
taking simple matrix multiplication as an example they relate the thread block size and shape to both cache
misses and occupancy. Lobeiras et al. in [25], by taking Fast Fourier Transform (FFT) algorithm as an
example, have analyzed different aspects of the GPU architectures, like the influence of memory access
patterns and register pressure on performance. They have shown that the performance of register-based
solutions is much better than the shared memory due to the higher register bandwidth. In [26] Liang et al.
have fine tuned their 3D localization application in GPU by optimizing the number of threads per thread
block and registers per thread. By experimenting with different values of the thread block size and the
number of registers they have found an optimal setting which gives the best performance. Torres et al. in [27]
have studied the relation between different thread block sizes and shapes with the global memory and the
cache memory access patterns. In [28] Lee et al. have discussed optimization strategies for compute bound as
well as memory bound algorithms. They have optimized the compute bound algorithms by reducing the
number of registers and by increasing the thread block size whereas memory bound algorithms are optimized
by storing the data in limited GPU resources like constant or shared memory in an optimized way.
3. LDP DESCRIPTOR AND USAGE FOR IMAGE RETRIEVAL
LDP algorithm contains very detailed discriminative features of an image as it encodes the higher
order derivative information. For an image 𝐼(𝑍) , the first order derivatives along 0°
, 45°
, 90°
and
135°
directions are denoted as 𝐼∝
′ (𝑍) where α= 0°
, 45°
, 90°
and 135°
. Let us consider a pixel 𝑍0 in 𝐼(𝑍) and
its eight neighbors 𝑍𝑖 , where 𝑖=1… 8 as shown in Figure 1a. The four first order derivatives [2] for the pixel
𝑍0 along the four different directions are given as in (1) to (4).
𝐼0°
,
(𝑍0) = 𝐼(𝑍0) − 𝐼(𝑍4) ; (1)
𝐼45°
,
(𝑍0) = 𝐼(𝑍0) − 𝐼(𝑍3); (2)
𝐼90°
,
(𝑍0) = 𝐼(𝑍0) − 𝐼(𝑍2); (3)
𝐼135°
,
(𝑍0) = 𝐼(𝑍0) − 𝐼(𝑍1) (4)
The directional differential channels for the four directions are as shown in Figure 1b.
The second order directional [2] LDP, for the pixel 𝑍0 𝐿𝐷𝑃∝
2
(𝑍0 ), in α direction is given as in (5).
𝐿𝐷𝑃∝
2(𝑍0) = {
𝑓(𝐼∝
′ (𝑍0), (𝐼∝
′ (𝑍1)), 𝑓(𝐼∝
′ (𝑍0), (𝐼∝
′ (𝑍2))
… , 𝑓(𝐼∝
′ (𝑍0), (𝐼∝
′ (𝑍8))
} (5)
4. ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 8, No. 6, December 2018 : 5457 - 5471
5460
where 𝑓(. , . ) is a binary encoding function which determines the transitions in the local pattern and is given
as in (6).
𝑓(𝐼 𝛼
′ (𝑍0), 𝐼 𝛼
′ (𝑍𝑖)) = {
0, 𝑖𝑓 𝐼 𝛼
′ (𝑍𝑖). 𝐼 𝛼
′ (𝑍0) > 0
1 , 𝑖𝑓 𝐼 𝛼
′ (𝑍𝑖). 𝐼 𝛼
′ (𝑍0) ≤ 0
(6)
i=1, 2 ...8.
where ‘.’ represents multiplication operator. The third order local derivative pattern is computed using the
second order derivatives. The third order directional LDP [2], for the pixel 𝑍0 𝐿𝐷𝑃∝
3
(𝑍0 ), in α direction is
given as in (7).
𝐿𝐷𝑃∝
3(𝑍0) = {
𝑓(𝐼∝
′′(𝑍0), (𝐼∝
′′(𝑍1)), 𝑓(𝐼∝
′′(𝑍0), (𝐼∝
′′(𝑍2))
… , 𝑓(𝐼∝
′′(𝑍0), (′′(𝑍8))
} (7)
In general, the nth
order LDP is computed using the (n-1) th
order derivatives. The nth
order directional LDP
[2], for the pixel 𝑍0, 𝐿𝐷𝑃∝
𝑛
(𝑍0 ) in α direction is given as in (8).
𝐿𝐷𝑃∝
𝑛(𝑍0) = {
𝑓(𝐼∝
𝑛−1(𝑍0), (𝐼∝
𝑛−1(𝑍1)), 𝑓(𝐼∝
𝑛−1(𝑍0), (𝐼∝
𝑛−1(𝑍2))
… , 𝑓(𝐼∝
𝑛−1(𝑍0), (𝐼∝
𝑛−1(𝑍8))
} (8)
(a) (b)
Figure 1. (a) Schematic showing the location of the pixel 𝑍0 and the neighboring pixels Z1...Z8 used in the
LDP algorithm [2], (b) Schematic showing four directional differential channels
Higher order LDPs extract higher spatial frequencies that contain finer details of an image. This is
very useful for image retrieval and pattern matching operations. However, if the image contains noise the
higher order LDPs also enhances the noise, especially from the fourth order and above. LDP thus represents
the higher order pattern information by labeling the pixels of an image by comparing two derivative
directions at neighboring pixels and concatenating the results into a 32-bit binary sequence. Figures 2b and
2c show the visualization of the second order and third order LDP in zero direction respectively for an
example image shown in Figure 2a. As shown in figures the third order LDP gives more detailed information
than the second order. We have used the third order LDP in our experiments as it extracts higher spatial
frequencies thereby extracting finer details of an image. As our preliminary work we have implemented the
LDP algorithm for medium sized images using only the GPU global memory [29]. The pseudo code of the
LDP algorithm is given in Table 1 and its implementation using the global memory space can be
found in [29].
3.1. GPU and NVIDIA CUDA
An overview of the basic GPGPU architecture is presented in this section. GPGPU generally consists of
hundreds of Streaming-Processors (SPs) called cores which are further grouped into Streaming Multi
Processors (SMPs) [21]. The cores perform data processing in the Single Instruction Multiple Thread (SIMT)
model. Mapping of the algorithm to the GPU is done by first identifying the highly parallel phases of the
algorithm. Those parallel phases are then offloaded to the GPU and are then called the CUDA kernels. The
rest of the phases are mapped onto the CPU. On completion of execution of a CUDA kernel the CPU collects
the execution results and the GPU is then scheduled with a new kernel for execution. CUDA exploits data
level parallelism by making every thread work on a different set of data. The light weight CUDA threads
combine to form thread blocks and the thread blocks combine to form a computational grid. The grids and
the threads can be arranged in one, two or three dimensions. Each thread and thread block can be identified
by unique ids and the kernel is launched by a grid of thread blocks.
5. Int J Elec & Comp Eng ISSN: 2088-8708
Real- Time Implementation and Performance Optimization of Local Derivative … (Nisha Chandran S)
5461
(a) (b) (c)
Figure 2. (a) Sample query image from Corel database, (b) Second order LDP of the query image in zero
direction. Decimal equivalent of the binary pattern is shown, (c) Third Order LDP of the query image in zero
direction. Decimal equivalent of the binary pattern is shown
Fermi [30] the world’s first complete GPU architecture was a significant leap forward from the G80
chipset series. The SMPs in the Fermi architecture consists of 32 cores and each of these can execute
one floating-point or integer instruction per clock. Kepler architecture launched in 2010 is the latest NVIDIA
generation of GPU architectures and is built based on the foundation of Fermi. Each SMP in Kepler
architecture has 192 single precision CUDA cores, each one with floating-point and integer arithmetic logic
units. Tesla K40c launched in 2012 is NVIDIA’s flagship compute GPU, based on the Kepler GK110
architecture [31]. With 5GB or 6GB of GDDR5 memory, they provide up to 3.95 TFLOPS single precision
and 1.33TFLOPS double precision floating point performance. To hide the long off chip memory access
latencies NVIDIA Fermi and Kepler GPUs include software managed caches, the shared memory and the
hardware managed caches, the L1 data caches. Shared memory is on chip and the CUDA C compiler treats
variables in the shared memory different from the typical variables. It creates a copy of the variable for each
block that is launched on the GPU. Every thread in the block shares the memory and thus can communicate
and collaborate on the computations done. Furthermore, shared memory and L1- D cache on a GPU utilize
the same physical storage and their capacity can be configured at runtime [32]. Each SM in the Fermi and
Kepler architecture has 64KB of on-chip memory, divided into shared memory and L1 cache. The
programmers may choose between the two configurations: 48KB of shared memory and 16KB of L1 cache
or 16KB of shared memory and 48KB of L1 cache. However, an increase in cache memory reduces the
amount of shared memory to 16KB, which can lead to the reduction in occupancy of the SMs.
In addition to shared memory and L1 cache, CUDA streams [33], [34] also plays an important role
in accelerating the computations done on the GPU. CUDA stream represents a queue of GPU operations that
get executed in a specific order. Both kernels launch, and memory copies can be added into a stream. The
order in which the operations are added to the stream specifies the order in which they will be executed.
When using streams, instead of copying the input buffers entirely to the GPU, they are split into smaller
chunks and the memory copies to and from the GPU as well as the kernel launches are performed
on each chunk.
Table 1. The pseudo code for the LDP calculation [29]
Algorithm: Pseudo code for the LDP calculation
Input: Query Image; Output: retrieval result
Step 1: Divide the input image of 𝑁 × 𝑁 pixels into blocks of 𝑘 × 𝑘 sub- images. Each sub image has
𝑁
𝑘
×
𝑁
𝑘
pixels.
Step 2: For i= 1 to 𝑘2
do steps 3 to 5
Step 3: For the image pixel 𝐼(𝑍0), the location of which is as shown in Fig.1, calculate the first order derivatives along 00
,
450
, 900
and 1350
directions as given in eqs.1 to 4.
Step 4: Calculate binary encoding function as given in eq.6.
Step 5: Calculate the histograms of the binary patterns.
Step 6: Concatenate the histograms of all the sub images to construct the feature vector of the image.
4. OPTIMUM SIZING OF EES
The approach adapted to parallelize the LDP algorithm is to break the data level parallelism
involved in the algorithm. The algorithm is divided into two parts: CPU (Host) and GPU (Device) processing
as shown in Figure 3. The image is read by the host, copied to the device memory, spatial histograms of the
image is computed and is copied back to the host memory. The images are divided into 16x16 sub blocks and
the blocks are processed by the GPU. To show the computation of LDP a sample 7x7 block size is taken as
shown in Figure 4. A block of 9 threads is required to calculate the third order LDP of the black square
region in Figure 4. The LDP patterns for the block are calculated using the binary encoding functions given
Zero direction
50 100 150 200 250
50
100
150
200
250
Zero direction
50 100 150 200 250
50
100
150
200
250
6. ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 8, No. 6, December 2018 : 5457 - 5471
5462
in Equation 7. The binary encoding functions for the pixel 𝑍16 , the position of which is shown in Figure 4,
for the four different directions 0°
,45°
, 90°
and 135°
is given as in (9).
𝐿𝐷𝑃∝
3(𝑍16) =
{
𝑓(𝐼∝
′′(𝑍16), 𝐼∝
′′(𝑍8)), 𝑓(𝐼∝
′′(𝑍16), 𝐼∝
′′(𝑍9))𝑓(𝐼∝
′′(𝑍16), 𝐼∝
′′(𝑍10))
𝑓(′′(𝑍16), 𝐼∝
′′(𝑍17))𝑓(𝐼∝
′′(𝑍16), 𝐼∝
′′(𝑍24))𝑓(𝐼∝
′′(𝑍16), 𝐼∝
′′(𝑍23))
𝑓(𝐼∝
′′(𝑍16), 𝐼∝
′′(𝑍22))𝑓(𝐼∝
′′(𝑍16), 𝐼∝
′′(𝑍15))
where α = 0°
, 45°
, 90°
and 135°
}
(9)
4.1. GPU and NVIDIA CUDA
This section describes how our implementation parallelizes the LDP algorithm. The algorithm is
divided into two main steps the LDP histogram calculation and the histogram intersection. Out of the two the
most critical step in total execution time is the LDP histogram calculation and therefore all optimizations are
done to speedup this kernel. The GPU algorithm thus consists of two kernels to execute the operations of the
two basic steps of the LDP algorithm.
The LDP_compute kernel computes the LDP values of all the pixels in the 16x16 block of the image
along all the four directions 0°
, 45°
, 90°
and 135°
. The sum of the decimal equivalent of the patterns is then
used to form the LDP histogram. Threads are allocated for each pixel of the image block and each thread
calculates and stores the LDP histogram values. The histogram values are then transferred to the host for
normalization.
Figure 3. Steps in the computation of LDP
Figure 4. Image block of size 7x7 to demonstrate the computation of a thread block.
LDP of the region in black rectangle is to be computed
The Hist_intersect kernel computes the histogram intersection of the normalized query image LDP
histogram and the database image histogram. Histogram intersection [2] is used to measure the similarity
between two histograms and is given by the (10).
𝑆 𝐻𝐼(𝐻, 𝑆) = ∑ min(𝐻𝑖 𝑆𝑖
𝐵
𝑖=1 ) (10)
where 𝑆 𝐻𝐼( 𝐻, 𝑆) is the histogram intersection statistic. This measure calculates the common parts of two
histograms. Histogram intersection thus finds the common parts of the query image histogram and the
database image histograms. Depending upon the size of the histograms threads are allocated for comparing
the elements in the query and the database image histograms.
CPU
Normalize query
image histogram
Normalize
database image
histograms
Read database
image
histograms
Read query
image
Display
results
GPU
Compute LDP Compute
Histogram
Transfer image
to GPU
Transfer results
to Host
Transfer
histograms to GPU
Compute histogram
intersection
Transfer results
to Host
7. Int J Elec & Comp Eng ISSN: 2088-8708
Real- Time Implementation and Performance Optimization of Local Derivative … (Nisha Chandran S)
5463
4.2. Design Space Exploration and Optimization
In the following sections we discuss in detail the architecture related design space exploration and
optimization focusing mainly on the CUDA thread block sizes, registers, memory spaces and CUDA streams.
4.2.1 Thread Block Size
The general methodology adapted when optimizing a GPU accelerated application is to increase the
occupancy of the SMPs. It is essential to run as many threads as possible on the SMPs to hide the long
memory latency. To increase occupancy either the number of registers per thread has to be decreased or the
number of threads in the thread blocks has to be adjusted according to the number of registers per thread. The
occupancy of a GPU of compute capability 2.1 for 42 registers per thread is more than 68% using 128,192
and 256 threads whereas with 64 threads it is 33%. We explore the optimum thread block size for different
image sizes that gives the maximum occupancy as well as the least execution time.
4.2.2 Registers
Reducing the number of registers used by the threads is another option to increase occupancy. The
CUDA nvcc compiler performs large scale register allocation to remove instruction dependencies and to
maximize single thread performance [27], [35]. Registers being a limited resource for a block an incremental
increase in register allocation will lead to running fewer thread blocks. Thus, there is a tradeoff between the
single thread performance and the number of active threads at a time. In [35] the authors have used a metric
RegRatio which is the ratio of the registers used per thread to the maximum allowed registers per thread.
Using this metric, we have optimized the usage of the registers. We keep the RegRatio value between 0 and 1
and choose a high value of RegRatio so that there is a good tradeoff between the single thread performance
and the number of simultaneously active threads. For a particular kernel the maximum number of registers per
thread can be specified to the nvcc compiler at compile-time using the maxrregcount option. Theoretically,
the actual register use might be less than the limit. We have also verified the performance gain achieved by
RegRatio using a heuristic approach where for each image size, thread size and block size the number of
registers is varied and the combination which gives the maximum occupancy and minimum kernel execution
time is found.
4.2.3 Shared Memory and L1 Cache
To further improve the performance of the LDP algorithm, in the next step we make use of the faster
on chip shared memory. The number of registers and the thread block size is kept at an optimum value
obtained from the earlier study. Further we choose between the two configurations: 48KB of shared memory
and 16KB of L1 cache or vice versa to study the performance improvement obtained by using L1 cache
instead of shared memory.
4.2.4 CUDA Streams
Fermi and Kepler architecture-based GPUs support device overlap which is the capacity to
simultaneously execute a CUDA kernel while performing a memory copy operation between device and host
memory. This is facilitated using CUDA streams. The use of streams is very profitable in applications where
input data instances are independent. A stream is a sequence of commands that is executed in order. By
default, a CUDA program has only one stream called the default stream [22], [23]. When multiple streams
are involved the order in which the operations are done becomes an important issue. There are two basic
approaches used to order the operations, breadth first and depth first. In depth first approach all operations of
a given stream are grouped together and in breadth first approach similar operations of different streams are
grouped together. Correct results are obtained with both approaches though both perform differently on the
specific generation of GPU [34]. In [35] the authors have shown that for a certain application which uses D
input data instances and defines B blocks of threads for kernel execution the data instances can be broken
into nStream streams. Therefore, each stream now works with D/nStream data instances and B/nStream
blocks and the memory copy of one stream overlaps the kernel execution of the other stream achieving a
performance improvement. In our application depending on the size of the image the value of nStream varies.
5. EXPERIMENTS AND ANALYSIS OF RESULTS
All experiments in this paper are done on two different machines. Preliminary experiments are done
on Intel Core i3-2375M with 4GB RAM and Geforce720m (GPU 1) and are further repeated on Intel Xeon
CPU E5-1650 with 64GB RAM and Tesla K40C (GPU 2). The authors in the original paper [2] have reported
the CPU execution time for the sequential LDP algorithm as 0.18sec for an image of size 88 x 88 pixels on a
PC with 3 GHz CPU and 2 GB RAM. In our experiments we have used our own sequential implementation of
8. ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 8, No. 6, December 2018 : 5457 - 5471
5464
the algorithm. Using our implementation, the time taken for 88x88 pixels on GPU 1 is 0.078sec and the time
taken on GPU 2 is 0.036sec. We have used high resolution images ranging from sizes 256x256 to 4096x4096
for our experiments. The experiments for medium image sizes using the global memory on GeForce 720m is
presented in [29]. In this paper the LDP algorithm is implemented on large sized images using the fast-shared
memory and further improved using the concept of L1 data cache and CUDA streams. High resolution images
ranging from size 256x256 to 4096x4096 pixels are considered in this paper.
Geforce 720m is a card of compute capability 2.1 and a thread block can contain a maximum of 1024
threads and every SMP can process a maximum of 1536 threads whereas Tesla K40C is a card of compute
capability 3.5, in which a thread block can contain a maximum of 1024 threads and every SMP can process a
maximum of 2048 threads. The test images are taken from Corel database [36] and Essex face database [37].
Corel database consists of various categories like animals, cars, distractions and transport. For our
experiments we have collected 1000 images from various categories of the Corel database to form the dataset.
The Essex face database consists of 20 images each of 153 persons. We have taken 10 images of each 153
persons for our experiments. The approach followed in our parallel implementation is to achieve almost the
same retrieval results as that of the sequential implementation. The images are divided into non-overlapping
image blocks of size 16x16. The experiment is repeated for different image sizes and the performance gain of
the kernel is analyzed. CUDA Visual Profiler is used to analyze the CUDA kernel performance [38]. We have
developed two kernels for image retrieval, one for LDP histogram computation and another for the similarity
matching using histogram intersection. For the implementation the query image is copied to the GPU global
memory. The image is divided into 16x16 blocks and the LDP histogram values of each block are calculated.
We have chosen a block size of 256 threads so that every SMP in GPU1 with compute capability 2.1 can
schedule six blocks whereas GPU2 with compute capability 3.5 can schedule eight blocks by optimum
resource utilization. The kernel is launched with the necessary threads to calculate the LDP histogram values.
The computed LDP histogram is then transferred back to the CPU for normalization. The matching is then
done by the second kernel using the histogram intersection method.
To avoid the CPU from being idle when the GPU is calculating, we efficiently divide the task of
histogram intersection between the two in such a way that in minimum time the calculation is completed. This
is done by launching the kernel asynchronously, i.e. whenever the kernel is executing in the GPU, the CPU is
not blocked and is free to perform its own computations. Our strategy for balancing the work load between
CPU and GPU is given below. Let R be the ratio of the time taken by the CPU to perform the task to the time
taken by the GPU to perform the task. If T is the total number of images to be processed then images to the
GPU is given by
𝑇∗𝑅
1+𝑅
and images to the CPU is given by
𝑇
1+𝑅
where R=
𝑡𝑖𝑚𝑒 𝑡𝑎𝑘𝑒𝑛 𝑏𝑦 𝐶𝑃𝑈
𝑡𝑖𝑚𝑒 𝑡𝑎𝑘𝑒𝑛 𝑏𝑦 𝐺𝑃𝑈
.
Figure 5 shows the percentage of allocation of the images to the CPU and GPU. The ratios
CPU/T=1/1+R and GPU/T = R/1+R are plotted against the logarithmic values of the speedup. From Figure 5
it can be seen that if the execution time taken by the CPU and the GPU is almost the same then the images
can be equally divided between the CPU and the GPU. In our experiments we allotted 50 images of size
256x256 pixels to the CPU and 950 images to the GPU. For images of size 4096x4096 we allotted 30 images
to the CPU and 970 to the GPU. As GPU2, the Tesla machine is more powerful and more suitable for
scientific calculations we have done all the computations in the GPU. In the following section we discuss the
performance improvement achieved by our design space exploration.
Figure 5. Optimal work load balance between CPU and GPU
0 1 2 3 4 5 6
0
10
20
30
40
50
60
70
80
90
100
log
10
(Speedup time)
Imageallocationin%
CPU allocation
GPU allocation
9. Int J Elec & Comp Eng ISSN: 2088-8708
Real- Time Implementation and Performance Optimization of Local Derivative … (Nisha Chandran S)
5465
5.1. Optimizing the Number of Threads and Registers
We have found optimized values for registers using the method discussed in section 4.1.2. We
consider threads per thread block (N) and register allocation per thread (R) as input variables. The default
register use (without register limit) of our LDP implementation is 42. The maximum value for N is 768 due
to the constraint of register use. We explore various combinations of N and R values and analyze the
performance improvement. Only a subset of the results is displayed here. As shown in Figures 6a and 6b the
design space exhibits high variation in terms of execution time. The overall performance critically depends
on both N and R. For example, for an image of size 256x256 the optimal setting is N=128 and R=23, whereas
for an image of size 1024x1024 the optimal setting is at N=256 R=32. We have done an optimal choice by
taking into consideration the maximum occupancy and minimum time. With the default number of registers,
42 in our case, the maximum number of active threads per multiprocessor is limited to 768 in a device of
compute capability 2.1. By decreasing the number of registers to the optimal value i.e. 23 and a thread block
size of 128, the active number of threads per multiprocessor is increased to 1024. In a device of compute
capability 3.5 the number of threads is increased from 1280 to 2048 with the optimal choice of the number of
registers per thread and the number of threads per thread block.
Combining the aforementioned various renewable energy resources with energy storage systems and
diesel engine generator in an interconnected system, the generated electric energy can be effectively
distributed and controlled to meet the energy requirement of the connected loads.
(a) (b)
Figure 6. (a) Optimal no. of registers and thread block size for an image of size 256x256,
(b) Optimal no. of registers and thread block size for an image of size 1024x1024
5.2. Implementation using Global Memory
The first version of the algorithm was implemented using the global memory space. The detailed
implementation procedure of the LDP algorithm using the global memory and the time comparison between
the sequential and parallel version for medium image sizes on GPU1 was presented in [29]. Optimized
number of registers and threads were considered for each image size. Figure 7 shows the time comparison of
LDP_compute on GPU2 for image sizes 256x256, 512x512, 1024x1024, 2048x2048 and 4096x4096. As
shown in Figure 7 the CUDA kernel gives more performance gain for higher size images than smaller
images. This is because the host memory to device memory copy takes considerable amount of time and the
speedup obtained must payoff for this copy overhead. In case of smaller sized images such as 256x256 pixels
or 512x512 pixels, the speedup obtained is not enough to payoff for this copy overhead. However, as the
image size increases to 2046x2046 pixels and 4096x4096 pixels this overhead is paid off and the
performance gain increases.
Figure 7. Time comparison between sequential version and parallel version in Tesla K40c
when global memory is used. Logarithmic scales are used for plotting
No. of Threads
No.ofRegisters
128 192 256
42
36
34
32
23
20
18.4
18.5
18.6
18.7
18.8
18.9
19
19.1
19.2
19.3
19.4
No. of Threads
No.ofRegisters
128 192 256
42
36
34
32
23
20
180
185
190
195
200
205
10
2
10
3
10
4
10
-2
10
-1
10
0
10
1
10
2
Image size(in pixels)
Timeinseconds
Time comparison for different image sizes
Sequential
Parallel
10. ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 8, No. 6, December 2018 : 5457 - 5471
5466
This shows that the GPUs give excellent performance for large workloads which are highly parallel
and throughput oriented. However, the large number of registers required by the thread will limit the number
of active threads per multiprocessor. By reducing the number of registers per thread and by selecting an ideal
thread block size the number of active threads per multiprocessor can be increased which in turn will increase
the overall performance.
5.3. Imlementation using Shared Memory and L1 Cache
The next optimization done is to use the software managed faster on chip shared memory instead of
the global memory. Shared memory enables inter thread data reuse as the threads in the thread block share the
data stored in the shared memory. In our implementation we have carefully divided the images in such a way
that the 48KB per block constraint of shared memory is never violated and also we use the maximum shared
memory available for processing. The image is divided into 16x16 tiles and seven such tiles are copied to the
shared memory for LDP computation. The sums of the LDP values in the four directions are then written to
the shared memory for further use. Total amount of shared memory used per block is 7KB. We have taken the
thread block size as 128 in GPU 1 so that 6 blocks are active in one SMP and the total shared memory used
per SMP is 42KB. This falls within the 48KB limit at the same time schedules the maximum number of
threads possible. The speedup obtained for in GPU1 when shared memory is used is shown in Table 2a. In
GPU2 the thread block size is taken as 256 and 6 blocks are active in one SMP and the total shared memory
used per SMP is 42KB. The speedup obtained in GPU2 when shared memory is used is shown in Table 2b.
The time taken for the computation when shared memory is used is much less compared to the
parallel version which uses the global memory. This is because as the shared memory is on-chip it is much
faster thanthe global memory. However, to ensure correct results there must be synchronization between the
threads which share the memory. CUDA’s simple barrier synchronization primitive __syncthreads ()
is used for this.
This avoids the race condition as the thread’s execution proceed past a __syncthreads () barrier only
after all the threads in the block have executed the barrier synchronization.
We have further studied the option of using a larger L1 cache in place of shared memory and its
effect on improving the performance of the LDP_compute kernel. Default L1 size is 16KB but in cases when
register spilling is problematic increasing the size of L1 to 48KB improves the performance. The
cudaDeviceSetCacheConfig function is used to set preference between the shared memory and L1 cache. The
slight improvement in speedup obtained by using L1 cache when compared to shared memory for the two
GPUs is as shown in Tables 3a and 3b. The L1 cache version shows only a slight performance improvement
when compared to the shared memory version.
5.4. Implementation using Streams
LDP_compute kernel is further optimized using streams. To allocate operations to different streams
the input image is divided into small chunks and the required operations are performed on each chunk. The
three main operations in the LDP computation is memory copy from host to device, execution of the kernels
and copying the result back from device to host. For an input image of size say N= 256x256 size, the staged
execution of nStreams, transfers chunks of N/nStreams size i.e. 128x128 pixels. If B blocks are used for the
computation of the whole image in the non-staged version then in the staged execution of nStreams,
B/nStreams blocks are used for computation. Also, while the first chunk is being computed by the B/nStreams
blocks, the second chunk is being transferred thereby giving an improvement in performance. Figure 8a
shows such a concurrency between transfer and computation with nStream set as 4. The host memories are
allocated as page locked. We have assumed that the kernel execution and memory copies take roughly the
same execution time. The execution time line is as shown in Figure 8b. The inter engine dependencies are
highlighted with arrows. On a GeForce 720m there is only a single copy engine and single kernel engine.
Depth first approach does not give any speedup whereas breadth first search gives a speedup as shown in
Table 4a. In K40c there are two copy engines one for host to device transfer and another for device to host
transfer and a single kernel engine. Both depths first and breadth first achieve the same speedup. The
operations are queued across the streams in such a way that the operation of one stream does not block the
operation of the other streams. This allows the GPU to execute copies and kernels in parallel, allowing the
application to run faster as shown in Table 4b.
The image retrieval results using the parallel implementation of LDP is given in Table 5. The
performance evaluation was done by calculating two parameters- true positive rate (TPR) and false positive
rate (FPR) given by 𝑇𝑃𝑅 = 𝑇𝑃/(𝑇𝑃 + 𝐹𝑁) and 𝐹𝑃𝑅 = 𝐹𝑃/(𝐹𝑃 + 𝑇𝑁) where TP is True Positive, FN is
False Negative, FP is False Positive and TN is True Negative. The four parameters TP, FN, FP, TN are
evaluated as follows: Firstly, the Ground Truth (GT) for all the images in the database is recorded as TRUE
or FALSE depending on whether the image belongs to the TRUE class or the FALSE class. In the second
11. Int J Elec & Comp Eng ISSN: 2088-8708
Real- Time Implementation and Performance Optimization of Local Derivative … (Nisha Chandran S)
5467
step the histogram intersection of the LDP of the query image and the database images are calculated using
the Equation 10. The histogram intersection values fall within the range 0 and 1. If the value is above a pre-
determined threshold then the Test Outcome (TO) is deemed to be TRUE else the TO is FALSE. If both the
TO and the GT are TRUE, then the TP count is incremented by 1. If both the TO and the GT are false, then
the TN count is incremented by 1. If the TO is true and the GT is FALSE, then the FP count is incremented
by 1. If TO is FALSE and GT is TRUE, then the FN count is incremented by 1. The TPR and FPR values for
the two databases are calculated and the Receiver Operating Characteristic (ROC) curves obtained using
these values are shown in Figures 9a and 9b. The threshold of the point that is closest to the top leftmost
point (0, 1) is chosen as the one which gives the maximum TPR and minimum FPR (shown in circles in
Figures 9a and 9b). These optimum threshold values are used to obtain the results shown in Table 5. The
representative images from the two databases are shown in Figures 10a and 10b. Example retrieval
results for a given query image from the two databases Corel and Essex using LDP is shown in Figures 11
and 12 respectively.
Table 2. (a) The time comparison for different image sizes when shared memory is used (GeForce 720m)
Image size CPU time (sec) GPU time (sec) Speedup
256x256 0.209 0.003318 62.98x
512x512 0.834 0.013184 63.25x
1024x1024 3.574 0.05238 68.23x
2048x2048 14.988 0.209413 71.57x
4096x4096 61.459 0.838844 73.25x
Table 2. (b) The time comparison for different image sizes when shared memory is used (TeslaK40c)
Image size CPU time (sec) GPU time (sec) Speedup
256x256 0.092 0.00048 190.95x
512x512 0.35 0.00178 195.75x
1024x1024 1.4 0.00696 201.17x
2048x2048 5.6781 0.02759 205.76x
4096x4096 22.944 0.11004 208.51x
Table 3. (a) Time comparison for different image sizes when L1 cache is used (GeForce 720m)
Image size CPU time (sec) GPU time (sec) Speedup
256x256 0.209 0.003233 64.64x
512x512 0.834 0.012733 65.49x
1024x1024 3.574 0.05 70.46x
2048x2048 14.988 0.202673 73.957x
4096x4096 61.459 0.810159 75.86x
Table 3. (b) Time comparison for different image sizes when L1 cache is used (TeslaK40c)
Image size CPU time (sec) GPU time (sec) Speedup
256x256 0.092 0.00048 192x
512x512 0.35 0.001648 212.378x
1024x1024 1.4 0.006275 223.107x
2048x2048 5.6781 0.024649 230.358x
4096x4096 22.944 0.098124 233.826x
Table 4. (a) Time comparison for different image sizes when streams is used (GeForce 720m)
Image size CPU time (sec) GPU time (sec) Speedup
256x256 0.209 0.003175 65.836x
512x512 0.834 0.012572 66.337x
1024x1024 3.574 0.049687 71.93x
2048x2048 14.988 0.19974 75.0375x
4096x4096 61.459 0.7973 77.0839x
Table 4. (b) Time comparison for different image sizes when streams is used (TeslaK40c)
Image size CPU time (sec) GPU time (sec) Speedup
256x256 0.092 0.000479 192.067x
512x512 0.350 0.001772 197.52x
1024x1024 1.4 0.006927 202.11x
2048x2048 5.6781 0.02734 207.684x
4096x4096 22.944 0.1082 211.052x
12. ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 8, No. 6, December 2018 : 5457 - 5471
5468
(a) (b)
Figure 8. (a) Computation on six image blocks for non-streamed and streamed execution (b) Execution
timeline using multiple CUDA streams with arrows indicating inter engine dependencies.
Table 5. Retrieval results for the two databases
Databases TP FN FP TN TPR FPR
Corel 41 11 322 626 0.79 0.34
Essex 10 0 0 1520 1 0
(a) (b)
Figure 9. (a) ROC curves for the Corel database; the point with the optimum threshold is circled,
(b) ROC curves for the Essex face database; the point with the optimum threshold is circled
6. POWER SYSTEM MODELLING
In this paper, we presented the design and implementation of a texture based CBIR algorithm called
Local Derivative Pattern in GPGPU. We have implemented a parallel version of the LDP algorithm for high
resolution images and showed the performance gain achieved. By using the fast on chip shared memory a
speedup of around 240x times is achieved without compromising on the performance of the LDP algorithm.
The parallel version of the algorithm is further revised to enable the use of additional parallelism in an
attempt to improve the performance. We have achieved a slight improvement in the speedup when the
hardware managed L1 data cache is used instead of the software managed cache. The synchronous parallel
version was further revised to make use of the asynchronous programming using CUDA streams and a
significant performance improvement was achieved over the sequential version. Our future work includes
further further optimization of the algorithm using one image in one SMP. We will also be focusing on an in
depth study into the tradeoffs between the shared memory and L1 cache.
A sequence of six 16x16 sub image blocks transferred to device
6*b blocks compute on the sequence of
image blocks
Non- streamed execution
Streamed execution
A chunk of 2 image blocks is transferred to device
2*b blocks compute on the
chunk, while the second
chunk is transferred
Execution time saved
by streams
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
ROC Curve
False Positive Rate
TruePositiveRate
0 0.2 0.4 0.6 0.8 1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
False Positive Rate
TruePositiveRate
ROC Curve
13. Int J Elec & Comp Eng ISSN: 2088-8708
Real- Time Implementation and Performance Optimization of Local Derivative … (Nisha Chandran S)
5469
(a) (b)
Figure 10. (a) Representative images from Corel database,
(b) Representative images from Essex face database
Figure 11. Top 5 images retrieved from Corel database using LDP, top leftmost is the query image
Figure 12. Top 5 images retrieved from Essex face database using LDP, top leftmost is the query image.
REFERENCES
[1] T. Ahonen, A. Hadid, and M. Pietikainen, "Face description with local binary patterns: Application to face
recognition," IEEE Trans. Pattern Anal. Mach. Intell., vol. 28(12), pp. 2037-2041, Dec 2006.
[2] Zhang, B., Gao,Y., Zhao S. and Liu, J., "Local derivative pattern versus local binary pattern: face recognition with
high-order local pattern descriptor," IEEE Trans. Image Processing, vol. 19(2), pp.533-544, 2010.
[3] "NVIDIA CUDA C Programming Guide," NVIDIA Corporation, vol.120 ,2011.
[4] Kusamura, Y., Amagasa, T., Kitagawa, H. and Kozawa, Y., "Efficient Content-based Image Retrieval for Position
Estimation on GPU," Proc. 15th ACM Int. Conf. on Advances in Mobile Computing & Multimedia, Dec 04-06,
Salzburg, Austria, pp. 58-66, 2017.
[5] Kusamura, Y., Kozawa, Y., Amagasa, T. and Kitagawa, H., "GPU Acceleration of Content-Based Image Retrieval
Based on SIFT Descriptors", Proc. 19th
IEEE Int. Conf. on Network-Based Information Systems (NBiS), Sep 07-09,
2016, Ostrava, Czech Republic, pp. 342-347.
[6] Khor, H.L., Liew, S.C. and Zain, J.M., "A review on parallel medical image processing on GPU," Proc. 4th IEEE
Int. Conf. on Software Engineering and Computer Systems (ICSECS), Malaysia, pp. 45-48, Aug 2015.
[7] Sevilla, J., Jiménez, L.I. and Plaza, A., "Sparse unmixing-based content retrieval of hyperspectral images on
graphics processing units," IEEE Geoscience and Remote Sensing Letters, vol.12(12), pp. 2443-2447, Oct 2015.
[8] Yang, Z., Zhu, Y., and Pu, Y., "Parallel image processing based on CUDA," in Proc. IEEE Int Conf on Computer
Science and Software Engineering(CSSE), Dec 12-14, Wuhen, Hubei, China, pp.198-201, 2008.
[9] Guo, H. and He, J., "A fast fractal image compression algorithm combined with graphic processor unit,"
TELKOMNIKA (Telecommunication Computing Electronics and Control), vol.13(3), pp.1089-1096, 2015.
[10] Tek, S.C. and Gökmen, M., "CUDA Accelerated Face Recognition Using Local Binary Patterns," Threshold, vol.
10101001(2), pp. 169, 2012.
[11] Gangodkar, D., Gupta, S., Gill, G.S., Kumar, P. and Mittal, A., "Efficient variable size template matching using fast
normalized cross correlation on multicore processors," in Advanced Computing, Networking and
Security(ADCONS), Springer, vol. 7135, pp. 218-227, 2012.
[12] Prananta, E., Pranowo, P. and Budianto, D., "GPU CUDA accelerated image inpainting using fourth order PDE
equation," TELKOMNIKA (Telecommun. Computing Electronics and Control), vol.14(3), pp. 1009-1015, 2016.
14. ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 8, No. 6, December 2018 : 5457 - 5471
5470
[13] Nasreen, A. and Shobha, G., "Parallelizing Multi-featured Content Based Search and Retrieval of Videos through
High Performance Computing," Indonesian Journal of Electrical Engineering and Computer Science, vol.5(1),
pp.214-219, 2017.
[14] Ahmed, R., Islam, M.S. and Uddin, J., "Optimizing Apple lossless audio codec algorithm using NVIDIA CUDA
architecture," International Journal of Electrical and Computer Engineering (IJECE), vol.8(1), pp.70-75, 2018.
[15] Heidari, H., Chalechale, A., and Mohammadabadi, A.A.,"Accelerating of color moments and texture features
extraction using GPU based parallel computing," in Proc. 8th
Iranian Conf. on Machine Vision and Image
Processing (MVIP), Zanjan, Iran, pp.430-435, 2013.
[16] Zhu, F., Chen, P., Yang, D., Zhang, W., Chen, H. and Zang, B., "A GPU-based high-throughput image retrieval
algorithm," in Proc. 5th
Annual Workshop on General Purpose Processing with Graphics Processing Units , ACM
London, pp.30-37, Aug 2012.
[17] Prisacariu, V. and Reid, I., "fastHOG-a real-time GPU implementation of HOG," Technical Report, vol. 2310,
Department of Engineering Science, Oxford Univ, Jul 2009.
[18] Park, I.K., Singhal, N., Lee, M.H., Cho, S.,& Kim, C. W., "Design and performance evaluation of image processing
algorithms on GPUs," IEEE Trans. Parallel and Distributed Systems, vol. 22, pp. 91-104, 2011.
[19] Hawick, K.A.,Leist, A.,and Playne, D.P.,"Regular Lattice and small-world spin model simulations using CUDA
and GPUs," Int. Journal of Parallel Programming, vol.39(2), pp.183-201, 2011.
[20] Liu, S., Eisenbeis, C., and Gaudiot, J.L., "Value prediction and speculative execution on GPU," Int. Journal of
Parallel Programming, vol.39(5), pp.533-552, 2011.
[21] Sanders, J. and Kandrot, E., "CUDA by example: an introduction to general-purpose GPU programming," Addison-
Wesley Professional, 2010.
[22] Wen-Mei, W.H., "GPU Computing Gems Emerald Edition," Elsevier, 2011.
[23] Kirk, D.B. and WenMei, W.H., "Programming massively parallel processors: a hands-on approach," Newnes, 2012.
[24] Cui, X., Chen, Y., Zhang, C., and Mei, H., "Auto-tuning dense matrix multiplication for GPGPU with cache," in
Proc. IEEE Int. Conf. on Parallel and Distributed Systems (ICPADS), China, pp. 237-242, Dec 2010.
[25] Lobeiras, J., Amor, M., and Doallo, R., "Performance evaluation of GPU memory hierarchy using the FFT. ",in
Proc. 11th
Int. conf. on Computational and Mathematical Methods in Science and Engineering (CMMSE 2011),
Benidorm, Alicante, Spain, pp.750-761, Jun 2011.
[26] Liang, Y.,Cui, Z., Zhao, S., Rupnow, K., Zhang,Y., Jones, D.L. and Chen. D., "Real-time implementation and
performance optimization of 3D sound localization on GPUs," in Proc. Design, Automation&Test in Europe
Conference & Exhibition (DATE), Dresden, pp. 832-835 , Mar 2012.
[27] Torres.Y., Gonzalez-Escribano, A., and Llanos, D.R., "Understanding the impact of CUDA tuning techniques for
Fermi," in Proc.IEEE Int.Conf. on High Performance Computing and Simulation, Istanbul, pp.631-639, Jul 2011.
[28] Lee,D.,Dinov,I., Dong, B.,Gutman,B.,Yanovsky,I., and Toga,A.W.,"CUDA optimization strategies for compute and
memory bound Neuroimaging algorithms," Computer methods and programs in biomedicine, vol.106(3), pp.175-
187, 2012.
[29] Chandran, S.N., Gangodkar, D. and Mittal, A., "Parallel Implementation of Local Derivative Pattern Algorithm for
Fast Image Retrieval," in Proc. IEEE Int. Conf. on Computing, Communication & Automation (ICCCA), Greater
Noida, India, pp.1132-1137, May 2015.
[30] Wittenbrink, C.M., Kilgariff, E., and Prabhu, A., "Fermi GF100 GPU architecture", IEEE Micro, vol.2, pp. 50-59,
2011.
[31] "NVIDIA Corporation: Whitepaper NVIDIAs Next Generation CUDA Compute Architecture: Kepler GK110",
[Online] Available: www.nvidia.in/content/PDF/kepler/NVIDIA-Kepler-G110-Architecture-Whitepaper.pdf.
[32] Li,C.,Yang, Y.,Dai, H.,Yan, S.,Mueller,F., and Zhou, H., "Undertsnding the tradeoffs between software-managed
vs hardware-managed caches in GPUs," in Proc. IEEE Int. Symposium on Performance Analysis of Systems and
Software (ISPASS), Monterey, CA, pp. 231-242, Mar 2014.
[33] Gomez-Luna,J.,Gonzalez-Linares,J.M., Benavides, J.I., Guil,N., "Performance models for CUDA streams on
NVIDIA GeForce series," Parallel and Distributed Computing , vol.72(9), pp.1117-1126, 2012.
[34] "NVIDIA Parallel Forall," [Online] Available: https://devblogs.nvidia.com/parallelforall/
[35] Lian, Y., Cui, Z., Rupnow, K., and Chen D.,"Register and thread structure optimization for GPUs," in Proc 18th
Asia and South Pacific Design and Automation Conference (ASP-DAC), Yokohama, pp.461-466, Jan 2013.
[36] Corel 1000 and Corel 10000 image databases [Online]Available: http://wang.ist.psu.edu/docs/related.html.
[37] Face database [Online] Available: http://cswwwessex.ac.uk/mv/allfaces/faces94.html.
[38] CUDA Visual Profiler [Online] Available: http://developer.nvidia.com/nvidia-visual-profiler.
15. Int J Elec & Comp Eng ISSN: 2088-8708
Real- Time Implementation and Performance Optimization of Local Derivative … (Nisha Chandran S)
5471
BIBLIOGRAPIES OF AUTHORS
Nisha Chandran S. received her M.Tech degree in Computer Science and Engineering from
Graphic Era University. She is currently working towards the Ph.D. degree with the Department of
Computer Science and Engineering, Graphic Era University, Dehradun, India. Her research
interests include image and video processing, content-based retrieval, data mining and high
performance computing.
Durgaprasad Gangodkar received the Ph.D. degree from Indian Institute of Technology (IIT),
Roorkee, India. For around six years, he was with Siemens Ltd., Mumbai, India, where he worked
on many prestigious projects related to industrial automation. He is currently with the Department
of Computer Science and Engineering, Graphic Era University, Dehradun, India. His research
interests include high-performance computing, computer vision, video analytics, and mobile agents.
Ankush Mittal received the Ph.D. degree in electrical and computer engineering from the National
University of Singapore, Singapore, in 2001. He was a Faculty Member with National University of
Singapore and IIT, Roorkee, India. He is currently the Director of Research with Graphic Era
University, Dehradun, India. He has been the author or a coauthor of more than 240 research
papers. His research interests include image and video processing, content-based retrieval, e-
learning, and bioinformatics.