As majority of the compression algorithms are implementations for CPU architecture, the primary focus of our work was to exploit the opportunities of GPU parallelism in audio compression. This paper presents an implementation of Apple Lossless Audio Codec (ALAC) algorithm by using NVIDIA GPUs Compute Unified Device Architecture (CUDA) Framework. The core idea was to identify the areas where data parallelism could be applied and parallel programming model CUDA could be used to execute the identified parallel components on Single Instruction Multiple Thread (SIMT) model of CUDA. The dataset was retrieved from European Broadcasting Union, Sound Quality Assessment Material (SQAM). Faster execution of the algorithm led to execution time reduction when applied to audio coding for large audios. This paper also presents the reduction of power usage due to running the parallel components on GPU. Experimental results reveal that we achieve about 80-90% speedup through CUDA on the identified components over its CPU implementation while saving CPU power consumption.
Lightweight DNN Processor Design (based on NVDLA)Shien-Chun Luo
https://sites.google.com/view/itri-icl-dla/
(Public Information Share) This is our lightweight DNN inference processor presentation, including a system solution (from Caffe prototxt to HW controls files), hardware features, and an example of object detection (Tiny YOLO) RTL simulation results. We modified open-source NVDLA, small configuration, and developed a RISC-V MCU in this accelerating system.
Graphics processing unit or GPU (also occasionally called visual processing unit or VPU) is a specialized microprocessor that offloads and accelerates graphics rendering from the central (micro) processor. Modern GPUs are very efficient at manipulating computer graphics, and their highly parallel structure makes them more effective than general-purpose CPUs for a range of complex algorithms. In CPU, only a fraction of the chip does computations where as the GPU devotes more transistors to data processing.
GPGPU is a programming methodology based on modifying algorithms to run on existing GPU hardware for increased performance. Unfortunately, GPGPU programming is significantly more complex than traditional programming for several reasons.
We updated the DLA system introductions here, from design, add-on functions, and applications. During the 2018~2019, we developed the tools needed for IC simulation and verification, constructed a quantize-aware & HW-aware training flow, and improved the automation of the verification. We have verified this system through FPGA and solid-state SoC.
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...AMD Developer Central
Presentation CC-4005, Performance analysis of 3D Finite Difference computational stencils on Seamicro fabric compute systems, by Joshua Mora from the AMD Developer Summit (APU13) November 2013.
Lightweight DNN Processor Design (based on NVDLA)Shien-Chun Luo
https://sites.google.com/view/itri-icl-dla/
(Public Information Share) This is our lightweight DNN inference processor presentation, including a system solution (from Caffe prototxt to HW controls files), hardware features, and an example of object detection (Tiny YOLO) RTL simulation results. We modified open-source NVDLA, small configuration, and developed a RISC-V MCU in this accelerating system.
Graphics processing unit or GPU (also occasionally called visual processing unit or VPU) is a specialized microprocessor that offloads and accelerates graphics rendering from the central (micro) processor. Modern GPUs are very efficient at manipulating computer graphics, and their highly parallel structure makes them more effective than general-purpose CPUs for a range of complex algorithms. In CPU, only a fraction of the chip does computations where as the GPU devotes more transistors to data processing.
GPGPU is a programming methodology based on modifying algorithms to run on existing GPU hardware for increased performance. Unfortunately, GPGPU programming is significantly more complex than traditional programming for several reasons.
We updated the DLA system introductions here, from design, add-on functions, and applications. During the 2018~2019, we developed the tools needed for IC simulation and verification, constructed a quantize-aware & HW-aware training flow, and improved the automation of the verification. We have verified this system through FPGA and solid-state SoC.
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...AMD Developer Central
Presentation CC-4005, Performance analysis of 3D Finite Difference computational stencils on Seamicro fabric compute systems, by Joshua Mora from the AMD Developer Summit (APU13) November 2013.
In this deck from the Performance Optimisation and Productivity group, Lubomir Riha from IT4Innovations presents: Energy Efficient Computing using Dynamic Tuning.
"We now live in a world of power-constrained architectures and systems and power consumption represents a significant cost factor in the overall HPC system economy. For these reasons, in recent years researchers, supercomputing centers and major vendors have developed new tools and methodologies to measure and optimize the energy consumption of large-scale high performance system installations. Due to the link between energy consumption, power consumption and execution time of an application executed by the final user, it is important for these tools and the methodology used to consider all these aspects, empowering the final user and the system administrator with the capability of finding the best configuration given different high level objectives.
This webinar focused on tools designed to improve the energy-efficiency of HPC applications using a methodology of dynamic tuning of HPC applications, developed under the H2020 READEX project. The READEX methodology has been designed for exploiting the dynamic behaviour of software. At design time, different runtime situations (RTS) are detected and optimized system configurations are determined. RTSs with the same configuration are grouped into scenarios, forming the tuning model. At runtime, the tuning model is used to switch system configurations dynamically.
The MERIC tool, that implements the READEX methodology, is presented. It supports manual or binary instrumentation of the analysed applications to simplify the analysis. This instrumentation is used to identify and annotate the significant regions in the HPC application. Automatic binary instrumentation annotates regions with significant runtime. Manual instrumentation, which can be combined with automatic, allows code developer to annotate regions of particular interest."
Watch the video: https://wp.me/p3RLHQ-lJP
Learn more: https://pop-coe.eu/blog/14th-pop-webinar-energy-efficient-computing-using-dynamic-tuning
and
https://code.it4i.cz/vys0053/meric
Sign up for our insideHPC Newsletter: http://insidehpc.com/newslett
DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...Ilango Jeyasubramanian
• Improved speedup by 3.2%, CPI by 7.68% from regular LRU policy by a new Dynamic SLRU policy Implementation on Gem5 simulator.
• Dynamic segmentation of each cache set was done with cache line access probability based LRU insertion policy and performed block replacement by normal SLRU policy.
• Achieved hit-rate improvement by 2.97%, power reduction by 11% with modified Gem5 Ruby memory system by added instructions in the X86 ISA and additional states in the MOESI protocol for coherence in round-robin ordered ring connected filter caches on Intel X86 processor.
Early Benchmarking Results for Neuromorphic ComputingDESMOND YUEN
An update on the Intel Neuromorphic Research Community’s growth and benchmark results, including the addition of new corporate members and numerous new benchmarking updates computed on Intel’s neuromorphic test chip, Loihi.
ASIC DESIGN OF MINI-STEREO DIGITAL AUDIO PROCESSOR UNDER SMIC 180NM TECHNOLOGYIlango Jeyasubramanian
- Designed and analyzed a complete MSDAP with optimized convolution computation by only shifts and adds using power-of-2 coefficients. Synthesized the chip through high level architecture design (C Program), Logic synthesis (Synopsys Design Compiler) and Physical Synthesis (Synopsys IC compiler).
- Achieved a low power consumption of 3.1438mW at 29.186Mhz clock frequency, with core utilization of 70% and chip area of 1.29mm2.
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORScscpconf
Our main aim of research is to find the limit of Amdahl's Law for multicore processors, to make number of cores giving more efficiency to overall architecture of the CMP(Chip Multi
Processor a.k.a. Multicore Processor). As it is expected this limit will be in the architecture of Multicore Processor, or in the programming. We surveyed the architecture of the Multicore
processors of various chip manufacturers namely INTEL™, AMD™, IBM™ etc., and the various techniques there followed in, for improving the performance of the Multicore
Processors. We conducted cluster experiments to find this limit. In this paper we propose an alternate design of Multicore processor based on the results of our cluster experiment.
Affect of parallel computing on multicore processorscsandit
Our main aim of research is to find the limit of Amdahl's Law for multicore processors, to make
number of cores giving more efficiency to overall architecture of the CMP(Chip Multi
Processor a.k.a. Multicore Processor). As it is expected this limit will be in the architecture of
Multicore Processor, or in the programming. We surveyed the architecture of the Multicore
processors of various chip manufacturers namely INTEL™, AMD™, IBM™ etc., and the
various techniques there followed in, for improving the performance of the Multicore
Processors.
We conducted cluster experiments to find this limit. In this paper we propose an alternate design
of Multicore processor based on the results of our cluster experiment.
Seven years ago at LCA, Van Jacobsen introduced the concept of net channels but since then the concept of user mode networking has not hit the mainstream. There are several different user mode networking environments: Intel DPDK, BSD netmap, and Solarflare OpenOnload. Each of these provides higher performance than standard Linux kernel networking; but also creates new problems. This talk will explore the issues created by user space networking including performance, internal architecture, security and licensing.
Best Practices and Performance Studies for High-Performance Computing ClustersIntel® Software
This session focuses on key system tunables for maximizing application performance of high-performance computing (HPC) workloads, and addresses porting, optimizing, and running applications to maximize performance. We present practical tips and techniques for building and running applications on multicore processors. We analyze sample performance and scaling data from various applications, and identify the best options.
In this deck from the Performance Optimisation and Productivity group, Lubomir Riha from IT4Innovations presents: Energy Efficient Computing using Dynamic Tuning.
"We now live in a world of power-constrained architectures and systems and power consumption represents a significant cost factor in the overall HPC system economy. For these reasons, in recent years researchers, supercomputing centers and major vendors have developed new tools and methodologies to measure and optimize the energy consumption of large-scale high performance system installations. Due to the link between energy consumption, power consumption and execution time of an application executed by the final user, it is important for these tools and the methodology used to consider all these aspects, empowering the final user and the system administrator with the capability of finding the best configuration given different high level objectives.
This webinar focused on tools designed to improve the energy-efficiency of HPC applications using a methodology of dynamic tuning of HPC applications, developed under the H2020 READEX project. The READEX methodology has been designed for exploiting the dynamic behaviour of software. At design time, different runtime situations (RTS) are detected and optimized system configurations are determined. RTSs with the same configuration are grouped into scenarios, forming the tuning model. At runtime, the tuning model is used to switch system configurations dynamically.
The MERIC tool, that implements the READEX methodology, is presented. It supports manual or binary instrumentation of the analysed applications to simplify the analysis. This instrumentation is used to identify and annotate the significant regions in the HPC application. Automatic binary instrumentation annotates regions with significant runtime. Manual instrumentation, which can be combined with automatic, allows code developer to annotate regions of particular interest."
Watch the video: https://wp.me/p3RLHQ-lJP
Learn more: https://pop-coe.eu/blog/14th-pop-webinar-energy-efficient-computing-using-dynamic-tuning
and
https://code.it4i.cz/vys0053/meric
Sign up for our insideHPC Newsletter: http://insidehpc.com/newslett
DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...Ilango Jeyasubramanian
• Improved speedup by 3.2%, CPI by 7.68% from regular LRU policy by a new Dynamic SLRU policy Implementation on Gem5 simulator.
• Dynamic segmentation of each cache set was done with cache line access probability based LRU insertion policy and performed block replacement by normal SLRU policy.
• Achieved hit-rate improvement by 2.97%, power reduction by 11% with modified Gem5 Ruby memory system by added instructions in the X86 ISA and additional states in the MOESI protocol for coherence in round-robin ordered ring connected filter caches on Intel X86 processor.
Early Benchmarking Results for Neuromorphic ComputingDESMOND YUEN
An update on the Intel Neuromorphic Research Community’s growth and benchmark results, including the addition of new corporate members and numerous new benchmarking updates computed on Intel’s neuromorphic test chip, Loihi.
ASIC DESIGN OF MINI-STEREO DIGITAL AUDIO PROCESSOR UNDER SMIC 180NM TECHNOLOGYIlango Jeyasubramanian
- Designed and analyzed a complete MSDAP with optimized convolution computation by only shifts and adds using power-of-2 coefficients. Synthesized the chip through high level architecture design (C Program), Logic synthesis (Synopsys Design Compiler) and Physical Synthesis (Synopsys IC compiler).
- Achieved a low power consumption of 3.1438mW at 29.186Mhz clock frequency, with core utilization of 70% and chip area of 1.29mm2.
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORScscpconf
Our main aim of research is to find the limit of Amdahl's Law for multicore processors, to make number of cores giving more efficiency to overall architecture of the CMP(Chip Multi
Processor a.k.a. Multicore Processor). As it is expected this limit will be in the architecture of Multicore Processor, or in the programming. We surveyed the architecture of the Multicore
processors of various chip manufacturers namely INTEL™, AMD™, IBM™ etc., and the various techniques there followed in, for improving the performance of the Multicore
Processors. We conducted cluster experiments to find this limit. In this paper we propose an alternate design of Multicore processor based on the results of our cluster experiment.
Affect of parallel computing on multicore processorscsandit
Our main aim of research is to find the limit of Amdahl's Law for multicore processors, to make
number of cores giving more efficiency to overall architecture of the CMP(Chip Multi
Processor a.k.a. Multicore Processor). As it is expected this limit will be in the architecture of
Multicore Processor, or in the programming. We surveyed the architecture of the Multicore
processors of various chip manufacturers namely INTEL™, AMD™, IBM™ etc., and the
various techniques there followed in, for improving the performance of the Multicore
Processors.
We conducted cluster experiments to find this limit. In this paper we propose an alternate design
of Multicore processor based on the results of our cluster experiment.
Seven years ago at LCA, Van Jacobsen introduced the concept of net channels but since then the concept of user mode networking has not hit the mainstream. There are several different user mode networking environments: Intel DPDK, BSD netmap, and Solarflare OpenOnload. Each of these provides higher performance than standard Linux kernel networking; but also creates new problems. This talk will explore the issues created by user space networking including performance, internal architecture, security and licensing.
Best Practices and Performance Studies for High-Performance Computing ClustersIntel® Software
This session focuses on key system tunables for maximizing application performance of high-performance computing (HPC) workloads, and addresses porting, optimizing, and running applications to maximize performance. We present practical tips and techniques for building and running applications on multicore processors. We analyze sample performance and scaling data from various applications, and identify the best options.
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONScseij
In this paper we study NVIDIA graphics processing unit (GPU) along with its computational power and applications. Although these units are specially designed for graphics application we can employee there computation power for non graphics application too. GPU has high parallel processing power, low cost of computation and less time utilization; it gives good result of performance per energy ratio. This GPU deployment property for excessive computation of similar small set of instruction played a significant role in reducing CPU overhead. GPU has several key advantages over CPU architecture as it provides high parallelism, intensive computation and significantly higher throughput. It consists of thousands of hardware threads that execute programs in a SIMD fashion hence GPU can be an alternate to CPU in high performance environment and in supercomputing environment. The base line is GPU based general purpose computing is a hot topics of research and there is great to explore rather than only graphics processing application.
SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. For efficient construction of large maps searching the best-matching unit is usually the computationally heaviest operation in the SOM. The parallel nature of the algorithm and the huge computations involved makes it a good target for GPU based parallel implementation. This paper presents an overall idea of the optimization strategies used for the parallel implementation of Basic-SOM on GPU using CUDA programming paradigm.
Distributed Video Coding (DVC) has become increasingly popular in recent times among the researchers in video coding due to its attractive and promising features. DVC primarily has a modified complexity balance between the encoder and decoder, in contrast to conventional video codecs. However, Most of the reported DVC schemes have a high time-delay in decoder which hinders its practical application in real-time systems. In this work, we focus on speed up the Side Information(SI) generation module in DVC, which is a major function in the DVC coding algorithm and one of the time-consuming factor at the decoder. By applied it through Compute Unified Device Architecture (CUDA) based on General-Purpose Graphics Processing Unit (GPGPU), the experimental results show that a considerable speedup can be obtained by using the proposed parallelized SI generation algorithm.
Efficient algorithm for rsa text encryption using cuda ccsandit
Modern-day computer security relies heavily on cryp
tography as a means to protect the data
that we have become increasingly reliant on. The m
ain research in computer security domain
is how to enhance the speed of RSA algorithm. The c
omputing capability of Graphic Processing
Unit as a co-processor of the CPU can leverage ma
ssive-parallelism. This paper presents a
novel algorithm for calculating modulo value that
can process large power of numbers
which otherwise are not supported by built-in data
types. First the traditional algorithm is
studied. Secondly, the parallelized RSA algorithm i
s designed using CUDA framework. Thirdly,
the designed algorithm is realized for small prim
e numbers and large prime number . As a
result the main fundamental problem of RSA algorith
m such as speed and use of poor or
small prime numbers that has led to significant s
ecurity holes, despite the RSA algorithm's
mathematical
soundness can be alleviated by this algorithm
Efficient algorithm for rsa text encryption using cuda ccsandit
Modern-day computer security relies heavily on cryptography as a means to protect the data
that we have become increasingly reliant on. The main research in computer security domain
is how to enhance the speed of RSA algorithm. The computing capability of Graphic Processing
Unit as a co-processor of the CPU can leverage massive-parallelism. This paper presents a
novel algorithm for calculating modulo value that can process large power of numbers
which otherwise are not supported by built-in data types. First the traditional algorithm is
studied. Secondly, the parallelized RSA algorithm is designed using CUDA framework. Thirdly,
the designed algorithm is realized for small prime numbers and large prime number . As a
result the main fundamental problem of RSA algorithm such as speed and use of poor or
small prime numbers that has led to significant security holes, despite the RSA algorithm's
mathematical soundness can be alleviated by this algorithm.
The complexity of Medical image reconstruction requires tens to hundreds of billions of computations per second. Until few years ago, special purpose processors designed especially for such applications were used. Such processors require significant design effort and are thus difficult to change as new algorithms in reconstructions evolve and have limited parallelism. Hence the demand for flexibility in medical applications motivated the use of stream processors with massively parallel architecture. Stream processing architectures offers data parallel kind of parallelism.
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGAVLSICS Design
Video Compression is very essential to meet the technological demands such as low power, less memory
and fast transfer rate for different range of devices and for various multimedia applications. Video
compression is primarily achieved by Motion Estimation (ME) process in any video encoder which
contributes to significant compression gain.Sum of Absolute Difference (SAD) is used as distortion metric
in ME process.In this paper, efficient Absolute Difference(AD)circuit is proposed which uses Brent Kung
Adder(BKA) and a comparator based on modified 1’s complement principle and conditional sum adder
scheme. Results shows that proposed architecture reduces delay by 15% and number of slice LUTs by 42
% as compared to conventional architecture. Simulation and synthesis are done on Xilinx ISE 14.2 using
Virtex 7 FPGA.
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGAVLSICS Design
Video Compression is very essential to meet the technological demands such as low power, less memory and fast transfer rate for different range of devices and for various multimedia applications. Video compression is primarily achieved by Motion Estimation (ME) process in any video encoder which contributes to significant compression gain.Sum of Absolute Difference (SAD) is used as distortion metric in ME process.In this paper, efficient Absolute Difference(AD)circuit is proposed which uses Brent Kung Adder(BKA) and a comparator based on modified 1’s complement principle and conditional sum adder scheme. Results shows that proposed architecture reduces delay by 15% and number of slice LUTs by 42 % as compared to conventional architecture. Simulation and synthesis are done on Xilinx ISE 14.2 using Virtex 7 FPGA.
HIGH PERFORMANCE ETHERNET PACKET PROCESSOR CORE FOR NEXT GENERATION NETWORKSijngnjournal
As the demand for high speed Internet significantly increasing to meet the requirement of large data transfers, real-time communication and High Definition ( HD) multimedia transfer over IP, the IP based network products architecture must evolve and change. Application specific processors require high
performance, low power and high degree of programmability is the limitation in many general processor based applications. This paper describes the design of Ethernet packet processor for system-on-chip (SoC) which performs all core packet processing functions, including segmentation and reassembly, packetization classification, route and queue management which will speedup switching/routing performance making it
more suitable for Next Generation Networks (NGN). Ethernet packet processor design can be configured for use with multiple projects targeted to a FPGA device the system is designed to support 1/10/20/40/100 Gigabit links with a speed and performance advantage. VHDL has been used to implement and simulated the required functions in FPGA
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGAVLSICS Design
Video Compression is very essential to meet the technological demands such as low power, less memory and fast transfer rate for different range of devices and for various multimedia applications. Video compression is primarily achieved by Motion Estimation (ME) process in any video encoder which contributes to significant compression gain.Sum of Absolute Difference (SAD) is used as distortion metric in ME process.In this paper, efficient Absolute Difference(AD)circuit is proposed which uses Brent Kung Adder(BKA) and a comparator based on modified 1’s complement principle and conditional sum adder scheme. Results shows that proposed architecture reduces delay by 15% and number of slice LUTs by 42% as compared to conventional architecture. Simulation and synthesis are done on Xilinx ISE 14.2 using Virtex 7 FPGA.
Image Processing Application on Graphics processorsCSCJournals
In this work, we introduce real time image processing techniques using modern programmable Graphic Processing Units GPU. GPU are SIMD (Single Instruction, Multiple Data) device that is inherently data-parallel. By utilizing NVIDIA new GPU programming framework, “Compute Unified Device Architecture” CUDA as a computational resource, we realize significant acceleration in image processing algorithm computations. We show that a range of computer vision algorithms map readily to CUDA with significant performance gains. Specifically, we demonstrate the efficiency of our approach by a parallelization and optimization of image processing, Morphology applications and image integral.
Bibliometric analysis highlighting the role of women in addressing climate ch...IJECEIAES
Fossil fuel consumption increased quickly, contributing to climate change
that is evident in unusual flooding and draughts, and global warming. Over
the past ten years, women's involvement in society has grown dramatically,
and they succeeded in playing a noticeable role in reducing climate change.
A bibliometric analysis of data from the last ten years has been carried out to
examine the role of women in addressing the climate change. The analysis's
findings discussed the relevant to the sustainable development goals (SDGs),
particularly SDG 7 and SDG 13. The results considered contributions made
by women in the various sectors while taking geographic dispersion into
account. The bibliometric analysis delves into topics including women's
leadership in environmental groups, their involvement in policymaking, their
contributions to sustainable development projects, and the influence of
gender diversity on attempts to mitigate climate change. This study's results
highlight how women have influenced policies and actions related to climate
change, point out areas of research deficiency and recommendations on how
to increase role of the women in addressing the climate change and
achieving sustainability. To achieve more successful results, this initiative
aims to highlight the significance of gender equality and encourage
inclusivity in climate change decision-making processes.
Voltage and frequency control of microgrid in presence of micro-turbine inter...IJECEIAES
The active and reactive load changes have a significant impact on voltage
and frequency. In this paper, in order to stabilize the microgrid (MG) against
load variations in islanding mode, the active and reactive power of all
distributed generators (DGs), including energy storage (battery), diesel
generator, and micro-turbine, are controlled. The micro-turbine generator is
connected to MG through a three-phase to three-phase matrix converter, and
the droop control method is applied for controlling the voltage and
frequency of MG. In addition, a method is introduced for voltage and
frequency control of micro-turbines in the transition state from gridconnected mode to islanding mode. A novel switching strategy of the matrix
converter is used for converting the high-frequency output voltage of the
micro-turbine to the grid-side frequency of the utility system. Moreover,
using the switching strategy, the low-order harmonics in the output current
and voltage are not produced, and consequently, the size of the output filter
would be reduced. In fact, the suggested control strategy is load-independent
and has no frequency conversion restrictions. The proposed approach for
voltage and frequency regulation demonstrates exceptional performance and
favorable response across various load alteration scenarios. The suggested
strategy is examined in several scenarios in the MG test systems, and the
simulation results are addressed.
Enhancing battery system identification: nonlinear autoregressive modeling fo...IJECEIAES
Precisely characterizing Li-ion batteries is essential for optimizing their
performance, enhancing safety, and prolonging their lifespan across various
applications, such as electric vehicles and renewable energy systems. This
article introduces an innovative nonlinear methodology for system
identification of a Li-ion battery, employing a nonlinear autoregressive with
exogenous inputs (NARX) model. The proposed approach integrates the
benefits of nonlinear modeling with the adaptability of the NARX structure,
facilitating a more comprehensive representation of the intricate
electrochemical processes within the battery. Experimental data collected
from a Li-ion battery operating under diverse scenarios are employed to
validate the effectiveness of the proposed methodology. The identified
NARX model exhibits superior accuracy in predicting the battery's behavior
compared to traditional linear models. This study underscores the
importance of accounting for nonlinearities in battery modeling, providing
insights into the intricate relationships between state-of-charge, voltage, and
current under dynamic conditions.
Smart grid deployment: from a bibliometric analysis to a surveyIJECEIAES
Smart grids are one of the last decades' innovations in electrical energy.
They bring relevant advantages compared to the traditional grid and
significant interest from the research community. Assessing the field's
evolution is essential to propose guidelines for facing new and future smart
grid challenges. In addition, knowing the main technologies involved in the
deployment of smart grids (SGs) is important to highlight possible
shortcomings that can be mitigated by developing new tools. This paper
contributes to the research trends mentioned above by focusing on two
objectives. First, a bibliometric analysis is presented to give an overview of
the current research level about smart grid deployment. Second, a survey of
the main technological approaches used for smart grid implementation and
their contributions are highlighted. To that effect, we searched the Web of
Science (WoS), and the Scopus databases. We obtained 5,663 documents
from WoS and 7,215 from Scopus on smart grid implementation or
deployment. With the extraction limitation in the Scopus database, 5,872 of
the 7,215 documents were extracted using a multi-step process. These two
datasets have been analyzed using a bibliometric tool called bibliometrix.
The main outputs are presented with some recommendations for future
research.
Use of analytical hierarchy process for selecting and prioritizing islanding ...IJECEIAES
One of the problems that are associated to power systems is islanding
condition, which must be rapidly and properly detected to prevent any
negative consequences on the system's protection, stability, and security.
This paper offers a thorough overview of several islanding detection
strategies, which are divided into two categories: classic approaches,
including local and remote approaches, and modern techniques, including
techniques based on signal processing and computational intelligence.
Additionally, each approach is compared and assessed based on several
factors, including implementation costs, non-detected zones, declining
power quality, and response times using the analytical hierarchy process
(AHP). The multi-criteria decision-making analysis shows that the overall
weight of passive methods (24.7%), active methods (7.8%), hybrid methods
(5.6%), remote methods (14.5%), signal processing-based methods (26.6%),
and computational intelligent-based methods (20.8%) based on the
comparison of all criteria together. Thus, it can be seen from the total weight
that hybrid approaches are the least suitable to be chosen, while signal
processing-based methods are the most appropriate islanding detection
method to be selected and implemented in power system with respect to the
aforementioned factors. Using Expert Choice software, the proposed
hierarchy model is studied and examined.
Enhancing of single-stage grid-connected photovoltaic system using fuzzy logi...IJECEIAES
The power generated by photovoltaic (PV) systems is influenced by
environmental factors. This variability hampers the control and utilization of
solar cells' peak output. In this study, a single-stage grid-connected PV
system is designed to enhance power quality. Our approach employs fuzzy
logic in the direct power control (DPC) of a three-phase voltage source
inverter (VSI), enabling seamless integration of the PV connected to the
grid. Additionally, a fuzzy logic-based maximum power point tracking
(MPPT) controller is adopted, which outperforms traditional methods like
incremental conductance (INC) in enhancing solar cell efficiency and
minimizing the response time. Moreover, the inverter's real-time active and
reactive power is directly managed to achieve a unity power factor (UPF).
The system's performance is assessed through MATLAB/Simulink
implementation, showing marked improvement over conventional methods,
particularly in steady-state and varying weather conditions. For solar
irradiances of 500 and 1,000 W/m2
, the results show that the proposed
method reduces the total harmonic distortion (THD) of the injected current
to the grid by approximately 46% and 38% compared to conventional
methods, respectively. Furthermore, we compare the simulation results with
IEEE standards to evaluate the system's grid compatibility.
Enhancing photovoltaic system maximum power point tracking with fuzzy logic-b...IJECEIAES
Photovoltaic systems have emerged as a promising energy resource that
caters to the future needs of society, owing to their renewable, inexhaustible,
and cost-free nature. The power output of these systems relies on solar cell
radiation and temperature. In order to mitigate the dependence on
atmospheric conditions and enhance power tracking, a conventional
approach has been improved by integrating various methods. To optimize
the generation of electricity from solar systems, the maximum power point
tracking (MPPT) technique is employed. To overcome limitations such as
steady-state voltage oscillations and improve transient response, two
traditional MPPT methods, namely fuzzy logic controller (FLC) and perturb
and observe (P&O), have been modified. This research paper aims to
simulate and validate the step size of the proposed modified P&O and FLC
techniques within the MPPT algorithm using MATLAB/Simulink for
efficient power tracking in photovoltaic systems.
Adaptive synchronous sliding control for a robot manipulator based on neural ...IJECEIAES
Robot manipulators have become important equipment in production lines, medical fields, and transportation. Improving the quality of trajectory tracking for
robot hands is always an attractive topic in the research community. This is a
challenging problem because robot manipulators are complex nonlinear systems
and are often subject to fluctuations in loads and external disturbances. This
article proposes an adaptive synchronous sliding control scheme to improve trajectory tracking performance for a robot manipulator. The proposed controller
ensures that the positions of the joints track the desired trajectory, synchronize
the errors, and significantly reduces chattering. First, the synchronous tracking
errors and synchronous sliding surfaces are presented. Second, the synchronous
tracking error dynamics are determined. Third, a robust adaptive control law is
designed,the unknown components of the model are estimated online by the neural network, and the parameters of the switching elements are selected by fuzzy
logic. The built algorithm ensures that the tracking and approximation errors
are ultimately uniformly bounded (UUB). Finally, the effectiveness of the constructed algorithm is demonstrated through simulation and experimental results.
Simulation and experimental results show that the proposed controller is effective with small synchronous tracking errors, and the chattering phenomenon is
significantly reduced.
Remote field-programmable gate array laboratory for signal acquisition and de...IJECEIAES
A remote laboratory utilizing field-programmable gate array (FPGA) technologies enhances students’ learning experience anywhere and anytime in embedded system design. Existing remote laboratories prioritize hardware access and visual feedback for observing board behavior after programming, neglecting comprehensive debugging tools to resolve errors that require internal signal acquisition. This paper proposes a novel remote embeddedsystem design approach targeting FPGA technologies that are fully interactive via a web-based platform. Our solution provides FPGA board access and debugging capabilities beyond the visual feedback provided by existing remote laboratories. We implemented a lab module that allows users to seamlessly incorporate into their FPGA design. The module minimizes hardware resource utilization while enabling the acquisition of a large number of data samples from the signal during the experiments by adaptively compressing the signal prior to data transmission. The results demonstrate an average compression ratio of 2.90 across three benchmark signals, indicating efficient signal acquisition and effective debugging and analysis. This method allows users to acquire more data samples than conventional methods. The proposed lab allows students to remotely test and debug their designs, bridging the gap between theory and practice in embedded system design.
Detecting and resolving feature envy through automated machine learning and m...IJECEIAES
Efficiently identifying and resolving code smells enhances software project quality. This paper presents a novel solution, utilizing automated machine learning (AutoML) techniques, to detect code smells and apply move method refactoring. By evaluating code metrics before and after refactoring, we assessed its impact on coupling, complexity, and cohesion. Key contributions of this research include a unique dataset for code smell classification and the development of models using AutoGluon for optimal performance. Furthermore, the study identifies the top 20 influential features in classifying feature envy, a well-known code smell, stemming from excessive reliance on external classes. We also explored how move method refactoring addresses feature envy, revealing reduced coupling and complexity, and improved cohesion, ultimately enhancing code quality. In summary, this research offers an empirical, data-driven approach, integrating AutoML and move method refactoring to optimize software project quality. Insights gained shed light on the benefits of refactoring on code quality and the significance of specific features in detecting feature envy. Future research can expand to explore additional refactoring techniques and a broader range of code metrics, advancing software engineering practices and standards.
Smart monitoring technique for solar cell systems using internet of things ba...IJECEIAES
Rapidly and remotely monitoring and receiving the solar cell systems status parameters, solar irradiance, temperature, and humidity, are critical issues in enhancement their efficiency. Hence, in the present article an improved smart prototype of internet of things (IoT) technique based on embedded system through NodeMCU ESP8266 (ESP-12E) was carried out experimentally. Three different regions at Egypt; Luxor, Cairo, and El-Beheira cities were chosen to study their solar irradiance profile, temperature, and humidity by the proposed IoT system. The monitoring data of solar irradiance, temperature, and humidity were live visualized directly by Ubidots through hypertext transfer protocol (HTTP) protocol. The measured solar power radiation in Luxor, Cairo, and El-Beheira ranged between 216-1000, 245-958, and 187-692 W/m 2 respectively during the solar day. The accuracy and rapidity of obtaining monitoring results using the proposed IoT system made it a strong candidate for application in monitoring solar cell systems. On the other hand, the obtained solar power radiation results of the three considered regions strongly candidate Luxor and Cairo as suitable places to build up a solar cells system station rather than El-Beheira.
An efficient security framework for intrusion detection and prevention in int...IJECEIAES
Over the past few years, the internet of things (IoT) has advanced to connect billions of smart devices to improve quality of life. However, anomalies or malicious intrusions pose several security loopholes, leading to performance degradation and threat to data security in IoT operations. Thereby, IoT security systems must keep an eye on and restrict unwanted events from occurring in the IoT network. Recently, various technical solutions based on machine learning (ML) models have been derived towards identifying and restricting unwanted events in IoT. However, most ML-based approaches are prone to miss-classification due to inappropriate feature selection. Additionally, most ML approaches applied to intrusion detection and prevention consider supervised learning, which requires a large amount of labeled data to be trained. Consequently, such complex datasets are impossible to source in a large network like IoT. To address this problem, this proposed study introduces an efficient learning mechanism to strengthen the IoT security aspects. The proposed algorithm incorporates supervised and unsupervised approaches to improve the learning models for intrusion detection and mitigation. Compared with the related works, the experimental outcome shows that the model performs well in a benchmark dataset. It accomplishes an improved detection accuracy of approximately 99.21%.
Developing a smart system for infant incubators using the internet of things ...IJECEIAES
This research is developing an incubator system that integrates the internet of things and artificial intelligence to improve care for premature babies. The system workflow starts with sensors that collect data from the incubator. Then, the data is sent in real-time to the internet of things (IoT) broker eclipse mosquito using the message queue telemetry transport (MQTT) protocol version 5.0. After that, the data is stored in a database for analysis using the long short-term memory network (LSTM) method and displayed in a web application using an application programming interface (API) service. Furthermore, the experimental results produce as many as 2,880 rows of data stored in the database. The correlation coefficient between the target attribute and other attributes ranges from 0.23 to 0.48. Next, several experiments were conducted to evaluate the model-predicted value on the test data. The best results are obtained using a two-layer LSTM configuration model, each with 60 neurons and a lookback setting 6. This model produces an R 2 value of 0.934, with a root mean square error (RMSE) value of 0.015 and a mean absolute error (MAE) of 0.008. In addition, the R 2 value was also evaluated for each attribute used as input, with a result of values between 0.590 and 0.845.
A review on internet of things-based stingless bee's honey production with im...IJECEIAES
Honey is produced exclusively by honeybees and stingless bees which both are well adapted to tropical and subtropical regions such as Malaysia. Stingless bees are known for producing small amounts of honey and are known for having a unique flavor profile. Problem identified that many stingless bees collapsed due to weather, temperature and environment. It is critical to understand the relationship between the production of stingless bee honey and environmental conditions to improve honey production. Thus, this paper presents a review on stingless bee's honey production and prediction modeling. About 54 previous research has been analyzed and compared in identifying the research gaps. A framework on modeling the prediction of stingless bee honey is derived. The result presents the comparison and analysis on the internet of things (IoT) monitoring systems, honey production estimation, convolution neural networks (CNNs), and automatic identification methods on bee species. It is identified based on image detection method the top best three efficiency presents CNN is at 98.67%, densely connected convolutional networks with YOLO v3 is 97.7%, and DenseNet201 convolutional networks 99.81%. This study is significant to assist the researcher in developing a model for predicting stingless honey produced by bee's output, which is important for a stable economy and food security.
A trust based secure access control using authentication mechanism for intero...IJECEIAES
The internet of things (IoT) is a revolutionary innovation in many aspects of our society including interactions, financial activity, and global security such as the military and battlefield internet. Due to the limited energy and processing capacity of network devices, security, energy consumption, compatibility, and device heterogeneity are the long-term IoT problems. As a result, energy and security are critical for data transmission across edge and IoT networks. Existing IoT interoperability techniques need more computation time, have unreliable authentication mechanisms that break easily, lose data easily, and have low confidentiality. In this paper, a key agreement protocol-based authentication mechanism for IoT devices is offered as a solution to this issue. This system makes use of information exchange, which must be secured to prevent access by unauthorized users. Using a compact contiki/cooja simulator, the performance and design of the suggested framework are validated. The simulation findings are evaluated based on detection of malicious nodes after 60 minutes of simulation. The suggested trust method, which is based on privacy access control, reduced packet loss ratio to 0.32%, consumed 0.39% power, and had the greatest average residual energy of 0.99 mJoules at 10 nodes.
Fuzzy linear programming with the intuitionistic polygonal fuzzy numbersIJECEIAES
In real world applications, data are subject to ambiguity due to several factors; fuzzy sets and fuzzy numbers propose a great tool to model such ambiguity. In case of hesitation, the complement of a membership value in fuzzy numbers can be different from the non-membership value, in which case we can model using intuitionistic fuzzy numbers as they provide flexibility by defining both a membership and a non-membership functions. In this article, we consider the intuitionistic fuzzy linear programming problem with intuitionistic polygonal fuzzy numbers, which is a generalization of the previous polygonal fuzzy numbers found in the literature. We present a modification of the simplex method that can be used to solve any general intuitionistic fuzzy linear programming problem after approximating the problem by an intuitionistic polygonal fuzzy number with n edges. This method is given in a simple tableau formulation, and then applied on numerical examples for clarity.
The performance of artificial intelligence in prostate magnetic resonance im...IJECEIAES
Prostate cancer is the predominant form of cancer observed in men worldwide. The application of magnetic resonance imaging (MRI) as a guidance tool for conducting biopsies has been established as a reliable and well-established approach in the diagnosis of prostate cancer. The diagnostic performance of MRI-guided prostate cancer diagnosis exhibits significant heterogeneity due to the intricate and multi-step nature of the diagnostic pathway. The development of artificial intelligence (AI) models, specifically through the utilization of machine learning techniques such as deep learning, is assuming an increasingly significant role in the field of radiology. In the realm of prostate MRI, a considerable body of literature has been dedicated to the development of various AI algorithms. These algorithms have been specifically designed for tasks such as prostate segmentation, lesion identification, and classification. The overarching objective of these endeavors is to enhance diagnostic performance and foster greater agreement among different observers within MRI scans for the prostate. This review article aims to provide a concise overview of the application of AI in the field of radiology, with a specific focus on its utilization in prostate MRI.
Seizure stage detection of epileptic seizure using convolutional neural networksIJECEIAES
According to the World Health Organization (WHO), seventy million individuals worldwide suffer from epilepsy, a neurological disorder. While electroencephalography (EEG) is crucial for diagnosing epilepsy and monitoring the brain activity of epilepsy patients, it requires a specialist to examine all EEG recordings to find epileptic behavior. This procedure needs an experienced doctor, and a precise epilepsy diagnosis is crucial for appropriate treatment. To identify epileptic seizures, this study employed a convolutional neural network (CNN) based on raw scalp EEG signals to discriminate between preictal, ictal, postictal, and interictal segments. The possibility of these characteristics is explored by examining how well timedomain signals work in the detection of epileptic signals using intracranial Freiburg Hospital (FH), scalp Children's Hospital Boston-Massachusetts Institute of Technology (CHB-MIT) databases, and Temple University Hospital (TUH) EEG. To test the viability of this approach, two types of experiments were carried out. Firstly, binary class classification (preictal, ictal, postictal each versus interictal) and four-class classification (interictal versus preictal versus ictal versus postictal). The average accuracy for stage detection using CHB-MIT database was 84.4%, while the Freiburg database's time-domain signals had an accuracy of 79.7% and the highest accuracy of 94.02% for classification in the TUH EEG database when comparing interictal stage to preictal stage.
Analysis of driving style using self-organizing maps to analyze driver behaviorIJECEIAES
Modern life is strongly associated with the use of cars, but the increase in acceleration speeds and their maneuverability leads to a dangerous driving style for some drivers. In these conditions, the development of a method that allows you to track the behavior of the driver is relevant. The article provides an overview of existing methods and models for assessing the functioning of motor vehicles and driver behavior. Based on this, a combined algorithm for recognizing driving style is proposed. To do this, a set of input data was formed, including 20 descriptive features: About the environment, the driver's behavior and the characteristics of the functioning of the car, collected using OBD II. The generated data set is sent to the Kohonen network, where clustering is performed according to driving style and degree of danger. Getting the driving characteristics into a particular cluster allows you to switch to the private indicators of an individual driver and considering individual driving characteristics. The application of the method allows you to identify potentially dangerous driving styles that can prevent accidents.
Hyperspectral object classification using hybrid spectral-spatial fusion and ...IJECEIAES
Because of its spectral-spatial and temporal resolution of greater areas, hyperspectral imaging (HSI) has found widespread application in the field of object classification. The HSI is typically used to accurately determine an object's physical characteristics as well as to locate related objects with appropriate spectral fingerprints. As a result, the HSI has been extensively applied to object identification in several fields, including surveillance, agricultural monitoring, environmental research, and precision agriculture. However, because of their enormous size, objects require a lot of time to classify; for this reason, both spectral and spatial feature fusion have been completed. The existing classification strategy leads to increased misclassification, and the feature fusion method is unable to preserve semantic object inherent features; This study addresses the research difficulties by introducing a hybrid spectral-spatial fusion (HSSF) technique to minimize feature size while maintaining object intrinsic qualities; Lastly, a soft-margins kernel is proposed for multi-layer deep support vector machine (MLDSVM) to reduce misclassification. The standard Indian pines dataset is used for the experiment, and the outcome demonstrates that the HSSF-MLDSVM model performs substantially better in terms of accuracy and Kappa coefficient.
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxR&R Consult
CFD analysis is incredibly effective at solving mysteries and improving the performance of complex systems!
Here's a great example: At a large natural gas-fired power plant, where they use waste heat to generate steam and energy, they were puzzled that their boiler wasn't producing as much steam as expected.
R&R and Tetra Engineering Group Inc. were asked to solve the issue with reduced steam production.
An inspection had shown that a significant amount of hot flue gas was bypassing the boiler tubes, where the heat was supposed to be transferred.
R&R Consult conducted a CFD analysis, which revealed that 6.3% of the flue gas was bypassing the boiler tubes without transferring heat. The analysis also showed that the flue gas was instead being directed along the sides of the boiler and between the modules that were supposed to capture the heat. This was the cause of the reduced performance.
Based on our results, Tetra Engineering installed covering plates to reduce the bypass flow. This improved the boiler's performance and increased electricity production.
It is always satisfying when we can help solve complex challenges like this. Do your systems also need a check-up or optimization? Give us a call!
Work done in cooperation with James Malloy and David Moelling from Tetra Engineering.
More examples of our work https://www.r-r-consult.dk/en/cases-en/
Welcome to WIPAC Monthly the magazine brought to you by the LinkedIn Group Water Industry Process Automation & Control.
In this month's edition, along with this month's industry news to celebrate the 13 years since the group was created we have articles including
A case study of the used of Advanced Process Control at the Wastewater Treatment works at Lleida in Spain
A look back on an article on smart wastewater networks in order to see how the industry has measured up in the interim around the adoption of Digital Transformation in the Water Industry.
Quality defects in TMT Bars, Possible causes and Potential Solutions.PrashantGoswami42
Maintaining high-quality standards in the production of TMT bars is crucial for ensuring structural integrity in construction. Addressing common defects through careful monitoring, standardized processes, and advanced technology can significantly improve the quality of TMT bars. Continuous training and adherence to quality control measures will also play a pivotal role in minimizing these defects.
Automobile Management System Project Report.pdfKamal Acharya
The proposed project is developed to manage the automobile in the automobile dealer company. The main module in this project is login, automobile management, customer management, sales, complaints and reports. The first module is the login. The automobile showroom owner should login to the project for usage. The username and password are verified and if it is correct, next form opens. If the username and password are not correct, it shows the error message.
When a customer search for a automobile, if the automobile is available, they will be taken to a page that shows the details of the automobile including automobile name, automobile ID, quantity, price etc. “Automobile Management System” is useful for maintaining automobiles, customers effectively and hence helps for establishing good relation between customer and automobile organization. It contains various customized modules for effectively maintaining automobiles and stock information accurately and safely.
When the automobile is sold to the customer, stock will be reduced automatically. When a new purchase is made, stock will be increased automatically. While selecting automobiles for sale, the proposed software will automatically check for total number of available stock of that particular item, if the total stock of that particular item is less than 5, software will notify the user to purchase the particular item.
Also when the user tries to sale items which are not in stock, the system will prompt the user that the stock is not enough. Customers of this system can search for a automobile; can purchase a automobile easily by selecting fast. On the other hand the stock of automobiles can be maintained perfectly by the automobile shop manager overcoming the drawbacks of existing system.
Water scarcity is the lack of fresh water resources to meet the standard water demand. There are two type of water scarcity. One is physical. The other is economic water scarcity.
Vaccine management system project report documentation..pdfKamal Acharya
The Division of Vaccine and Immunization is facing increasing difficulty monitoring vaccines and other commodities distribution once they have been distributed from the national stores. With the introduction of new vaccines, more challenges have been anticipated with this additions posing serious threat to the already over strained vaccine supply chain system in Kenya.
Final project report on grocery store management system..pdfKamal Acharya
In today’s fast-changing business environment, it’s extremely important to be able to respond to client needs in the most effective and timely manner. If your customers wish to see your business online and have instant access to your products or services.
Online Grocery Store is an e-commerce website, which retails various grocery products. This project allows viewing various products available enables registered users to purchase desired products instantly using Paytm, UPI payment processor (Instant Pay) and also can place order by using Cash on Delivery (Pay Later) option. This project provides an easy access to Administrators and Managers to view orders placed using Pay Later and Instant Pay options.
In order to develop an e-commerce website, a number of Technologies must be studied and understood. These include multi-tiered architecture, server and client-side scripting techniques, implementation technologies, programming language (such as PHP, HTML, CSS, JavaScript) and MySQL relational databases. This is a project with the objective to develop a basic website where a consumer is provided with a shopping cart website and also to know about the technologies used to develop such a website.
This document will discuss each of the underlying technologies to create and implement an e- commerce website.
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
Event Management System Vb Net Project Report.pdfKamal Acharya
In present era, the scopes of information technology growing with a very fast .We do not see any are untouched from this industry. The scope of information technology has become wider includes: Business and industry. Household Business, Communication, Education, Entertainment, Science, Medicine, Engineering, Distance Learning, Weather Forecasting. Carrier Searching and so on.
My project named “Event Management System” is software that store and maintained all events coordinated in college. It also helpful to print related reports. My project will help to record the events coordinated by faculties with their Name, Event subject, date & details in an efficient & effective ways.
In my system we have to make a system by which a user can record all events coordinated by a particular faculty. In our proposed system some more featured are added which differs it from the existing system such as security.
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
Courier management system project report.pdfKamal Acharya
It is now-a-days very important for the people to send or receive articles like imported furniture, electronic items, gifts, business goods and the like. People depend vastly on different transport systems which mostly use the manual way of receiving and delivering the articles. There is no way to track the articles till they are received and there is no way to let the customer know what happened in transit, once he booked some articles. In such a situation, we need a system which completely computerizes the cargo activities including time to time tracking of the articles sent. This need is fulfilled by Courier Management System software which is online software for the cargo management people that enables them to receive the goods from a source and send them to a required destination and track their status from time to time.
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
2. Int J Elec & Comp Eng ISSN: 2088-8708
Optimizing Apple Lossless Audio Codec algorithm using NVIDIA CUDA architecture (Rafid Ahmed)
71
In comparison with the traditional GPGPU techniques, CUDA offers several advantages, such as
shared memory, faster downloads, fully support for integer and bitwise operations and also CUDA code is
easily understandable than OpenCL [9]. The process flow of CUDA programming is described below [10]:
a. Data is copied from the main memory to the GPU global memory
b. CPU sends the processing instructions to the GPU
c. Each core at the GPU memory parallelly executes the data
d. Result from GPU global memory is copied back to main memory
In this paper, we propose an implementation of ALAC audio compression on NVIDIA GPUs.
ALAC algorithm follows the same basic principle of other lossless audio compression illustrated in
Figure 1 [11]. It is a highly serialized algorithm which is not efficient enough to be used on GPU. For
example, as the input audio file is separated into packets, the encoding of next packet depends on the results
of previous encoded packet. Our redesigned implementation for the CUDA framework aims to parallelly
implement the parts of the algorithm where the computation of a method is not serialized by previous
computations.
The paper is presented as follows. Section 2 describes the implementation of CUDA model on
ALAC Encoding and Decoding process. The experimental result with hardware and software setups are
discussed in section 3 and finally conclusions will be made in section 4.
2. RESEARCH METHOD
2.1. SIMT Model
The GPU core contains a number of Streaming Multiprocessors (SMs). On each SM, execution of
each instruction follows a model like SIMD which is called SIMT referred by Nvidia. In SIMT, the same
instruction is assigned to all the threads in the chosen warp. All threads in a warp are issued the same
instruction, although not every thread needs to execute that instruction. As a result, threads in a warp
diverging across different paths in a branch results in a loss of parallelism [12]. In our implementation, the
branching factor is handled in CPU before calling the kernel to gain total parallelism. The kernel block size
must be chosen less than or equal to tile size such that one or more elements of a tile is loaded into shared
memory by each thread in a block. Resultantly, the device performs one instruction fetch for a block of
threads which is in SIMT manner. This shortens instruction fetch and processing overhead of load instruction
[13]. In the tests, we determine that 512 threads per block configuration is giving the best performance. In the
CUDA implementation of ALAC algorithm, we have decided to exploit the framing and mixing phase of
encoding. The decoding of an ALAC audio to pcm (pulse code modulation) data are done by following the
steps of Figure 1 reversely. So, for the decoding section, we attempted to parallelize the un-mixing and
concatenating phase.
Figure 1. The basic operation of ALAC encoder
2.1.1. SIMT Model in Encoding
In serial CPU implementation of ALAC, the input data is divided to several frames by the encoding
phase, where a frame is split into even smaller pieces to carry out mixing operation. As shown in Figure 2,
the possible way to utilize this serialization of framing and mixing of encoding a stereo audio is to batch all
the input packets into CUDA global memory for a single parallel operation of mixing the data as shown in
Figure 3 taking advantage of SIMT nature.
3. ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 8, No. 1, February 2018 : 70 – 75
72
Figure 2. CPU implementation of Framing and Mixing
phase
Figure 3. GPU implementation of Framing and
Mixing phase
2.1.2. SIMT Model in Decoding
The decoding process is reverse of the encoding process. First ALAC audio data is decompressed.
Then the predictors are run over the data to convert it to pcm data. Finally, for stereo audio, un-mix function
is carried out to concatenate the 2 channels into a single output buffer. As the same independent behavior
exist in the decoding process to make use of the data parallelism in CUDA, we distribute the work of the end
of the decoding process across the GPU.
2.2. Memory Coalescing
According to Jang B, Schaa D, et al, GPU memory subsystems are designed to deliver high
bandwidth versus low-latency access. To gain highest throughput, a large number of small memory accesses
should be buffered, reordered, and coalesced into a small number of large requests [14].
For strided global memory access, the effective bandwidth is always poor. When concurrent threads
simultaneously access memory addresses that are located far apart in physical memory, there is no possibility
for coalescing the memory access [15]. Thus, we restructure the buffer in such a way that global memory
loads issued by threads of warp are coalesced in encoding process as show in Figure 4.
Figure 2. Uncoalesced and coalesced memory access
2.3. Pinned Memory
By default, CPU data allocations are pageable. As CUDA driver’s uses page-locked or pinned
memory, GPU cannot directly access data prom pageable host memory. Consequently, the GPU first
allocates a temporary pinned memory, copies host data into it and then transfers data to the device memory.
So, it is better to use pinned memory beforehand rather than using paged memory in order to reduce copying
cost [16]. As we have to transfer channel buffers from GPU to CPU for each iteration while encoding frames,
we kept the channel buffer of CPU as pinned memory for fast transfer of data.
3. RESULTS AND ANALYSIS
To analyze the performance of CUDA implementation, we used a GeForce GTX 960 card with
CUDA version 7.5 installed on a machine with Intel(R) Core(TM) i5-4590 CPU running at 3.30GHz. The
4. Int J Elec & Comp Eng ISSN: 2088-8708
Optimizing Apple Lossless Audio Codec algorithm using NVIDIA CUDA architecture (Rafid Ahmed)
73
CPU implementation of ALAC is also tested on the same testbed. To compare the performance of our GPU
implementation of ALAC algorithm, we selected representative of five audio files for testing. All these files
have the CD format, i.e. 44.1kHz, 16 bits.
a. Track 5 of [17] (28 s): Electronic Gong 5kHz (Mono)
b. Track 20 of [17] (39 s): Saxophone
c. Track 50 of [17] (22 s): Male speech (English)
d. Track 68 of [17] (2 min 44 s): Orchestra
e. Track 70 of [17] (21 s): Song by Eddie Rabbitt
The audio files are collected from Sound Quality Assessment Material (SQAM) recordings for subjective
tests 9. The flies were converted to WAV from FLAC to work with pcm data. To measure the results, we ran
our test 10 times on each dataset for each reading and showed the average running results in this section.
3.1. Encoding Results
3.1.1. Execution Speed Gain
The results on Table 1 shows the average running speed results of both mono and stereo audio
(Track 68) for CPU and GPU separately. For 16bit audio we achieve speed gain about 67% for mono and
85% for stereo audio type. For 24bit audio the speed up is around 90% for both audio types. For 32bit audio,
the speed increase is around 81%. Here we can infer that 24bit audio conversion results in faster encoding
speed for GPU where for 16bit audio it is slower than the others.
In Figure 5, we see for first test file (Track 5) we achieve 3x speed up, for second and third test file (Track 20
& 50) we achieve 7x speed up, for fourth test file (Track 68) we gain 6.5x speed and lastly for fifth file
(Track 70) we get 5x speed up.
Table 1. Mixing/Converting phase execution time comparison for encoding process
Bit Depth
CPU execution time for
mono (ms)
GPU execution time for
mono (ms)
CPU execution time for
stereo (ms)
GPU execution time for
stereo (ms)
16 2.85 0.94 20.3 3.0259
24 25.68 1.62 63.5 5.246
32 9.32 1.71 38 4.8266
Figure 3. Mixing/Converting phase speed up for encoding and decoding process against CPU using dataset
5. ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 8, No. 1, February 2018 : 70 – 75
74
3.1.2. Power Consumption Saving
Looking at Table 2, the power consumption saving is least for 16bit audio for both mono and stereo
audio files where for 24bit audio power saving is the highest.
Table 2. CPU power saving in encoding process
Bit Depth
Power consumption saving for mono
(Watt)
Power consumption saving for stereo
(Watt)
16 0.66792 3.636927
24 5.49444 9.209082
32 2.0544 7.257163
3.2. Decoding Results
3.2.1. Execution Speed Gain
Decoding results for both mono and stereo file (Track 68) are shown in Table 3. According to the
results, we get around 85% speed increase for 16bit mono audio and 90% for 16bit stereo. For 24bit audio,
we achieve more than 90% speed up. Lastly for 32bit audio, around 89% speed up is gained. From this table,
we can also state that 24bit audio decoding is faster in GPU than the others where 16bit audio conversion is
slowest.
Table 3. Un-mixing/Converting phase execution time comparison for decoding process
Bit Depth
CPU execution time for
mono (ms)
GPU execution time for
mono (ms)
CPU execution time for
stereo (ms)
GPU execution time for
stereo (ms)
16 4.02 0.5892 10.25 1.0764
24 10.884 0.8352 28.032 1.8192
32 7.093 0.864 20.354 1.715
3.3.2. Power Consumption Saving
For decoding stage, we get less power saving than that of encoding stage. Here also 16 bit audio has
the least power saving where 24 bit audio has the highest shown in Table 4.
Table 4. CPU power saving in decoding process
Bit Depth
Power consumption saving for mono
(Watt)
Power consumption saving for stereo
(Watt)
16 0.690414 1.577814
24 1.85396 4.379118
32 1.124886 3.087194
4. CONCLUSION
In this paper, we analyzed the feasibility to use CUDA framework for Apple Lossless Audio Codec
compression algorithm. Our primary focus was on outperforming the mixing/un-mixing speed of the CPU
based ALAC implementation by using NVIDIA GPUs without losing any compression ratio. We tested our
implementation on several datasets and made comparison. Experimental results demonstrate that we achieve
average of 80-95% speed up for mixing/un-mixing audio data.
6. Int J Elec & Comp Eng ISSN: 2088-8708
Optimizing Apple Lossless Audio Codec algorithm using NVIDIA CUDA architecture (Rafid Ahmed)
75
REFERENCES
[1] Firmansah L, Setiawan EB, “Data audio compression lossless FLAC format to lossy audio MP3 format with
Huffman shift coding algorithm,” 2016 4th International Conference on Information and Communication
Technology (ICoICT). 2016 May.
[2] Salomon D. A concise introduction to data compression. London: Springer London; 2008 Jan 14. 227–44 p.
ISBN: 9781848000711.
[3] Waggoner B, “Compression for great video and audio: Master tips and common sense,” 2nd ed. Oxford, United
Kingdom: Elsevier Science; 2009 Nov 23. 341–7 p. ISBN: 9780240812137.
[4] Yu R, Rahardja S, Xiao L, Ko CC, “A fine granular scalable to lossless audio coder,” IEEE Transactions on
Audio, Speech and Language Processing. 2006 Jul; 14(4):1352–63.
[5] Salomon D, Motta G, Bryant D, “Data compression: The complete reference,” 4th ed. London: Springer London;
2006 Dec 19. 773–83 p. ISBN: 9781846286025.
[6] Gunawan T, Zain M, Muin F, Kartiwi M, “Investigation of Lossless Audio Compression using IEEE 1857.2
Advanced Audio Coding,” Indonesian Journal of Electrical Engineering and Computer Science, 2017;6:
422 – 430
[7] Apple Lossless [Internet]. Applelossless.com. 2017 [cited 23 September 2017]. Available from:
http://www.applelossless.com/
[8] Waggoner B, “Compression for great video and audio: Master tips and common sense,” 2nd ed. Oxford, United
Kingdom: Elsevier Science; 2009 Nov 23. 215–21 p. ISBN: 9780240812137.
[9] Guo S, Chen S, Liang Y, “A Novel Architecture of Multi-GPU Computing Card,” TELKOMNIKA
(Telecommunication Computing Electronics and Control), 2013;11(8)
[10] Nasreen A, G S, “Parallelizing Multi-featured Content Based Search and Retrieval of Videos through High
Performance Computing,” Indonesian Journal of Electrical Engineering and Computer Science, 2017;5(1):214
[11] Hans M, Schafer RW. Lossless compression of digital audio. IEEE Signal Processing Magazine. 2001,
Jul: 18(4):5–15.
[12] Sutsui S, Collet P, editors, “Massively parallel evolutionary computation on GPGPUs,” Berlin: Springer-Verlag
Berlin and Heidelberg GmbH & Co. K; 2013 Dec 6. 149–56 p. ISBN: 9783642379581.
[13] Al-Mouhamed M, Khan A ul H, “Exploration of automatic optimization for CUDA programming,” International
Journal of Parallel, Emergent and Distributed Systems, 2014 Sep 16;30(4):309–24.
[14] Jang B, Schaa D, Mistry P, Kaeli D, “Exploiting memory access patterns to improve memory performance in
data-parallel architectures,” IEEE Transactions on Parallel and Distributed Systems, 2011 Jan;22(1):105–18.
[15] Harris M. [Internet]: Parallel For all. How to access global memory efficiently in CUDA C/C++ kernels; 2013
Jan 7 [cited 2016 Nov 14]. Available from: https://devblogs.nvidia.com/parallelforall/how-access-global-
memory-efficiently-cuda-c-kernels/
[16] Harris M. [Internet]: Parallel For all. How to optimize data transfers in CUDA C/C++; 2012 Dec 4 [cited 2016
Nov 14]. Available from: https://devblogs.nvidia.com/parallelforall/how-optimize-data-transfers-cuda-cc/
[17] Sound quality assessment material: recordings for subjective tests; user’s handbook for the EBU - SQAM
compact disc. Geneva: Technical Centre; 2008. [cited 2016 Nov 14]. Available from:
https://tech.ebu.ch/docs/tech/tech3253.pdf.