This document summarizes a survey on GPU systems and their performance on different applications. It discusses how GPUs can be used for general purpose computing due to their high parallel processing capabilities. Several computational intensive applications that achieve speedups when implemented on GPUs are described, including video decoding, matrix multiplication, parallel AES encryption, and password recovery for MS office documents. The GPU architecture and Nvidia's CUDA programming model are also summarized. While GPUs provide significant performance benefits, some limitations for non-graphics applications are noted. The conclusion is that GPUs are a good alternative for computational intensive tasks to reduce CPU load and improve performance compared to CPU-only implementations.
Image Processing Application on Graphics processorsCSCJournals
In this work, we introduce real time image processing techniques using modern programmable Graphic Processing Units GPU. GPU are SIMD (Single Instruction, Multiple Data) device that is inherently data-parallel. By utilizing NVIDIA new GPU programming framework, “Compute Unified Device Architecture” CUDA as a computational resource, we realize significant acceleration in image processing algorithm computations. We show that a range of computer vision algorithms map readily to CUDA with significant performance gains. Specifically, we demonstrate the efficiency of our approach by a parallelization and optimization of image processing, Morphology applications and image integral.
Nowadays modern computer GPU (Graphic Processing Unit) became widely used to improve the
performance of a computer, which is basically for the GPU graphics calculations, are now used not only
for the purposes of calculating the graphics but also for other application. In addition, Graphics
Processing Unit (GPU) has high computation and low price. This device can be treat as an array of SIMD
processor using CUDA software. This paper talks about GPU application, CUDA memory and efficient
CUDA memory using Reduction kernel. High-performance GPU application requires reuse of data inside
the streaming multiprocessor (SM). The reason is that onboard global memory is simply not fast enough to
meet the needs of all the streaming multiprocessor on the GPU. In addition, CUDA exposes the memory
space within the SM and provides configurable caches to give the developer the greatest opportunity of
data reuse.
Image Processing Application on Graphics processorsCSCJournals
In this work, we introduce real time image processing techniques using modern programmable Graphic Processing Units GPU. GPU are SIMD (Single Instruction, Multiple Data) device that is inherently data-parallel. By utilizing NVIDIA new GPU programming framework, “Compute Unified Device Architecture” CUDA as a computational resource, we realize significant acceleration in image processing algorithm computations. We show that a range of computer vision algorithms map readily to CUDA with significant performance gains. Specifically, we demonstrate the efficiency of our approach by a parallelization and optimization of image processing, Morphology applications and image integral.
Nowadays modern computer GPU (Graphic Processing Unit) became widely used to improve the
performance of a computer, which is basically for the GPU graphics calculations, are now used not only
for the purposes of calculating the graphics but also for other application. In addition, Graphics
Processing Unit (GPU) has high computation and low price. This device can be treat as an array of SIMD
processor using CUDA software. This paper talks about GPU application, CUDA memory and efficient
CUDA memory using Reduction kernel. High-performance GPU application requires reuse of data inside
the streaming multiprocessor (SM). The reason is that onboard global memory is simply not fast enough to
meet the needs of all the streaming multiprocessor on the GPU. In addition, CUDA exposes the memory
space within the SM and provides configurable caches to give the developer the greatest opportunity of
data reuse.
Optimizing Apple Lossless Audio Codec Algorithm using NVIDIA CUDA Architecture IJECEIAES
As majority of the compression algorithms are implementations for CPU architecture, the primary focus of our work was to exploit the opportunities of GPU parallelism in audio compression. This paper presents an implementation of Apple Lossless Audio Codec (ALAC) algorithm by using NVIDIA GPUs Compute Unified Device Architecture (CUDA) Framework. The core idea was to identify the areas where data parallelism could be applied and parallel programming model CUDA could be used to execute the identified parallel components on Single Instruction Multiple Thread (SIMT) model of CUDA. The dataset was retrieved from European Broadcasting Union, Sound Quality Assessment Material (SQAM). Faster execution of the algorithm led to execution time reduction when applied to audio coding for large audios. This paper also presents the reduction of power usage due to running the parallel components on GPU. Experimental results reveal that we achieve about 80-90% speedup through CUDA on the identified components over its CPU implementation while saving CPU power consumption.
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUScsandit
Graphics Processing Units (GPUs) have been emerged as powerful parallel compute platforms for various
application domains. A GPU consists of hundreds or even thousands processor cores and adopts Single
Instruction Multiple Threading (SIMT) architecture. Previously, we have proposed an approach that
optimizes the Tabu Search algorithm for solving the Permutation Flowshop Scheduling Problem (PFSP)
on a GPU by using a math function to generate all different permutations, avoiding the need of placing all
the permutations in the global memory. Based on the research result, this paper proposes another
approach that further improves the performance by avoiding duplicated computation among threads,
which is incurred when any two permutations have the same prefix. Experimental results show that the
GPU implementation of our proposed Tabu Search for PFSP runs up to 1.5 times faster than another GPU
implementation proposed by Czapinski and Barnes
Smart application for ams using face recognitioncseij
Attendance Management System (AMS) can be made into smarter way by using face recognition technique, where we use a CCTV camera to be fixed at the entry point of a classroom, which automatically captures the image of the person and checks the observed image with the face database using android enhanced smart phone.
It is typically used for two purposes. Firstly, marking attendance for student by comparing the face images produced recently and secondly, recognition of human who are strange to the environment i.e. an unauthorized person
For verification of image, a newly emerging trend 3D Face Recognition is used which claims to provide more accuracy in matching the image databases and has an ability to recognize a subject at different view angles.
DOMAIN SPECIFIC CBIR FOR HIGHLY TEXTURED IMAGEScseij
It is A Challenging Task To Build A Cbir System Which Primarily Works On Texture Values As There
Meaning And Semantics Needs A Special Care To Be Mapped With Human Based Languages. We Have
Consider Highly Textured Images Having Properties(Entropy, Homogeneity, Contrast, Cluster Shade, Auto
Correlation)And Have Mapped Using A Fuzzy Minmax Scale W.R.T. Their Degree(High, Low,
Medium)And Technical Interpetation.This Developed System Is Performing Well In Terms Of Precision
And Recall Value Showing That Semantic Gap Has Been Reduced For Highly Textured Images Based Cbir.
CONCEPTUAL FRAMEWORK OF REDUNDANT LINK AGGREGATIONcseij
This is era of information blast. A huge quantity of information is pouring in from various sources. The
revolutionary advancement of Information and Communication technologies bring the world close
together.A pile of information in different formats is just a click away. Which motivate the organizations to
get more internet bandwidth to consume and publish theinformationoverexploding cloudof Internet. The
standard router redundancyprotocolis used to handle backup link showever it cannot aggregate
them.Whereas thelink standard aggregation protocol can aggregate the link but it support only Ethernet
technology.In this researchpaper a concept of Redundant Link Aggregation (RLA)is proposed. RLA can
aggregate and handle backup links with main links regardless of carriertechnology. Furthermore a
dataforwardingmechanism Odd Load Balancing (OLB) is also proposed for RLA scheme. For the sake of
performance evaluation, Redundant Link Aggregation (RLA) is compared with Virtual Router Redundancy
Protocol (VRRP). The simulation result reveals that Redundant Link Aggregation (RLA) can cover the
bandwidth demand of the network in peak hours by consuming backup links as well which with Virtual
Router Redundancy Protocol (VRRP)cannot.It is further noted thatOdd Load Balancing (OLB) feature can
be used to save the cost in terms of money per annum.
Optimizing Apple Lossless Audio Codec Algorithm using NVIDIA CUDA Architecture IJECEIAES
As majority of the compression algorithms are implementations for CPU architecture, the primary focus of our work was to exploit the opportunities of GPU parallelism in audio compression. This paper presents an implementation of Apple Lossless Audio Codec (ALAC) algorithm by using NVIDIA GPUs Compute Unified Device Architecture (CUDA) Framework. The core idea was to identify the areas where data parallelism could be applied and parallel programming model CUDA could be used to execute the identified parallel components on Single Instruction Multiple Thread (SIMT) model of CUDA. The dataset was retrieved from European Broadcasting Union, Sound Quality Assessment Material (SQAM). Faster execution of the algorithm led to execution time reduction when applied to audio coding for large audios. This paper also presents the reduction of power usage due to running the parallel components on GPU. Experimental results reveal that we achieve about 80-90% speedup through CUDA on the identified components over its CPU implementation while saving CPU power consumption.
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUScsandit
Graphics Processing Units (GPUs) have been emerged as powerful parallel compute platforms for various
application domains. A GPU consists of hundreds or even thousands processor cores and adopts Single
Instruction Multiple Threading (SIMT) architecture. Previously, we have proposed an approach that
optimizes the Tabu Search algorithm for solving the Permutation Flowshop Scheduling Problem (PFSP)
on a GPU by using a math function to generate all different permutations, avoiding the need of placing all
the permutations in the global memory. Based on the research result, this paper proposes another
approach that further improves the performance by avoiding duplicated computation among threads,
which is incurred when any two permutations have the same prefix. Experimental results show that the
GPU implementation of our proposed Tabu Search for PFSP runs up to 1.5 times faster than another GPU
implementation proposed by Czapinski and Barnes
Smart application for ams using face recognitioncseij
Attendance Management System (AMS) can be made into smarter way by using face recognition technique, where we use a CCTV camera to be fixed at the entry point of a classroom, which automatically captures the image of the person and checks the observed image with the face database using android enhanced smart phone.
It is typically used for two purposes. Firstly, marking attendance for student by comparing the face images produced recently and secondly, recognition of human who are strange to the environment i.e. an unauthorized person
For verification of image, a newly emerging trend 3D Face Recognition is used which claims to provide more accuracy in matching the image databases and has an ability to recognize a subject at different view angles.
DOMAIN SPECIFIC CBIR FOR HIGHLY TEXTURED IMAGEScseij
It is A Challenging Task To Build A Cbir System Which Primarily Works On Texture Values As There
Meaning And Semantics Needs A Special Care To Be Mapped With Human Based Languages. We Have
Consider Highly Textured Images Having Properties(Entropy, Homogeneity, Contrast, Cluster Shade, Auto
Correlation)And Have Mapped Using A Fuzzy Minmax Scale W.R.T. Their Degree(High, Low,
Medium)And Technical Interpetation.This Developed System Is Performing Well In Terms Of Precision
And Recall Value Showing That Semantic Gap Has Been Reduced For Highly Textured Images Based Cbir.
CONCEPTUAL FRAMEWORK OF REDUNDANT LINK AGGREGATIONcseij
This is era of information blast. A huge quantity of information is pouring in from various sources. The
revolutionary advancement of Information and Communication technologies bring the world close
together.A pile of information in different formats is just a click away. Which motivate the organizations to
get more internet bandwidth to consume and publish theinformationoverexploding cloudof Internet. The
standard router redundancyprotocolis used to handle backup link showever it cannot aggregate
them.Whereas thelink standard aggregation protocol can aggregate the link but it support only Ethernet
technology.In this researchpaper a concept of Redundant Link Aggregation (RLA)is proposed. RLA can
aggregate and handle backup links with main links regardless of carriertechnology. Furthermore a
dataforwardingmechanism Odd Load Balancing (OLB) is also proposed for RLA scheme. For the sake of
performance evaluation, Redundant Link Aggregation (RLA) is compared with Virtual Router Redundancy
Protocol (VRRP). The simulation result reveals that Redundant Link Aggregation (RLA) can cover the
bandwidth demand of the network in peak hours by consuming backup links as well which with Virtual
Router Redundancy Protocol (VRRP)cannot.It is further noted thatOdd Load Balancing (OLB) feature can
be used to save the cost in terms of money per annum.
This short paper suggests that there might be numerals that do not represent numbers. It introduces an
alternative proof that the set of complex numbers is denumerable, and also an algorithm for denumerating
them. Both the proof and the contained denumeration are easy to be understood.
The Generational Garbage collection involves organizing the heap into different divisions of memory space
in-order to filter long-lived objects from short-lived objects through moving the surviving object of each
generation’s GC cycle to another memory space, updating its age and reclaiming space from the dead
ones. The problem in this method is that, the longer an object is alive during its initial generations, the
longer the garbage collector will have to deal with it by checking for its reachability from the root and
promoting it to other space divisions, where as the ultimate goal of the GC is to reclaim memory from
unreachable objects at a minimal time possible. This paper is a proposal of a method where the lifetime of
every object getting into the heap will be predicted and will be placed in heap accordingly for the garbage
collector to deal more with reclaiming space from dead object and less in promoting the live ones to the
higher level.
Vision based entomology how to effectively exploit color and shape featurescseij
Entomology has been deeply rooted in various cultures since prehistoric times for the purpose of
agriculture. Nowadays, many scientists are interested in the field of biodiversity in order to maintain the
diversity of species within our ecosystem. Out of 1.3 million known species on this earth, insects account
for more than two thirds of these known species. Since 400 million years ago, there have been various kinds
of interactions between humans and insects. There have been several attempts to create a method to
perform insect identification accurately. Great knowledge and experience on entomology are required for
accurate insect identification. Automation of insect identification is required because there is a shortage of
skilled entomologists. We propose an automatic insect identification framework that can identify
grasshoppers and butterflies from colored images. Two classes of insects are chosen for a proof-ofconcept.
Classification is achieved by manipulating insects’ color and their shape feature since each class
of sample case has different color and distinctive body shapes. The proposed insect identification process
starts by extracting features from samples and splitting them into two training sets. One training
emphasizes on computing RGB features while the other one is normalized to estimate the area of binary
color that signifies the shape of the insect. SVM classifier is used to train the data obtained. Final decision
of the classifier combines the result of these two features to determine which class an unknown instance
belong to. The preliminary results demonstrate the efficacy and efficiency of our two-step automatic insect
identification approach and motivate us to extend this framework to identify a variety of other species of
insects.
Uav route planning for maximum target coveragecseij
Utilization of Unmanned Aerial Vehicles (UAVs) in military and civil operations is getting popular. One of
the challenges in effectively tasking these expensive vehicles is planning the flight routes to monitor the
targets. In this work, we aim to develop an algorithm which produces routing plans for a limited number of
UAVs to cover maximum number of targets considering their flight range.
The proposed solution for this practical optimization problem is designed by modifying the Max-Min Ant
System (MMAS) algorithm. To evaluate the success of the proposed method, an alternative approach,
based on the Nearest Neighbour (NN) heuristic, has been developed as well. The results showed the success
of the proposed MMAS method by increasing the number of covered targets compared to the solution based
on the NN heuristic.
Graphics Processing Unit GPU is a processor or electronic chip for graphics. GPUs are massively parallel processors used widely used for 3D graphic and many non graphic applications. As the demand for graphics applications increases, GPU has become indispensable. The use of GPUs has now matured to a point where there are countless industrial applications. This paper provides a brief introduction on GPUs, their properties, and their applications. Matthew N. O. Sadiku | Adedamola A. Omotoso | Sarhan M. Musa "Graphics Processing Unit: An Introduction" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-1 , December 2019, URL: https://www.ijtsrd.com/papers/ijtsrd29647.pdf Paper URL: https://www.ijtsrd.com/engineering/electrical-engineering/29647/graphics-processing-unit-an-introduction/matthew-n-o-sadiku
The complexity of Medical image reconstruction requires tens to hundreds of billions of computations per second. Until few years ago, special purpose processors designed especially for such applications were used. Such processors require significant design effort and are thus difficult to change as new algorithms in reconstructions evolve and have limited parallelism. Hence the demand for flexibility in medical applications motivated the use of stream processors with massively parallel architecture. Stream processing architectures offers data parallel kind of parallelism.
Cuda Based Performance Evaluation Of The Computational Efficiency Of The Dct ...acijjournal
Recent advances in computing such as the massively parallel GPUs (Graphical Processing Units),coupled
with the need to store and deliver large quantities of digital data especially images, has brought a number
of challenges for Computer Scientists, the research community and other stakeholders. These challenges,
such as prohibitively large costs to manipulate the digital data amongst others, have been the focus of the
research community in recent years and has led to the investigation of image compression techniques that
can achieve excellent results. One such technique is the Discrete Cosine Transform, which helps separate
an image into parts of differing frequencies and has the advantage of excellent energy-compaction.
This paper investigates the use of the Compute Unified Device Architecture (CUDA) programming model
to implement the DCT based Cordic based Loeffler algorithm for efficient image compression. The
computational efficiency is analyzed and evaluated under both the CPU and GPU. The PSNR (Peak Signal
to Noise Ratio) is used to evaluate image reconstruction quality in this paper. The results are presented
and discussed
Architecture exploration of recent GPUs to analyze the efficiency of hardware...journalBEEI
This study analyzes the efficiency of parallel computational applications with the adoption of recent graphics processing units (GPUs). We investigate the impacts of the additional resources of recent architecture on the popular benchmarks compared with previous architecture. Our simulation results demonstrate that Pascal GPU architecture improves the performance by 273% on average compared to old-fashioned Fermi architecture. To evaluate the performance improvement depending on specific hardware resources, we divide the hardware resources into two types: computing and memory resources. Computing resources have bigger impact on performance improvement than memory resources in most of benchmarks. For Hotspot and B+ tree, the architecture adopting only enhanced computing resources can achieve similar performance gains of the architecture adopting both computing and memory resources. We also evaluate the influence of the number of warp schedulers in the SM (Streaming Multiprocessor) to the GPU performance in relationship with barrier waiting time. Based on these analyses, we propose the development direction for the future generation of GPUs.
Accelerating Real Time Applications on Heterogeneous PlatformsIJMER
In this paper we describe about the novel implementations of depth estimation from a stereo
images using feature extraction algorithms that run on the graphics processing unit (GPU) which is
suitable for real time applications like analyzing video in real-time vision systems. Modern graphics
cards contain large number of parallel processors and high-bandwidth memory for accelerating the
processing of data computation operations. In this paper we give general idea of how to accelerate the
real time application using heterogeneous platforms. We have proposed to use some added resources to
grasp more computationally involved optimization methods. This proposed approach will indirectly
accelerate a database by producing better plan quality.
Abstract
The purpose of this review paper is to show the difference between executing the seam carving algorithm using sequential approach on a traditional CPU (central processing unit) and using parallel approach on a modern CUDA (compute unified device architecture) enabled GPU (graphics processing unit). Seam Carving is a content-aware image resizing method proposed by Avidan and Shamir of MERL.[1] It functions by identifying seams, or paths of least importance, through an image. These seams can either be removed or inserted in order to change the size of the image. It is determined that the success of this algorithm depends on a lot of factors: the number of objects in the picture, the size of monotonous background and the energy function. The purpose of the algorithm is to reduce image distortion in applications where images cannot be displayed at their original size. CUDA is a parallel architecture for GPUs, developed in the year 2007 by the Nvidia Corporation. Besides their primary function i.e. rendering of graphics, GPUs can also be used for general purpose computing (GPGPU). CUDA enabled GPU helps its user to harness massive parallelism in regular computations. If an algorithm can be made parallel, the use of GPUs significantly improves the performance and reduces the load of the central processing units (CPUs). The implementation of seam carving uses massive matrix calculations which could be performed in parallel to achieve speed ups in the execution of the algorithm as a whole. The entire algorithm itself cannot be run in parallel, and so some part of the algorithm mandatorily needs a CPU for performing sequential computations.
Keywords: Seam Carving, CUDA, Parallel Processing, GPGPU, CPU, GPU, Parallel Computing.
n recent years, with the development of graphics p
rocessors, graphics cards have been widely
used to perform general-purpose calculations. Espec
ially with release of CUDA C
programming languages in 2007, most of the research
ers have been used CUDA C
programming language for the processes which needs
high performance computing.
In this paper, a scaling approach for image segment
ation using level sets is carried out by the
GPU programming techniques. Approach to level sets
is mainly based on the solution of partial
differential equations. The proposed method does no
t require the solution of partial differential
equation. Scaling approach, which uses basic geomet
ric transformations, is used. Thus, the
required computational cost reduces. The use of the
CUDA programming on the GPU has taken
advantage of classic programming as spending time a
nd performance. Thereby results are
obtained faster. The use of the GPU has provided to
enable real-time processing. The developed
application in this study is used to find tumor on
MRI brain images.
GPU-BASED IMAGE SEGMENTATION USING LEVEL SET METHOD WITH SCALING APPROACHcsandit
In recent years, with the development of graphics processors, graphics cards have been widely
used to perform general-purpose calculations. Especially with release of CUDA C
programming languages in 2007, most of the researchers have been used CUDA C
programming language for the processes which needs high performance computing.
In this paper, a scaling approach for image segmentation using level sets is carried out by the
GPU programming techniques. Approach to level sets is mainly based on the solution of partial
differential equations. The proposed method does not require the solution of partial differential
equation. Scaling approach, which uses basic geometric transformations, is used. Thus, the
required computational cost reduces. The use of the CUDA programming on the GPU has taken
advantage of classic programming as spending time and performance. Thereby results are
obtained faster. The use of the GPU has provided to enable real-time processing. The developed
application in this study is used to find tumor on MRI brain images.
Benchmark of common AI accelerators: NVIDIA GPU vs. Intel MovidiusbyteLAKE
The document summarizes byteLAKE’s basic benchmark results between two different setups of example edge devices: with NVIDIA GPU and with Intel’s Movidius cards.
Key takeaway: the comparison of Movidius and NVIDIA as two competing accelerators for AI workloads leads to a conclusion that these two are meant for different tasks.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONS
1. Computer Science & Engineering: An International Journal (CSEIJ), Vol. 3, No. 4, August 2013
DOI : 10.5121/cseij.2013.3402 11
A SURVEY ON GPU SYSTEM CONSIDERING ITS
PERFORMANCE ON DIFFERENT APPLICATIONS
Dattatraya Londhe1
, Praveen Barapatre2
, Nisha Gholap3
, Soumitra Das4
1
Department of Computer Engineering, University of Mumbai, Gharda Institute of
Technology, Lavel, Maharashtra, India
londhedn@gmail.com
2
Department of Computer Engineering, University of Pune, SKNSITS, Lonavala,
Maharashtra, India
pravinbarapatre@hotmail.com
3
Department of Computer Engineering, KJ College of Engineering, Pune, Maharashtra,
India
golap.nish@gmail.com
4
Department of Computer Engineering, University of Pune, KJ College of Engineering,
Pune, Maharashtra, India
soumitra.das@gmail.com
ABSTRACT
In this paper we study NVIDIA graphics processing unit (GPU) along with its computational power and
applications. Although these units are specially designed for graphics application we can employee there
computation power for non graphics application too. GPU has high parallel processing power, low cost of
computation and less time utilization; it gives good result of performance per energy ratio. This GPU
deployment property for excessive computation of similar small set of instruction played a significant role
in reducing CPU overhead. GPU has several key advantages over CPU architecture as it provides high
parallelism, intensive computation and significantly higher throughput. It consists of thousands of
hardware threads that execute programs in a SIMD fashion hence GPU can be an alternate to CPU in high
performance environment and in supercomputing environment. The base line is GPU based general
purpose computing is a hot topics of research and there is great to explore rather than only graphics
processing application.
KEYWORDS
Graphics processing, Hardware threads, Supercomputing, Parallel processing, SIMD
1. INTRODUCTION
Inventions and research in technology has always increased human comfort and reduce human
efforts. These implicit aims have always motivated researchers to explore different dimension in
technology and science. Recently computer technology plays a great role when it comes to
excessive computation to solve a special or particular problem. GPUs have been widely used as
components of complex graphics application. Nowadays these graphic processing units are
gradually making a way into cluster computing system as the high performance computing units,
due to their prominent computational power.
2. Computer Science & Engineering: An International Journal (CSEIJ), Vol. 3, No. 4, August 2013
12
Before when CPU was only the unit for computation many task had to wait for their completion,
gradually the idea of processor clustering came into market which not only increased
performance but also provide ease for complex computing. Clustering of processor proved to be
beneficial for complex computation but along with its benefits there were some unwanted
features like high amount of investment, costly for usage when there is less complex computation.
GPUs invention proved to be a boon not only for graphics related application but also for other
excessive computational SIMD (Single Instruction Multiple Data) tasks. Over few years GPU has
evolved from a fixed function special – purpose processor into a full-fledged parallel
programmable processor with additional fixed function special –purpose functionality [1].
GPGPU (General Purpose GPU) is study on how to use the GPU for more general application
computation and it is gradually increasing [2]. Nvidia announced their CUDA (Compute Unified
Device Architecture) system which was specifically designed for GPU programming. It was
development platform for developing non graphics application on GPU. CUDA provides a C like
syntax for executing on the GPU and compiles offline, getting the favors of many programmers
[1]. Nvidia invented GPGPU system known as CUDA in 2006. CUDA allowed programmers to
design highly parallel computation program with ease on GPU. CUDA program is mixed code of
GPU and CPU. The main routine, complied by the standard C compiler, is generally executed on
CPU, while the parallel computing portion is compiled into GPU codes and then transferred to
GPU [3]. Functions of CUDA threads is called kernel, n such CUDA threads will perform this
kernel n times in parallel.
2. STUDY OF GPU
The first generation NVIDIA unified visual computing architecture in Geforce 8 and 9 series,
GPUs was based on a scalable processor array (SPA) framework. The second generation
architecture in GeForce GTX 200 GPU is based on a re-engineered, extended SPA architecture
[4]. The SPA architecture consists of a number of TPCs which stands for “Texture Processing
Clusters” in graphics processing mode and “Thread Processing Clusters” in parallel
computational mode. Each TPC is in turn made up of a number of streaming multiprocessors
(SMs) and each SM contains eight processor cores also called as streaming processor (SPs) or
thread processor [4].Example is NVIDIA G80 GPU, which includes 128 streaming processors. A
SM consists of eight streaming processor therefore G80 GPU contains 16 SMs. SM is responsible
to carry out creation, management and execution of concurrent threads in hardware with no
overhead. This SM support very fine grained parallelism. GPU Parallel computing architecture is
featured for parallel computing. The difference between computation mode of CPU and GPU is
that GPU is specialized for compute-intensive and highly parallel computation.
For parallel computing the user can define threads which run on GPU in parallel using standard
instructions. User can declare the number of threads that can run on a single SM by specifying a
block size. User can also state the number of blocks of thread by declaring a grid size, Grid of
threads makes up a single kernel of work which can be sent to GPU.
GeForce GTX 200 GPUs include two different architectures -graphics and computing.
3. Computer Science & Engineering: An International Journal (CSEIJ), Vol. 3, No. 4, August 2013
13
Fig.1: GeForce GTX 280 GPU Graphics Processing Architecture
Fig.2: GeForce GTX 280 GPU Parallel Computing Architecture
3. LITTLE ABOUT CUDA
One of the CUDA’s characteristics is that it is an extension of C language. CUDA allows the
developer to create special C functions, called kernels. Kernel executes on n different CUDA
threads. A kernel call is single invocation of the code which runs until completion. GPU follows
SIMD / Single Process Multiple Thread (SIMT) model. All the threads are supposed to execute
before kernel finishes [5]. CUDA API help user define number of threads and thread blocks.
4. Computer Science & Engineering: An International Journal (CSEIJ), Vol. 3, No. 4, August 2013
14
Each thread block is called CUDA block and run on a single SM. Each thread in a block is
synchronized using synchronization barrier. The threads in block are grouped together called a
CUDA Warp [5].Memory architecture of CUDA threads is as follows
Fig.3. Memory Architecture
Here each thread has private local memory. Each thread block has a shared memory visible to all
threads of the block and with the same life time as the block. At last all thread blocks form grids
as shown which have access to the same global memory [5].
4. COMPUTATIONAL INTENSIVE APPLICATIONS AND ITS PERFORMANCE ON
GPU
4.1 Video Decoding:
When it comes to video or any multimedia application Quality of service become main issue to be
handled. Recently people are becoming more and more concerned about the quality of
video/visual appliances. GPU units were specifically designed for work such as faster graphics
application and better graphics effects, rather than video decoding. In spite of this GPU still
proved to be beneficial in partially handling video decoding task. It could be used to perform task
that were concerned only with per vertex and per pixel operation. Suppose a block is a regular
shape then vertices can be handled by the vertex shader efficiently. Per pixel means all the pixels
in a block will go through the same processing. Video decoding highly complex and
computationally intensive due to huge amount of video data, complex conversion and filtering
process involved in it. The most computational parts in video decoding are Color Space
Conversion (CSC), Motion Computation (MC), Inverse DCT, Inverse quantization (IQ) and
Variable Length Decoding (VLD). In CSC process every pixel will be translated from YUV
space to RGB space using the same equation while for IDCT every pixel will be transformed
using different DCT bases as determined by their position [6]. Clearly we can predict that the
5. Computer Science & Engineering: An International Journal (CSEIJ), Vol. 3, No. 4, August 2013
15
most computationally complex MC and CSC are well suitable for the GPU to process both are
block-wise and per-pixel operation which IQ, IDCT and VLD are handled by CPU.
CPU and GPU works in a pipelined manner. CPU handles those operational tasks which are
sequential, not per pixel type and which may cause more memory traffic between CPU and GPU.
So CPU handles operation like VLD, IDCT, and IQ where as GPU handles MC, CSC along with
display. This experiment tries to establish CPU and GPU load balance by accommodating a large
buffer between CPU and GPU The intermediate buffer effectively absorbed most decoding jitters
of both CPU and GPU and contributed significantly to the overall speed-up [6].
We show experimental results of GPU assisted video decoding on pc with an Intel Pentium iii
667-mhz CPU, 256-mb memory and an NVIDIA geforce3 ti200 GPU. This experiment is carried
out by uobin Shen, Guang-Ping Gao, Shipeng Li, Heung-Yeung Shum, and Ya-Qin Zhang in
paper [6].
Table 1: Experimental Results of GPU Assisted Video Decoding on PC with an Intel Pentium iii 667-Mhz
CPU, 256-Mb Memory and an NVIDIA Geforce3 Ti200 GPU
Sequence Format Bit rate
Frame rate
(CPU only)
Frame rate
(CPU + GPU)
Speed-up
Football
SIF
(320*240)
2 Mbps 81.0 fps 135.4 fps 1.67
Total
CIF
(352 *288)
2 Mbps 84.7 fps 186.7 fps 2.2
Trap
HD 720p
(1280 * 720)
5 Mbps 9.9 fps 31.3 fps 3.16
Thus video decoding with generic GPU efficiently increase performance.
4.2 Matrix Multiplication
Some mathematical operations are not practically possible to be solved using pen and paper. The
solution for this is use of CPU as a computational device. Mathematical operation like matrix
multiplication of huge size matrices lead to overloading of CPU, hence there was degradation of
performance. Now the solution is to use either multi-core CPU architecture or GPU. The
advantage of GPU over CPU architecture is that GPU is best suited for SIMD operation and
matrix multiplication is best example of SIMD. In this application kernel makes up the
computation of matrix multiplication on the GPU. Along with the multiplication other
initialization are needed to prepare GPU for this computation. These include declaring the thread
and block in which the values will be stored [5].
We considered the experiment performed by Fan Wu, Miguel Cabral, Jessica Brazelton in paper
[5]. They consider the problem in three stages first is the main file that is recognized by the
compiler as a starting point of the program. The second is matrix multiplication algorithm on
CPU and the third matrix multiplication algorithm on GPU. After executing the proposed
program the result received shows that GPU is much faster than CPU for matrix multiplication.
Increase in size of matrix did not give great impact on GPU as that it gave on CPU. Result of this
6. Computer Science & Engineering: An International Journal (CSEIJ), Vol. 3, No. 4, August 2013
16
experiment is shown in the form of graph. This graph represented performance comparison
between CPU and GPU based algorithm.
Fig.4: Performance Comparison between CPU and GPU based Algorithms.
4.3 Parallel AES algorithm
Information Security has gains much researcher attention due to increase in threat to important
information assets of companies. Data encryption plays important role when it comes to data
security. Security is directly proportional to the complexity of encryption algorithm. AES i.e.
Rijndael algorithm [7], is a symmetric key cryptography algorithm which is mostly used in data
encryption. The traditional CPU-based AES implementation shows poor performance and cannot
meet the demands of fast data encryption [8]. AES is block cipher, which divides the plaintext
into fixed size blocks. The computation of each block is independent of each other without
considering any block cipher mode of operation. When we use CPU for AES encryption each
block is encrypted serially. Thus leading to low performance in term of execution time for
plaintext encryption, On the other hand GPU executes each plaintext block parallel, thus reducing
the encryption time.
We studied Parallel AES algorithm by Deguang Le, Jinyi Chang, Xingdou Gou, Ankang Zhang,
Conglan Lu in paper [8]. In this experiment the hardware used are CPU of Intel Core 2 Duo
E8200, the memory of 1GB and a GPU graphics card of NVIDIA GeForce GTS250. The
software is parallel AES algorithm which runs in windows XP. The result of this experiment is
presented in the form of graph which compares speed up of serial AES algorithm and parallel
AES algorithm the graph is as follow.
7. Computer Science & Engineering: An International Journal (CSEIJ), Vol. 3, No. 4, August 2013
17
Fig.5: Comparisons of AES algorithms
The speedup is calculated using following formula
Speedup=AES_GPU_Time/AES_CPU_Time
This experiment achieved 7x speedup over the implementation of AES on a comparable CPU [8].
4.4 Password Recovery for MS office 2003 Document
Recently digital data is increasing rapidly. Hence proper organization of data is become an
important concern. So at this point MS office comes to picture which helps in properly organizing
data files and also providing security by means of encryption and password. MS office 2003 and
the previous version organize documents in CFB (Compound File Binary) structure [9]. CFB
contain independent data file organize in hierarchy of storage. There are three kinds of encryption
scheme available in office 2003 first one is XOR obfuscation, second is 40 bits RC4 encryption
and last is CryptoAPI RC4 encryption [11]. Excel puts its encryption information in the
‘Workbook Globals Substream’[11]. PPT holds its encryption information with
‘CryptSession10Container’ in the ‘PowerPoint Document’ stream [12].
We considered the experiment performed by Xiaojing Zhan, Jingxin Hong in paper [10] states
that office document is first analyze then its encryption information are extracted and this is done
by CPU after that password verification is to be done, which involve exhaustive calculation
having plenty of SIMD task. GPU is involved for password verification. Below the result of
experiment is shown in the tabular form.
Table 2: Comparison on time cost between CPU and GPU
Encryption Scheme
Platform
CPU(Intel(R) Core(TM) i5 CPU
650 @ 3.20GHz,RAM
2.00GHz,OS-32-bit Windows 7
GPU (GeForce
GTX 470)
XOR Obfuscation <12 min N/A
40-bit RC4 <64.6h <4.4h
CryptoAPI RC4 <47.4h <4.6h
8. Computer Science & Engineering: An International Journal (CSEIJ), Vol. 3, No. 4, August 2013
18
Table 3: Time cost on GPU with different Password Length
Encryption
Scheme
Password Length
6 7 8 9
40-bit RC4 <4.4h <11.4d <705.1d <119.8y
CryptoAPI
RC4
<4.6h <11.9d <740.4d <125.8y
5. GPU LIMITATION
Along with all GPU’s advantages there comes some limitation which should be studied while
designing any GPU application. Major limitations which can directly or indirectly affect the
performance or quality of application are stated as follows.
1. The memory bandwidth between CPU and GPU is limited.
2. Read back from GPU memory to main memory is costly.
3. It has Small instruction set and mainly designed for graphics application.
4. No bitwise operation coding.
5. GPU still missing some features, example is big number support.
6. GPU is not optimized for serial operations.
6. CONCLUSION
From the above study we conclude that GPU is the best alternative for exhaustive computational
task. Employing GPU no doubt increase the speed of execution but also frees the CPU from the
load to perform serial executable tasks. Combination of CPU and GPU in many applications can
render high performance having low cost as compared to cluster of CPUs.
To program GPU the best programming language used is CUDA. It is very efficient and easily
understandable programming language to most of the programmer as it is an extension of C
language. CUDA programming help in designing heterogeneous computational code which is a
combination of serial and parallel execution task performed by CPU and GPU unit respectively.
Many computationally intensive applications have gained benefits from the use of GPU in their
computation. There are many more applications under study where researchers are trying to
deploy GPU units to gain the best results.
REFERENCES
[1] John D. Owens, Mike Houston, David Lucbke, et al. GPU Computing Proceedings of the IEEE, 2008,
96(5): 879-899.
[2] Zhang Hao, Li Lijun, LiLan General Purpose computation on Graphics Processors [j]. Computer and
Digital Engineering, 2005, 33(12):60-62, 98.
[3] Study on GPU based Password Recovery for MS office 2003 Document by Xiaojing Zhan, Jingxin
Hong. The 7th International Conference on Computer Science and Education (ICCSE 2012) July 14-
17, 2012. 978-1-4673-242-5/12@2012 IEEE
[4] NVIDIA Technical brief. NVIDIA Geforce GTX200 GPU architecture overview second generation
unified GPU architecture for Visual Computing. May 2008.
9. Computer Science & Engineering: An International Journal (CSEIJ), Vol. 3, No. 4, August 2013
19
[5] Fan Wu et al., “High Performance Matrix Multiplication on General Purpose Graphics Processing
Units” 2010 International Conference on Computational Intelligence and Software Engineering
(CiSE), 978-1-4244-5392- 4/10@2010 IEEE
[6] Accelerate Video Decoding With Generic GPU by Guobin Shen, Guang-Ping Gao, Shipeng Li,
Heung-Yeung Shum, and Ya-Qin Zhang in IEEE TRANSACTIONS ON CIRCUITS AND
SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 15, NO. 5, MAY 2005
[7] J.Daemen and V. Rijmen. The Design of Rijndael: AES The Advanced Encryption Standard. New
York, USA: Springer-Verlag, 2002.
[8] Deguang Le et al., “Parallel AES Algorithm for Fast Data Encryption on GPU” 2010 2nd
International Conference on Computer Engineering and Technology (ICCET), 978-1-4244-6349-
7/10@ 2010 IEEE
[9] Compound File Binary File Format [S].Microsoft Corporation, 2010.Available:
http://download.microsoft.com/download/a/e/6/ae6e4142-aa58-45c6-8dcf-a657e5900cd3/%5BMS-
CFB%5D.pdf
[10] Xiaojing Zhan and Jingxin Hong “Study on GPU-based Password Recovery for MS Office2003
Document” 7th International Conference on Computer Science & Education (ICCSE 2012) July 14-
17, 2012. Melbourne, ustralia
[11] Excel Binary File Format(.xls) Structure Specification[S].Microsoft Corporation,2010. Available:
http://download.microsoft.com/download/0/B/E/0BE8BDD7-E5E8-422A-ABFD-
4342ED7AD886/Excel97-2007BinaryFileFormat(xls)Specification.pdf
[12] PowerPoint Binary File Format(.ppt) Structure Specification[S].Microsoft Corporation,2010.
Available:http://download.microsoft.com/download/0/B/E/0BE8BDD7-E5E8-422A-ABFD-
4342ED7AD886/PowerPoint97-2007BinaryFileFormat(ppt)Specification.pdf
AUTHORS
Mr. Dattatraya N Londhe received his Bachelor’s Degree in Computer Engineering
from VPCOE Baramati, Maharashtra, India & is pursuing Master’s Degree in Computer
Engineering from SKNSITS Lonavala, Pune University, India. He is currently working
as an Assistant Professor in Gharda Institute of Technology, Lavel affiliated to Mumbai
University. His area of interest is Parallel Computing and information security.
Mr. Praveen R Barapatre received his Bachelor’s Degree in Computer Science and
Engineering from RGTU, Bhopal & master degree in Remote Sensing and GIS from
MANIT, Bhopal. He is currently working as an Assistant Professor & HOD IT in
SKNSITS Lonavala affiliated to Pune University, Maharashtra India. His area of
interest is Image Processing and Parallel Computing
Miss. Nisha P Gholap received her Bachelor’s Degree in Information Technology
from Trinity College of Engineering & Research, Pune, & is pursuing Master’s Degree
in Computer Engineering from KJ College of Engineering & Research, Pune. She is
currently working as an Assistant Professor in Gharda Institute of Technology; Lavel
affiliated to Mumbai University Her area of interest is Information Security and Parallel
Computing.
Mr. Saumitra S Das received Bachelor’s Degree in Computer Engineering from
North Maharashtra University & Master degree from DY Patil College of Engineering
Pune in Computer Engineering. He is currently pursuing Phd in Energy Efficiency in
WSN & is working as an Associate Professor in KJ College of Engineering &
Research, Pune, affiliated to Pune University, India. His area of interest is WSN and
Parallel Computing.