Highlighted notes on Hybrid Multicore Computing
While doing research work under Prof. Dip Banerjee, Prof. Kishore Kothapalli.
In this comprehensive report, Prof. Dip Banerjee describes about the benefit of utilizing both multicore systems, CPUs with vector instructions, and manycore systems, GPUs with large no. of low speed ALUs. Such hybrid systems are beneficial to several algorithms as an accelerator cant optimize for all parts of an algorithms (some computations are very regular, while some very irregular).
Estimation of Optimized Energy and Latency Constraint for Task Allocation in ...ijcsit
In Network on Chip (NoC) rooted system, energy consumption is affected by task scheduling and allocation
schemes which affect the performance of the system. In this paper we test the pre-existing proposed
algorithms and introduced a new energy skilled algorithm for 3D NoC architecture. An efficient dynamic
and cluster approaches are proposed along with the optimization using bio-inspired algorithm. The
proposed algorithm has been implemented and evaluated on randomly generated benchmark and real life
application such as MMS, Telecom and VOPD. The algorithm has also been tested with the E3S benchmark
and has been compared with the existing mapping algorithm spiral and crinkle and has shown better
reduction in the communication energy consumption and shows improvement in the performance of the
system. On performing experimental analysis of proposed algorithm results shows that average reduction
in energy consumption is 49%, reduction in communication cost is 48% and average latency is 34%.
Cluster based approach is mapped onto NoC using Dynamic Diagonal Mapping (DDMap), Crinkle and
Spiral algorithms and found DDmap provides improved result. On analysis and comparison of mapping of
cluster using DDmap approach the average energy reduction is 14% and 9% with crinkle and spiral.
The complexity of Medical image reconstruction requires tens to hundreds of billions of computations per second. Until few years ago, special purpose processors designed especially for such applications were used. Such processors require significant design effort and are thus difficult to change as new algorithms in reconstructions evolve and have limited parallelism. Hence the demand for flexibility in medical applications motivated the use of stream processors with massively parallel architecture. Stream processing architectures offers data parallel kind of parallelism.
Accelerating Real Time Applications on Heterogeneous PlatformsIJMER
In this paper we describe about the novel implementations of depth estimation from a stereo
images using feature extraction algorithms that run on the graphics processing unit (GPU) which is
suitable for real time applications like analyzing video in real-time vision systems. Modern graphics
cards contain large number of parallel processors and high-bandwidth memory for accelerating the
processing of data computation operations. In this paper we give general idea of how to accelerate the
real time application using heterogeneous platforms. We have proposed to use some added resources to
grasp more computationally involved optimization methods. This proposed approach will indirectly
accelerate a database by producing better plan quality.
Real-Time Implementation and Performance Optimization of Local Derivative Pat...IJECEIAES
Pattern based texture descriptors are widely used in Content Based Image Retrieval (CBIR) for efficient retrieval of matching images. Local Derivative Pattern (LDP), a higher order local pattern operator, originally proposed for face recognition, encodes the distinctive spatial relationships contained in a local region of an image as the feature vector. LDP efficiently extracts finer details and provides efficient retrieval however, it was proposed for images of limited resolution. Over the period of time the development in the digital image sensors had paid way for capturing images at a very high resolution. LDP algorithm though very efficient in content-based image retrieval did not scale well when capturing features from such high-resolution images as it becomes computationally very expensive. This paper proposes how to efficiently extract parallelism from the LDP algorithm and strategies for optimally implementing it by exploiting some inherent General-Purpose Graphics Processing Unit (GPGPU) characteristics. By optimally configuring the GPGPU kernels, image retrieval was performed at a much faster rate. The LDP algorithm was ported on to Compute Unified Device Architecture (CUDA) supported GPGPU and a maximum speed up of around 240x was achieved as compared to its sequential counterpart.
Tensor Processing Units (TPUs) are Google's custom-developed application-specific integrated circuits (ASICs) used to accelerate machine learning workloads. TPUs are designed from the ground up with the benefit of Google's deep experience and leadership in machine learning.
Estimation of Optimized Energy and Latency Constraint for Task Allocation in ...ijcsit
In Network on Chip (NoC) rooted system, energy consumption is affected by task scheduling and allocation
schemes which affect the performance of the system. In this paper we test the pre-existing proposed
algorithms and introduced a new energy skilled algorithm for 3D NoC architecture. An efficient dynamic
and cluster approaches are proposed along with the optimization using bio-inspired algorithm. The
proposed algorithm has been implemented and evaluated on randomly generated benchmark and real life
application such as MMS, Telecom and VOPD. The algorithm has also been tested with the E3S benchmark
and has been compared with the existing mapping algorithm spiral and crinkle and has shown better
reduction in the communication energy consumption and shows improvement in the performance of the
system. On performing experimental analysis of proposed algorithm results shows that average reduction
in energy consumption is 49%, reduction in communication cost is 48% and average latency is 34%.
Cluster based approach is mapped onto NoC using Dynamic Diagonal Mapping (DDMap), Crinkle and
Spiral algorithms and found DDmap provides improved result. On analysis and comparison of mapping of
cluster using DDmap approach the average energy reduction is 14% and 9% with crinkle and spiral.
The complexity of Medical image reconstruction requires tens to hundreds of billions of computations per second. Until few years ago, special purpose processors designed especially for such applications were used. Such processors require significant design effort and are thus difficult to change as new algorithms in reconstructions evolve and have limited parallelism. Hence the demand for flexibility in medical applications motivated the use of stream processors with massively parallel architecture. Stream processing architectures offers data parallel kind of parallelism.
Accelerating Real Time Applications on Heterogeneous PlatformsIJMER
In this paper we describe about the novel implementations of depth estimation from a stereo
images using feature extraction algorithms that run on the graphics processing unit (GPU) which is
suitable for real time applications like analyzing video in real-time vision systems. Modern graphics
cards contain large number of parallel processors and high-bandwidth memory for accelerating the
processing of data computation operations. In this paper we give general idea of how to accelerate the
real time application using heterogeneous platforms. We have proposed to use some added resources to
grasp more computationally involved optimization methods. This proposed approach will indirectly
accelerate a database by producing better plan quality.
Real-Time Implementation and Performance Optimization of Local Derivative Pat...IJECEIAES
Pattern based texture descriptors are widely used in Content Based Image Retrieval (CBIR) for efficient retrieval of matching images. Local Derivative Pattern (LDP), a higher order local pattern operator, originally proposed for face recognition, encodes the distinctive spatial relationships contained in a local region of an image as the feature vector. LDP efficiently extracts finer details and provides efficient retrieval however, it was proposed for images of limited resolution. Over the period of time the development in the digital image sensors had paid way for capturing images at a very high resolution. LDP algorithm though very efficient in content-based image retrieval did not scale well when capturing features from such high-resolution images as it becomes computationally very expensive. This paper proposes how to efficiently extract parallelism from the LDP algorithm and strategies for optimally implementing it by exploiting some inherent General-Purpose Graphics Processing Unit (GPGPU) characteristics. By optimally configuring the GPGPU kernels, image retrieval was performed at a much faster rate. The LDP algorithm was ported on to Compute Unified Device Architecture (CUDA) supported GPGPU and a maximum speed up of around 240x was achieved as compared to its sequential counterpart.
Tensor Processing Units (TPUs) are Google's custom-developed application-specific integrated circuits (ASICs) used to accelerate machine learning workloads. TPUs are designed from the ground up with the benefit of Google's deep experience and leadership in machine learning.
ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...ijdpsjournal
This paper studies the performance and energy consumption of several multi-core, multi-CPUs and manycore
hardware platforms and software stacks for parallel programming. It uses the Multimedia Multiscale
Parser (MMP), a computationally demanding image encoder application, which was ported to several
hardware and software parallel environments as a benchmark. Hardware-wise, the study assesses
NVIDIA's Jetson TK1 development board, the Raspberry Pi 2, and a dual Intel Xeon E5-2620/v2 server, as
well as NVIDIA's discrete GPUs GTX 680, Titan Black Edition and GTX 750 Ti. The assessed parallel
programming paradigms are OpenMP, Pthreads and CUDA, and a single-thread sequential version, all
running in a Linux environment. While the CUDA-based implementation delivered the fastest execution, the
Jetson TK1 proved to be the most energy efficient platform, regardless of the used parallel software stack.
Although it has the lowest power demand, the Raspberry Pi 2 energy efficiency is hindered by its lengthy
execution times, effectively consuming more energy than the Jetson TK1. Surprisingly, OpenMP delivered
twice the performance of the Pthreads-based implementation, proving the maturity of the tools and
libraries supporting OpenMP.
International Journal of Engineering Research and DevelopmentIJERD Editor
Electrical, Electronics and Computer Engineering,
Information Engineering and Technology,
Mechanical, Industrial and Manufacturing Engineering,
Automation and Mechatronics Engineering,
Material and Chemical Engineering,
Civil and Architecture Engineering,
Biotechnology and Bio Engineering,
Environmental Engineering,
Petroleum and Mining Engineering,
Marine and Agriculture engineering,
Aerospace Engineering.
The measure of computerized information being created and put away is expanding at a disturbing rate. This information is classified and handled to distil and convey data to clients crossing various businesses for example, finance, online networking, gaming and so forth. This class of workloads is alluded to as throughput computing applications. Multi-core CPUs have been viewed as reasonable for handling information in such workloads. Be that as it may, energized by high computational throughput and energy proficiency, there has been a fast reception of Graphics Processing Units (GPUs) as computing engines lately. GPU computing has risen lately as a reasonable execution stage for throughput situated applications or regions of code. GPUs began as free units for program execution however there are clear patterns towards tight-sew CPU-GPU integration. In this paper, we look to comprehend cutting edge Heterogeneous System Architecture and inspect a few key segments that influences it to emerge from other architecture designs by analyzing existing inquiries about, articles and reports bearing and future open doors for HSA systems.
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGAVLSICS Design
Video Compression is very essential to meet the technological demands such as low power, less memory and fast transfer rate for different range of devices and for various multimedia applications. Video compression is primarily achieved by Motion Estimation (ME) process in any video encoder which contributes to significant compression gain.Sum of Absolute Difference (SAD) is used as distortion metric in ME process.In this paper, efficient Absolute Difference(AD)circuit is proposed which uses Brent Kung Adder(BKA) and a comparator based on modified 1’s complement principle and conditional sum adder scheme. Results shows that proposed architecture reduces delay by 15% and number of slice LUTs by 42 % as compared to conventional architecture. Simulation and synthesis are done on Xilinx ISE 14.2 using Virtex 7 FPGA.
Parallel algorithms for multi-source graph traversal and its applicationsSubhajit Sahu
Highlighted notes on Parallel algorithms for multi-source graph traversal and its applications.
While doing research work under Prof. Kishore Kothapalli.
Seema is working on Multi-source BFS with hybrid-CSR, with applications in APSP, diameter, centrality, reachability.
BFS can be either top-down (from visited nodes, mark next frontier), or bottom-up (from unvisited nodes, mark next frontier). She mentioned that hybrid approach is more efficient. EtaGraph uses unified degree cut (UDC) graph partitioning. Also overlaps data transfer with kernel execution. iCENTRAL uses biconnected components for betwenness centrality on dynamic graphs.
Hybrid CSR uses an additional value array for storing packed "has edge/neighbour" bits. This can give better memory access pattern if many bits are set, and cause many threads to wait if many bits are zero. She mentioned Volta architecture has independent PC, stack per thread (similar to CPU?). Does is not matter then if the threads in a block diverge?
(BFS = G*v, Multi-source BFS = G*vs)
In present day MAC unit is demanded in most of the Digital signal processing. Function of addition and multiplication is performed by the MAC unit. MAC operates in two stages. Firstly, multiplier computes the given number output and the result is forwarded to second stage i.e. addition/accumulation operates. Speed of multiplier is important in MAC unit which determines critical path as well as area is also of great importance in designing of MAC unit. Multiplier plays an important roles in many digital signal processing (DSP) applications such as in convolution, digital filters and other data processing unit. Many research has been performed on MAC implementation. This paper provides analysis of the research and investigations held till now.
A hybrid approach for scheduling applications in cloud computing environment IJECEIAES
Cloud computing plays an important role in our daily life. It has direct and positive impact on share and update data, knowledge, storage and scientific resources between various regions. Cloud computing performance heavily based on job scheduling algorithms that are utilized for queue waiting in modern scientific applications. The researchers are considered cloud computing a popular platform for new enforcements. These scheduling algorithms help in design efficient queue lists in cloud as well as they play vital role in reducing waiting for processing time in cloud computing. A novel job scheduling is proposed in this paper to enhance performance of cloud computing and reduce delay time in queue waiting for jobs. The proposed algorithm tries to avoid some significant challenges that throttle from developing applications of cloud computing. However, a smart scheduling technique is proposed in our paper to improve performance processing in cloud applications. Our experimental result of the proposed job scheduling algorithm shows that the proposed schemes possess outstanding enhancing rates with a reduction in waiting time for jobs in queue list.
A distinct unit of code, or a program, that can be separated and executed in a remote runtime environment
Task computing provides distribution by harnessing the computer power of several computing nodes
Task computing is categorized into: High-performance computing (HPC), High-throughput computing (HTC) and Many-task computing (MTC).
International Journal of Engineering Research and Development (IJERD)IJERD Editor
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
A Review on Scheduling in Cloud Computingijujournal
Cloud computing is the requirement based on clients that this computing which provides software,
infrastructure and platform as a service as per pay for use norm. The scheduling main goal is to achieve
the accuracy and correctness on task completion. The scheduling in cloud environment which enables the
various cloud services to help framework implementation. Thus the far reaching way of different type of
scheduling algorithms in cloud computing environment surveyed which includes the workflow scheduling
and grid scheduling. The survey gives an elaborate idea about grid, cloud, workflow scheduling to
minimize the energy cost, efficiency and throughput of the system
ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...ijdpsjournal
This paper studies the performance and energy consumption of several multi-core, multi-CPUs and manycore
hardware platforms and software stacks for parallel programming. It uses the Multimedia Multiscale
Parser (MMP), a computationally demanding image encoder application, which was ported to several
hardware and software parallel environments as a benchmark. Hardware-wise, the study assesses
NVIDIA's Jetson TK1 development board, the Raspberry Pi 2, and a dual Intel Xeon E5-2620/v2 server, as
well as NVIDIA's discrete GPUs GTX 680, Titan Black Edition and GTX 750 Ti. The assessed parallel
programming paradigms are OpenMP, Pthreads and CUDA, and a single-thread sequential version, all
running in a Linux environment. While the CUDA-based implementation delivered the fastest execution, the
Jetson TK1 proved to be the most energy efficient platform, regardless of the used parallel software stack.
Although it has the lowest power demand, the Raspberry Pi 2 energy efficiency is hindered by its lengthy
execution times, effectively consuming more energy than the Jetson TK1. Surprisingly, OpenMP delivered
twice the performance of the Pthreads-based implementation, proving the maturity of the tools and
libraries supporting OpenMP.
International Journal of Engineering Research and DevelopmentIJERD Editor
Electrical, Electronics and Computer Engineering,
Information Engineering and Technology,
Mechanical, Industrial and Manufacturing Engineering,
Automation and Mechatronics Engineering,
Material and Chemical Engineering,
Civil and Architecture Engineering,
Biotechnology and Bio Engineering,
Environmental Engineering,
Petroleum and Mining Engineering,
Marine and Agriculture engineering,
Aerospace Engineering.
The measure of computerized information being created and put away is expanding at a disturbing rate. This information is classified and handled to distil and convey data to clients crossing various businesses for example, finance, online networking, gaming and so forth. This class of workloads is alluded to as throughput computing applications. Multi-core CPUs have been viewed as reasonable for handling information in such workloads. Be that as it may, energized by high computational throughput and energy proficiency, there has been a fast reception of Graphics Processing Units (GPUs) as computing engines lately. GPU computing has risen lately as a reasonable execution stage for throughput situated applications or regions of code. GPUs began as free units for program execution however there are clear patterns towards tight-sew CPU-GPU integration. In this paper, we look to comprehend cutting edge Heterogeneous System Architecture and inspect a few key segments that influences it to emerge from other architecture designs by analyzing existing inquiries about, articles and reports bearing and future open doors for HSA systems.
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGAVLSICS Design
Video Compression is very essential to meet the technological demands such as low power, less memory and fast transfer rate for different range of devices and for various multimedia applications. Video compression is primarily achieved by Motion Estimation (ME) process in any video encoder which contributes to significant compression gain.Sum of Absolute Difference (SAD) is used as distortion metric in ME process.In this paper, efficient Absolute Difference(AD)circuit is proposed which uses Brent Kung Adder(BKA) and a comparator based on modified 1’s complement principle and conditional sum adder scheme. Results shows that proposed architecture reduces delay by 15% and number of slice LUTs by 42 % as compared to conventional architecture. Simulation and synthesis are done on Xilinx ISE 14.2 using Virtex 7 FPGA.
Parallel algorithms for multi-source graph traversal and its applicationsSubhajit Sahu
Highlighted notes on Parallel algorithms for multi-source graph traversal and its applications.
While doing research work under Prof. Kishore Kothapalli.
Seema is working on Multi-source BFS with hybrid-CSR, with applications in APSP, diameter, centrality, reachability.
BFS can be either top-down (from visited nodes, mark next frontier), or bottom-up (from unvisited nodes, mark next frontier). She mentioned that hybrid approach is more efficient. EtaGraph uses unified degree cut (UDC) graph partitioning. Also overlaps data transfer with kernel execution. iCENTRAL uses biconnected components for betwenness centrality on dynamic graphs.
Hybrid CSR uses an additional value array for storing packed "has edge/neighbour" bits. This can give better memory access pattern if many bits are set, and cause many threads to wait if many bits are zero. She mentioned Volta architecture has independent PC, stack per thread (similar to CPU?). Does is not matter then if the threads in a block diverge?
(BFS = G*v, Multi-source BFS = G*vs)
In present day MAC unit is demanded in most of the Digital signal processing. Function of addition and multiplication is performed by the MAC unit. MAC operates in two stages. Firstly, multiplier computes the given number output and the result is forwarded to second stage i.e. addition/accumulation operates. Speed of multiplier is important in MAC unit which determines critical path as well as area is also of great importance in designing of MAC unit. Multiplier plays an important roles in many digital signal processing (DSP) applications such as in convolution, digital filters and other data processing unit. Many research has been performed on MAC implementation. This paper provides analysis of the research and investigations held till now.
A hybrid approach for scheduling applications in cloud computing environment IJECEIAES
Cloud computing plays an important role in our daily life. It has direct and positive impact on share and update data, knowledge, storage and scientific resources between various regions. Cloud computing performance heavily based on job scheduling algorithms that are utilized for queue waiting in modern scientific applications. The researchers are considered cloud computing a popular platform for new enforcements. These scheduling algorithms help in design efficient queue lists in cloud as well as they play vital role in reducing waiting for processing time in cloud computing. A novel job scheduling is proposed in this paper to enhance performance of cloud computing and reduce delay time in queue waiting for jobs. The proposed algorithm tries to avoid some significant challenges that throttle from developing applications of cloud computing. However, a smart scheduling technique is proposed in our paper to improve performance processing in cloud applications. Our experimental result of the proposed job scheduling algorithm shows that the proposed schemes possess outstanding enhancing rates with a reduction in waiting time for jobs in queue list.
A distinct unit of code, or a program, that can be separated and executed in a remote runtime environment
Task computing provides distribution by harnessing the computer power of several computing nodes
Task computing is categorized into: High-performance computing (HPC), High-throughput computing (HTC) and Many-task computing (MTC).
International Journal of Engineering Research and Development (IJERD)IJERD Editor
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
A Review on Scheduling in Cloud Computingijujournal
Cloud computing is the requirement based on clients that this computing which provides software,
infrastructure and platform as a service as per pay for use norm. The scheduling main goal is to achieve
the accuracy and correctness on task completion. The scheduling in cloud environment which enables the
various cloud services to help framework implementation. Thus the far reaching way of different type of
scheduling algorithms in cloud computing environment surveyed which includes the workflow scheduling
and grid scheduling. The survey gives an elaborate idea about grid, cloud, workflow scheduling to
minimize the energy cost, efficiency and throughput of the system
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONScseij
In this paper we study NVIDIA graphics processing unit (GPU) along with its computational power and applications. Although these units are specially designed for graphics application we can employee there computation power for non graphics application too. GPU has high parallel processing power, low cost of computation and less time utilization; it gives good result of performance per energy ratio. This GPU deployment property for excessive computation of similar small set of instruction played a significant role in reducing CPU overhead. GPU has several key advantages over CPU architecture as it provides high parallelism, intensive computation and significantly higher throughput. It consists of thousands of hardware threads that execute programs in a SIMD fashion hence GPU can be an alternate to CPU in high performance environment and in supercomputing environment. The base line is GPU based general purpose computing is a hot topics of research and there is great to explore rather than only graphics processing application.
The graphics processing unit GPU is a computer chip that performs rapid mathematical calculations. GPU is a ubiquitous device which appears in every computing systems such PC, laptop, desktop, and workstation. It is a many core multithreaded multiprocessor that excels at both graphics and non graphic applications. GPU computing is using a GPU as a co processor to accelerate CPUs scientific and engineering computing. The paper provides a brief introduction to GPU computing. Matthew N. O. Sadiku | Adedamola A. Omotoso | Sarhan M. Musa "GPU Computing: An Introduction" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-1 , December 2019, URL: https://www.ijtsrd.com/papers/ijtsrd29648.pdf Paper URL: https://www.ijtsrd.com/engineering/electrical-engineering/29648/gpu-computing-an-introduction/matthew-n-o-sadiku
Stay up-to-date on the latest news, events and resources for the OpenACC community. This month’s highlights covers working on applications for the new Frontier supercomputer, using OpenACC for weather forecasting, upcoming GPU Hackathons and Bootcamps, and new resources!
Cuda Based Performance Evaluation Of The Computational Efficiency Of The Dct ...acijjournal
Recent advances in computing such as the massively parallel GPUs (Graphical Processing Units),coupled
with the need to store and deliver large quantities of digital data especially images, has brought a number
of challenges for Computer Scientists, the research community and other stakeholders. These challenges,
such as prohibitively large costs to manipulate the digital data amongst others, have been the focus of the
research community in recent years and has led to the investigation of image compression techniques that
can achieve excellent results. One such technique is the Discrete Cosine Transform, which helps separate
an image into parts of differing frequencies and has the advantage of excellent energy-compaction.
This paper investigates the use of the Compute Unified Device Architecture (CUDA) programming model
to implement the DCT based Cordic based Loeffler algorithm for efficient image compression. The
computational efficiency is analyzed and evaluated under both the CPU and GPU. The PSNR (Peak Signal
to Noise Ratio) is used to evaluate image reconstruction quality in this paper. The results are presented
and discussed
Stay up-to-date on the latest news, events and resources for the OpenACC community. This month’s highlights covers pseudo random number generation, the first-ever MONAI Bootcamp, upcoming GPU Hackathons and Bootcamps, and new resources!
HyPR: Hybrid Page Ranking on Evolving Graphs (NOTES)Subhajit Sahu
Highlighted notes on A Parallel Algorithm Template for Updating Single-Source Shortest Paths in Large-Scale Dynamic Networks.
While doing research work under Prof. Dip Banerjee, Prof, Kishore Kothapalli.
In Hybrid Pagerank the vertices are divided in 3 groups, V_old, V_border, V_new. Scaling for old, border vertices is N/N_new, and 1/N_new for V_new (i do this too ). Then PR is run only on V_border, V_new.
"V_border which is the set of nodes which have edges in Bi connecting V_old and V_new and is reachable using a breadth first traversal."
Does that mean V_border = V_batch(i) ∩ V_old? BFS from where?
"We can assume that the new batch of updates is topologically sorted since the PR scores of the new nodes in Bi is guaranteed to be lower than those in Co."
Is sum(PR) in V_old > sum(PR) in V_new always?
"For performing the comparisons with GPMA and GPMA+, we configure the experiment to run HyPR on the same platform as used in [1] which is a Intel Xeon CPU connected to a Titan X Pascal GPU, and also the same datasets."
Old GPUs are going to be slower ...
Like we were discussing last time, it is not possible to scale old ranks, and skip the unchanged components (or here V_old). Please check this simple counter example that shows skipping leads to incorrect ranks.
https://github.com/puzzlef/pagerank-levelwise-skip-unchanged-components
Another omission in the paper is that Hybrid PR (just like STICD) wont work for graphs which have dead ends. This is a pre-condition for the algorithm.
Image Processing Application on Graphics processorsCSCJournals
In this work, we introduce real time image processing techniques using modern programmable Graphic Processing Units GPU. GPU are SIMD (Single Instruction, Multiple Data) device that is inherently data-parallel. By utilizing NVIDIA new GPU programming framework, “Compute Unified Device Architecture” CUDA as a computational resource, we realize significant acceleration in image processing algorithm computations. We show that a range of computer vision algorithms map readily to CUDA with significant performance gains. Specifically, we demonstrate the efficiency of our approach by a parallelization and optimization of image processing, Morphology applications and image integral.
Top 10 Supercomputers With Descriptive Information & AnalysisNomanSiddiqui41
Top 10 Supercomputers Report
What is Supercomputer?
A supercomputer is a computer with a high level of performance as compared to a general-purpose computer. The performance of a supercomputer is commonly measured in floating-point operations per second (FLOPS) instead of million instructions per second (MIPS). Since 2017, there are supercomputers which can perform over 1017 FLOPS (a hundred quadrillion FLOPS, 100 petaFLOPS or 100 PFLOPS
Supercomputers play an important role in the field of computational science, and are used for a wide range of computationally intensive tasks in various fields, including quantum mechanics, weather forecasting, climate research, oil and gas exploration, molecular modeling (computing the structures and properties of chemical compounds, biological macromolecules, polymers, and crystals), and physical simulations (such as simulations of the early moments of the universe, airplane and spacecraft aerodynamics, the detonation of nuclear weapons, and nuclear fusion). They have been essential in the field of cryptanalysis.
1. The Fugaku Supercomputer
Introduction:
Fugaku is a petascale supercomputer (while only at petascale for mainstream benchmark), at the Riken Center for Computational Science in Kobe, Japan. It started development in 2014 as the successor to the K computer, and started operating in 2021. Fugaku made its debut in 2020, and became the fastest supercomputer in the world in the June 2020 TOP500 list, as well as becoming the first ARM architecture-based computer to achieve this. In June 2020, it achieved 1.42 exaFLOPS (in HPL-AI benchmark making it the first ever supercomputer that achieved 1 exaFLOPS. As of November 2021, Fugaku is the fastest supercomputer in the world. It is named after an alternative name for Mount Fuji.
Block Diagram:
Functional Units:
Functional Units, Co-Design and System for the Supercomputer “Fugaku”
1. Performance estimation tool: This tool, taking Fujitsu FX100 (FX100 is the previous Fujitsu supercomputer) execution profile data as an input, enables the performance projection by a given set of architecture parameters. The performance projection is modeled according to the Fujitsu microarchitecture. This tool can also estimate the power consumption based on the architecture model.
2. Fujitsu in-house processor simulator: We used an extended FX100 SPARC instruction-set simulator and compiler, developed by Fujitsu, for preliminary studies in the initial phase, and an Armv8þSVE simulator and compiler afterward.
3. Gem5 simulator for the Post-K processor: The Post-K processor simulator3 based on an opensource system-level processor simulator, Gem5, was developed by RIKEN during the co-design for architecture verification and performance tuning. A fundamental problem is the scale of scientific applications that are expected to be run on Post-K. Even our target applications are thousands of lines of code and are written to use complex algorithms and data structures. Altho
SPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHY cscpconf
This paper presents a parallel approach to improve the time complexity problem associated with sequential algorithms. An image steganography algorithm in transform domain is considered for implementation. Image steganography is a technique to hide secret message in an image. With the parallel implementation, large message can be hidden in large image since it does not take much processing time. It is implemented on GPU systems. Parallel programming is done using OpenCL in CUDA cores from NVIDIA. The speed-up improvement
obtained is very good with reasonably good output signal quality, when large amount of data is processed
SPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHYcsandit
This paper presents a parallel approach to improve the time complexity problem associated
with sequential algorithms. An image steganography algorithm in transform domain is
considered for implementation. Image steganography is a technique to hide secret message in
an image. With the parallel implementation, large message can be hidden in large image since
it does not take much processing time. It is implemented on GPU systems. Parallel
programming is done using OpenCL in CUDA cores from NVIDIA. The speed-up improvement
obtained is very good with reasonably good output signal quality, when large amount of data is
processed
Similar to Hybrid Multicore Computing : NOTES (20)
About TrueTime, Spanner, Clock synchronization, CAP theorem, Two-phase lockin...Subhajit Sahu
TrueTime is a service that enables the use of globally synchronized clocks, with bounded error. It returns a time interval that is guaranteed to contain the clock’s actual time for some time during the call’s execution. If two intervals do not overlap, then we know calls were definitely ordered in real time. In general, synchronized clocks can be used to avoid communication in a distributed system.
The underlying source of time is a combination of GPS receivers and atomic clocks. As there are “time masters” in every datacenter (redundantly), it is likely that both sides of a partition would continue to enjoy accurate time. Individual nodes however need network connectivity to the masters, and without it their clocks will drift. Thus, during a partition their intervals slowly grow wider over time, based on bounds on the rate of local clock drift. Operations depending on TrueTime, such as Paxos leader election or transaction commits, thus have to wait a little longer, but the operation still completes (assuming the 2PC and quorum communication are working).
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Adjusting Bitset for graph : SHORT REPORT / NOTESSubhajit Sahu
Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is commonly used for efficient graph computations. Unfortunately, using CSR for dynamic graphs is impractical since addition/deletion of a single edge can require on average (N+M)/2 memory accesses, in order to update source-offsets and destination-indices. A common approach is therefore to store edge-lists/destination-indices as an array of arrays, where each edge-list is an array belonging to a vertex. While this is good enough for small graphs, it quickly becomes a bottleneck for large graphs. What causes this bottleneck depends on whether the edge-lists are sorted or unsorted. If they are sorted, checking for an edge requires about log(E) memory accesses, but adding an edge on average requires E/2 accesses, where E is the number of edges of a given vertex. Note that both addition and deletion of edges in a dynamic graph require checking for an existing edge, before adding or deleting it. If edge lists are unsorted, checking for an edge requires around E/2 memory accesses, but adding an edge requires only 1 memory access.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Experiments with Primitive operations : SHORT REPORT / NOTESSubhajit Sahu
This includes:
- Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
- Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
- Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
- Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Adjusting OpenMP PageRank : SHORT REPORT / NOTESSubhajit Sahu
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate pagerank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement pagerank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...Subhajit Sahu
Below are the important points I note from the 2020 paper by Martin Grohe:
- 1-WL distinguishes almost all graphs, in a probabilistic sense
- Classical WL is two dimensional Weisfeiler-Leman
- DeepWL is an unlimited version of WL graph that runs in polynomial time.
- Knowledge graphs are essentially graphs with vertex/edge attributes
ABSTRACT:
Vector representations of graphs and relational structures, whether handcrafted feature vectors or learned representations, enable us to apply standard data analysis and machine learning techniques to the structures. A wide range of methods for generating such embeddings have been studied in the machine learning and knowledge representation literature. However, vector embeddings have received relatively little attention from a theoretical point of view.
Starting with a survey of embedding techniques that have been used in practice, in this paper we propose two theoretical approaches that we see as central for understanding the foundations of vector embeddings. We draw connections between the various approaches and suggest directions for future research.
DyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTESSubhajit Sahu
https://gist.github.com/wolfram77/54c4a14d9ea547183c6c7b3518bf9cd1
There exist a number of dynamic graph generators. Barbasi-Albert model iteratively attach new vertices to pre-exsiting vertices in the graph using preferential attachment (edges to high degree vertices are more likely - rich get richer - Pareto principle). However, graph size increases monotonically, and density of graph keeps increasing (sparsity decreasing).
Gorke's model uses a defined clustering to uniformly add vertices and edges. Purohit's model uses motifs (eg. triangles) to mimick properties of existing dynamic graphs, such as growth rate, structure, and degree distribution. Kronecker graph generators are used to increase size of a given graph, with power-law distribution.
To generate dynamic graphs, we must choose a metric to compare two graphs. Common metrics include diameter, clustering coefficient (modularity?), triangle counting (triangle density?), and degree distribution.
In this paper, the authors propose Dygraph, a dynamic graph generator that uses degree distribution as the only metric. The authors observe that many real-world graphs differ from the power-law distribution at the tail end. To address this issue, they propose binning, where the vertices beyond a certain degree (minDeg = min(deg) s.t. |V(deg)| < H, where H~10 is the number of vertices with a given degree below which are binned) are grouped into bins of degree-width binWidth, max-degree localMax, and number of degrees in bin with at least one vertex binSize (to keep track of sparsity). This helps the authors to generate graphs with a more realistic degree distribution.
The process of generating a dynamic graph is as follows. First the difference between the desired and the current degree distribution is calculated. The authors then create an edge-addition set where each vertex is present as many times as the number of additional incident edges it must recieve. Edges are then created by connecting two vertices randomly from this set, and removing both from the set once connected. Currently, authors reject self-loops and duplicate edges. Removal of edges is done in a similar fashion.
Authors observe that adding edges with power-law properties dominates the execution time, and consider parallelizing DyGraph as part of future work.
My notes on shared memory parallelism.
Shared memory is memory that may be simultaneously accessed by multiple programs with an intent to provide communication among them or avoid redundant copies. Shared memory is an efficient means of passing data between programs. Using memory for communication inside a single program, e.g. among its multiple threads, is also referred to as shared memory [REF].
A Dynamic Algorithm for Local Community Detection in Graphs : NOTESSubhajit Sahu
**Community detection methods** can be *global* or *local*. **Global community detection methods** divide the entire graph into groups. Existing global algorithms include:
- Random walk methods
- Spectral partitioning
- Label propagation
- Greedy agglomerative and divisive algorithms
- Clique percolation
https://gist.github.com/wolfram77/b4316609265b5b9f88027bbc491f80b6
There is a growing body of work in *detecting overlapping communities*. **Seed set expansion** is a **local community detection method** where a relevant *seed vertices* of interest are picked and *expanded to form communities* surrounding them. The quality of each community is measured using a *fitness function*.
**Modularity** is a *fitness function* which compares the number of intra-community edges to the expected number in a random-null model. **Conductance** is another popular fitness score that measures the community cut or inter-community edges. Many *overlapping community detection* methods **use a modified ratio** of intra-community edges to all edges with atleast one endpoint in the community.
Andersen et al. use a **Spectral PageRank-Nibble method** which minimizes conductance and is formed by adding vertices in order of decreasing PageRank values. Andersen and Lang develop a **random walk approach** in which some vertices in the seed set may not be placed in the final community. Clauset gives a **greedy method** that *starts from a single vertex* and then iteratively adds neighboring vertices *maximizing the local modularity score*. Riedy et al. **expand multiple vertices** via maximizing modularity.
Several algorithms for **detecting global, overlapping communities** use a *greedy*, *agglomerative approach* and run *multiple separate seed set expansions*. Lancichinetti et al. run **greedy seed set expansions**, each with a *single seed vertex*. Overlapping communities are produced by a sequentially running expansions from a node not yet in a community. Lee et al. use **maximal cliques as seed sets**. Havemann et al. **greedily expand cliques**.
The authors of this paper discuss a dynamic approach for **community detection using seed set expansion**. Simply marking the neighbours of changed vertices is a **naive approach**, and has *severe shortcomings*. This is because *communities can split apart*. The simple updating method *may fail even when it outputs a valid community* in the graph.
Scalable Static and Dynamic Community Detection Using Grappolo : NOTESSubhajit Sahu
A **community** (in a network) is a subset of nodes which are _strongly connected among themselves_, but _weakly connected to others_. Neither the number of output communities nor their size distribution is known a priori. Community detection methods can be divisive or agglomerative. **Divisive methods** use _betweeness centrality_ to **identify and remove bridges** between communities. **Agglomerative methods** greedily **merge two communities** that provide maximum gain in _modularity_. Newman and Girvan have introduced the **modularity metric**. The problem of community detection is then reduced to the problem of modularity maximization which is **NP-complete**. **Louvain method** is a variant of the _agglomerative strategy_, in that is a _multi-level heuristic_.
https://gist.github.com/wolfram77/917a1a4a429e89a0f2a1911cea56314d
In this paper, the authors discuss **four heuristics** for Community detection using the _Louvain algorithm_ implemented upon recently developed **Grappolo**, which is a parallel variant of the Louvain algorithm. They are:
- Vertex following and Minimum label
- Data caching
- Graph coloring
- Threshold scaling
With the **Vertex following** heuristic, the _input is preprocessed_ and all single-degree vertices are merged with their corresponding neighbours. This helps reduce the number of vertices considered in each iteration, and also help initial seeds of communities to be formed. With the **Minimum label heuristic**, when a vertex is making the decision to move to a community and multiple communities provided the same modularity gain, the community with the smallest id is chosen. This helps _minimize or prevent community swaps_. With the **Data caching** heuristic, community information is stored in a vector instead of a map, and is reused in each iteration, but with some additional cost. With the **Vertex ordering via Graph coloring** heuristic, _distance-k coloring_ of graphs is performed in order to group vertices into colors. Then, each set of vertices (by color) is processed _concurrently_, and synchronization is performed after that. This enables us to mimic the behaviour of the serial algorithm. Finally, with the **Threshold scaling** heuristic, _successively smaller values of modularity threshold_ are used as the algorithm progresses. This allows the algorithm to converge faster, and it has been observed a good modularity score as well.
From the results, it appears that _graph coloring_ and _threshold scaling_ heuristics do not always provide a speedup and this depends upon the nature of the graph. It would be interesting to compare the heuristics against baseline approaches. Future work can include _distributed memory implementations_, and _community detection on streaming graphs_.
Application Areas of Community Detection: A Review : NOTESSubhajit Sahu
This is a short review of Community detection methods (on graphs), and their applications. A **community** is a subset of a network whose members are *highly connected*, but *loosely connected* to others outside their community. Different community detection methods *can return differing communities* these algorithms are **heuristic-based**. **Dynamic community detection** involves tracking the *evolution of community structure* over time.
https://gist.github.com/wolfram77/09e64d6ba3ef080db5558feb2d32fdc0
Communities can be of the following **types**:
- Disjoint
- Overlapping
- Hierarchical
- Local.
The following **static** community detection **methods** exist:
- Spectral-based
- Statistical inference
- Optimization
- Dynamics-based
The following **dynamic** community detection **methods** exist:
- Independent community detection and matching
- Dependent community detection (evolutionary)
- Simultaneous community detection on all snapshots
- Dynamic community detection on temporal networks
**Applications** of community detection include:
- Criminal identification
- Fraud detection
- Criminal activities detection
- Bot detection
- Dynamics of epidemic spreading (dynamic)
- Cancer/tumor detection
- Tissue/organ detection
- Evolution of influence (dynamic)
- Astroturfing
- Customer segmentation
- Recommendation systems
- Social network analysis (both)
- Network summarization
- Privary, group segmentation
- Link prediction (both)
- Community evolution prediction (dynamic, hot field)
<br>
<br>
## References
- [Application Areas of Community Detection: A Review : PAPER](https://ieeexplore.ieee.org/document/8625349)
This paper discusses a GPU implementation of the Louvain community detection algorithm. Louvain algorithm obtains hierachical communities as a dendrogram through modularity optimization. Given an undirected weighted graph, all vertices are first considered to be their own communities. In the first phase, each vertex greedily decides to move to the community of one of its neighbours which gives greatest increase in modularity. If moving to no neighbour's community leads to an increase in modularity, the vertex chooses to stay with its own community. This is done sequentially for all the vertices. If the total change in modularity is more than a certain threshold, this phase is repeated. Once this local moving phase is complete, all vertices have formed their first hierarchy of communities. The next phase is called the aggregation phase, where all the vertices belonging to a community are collapsed into a single super-vertex, such that edges between communities are represented as edges between respective super-vertices (edge weights are combined), and edges within each community are represented as self-loops in respective super-vertices (again, edge weights are combined). Together, the local moving and the aggregation phases constitute a stage. This super-vertex graph is then used as input fof the next stage. This process continues until the increase in modularity is below a certain threshold. As a result from each stage, we have a hierarchy of community memberships for each vertex as a dendrogram.
Approaches to perform the Louvain algorithm can be divided into coarse-grained and fine-grained. Coarse-grained approaches process a set of vertices in parallel, while fine-grained approaches process all vertices in parallel. A coarse-grained hybrid-GPU algorithm using multi GPUs has be implemented by Cheong et al. which grabbed my attention. In addition, their algorithm does not use hashing for the local moving phase, but instead sorts each neighbour list based on the community id of each vertex.
https://gist.github.com/wolfram77/7e72c9b8c18c18ab908ae76262099329
Survey for extra-child-process package : NOTESSubhajit Sahu
Useful additions to inbuilt child_process module.
📦 Node.js, 📜 Files, 📰 Docs.
Please see attached PDF for literature survey.
https://gist.github.com/wolfram77/d936da570d7bf73f95d1513d4368573e
Dynamic Batch Parallel Algorithms for Updating PageRank : POSTERSubhajit Sahu
For the PhD forum an abstract submission is required by 10th May, and poster by 15th May. The event is on 30th May.
https://gist.github.com/wolfram77/692d263f463fd49be6eb5aa65dd4d0f9
Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...Subhajit Sahu
For the PhD forum an abstract submission is required by 10th May, and poster by 15th May. The event is on 30th May.
https://gist.github.com/wolfram77/1c1f730d20b51e0d2c6d477fd3713024
Fast Incremental Community Detection on Dynamic Graphs : NOTESSubhajit Sahu
In this paper, the authors describe two approaches for dynamic community detection using the CNM algorithm. CNM is a hierarchical, agglomerative algorithm that greedily maximizes modularity. They define two approaches: BasicDyn and FastDyn. BasicDyn backtracks merges of communities until each marked (changed) vertex is its own singleton community. FastDyn undoes a merge only if the quality of merge, as measured by the induced change in modularity, has significantly decreased compared to when the merge initially took place. FastDyn also allows more than two vertices to contract together if in the previous time step these vertices eventually ended up contracted in the same community. In the static case, merging several vertices together in one contraction phase could lead to deteriorating results. FastDyn is able to do this, however, because it uses information from the merges of the previous time step. Intuitively, merges that previously occurred are more likely to be acceptable later.
https://gist.github.com/wolfram77/1856b108334cc822cdddfdfa7334792a
Enterprise Resource Planning System includes various modules that reduce any business's workload. Additionally, it organizes the workflows, which drives towards enhancing productivity. Here are a detailed explanation of the ERP modules. Going through the points will help you understand how the software is changing the work dynamics.
To know more details here: https://blogs.nyggs.com/nyggs/enterprise-resource-planning-erp-system-modules/
Unleash Unlimited Potential with One-Time Purchase
BoxLang is more than just a language; it's a community. By choosing a Visionary License, you're not just investing in your success, you're actively contributing to the ongoing development and support of BoxLang.
Listen to the keynote address and hear about the latest developments from Rachana Ananthakrishnan and Ian Foster who review the updates to the Globus Platform and Service, and the relevance of Globus to the scientific community as an automation platform to accelerate scientific discovery.
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns
Unlocking Business Potential: Tailored Technology Solutions by Prosigns
Discover how Prosigns, a leading technology solutions provider, partners with businesses to drive innovation and success. Our presentation showcases our comprehensive range of services, including custom software development, web and mobile app development, AI & ML solutions, blockchain integration, DevOps services, and Microsoft Dynamics 365 support.
Custom Software Development: Prosigns specializes in creating bespoke software solutions that cater to your unique business needs. Our team of experts works closely with you to understand your requirements and deliver tailor-made software that enhances efficiency and drives growth.
Web and Mobile App Development: From responsive websites to intuitive mobile applications, Prosigns develops cutting-edge solutions that engage users and deliver seamless experiences across devices.
AI & ML Solutions: Harnessing the power of Artificial Intelligence and Machine Learning, Prosigns provides smart solutions that automate processes, provide valuable insights, and drive informed decision-making.
Blockchain Integration: Prosigns offers comprehensive blockchain solutions, including development, integration, and consulting services, enabling businesses to leverage blockchain technology for enhanced security, transparency, and efficiency.
DevOps Services: Prosigns' DevOps services streamline development and operations processes, ensuring faster and more reliable software delivery through automation and continuous integration.
Microsoft Dynamics 365 Support: Prosigns provides comprehensive support and maintenance services for Microsoft Dynamics 365, ensuring your system is always up-to-date, secure, and running smoothly.
Learn how our collaborative approach and dedication to excellence help businesses achieve their goals and stay ahead in today's digital landscape. From concept to deployment, Prosigns is your trusted partner for transforming ideas into reality and unlocking the full potential of your business.
Join us on a journey of innovation and growth. Let's partner for success with Prosigns.
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfJay Das
With the advent of artificial intelligence or AI tools, project management processes are undergoing a transformative shift. By using tools like ChatGPT, and Bard organizations can empower their leaders and managers to plan, execute, and monitor projects more effectively.
Quarkus Hidden and Forbidden ExtensionsMax Andersen
Quarkus has a vast extension ecosystem and is known for its subsonic and subatomic feature set. Some of these features are not as well known, and some extensions are less talked about, but that does not make them less interesting - quite the opposite.
Come join this talk to see some tips and tricks for using Quarkus and some of the lesser known features, extensions and development techniques.
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar
The European Union Agency for Law Enforcement Cooperation (Europol) has suffered an alleged data breach after a notorious threat actor claimed to have exfiltrated data from its systems. Infamous data leaker IntelBroker posted on the even more infamous BreachForums hacking forum, saying that Europol suffered a data breach this month.
The alleged breach affected Europol agencies CCSE, EC3, Europol Platform for Experts, Law Enforcement Forum, and SIRIUS. Infiltration of these entities can disrupt ongoing investigations and compromise sensitive intelligence shared among international law enforcement agencies.
However, this is neither the first nor the last activity of IntekBroker. We have compiled for you what happened in the last few days. To track such hacker activities on dark web sources like hacker forums, private Telegram channels, and other hidden platforms where cyber threats often originate, you can check SOCRadar’s Dark Web News.
Stay Informed on Threat Actors’ Activity on the Dark Web with SOCRadar!
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Mind IT Systems
Healthcare providers often struggle with the complexities of chronic conditions and remote patient monitoring, as each patient requires personalized care and ongoing monitoring. Off-the-shelf solutions may not meet these diverse needs, leading to inefficiencies and gaps in care. It’s here, custom healthcare software offers a tailored solution, ensuring improved care and effectiveness.
First Steps with Globus Compute Multi-User EndpointsGlobus
In this presentation we will share our experiences around getting started with the Globus Compute multi-user endpoint. Working with the Pharmacology group at the University of Auckland, we have previously written an application using Globus Compute that can offload computationally expensive steps in the researcher's workflows, which they wish to manage from their familiar Windows environments, onto the NeSI (New Zealand eScience Infrastructure) cluster. Some of the challenges we have encountered were that each researcher had to set up and manage their own single-user globus compute endpoint and that the workloads had varying resource requirements (CPUs, memory and wall time) between different runs. We hope that the multi-user endpoint will help to address these challenges and share an update on our progress here.
A Comprehensive Look at Generative AI in Retail App Testing.pdfkalichargn70th171
Traditional software testing methods are being challenged in retail, where customer expectations and technological advancements continually shape the landscape. Enter generative AI—a transformative subset of artificial intelligence technologies poised to revolutionize software testing.
Accelerate Enterprise Software Engineering with PlatformlessWSO2
Key takeaways:
Challenges of building platforms and the benefits of platformless.
Key principles of platformless, including API-first, cloud-native middleware, platform engineering, and developer experience.
How Choreo enables the platformless experience.
How key concepts like application architecture, domain-driven design, zero trust, and cell-based architecture are inherently a part of Choreo.
Demo of an end-to-end app built and deployed on Choreo.
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus
As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.
In software engineering, the right architecture is essential for robust, scalable platforms. Wix has undergone a pivotal shift from event sourcing to a CRUD-based model for its microservices. This talk will chart the course of this pivotal journey.
Event sourcing, which records state changes as immutable events, provided robust auditing and "time travel" debugging for Wix Stores' microservices. Despite its benefits, the complexity it introduced in state management slowed development. Wix responded by adopting a simpler, unified CRUD model. This talk will explore the challenges of event sourcing and the advantages of Wix's new "CRUD on steroids" approach, which streamlines API integration and domain event management while preserving data integrity and system resilience.
Participants will gain valuable insights into Wix's strategies for ensuring atomicity in database updates and event production, as well as caching, materialization, and performance optimization techniques within a distributed system.
Join us to discover how Wix has mastered the art of balancing simplicity and extensibility, and learn how the re-adoption of the modest CRUD has turbocharged their development velocity, resilience, and scalability in a high-growth environment.
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...informapgpstrackings
Keep tabs on your field staff effortlessly with Informap Technology Centre LLC. Real-time tracking, task assignment, and smart features for efficient management. Request a live demo today!
For more details, visit us : https://informapuae.com/field-staff-tracking/
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Globus
The Earth System Grid Federation (ESGF) is a global network of data servers that archives and distributes the planet’s largest collection of Earth system model output for thousands of climate and environmental scientists worldwide. Many of these petabyte-scale data archives are located in proximity to large high-performance computing (HPC) or cloud computing resources, but the primary workflow for data users consists of transferring data, and applying computations on a different system. As a part of the ESGF 2.0 US project (funded by the United States Department of Energy Office of Science), we developed pre-defined data workflows, which can be run on-demand, capable of applying many data reduction and data analysis to the large ESGF data archives, transferring only the resultant analysis (ex. visualizations, smaller data files). In this talk, we will showcase a few of these workflows, highlighting how Globus Flows can be used for petabyte-scale climate analysis.
Developing Distributed High-performance Computing Capabilities of an Open Sci...Globus
COVID-19 had an unprecedented impact on scientific collaboration. The pandemic and its broad response from the scientific community has forged new relationships among public health practitioners, mathematical modelers, and scientific computing specialists, while revealing critical gaps in exploiting advanced computing systems to support urgent decision making. Informed by our team’s work in applying high-performance computing in support of public health decision makers during the COVID-19 pandemic, we present how Globus technologies are enabling the development of an open science platform for robust epidemic analysis, with the goal of collaborative, secure, distributed, on-demand, and fast time-to-solution analyses to support public health.
We describe the deployment and use of Globus Compute for remote computation. This content is aimed at researchers who wish to compute on remote resources using a unified programming interface, as well as system administrators who will deploy and operate Globus Compute services on their research computing infrastructure.
1. Hybrid Multicore Computing
Dip Sankar Banerjee
Advisor : Kishore Kothapalli
Center for Security, Theory, and Algorithmic Research (C-STAR)
International Institute of Information Technology, Hyderabad
Gachibowli, Hyderabad, India – 500 032.
dipsankar.banerjee@research.iiit.ac.in
1 Introduction
The multicore revolution was fueled by the fact that the standard processor designs collided with the
power wall, the memory wall, and the ILP wall. Hence, the architectural designs progressed in the
multi-core direction where several processing cores are packed into one single chip. Each of these cores
run at a lower clock rate than a single core chip but the net throughput continues to rise. Another
recent phenomenon is the availability of many-core accelerators such as the GPUs. It is because of the
high cost to performance and the low cost to power ratios that has enabled GPUs become the devices
of desktop supercomputing. In Figure 1, we can see the increase in the theoritical peak performance
of the many-core GPUs and the multi-core CPUs over the last decade. Another similar architecture
is the Cell Broadband Engine which was designed by IBM in 2001. The Cell BE was fundamentally
an accelerator which had many heterogeneous features. In recent days architectures along the lines of
accelerators continue to be released. The Intel Many Integrated Core(MIC) is one of the significant
ones. However, the future roads in HPC systems are leading towards on-die hybrid systems at the level
of supercomputing as well as the commodity consumers. It is evident from these developments that
the combination of general purpose CPU cores along with massively parallel accelerator cores provide
higher performance benefits.
Figure 1: Increase in GFLOPS of CPUs and GPUs over the last ten years.
1
Vector Inst.
On die GPU
APU
ANN Acc.
2. The advent of multicore and manycore architectures saw them being deployed to speed-up com-
putations across several disciplines and application areas. Prominent examples include semi-numerical
algorithms such as sorting [4], graph algorithms [14], image processing [6], scientific computations
[7, 3], and also on several irregular algorithms like [21, 13, 18]. Using varied architectures such as mul-
ticore processors, the graphics processing units (GPUs) and the IBM Cell researchers have been able to
achieve substantial speed-ups compared to corresponding or best-known sequential implementations. In
particular, using GPUs for general purpose computations has attracted a lot of attention given that GPUs
can deliver 4 TFLOP, or more, of computing power at very low prices and low power consumption.
When using accelerators such as the GPUs and the SPUs in IBM Cell, typically there is a control
device that has to explicitly invoke the accelerator. Typically, data and program are transfered to the
accelerator, the accelerator computes and returns the results to the device. This approach however has the
drawback that the computational resources of the device, a CPU in most cases, are mostly underutilized.
As multicore CPUs containing tens of cores are on the horizon, this underutilization of CPUs is a severe
wastage of compute power. More recently, in [9], it is also argued that on a broad class of throughput
oriented applications, the GPGPU advantage is only of the order of 3x on average and 15x at best.
TIME
CPU
SEND DATA
SEND CODE
GPU
HYBRID
DATA TRANSFER
DATA TRANSFER
SEND RESULTS
TIME
CPU
IDLE
GPU
COMPUTE
SEND RESULTS
SEND CODE
SEND DATA
PURE DEVICE
COMPUTE COMPUTE
(b)
(a)
Figure 2: The case of hybrid computing
In Figure 2, we see the fundamental case of hybrid computing. In Figure 2(a), we see that in the
conventional accelerator based method, we send the data and code to the device from the host machine
initially which then computes the results and sends it back. In the hybrid case which is demonstrated in
Figure 2(b), we show the hybrid technique where both the accelerator and the host device execute in an
interleaved fashion with an exchange of data at regular intervals. Accelerator based computing is here
to stay, and there has been sufficient evidences to show that heterogeneous computing is better than pure
parallel computing in many ways. So, in the near future we will see the shift of pure multicore comput-
ing towards hybrid multicore computing. In this thesis, our main objective is to study the implications of
accelerator based computing and provide parallel primitives for the hybrid platform. Some of the ques-
tions we will be finding answers include: how to arrive at efficient algorithms in the hybrid computing
model for fundamental problems in parallel computing such as list ranking, sorting, and how to analyze
a hybrid algorithm for its efficiency, what are the parameters that are important for a hybrid algorithm,
and the like. In addition, we will focus on optimization and tuning of hybrid systems. The research will
also attempt to propose a new programming model of hybrid computation.
The main goal in this thesis work would be to understand hybrid computing solutions in general. This
would include proposing newer models of computations, design of algorithms, and arriving at guidelines
for developing general purpose hybrid solutions. Issues like optimum resource utilization, power man-
agement, and speeding up known parallel applications would be undertaken. The thesis therefore argues
that future parallel computing solutions would necessarily have to include elements of hybrid multicore
computing so as to be fast, efficient, and scalable. It has been argued in literature[9], that in the case of
primitive driven computation, only a single multicore device such as a standalone GPU or a CPU does
2
3. not suffice for a great level of speedup. When averaged over several cases, the net speedup obtained is
not that significant.
2 Organization of this report
In this report we are going to make an initial attempt at establishing hybrid multicore computing as a
possible way ahead in parallel computing. We shall start off by discussing some of the good results that
has been obtained in recent times using the accelerators like Cell BE and GPU. This is followed by a
generic understanding of why hybrid computing is important in the road ahead. We support our claim
by discussing results from both architectural as well as theoretical point of views. Some good results in
hybrid computing in the recent times are also discussed. Some other works in hybrid computing is also
discussed citing some examples. The report will finally culminate in a discussion of why some earlier
theories proposed in parallel computing holds good meaning in the hybrid scenario. Finally, a discussion
on programming models that can provide a seamless usage of hybrid systems.
3 Accelerator based computing
We first discuss how accelerator based computing was successful and a paradigm shifting phenomenon.
The whole world of parallel computing underwent a change when graphics practitioners found that GPUs
can be used for many general purpose computations. The term of General Purpose Computing on Graph-
ics Processing Units (GPGPU) was coined during that time. Although NVidia led the initial development
of the GPGPU program, soon significant contributions were made by industry houses and scientists from
several fields of research. Today, OpenCL is the dominant open source general purpose computing lan-
guage for accelerators, although the dominant proprietary framework is still NVidia’s Compute Unified
Device Architecture (CUDA).
One of the most significant initial results on the GPGPU front was that of the scan primitive pro-
posed by Sengupta et. al [16]. In this work, the authors demonstrated the power of GPU computing by
efficiently implementing a highly used computing primitive. Scan is an operation where, we apply an
associative operation on a set of contiguous operands. The authors basically employed the O(logn) al-
gorithm to compute the scan on an array by performing an up-sweep and a down-sweep. This technique
was subsequently aided by applying well known caching and blocking techniques at each level of the
computation. The authors demonstrated their implementation on an early GT8800 GPU. They showed
that a scan on 1 million elements can be done in less than a millisecond. This was achieved by running
128 threads on each core of the GPU which was previously impossible on any other parallel architecture.
The authors also demonstrated the applicability of their scan primitive by using it in a tri-diagonal solver
and a sorting routine.
In parallel to the GPUs, the Cell BE initiative from IBM also made a big impact in this new world.
The POWER4 processor which was also designed fundamentally for aiding graphics and games, was
re-designed to perform general purpose computations and renamed as the Cell. It was heavily used in the
Sony Playstation gaming consoles and also contributed towards many petaflop supercomputers such as
the IBM Roadrunner. An initial work on the Cell BE was of the implementation of a breadth first search
routine [14]. This paper also employed a famous technique known as the Bulk Synchronous Parallel
(BSP) model for algorithmically re-designing BFS to suit the Cell BE architecture. The paper employed
SIMD operations in each of the Synergistic Processing Elements (SPEs). Proper work distribution from
the Power Processing Elements (PPEs) and synchronization among the two led to a 4x performance ben-
efit over the conventional sequential implementations on pentium machines. The authors demonstrated
a performance of around 100 million edges per second.
In a more recent success story, the accelerator based implementations especially on latest generation
GPUs also demonstrate the prowess of wide array of throughput cores. Sorting has always been an
area of high interest for computer scientists and hence also highly pursued in the domain of massive
parallelism. In [10], the authors demonstrated a new randomized sorting technique. In this technique,
3
4. the authors divided the elements in a highly efficient pattern onto a certain number of bins. This efficient
binning helped in a better utilization of the shared resources that are present in a GPU. So, all the bins
can be concurrently scheduled onto the several symmetric multiprocessors(SMs) that are available for
sorting. The authors showed a performance of around 120 million elements per second on the GTX
280 GPU which was released in 2008. Along the similar lines, there was another comparison sort paper
[4] that appeared in 2011 which used the latest Fermi line of GPUs for sorting 32 bit integers. In this
paper, the authors showed the optimal usage of not only the shared memory that is available to all the
SMs, but also the registers that are available to each of the cores. The major contribution of the paper
was the memory access optimizations, the avoidance of conflicts and efficient usage of hardware that
is available. The results of the merge sort on GTX 580 was around 140 million elements per second
on an input list of 220 elements. The authors also demonstrated the applicability of their technique in
variable length keys such as strings. Both of these papers, very efficiently describe the use of massive
parallelism that is offered by the accelerators. We learn from these papers that in order to extract the
highest possible amount of performance, we need to run as many threads as possible in order to achieve
bandwidth saturation, remove irregularities in data access, and maximize shared cache usage.
4 Need for Hybrid Computing
In the world of parallel computations, we can always refer to the no free lunch theorem. In particular
Prof. Jack Dongarra quotes [5],
GPUs have evolved to the point where many real-world applications are easily implemented
on them and run significantly faster than on multi-core systems. Future computing archi-
tectures will be hybrid systems with parallel-core GPUs working in tandem with multi-core
CPUs.
In this case we can easily argue the case of many-core vs multi-core computing, where a many-core is a
combination of many low compute cores bundled in a chip and a multi-core is a chip where a small num-
ber of high-power cores are packed. There are several scientific computations which have high compute
requirement, but are aided by graphical simulations which require low compute but a higher number of
working threads. Hence, we can intuitively make an assumption regarding the presence of many large
scale applications, that are not entirely suitable for accelerator type computations. There has to be some
kind of a trade-off that must exist in order to cater to all types of applications. In particular we may
look at some cases where there has been sufficient proof to show that some operations like the atomic
operations are not very well suited for the accelerators. Computations such as histograms which are not
time consuming ones in the CPU often suffer in GPUs. In paper [15], the authors have shown an imple-
mentation of histogram on GPUs where they have dealt with the complications of efficiently computing
histograms. The authors have showed the complexity that is involved in the accumulation stage of all
the bins and he limited precision which is available. The CPUs, however have proved time and again
their efficiency in performing operations such as atomics, gather and scatter. It is the development of the
dedicated hardware units for these operations that enable the CPU to excel in their implementations.
In 2010, Intel researchers published a paper which made an attempt to debunk the GPU computing
myth [9]. The paper mainly dealt with 14 computing primitives that are commonly used in large appli-
cations. The authors implemented each of the workloads on the current generation multi-core CPUs as
well as the GPUs. They showed that, with proper optimizations and hardware utilization of the CPU, the
GPU results will never out-perform the CPU by the several orders of magnitude as they were happening
in many recent works.
The authors primarily revisited some of the works on GPUs that were done previously which claimed
improvements ranging from 10x to 100x. However, the authors found that with proper tuning of code
on the CPU and the GPU, the gap narrows down to only 2.5x. This is a very significant result that
was not expected after the success of the accelerators. The authors mainly focused on workloads which
find high application in scientific and financial computations. Workloads like SGEMM, Bilateral, FFT,
4
5. Convolution, Monte Carlo Simulations and Sorting are widely known and worked upon. In this paper
the authors focused on re-designing each of these workloads that will extract the maximum performance
out of a multi-core CPU. Techniques such as multi-level caching, data reuse and applying vector SIMD
operations as much as possible, were applied. In contrast to many-core designs, these tunings make
complete sense as simple assumption that the compiler takes care of all internal optimizations is wrong.
Hence, we need to look into the working of each and every instruction on a parallel CPU as we do on an
accelerator say GPU. So, in a nutshell, the authors found out an essential truth that accelerators are not
the only way ahead and multi-core CPUs can in no way be ignored from the whole picture.
From this work, we eventually can arrive at the idea that what if we are able to put both the multi-
core and th many-core processors to work together. A platform where these devices can communicate
among themselves was needed to make it possible. With the advent of PCI Express, this problem also
has been solved to a large extent as many accelerators and host processors can now communicate among
themselves although being bound by the bandwidth. However, this provides a suitable pasture to venture
an attempt at hybrid multicore computing.
4.1 Current Results in Hybrid Computing
The Matrix Algebra on GPU and Multicore Architectures (MAGMA) initiative at the University of Ten-
nessee is one of the earliest works on hybrid computation. The group worked on familiar numerical
techniques in hybrid CPU+GPU platforms. One of the relevant works from their group is [19]. In this
work the authors primarily focus on developing a Dense Linear Algebra (DLA) benchmark suite which
would be analogous to the LAPACK benchmark for conventional parallel systems. This DLA suite is
proposed purely to be run on hybrid multicore systems. The main motivation of the work stems from
the fact that the linear algebra computations are highly compute intensive kernels which involve a lot of
double precision operations on real numbers. The solution is based on data parallelism where the CPU
acts as the master device and runs the mother thread. Smaller kernels which are inefficient to be run on
the CPU and rest are offloaded to the GPU. Also the asynchrony between the two devices are utilized at
regular intervals to offload data/code. The authors demonstrate the performance of their solution strategy
by designing a hybrid code for LU factorization of matrices. The results are nearly 30% better than the
pure GPU implementations.
4.2 Our attempts at hybrid computation
In an early attempt at hybrid computation, we worked on the problem of list ranking. List ranking it-
self is another popular computing primitive. A very early work at parallel list ranking was proposed by
Wyllie in his Ph.D. thesis [22]. The biggest contribution of Wyllie’s work was that of Parallel Random
Access Machine (PRAM) model. In that model the author proposed the use of a shared memory which
can be accessed by several processors and can be used for sharing data and also synchronizing between
themselves. In this thesis he applied the PRAM model for solving the list ranking problem using a tech-
nique called “pointer jumping“. Pointer jumping is a very fundamental operation in parallel computation
where any node can be made to point to the parent of its predecessor until convergence. This operation
finds application in a wide variety of problems. In [22], Wyllie also discusses several other problems that
can be solved using the PRAM model by adapting conventional sequential algorithms.
Wei and Jaja published a work on list ranking [21] that implemented list ranking on GTX 200 series
GPUs. This result was in turn an improvement over and earlier work on list ranking [13]. Wei and Jaja
employed a randomized algorithm to perform the list ranking where several elements were binned into a
certain number of bins which were in turn allocated to each of the SMs of a GPU. This would lead to each
of the thread blocks completing the ranking process independent of each other and finally consolidating
to provide the final results. The authors showed a performance of around 300 ms for a random list of 64
million elements.
This work was later improved by us [1] using an hybrid mechanism where we employed fractional
independent sets to engage the CPU and the GPU both at the same time. The algorithm utilized asyn-
5
Ethernet
can be
even faster
these days
6. chronous data transfer between the GPU global memory and the CPU memory. The iterative algorithm
took a finite number of steps to reach n/log(n) of nodes in the list. The rest of the nodes were removed
with proper book-keeping. After this the smaller list was ranked on the GPU using a popular GPU al-
gorithm for list ranking which took a very small fraction of the total time. After this step, the initially
removed nodes were re-inserted into the list in the reverse order of their removal. Overall, the algorithm
performed almost 25% better than the conventional GPU algorithm. Figure 3 below is a plot of the re-
sults and the comparisons with the other implementations. This result also gives a clear indication that
the hybrid approach is significantly better than the homogeneous accelerator based approach.
100
200
300
400
500
600
700
20 40 60 80 100 120
Timings
(ms)
List Size (M)
WJ10
Pure GPU-MD5
Pure GPU-MT
Hybrid Time
Figure 3: List Ranking Results
2
4
6
8
10
12
14
16
18
5 10 15 20 25 30 35
Time
(ms)
Threshold (%)
Hybrid Time for 1.5 million nodes( s)
GPU Time for 1.5 million nodes (ms)
Hybrid Time for 3 million nodes (ms)
GPU Time for 3 million nodes (ms)
Figure 4: Connected Components Results
Also in [1], we worked on a graph connected components problem. This is an extension of the work
done in [18] where the authors have used the O(nlog n) Shiloach-Vishkin algorithm [17] to compute
the connected components of a random graph. We now experiment on the work partitioning path of
hybridization where we first partition the graph statically using a certain threshold and then perform
the best known algorithm to compute connected components on each of the device. These algorithms
execute in an overlapped fashion and then synchronize. This is then followed by a consolidation step
where the results of both the devices are checked to provide the final result. Due to this non-blocking
execution of the kernels in the two devices, it is pragmatic to say that the hybrid approach is always a
better option where there is ample data parallelism. Figure 4 shows the results of our implementation
using 1.5 million and 3 million vertices of a random graph.
4.3 Other work in hybrid computing
Another interesting work in recent times on hybrid systems was proposed in [8]. In this paper the authors
propose a Breadth First Search technique for hybrid platforms. The basic technique applied in this paper
is that of architectural optimizations. The authors first present a CPU implementation of graph BFS
which is followed by a hybrid implementation. The hybrid implementation is not that of an overlapping
nature. The application chooses the CPU and GPU codes optimally depending on the graph type. So,
this is more of an interleaved execution rather than the overlapping data parallel execution. The authors,
in this case also shows that a fully optimized quad-core performance of a CPU can match that of a
GPU in the case of graph traversals which are usually irregular in nature. Results show that the hybrid
implementation can process nearly 250 million edges per second using 16 threads on a quad-core CPU.
The GPU results are in the range of 300 million edges per second. They experiment using random graphs
generated using the R-MAT model and the uniform model.
6
7. 5 Future Directions
There are several interesting areas which can be investigated during the course of this work. As hybrid
computing is a relatively new topic in the world of parallel processing, there are several topics which
can be re-looked for finding answers in the hybrid concept. Some of the interesting topics are discussed
below.
5.1 Scheduling and Optimization
In future, we plan on working on scheduling mechanisms for the hybrid platform. Thread scheduling,
safety, and communication latency are still some of the important aspects in parallel programming. These
issues tend to dominate the analysis of hybrid algorithms. Some work along these lines have appeared
in [11, 20]. An technique for list scheduling has appeared in [11] where the author has analyzed the
performance of thread scheduling on heterogeneous systems. However, the work of [11] completely
ignores the aspect of communication cost that has be incurred to transfer data across the devices in the
computing platform. Further, hybrid computing platforms are tightly coupled compared to heterogeneous
computing systems. When one considers also communication latency, it would be therefore interesting
to analyze the thread scheduling problem on a tightly coupled hybrid platform.
Optimization is a principle aspect in programming multithreaded devices. Several optimization prin-
ciples based on workload distribution, scheduling, memory access and thread overheads can drastically
increase the performance of an otherwise un-optimized routine. Optimization is a big area of research
in itself and has a number of topics for investigation in parallel processing. Good design of a primitive
depends both on the algorithmic foundations as well as the optimizations that has been done.
5.2 Randomized Approaches
Randomized approaches is parallel computing is also an interesting area of research. We plan on de-
ploying the random number generator which we have devised for Monte Carlo simulations. These sim-
ulations usually require a good and reliable source of randomness. Randomized algorithms have been
proved to perform than deterministic ones in many problems of computer science [12]. High quality ran-
dom number source is essential for designing cryptographic applications. These applications are of high
computational intensity and often require good systems knowledge to provide good hardware implemen-
tations. It will be interesting to see if the randomized mechanism perform better in a hybrid setting rather
than the homogeneous solutions.
5.3 Synchronization
Synchronization of devices is also an important issue at this point of time. Suitable programming support
from the vendors would enable one to synchronize all the devices in a hybrid computing platform. Cur-
rent techniques usually involve a high overhead which affects the overall performance. Dynamic load
balancing and distribution are dependent entirely on good synchronization methods. As the bus band-
width between the devices are orders of magnitude slower than shared memory accesses, it is difficult to
exchange messages between devices while the program is under execution. This is a severe bottleneck
to hybrid computations which we plan on addressing at the algorithmic level by designing algorithms so
as to minimize the need for synchronization.
5.4 Design of Primitives
In modern day parallel computing, writing parallel software for multicore devices is a challenge which is
needed to be addressed immediately. Truly parallel software which are fine tuned to run on commodity
computing devices can truly utilize the computing power which the multicore devices offer. However,
it is difficult to move away from the sequential thinking and device parallel solutions. One mechanisms
that has evolved is the Primitive based parallel computing. In this technique, the first job is to design
7
8. some of the well known primitives which are used by the softwares. Common examples being searching,
sorting, ranking, graph theoretic approaches and matrix operations. Primitive based approach designs
software as a set of function calls to each of these parallel primitives which is then condensed to provide
the final results.
5.5 Programming Models
The immediate interest in this work is to propose an efficient programming model for the hybrid platform.
The whole accelerator based parallel computing community is often looking to develop devices which
are having good programmability. In [2], Blelloch proposes a parallel model of computation which is
a successor to the previous PRAM model. In this VRAM model of computation, the author shows the
use of vector multiprocessors to compute several fundamental computing primitives. It is important to
note the relevance of this work as the current generation multiprocessors are all having SIMD and vector
processing units which requires vector instructions for performance. In [2], the author shows how a group
of primitives can be efficiently used to design a parallel application. The grouping of several primitives
as a linear sequence of parallel kernels, provide a seamless approach to designing parallel programs.
Along these lines, we will work towards developing a parallel programming model for hybrid systems
where the practitioner can write a sequential array of hybrid kernels that will in turn execute across all
the available devices and accelerators. All the tuning parameters such as data partitioning, degree of task
parallelism, and synchronization can be included as a part of the kernel calls.
Massively multithreaded computing has entirely changed the horizon of parallel computing and
through this thesis we want to further this idea by positioning hybrid computing as the future of par-
allel computing.
References
[1] BANERJEE, D. S., AND KOTHAPALLI, K. Hybrid Algorithms for List Ranking and Graph Con-
nected Components. In Proc. of 18th Annual International Conference on High Performance Com-
puting (HiPC) (2011).
[2] BLELLOCH, G. E. Vector models for data-parallel computing. MIT Press, Cambridge, MA, USA,
1990.
[3] COLE, R., AND VISHKIN, U. Faster Optimal Parallel Prefix sums and List Ranking. Information
and Computation 81, 3 (1989), 334–352.
[4] DAVIDSON, A., TARJAN, D., GARLAND, M., AND OWENS, J. D. Efficient Parallel Merge Sort
for Fixed and Variable Length Keys. In Proceedings of Innovative Parallel Computing (InPar ’12)
(May 2012).
[5] DONGARRA, J. http://www.nvidia.com/object/GPU Computing.html.
[6] FIALKA, O., AND CADIK, M. FFT and Convolution Performance in Filtering on GPU. In Pro-
ceedings of the Conference on Information Visualization (2006).
[7] GOVINDARAJU, N., AND MANOCHA, D. Cache-efficient Numerical Algorithms using Graphics
Hardware. Parallel Computing 33, 10-11 (2007), 663–684.
[8] HONG, S., OGUNTEBI, T., AND OLUKOTUN, K. Efficient parallel graph exploration for multi-
core CPU and GPU. In IEEE Parallel Architectures and Compilation Techniques (PACT) (2011).
[9] LEE, V. W., KIM, C., CHHUGANI, J., DEISHER, M., KIM, D., NGUYEN, A. D., SATISH, N.,
SMELYANSKIY, M., CHENNUPATY, S., HAMMARLUND, P., SINGHAL, R., AND DUBEY, P. De-
bunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU.
8
9. In Proceedings of the 37th annual international symposium on Computer architecture (2010), ISCA
’10, ACM.
[10] LEISCHNER, N., OSIPOV, V., AND SANDERS, P. GPU sample sort. In 24th IEEE International
Symposium on Parallel and Distributed Processing, IPDPS 2010 (2010), IEEE, pp. 1–10.
[11] LI, K. Performance analysis of list scheduling in heterogeneous computing systems. International
Journal of Electrical and Electronics Engineering (June 2010).
[12] MOTWANI, R., AND RAGHAVAN, P. Randomized algorithms. Cambridge University Press, New
York, NY, USA, 1995.
[13] REHMAN, M. S., KOTHAPALLI, K., AND NARAYANAN, P. J. Fast and Scalable List Ranking on
the GPU. In Proc. of ACM ICS (2009).
[14] SCARPAZZA, D. P., VILLA, O., AND PETRINI, F. Efficient Breadth-First Search on the Cell/BE
Processor. IEEE Trans. Parallel Distrib. Syst. 19, 10 (2008), 1381–1395.
[15] SCHEUERMANN, T., AND HENSLEY, J. Efficient histogram generation using scattering on GPUs.
In Proceedings of the 2007 symposium on Interactive 3D graphics and games (New York, NY,
USA, 2007), I3D ’07, ACM, pp. 33–37.
[16] SENGUPTA, S., HARRIS, M., ZHANG, Y., AND OWENS, J. D. Scan primitives for gpu computing.
In Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICSsymposium on Graphics hardware
(2007).
[17] SHILOACH, Y., AND VISHKIN, U. An o(log n) parallel connectivity algorithm. J. Algorithms
(1982), 57–67.
[18] SOMAN, J., KOTHAPALLI, K., AND NARAYANAN, P. Some GPU Algorithms for Graph Con-
nected Components and Spanning Tree. In Parallel Processing Letters (2010), vol. 20, pp. 325–
339.
[19] TOMOV, S., DONGARRA, J., AND BABOULIN, M. Towards dense liner algebra for hybrid gpu
accelerated manycore systems. Parallel Computing 12 (Dec. 2009), 10–16.
[20] TOPCUOGLU, H., HARIRI, S., AND WU, M.-Y. Performance-effective and low-complexity task
scheduling for heterogeneous computing. Parallel and Distributed Systems, IEEE Transactions on
13, 3 (Mar. 2002), 260 –274.
[21] WEI, Z., AND JAJA, J. Optimization of linked list prefix computations on multithreaded gpus using
cuda. In The 24th IEEE International Parallel and Distributed Processing Symposium (IPDPS)
(April 2010).
[22] WYLLIE, J. C. The complexity of parallel computations. PhD thesis, Cornell University, Ithaca,
NY, 1979.
9