The document describes a method for speeding up cycle-based logic simulation using graphics processing units (GPUs). The authors develop a parallel cycle-based logic simulation algorithm that uses and-inverter graphs (AIGs) as the design representation. They create two clustering algorithms that partition circuit gates into independent blocks that can be simulated concurrently. Experimental results show speedups of up to 5x using the first clustering algorithm and up to 21x using the second algorithm on several benchmarks. The work demonstrates that GPUs can significantly accelerate logic simulation through parallelization.
Highlighted notes on Hybrid Multicore Computing
While doing research work under Prof. Dip Banerjee, Prof. Kishore Kothapalli.
In this comprehensive report, Prof. Dip Banerjee describes about the benefit of utilizing both multicore systems, CPUs with vector instructions, and manycore systems, GPUs with large no. of low speed ALUs. Such hybrid systems are beneficial to several algorithms as an accelerator cant optimize for all parts of an algorithms (some computations are very regular, while some very irregular).
International Journal of Engineering Research and DevelopmentIJERD Editor
Electrical, Electronics and Computer Engineering,
Information Engineering and Technology,
Mechanical, Industrial and Manufacturing Engineering,
Automation and Mechatronics Engineering,
Material and Chemical Engineering,
Civil and Architecture Engineering,
Biotechnology and Bio Engineering,
Environmental Engineering,
Petroleum and Mining Engineering,
Marine and Agriculture engineering,
Aerospace Engineering.
AN EFFICIENT FPGA IMPLEMENTATION OF MRI IMAGE FILTERING AND TUMOUR CHARACTERI...VLSICS Design
This paper presents an efficient architecture for various image filtering algorithms and tumor characterization using Xilinx System Generator (XSG). This architecture offers an alternative through a graphical user interface that combines MATLAB, Simulink and XSG and explores important aspectsconcerned to hardware implementation. Performance of this architecture implemented in SPARTAN-3E Starter kit (XC3S500E-FG320) exceeds those of similar or greater resources architectures. The proposed architecture reduces the resources available on target device by 50%.
Hardware simulation for exponential blind equal throughput algorithm using sy...IJECEIAES
Scheduling mechanism is the process of allocating radio resources to User Equipment (UE) that transmits different flows at the same time. It is performed by the scheduling algorithm implemented in the Long Term Evolution base station, Evolved Node B. Normally, most of the proposed algorithms are not focusing on handling the real-time and non-real-time traffics simultaneously. Thus, UE with bad channel quality may starve due to no resources allocated for quite a long time. To solve the problems, Exponential Blind Equal Throughput (EXP-BET) algorithm is proposed. User with the highest priority metrics is allocated the resources firstly which is calculated using the EXP-BET metric equation. This study investigates the implementation of the EXP-BET scheduling algorithm on the FPGA platform. The metric equation of the EXP-BET is modelled and simulated using System Generator. This design has utilized only 10% of available resources on FPGA. Fixed numbers are used for all the input to the scheduler. The system verification is performed by simulating the hardware co-simulation for the metric value of the EXP-BET metric algorithm. The output from the hardware co-simulation showed that the metric values of EXP-BET produce similar results to the Simulink environment. Thus, the algorithm is ready for prototyping and Virtex-6 FPGA is chosen as the platform.
Fpga based efficient multiplier for image processing applications using recur...VLSICS Design
The Digital Image processing applications like medical imaging, satellite imaging, Biometric trait images
etc., rely on multipliers to improve the quality of image. However, existing multiplication techniques
introduce errors in the output with consumption of more time, hence error free high speed multipliers has
to be designed. In this paper we propose FPGA based Recursive Error Free Mitchell Log Multiplier
(REFMLM) for image Filters. The 2x2 error free Mitchell log multiplier is designed with zero error by
introducing error correction term is used in higher order Karastuba-Ofman Multiplier (KOM)
Architectures. The higher order KOM multipliers is decomposed into number of lower order multipliers
using radix 2 till basic multiplier block of order 2x2 which is designed by error free Mitchell log multiplier.
The 8x8 REFMLM is tested for Gaussian filter to remove noise in fingerprint image. The Multiplier is
synthesized using Spartan 3 FPGA family device XC3S1500-5fg320. It is observed that the performance
parameters such as area utilization, speed, error and PSNR are better in the case of proposed architecture
compared to existing architectures.
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGAVLSICS Design
Video Compression is very essential to meet the technological demands such as low power, less memory and fast transfer rate for different range of devices and for various multimedia applications. Video compression is primarily achieved by Motion Estimation (ME) process in any video encoder which contributes to significant compression gain.Sum of Absolute Difference (SAD) is used as distortion metric in ME process.In this paper, efficient Absolute Difference(AD)circuit is proposed which uses Brent Kung Adder(BKA) and a comparator based on modified 1’s complement principle and conditional sum adder scheme. Results shows that proposed architecture reduces delay by 15% and number of slice LUTs by 42 % as compared to conventional architecture. Simulation and synthesis are done on Xilinx ISE 14.2 using Virtex 7 FPGA.
Accelerating Real Time Applications on Heterogeneous PlatformsIJMER
In this paper we describe about the novel implementations of depth estimation from a stereo
images using feature extraction algorithms that run on the graphics processing unit (GPU) which is
suitable for real time applications like analyzing video in real-time vision systems. Modern graphics
cards contain large number of parallel processors and high-bandwidth memory for accelerating the
processing of data computation operations. In this paper we give general idea of how to accelerate the
real time application using heterogeneous platforms. We have proposed to use some added resources to
grasp more computationally involved optimization methods. This proposed approach will indirectly
accelerate a database by producing better plan quality.
A Survey of Machine Learning Methods Applied to Computer ...butest
This document discusses various machine learning methods that have been applied to computer architecture problems. It begins by introducing k-means clustering and how it is used in SimPoint to reduce architecture simulation time. It then discusses how machine learning can be used for design space exploration in multi-core processors and for coordinated resource management on multiprocessors. Finally, it provides an example of using artificial neural networks to build performance models to inform resource allocation decisions.
Highlighted notes on Hybrid Multicore Computing
While doing research work under Prof. Dip Banerjee, Prof. Kishore Kothapalli.
In this comprehensive report, Prof. Dip Banerjee describes about the benefit of utilizing both multicore systems, CPUs with vector instructions, and manycore systems, GPUs with large no. of low speed ALUs. Such hybrid systems are beneficial to several algorithms as an accelerator cant optimize for all parts of an algorithms (some computations are very regular, while some very irregular).
International Journal of Engineering Research and DevelopmentIJERD Editor
Electrical, Electronics and Computer Engineering,
Information Engineering and Technology,
Mechanical, Industrial and Manufacturing Engineering,
Automation and Mechatronics Engineering,
Material and Chemical Engineering,
Civil and Architecture Engineering,
Biotechnology and Bio Engineering,
Environmental Engineering,
Petroleum and Mining Engineering,
Marine and Agriculture engineering,
Aerospace Engineering.
AN EFFICIENT FPGA IMPLEMENTATION OF MRI IMAGE FILTERING AND TUMOUR CHARACTERI...VLSICS Design
This paper presents an efficient architecture for various image filtering algorithms and tumor characterization using Xilinx System Generator (XSG). This architecture offers an alternative through a graphical user interface that combines MATLAB, Simulink and XSG and explores important aspectsconcerned to hardware implementation. Performance of this architecture implemented in SPARTAN-3E Starter kit (XC3S500E-FG320) exceeds those of similar or greater resources architectures. The proposed architecture reduces the resources available on target device by 50%.
Hardware simulation for exponential blind equal throughput algorithm using sy...IJECEIAES
Scheduling mechanism is the process of allocating radio resources to User Equipment (UE) that transmits different flows at the same time. It is performed by the scheduling algorithm implemented in the Long Term Evolution base station, Evolved Node B. Normally, most of the proposed algorithms are not focusing on handling the real-time and non-real-time traffics simultaneously. Thus, UE with bad channel quality may starve due to no resources allocated for quite a long time. To solve the problems, Exponential Blind Equal Throughput (EXP-BET) algorithm is proposed. User with the highest priority metrics is allocated the resources firstly which is calculated using the EXP-BET metric equation. This study investigates the implementation of the EXP-BET scheduling algorithm on the FPGA platform. The metric equation of the EXP-BET is modelled and simulated using System Generator. This design has utilized only 10% of available resources on FPGA. Fixed numbers are used for all the input to the scheduler. The system verification is performed by simulating the hardware co-simulation for the metric value of the EXP-BET metric algorithm. The output from the hardware co-simulation showed that the metric values of EXP-BET produce similar results to the Simulink environment. Thus, the algorithm is ready for prototyping and Virtex-6 FPGA is chosen as the platform.
Fpga based efficient multiplier for image processing applications using recur...VLSICS Design
The Digital Image processing applications like medical imaging, satellite imaging, Biometric trait images
etc., rely on multipliers to improve the quality of image. However, existing multiplication techniques
introduce errors in the output with consumption of more time, hence error free high speed multipliers has
to be designed. In this paper we propose FPGA based Recursive Error Free Mitchell Log Multiplier
(REFMLM) for image Filters. The 2x2 error free Mitchell log multiplier is designed with zero error by
introducing error correction term is used in higher order Karastuba-Ofman Multiplier (KOM)
Architectures. The higher order KOM multipliers is decomposed into number of lower order multipliers
using radix 2 till basic multiplier block of order 2x2 which is designed by error free Mitchell log multiplier.
The 8x8 REFMLM is tested for Gaussian filter to remove noise in fingerprint image. The Multiplier is
synthesized using Spartan 3 FPGA family device XC3S1500-5fg320. It is observed that the performance
parameters such as area utilization, speed, error and PSNR are better in the case of proposed architecture
compared to existing architectures.
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGAVLSICS Design
Video Compression is very essential to meet the technological demands such as low power, less memory and fast transfer rate for different range of devices and for various multimedia applications. Video compression is primarily achieved by Motion Estimation (ME) process in any video encoder which contributes to significant compression gain.Sum of Absolute Difference (SAD) is used as distortion metric in ME process.In this paper, efficient Absolute Difference(AD)circuit is proposed which uses Brent Kung Adder(BKA) and a comparator based on modified 1’s complement principle and conditional sum adder scheme. Results shows that proposed architecture reduces delay by 15% and number of slice LUTs by 42 % as compared to conventional architecture. Simulation and synthesis are done on Xilinx ISE 14.2 using Virtex 7 FPGA.
Accelerating Real Time Applications on Heterogeneous PlatformsIJMER
In this paper we describe about the novel implementations of depth estimation from a stereo
images using feature extraction algorithms that run on the graphics processing unit (GPU) which is
suitable for real time applications like analyzing video in real-time vision systems. Modern graphics
cards contain large number of parallel processors and high-bandwidth memory for accelerating the
processing of data computation operations. In this paper we give general idea of how to accelerate the
real time application using heterogeneous platforms. We have proposed to use some added resources to
grasp more computationally involved optimization methods. This proposed approach will indirectly
accelerate a database by producing better plan quality.
A Survey of Machine Learning Methods Applied to Computer ...butest
This document discusses various machine learning methods that have been applied to computer architecture problems. It begins by introducing k-means clustering and how it is used in SimPoint to reduce architecture simulation time. It then discusses how machine learning can be used for design space exploration in multi-core processors and for coordinated resource management on multiprocessors. Finally, it provides an example of using artificial neural networks to build performance models to inform resource allocation decisions.
Inference on edge has an ever increasing performance for companies and thus it is crucial to be able to make models smaller. Compressing models can be loss-less or can result in loss of accuracy. This presentation provides a survey of compression techniques for deep learning models. It then describes different architectures of AWS IoT/Green Grass to combine on-device inference and GPU inference in a hub model. Additionally the presentation introduces MXNet, which has small footprint and efficient both for inference and training in distributed settings.
HOMOGENEOUS MULTISTAGE ARCHITECTURE FOR REAL-TIME IMAGE PROCESSINGcscpconf
The document describes a homogeneous multistage architecture for real-time image processing. It proposes a parallel architecture using multiple identical processing elements connected by different communication links. As an example application, it discusses a multi-hypothesis approach for road recognition, which uses multiple hypotheses to detect and track road edges in video in real-time. Experimental results using a FPGA demonstrate the architecture can detect roadsides in images within 60 milliseconds.
This document describes an FPGA-based human detection system with an embedded platform. Key points:
- The system uses HOG features, SVM classification, and AdaBoost algorithms for human detection in images and video.
- FPGA circuits are designed to accelerate the computationally intensive HOG feature extraction, including modules for gradient calculation, histogram accumulation, and more.
- The full system is implemented on an embedded platform to achieve a real-time human detection system running at 15 frames per second.
- Experimental results show the FPGA-based system has similar detection accuracy to a PC-based software implementation but significantly faster speed, suitable for real-time embedded applications.
IRJET - Wavelet based Image Fusion using FPGA for Biomedical ApplicationIRJET Journal
This document describes a wavelet-based image fusion system implemented on an FPGA for biomedical applications. The system takes two input images, applies discrete wavelet transforms to both, then fuses the wavelet coefficients to create a single output image. It uses MATLAB and Xilinx System Generator to simulate the design in Simulink and implement it on a Virtex6 FPGA. The results show that wavelet-based fusion can combine the spatial and spectral information from multiple input images into a higher quality fused output image suitable for medical applications like fusing MRI and CT scans.
Machine learning methods can be applied in several areas of computer architecture including reducing simulation time, exploring large design spaces, and optimizing resource management and hardware predictors. Supervised and unsupervised learning algorithms are discussed as well as applications like basic block instruction scheduling where machine learning has shown improvements over traditional heuristic approaches.
The document provides an overview of the Dynamic Reconfigurability in Embedded System Design (DRESD) project. It describes the MicroLAB organization involved in DRESD, various motivations for reconfiguration, definitions of reconfiguration, examples of reconfiguration in everyday life, new frontiers for reconfigurable computing, the DRESD philosophy, involvement of DRESD at Politecnico di Milano and internationally, main DRESD projects, and opportunities to get involved.
Parallel algorithms for multi-source graph traversal and its applicationsSubhajit Sahu
Highlighted notes on Parallel algorithms for multi-source graph traversal and its applications.
While doing research work under Prof. Kishore Kothapalli.
Seema is working on Multi-source BFS with hybrid-CSR, with applications in APSP, diameter, centrality, reachability.
BFS can be either top-down (from visited nodes, mark next frontier), or bottom-up (from unvisited nodes, mark next frontier). She mentioned that hybrid approach is more efficient. EtaGraph uses unified degree cut (UDC) graph partitioning. Also overlaps data transfer with kernel execution. iCENTRAL uses biconnected components for betwenness centrality on dynamic graphs.
Hybrid CSR uses an additional value array for storing packed "has edge/neighbour" bits. This can give better memory access pattern if many bits are set, and cause many threads to wait if many bits are zero. She mentioned Volta architecture has independent PC, stack per thread (similar to CPU?). Does is not matter then if the threads in a block diverge?
(BFS = G*v, Multi-source BFS = G*vs)
The document outlines a talk on spatial computation and compiler techniques for configurable architectures. It discusses a high-level compilation layer (HCL) and architecture specific compilation layer (ASCL) that work together to map programs onto target architectures. It also presents BME, a bio-molecular engine simulation environment used to experiment with architectural solutions at the molecular scale.
This document contains Sai Dheeraj Polagani's resume. It includes his contact information, objective, education history, industrial experience at Intel as a physical design engineer and layout engineer, previous experience at Wipro Technologies as an ASIC physical design and verification engineer, technical skills, projects completed and academic projects during his Master's program. His experience includes full-chip static timing analysis, noise analysis, cross-talk analysis, place and route, timing closure, and physical verification. He has a Master's in Electrical Engineering from San Jose State University and a Bachelors in Information and Communication Technology from DA-IICT, India.
IRJET-Hardware Co-Simulation of Classical Edge Detection Algorithms using Xil...IRJET Journal
This document describes hardware co-simulation of classical edge detection algorithms using Xilinx System Generator. It presents designs for Robert, Prewitt, and Sobel edge detection operators using minimum FPGA resources. The designs were tested on a Spartan-3E FPGA board using JTAG hardware co-simulation. Results show the Sobel operator provides the best edge detection, followed by Prewitt and Robert. Compared to prior work, the proposed designs utilize significantly fewer FPGA slices, flip flops, and lookup tables, demonstrating more efficient implementation of the classical edge detection algorithms.
The complexity of Medical image reconstruction requires tens to hundreds of billions of computations per second. Until few years ago, special purpose processors designed especially for such applications were used. Such processors require significant design effort and are thus difficult to change as new algorithms in reconstructions evolve and have limited parallelism. Hence the demand for flexibility in medical applications motivated the use of stream processors with massively parallel architecture. Stream processing architectures offers data parallel kind of parallelism.
Software effort estimation through clustering techniques of RBFN networkIOSR Journals
This document discusses using a radial basis function neural network (RBFN) to estimate software development effort based on the COCOMO II model. The RBFN uses COCOMO II data for training and has three layers - an input layer with COCOMO II parameters like size and scale factors, a hidden middle layer with Gaussian activation functions, and an output layer that calculates effort. Two clustering algorithms, K-means and APC-III, are used to determine the receptive fields of the hidden layer neurons. The K-means algorithm partitions the COCOMO II data into clusters and finds cluster centers to minimize distance between clusters. The RBFN is trained and tested on the COCOMO II data to evaluate its ability to accurately estimate software
Optimization of latency of temporal key Integrity protocol (tkip) using graph...ijcseit
The document discusses optimization of latency in the Temporal Key Integrity Protocol (TKIP) using hardware-software co-design and graph theory. It presents a mathematical model to partition TKIP algorithms between hardware and software blocks to minimize latency. Simulation results showed the proposed technique achieved lower latency than a hardware-only implementation, reducing latency from 10us to 8us. The technique models TKIP modules as a graph and uses algorithms to assign modules to hardware or software based on latency calculations.
(Im2col)accelerating deep neural networks on low power heterogeneous architec...Bomm Kim
This document discusses accelerating deep neural networks on low power heterogeneous architectures. Specifically, it focuses on accelerating the inference time of the VGG-16 neural network on the ODROID-XU4 board, which contains an ARM CPU and Mali GPU. The authors develop parallel versions of VGG-16 using OpenMP for the CPU and OpenCL for the GPU. Several optimizations are explored in OpenCL, including work groups, vector data types, and the CLBlast library. The best OpenCL implementation achieves a 9.4x speedup over the original serial version.
International Journal of Computational Engineering Research (IJCER) is dedicated to protecting personal information and will make every reasonable effort to handle collected information appropriately. All information collected, as well as related requests, will be handled as carefully and efficiently as possible in accordance with IJCER standards for integrity and objectivity.
The computing continuum extends the high-performance cloud data centers with energy-efficient and low-latency devices close to the data sources located at the edge of the network. However, the heterogeneity of the computing continuum raises multiple challenges related to application and data management. These include (i) how to efficiently provision compute and storage resources across multiple control domains across the computing continuum, (ii) how to decompose and schedule an application, and (iii) where to store an application source and the related data. To support these decisions, we explore in this thesis, novel approaches for (i) resource characterization and provisioning with detailed performance, mobility, and carbon footprint analysis, (ii) application and data decomposition with increased reliability, and (iii) optimization of application storage repositories. We validate our approaches based on a selection of use case applications with complementary resource requirements across the computing continuum over a real-life evaluation testbed.
Lions Park Residences is a low-rise condominium development located near Makati and Ortigas business districts with amenities like a pool, park, and commercial area. The development offers studio, 1-bedroom, 2-bedroom, and 3-bedroom units that are an excellent investment for OFWs and retirees. Quadrillion Property Management provides property management services. The development is near transportation and amenities while offering affordable prices and payment plans.
This document provides a user guide for the Wishbone serializer core, which establishes a transparent Wishbone bridge between two FPGAs using high-speed serial transceivers. The core supports simultaneous communication between a master on one FPGA and a slave on the other FPGA. It contains a Wishbone control logic block, asynchronous FIFOs to handle the two clock domains, and uses an Aurora 8B/10B core for serial transmission. The guide describes the files and architecture of the core.
Inference on edge has an ever increasing performance for companies and thus it is crucial to be able to make models smaller. Compressing models can be loss-less or can result in loss of accuracy. This presentation provides a survey of compression techniques for deep learning models. It then describes different architectures of AWS IoT/Green Grass to combine on-device inference and GPU inference in a hub model. Additionally the presentation introduces MXNet, which has small footprint and efficient both for inference and training in distributed settings.
HOMOGENEOUS MULTISTAGE ARCHITECTURE FOR REAL-TIME IMAGE PROCESSINGcscpconf
The document describes a homogeneous multistage architecture for real-time image processing. It proposes a parallel architecture using multiple identical processing elements connected by different communication links. As an example application, it discusses a multi-hypothesis approach for road recognition, which uses multiple hypotheses to detect and track road edges in video in real-time. Experimental results using a FPGA demonstrate the architecture can detect roadsides in images within 60 milliseconds.
This document describes an FPGA-based human detection system with an embedded platform. Key points:
- The system uses HOG features, SVM classification, and AdaBoost algorithms for human detection in images and video.
- FPGA circuits are designed to accelerate the computationally intensive HOG feature extraction, including modules for gradient calculation, histogram accumulation, and more.
- The full system is implemented on an embedded platform to achieve a real-time human detection system running at 15 frames per second.
- Experimental results show the FPGA-based system has similar detection accuracy to a PC-based software implementation but significantly faster speed, suitable for real-time embedded applications.
IRJET - Wavelet based Image Fusion using FPGA for Biomedical ApplicationIRJET Journal
This document describes a wavelet-based image fusion system implemented on an FPGA for biomedical applications. The system takes two input images, applies discrete wavelet transforms to both, then fuses the wavelet coefficients to create a single output image. It uses MATLAB and Xilinx System Generator to simulate the design in Simulink and implement it on a Virtex6 FPGA. The results show that wavelet-based fusion can combine the spatial and spectral information from multiple input images into a higher quality fused output image suitable for medical applications like fusing MRI and CT scans.
Machine learning methods can be applied in several areas of computer architecture including reducing simulation time, exploring large design spaces, and optimizing resource management and hardware predictors. Supervised and unsupervised learning algorithms are discussed as well as applications like basic block instruction scheduling where machine learning has shown improvements over traditional heuristic approaches.
The document provides an overview of the Dynamic Reconfigurability in Embedded System Design (DRESD) project. It describes the MicroLAB organization involved in DRESD, various motivations for reconfiguration, definitions of reconfiguration, examples of reconfiguration in everyday life, new frontiers for reconfigurable computing, the DRESD philosophy, involvement of DRESD at Politecnico di Milano and internationally, main DRESD projects, and opportunities to get involved.
Parallel algorithms for multi-source graph traversal and its applicationsSubhajit Sahu
Highlighted notes on Parallel algorithms for multi-source graph traversal and its applications.
While doing research work under Prof. Kishore Kothapalli.
Seema is working on Multi-source BFS with hybrid-CSR, with applications in APSP, diameter, centrality, reachability.
BFS can be either top-down (from visited nodes, mark next frontier), or bottom-up (from unvisited nodes, mark next frontier). She mentioned that hybrid approach is more efficient. EtaGraph uses unified degree cut (UDC) graph partitioning. Also overlaps data transfer with kernel execution. iCENTRAL uses biconnected components for betwenness centrality on dynamic graphs.
Hybrid CSR uses an additional value array for storing packed "has edge/neighbour" bits. This can give better memory access pattern if many bits are set, and cause many threads to wait if many bits are zero. She mentioned Volta architecture has independent PC, stack per thread (similar to CPU?). Does is not matter then if the threads in a block diverge?
(BFS = G*v, Multi-source BFS = G*vs)
The document outlines a talk on spatial computation and compiler techniques for configurable architectures. It discusses a high-level compilation layer (HCL) and architecture specific compilation layer (ASCL) that work together to map programs onto target architectures. It also presents BME, a bio-molecular engine simulation environment used to experiment with architectural solutions at the molecular scale.
This document contains Sai Dheeraj Polagani's resume. It includes his contact information, objective, education history, industrial experience at Intel as a physical design engineer and layout engineer, previous experience at Wipro Technologies as an ASIC physical design and verification engineer, technical skills, projects completed and academic projects during his Master's program. His experience includes full-chip static timing analysis, noise analysis, cross-talk analysis, place and route, timing closure, and physical verification. He has a Master's in Electrical Engineering from San Jose State University and a Bachelors in Information and Communication Technology from DA-IICT, India.
IRJET-Hardware Co-Simulation of Classical Edge Detection Algorithms using Xil...IRJET Journal
This document describes hardware co-simulation of classical edge detection algorithms using Xilinx System Generator. It presents designs for Robert, Prewitt, and Sobel edge detection operators using minimum FPGA resources. The designs were tested on a Spartan-3E FPGA board using JTAG hardware co-simulation. Results show the Sobel operator provides the best edge detection, followed by Prewitt and Robert. Compared to prior work, the proposed designs utilize significantly fewer FPGA slices, flip flops, and lookup tables, demonstrating more efficient implementation of the classical edge detection algorithms.
The complexity of Medical image reconstruction requires tens to hundreds of billions of computations per second. Until few years ago, special purpose processors designed especially for such applications were used. Such processors require significant design effort and are thus difficult to change as new algorithms in reconstructions evolve and have limited parallelism. Hence the demand for flexibility in medical applications motivated the use of stream processors with massively parallel architecture. Stream processing architectures offers data parallel kind of parallelism.
Software effort estimation through clustering techniques of RBFN networkIOSR Journals
This document discusses using a radial basis function neural network (RBFN) to estimate software development effort based on the COCOMO II model. The RBFN uses COCOMO II data for training and has three layers - an input layer with COCOMO II parameters like size and scale factors, a hidden middle layer with Gaussian activation functions, and an output layer that calculates effort. Two clustering algorithms, K-means and APC-III, are used to determine the receptive fields of the hidden layer neurons. The K-means algorithm partitions the COCOMO II data into clusters and finds cluster centers to minimize distance between clusters. The RBFN is trained and tested on the COCOMO II data to evaluate its ability to accurately estimate software
Optimization of latency of temporal key Integrity protocol (tkip) using graph...ijcseit
The document discusses optimization of latency in the Temporal Key Integrity Protocol (TKIP) using hardware-software co-design and graph theory. It presents a mathematical model to partition TKIP algorithms between hardware and software blocks to minimize latency. Simulation results showed the proposed technique achieved lower latency than a hardware-only implementation, reducing latency from 10us to 8us. The technique models TKIP modules as a graph and uses algorithms to assign modules to hardware or software based on latency calculations.
(Im2col)accelerating deep neural networks on low power heterogeneous architec...Bomm Kim
This document discusses accelerating deep neural networks on low power heterogeneous architectures. Specifically, it focuses on accelerating the inference time of the VGG-16 neural network on the ODROID-XU4 board, which contains an ARM CPU and Mali GPU. The authors develop parallel versions of VGG-16 using OpenMP for the CPU and OpenCL for the GPU. Several optimizations are explored in OpenCL, including work groups, vector data types, and the CLBlast library. The best OpenCL implementation achieves a 9.4x speedup over the original serial version.
International Journal of Computational Engineering Research (IJCER) is dedicated to protecting personal information and will make every reasonable effort to handle collected information appropriately. All information collected, as well as related requests, will be handled as carefully and efficiently as possible in accordance with IJCER standards for integrity and objectivity.
The computing continuum extends the high-performance cloud data centers with energy-efficient and low-latency devices close to the data sources located at the edge of the network. However, the heterogeneity of the computing continuum raises multiple challenges related to application and data management. These include (i) how to efficiently provision compute and storage resources across multiple control domains across the computing continuum, (ii) how to decompose and schedule an application, and (iii) where to store an application source and the related data. To support these decisions, we explore in this thesis, novel approaches for (i) resource characterization and provisioning with detailed performance, mobility, and carbon footprint analysis, (ii) application and data decomposition with increased reliability, and (iii) optimization of application storage repositories. We validate our approaches based on a selection of use case applications with complementary resource requirements across the computing continuum over a real-life evaluation testbed.
Lions Park Residences is a low-rise condominium development located near Makati and Ortigas business districts with amenities like a pool, park, and commercial area. The development offers studio, 1-bedroom, 2-bedroom, and 3-bedroom units that are an excellent investment for OFWs and retirees. Quadrillion Property Management provides property management services. The development is near transportation and amenities while offering affordable prices and payment plans.
This document provides a user guide for the Wishbone serializer core, which establishes a transparent Wishbone bridge between two FPGAs using high-speed serial transceivers. The core supports simultaneous communication between a master on one FPGA and a slave on the other FPGA. It contains a Wishbone control logic block, asynchronous FIFOs to handle the two clock domains, and uses an Aurora 8B/10B core for serial transmission. The guide describes the files and architecture of the core.
This document provides an overview of a digital design course. It discusses:
1) The course objectives which are for students to learn digital system analysis, Boolean algebra, and how to design combinational and sequential circuits.
2) The course outline which covers an introduction to digital systems, numbering systems, Boolean algebra, combinational circuits, and sequential circuits.
3) Examples of digital systems such as computers, mobile phones, and washing machines which process discrete digital inputs and outputs.
Since we started out in 2001 our mission has remained the same:
To deliver recruitment experiences that defy expectations; to build a team of digital recruiters that understands the digital sector better than any other.
We recruit across the whole of digital - from creative agencies to client side, from social media to search marketing, and everything in between.
We\'re redefining digital recruitment. In everything that we do, we\'re trying to deliver an experience that defies expectations. In a good way.
You can follow us on twitter - http://bit.ly/twitterpropel - or join our “London Digital Talent” group on LinkedIn - http://linkd.in/Londondigitaltalent - Both these channels will give you information about what we’re up to, wider industry trends and upcoming events. We’d love you to join the conversation.
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGAVLSICS Design
Video Compression is very essential to meet the technological demands such as low power, less memory
and fast transfer rate for different range of devices and for various multimedia applications. Video
compression is primarily achieved by Motion Estimation (ME) process in any video encoder which
contributes to significant compression gain.Sum of Absolute Difference (SAD) is used as distortion metric
in ME process.In this paper, efficient Absolute Difference(AD)circuit is proposed which uses Brent Kung
Adder(BKA) and a comparator based on modified 1’s complement principle and conditional sum adder
scheme. Results shows that proposed architecture reduces delay by 15% and number of slice LUTs by 42
% as compared to conventional architecture. Simulation and synthesis are done on Xilinx ISE 14.2 using
Virtex 7 FPGA.
Parallelization of the LBG Vector Quantization Algorithm for Shared Memory Sy...CSCJournals
This paper proposes a parallel approach for the Vector Quantization (VQ) problem in image processing. VQ deals with codebook generation from the input training data set and replacement of any arbitrary data with the nearest codevector. Most of the efforts in VQ have been directed towards designing parallel search algorithms for the codebook, and little has hitherto been done in evolving a parallelized procedure to obtain an optimum codebook. This parallel algorithm addresses the problem of designing an optimum codebook using the traditional LBG type of vector quantization algorithm for shared memory systems and for the efficient usage of parallel processors. Using the codebook formed from a training set, any arbitrary input data is replaced with the nearest codevector from the codebook. The effectiveness of the proposed algorithm is indicated.
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGAVLSICS Design
Video Compression is very essential to meet the technological demands such as low power, less memory and fast transfer rate for different range of devices and for various multimedia applications. Video compression is primarily achieved by Motion Estimation (ME) process in any video encoder which contributes to significant compression gain.Sum of Absolute Difference (SAD) is used as distortion metric in ME process.In this paper, efficient Absolute Difference(AD)circuit is proposed which uses Brent Kung Adder(BKA) and a comparator based on modified 1’s complement principle and conditional sum adder scheme. Results shows that proposed architecture reduces delay by 15% and number of slice LUTs by 42% as compared to conventional architecture. Simulation and synthesis are done on Xilinx ISE 14.2 using Virtex 7 FPGA.
This document summarizes a project that aims to improve computational efficiency in solving large sparse linear systems for modeling ablative thermal protection system surfaces by introducing GPU techniques. Three approaches have been attempted so far: 1) using the AMGX library, 2) using new sparse matrix subclasses to perform matrix-vector products on the GPU, and 3) adjusting the CUDA_ITSOL library. The source code uses the PETSc library and takes multiple rounds to converge. Profiling found most time is spent in preconditioning and solving phases, which could benefit from GPU acceleration. The next step is to fully integrate AMGX and test different CPU-GPU ratios. Version control tools like Git are used to manage the code and track changes
SPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHYcsandit
This paper presents a parallel approach to improve the time complexity problem associated
with sequential algorithms. An image steganography algorithm in transform domain is
considered for implementation. Image steganography is a technique to hide secret message in
an image. With the parallel implementation, large message can be hidden in large image since
it does not take much processing time. It is implemented on GPU systems. Parallel
programming is done using OpenCL in CUDA cores from NVIDIA. The speed-up improvement
obtained is very good with reasonably good output signal quality, when large amount of data is
processed
Estimation of Optimized Energy and Latency Constraint for Task Allocation in ...ijcsit
In Network on Chip (NoC) rooted system, energy consumption is affected by task scheduling and allocation
schemes which affect the performance of the system. In this paper we test the pre-existing proposed
algorithms and introduced a new energy skilled algorithm for 3D NoC architecture. An efficient dynamic
and cluster approaches are proposed along with the optimization using bio-inspired algorithm. The
proposed algorithm has been implemented and evaluated on randomly generated benchmark and real life
application such as MMS, Telecom and VOPD. The algorithm has also been tested with the E3S benchmark
and has been compared with the existing mapping algorithm spiral and crinkle and has shown better
reduction in the communication energy consumption and shows improvement in the performance of the
system. On performing experimental analysis of proposed algorithm results shows that average reduction
in energy consumption is 49%, reduction in communication cost is 48% and average latency is 34%.
Cluster based approach is mapped onto NoC using Dynamic Diagonal Mapping (DDMap), Crinkle and
Spiral algorithms and found DDmap provides improved result. On analysis and comparison of mapping of
cluster using DDmap approach the average energy reduction is 14% and 9% with crinkle and spiral.
SPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHY cscpconf
This paper presents a parallel approach to improve the time complexity problem associated with sequential algorithms. An image steganography algorithm in transform domain is considered for implementation. Image steganography is a technique to hide secret message in an image. With the parallel implementation, large message can be hidden in large image since it does not take much processing time. It is implemented on GPU systems. Parallel programming is done using OpenCL in CUDA cores from NVIDIA. The speed-up improvement
obtained is very good with reasonably good output signal quality, when large amount of data is processed
This document reviews parallel computing and compares different parallel programming models. It discusses CPU and GPU architectures, highlighting that GPUs are designed for massive parallelism while CPUs balance computing power and flexibility. The document evaluates programming models based on supported system architectures, programming interfaces, workload partitioning, task assignment, synchronization methods, and communication models.
HARDWARE SOFTWARE CO-SIMULATION OF MOTION ESTIMATION IN H.264 ENCODERcscpconf
This paper proposes about motion estimation in H.264/AVC encoder. Compared with standards
such as MPEG-2 and MPEG-4 Visual, H.264 can deliver better image quality at the same
compressed bit rate or at a lower bit rate. The increase in compression efficiency comes at the
expense of increase in complexity, which is a fact that must be overcome. An efficient Co-design
methodology is required, where the encoder software application is highly optimized and
structured in a very modular and efficient manner, so as to allow its most complex and time
consuming operations to be offloaded to dedicated hardware accelerators. The Motion
Estimation algorithm is the most computationally intensive part of the encoder which is simulated using MATLAB. The hardware/software co-simulation is done using system generator tool and implemented using Xilinx FPGA Spartan 3E for different scanning methods.
IEEE 2014 MATLAB IMAGE PROCESSING PROJECTS An efficient-parallel-approach-fo...IEEEBEBTECHSTUDENTPROJECTS
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09666155510, 09849539085 or mail us - ieeefinalsemprojects@gmail.com-Visit Our Website: www.finalyearprojects.org
IRJET- A Review- FPGA based Architectures for Image Capturing Consequently Pr...IRJET Journal
This document presents a review of FPGA-based architectures for image capturing, processing, and display using a VGA monitor. It discusses using the Xilinx AccelDSP tool to develop the system on a Spartan 3E FPGA. The AccelDSP tool allows converting a MATLAB design into HDL for implementation on the FPGA. It summarizes the FPGA-based system architecture, which includes units for initialization, data transfer, image processing, and memory management. It then outlines the Xilinx AccelDSP design flow, which verifies the functionality at each stage of converting the floating-point MATLAB model to a fixed-point hardware implementation on the FPGA. The goal is to accelerate image processing applications using the parallel
APPROXIMATE ARITHMETIC CIRCUIT DESIGN FOR ERROR RESILIENT APPLICATIONSVLSICS Design
When the application context is ready to accept different levels of exactness in solutions and is supported
by human perception quality, then the term ‘Approximate Computing’ tossed before one decade will
become the first priority . Even though computer hardware and software are working to generate exact
results, approximate results are preferred whenever an error is in predefined bound and adaptive. It will
reduce power demand and critical path delay and improve other circuit metrics. When it comes to
traditional arithmetic circuits, those generating correct results with limitations on performance are rapidly
getting replaced by approximate arithmetic circuits which are the need of the hour, and so on about their
design.
APPROXIMATE ARITHMETIC CIRCUIT DESIGN FOR ERROR RESILIENT APPLICATIONSVLSICS Design
When the application context is ready to accept different levels of exactness in solutions and is supported
by human perception quality, then the term ‘Approximate Computing’ tossed before one decade will
become the first priority . Even though computer hardware and software are working to generate exact
results, approximate results are preferred whenever an error is in predefined bound and adaptive. It will
reduce power demand and critical path delay and improve other circuit metrics. When it comes to
traditional arithmetic circuits, those generating correct results with limitations on performance are rapidly
getting replaced by approximate arithmetic circuits which are the need of the hour, and so on about their
design.
The document discusses the evolution of GPU architecture and capabilities over time. It describes how GPUs have become massively parallel processors with programmable capabilities beyond just graphics. The document outlines the core components of a GPU including the graphics pipeline and programming model. It also discusses how GPUs are well suited for parallel, data-intensive applications and how their capabilities have expanded into general purpose computing through technologies like CUDA.
The document discusses the performance implications of two types of processing-in-memory (PIM) designs - fixed-functional PIM and programmable PIM - on data-intensive applications. It explores these implications through three benchmarks, including a real data analytics application involving gradient computation. The results show that PIMs provide speedups ranging from 2.09x to 91.4x over non-PIM designs. However, fixed-functional and programmable PIMs perform differently across applications, with up to a 90% performance difference. Neither PIM type is optimal for all cases. The best choice depends on workload and PIM characteristics as well as PIM overhead.
IEEE 2014 JAVA PARALLEL DISTRIBUTED PROJECTS Streaming applications on bus ba...IEEEMEMTECHSTUDENTPROJECTS
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09666155510, 09849539085 or mail us - ieeefinalsemprojects@gmail.com-Visit Our Website: www.finalyearprojects.org
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09666155510, 09849539085 or mail us - ieeefinalsemprojects@gmail.com-Visit Our Website: www.finalyearprojects.org
A high frame-rate of cell-based histogram-oriented gradients human detector a...IAESIJAI
In respect of the accuracy, one of the well-known techniques for human
detection is the histogram-oriented gradients (HOG) method. Unfortunately,
the HOG feature calculation is highly complex and computationally
intensive. Thus, in this research, we aim to achieve a resource-efficient and
low-power HOG hardware architecture while maintaining its high frame-rate
performance for real-time processing. A hardware architecture for human
detection in 2D images using simplified HOG algorithm was introduced in
this paper. To increase the frame-rate, we simplify the HOG computation
while maintaining the detection quality. In the hardware architecture, we
design a cell-based processing method instead of a window-based method.
Moreover, 64 parallel and pipeline architectures were used to increase the
processing speed. Our pipeline architecture can significantly reduce memory
bandwidth and avoid any external memory utilization. an altera field
programmable gate arrays (FPGA) E2-115 was employed to evaluate the
design. The evaluation results show that our design achieves performance up
to 86.51 frame rate per second (Fps) with a relatively low operating
frequency (27 MHz). It consumes 48,360 logic elements (LEs) and 4,363
registers. The performance test results reveal that the proposed solution
exhibits a trade-off between Fps, clock frequency, the use of registers, and
Fps-to-clock ratio.
Matrix Multiplication with Ateji PX for JavaPatrick Viry
Matrix multiplication is a standard benchmark for evaluating the performance of intensive dataparallel operations on recent multi-core processors. This whitepaper shows to use Ateji PX for Java to achieve state-of-the-art parallel performance, by adding one single operator in your existing code.
This document discusses introducing autonomic capabilities to algorithmic skeletons using event-driven programming. It presents a novel way to add self-configuration and self-optimization to skeletons by monitoring their execution status through events. Events allow precisely tracking a skeleton's execution and implementing non-functional concerns like optimizing resource allocation to guarantee a certain execution time. The approach is implemented in the Skandium skeleton framework. It focuses on autonomically optimizing thread allocation to meet a given execution time goal.
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframePrecisely
Inconsistent user experience and siloed data, high costs, and changing customer expectations – Citizens Bank was experiencing these challenges while it was attempting to deliver a superior digital banking experience for its clients. Its core banking applications run on the mainframe and Citizens was using legacy utilities to get the critical mainframe data to feed customer-facing channels, like call centers, web, and mobile. Ultimately, this led to higher operating costs (MIPS), delayed response times, and longer time to market.
Ever-changing customer expectations demand more modern digital experiences, and the bank needed to find a solution that could provide real-time data to its customer channels with low latency and operating costs. Join this session to learn how Citizens is leveraging Precisely to replicate mainframe data to its customer channels and deliver on their “modern digital bank” experiences.
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
Skybuffer SAM4U tool for SAP license adoptionTatiana Kojar
Manage and optimize your license adoption and consumption with SAM4U, an SAP free customer software asset management tool.
SAM4U, an SAP complimentary software asset management tool for customers, delivers a detailed and well-structured overview of license inventory and usage with a user-friendly interface. We offer a hosted, cost-effective, and performance-optimized SAM4U setup in the Skybuffer Cloud environment. You retain ownership of the system and data, while we manage the ABAP 7.58 infrastructure, ensuring fixed Total Cost of Ownership (TCO) and exceptional services through the SAP Fiori interface.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfChart Kalyan
A Mix Chart displays historical data of numbers in a graphical or tabular form. The Kalyan Rajdhani Mix Chart specifically shows the results of a sequence of numbers over different periods.
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...alexjohnson7307
Predictive maintenance is a proactive approach that anticipates equipment failures before they happen. At the forefront of this innovative strategy is Artificial Intelligence (AI), which brings unprecedented precision and efficiency. AI in predictive maintenance is transforming industries by reducing downtime, minimizing costs, and enhancing productivity.
5th LF Energy Power Grid Model Meet-up SlidesDanBrown980551
5th Power Grid Model Meet-up
It is with great pleasure that we extend to you an invitation to the 5th Power Grid Model Meet-up, scheduled for 6th June 2024. This event will adopt a hybrid format, allowing participants to join us either through an online Mircosoft Teams session or in person at TU/e located at Den Dolech 2, Eindhoven, Netherlands. The meet-up will be hosted by Eindhoven University of Technology (TU/e), a research university specializing in engineering science & technology.
Power Grid Model
The global energy transition is placing new and unprecedented demands on Distribution System Operators (DSOs). Alongside upgrades to grid capacity, processes such as digitization, capacity optimization, and congestion management are becoming vital for delivering reliable services.
Power Grid Model is an open source project from Linux Foundation Energy and provides a calculation engine that is increasingly essential for DSOs. It offers a standards-based foundation enabling real-time power systems analysis, simulations of electrical power grids, and sophisticated what-if analysis. In addition, it enables in-depth studies and analysis of the electrical power grid’s behavior and performance. This comprehensive model incorporates essential factors such as power generation capacity, electrical losses, voltage levels, power flows, and system stability.
Power Grid Model is currently being applied in a wide variety of use cases, including grid planning, expansion, reliability, and congestion studies. It can also help in analyzing the impact of renewable energy integration, assessing the effects of disturbances or faults, and developing strategies for grid control and optimization.
What to expect
For the upcoming meetup we are organizing, we have an exciting lineup of activities planned:
-Insightful presentations covering two practical applications of the Power Grid Model.
-An update on the latest advancements in Power Grid -Model technology during the first and second quarters of 2024.
-An interactive brainstorming session to discuss and propose new feature requests.
-An opportunity to connect with fellow Power Grid Model enthusiasts and users.
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3Data Hops
Free A4 downloadable and printable Cyber Security, Social Engineering Safety and security Training Posters . Promote security awareness in the home or workplace. Lock them Out From training providers datahops.com
Your One-Stop Shop for Python Success: Top 10 US Python Development Providersakankshawande
Simplify your search for a reliable Python development partner! This list presents the top 10 trusted US providers offering comprehensive Python development services, ensuring your project's success from conception to completion.
2. Int J Parallel Prog
1 Introduction
Complexity of electronic designs have been rapidly growing. Billions of transistors
are commonly placed in designs to generate higher performance. Multicore and many
core systems provide the performance by handling more work in parallel. However,
the complexity in electronic designs poses a great challenge for building such systems,
where designs continue to be released with latent bugs. Functional design verification
is the task of establishing that a given design accurately implements the intended
functional behavior (specification). Today, design verification has grown to dominate
the cost of electronic system design, in fact, consuming about 60% of design effort
[5]. Among several verification techniques, logic simulation remains to be the major
verification technique due to its applicability to real designs and its relative ease of
use. However, logic simulation of designs with millions of components is time con-
suming and has become a bottleneck in the design process. Any means to speedup
logic simulation results in productivity gains and faster time-to-market.
We observe that electronic designs exhibit a lot of parallelism that can be exploited
by parallel algorithms. In fact, there are parallel electronic design automation algo-
rithms for almost all stages in the implementation of an electronic design such as logic
optimization, floor-planning, routing, and physical verification stages. In this paper,
we focus on parallelization of logic simulation of electronic designs using Graphics
Processing Units (GPUs).
In the past, Graphics Processing Units (GPUs) were special-purpose application
accelerators, suitable only for conventional graphics applications. Nowadays, GPUs
are routinely used in applications other than graphics such as computational biol-
ogy, computational finance and electronic designs, where huge speedups have been
achieved [22]. This is a result of the availability of extremely high performance hard-
ware (for example, a GTX 480 GPU provides 1.35 Tflops with 480 cores and is under
$500), and the availability of general purpose programming models such as Com-
pute Unified Device Architecture (CUDA) by NVIDIA [13]. CUDA is an extension
to C language and is based on a few abstractions for parallel programming. CUDA
has accelerated the development of parallel applications beyond that of the original
purpose of graphics processing.
There are two main types of logic simulation; Cycle-Based Simulation (CBS) and
Event-Based Simulation (EBS). In CBS, the evaluation schedule of gates in the design
for each step of simulation is determined once at the compilation time of the simulator.
EBS has a more complicated scheduling policy where a gate is simulated only if at least
one of its input values have changed. Both CBS and EBS are commonly used in the
industry. In this paper, we work with CBS since it has a less complicated static sched-
uling policy that is amenable to better parallelization. However, we also note that logic
simulation has been classified as the most challenging Computer Aided Design pattern
to parallelize [7]. The ratio of communication (between gate and gate) to Boolean eval-
uation is high. Since the communication pattern of logic simulation often defy a pat-
tern, it is difficult to assign elements to processors so that communicating elements are
always close. High performance loss due to communication overhead is reported [26].
Logic simulation proceeds in two phases; compilation and simulation. The compi-
lation phase is part of any modern simulator and is required to convert the design into
123
3. Int J Parallel Prog
an appropriate form for the simulation phase. Our GPU based solution is also com-
posed of these two phases. In the compilation phase, we perform several operations
namely combinational logic extraction, levelization, clustering, and balancing opera-
tions. This phase has several GPU architecture dependent optimizations such as the
efficient usage of shared memory and memory coalescing. Levelization helps deter-
mine the dependency between gates, where gates in the same level can be simulated in
parallel. Clustering helps partition the design into a collection of smaller parts where
each part can be simulated independent of other parts. Balancing helps optimize the
usage of threads for a SIMD instruction. These operations are necessary because in
a gate level design, certain gates could have hundreds or thousands of fanouts while
most will have a few fanouts resulting in irregular data access patterns. Clustering
and balancing help organize access patterns for effective simulation. The simulation
phase is where thousands of threads are available to simulate the compiled design
in parallel. We use several optimizations in simulation phase in order to reduce the
communication overhead between the GPU and the CPU. In both phases, we exploit
the GPU memory resources as efficiently as possible in order to have low latency.
Our design representation for logic simulation is different from the other design
representations. We use And-Inverter Graphs (AIGs), in particular we use the AIGER
format [2]. AIG is an efficient representation for manipulation of boolean functions. It
is increasingly used in logic synthesis, technology mapping, verification, and boolean
satisfiability checking [6,10,20,21,28]. However, to the best of our knowledge, our
work is the first GPU based electronic design automation solution using AIGs. Since
we use AIGs with only a single type of combinational gate (and-gate), our algorithms
can efficiently use the limited low latency memory spaces provided by the GPU.
We developed two clustering algorithms. The first one clusters gates using a given
threshold, and the second one improves clustering with that of merging and balancing
steps and incorporates memory coalescing and efficient utilization of shared mem-
ory. We validated the effectiveness of our parallel CBS algorithm for both types of
clustering with several benchmarks from IWLS, and AIGER [2,17,24]. We compared
our parallel CBS algorithm with that of a sequential CBS algorithm. Our experiments
show that parallel CBS can speedup the simulation of designs over the sequential algo-
rithm. In particular, we obtained up-to 5x speedup with the first clustering algorithm
and up-to 21x with the second clustering algorithm. We obtain better speedups with
larger circuits.
This paper is organized as follows. In the next section, we give an overview of related
work in logic simulation and General Purpose computation on GPUs (GPGPU). We
then describe background in CUDA, AIG format and logic simulation in Sect. 3. In
Sect. 4, we describe our parallel CBS algorithm together with two clustering algo-
rithms and CUDA optimizations. Our experiments are in Sect. 6. Finally, conclusions
and future work are described.
2 Related Work
Catanzaro et al. [7] explain design patterns for parallelization in Computer Aided
Design (CAD). The authors consider 17 different CAD algorithms and partition
123
4. Int J Parallel Prog
these algorithms in three categories; graph algorithms, branch and bound search,
and linear algebra. They also state that graph algorithms in CAD (of which logic
simulation is a member) are the hardest to parallelize among these categories. Sim-
ilarly, CAD case studies of GPU acceleration can be found in [12]. Some of these
case studies are spice simulation, fault simulation, static timing analysis, boolean
satisfiability, fault dictionary computation, and power grid analysis. The authors
describe optimization techniques for irregular EDA applications on GPUs in [14].
In particular, they make use of memory coalescing and shared memory utiliza-
tions that improve the speedup of sparse matrix vector product and breadth first
search.
There has been a lot of work on parallel logic simulation using architectures other
than GPUs. There are several surveys on parallel logic simulation [4,19]. In particular,
cycle-based simulation approaches are used by IBM and others [15]. The simulation
algorithms in these works are aimed at loosely coupled processor systems.
Different partitioning algorithms for electronic designs are described in [3,18].
Some of these algorithms are based on performance, layout, clustering, network
flow, and spectral methods. Our partitioning approach is similar to cone clustering
described in [16], where a fanin cone of a circuit element embodies an area of com-
binational logic that has the potential to influence signal values provided by that
element.
Several general purpose GPU applications can be found in [13,22]. The application
domain ranges from physics to finance and the medical field. The work in [11] gives a
performance study of general purpose applications using CUDA and compares them
with that of applications written using OpenMP.
There is another general purpose programming environment for GPUs named
OpenCL. OpenCL [23] is a relatively new standard that is very similar to CUDA in
that it is also an extension of C language. CUDA is specific to NVIDIA GPUs, whereas
OpenCL can be run on different architectures and gives you portability at the expense
of potentially sub-optimal performance for any specific platform. Also, CUDA is a
more mature environment with high-performance libraries and accompanying tools
like debuggers and profilers.
Our CBS algorithm is most similar to the work by Chatterjee et al. [9]. However,
there are several differences. We use AIGs as gate level representation whereas they
support a generic library of gates. AIGs allow us to efficiently use the limited low
latency memory spaces. We use a threshold value in our first clustering algorithm,
whereas they start clusters from the primary outputs. We fix the number of blocks
in our second clustering algorithm in order to maximize parallel thread execution.
We encode variables in order to better utilize shared memory and we explicitly use
memory coalescing, whereas these are not used in [9].
There is an earlier logic simulation algorithm using GPUs by Perinkulam [25].
However, this algorithm does not provide performance benefits since they do not opti-
mize data transfer between GPU and CPU, use a different partitioning approach, and
do not use the general purpose programming language CUDA. There is also a recent
work on event-based simulation algorithm, which is also a commonly used simulation
technique in the industry, using CUDA [8].
123
5. Int J Parallel Prog
3 Background
In this section, we are going to present background on CUDA programming, AIGs,
and sequential cycle based logic simulation of electronic designs.
3.1 CUDA Programming
Compute Unified Device Architecture (CUDA) is a small C library extension devel-
oped by NVIDIA to expose the computational horsepower of NVIDIA GPUs [13].
GPU is a compute device that serves as a co-processor for the host CPU. CUDA
can be considered as an instance of widely used Single Program Multiple Data
(SPMD) parallel programming models. A CUDA program supplies a single source
code encompassing both host and device code. Execution of this code consists of
one or more phases that are executed either on the host or device. The phases that
exhibit little amount of parallelism are executed on the host and rich amount of
parallelism are executed on the device. The device code is referred to as kernel
code.
The smallest execution units in CUDA are threads. Thousands of threads can work
concurrently at a time. GPU has its own device memory and provides different types
of memory spaces available to threads during their execution. Each thread has a private
local memory in addition to the registers allocated to it. A group of threads forms a
CUDA block. A CUDA block can have at most 512 threads, where each thread has
a unique id. A group of blocks forms a CUDA grid. Each thread block has a shared
memory visible to all threads of the block within the lifetime of the block. Access to
shared memory is fast like that of a register. Threads in the same block can synchro-
nize using a barrier, whereas threads from different blocks can not synchronize. Each
thread has access to the global device memory throughout the application. Access to
the global memory is slow and around 400–600 cycles. There are also special memory
spaces such as texture, constant, and page-locked (pinned) memories. All types of
memory spaces are limited in size, and should therefore be handled carefully in the
program.
Figure 1 displays general NVIDIA CUDA architecture with N streaming mul-
tiprocessors composed of M streaming processors. For an NVIDIA FX3800 GPU,
N = 24 and M = 8. Each multiprocessor can execute 768 threads in parallel and
shared memory size is limited where each multiprocessor can have 16 KB of shared
memory.
3.2 And-Inverter Graph (AIG)
And-Inverter Graph (AIG) is a directed, acyclic graph that represents a structural
implementation of the logical functionality of a design or circuit. AIGER format is an
implementation of AIGs [2]. An AIG is composed of two-input and-nodes (combi-
national elements) representing logical conjunction, single input nodes representing
memory elements (latches, sequential elements), nodes labeled with variable names
123
6. Int J Parallel Prog
Fig. 1 CUDA hardware and memory architecture (Source: NVIDIA CUDA programming Guide [13])
representing inputs and outputs, and edges optionally containing markers indicating
logical negation. We refer to and-nodes and latches as gates. AIG is an efficient rep-
resentation for manipulation of boolean functions.
The combinational logic of an arbitrary Boolean network can be factored and
transformed into an AIG using DeMorgans rule. The following properties of
AIGs facilitate development of robust applications in synthesis, mapping, and
formal verification. Structural hashing ensures that AIGs do not contain struc-
turally identical nodes. Inverters are represented as edge attributes. As a result,
single-input nodes representing inverters and buffers do not have to be created.
This saves memory and allows for applying DeMorgans rule on-the-fly, which
increases logic sharing. The AIG representation is uniform and fine-grain, result-
ing in a small, fixed amount of memory per node. The nodes are stored in
one memory array in a topological order, resulting in fast, CPU-cache-friendly
traversals.
There has been a growing interest in AIGs as a functional representation for a
variety of tasks in Electronic Design Automation (EDA) such as logic synthesis,
technology mapping, verification, and equivalence checking [6,10,20,21,28]. Espe-
cially, the recent emergence of efficient boolean satisfiability (SAT) solvers that use
AIGs instead of the Binary Decision Diagrams (BDD) has made AIGs popular in
EDA. A tool called ABC [6] features an AIG package, several AIG-based synthe-
sis and equivalence-checking techniques, as well as an implementation of sequential
synthesis.
Figure 2 displays an example gate level design. In this figure, and-gates are iden-
tified as G#, where # stands for the unique gate number; and latches are identified as
L#, where # stands for the unique latch number. i# and o# stand for (primary) inputs
and (primary) outputs, respectively.
123
7. Int J Parallel Prog
Fig. 2 An example gate level
design
3.3 Sequential Cycle-Based Simulation
Cycle-based simulation (CBS) is a commonly used logic design simulation technique.
Given a sequence of input vectors, the goal is to generate the output vectors of the
design. CBS is a form of time driven or compiled mode simulation because the evalu-
ation schedule of gates for each simulation step is determined once at the compilation
time of the simulator. The time in CBS is not a physical time rather an integer that
denotes the current cycle. CBS represents a fast alternative to event-based simulation.
In CBS, logic values of all gates are calculated at clock defined cycle boundaries,
whereas event-based simulators calculate logic values of a gate if a change occurs at
the inputs of that gate. This property of CBS may bring some redundancy since inputs
of a combinational element may not change at every cycle. However, it eliminates
costly management of scheduling of events as in event-based simulation resulting in
efficient implementation rules and better performance.
Algorithm 1 displays a pseudocode for a general sequential CBS algorithm. In our
case, since we use AIGs, the only combinational elements are and-gates. At each step,
both combinational and sequential elements are evaluated. The primary output val-
ues are calculated at every cycle using the current values of sequential elements and
primary inputs. The combinational elements are evaluated generating the next latch
values. The combinational elements can be simulated in any order as long as the inputs
of an and-gate are ready for the current cycle. An optimization at this step would be to
levelize the combinational elements and simulate them level by level, which we will
explain in detail in the next sections. Next, sequential elements are simulated. The next
latch values are stored in a second set of latch variables that will take the role of the
123
8. Int J Parallel Prog
Algorithm 1 Sequential Cycle-Based Simulation Algorithm
Input: circuit, number of simulation cycles num, test-bench
Output: output values
1: for i = 0 to num do
2: execute test-bench and generate primary input values;
3: simulate all combinational elements;
4: simulate all sequential elements;
5: generate primary output values using primary input values and the current values of sequential ele-
ments;
6: end for
Table 1 Cycle-based
Cycle Inputs Latches Outputs
simulation of design in Fig. 2
i1 i2 i3 i4 i5 i6 i7 i8 L1 L2 L3 o1 o2
0 11111111 000 00
1 11111111 110 01
2 11111111 111 11
present (current) latch variables in the next cycle. For example, assuming that the latch
values are initially all zero, the outputs of the design in Fig. 2 with respect to a given
input sequence for three cycles is displayed in Table 1. In cycle 0, primary outputs are
00, whereas the next latch values are 110 based on the given primary inputs. In cycle
1, the next latch values of cycle 0 become the current latch values of cycle 1, that is,
current latch values are 110, hence the primary outputs become 01. The next latch
values become 111 based on the given primary inputs. In cycle 2, the next latch values
of cycle 1 become the current latch values of cycle 1, that is, current latch values are
111, hence the primary outputs become 11.
4 Parallel Cycle-Based Simulation Using GPUs
Algorithm 2 displays the pseudocode for our parallel CBS algorithm. Our algorithm
has compilation and simulation phases. In compilation phase, we extract combinational
elements of a design since we can simulate combinational and sequential elements
separately as described in Algorithm 1. Next, we levelize the extracted combinational
design in order to simulate elements at a level in parallel. Then, we partition the lev-
elized design into clusters, which are sets of gates that we will define later. We then
balance these clusters in order to maximize CUDA thread usage and map these bal-
anced clusters to CUDA blocks. We developed two different clustering algorithms
that we will explain below. After obtaining CUDA blocks, we are ready for parallel
simulation. In parallel simulation phase, we execute the test-bench in order to generate
primary input values at each cycle, next we transfer these values to the GPU. Then, we
simulate in parallel CUDA blocks for combinational and sequential elements. Below
we provide the details of our algorithm.
123
9. Int J Parallel Prog
Algorithm 2 Parallel Cycle-Based Simulation Algorithm
Input: circuit, number of simulation cycles num, test-bench
Output: output values
// compilation phase
1: extract combinational elements;
2: levelize combinational elements;
3: cluster combinational and sequential elements;
4: balance clusters;
// parallel simulation phase
5: for i = 0 to num do
6: execute test-bench on the host and generate primary input values;
7: transfer inputs from the host to the device;
8: simulate all clusters on the device and generate primary output values;
9: transfer outputs from the device to the host;
10: end for
Fig. 3 Levelization of the circuit in Fig. 2. The upper levels are dependent on the lower levels, whereas
the gates in the same level are independent of each other
4.1 Levelization
In CBS, we can simulate combinational and sequential elements separately. A speed-
up technique for the simulation of combinational elements is to simulate them level
by level, where the level of a gate is defined as the largest distance from the primary
inputs and the sequential elements of the design. For example, the level of gate G7
in Fig. 2 is 3. The level of a gate also describes its evaluation order with respect to
other gates, that is, a lower level gate is simulated before a higher level gate. Hence, a
level encodes the dependency relation between gates in the design where the input of
a gate can be the output of another gate from a lower level. This levelization enables
parallelization of CBS since the simulation of gates in the same level are independent
of each other and can be done in parallel.
Figure 3 displays the design in Fig. 2 after levelization step. Note that there are no
latches displayed in the figure since levelization is done on the combinational extrac-
tion of the design. We also assume that all latch values and primary input values are
given initially. Hence, latches and inputs can be considered to be in level 0. In the
figure, we denote the present and next latch values by PL# and NL#, respectively.
123
10. Int J Parallel Prog
4.2 Clustering and Balancing of Gates
Once the design is levelized, we can partition both the combinational and sequential
elements into sets. We call each such set a cluster. Our goal is to generate clusters
where each cluster can be simulated independently of other clusters. Then we can
simulate each gate in a level of a cluster by a separate thread since simulation of gates
in the same level are independent of each other.
We used several heuristics for clustering while exploiting the GPU architectural
properties. In particular, we want to maximize the usage of available CUDA threads.
Furthermore, CUDA allows thread synchronization, that is, threads within a block
can be synchronized; however, threads in different blocks cannot be synchronized.
An attempt to distribute all gates such that each level is simulated by a separate block
would require the synchronization of different blocks since each level has to be com-
pletely simulated before the next level. Hence, this approach is not possible.
4.2.1 First Clustering Algorithm
Figure 4 is a conceptual view of the result of our first clustering algorithm. Each
triangle represents a cluster. We use a threshold value, explained below, in order to
control the number of CUDA blocks. The intersections of clusters represent the design
elements that are duplicated. This duplication is necessary to ensure the simulation
independence between the clusters.
Algorithm 3 displays the pseudocode of our first clustering algorithm. In this algo-
rithm we assign each cluster to a CUDA block. First, starting from the gates at the
highest level, we search for the level where the number of gates in that level is greater
Fig. 4 Visualizing clustering
Algorithm 1. The threshold level
is the level where the number of
gates is at least the number
specified by clustering
threshold. The top corner of
each triangle represents a gate
in the threshold level
123
11. Int J Parallel Prog
Algorithm 3 Pseudocode for First Clustering Algorithm
Input: levelized circuit, threshold value thr eshold
Output: cluster of gates
1: level = the highest level of gates in circuit;
2: repeat
3: if number of gates in level > thr eshold then
4: add the gates from the highest level until the current level to cluster H ;
5: break; // found threshold level
6: end if
7: level = level − 1;
8: until level = 0;
9: if level = 0 then
10: update threshold value and rerun above steps;
11: end if
12: for i = 1 to number of gates in level do
13: add the cone of logic for gate Gi to cluster Gi;
14: end for
15: cluster Rem = remaining combinational gates;
16: cluster Seq = sequential gates;
than or equal to a given threshold value. We call this level the threshold level. This
threshold value is crucial in obtaining a good partitioning of the gates. We add all the
gates from the highest level up-to but not including the threshold level into a single
cluster (cluster H ). Once we reach the threshold level, we determine the cone of logic
of every gate in that level, where the cone of logic of a gate is the set of elements encoun-
tered during a backtrace from a gate to inputs or sequential elements. For example,
the cone of logic of gate G5 in Fig. 3 includes G5, G4, G3, G2. Each cone of logic is
a cluster for our purposes, named cluster Gi for each gate Gi in threshold level. Note
that since gates can be in the cone of logic of more than one cluster they may need
to be duplicated for each cluster. This duplication allows us to simulate each cluster
independent of other clusters. After these steps, there may still be some gates that are
not part of any cluster. In particular, there may be combinational gates that are not in
the cone of logic of any gate at the threshold level. For example, gate G8 is one such
gate. Hence, we add such gates to cluster Rem. Also, we add all sequential elements
to a separate cluster Seq. At the end of the clustering algorithm we have the following
clusters, cluster H, cluster Gi for each gate Gi in threshold level, cluster Rem, and
cluster Seq.
Although most clusters can be simulated independent of each other, there is a depen-
dency among some clusters. In particular, cluster H can be simulated only after all
cluster Gi have been simulated. This is because all cluster Gi are in the cone of logic
for gates in cluster H , hence cluster H is dependent on all cluster Gi. Furthermore,
cluster Seq for sequential elements can be simulated only after all combinational
clusters have been simulated since the next latch values can only be calculated after
all combinational clusters have been simulated.
In Fig. 5, we apply clustering algorithm on the design in Fig. 3. We assume that the
threshold value is 2, in which case the top level (level 3) automatically becomes the
threshold level. Clusters 1 and 2 are obtained by finding the cone of logic of gates in
level 3. Whereas, cluster 3 is a cluster of remaining combinational elements. Similarly,
123
12. Int J Parallel Prog
Fig. 5 Clustering Algorithm 1 on the levelized design in Fig. 3. Distinct clusters are independent of each
other. However, there is a dependency between levels within a cluster
Fig. 6 Visualizing clustering
Algorithm 2
there is a cluster 4 that includes all latches. Finally each cluster is assigned to a CUDA
block.
It is clear that we can control the number of CUDA blocks using the threshold
value with this clustering algorithm. However, this becomes a manual effort and some
designs may not have as many gates as specified in the threshold value or may have
many more gates than the threshold value. Furthermore, clustering of remaining gates
needs extra effort of searching and finding out the gates that are not part of any cluster.
In such cases, the first clustering approach may not be useful. Hence, we developed
an optimized approach that allows us to better control the number of CUDA blocks,
the shared memory and ultimately obtain better speedups.
123
13. Int J Parallel Prog
4.2.2 Second Clustering Algorithm
In this algorithm, we build clusters starting from the primary outputs and latches
instead of starting from the user given threshold level as in the first clustering algo-
rithm (Fig. 6). This allows us not to have any remaining gates as was the case above.
Also, since designs can have varying number of outputs and latches, the number of
clusters and the sizes of clusters can vary. This necessitates further steps to optimize
thread usage. For this purpose, we develop new steps for merging and balancing of
clusters. This avoids assigning every cluster to a CUDA block as in the previous algo-
rithm, rather we reshape clusters into an optimized format then assign those clusters
to CUDA blocks.
Algorithm 4 shows our second clustering algorithm. We next describe the steps in
the algorithm in greater detail.
– Step 1. Obtain clusters In this algorithm we do not assign each cluster to a CUDA
block as was done in the previous clustering algorithm. Rather, our goal is to build
many clusters starting from primary outputs and latches. We add the cone of logic
of both primary outputs and latches and form clusters, named cluster O L. How-
ever, when going back from a gate, we stop at the latches. At the end of this step, the
number of clusters is the sum of the number of primary outputs and the number of
latches. Figure 7 displays application of this step on the levelized design in Fig. 3.
– Step 2. Merge clusters After the above step is completed, there may be clusters
with an imbalance in distribution of gates to clusters. Hence, we need to merge
Algorithm 4 Pseudocode for Second Clustering Algorithm
Input: levelized circuit, number of CUDA blocks num Blocks
Output: merged and balanced cluster of gates
// Step 1. obtain clusters
1: for each primary output or latch O L do
2: add the cone of logic for O L to cluster O L;
3: end for
// Step 2. merge clusters
4: gate_limit = total number of gates in all cluster O L/number of blocks;
5: compute overlaps of all possible pairs of clusters;
6: for i = 1 to num Blocks do
7: let clusterC L be an arbitrary unmarked cluster and mergedC L[i] = ∅;
8: mergedC L[i] = mergedC L[i] ∪ clusterC L;
9: mark clusterC L;
10: repeat
11: clusterC Li = an unmarked cluster with maximum overlap with clusterC L;
12: mergedC L[i] = mergedC L[i] ∪ clusterC Li;
13: mark clusterC Li;
14: until number of gates in mergedC L[i] reaches gate_limit;
15: delete duplicate gates in each mergedC L[i];
16: end for
// Step 3. balance merged clusters
17: level_width = average width of all merged clusters;
18: for each merged cluster mergedC L[i] do
19: move gates in mergedC L[i] such that the cluster has a level_width wide rectangular shape;
20: end for
123
14. Int J Parallel Prog
Fig. 7 Clustering Algorithm 2 before merging and balancing steps
clusters in order to reduce this imbalance in the sizes of the clusters. This step will
also allow us to optimize the usage of shared memory as we describe below. We
respect the following conditions for merging clusters.
1. The number of merged clusters is a given preset number of CUDA blocks.
2. The size of a merged cluster may not exceed a certain limit, called gate_limit.
The gate_limit is obtained by dividing the total number of gates in all clusters
(including duplicate gates) by the number of CUDA blocks.
3. We merge clusters with maximum number of common gates (overlap).
We choose the preset number of CUDA blocks according to the properties of
the circuit and the CUDA device at hand. In our case, we chose to generate 48
CUDA blocks for large circuits. This is because each multiprocessor can execute
768 threads in parallel and given that a CUDA block can not have more than 512
threads, we allocate 384 threads to two blocks for each multiprocessor for max-
imum thread utilization. For a CUDA architecture with 24 multiprocessors, this
results in 48 blocks in total. This preset number of blocks is fine for most industrial
designs as can be seen from experiments. However, for designs with fewer number
of gates, fewer number of blocks can be allocated such as 12 or 24.
Next, we use a simple heuristic to merge clusters. We compute the overlaps of
all possible pairs of clusters in a two-dimensional matrix. Then, for each CUDA
block, we start with an arbitrary cluster and add the cluster with the maximum
overlap to it. This addition procedure continues until we reach a given gate_limit
for each merged cluster.
After merging the clusters based on the above heuristic, there may be duplicate
gates in merged clusters, so we delete these duplications. Figure 8 displays the
effect of the merging step.
– Step 3. Balance merged clusters At the end of the above step, we obtain merged
clusters that are of triangular shape. In a triangular shape cluster, more threads will
become idle as the level increases during execution, since there are more gates at
the lower levels. Our goal is to utilize available threads in the most efficient way
such that all threads will approximately have the same computation burden for
123
15. Int J Parallel Prog
Fig. 8 Merging and balancing clusters results in using all available threads efficiently at all levels
each level. A uniform shape like a rectangular allows us to have the same number
of gates at each level. Figure 8 displays the effect of the balancing step.
In order to balance merged clusters, we compute the average width of all merged
clusters, called level_width. This width will give us the number of threads that
should be allocated to each CUDA block. We make sure that this is a multiple of
16 since for coalesced memory access half warp (16 threads) can work at the same
time. If this width is greater than 384, which is the maximum number of threads
that we allow for a block, we set 384 as the width value. Since level_width is
determined as an average, there may be levels within the blocks having more gates
than this width. We aim to reshape the merged clusters such that no level will
have gates more than this width value. When we have a level having gates more
than the width value, the extra gates will be moved to one level up. To preserve
the dependency between gates, the levels of all gates that are dependent on those
moved gates are also incremented. At the end of this iterative procedure, we will
have rectangular shaped merged clusters with the same width but with differing
heights. Note that there may be fewer gates than the width at a given level. For
such levels, we add dummy gates that have no operational burden.
Finally, each merged and balanced cluster is assigned to a CUDA block.
4.3 Parallel Simulation Phase
Parallel simulation phase consists of several steps that can be seen in Algorithm 2.
During each cycle, input values are transferred from host to device. The inputs are
essentially generated by executing the test-bench on the host. In our experiments,
123
16. Int J Parallel Prog
Algorithm 5 Parallel Block Simulation Kernel Pseudocode
1: _ _global_ _ void simulationKernel(...){
2:
3: _ _shared_ _ unsigned char vars[8000]; // keep encoded intermediate values
4: numberOfThreads = blockDim.x;
5: index = threadIdx.x; start = blockIdx.x * numberofThreads;
6: vars = coalesced read of encoded values from global device memory;
7: for i = 0; i < level; i + + do
8: // coalesced read gate variables from global device memory
9: i1 = firstInput[start+index];
10: i2 = secondInput[start+index];
11: o = output[start+index];
12: decode variable i1 as v1, i2 as v2;
13: val = v1 ∧ v2;
14: write encoded val to vars array;
15: index+ = number O f T hr eads;
16: _ _syncthreads();
17: end for
18: coalesced write vars to global device memory;
19: }
we mainly used random test-benches, hence the input generation could be optimized
as described in the next section. Once the inputs are ready, all CUDA blocks can
be executed by executing a kernel function. An execution of a block with combina-
tional elements proceeds level by level and after each level is simulated the threads
of the block synchronize using a barrier. This process continues until all levels are
completed. We used coalesced memory access for reading and writing gate variables
between the global memory and the CUDA processors. We increased the number of
gates that can be efficiently simulated in a block by encoding the intermediate gate
outputs and storing these in the shared memory. Also, the intermediate gate output
values are stored in shared memory in an encoded fashion, which we describe in next
section. At the beginning of a cycle, blocks fetch relevant design data (gate variables)
and input values from device memory in a coalesced manner. At the end of the cycle,
output and latch values are transferred from device to host. In particular, we stored
the design structure, primary inputs, primary outputs and latches in the device mem-
ory and frequently accessed data structures such as intermediate and-gate values of
a block in the shared memory. Our particular simulation kernel function is shown in
Algorithm 5. We next describe our CUDA optimizations in more detail.
5 CUDA Optimizations
We applied several optimization methods to increase performance of our algorithms.
To develop an effective CUDA program, one needs to have knowledge of the GPU
architecture such as shared memory, memory coalescing and memory transfer over-
head. Communication overhead is a major cost, so we exploited CUDA’s faster shared
memory, register and pinned memories in our implementation.
In CUDA, one can allocate memory space in host memory dynamically by using
Cuda MallocH ost function instead of the ordinary C malloc function. This uses
123
17. Int J Parallel Prog
the pinned (paged-locked) memory. Allocating memory with Cuda MallocH ost
increases the performance of Cuda Memcpy operation. We used these functions to
allocate data structures for our designs.
One of the most important optimizations for CUDA programming is to make effec-
tive use of shared memory. Fetching a variable from device memory requires 400–600
cycles, whereas fetching it from the shared memory requires only a few cycles. During
the execution of a CUDA block, we copied the most frequently accessed data struc-
tures to the shared memory such as intermediate and-gate output values, which are
the values obtained after simulating one level of a block. Using the global memory
for this purpose will result in a huge time penalty, hence we write these intermediate
values to the shared memory instead.
For designs with random input generation (random test-benches), we moved the
test-bench code into the GPU and enabled parallel input generation. This removes the
communication need between CPU and GPU. For designs without a random input
generation, we used another optimization method where the inputs of the design and
outputs of the design are not transferred between GPU and CPU during every cycle,
rather these data transfers are done in bulks such as after every 10,000 cycles. Although
this optimization increases the data size to be copied between the host and the device,
our experiments have proven that decreasing the number of Cuda Memcpy operations
is more critical than increasing the data size.
Also, since we use AIGER format with only a single type of combinational gate
(an and-gate), we do not need to store the truth table for other types of complex gates
in the shared memory, wasting valuable space.
In addition to the above optimizations, we incorporated the following optimizations
with our second clustering algorithm.
– We developed a new clustering and balancing algorithm that makes maximum use
of all available threads such that every thread is doing the same task without diver-
gent paths. We accomplished this by first merging the clusters and then balancing
them to reshape into rectangular uniform structures as described in Sect. 4.2.2.
– We heavily made use of coalesced memory access for reading and writing variables
between the global memory and the CUDA processors.
– We increased the number of gates that can be efficiently simulated by encoding
the intermediate gate outputs and storing these in the shared memory.
We now describe these optimizations in more detail.
During parallel simulation, we need to access the inputs and outputs of the gates.
To access this information inside the kernel in a time-efficient manner, we put gate
information on the device global memory and access this information in a coalesced
manner. When accessing device global memory, peak performance utilization occurs
when all threads in a half warp (16 threads) access continuous memory locations. For
coalesced access, we have to adjust the data structures that hold gate information.
Therefore, we allocate new arrays for indices of each gate; one for first input, one for
second input and one for the output. These data structures are such that all gates in a
level are adjacent to each other so that when the threads access device global memory
for gate information they are accessed in a coalesced manner. Figure 9 shows this
coalesced memory access.
123
18. Int J Parallel Prog
Fig. 9 Coalesced memory access
Above we described how we used shared memory for fast access. However, shared
memory size is limited. Each multiprocessor can have 16 KB of shared memory, when
we use 2 blocks for each multiprocessor then each will have 8 KB of shared memory.
If we use one char variable for each gate output value, we can have at most 8,000
variables for a block (8 KB max shared memory for a block), and therefore the total
number of variables in a design can be at most 48 blocks * 8,000 variables = 384,000
variables. This number may be small for large designs. Therefore, we developed an
efficient representation to keep gate output values by using 1 unsigned char variable
(8 bits) for 8 gate outputs since a gate output can only be 0 or 1. At the start of each
block simulation, we bring these encoded variables from device global memory in
a coalesced manner as well. With this representation, we can have 64,000 variables
for a block and in total 48 blocks * 64,000 variables = 3,072,000 variables. This is
equal to the sum of the number of and-gates, latches, primary inputs and outputs in
a design. If we assumed that there is a 30% duplication of gates among blocks, then
we can simulate designs with up to 2 million gates approximately. Similarly encoded
variables are read and written in a coalesced manner.
6 Experimental Results
We validated the effectiveness of our two GPU based parallel CBS algorithms on
several test cases. We used test cases from IWLS [17], OpenCores [24] and AIGER
[2]. These test cases include ldpc-encoder (low density parity check encoder), des-perf
(triple DES optimized for performance), wb-conmax (wishbone conmax IP core), aes-
core (AES Encryption/Decryption IP Core) pci-bridge32, ethernet, vga-lcd (wishbone
compliant enhanced vga/lcd controller), and several other designs. These test cases
either had their own test benches or we generated random test benches for them.
Table 2 displays the list of test cases that we used in our experiments and their design
characteristics including the number of levels and the number of CUDA blocks used
for their simulation for both clustering algorithms. For some test cases, we generated
AIGER format of designs using ABC tool [1].
We performed our experiments on an Intel Xeon CPU with two 2.27 GHz multi-
processors, 32GB of memory and a CUDA–enabled NVIDIA Quadro FX3800 GPU
with 1 GB device memory and 24 streaming multiprocessors each with 8 streaming
processors.
123
19. Int J Parallel Prog
Table 2 Experimental test cases
Design Vars Inputs Latches Outputs And-Gates Levels Blocks1 Blocks2
ldpc-encoder 220,684 1,723 0 2,048 218,961 19 288 48
vga-lcd 143,879 89 17,079 109 126,711 23 145 48
des-perf 100,500 17,850 0 9,038 82,650 19 373 48
ethernet 80,326 98 10,544 115 69,684 31 364 48
wb-conmax 49,753 1,130 770 1,416 47,853 26 261 24
tv80 7,665 14 359 32 7,292 39 34 24
aes-core 22,841 1,319 0 668 21,522 25 126 24
ac97-ctrl 12,624 84 2,199 48 10,341 8 495 24
pci-bridge32 26,305 162 3,359 207 22,784 29 146 24
system-cdes 2,813 132 190 65 2,491 22 48 24
pci-spoci-ctrl 880 25 60 13 795 14 29 24
sasc 740 16 117 12 607 7 49 24
Table 3 Experimental results for 100,000 cycles
Design SEQ(sec) PAR1(sec) PAR2(sec) Speedup1 Speedup2
ldpc-encoder 382.21 74.79 24.32 5.11 15.72
vga-lcd 223.15 45.91 25.45 4.86 8.77
des-perf 180.62 65.68 8.53 2.75 21.17
ethernet 155.51 52.89 14.96 2.94 10.40
wb-conmax 94.81 33.96 10.8 2.79 8.78
tv80 92.59 23.31 10.63 3.97 8.71
aes-core 83.64 35.63 7.97 2.35 10.49
ac97-ctrl 58.66 18.87 4.74 3.11 12.38
pci-bridge32 50.12 21.7 9.39 2.31 5.34
system-cdes 30.44 11.55 6.92 2.64 4.40
pci-spoci-ctrl 7.04 10.44 5.13 0.67 1.37
sasc 6.26 6.34 3.75 0.99 1.6
In Table 3, we demonstrate our experimental results. Column S E Q denotes the
results for sequential simulation using the default simulator that is available with AI-
GER [2]. Columns P A R1 and P A R2 denote the results for our parallel simulation
algorithms using the first and the second clustering algorithms, respectively. Col-
umns Speedup1 and Speedup2 denote the speedup of parallel algorithms over the
sequential algorithm. The times do not include the compilation phase in both cases but
include the time required to transfer the data between the GPU and the CPU. Figure 10
graphically displays the speedups.
We simulated the designs for different number of cycles ranging from 100, 500 K to
1 M. Figure 11 shows that the speedups are similar for different number of cycles for
123
20. Int J Parallel Prog
Speedup over Sequential
25.00
20.00
15.00
10.00
5.00
0.00
PAR1 PAR2
Fig. 10 Parallel simulation speedups over sequential simulation for 100,000 cycles
Fig. 11 Parallel simulation 18.00
speedups versus number of
cycles 16.00
14.00
12.00
Speedup
10.00
8.00
6.00
4.00
2.00
0.00
100 500 1000
Number of Cycles (K)
ldpc-encoder PAR1 vga-lcd PAR1
ldpc-encoder PAR2 vga-lcd PAR2
ldpc-encode and vga-lcd. We obtained similar results for other testcases. We present
results for 100K cycles in the rest of the experiments.
Different speedups are expected for logic simulation due to the fact that logic sim-
ulation is heavily influenced by the circuit structure. We obtained speedups up-to 5x
for PAR1 and up-to 21x for PAR2. Both PAR1 and PAR2 outperform SEQ except for
two cases where PAR1 performs worse than SEQ. In all test cases, PAR2 significantly
outperforms both SEQ and PAR1.
In general, arbitrary circuit structures make it difficult to analyze designs and the
corresponding speedups. However, Fig. 12 shows that both PAR1 and PAR2 speedups
are proportional with the design size (the number of variables). Hence, further studies
with larger designs hold promise. In fact, for smaller designs such as pci-spoci-ctrl
and sasc, PAR1 performs worse than the sequential algorithm. Figure 13 shows that
123
21. Int J Parallel Prog
PAR1 versus Vars
6.00
PAR1 Speedup
5.00
4.00
3.00
2.00
1.00
0.00
Vars
PAR2 versus Vars
25.00
PAR2 Speedup
20.00
15.00
10.00
5.00
0.00
Vars
Fig. 12 Parallel simulation speedups versus design size
PAR2 versus Blocks
25.00
PAR2 Speedup
20.00
15.00
10.00
5.00
0.00
24 24 24 24 24 24 24 24 48 48 48 48
Blocks2
PAR2
Fig. 13 Parallel simulation speedup using second clustering versus number of CUDA blocks
the speedup is better when the number of blocks is increased from 24 to 48 for PAR2.
This is related to the fact that the higher number of blocks is chosen for larger designs
in PAR2. We also observe from Fig. 10 that three of the top four speedups for PAR2
are obtained for designs with no latches such as ldpc-encoder, des-perf, and aes-core.
This is expected since there is no need to start another GPU kernel for latches after
and-gate simulation.
For PAR1, we experimented with the threshold values. We observed that the thresh-
old value is dependent on the circuit structure as well. For example, for most testcases
including sasc the execution time changes almost linearly with the threshold value.
123
22. Int J Parallel Prog
We observed that the number of gates in the designs can be higher using AIGs than
the other gate level formats. Although our experiments showed that AIGs are effective
in reducing the shared memory size, AIGs may result in higher number of levels that
may lead to execution overheads. So further synthesis steps in ABC tool can help
reduce the number of levels.
Our parallel algorithm compilation phases may spend more time than the sequential
algorithm. Once the compilation is done, it is saved so that simulations with different
test benches or parameters or number of cycles can be done without incurring the
compilation cost again. In practice, once the design matures, simulations are run until
the design product is taped-out.
7 Conclusions
We introduced two novel parallel cycle-based simulation algorithms for digital
designs using GPUs. This work has further progressed the state of the art in logic
simulation. Our algorithms result in faster, more efficient parallel logic simula-
tors that can run on commodity graphics cards allowing verified designs while
obtaining significant reduction in the overall design cycle. Our approach leverages
the GPU architecture by optimizing on low latency memory spaces, and reducing
host and device communications during compilation and simulation phases of a
cycle based simulation algorithm. Our approach is unique in that we use the AIG
representation for electronic designs that proves to be very efficient for boolean
functions.
We obtained speedups ranging from 5× to 21× with our first and second algorithms,
respectively, for various benchmarks. Our first clustering algorithm uses a given thresh-
old to generate independent CUDA blocks for parallel simulation. Whereas, our second
clustering algorithm automatically generates CUDA blocks while using sophisticated
merging and balancing steps in order to maximize the number of parallel threads that
can be executed on the GPU. Our second algorithm also improves on the first by bet-
ter utilization of shared memory and memory coalescing as well as supporting larger
design sizes, which is crucial for practical applications. Since logic simulation is an
activity that continues nonstop until the design is productized such speedups have
a big impact on the verification and the overall design cycle. Our results show that
we obtain better speedups with larger circuits. Hence, further studies with industrial
designs hold promise.
As a future work, we want to investigate other gate level design formats and develop
event-based simulation algorithms that are also commonly used in the industry. We
also want to experiment with complex industrial test cases that may not fit on a single
GPU device.
Acknowledgments We would like to thank Alan Mischenko from the University of California, Berkeley
for suggesting and providing AIG benchmarks. This research was supported by a Marie Curie European
Reintegration Grant within the 7th European Community Framework Programme and BU Research Fund
5483.
123
23. Int J Parallel Prog
References
1. ABC web site. http://www.eecs.berkeley.edu/~alanmi/abc/
2. AIGER Format web site. http://fmv.jku.at/aiger/
3. Alpert, C.J., Kahng, A.B.: Recent directions in netlist partitioning: a survey. Int. VLSI J. 19(1–2), 1–81
(1995)
4. Bailey, M.L., Briner, J.V. Jr., Chamberlain, R.D.: Parallel logic simulation of VLSI systems. ACM
Comput. Surv. 26(3), 255–294 (1994)
5. Bergeron, J.: Writing Testbenches—Functional Verification of HDL Models. Springer, Berlin (2003)
6. Brayton, R., Mishchenko, A.: ABC: an academic industrial-strength verification tool. In: Proceedings
of the International Conference on Computer-Aided Verification (CAV) (2010)
7. Catanzaro, B., Keutzer, K., Su, B.Y.: Parallelizing CAD: a timely research agenda for EDA. In: Pro-
ceedings of the Design Automation Conference (DAC), pp. 12–17. ACM (2008)
8. Chatterjee, D., DeOrio, A., Bertacco, V.: Event-driven gate-level simulation with GP-GPUs. In: Pro-
ceedings of the Design Automation Conference (DAC), pp. 557–562 (2009)
9. Chatterjee, D., DeOrio, A., Bertacco, V.: GCS: High-performance gate-Level simulation with GPGPUs.
In: Proceedings of the Conference on Design Automation and Test in Europe (DATE), pp. 1332–1337
(2009)
10. Chatterjee, S., Mishchenko, A., Brayton, R.K., Wang, X., Kam, T.: Reducing structural bias in tech-
nology mapping. IEEE TCAD 25(12), 2894–2903 (2010)
11. Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J.W., Skadron, K.: A performance study of general-
purpose applications on graphics processors using CUDA. J. Parallel Distrib. Comput. 68(10), 1370–
1380 (2008)
12. Croix, J.F., Khatri, S.P.: Introduction to GPU programming for EDA. In: Proceedings of the Interna-
tional Conference on Computer Aided Design (ICCAD), pp. 276–280. ACM (2009)
13. NVIDIA CUDA web site. http://www.nvidia.com/CUDA
14. Deng, Y.S., Wang, B.D., Mu, S.: Taming irregular EDA applications on GPUs. In: Proceedings of
the International Conference on Computer Aided Design (ICCAD), pp. 539–546. ACM (2009)
15. Hering, K.: A Parallel LCC Simulation System. In: Proceedings of the International Parallel and
Distributed Processing Symposium (IPDPS) (2002)
16. Hering, K., Reilein, R., Trautmann, S.: Cone clustering principles for parallel logic simulatio. In:
International Workshop on Modeling, Analysis, and Simulation of Computer and Telecommunication
Systems, pp. 93–100 (2002)
17. IWLS 2005 Benchmarks. http://www.iwls.org/iwls2005/benchmarks.html
18. Johannes, F.M.: Partitioning of VLSI circuits and systems. In: Proceedings of the Design Automation
Conference (DAC), pp. 83–87. ACM (1996)
19. Meister, G.: A survey on parallel logic simulation Technical report. Department of Computer Engi-
neering, University of Saarland, Saarland (1993)
20. Mishchenko, A., Brayton, R., Jang, S.: Global delay optimization using structural choices. In: FPGA
’10: Proceedings of the 18th Annual ACM/SIGDA International Symposium on Field Programmable
Gate Arrays, pp. 181–184. ACM (2010)
21. Mishchenko, A., Chatterjee, S., Brayton, R.: DAG-aware AIG rewriting: a fresh look at combinational
logic synthesis. In: Proceedings of the Design Automation Conference (DAC) (2006)
22. Nguyen, H.: Gpu Gems 3. Addison-Wesley Professional, Reading (2007)
23. OpenCL web site. http://www.khronos.org/opencl/
24. Opencores Benchmarks. http://www.opencores.org
25. Perinkulam, A.: Logic Simulation Using Graphics Processors. Master’s thesis, University of Massa-
chusetts Amherst (2007)
26. Pfister, G.: The Yorktown simulation engine: introduction. In: Proceedings of the Design Automation
Conference (DAC) (1982)
27. Sen, A., Aksanli, B., Bozkurt, M., Mert, M.: Parallel cycle based logic simulation using graphics pro-
cessing units. In: Proceedings of the International Symposium on Parallel and Distributed Computing
(ISPDC) (2010)
28. Zhu, Q., Kitchen, N., Kuehlmann, A., Sangiovanni-Vincentelli, A.: SAT sweeping with local observ-
ability don’t-cares. In: Proceedings of the Design Automation Conference (DAC) (2006)
123