The need for power efficiency is driving a rethink of design decisions in processor architectures. While vector processors succeeded in the high-performance market in the past, they need a retailoring for the mobile market that they are entering now. Floating-point (FP) fused multiply-add (FMA), being a functional unit with high power consumption, deserves special attention. Although clock gating is a well-known method to reduce switching power in synchronous designs, there are unexplored opportunities for its application to vector processors, especially when considering active operating mode. In this research, we comprehensively identify, propose, and evaluate the most suitable clock-gating techniques for vector FMA units (VFUs). These techniques ensure power savings without jeopardizing the timing. We evaluate the proposed techniques using both synthetic and “real-world” application-based benchmarking. Using vector masking and vector multilane-aware clock gating, we report power reductions of up to 52%, assuming active VFU operating at the peak performance. Among other findings, we observe that vector instruction-based clock-gating techniques achieve power savings for all vector FP instructions. Finally, when evaluating all techniques together, using “real-world” benchmarking, the power reductions are up to 80%. Additionally, in accordance with processor design trends, we perform this research in a fully parameterizable and automated fashion.
Today's fast paced product market has shorter lifecycles and tighter budgetary concerns. Tolerance analysis software provides an ideal solution to reduce the number of crucial steps needed to optimize a product at the design step itself. 3DCS Variation Analyst is the world's most used tolerance analysis software that is fully integrated into NX/ CATIA V5/ Creo and CAD Neutral Multi-CAD. 3DCS Variation Analyst is designed to use a consistent format and set of mathematical formulae that create reliable results, enabling engineers to gain a complete insight into their design. The software empowers design engineers to control variation and optimize their designs to account for inherent process and part variation, which in turn reduces non-conformance, scrap, rework and other associated costs.
3DCS Variation Analyst
Used by the world’s leading manufacturing OEM’s to reduce the cost of quality, 3DCS Variation Analyst comes in two flavours:
1) 3DCS Variation Analyst (NX / CAA V5 or Creo Based) is an integrated solution for NX / CATIA V5 or Creo. Since it is an integrated solution, users can not only activate 3DCS workbenches from within the modelling solution, they can use many of its inbuilt functionality to support their modelling.
3DCS Variation Analyst provides three analysis methods:
Monte Carlo Analysis
High-Low-Mean (Sensitivity Analysis) and
Geofactor Analysis (Relationship)
Fast neural network training on fpga using quasi newton optimization methodNxfee Innovation
In this brief, a customized and pipelined hardware implementation of the quasi-Newton (QN) method on field-programmable gate array (FPGA) is proposed for fast artificial neural networks onsite training, targeting at the embedded applications. The architecture is scalable to cope with different neural network sizes while it supports batch-mode training. Experimental results demonstrate the superior performance and power efficiency of the proposed implementation over CPU, graphics processing unit, and FPGA QN implementations.
Design of an area efficient million-bit integer multiplier using double modul...Nxfee Innovation
This brief proposes a double modulus number theoretical transform (NTT) method for million-bit integer multiplication in fully homomorphic encryption. In our method, each NTT point is processed simultaneously under two moduli, and the final result is generated through the Chinese reminder theorem. The employment of double modulus enlarges the permitted NTT sample size from 24 to 32 bits and thus improves the transform efficiency. Based on the proposed double modulus method, we accomplish a VLSI design of million-bit integer multiplier with the Schönhage–Strassen algorithm. Implementation results on Altera Stratix-V FPGA show that this brief is able to compute a product of two 1024k-bit integers every 4.9 ms at the cost of only 7.9k ALUTs and 3.6k registers, which is more area-efficient when compared with the current competitors.
Multilevel half rate phase detector for clock and data recovery circuitsNxfee Innovation
In this brief, a half-rate (HR) bang-bang (BB) phase detector (PD) with multiple decision levels is proposed for clock and data recovery (CDR) circuits. The combination allows the oscillator to run at half the input data rate while providing information about the sign and magnitude of the phase shift between the PD inputs. This allows a finer control of the frequency of the oscillator in the phase-locked loop (PLL) of the CDR circuit, which results in up to 30% less output clock jitter than with a conventional two-levels HR BB PD. Thanks to this, the bit error rate can be decreased by up to 5× in a 5-Gb/s CDR circuit. The proposed topology was implemented in a 28-nm FDSOI CMOS technology providing average power consumption below 76 µW with a supply voltage of 1 V. Although multilevel (ML) BB PDs have already been proposed in some PLL-based CDR with very interesting results, a specific design of the PD has to be implemented for an HR system. This brief provides the first ML-HR-BBPD.
An energy efficient programmable many core accelerator for personalized biome...Nxfee Innovation
Wearable personalized health monitoring systems can offer a cost-effective solution for human health care. These systems must constantly monitor patients’ physiological signals and provide highly accurate, and quick processing and delivery of the vast amount of data within a limited power and area footprint. These personalized biomedical applications require sampling and processing multiple streams of physiological signals with a varying number of channels and sampling rates. The processing typically consists of feature extraction, data fusion, and classification stages that require a large number of digital signal processing (DSP) and machine learning (ML) kernels. In response to these requirements, in this paper, a tiny, energy efficient, and domain-specific manycore accelerator referred to as power-efficient nano clusters (PENC) is proposed to map and execute the kernels of these applications. Simulation results show that the PENC is able to reduce energy consumption by up to 80% and 25% for DSP and ML kernels, respectively, when optimally parallelized. In addition, we fully implemented three compute-intensive personalized biomedical applications, namely, multichannel seizure detection, multi physiological stress detection, and standalone tongue drive system (sTDS), to evaluate the proposed manycore performance relative to commodity embedded CPU, graphical processing unit (GPU), and field programmable gate array (FPGA)-based implementations. For these three case studies, the energy consumption and the performance of the proposed PENC manycore, when acting as an accelerator along with an Intel Atom processor as a host, are compared with the existing commercial off-the-shelf general purpose, customizable, and programmable embedded platforms, including Intel Atom, Xilinx Artix-7 FPGA, and NVIDIA TK1 advanced RISC machine -A15 and K1 GPU system on a chip. For these applications, the PENC manycore is able to significantly improve throughput and energy efficiency by up to 1872× and 276×, respectively. For the most computational intensive application of seizure detection, the PENC manycore is able to achieve a throughput of 15.22 giga-operations-per-second (GOPs), which is a 14× improvement in throughput over custom FPGA solution. For stress detection, the PENC achieves a throughput of 21.36 GOPs and an energy efficiency of 4.23 GOP/J, which is 14.87× and 2.28× better over FPGA implementation, respectively. For the sTDS application, the PENC improves a through put by 5.45× and an energy efficiency by 2.37× over FPGA implementation.
Design and fpga implementation of a reconfigurable digital down converter for...Nxfee Innovation
This brief presents a field-programmable gate array-based implementation of a reconfigurable digital down converter (DDC) that can process input bandwidth of up to 3.6 GHz and provide a flexible down-converted output. The proposed DDC consists of a mixer and a resampling filter. The resampling filter can work at much higher clock rate. The reason is that all the single-cycle recursive loops in the re sampling filter are pipelined by using either real/imaginary part-time multiplexing or parallel processing technique. With features like arbitrary sampling rate conversion, and dynamic configuration, the proposed design is highly flexible, so that it can generate a down-converted output with sampling rate, selectable within the range of 1 kS/s–225 MS/s. Moreover, the flexibility is further improved by being able to specify the output sampling rate and center frequency to a resolution of less than 1 S/s. The experimental results show that the proposed design can achieve the same functionality as the existing work but with fewer hardware resources.
A fast and low complexity operator for the computation of the arctangent of a...Nxfee Innovation
The computation of the arctangent of a complex number, i.e., the atan2 function, is frequently needed in hardware systems that could profit from an optimized operator. In this brief, we present a novel method to compute the atan2 function and a hardware architecture for its implementation. The method is based on a first stage that performs a coarse approximation of the atan2 function and a second stage that improves the output accuracy by means of a lookup table. We present results for fixed-point implementations in a field-programmable gate array device, all of them guaranteeing last-bit accuracy, which provide an advantage in latency, speed, and use of resources, when compared with well-established fixed-point options.
A high accuracy programmable pulse generator with a 10-ps timing resolutionNxfee Innovation
Automatic test equipment must have high-precision and low-power pulse generators (PGs) for testing memory and device-under-test ICs. This paper describes a high-accuracy and wide-data-rate-range PG with a 10-ps time resolution. The PG comprises an edge combiner (EC) and a multiphase clock generator (MPCG). The EC can produce an arbitrary waveform through 32 phase outputs of the MPCG. The EC adopts a one/zero detector and phase selection logic to define an operational data rate range and a timing resolution, respectively. Therefore, the EC uses the phase selection logic to combine the period window of the one/zero detector with the MPCG output phases. The EC also uses a countdown counter for a wide operational range. In the MPCG, a multiphase oscillator (MPO) adopts a ring oscillator scheme with sub feedback loops to extend its maximum operational frequency. The MPO also uses a phase error corrector to reduce the output phase error resulting from process and layout mismatches. Thus, the PG can obtain high accuracy waveforms owing to small phase errors. The test chip was implemented using a 0.13-µm CMOS process. The core area and power consumption of the PG were measured to be 250 × 300 µm2 and 18.7 mW, respectively. The data rate range of the PG was determined to be from 3.2 kHz to 893 MHz. The time resolution and average accuracy of the PG were measured to be 10 ps and ±0.3 LSB, respectively.
Today's fast paced product market has shorter lifecycles and tighter budgetary concerns. Tolerance analysis software provides an ideal solution to reduce the number of crucial steps needed to optimize a product at the design step itself. 3DCS Variation Analyst is the world's most used tolerance analysis software that is fully integrated into NX/ CATIA V5/ Creo and CAD Neutral Multi-CAD. 3DCS Variation Analyst is designed to use a consistent format and set of mathematical formulae that create reliable results, enabling engineers to gain a complete insight into their design. The software empowers design engineers to control variation and optimize their designs to account for inherent process and part variation, which in turn reduces non-conformance, scrap, rework and other associated costs.
3DCS Variation Analyst
Used by the world’s leading manufacturing OEM’s to reduce the cost of quality, 3DCS Variation Analyst comes in two flavours:
1) 3DCS Variation Analyst (NX / CAA V5 or Creo Based) is an integrated solution for NX / CATIA V5 or Creo. Since it is an integrated solution, users can not only activate 3DCS workbenches from within the modelling solution, they can use many of its inbuilt functionality to support their modelling.
3DCS Variation Analyst provides three analysis methods:
Monte Carlo Analysis
High-Low-Mean (Sensitivity Analysis) and
Geofactor Analysis (Relationship)
Fast neural network training on fpga using quasi newton optimization methodNxfee Innovation
In this brief, a customized and pipelined hardware implementation of the quasi-Newton (QN) method on field-programmable gate array (FPGA) is proposed for fast artificial neural networks onsite training, targeting at the embedded applications. The architecture is scalable to cope with different neural network sizes while it supports batch-mode training. Experimental results demonstrate the superior performance and power efficiency of the proposed implementation over CPU, graphics processing unit, and FPGA QN implementations.
Design of an area efficient million-bit integer multiplier using double modul...Nxfee Innovation
This brief proposes a double modulus number theoretical transform (NTT) method for million-bit integer multiplication in fully homomorphic encryption. In our method, each NTT point is processed simultaneously under two moduli, and the final result is generated through the Chinese reminder theorem. The employment of double modulus enlarges the permitted NTT sample size from 24 to 32 bits and thus improves the transform efficiency. Based on the proposed double modulus method, we accomplish a VLSI design of million-bit integer multiplier with the Schönhage–Strassen algorithm. Implementation results on Altera Stratix-V FPGA show that this brief is able to compute a product of two 1024k-bit integers every 4.9 ms at the cost of only 7.9k ALUTs and 3.6k registers, which is more area-efficient when compared with the current competitors.
Multilevel half rate phase detector for clock and data recovery circuitsNxfee Innovation
In this brief, a half-rate (HR) bang-bang (BB) phase detector (PD) with multiple decision levels is proposed for clock and data recovery (CDR) circuits. The combination allows the oscillator to run at half the input data rate while providing information about the sign and magnitude of the phase shift between the PD inputs. This allows a finer control of the frequency of the oscillator in the phase-locked loop (PLL) of the CDR circuit, which results in up to 30% less output clock jitter than with a conventional two-levels HR BB PD. Thanks to this, the bit error rate can be decreased by up to 5× in a 5-Gb/s CDR circuit. The proposed topology was implemented in a 28-nm FDSOI CMOS technology providing average power consumption below 76 µW with a supply voltage of 1 V. Although multilevel (ML) BB PDs have already been proposed in some PLL-based CDR with very interesting results, a specific design of the PD has to be implemented for an HR system. This brief provides the first ML-HR-BBPD.
An energy efficient programmable many core accelerator for personalized biome...Nxfee Innovation
Wearable personalized health monitoring systems can offer a cost-effective solution for human health care. These systems must constantly monitor patients’ physiological signals and provide highly accurate, and quick processing and delivery of the vast amount of data within a limited power and area footprint. These personalized biomedical applications require sampling and processing multiple streams of physiological signals with a varying number of channels and sampling rates. The processing typically consists of feature extraction, data fusion, and classification stages that require a large number of digital signal processing (DSP) and machine learning (ML) kernels. In response to these requirements, in this paper, a tiny, energy efficient, and domain-specific manycore accelerator referred to as power-efficient nano clusters (PENC) is proposed to map and execute the kernels of these applications. Simulation results show that the PENC is able to reduce energy consumption by up to 80% and 25% for DSP and ML kernels, respectively, when optimally parallelized. In addition, we fully implemented three compute-intensive personalized biomedical applications, namely, multichannel seizure detection, multi physiological stress detection, and standalone tongue drive system (sTDS), to evaluate the proposed manycore performance relative to commodity embedded CPU, graphical processing unit (GPU), and field programmable gate array (FPGA)-based implementations. For these three case studies, the energy consumption and the performance of the proposed PENC manycore, when acting as an accelerator along with an Intel Atom processor as a host, are compared with the existing commercial off-the-shelf general purpose, customizable, and programmable embedded platforms, including Intel Atom, Xilinx Artix-7 FPGA, and NVIDIA TK1 advanced RISC machine -A15 and K1 GPU system on a chip. For these applications, the PENC manycore is able to significantly improve throughput and energy efficiency by up to 1872× and 276×, respectively. For the most computational intensive application of seizure detection, the PENC manycore is able to achieve a throughput of 15.22 giga-operations-per-second (GOPs), which is a 14× improvement in throughput over custom FPGA solution. For stress detection, the PENC achieves a throughput of 21.36 GOPs and an energy efficiency of 4.23 GOP/J, which is 14.87× and 2.28× better over FPGA implementation, respectively. For the sTDS application, the PENC improves a through put by 5.45× and an energy efficiency by 2.37× over FPGA implementation.
Design and fpga implementation of a reconfigurable digital down converter for...Nxfee Innovation
This brief presents a field-programmable gate array-based implementation of a reconfigurable digital down converter (DDC) that can process input bandwidth of up to 3.6 GHz and provide a flexible down-converted output. The proposed DDC consists of a mixer and a resampling filter. The resampling filter can work at much higher clock rate. The reason is that all the single-cycle recursive loops in the re sampling filter are pipelined by using either real/imaginary part-time multiplexing or parallel processing technique. With features like arbitrary sampling rate conversion, and dynamic configuration, the proposed design is highly flexible, so that it can generate a down-converted output with sampling rate, selectable within the range of 1 kS/s–225 MS/s. Moreover, the flexibility is further improved by being able to specify the output sampling rate and center frequency to a resolution of less than 1 S/s. The experimental results show that the proposed design can achieve the same functionality as the existing work but with fewer hardware resources.
A fast and low complexity operator for the computation of the arctangent of a...Nxfee Innovation
The computation of the arctangent of a complex number, i.e., the atan2 function, is frequently needed in hardware systems that could profit from an optimized operator. In this brief, we present a novel method to compute the atan2 function and a hardware architecture for its implementation. The method is based on a first stage that performs a coarse approximation of the atan2 function and a second stage that improves the output accuracy by means of a lookup table. We present results for fixed-point implementations in a field-programmable gate array device, all of them guaranteeing last-bit accuracy, which provide an advantage in latency, speed, and use of resources, when compared with well-established fixed-point options.
A high accuracy programmable pulse generator with a 10-ps timing resolutionNxfee Innovation
Automatic test equipment must have high-precision and low-power pulse generators (PGs) for testing memory and device-under-test ICs. This paper describes a high-accuracy and wide-data-rate-range PG with a 10-ps time resolution. The PG comprises an edge combiner (EC) and a multiphase clock generator (MPCG). The EC can produce an arbitrary waveform through 32 phase outputs of the MPCG. The EC adopts a one/zero detector and phase selection logic to define an operational data rate range and a timing resolution, respectively. Therefore, the EC uses the phase selection logic to combine the period window of the one/zero detector with the MPCG output phases. The EC also uses a countdown counter for a wide operational range. In the MPCG, a multiphase oscillator (MPO) adopts a ring oscillator scheme with sub feedback loops to extend its maximum operational frequency. The MPO also uses a phase error corrector to reduce the output phase error resulting from process and layout mismatches. Thus, the PG can obtain high accuracy waveforms owing to small phase errors. The test chip was implemented using a 0.13-µm CMOS process. The core area and power consumption of the PG were measured to be 250 × 300 µm2 and 18.7 mW, respectively. The data rate range of the PG was determined to be from 3.2 kHz to 893 MHz. The time resolution and average accuracy of the PG were measured to be 10 ps and ±0.3 LSB, respectively.
Analysis and design of cost effective, high-throughput ldpc decodersNxfee Innovation
This paper introduces a new approach to cost effective, high-throughput hardware designs for low-density parity-check (LDPC) decoders. The proposed approach, called nonsurjective finite alphabet iterative decoders (NS-FAIDs), exploits the robustness of message-passing LDPC decoders to inaccuracies in the calculation of exchanged messages, and it is shown to provide a unified framework for several designs previously proposed in the literature. NS-FAIDs are optimized by density evolution for regular and irregular LDPC codes, and are shown to provide different tradeoffs between hardware complexity and decoding performance. Two hardware architectures targeting high-throughput applications are also proposed, integrating both Min-Sum (MS) and NS-FAID decoding kernels. ASIC post synthesis implementation results on 65-nm CMOS technology show that NS-FAIDs yield significant improvements in the throughput to area ratio, by up to 58.75% with respect to the MS decoder, with even better or only slightly degraded error correction performance.
Efficient fpga mapping of pipeline sdf fft coresNxfee Innovation
In this paper, an efficient mapping of the pipeline single-path delay feedback (SDF) fast Fourier transform (FFT) architecture to field-programmable gate arrays (FPGAs) is proposed. By considering the architectural features of the target FPGA, significantly better implementation results are obtained. This is illustrated by mapping an R22SDF 1024-point FFT core toward both Xilinx Virtex-4 and Virtex-6 devices. The optimized FPGA mapping is explored in detail. Algorithmic transformations that allow a better mapping are proposed, resulting in implementation achievements that by far outperforms earlier published work. For Virtex-4, the results show a 350% increase in throughput per slice and 25% reduction in block RAM (BRAM) use, with the same amount of DSP48 resources, compared with the best earlier published result. The resulting Virtex-6 design sees even larger increases in throughput per slice compared with Xilinx FFT IP core, using half as many DSP48E1 blocks and less BRAM resources. The results clearly show that the FPGA mapping is crucial, not only the architecture and algorithm choices.
Approximate hybrid high radix encoding for energy efficient inexact multipliersNxfee Innovation
Approximate computing forms a design alternative that exploits the intrinsic error resilience of various applications and produces energy-efficient circuits with small accuracy loss. In this paper, we propose an approximate hybrid high radix encoding for generating the partial products in signed multiplications that encodes the most significant bits with the accurate radix-4 encoding and the least significant bits with an approximate higher radix encoding. The approximations are performed by rounding the high radix values to their nearest power of two. The proposed technique can be configured to achieve the desired energy–accuracy tradeoffs. Compared with the accurate radix-4 multiplier, the proposed multipliers deliver up to 56% energy and 55% area savings, when operating at the same frequency, while the imposed error is bounded by a Gaussian distribution with near-zero average. Moreover, the proposed multipliers are compared with state-of-the-art inexact multipliers, outperforming them by up to 40% in energy consumption, for similar error values. Finally, we demonstrate the scalability of our technique.
Feedback based low-power soft-error-tolerant design for dual-modular redundancyNxfee Innovation
Triple-modular redundancy (TMR), which consists of three identical modules and a voting circuit, is a common architecture for soft-error tolerance. However, the original TMR suffers from two major drawbacks: the large area overhead and the vulnerability of the voter. In order to overcome these drawbacks, we propose a new complementary dual-modular redundancy (CDMR) scheme for mitigating the effect of soft errors. Inspired by the Markov random field (MRF) theory, a two-stage voting system is implemented in CDMR, including a first stage optimal MRF structure and a second-stage high-performance merging unit. The CDMR scheme can reduce the voting circuit area by 20% while saving the area of one redundant module, achieving at least 26% error-rate reduction at an ultralow supply voltage of 0.25 V with 8.33% faster timing compared to previous voter designs.
A reconfigurable ldpc decoder optimized applicationsNxfee Innovation
This paper presents a high data-rate low-density parity-check (LDPC) decoder, suitable for the 802.11n/ac (WiFi) standard. The innovative features of the proposed decoder relate to the decoding algorithms and the interconnection between the processing elements. The reduction of the hardware complexity of decoders based on the min-sum (MS) algorithms comes at the cost of performance degradation, especially at high-noise regions. We introduce more accurate approximations of the log sum-product algorithm that also operate well for low signal-to noise ratio values. Telecommunication standards, including WiFi, support more than one quasi-cyclic LDPC codes of different characteristics, such as codeword length and code rate. A proposed design technique derives networks, capable of supporting a variety of codes and efficiently realizing connectivity between a variable number of processing units, with a relatively small hardware overhead over the single-code case. As a demonstration of the proposed technique, we implemented a reconfigurable network based on barrel rotators, suitable for LDPC decoders compatible with WiFi standard. Our approach achieves low complexity and high clock frequency, compared with related prior works. A 90-nm application-specified integrated circuit implementation of the proposed high-parallel WiFi decoder occupies 4.88 mm2 and achieves an information throughput rate of 4.5 G bit/s at a clock frequency of 555 MHz.
An efficient fault tolerance design for integer parallel matrix vectorNxfee Innovation
Parallel matrix processing is a typical operation in many systems, and in particular matrix–vector multiplication (MVM) is one of the most common operations in the modern digital signal processing and digital communication systems. This paper proposes a fault tolerant design for integer parallel MVMs. The scheme combines ideas from error correction codes with the self-checking capability of MVM. Field-programmable gate array evaluation shows that the proposed scheme can significantly reduce the overheads compared to the protection of each MVM on its own. Therefore, the proposed technique can be used to reduce the cost of providing fault tolerance in practical implementations.
Fpga based 128 bit customised vliw processor for executing dual scalarvector ...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
IRJET-To Implement Cloud Computing by using Agile Methodology in Indian E-Gov...IRJET Journal
P.V.S.S.Gangadhar, A.K.Shrivastava, Ragini Shukla "To Implement Cloud Computing by using Agile Methodology in Indian E-Governance ",International Research Journal of Engineering and Technology (IRJET), Volume2,issue-01 April 2015.e-ISSN:2395-0056, p-ISSN:2395-0072. www.irjet.net .published by Fast Track Publications
Abstract
Many developed countries in the world, uses Information and communication technology to deliver public services in a more efficient & easy way. The benefits of the Electronic Governance are huge and day by day increasing from one to many public services. But implementation of the governance is constrained due to 1) high costs of investment 2) shortage of domain experts 3) diverse and irreconcilable systems and 4) Security and privacy issues. With the Invent & rise of Cloud computing, i am looking various aspects of use of cloud computing in e-governance is emerged. Cloud computing can solve many of the above mentioned hurdles and provide better way to e-gov expansion, but it has some risks also. Agile development treats optimize the chance provided by cloud computing by doing software relinquishes iteratively and getting end user feedback more frequently and quickly.
Cloudify: Open vCPE Design Concepts and Multi-Cloud OrchestrationCloudify Community
See how open vCPE can be achieved in the real world and in action, while integrating other VNFs into the service chain, while easily instantiating and managing on any cloud, leveraging open orchestration design concepts. More and more vendors are looking to not only easily onboard their VNFs to the cloud, but also build a stack that is versatile and not locked into one cloud provider or vendor. Join this webinar and learn how Datavision and Cloudify are helping deliver this end-to-end solution across the globe
Securing the present block cipher against combined side channel analysis and ...Nxfee Innovation
In this paper, we present and evaluate a hardware implementation of the PRESENT block cipher secured against both side-channel analysis and fault attacks (FAs). The side-channel security is provided by the first-order threshold implementation masking scheme of the serialized PRESENT proposed by Poschmann et al. For the FA resistance, we employ the Private Circuits II countermeasure presented by Ishai et al. at Eurocrypt 2006, which we tailor to resist arbitrary 1-bit faults. We perform a side-channel evaluation using the state-of-the-art leakage detection tests, quantify the resource overhead of the Private Circuits II countermeasure, subdue the implementation to established differential FAs against the PRESENT block cipher, and contemplate on the structural resistance of the countermeasure. This paper provides the detailed instructions on how to successfully achieve a secure Private Circuits II implementation for the data path as well as the control logic.
Optimize Utility in Computing-Based Manufacturing Systems Using Service Model...AM Publications
Present manufacturing systems use Manual migration of essential manufacturing related data wherein they identify servers manually for placement of application executable files and includes subject matter expertise to manually select the server and create, configure website and authorization and repeat migration process in other servers. This involves migration from source-server to destination-servers. The concept of cloud computing is not been introduced in both the case i.e. source server and destination server. In case of failures they perform manual back-up and restore. This manual migration process is time consuming, error prone, repetitive operation. Proposed Automatic migration system uses cloud migration automation where in both the source server and destination servers are placed in any of the deployment model such as private cloud, community cloud or public cloud. Provide a Rules engine core for data encryption, Intuitive UI for client side usage. Universal configuration management database solution (UCNDB) for daily refresh of data and migration database which execute using impersonated EPR specific user. Resource planning is also required to do random testing in case there are few requests or quality of the product is critical and this will help in identifying how much testing is done.
VMworld 2013: Architecting the Software-Defined Data Center VMworld
VMworld 2013
Aidan Dalgleish, VMware
David Hill, VMware
Kamau Wanguhu, VMware
Learn more about VMworld and register at http://www.vmworld.com/index.jspa?src=socmed-vmworld-slideshare
Noise insensitive pll using a gate-voltage-boosted source-follower regulator ...Nxfee Innovation
In this brief, we propose a supply noise-insensitive charge pump phase-locked loop (PLL) using a source-follower (SF) regulator and noise cancellation. In order to minimize the voltage drop of the SF regulator while improving supply rejection, a gate-voltage-boosting technique and the body-controlled noise cancellation are proposed. To suppress the phase noise from the ring oscillator, a reference multiplier is employed to maximize the PLL loop bandwidth. Implemented in 65-nm CMOS, a prototype PLL at 3.2 GHz achieves supply noise spur of less than −33 dBc for a 50-mVpp supply noise around the loop bandwidth while consuming 3.12 mW from a 1-V supply.
More Related Content
Similar to Vector processing aware advanced clock-gating techniques for low-power fused multiply-add
Analysis and design of cost effective, high-throughput ldpc decodersNxfee Innovation
This paper introduces a new approach to cost effective, high-throughput hardware designs for low-density parity-check (LDPC) decoders. The proposed approach, called nonsurjective finite alphabet iterative decoders (NS-FAIDs), exploits the robustness of message-passing LDPC decoders to inaccuracies in the calculation of exchanged messages, and it is shown to provide a unified framework for several designs previously proposed in the literature. NS-FAIDs are optimized by density evolution for regular and irregular LDPC codes, and are shown to provide different tradeoffs between hardware complexity and decoding performance. Two hardware architectures targeting high-throughput applications are also proposed, integrating both Min-Sum (MS) and NS-FAID decoding kernels. ASIC post synthesis implementation results on 65-nm CMOS technology show that NS-FAIDs yield significant improvements in the throughput to area ratio, by up to 58.75% with respect to the MS decoder, with even better or only slightly degraded error correction performance.
Efficient fpga mapping of pipeline sdf fft coresNxfee Innovation
In this paper, an efficient mapping of the pipeline single-path delay feedback (SDF) fast Fourier transform (FFT) architecture to field-programmable gate arrays (FPGAs) is proposed. By considering the architectural features of the target FPGA, significantly better implementation results are obtained. This is illustrated by mapping an R22SDF 1024-point FFT core toward both Xilinx Virtex-4 and Virtex-6 devices. The optimized FPGA mapping is explored in detail. Algorithmic transformations that allow a better mapping are proposed, resulting in implementation achievements that by far outperforms earlier published work. For Virtex-4, the results show a 350% increase in throughput per slice and 25% reduction in block RAM (BRAM) use, with the same amount of DSP48 resources, compared with the best earlier published result. The resulting Virtex-6 design sees even larger increases in throughput per slice compared with Xilinx FFT IP core, using half as many DSP48E1 blocks and less BRAM resources. The results clearly show that the FPGA mapping is crucial, not only the architecture and algorithm choices.
Approximate hybrid high radix encoding for energy efficient inexact multipliersNxfee Innovation
Approximate computing forms a design alternative that exploits the intrinsic error resilience of various applications and produces energy-efficient circuits with small accuracy loss. In this paper, we propose an approximate hybrid high radix encoding for generating the partial products in signed multiplications that encodes the most significant bits with the accurate radix-4 encoding and the least significant bits with an approximate higher radix encoding. The approximations are performed by rounding the high radix values to their nearest power of two. The proposed technique can be configured to achieve the desired energy–accuracy tradeoffs. Compared with the accurate radix-4 multiplier, the proposed multipliers deliver up to 56% energy and 55% area savings, when operating at the same frequency, while the imposed error is bounded by a Gaussian distribution with near-zero average. Moreover, the proposed multipliers are compared with state-of-the-art inexact multipliers, outperforming them by up to 40% in energy consumption, for similar error values. Finally, we demonstrate the scalability of our technique.
Feedback based low-power soft-error-tolerant design for dual-modular redundancyNxfee Innovation
Triple-modular redundancy (TMR), which consists of three identical modules and a voting circuit, is a common architecture for soft-error tolerance. However, the original TMR suffers from two major drawbacks: the large area overhead and the vulnerability of the voter. In order to overcome these drawbacks, we propose a new complementary dual-modular redundancy (CDMR) scheme for mitigating the effect of soft errors. Inspired by the Markov random field (MRF) theory, a two-stage voting system is implemented in CDMR, including a first stage optimal MRF structure and a second-stage high-performance merging unit. The CDMR scheme can reduce the voting circuit area by 20% while saving the area of one redundant module, achieving at least 26% error-rate reduction at an ultralow supply voltage of 0.25 V with 8.33% faster timing compared to previous voter designs.
A reconfigurable ldpc decoder optimized applicationsNxfee Innovation
This paper presents a high data-rate low-density parity-check (LDPC) decoder, suitable for the 802.11n/ac (WiFi) standard. The innovative features of the proposed decoder relate to the decoding algorithms and the interconnection between the processing elements. The reduction of the hardware complexity of decoders based on the min-sum (MS) algorithms comes at the cost of performance degradation, especially at high-noise regions. We introduce more accurate approximations of the log sum-product algorithm that also operate well for low signal-to noise ratio values. Telecommunication standards, including WiFi, support more than one quasi-cyclic LDPC codes of different characteristics, such as codeword length and code rate. A proposed design technique derives networks, capable of supporting a variety of codes and efficiently realizing connectivity between a variable number of processing units, with a relatively small hardware overhead over the single-code case. As a demonstration of the proposed technique, we implemented a reconfigurable network based on barrel rotators, suitable for LDPC decoders compatible with WiFi standard. Our approach achieves low complexity and high clock frequency, compared with related prior works. A 90-nm application-specified integrated circuit implementation of the proposed high-parallel WiFi decoder occupies 4.88 mm2 and achieves an information throughput rate of 4.5 G bit/s at a clock frequency of 555 MHz.
An efficient fault tolerance design for integer parallel matrix vectorNxfee Innovation
Parallel matrix processing is a typical operation in many systems, and in particular matrix–vector multiplication (MVM) is one of the most common operations in the modern digital signal processing and digital communication systems. This paper proposes a fault tolerant design for integer parallel MVMs. The scheme combines ideas from error correction codes with the self-checking capability of MVM. Field-programmable gate array evaluation shows that the proposed scheme can significantly reduce the overheads compared to the protection of each MVM on its own. Therefore, the proposed technique can be used to reduce the cost of providing fault tolerance in practical implementations.
Fpga based 128 bit customised vliw processor for executing dual scalarvector ...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
IRJET-To Implement Cloud Computing by using Agile Methodology in Indian E-Gov...IRJET Journal
P.V.S.S.Gangadhar, A.K.Shrivastava, Ragini Shukla "To Implement Cloud Computing by using Agile Methodology in Indian E-Governance ",International Research Journal of Engineering and Technology (IRJET), Volume2,issue-01 April 2015.e-ISSN:2395-0056, p-ISSN:2395-0072. www.irjet.net .published by Fast Track Publications
Abstract
Many developed countries in the world, uses Information and communication technology to deliver public services in a more efficient & easy way. The benefits of the Electronic Governance are huge and day by day increasing from one to many public services. But implementation of the governance is constrained due to 1) high costs of investment 2) shortage of domain experts 3) diverse and irreconcilable systems and 4) Security and privacy issues. With the Invent & rise of Cloud computing, i am looking various aspects of use of cloud computing in e-governance is emerged. Cloud computing can solve many of the above mentioned hurdles and provide better way to e-gov expansion, but it has some risks also. Agile development treats optimize the chance provided by cloud computing by doing software relinquishes iteratively and getting end user feedback more frequently and quickly.
Cloudify: Open vCPE Design Concepts and Multi-Cloud OrchestrationCloudify Community
See how open vCPE can be achieved in the real world and in action, while integrating other VNFs into the service chain, while easily instantiating and managing on any cloud, leveraging open orchestration design concepts. More and more vendors are looking to not only easily onboard their VNFs to the cloud, but also build a stack that is versatile and not locked into one cloud provider or vendor. Join this webinar and learn how Datavision and Cloudify are helping deliver this end-to-end solution across the globe
Securing the present block cipher against combined side channel analysis and ...Nxfee Innovation
In this paper, we present and evaluate a hardware implementation of the PRESENT block cipher secured against both side-channel analysis and fault attacks (FAs). The side-channel security is provided by the first-order threshold implementation masking scheme of the serialized PRESENT proposed by Poschmann et al. For the FA resistance, we employ the Private Circuits II countermeasure presented by Ishai et al. at Eurocrypt 2006, which we tailor to resist arbitrary 1-bit faults. We perform a side-channel evaluation using the state-of-the-art leakage detection tests, quantify the resource overhead of the Private Circuits II countermeasure, subdue the implementation to established differential FAs against the PRESENT block cipher, and contemplate on the structural resistance of the countermeasure. This paper provides the detailed instructions on how to successfully achieve a secure Private Circuits II implementation for the data path as well as the control logic.
Optimize Utility in Computing-Based Manufacturing Systems Using Service Model...AM Publications
Present manufacturing systems use Manual migration of essential manufacturing related data wherein they identify servers manually for placement of application executable files and includes subject matter expertise to manually select the server and create, configure website and authorization and repeat migration process in other servers. This involves migration from source-server to destination-servers. The concept of cloud computing is not been introduced in both the case i.e. source server and destination server. In case of failures they perform manual back-up and restore. This manual migration process is time consuming, error prone, repetitive operation. Proposed Automatic migration system uses cloud migration automation where in both the source server and destination servers are placed in any of the deployment model such as private cloud, community cloud or public cloud. Provide a Rules engine core for data encryption, Intuitive UI for client side usage. Universal configuration management database solution (UCNDB) for daily refresh of data and migration database which execute using impersonated EPR specific user. Resource planning is also required to do random testing in case there are few requests or quality of the product is critical and this will help in identifying how much testing is done.
VMworld 2013: Architecting the Software-Defined Data Center VMworld
VMworld 2013
Aidan Dalgleish, VMware
David Hill, VMware
Kamau Wanguhu, VMware
Learn more about VMworld and register at http://www.vmworld.com/index.jspa?src=socmed-vmworld-slideshare
Noise insensitive pll using a gate-voltage-boosted source-follower regulator ...Nxfee Innovation
In this brief, we propose a supply noise-insensitive charge pump phase-locked loop (PLL) using a source-follower (SF) regulator and noise cancellation. In order to minimize the voltage drop of the SF regulator while improving supply rejection, a gate-voltage-boosting technique and the body-controlled noise cancellation are proposed. To suppress the phase noise from the ring oscillator, a reference multiplier is employed to maximize the PLL loop bandwidth. Implemented in 65-nm CMOS, a prototype PLL at 3.2 GHz achieves supply noise spur of less than −33 dBc for a 50-mVpp supply noise around the loop bandwidth while consuming 3.12 mW from a 1-V supply.
The implementation of the improved omp for aic reconstruction based on parall...Nxfee Innovation
Sparse signal recovery becomes extremely challenging for a variety of real-time applications. In this paper, we improve the orthogonal matching pursuit (OMP) algorithm based on parallel correlation indices selection mechanism in each iteration and Goldschmidt algorithm. Simulation results show that the improved OMP algorithm with a reduced number of iterations and low hardware complexity of matrix operations has higher success rate and recovery signal-to-noise-ratio (RSNR) for sparse signal recovery. This paper presents an efficient complex valued system hardware architecture of the recovery algorithm for analog-to-information structure based on compressive sensing. The proposed architecture is implemented and validated on the Xilinx Virtex6 field-programmable gate array (FPGA) for signal reconstruction with N = 1024, K = 36, and M = 256. The implementation results showed that the improved OMP algorithm achieved a higher RSNR of 31.04 dB compared with the original OMP algorithm. This synthesized design consumes a few percentages of the hardware resources of the FPGA chip with the clock frequency of 135.4 MHZ and reconstruction time of 170 µs, which is faster than the existing design.
Low complexity methodology for complex square-root computationNxfee Innovation
In this brief, we propose a low-complexity methodology to compute a complex square root using only a circular coordinate rotation digital computer (CORDIC) as opposed to the state-of-the-art techniques that need both circular as well as hyperbolic CORDICs. Subsequently, an architecture has been designed based on the proposed methodology and implemented on the ASIC platform using the UMC 180-nm Technology node with 1.0 V at 5 MHz. Field programmable gate array (FPGA) prototyping using Xilinx’ Virtex-6 (XC6v1x240t) has also been carried out. After thorough theoretical analysis and experimental validations, it can be inferred that the proposed methodology reduces 21.15% slice look up tables (on FPGA platform) and saves 20.25% silicon area overhead and decreases 19% power consumption (on ASIC platform) when compared with the state-of-the-art method without compromising the computational speed, throughput, and accuracy.
Combating data leakage trojans in commercial and asic applications with time ...Nxfee Innovation
Globalization of microchip fabrication opens the possibility for an attacker to insert hardware Trojans into a chip during the manufacturing process. While most defensive methods focus on detection or prevention, a recent method, called Randomized Encoding of Combinational Logic for Resistance to Data Leakage (RECORD), uses data randomization to prevent hardware Trojans from leaking meaningful information even when the entire design is known to the attacker. Both RECORD and its sequential variant require significant area and power overhead. In this paper, a Time-Division Multiplexed version of the RECORD design process is proposed which reduces area overhead by 63% and power by 56%. This time-division multiplexing (TDM) concept is further refined to allow commercial off the shelf (COTS) products and IP cores to be safely operated from a separate chip. These new methods tradeoff latency (5.3× for TDM and 3.9× for COTS) and energy use to accomplish area and power savings and achieve greater security than the original RECORD process.
Approximate sum of-products designs based on distributed arithmeticNxfee Innovation
Approximate circuits provide high performance and require low power. Sum-of-products (SOP) units are key elements in many digital signal processing applications. In this brief, three approximate SOP (ASOP) models which are based on the distributed arithmetic are proposed. They are designed for different levels of accuracy. First model of ASOP achieves an improvement up to 64% on area and 70% on power, when compared with conventional unit. Other two models provide an improvement of 32% and 48% on area and 54% and 58% on power, respectively, with a reduced error rate compared with the first model. Third model achieves the mean relative error and normalized error distance as low as 0.05% and 0.009%, respectively. Performance of approximate units is evaluated with a noisy image smoothing application, where the proposed models are capable of achieving higher peak signal to-noise ratio than the existing state-of-the-art techniques. It is shown that the proposed approximate models achieve higher processing accuracy than existing works but with significant improvements in power and performance.
Algorithm and vlsi architecture design of proportionate type lms adaptive fil...Nxfee Innovation
Proportionate-type normalized LMS (Pt-NLMS) family of adaptive filtering algorithms for sparse system identification pose significant implementation challenges due to their high computational complexity especially for real-time applications like network echo cancelation. In this paper, we make the first attempt to implement Pt-NLMS algorithms in hardware. Several reformulations are proposed to simplify the original Pt-NLMS algorithms, thereby making them amenable to real time VLSI implementations and the reformulated algorithms referred as delayed µ-law proportionate LMS (DMPLMS) algorithm for white input and delayed wavelet MPLMS (DWMPLMS) for colored input are then implemented in hardware. Simulation studies demonstrate that the performance loss is very small for the proposed reformulations. We implemented the proposed designs considering 16-bit fixed point representation in hardware, and synthesis results show that the DMPLMS architecture with ≈30% increase in hardware over the state-of-the-art conventional delayed LMS architecture achieves 3× improvement in convergence rate for white input and the DWMPLMS architecture with ≈70% increase in hardware achieves 10× improvement in convergence rate for correlated input conditions.
A flexible wildcard pattern matching accelerator via simultaneous discrete fi...Nxfee Innovation
Regular expression matching becomes indispensable elements of Internet of Things network security. However, traditional ternary content addressable memory (TCAM) search engine is unable to handle patterns with wildcards, as it precisely tracks only one active state with single transition. This paper proposes a promising simultaneous pattern matching methodology for wildcard patterns by two separated engines to represent discrete finite automata. A key preprocessing to encode possible postfix pattern by a unique key ensures that follow-up patterns can accurately traverse all possible matches with limited hardware resources. This approach is practical and scalable for achieving good performance and low space consumption in network security, and it can be applicable to any regular expressions even with multi wildcard patterns. The experimental results demonstrate that this scheme can efficiently and accurately recognize wildcard patterns by simultaneously tracking only two active states. By adopting SRAM TCAM in the proposed architecture, the energy consumption is reduced to around 39%, compared with the energy consumption using a computing system that contains a large memory lookup and comparison overhead.
A closed form expression for minimum operating voltage of cmos d flip-flopNxfee Innovation
In this paper, a closed-form expression for estimating the minimum operating voltage (VDDmin) of D flip-flops (FFs) is proposed. VDDmin is defined as the minimum supply voltage at which the FFs are functional without errors. The proposed expression indicates that VDDmin of FFs is a linear function of the square root of logarithm of the number of FFs, and its slope depends on the within-die variation of the threshold voltage (VTH) and its intercept depends on the balance between PMOS and NMOS, which is mainly due to the die-to-die VTH variation. The proposed expression of VDDmin is validated by the simulation results as well as the silicon measurements. Finally, we discuss the dependence of VDDmin on the device parameters..
A 128 tap highly tunable cmos if finite impulse response filter for pulsed ra...Nxfee Innovation
A configurable-bandwidth (BW) filter is presented in this paper for pulsed radar applications. To eliminate dispersion effects in the received waveform, a finite impulse response (FIR) topology is proposed, which has a measured standard deviation of an in-band group delay of 11 ns that is primarily dominated by the inherent, fully predictable delay introduced by the sample-and-hold. The filter operates at an IF of 20 MHz, and is tunable in BW from 1.5 to 15 MHz, which makes it optimal to be used with varying pulse widths in the radar. Employing a total of 128 taps, the FIR filter provides greater than 50-dB sharp attenuation in the stop band in order to minimize all out-of-band noise in the low signal-to-noise received radar signal. Fabricated in a 0.18-µm silicon on insulator CMOS process, the proposed filter consumes approximately 3.5mW/tap with a 1.8-V supply. A 20-MHz two-tone measurement with 200-kHz tone separation shows IIP3 greater than 8.5dBm.
A 12 bit 40-ms s sar adc with a fast-binary-window dac switching schemeNxfee Innovation
This paper presents a 12-bit 40-MS/s successive approximation register analog-to-digital converter (ADC) for ultrasound imaging systems. By incorporating a fast binary window digital-to-analog converter (DAC) switching technique, the problematic most significant bit transition glitch was removed to improve linearity without increasing the input capacitance or using a calibration scheme. A hybrid DAC was also developed to overcome the yield problem that occurs when a tiny unit capacitance is used in the DAC. Moreover, a reference buffer was used to accelerate the DAC settling to achieve high speed conversion. The prototype ADC was fabricated using a 130-nm CMOS technology. The ADC core occupied an active area of 0.1 mm 2 and consumed a total power of 1.32 mW when a 1.2-V supply was used at a conversion rate of 40 MS/s. The measured peak signal-to-noise-and-distortion ratio and spurious free dynamic range were 64 and 77.5 dB, respectively. The peak effective number of bits was 10.33, which is equivalent to a Walden figure-of-merit of 25.6 fJ/conversion step.
NXFEE Innovation is the Industry of Semiconductor IP Development, IP Designs, and services of developing solution to provide core products and application to customers with a wide range of solution that include custom ASIC/ FPGA/ DSP/ EMBEDDED System/ Wireless Technologies. Having lustrum of expertise and satisfied customers, NXFEE have the capability to deliver solution that is fully meshed with customer’s business requirement, meeting the highest standards.
NXFEE will Provide cost effective outsourcing services for secure and turn key product development in the areas of Bio-Medical/ Wireless/ Robotics/ VLSI/ DSP/ Embedded design & Development from conceptualization to production. Our sound technology and knowledge base have helped us to create products using emerging technology that include FPGA, VHDL, VERILOG HDL, SYSTEM VERILOG HDL, UVM, OVM, VVM, DSP, RTOS, DSP, Bluetooth, WI-FI, RF, CDMA, AXI, AHP, APB, and other related technologies in the area of industrial automation, telecommunications, consumer electronics and automotive applications.
Immunizing Image Classifiers Against Localized Adversary Attacksgerogepatton
This paper addresses the vulnerability of deep learning models, particularly convolutional neural networks
(CNN)s, to adversarial attacks and presents a proactive training technique designed to counter them. We
introduce a novel volumization algorithm, which transforms 2D images into 3D volumetric representations.
When combined with 3D convolution and deep curriculum learning optimization (CLO), itsignificantly improves
the immunity of models against localized universal attacks by up to 40%. We evaluate our proposed approach
using contemporary CNN architectures and the modified Canadian Institute for Advanced Research (CIFAR-10
and CIFAR-100) and ImageNet Large Scale Visual Recognition Challenge (ILSVRC12) datasets, showcasing
accuracy improvements over previous techniques. The results indicate that the combination of the volumetric
input and curriculum learning holds significant promise for mitigating adversarial attacks without necessitating
adversary training.
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdffxintegritypublishin
Advancements in technology unveil a myriad of electrical and electronic breakthroughs geared towards efficiently harnessing limited resources to meet human energy demands. The optimization of hybrid solar PV panels and pumped hydro energy supply systems plays a pivotal role in utilizing natural resources effectively. This initiative not only benefits humanity but also fosters environmental sustainability. The study investigated the design optimization of these hybrid systems, focusing on understanding solar radiation patterns, identifying geographical influences on solar radiation, formulating a mathematical model for system optimization, and determining the optimal configuration of PV panels and pumped hydro storage. Through a comparative analysis approach and eight weeks of data collection, the study addressed key research questions related to solar radiation patterns and optimal system design. The findings highlighted regions with heightened solar radiation levels, showcasing substantial potential for power generation and emphasizing the system's efficiency. Optimizing system design significantly boosted power generation, promoted renewable energy utilization, and enhanced energy storage capacity. The study underscored the benefits of optimizing hybrid solar PV panels and pumped hydro energy supply systems for sustainable energy usage. Optimizing the design of solar PV panels and pumped hydro energy supply systems as examined across diverse climatic conditions in a developing country, not only enhances power generation but also improves the integration of renewable energy sources and boosts energy storage capacities, particularly beneficial for less economically prosperous regions. Additionally, the study provides valuable insights for advancing energy research in economically viable areas. Recommendations included conducting site-specific assessments, utilizing advanced modeling tools, implementing regular maintenance protocols, and enhancing communication among system components.
Hierarchical Digital Twin of a Naval Power SystemKerry Sado
A hierarchical digital twin of a Naval DC power system has been developed and experimentally verified. Similar to other state-of-the-art digital twins, this technology creates a digital replica of the physical system executed in real-time or faster, which can modify hardware controls. However, its advantage stems from distributing computational efforts by utilizing a hierarchical structure composed of lower-level digital twin blocks and a higher-level system digital twin. Each digital twin block is associated with a physical subsystem of the hardware and communicates with a singular system digital twin, which creates a system-level response. By extracting information from each level of the hierarchy, power system controls of the hardware were reconfigured autonomously. This hierarchical digital twin development offers several advantages over other digital twins, particularly in the field of naval power systems. The hierarchical structure allows for greater computational efficiency and scalability while the ability to autonomously reconfigure hardware controls offers increased flexibility and responsiveness. The hierarchical decomposition and models utilized were well aligned with the physical twin, as indicated by the maximum deviations between the developed digital twin hierarchy and the hardware.
6th International Conference on Machine Learning & Applications (CMLA 2024)ClaraZara1
6th International Conference on Machine Learning & Applications (CMLA 2024) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of on Machine Learning & Applications.
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
Vector processing aware advanced clock-gating techniques for low-power fused multiply-add
1. NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
Vector Processing-Aware Advanced Clock-Gating Techniques for Low-
Power Fused Multiply-Add
Abstract:
The need for power efficiency is driving a rethink of design decisions in processor
architectures. While vector processors succeeded in the high-performance market in the
past, they need a retailoring for the mobile market that they are entering now. Floating-
point (FP) fused multiply-add (FMA), being a functional unit with high power
consumption, deserves special attention. Although clock gating is a well-known method
to reduce switching power in synchronous designs, there are unexplored opportunities for
its application to vector processors, especially when considering active operating mode.
In this research, we comprehensively identify, propose, and evaluate the most suitable
clock-gating techniques for vector FMA units (VFUs). These techniques ensure power
savings without jeopardizing the timing. We evaluate the proposed techniques using both
synthetic and ―real-world‖ application-based benchmarking. Using vector masking and
vector multilane-aware clock gating, we report power reductions of up to 52%, assuming
active VFU operating at the peak performance. Among other findings, we observe that
vector instruction-based clock-gating techniques achieve power savings for all vector FP
instructions. Finally, when evaluating all techniques together, using ―real-world‖
benchmarking, the power reductions are up to 80%. Additionally, in accordance with
processor design trends, we perform this research in a fully parameterizable and
automated fashion.
Software Implementation:
Modelsim
Xilinx 14.2
Existing System:
2. NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
Power and energy efficiency have become the dominant limiting factor to processor
performance and have increased significantly processor design complexity, especially
when considering the mobile market. Being able to exploit high degrees of data-level
parallelism (DLP) at low cost in a power- and energy-efficient way, vector processors are
an attractive architectural-level solution. Undoubtedly, the design goals for mobile vector
processors clearly differ from the performance-driven designs of traditional vector
machines. Therefore, mobile vector processors require a redesign of their functional units
(FUs) in a power-efficient manner.
Clock gating is a common method to reduce switching power in synchronous pipelines. It
is practically a standard in low-power design. The goal is to ―gate‖ the clock of any
component whenever it does not perform useful work. In that way, the power spent in the
associated clock tree, registers, and the logic between the registers is reduced. It is the
most efficient power reduction technique for active operating mode.1 Therefore, the
conditions under which clock gating can be applied should be extensively studied and
identified. A widely used approach is to clock gate a whole FU when it is idle. A
complementary and more challenging approach is clock gating the FU or its sub blocks
when it is active, i.e., operating at peak performance. Furthermore, there are
characteristics of vector processors that provide Since fused multiply-add (FMA) units
usually dissipate the most power of all FUs, their design requires special attention.
Abundant floating-point (FP) FMA is typically found in vector workloads, such as
multimedia, computer graphics, or deep learning workloads. Although in the past FMA
has been used for high performance, it recently has been included in mobile processors as
well. In contrast to high performance vector processors (e.g., NEC SX-series and
Tarantula ) that have separated units for each FP operation, mobile vector processors’
resources are limited; thus, we typically have a single unit per vector lane capable of
performing multiple FP operations rather than separate FP units. Apart from that,
additional advantages of using FMA over separate FP adder and multiplier are as follows.
3. NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
1) Computation localization inside the same unit reduces the number of interconnections
(power and energy efficiency). 2) Higher accuracy (single, instead of two
round/normalize steps). 3) Improved performance (shorter latency).
We present three kinds of techniques: 1) novel ideas to exploit unique characteristics of
vector architectures for clock gating during active periods of execution (e.g., vector
instructions with a scalar operand or vector masking); 2) novel ideas for clock gating
during active periods of execution that are also applicable to scalar architectures but
especially beneficial to vector processors (e.g., gating internal blocks depending on the
values of input data); 3) ideas that are already used in other architectures and that we
present as its application is beneficial to vector processors, and for the sake of
completeness (e.g., idle VFU).
Regarding the second and third groups of ideas, an advantage of vector processing that
extends the applicability of clock gating is that vector instructions last many cycles, so
the state of the clock-gating and bypassing logic remains the same during the whole
instruction. As a result, power savings typically overcome the switching overhead of the
added hardware (which is often not a case in scalar processors.
To fulfill current trends in digital design that promote building generators rather than
instances, we perform this research in a fully parameterizable, scalable, and automated
manner. We developed an integrated architecture-circuit framework that consists of
several generators, simulators, and other tools, in order to join architectural-level
information (e.g., vector length or benchmark configuration) with circuit-level outputs
(e.g., VFU power and timing measurements). We implement our clock-gating techniques
and generate hardware VFU models using a fully parameterizable Chisel based FMA
generator (FM Agen) and a 40-nm low-power technology.
We discuss the related work individually for each of our clock-gating techniques together
with the description of the technique in Section IV. Besides, in the context of alternative
4. NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
low-power techniques for FP units, interesting approaches have been proposed (caching
results that can be reused) and byte encoding (computation performed over significant
bytes). However, detailed and accurate evaluation reveals that the actual savings are often
low and with an unaffordable area overhead .
Fig. 1. Two-lane, four-stage VFU (MVL = EVL = 64) executing FPFMAV V3<-V0,V1,V2.
Disadvantages:
Inaccuracy in operation
Power savings is not efficient
Proposed System:
Proposed clock-gating Techniques
Scalar Operand Clock-Gating
We propose this technique to tackle the cases in which one or two operands do not
change during the vector instruction. Table III lists the types of instructions during which
5. NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
Fig. 2. Simplified block diagram of a one-lane, four-stage VFU with all clock-gating techniques applied
(AllCG technique). Input signals for the baseline without clock-gating are multiplicands (A and B),
addend (C), rounding mode, and operation sign (opsign), while output signals are result (Out) and
exception flags.
Table I
6. NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
Types of instructions where ScalarCG AND ImplCG
scalar operand clock-gating (ScalarCG) is active. As only one of all the supported vector
instructions has all three vector operands, often at least one operand is scalar. Only the
FPFMAV instruction, in which all operands are vectors, does not benefit from this
technique.
During these instructions, the corresponding input register(s) of scalar operand(s) should
latch a new value only on the first clock edge of the execution of the instruction, while
during the rest of the instruction, they can be clock-gated. To implement this, we
introduce the signals VS[2..0] (Fig. 2), where VS[i] = 0 means that the ith operand is
gated after the mentioned first cycle. VS signals are derived from the instruction
OPCODE. Deriving VS signals from the OPCODE is done before the first pipeline stage
(as shown in Fig. 2). This generation (decoding) requires regular comparators, and they
are not on the critical path as the OPCODE is available at least one cycle in advance.
Table I shows corresponding VS signals for all the instructions.
Implicit Scalar Operand Clock-Gating (ImplCG)
7. NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
This technique is an additional optimization of ScalarCG and aims to exploit further the
information given through the instruction OPCODE for clock-gating, operand isolation,
and computation bypassing. In the case of addition and subtraction instructions, such as
FPADDV and FPSUBV, the 53 × 53 mantissa multiplier is not needed as it is known that
one of the multiplicands is ―1,‖ and thus, we can bypass, isolate, and clock-gate it
providing the value of the other multiplicand directly to the adder. There is an analogous
situation for FPMULV since the addend is known to be ―0.‖ In this case, the 162-bit wide
adder, leading zero anticipation, and the aligning part are not needed.
To control bypassing, isolation, and clock-gating of the mentioned sub modules, we
introduce signal INSTYP (see Fig. 2 and Table I), generated from the instruction
OPCODE, which indicates whether an FPFMAV or an FPADDV/FPSUBV/FPMULV
instruction is executed. INSTYP together with VS signals provide information of the
instruction type. For example, INSTYP = 1, VS[0] = 1, and VS[2] = 0 indicate that we
have an FPMULV instruction while INSTYP = 1, and VS[1] = 0 indicates that we have
an FPADDV/FPSUBV instruction. Fig. 3 shows the simplified block diagram of gated
FMA sub modules when the aforementioned instructions are executed. Circuitry added
for implementing ImplCG mostly consists of clock-gating cells and MUXs.
In the context of instruction-dependent techniques, there is interesting research done in
the past for scalar processors [8]. The main advantages of our ImplCG proposal over the
mentioned research are: 1) we apply the technique for a variable number of pipeline
stages; 2) we evaluate power, timing, and area; and 3) we propose the technique for
vector processors. The advantage of applying this technique on a vector processor over
other models (e.g., scalar) is that vector instructions last many cycles, so the state of the
related hardware (clock-gating logic and MUXs) maintains the same
8. NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
Fig. 3. Gated FMA sub modules during FPMULV (dark gray) and FPADDV (light gray) instructions in
case of ImplCG.
state during the whole instruction. Thus, there will be less switching overhead than in the
scalar case.
Vector Masking and Vector Multilane-Aware Clock-Gating (MaskCG)
Here, we target cases in which there are idle cycles during the vector mask instructions
(e.g., FPFMAV_MASK). Common cases in which vector mask control is used are: 1)
sparse matrix operations, and 2) conditional statements inside a vectorized loop.
Additionally, we assume that the same mechanism is also used to reduce the EVL to less
than the MVL . We assume that the control logic will detect and optimize this case,
skiping the last elements of the vector corresponding to the trailing 0s of the mask.
However, in vector designs with nL lanes, there will still be mod(EVL, nL ) idle lanes in
the last cycle of the operation. The VMR directly controls the clock-gating of the whole
arithmetic unit during these idle cycles (see Fig. 2).
9. NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
Regarding the internal implementation, we perform clock-gating at pipeline stage
granularity [25], so we prevent useless cycles inside the unit, i.e., the data are latched in
subsequent stages only if necessary. Once the Enable signal of the first pipeline stage’s
register gets the value ―1,‖ this Enable signal propagates to the end of the pipeline, one
stage per cycle (see Fig. 2). In other words, the Enable signal of the nth stage is actually
the first stage’s Enable signal delayed by n−1 cycles. This is implemented by adding a 1-
bit-wide, (nS−1)-long shift register that drives clock-gating cells. To the best of our
knowledge, there is no related work that aims to exploit vector conditional execution with
VMR to lower the power of vector processors.
Input Data Aware Clock-Gating (InputCG)
Here, we identify the scenarios in which, depending on the input data, a part of mantissa
processing is not needed for the correct result and, thus, can be bypassed. We use a
recoded format for internal representation that allows us to detect special cases (explained
in Section III-A) and zeros with an negligible hardware overhead; it requires inspection
of only three most significant bits of the exponent (fourth column in Table II). Table II
presents the identified scenarios (conditions) in which a hardware block of mantissa
TABLE II
InputCG—Conditions under which a hardware block of mantissa arithmetic computations and
corresponding input registers can be bypassed
10. NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
arithmetic computations and the corresponding input registers can be bypassed, isolated,
and clock-gated. The recoded format allows detection of relevant scenarios by using
simple 3-bit comparators. They are located at the inputs of VFU (A, B, and C processing
block on Fig. 2). In that way, we assure that the mentioned detection comparators are not
on the VFU’s critical path, i.e., gating information is available in time. The added internal
hardware is similar as for ImplCG. Having zero addend is analog to FPMULV instruction
case (see Fig. 3). Zero multiplicand allows gating and bypassing all the modules from
Fig. 3 except the registers that hold operand C value as in that case the final result is
operand C. In the case of NaN and infinity, there is no need for any computation as the
result that has to be at the VFU output is already known (explained in Section III-A), so
we can gate/bypass/isolate vast majority of FMA sub modules. There are many
workloads whose data contain a lot of zero values, thus can fairly benefit from the last
two sub techniques presented in Table II. Although these techniques are applicable to
other architectures as well, their application to vector processors is more efficient since
the recurrent values are common within the vector data, thus lowering the switching
overhead in added hardware (clock gating logic and MUXs). While both ImplCG and
InputCG techniques aim to exploit cases when the addend is ―0,‖ in this case, there is no
external information of ―0‖ existence via VS signals, but it has to be detected, and the
gating has to be done on time. As in the case of ImplCG, the research done in presents a
related data-driven technique for scalar processors. The main advantages (that enable
additional savings) of our InputCG technique over the mentioned research are: 1)
detection of zero operands and 2) gating the mantissa multiplier when processing NaNs.
Advantages:
High accuracy
Improved performance
Power saving is efficient
11. NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
References:
[1] I. Ratkovi´c, O. Palomar, M. Stani´c, O. Unsal, A. Cristal, and M. Valero, ―A fully parameterizable
low power design of vector fused multiply-add using active clock-gating techniques,‖ in Proc. Int.
Symp. Low Power Electron. Design (ISLPED), 2016, pp. 362–367.
[2] K. Asanovi´c, ―Vector microprocessor,‖ Ph.D. dissertation, Dept. Comput. Sci., Univ. California,
Berkeley, Berkeley, CA, USA, 1998.
[3] Y. Lee et al., ―Exploring the tradeoffs between programmability and efficiency in data-parallel
accelerators,‖ in Proc. 38th Annu. Int. Symp. Comput. Archit. (ISCA), 2011, pp. 129–140.
[4] B. Zimmer et al., ―A RISC-V vector processor with simultaneousswitching switched-capacitor DC–
DC converters in 28 nm FDSOI,‖ IEEE J. Solid-State Circuits, vol. 51, no. 4, pp. 930–942, Apr. 2016.
[5] R. Espasa, M. Valero, and J. E. Smith, ―Vector architectures: Past, present and future,‖ in Proc. 12th
Int. Conf. Supercomput. (SC), 1998, pp. 425–432.
[6] H. Li, S. Bhunia, Y. Chen, T. N. Vijaykumar, and K. Roy, ―Deterministic clock gating for
microprocessor power reduction,‖ in Proc. 9th Int. Symp. High-Perform. Comput. Archit. (ISCA), Feb.
2003, pp. 113–122.
[7] N. Mohyuddin, K. Patel, and M. Pedram, ―Deterministic clock gating to eliminate wasteful activity
due to wrong-path instructions in outof-order superscalar processors,‖ in Proc. IEEE Int. Conf. Comput.
Design (ICCD), Oct. 2009, pp. 166–172.
[8] J. Preiss, M. Boersma, and S. M. Mueller, ―Advanced clockgating schemes for fused-multiply-add-
type floating-point units,‖ in Proc. 19th IEEE Symp. Comput. Arithmetic (ARITH), Jun. 2009, pp. 48–
56.
[9] I. Ratkovi´c, O. Palomar, M. Stani´c, O. S. Ünsal, A. Cristal, and M. Valero, ―On the selection of
adder unit in energy efficient vector processing,‖ in Proc. 14th Int. Symp. Quality Electron. Design
(ISQED), Mar. 2013, pp. 143–150.
12. NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
[10] I. Ratkovi´c et al., ―Joint circuit-system design space exploration of multiplier unit structure for
energy-efficient vector processors,‖ in Proc. IEEE Comput. Soc. Annu. Symp. VLSI (ISVLSI), Jul.
2015, pp. 19–26.
[11] M. Stanic, O. Palomar, I. Ratkovi´c, M. Duric, O. Unsal, and A. Cristal, ―VALib and SimpleVector:
Tools for rapid initial research on vector architectures,‖ in Proc. 11th ACM Conf. Comput. Frontiers
(CS), 2014, p. 7.