SlideShare a Scribd company logo
NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
Vector Processing-Aware Advanced Clock-Gating Techniques for Low-
Power Fused Multiply-Add
Abstract:
The need for power efficiency is driving a rethink of design decisions in processor
architectures. While vector processors succeeded in the high-performance market in the
past, they need a retailoring for the mobile market that they are entering now. Floating-
point (FP) fused multiply-add (FMA), being a functional unit with high power
consumption, deserves special attention. Although clock gating is a well-known method
to reduce switching power in synchronous designs, there are unexplored opportunities for
its application to vector processors, especially when considering active operating mode.
In this research, we comprehensively identify, propose, and evaluate the most suitable
clock-gating techniques for vector FMA units (VFUs). These techniques ensure power
savings without jeopardizing the timing. We evaluate the proposed techniques using both
synthetic and ―real-world‖ application-based benchmarking. Using vector masking and
vector multilane-aware clock gating, we report power reductions of up to 52%, assuming
active VFU operating at the peak performance. Among other findings, we observe that
vector instruction-based clock-gating techniques achieve power savings for all vector FP
instructions. Finally, when evaluating all techniques together, using ―real-world‖
benchmarking, the power reductions are up to 80%. Additionally, in accordance with
processor design trends, we perform this research in a fully parameterizable and
automated fashion.
Software Implementation:
 Modelsim
 Xilinx 14.2
Existing System:
NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
Power and energy efficiency have become the dominant limiting factor to processor
performance and have increased significantly processor design complexity, especially
when considering the mobile market. Being able to exploit high degrees of data-level
parallelism (DLP) at low cost in a power- and energy-efficient way, vector processors are
an attractive architectural-level solution. Undoubtedly, the design goals for mobile vector
processors clearly differ from the performance-driven designs of traditional vector
machines. Therefore, mobile vector processors require a redesign of their functional units
(FUs) in a power-efficient manner.
Clock gating is a common method to reduce switching power in synchronous pipelines. It
is practically a standard in low-power design. The goal is to ―gate‖ the clock of any
component whenever it does not perform useful work. In that way, the power spent in the
associated clock tree, registers, and the logic between the registers is reduced. It is the
most efficient power reduction technique for active operating mode.1 Therefore, the
conditions under which clock gating can be applied should be extensively studied and
identified. A widely used approach is to clock gate a whole FU when it is idle. A
complementary and more challenging approach is clock gating the FU or its sub blocks
when it is active, i.e., operating at peak performance. Furthermore, there are
characteristics of vector processors that provide Since fused multiply-add (FMA) units
usually dissipate the most power of all FUs, their design requires special attention.
Abundant floating-point (FP) FMA is typically found in vector workloads, such as
multimedia, computer graphics, or deep learning workloads. Although in the past FMA
has been used for high performance, it recently has been included in mobile processors as
well. In contrast to high performance vector processors (e.g., NEC SX-series and
Tarantula ) that have separated units for each FP operation, mobile vector processors’
resources are limited; thus, we typically have a single unit per vector lane capable of
performing multiple FP operations rather than separate FP units. Apart from that,
additional advantages of using FMA over separate FP adder and multiplier are as follows.
NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
1) Computation localization inside the same unit reduces the number of interconnections
(power and energy efficiency). 2) Higher accuracy (single, instead of two
round/normalize steps). 3) Improved performance (shorter latency).
We present three kinds of techniques: 1) novel ideas to exploit unique characteristics of
vector architectures for clock gating during active periods of execution (e.g., vector
instructions with a scalar operand or vector masking); 2) novel ideas for clock gating
during active periods of execution that are also applicable to scalar architectures but
especially beneficial to vector processors (e.g., gating internal blocks depending on the
values of input data); 3) ideas that are already used in other architectures and that we
present as its application is beneficial to vector processors, and for the sake of
completeness (e.g., idle VFU).
Regarding the second and third groups of ideas, an advantage of vector processing that
extends the applicability of clock gating is that vector instructions last many cycles, so
the state of the clock-gating and bypassing logic remains the same during the whole
instruction. As a result, power savings typically overcome the switching overhead of the
added hardware (which is often not a case in scalar processors.
To fulfill current trends in digital design that promote building generators rather than
instances, we perform this research in a fully parameterizable, scalable, and automated
manner. We developed an integrated architecture-circuit framework that consists of
several generators, simulators, and other tools, in order to join architectural-level
information (e.g., vector length or benchmark configuration) with circuit-level outputs
(e.g., VFU power and timing measurements). We implement our clock-gating techniques
and generate hardware VFU models using a fully parameterizable Chisel based FMA
generator (FM Agen) and a 40-nm low-power technology.
We discuss the related work individually for each of our clock-gating techniques together
with the description of the technique in Section IV. Besides, in the context of alternative
NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
low-power techniques for FP units, interesting approaches have been proposed (caching
results that can be reused) and byte encoding (computation performed over significant
bytes). However, detailed and accurate evaluation reveals that the actual savings are often
low and with an unaffordable area overhead .
Fig. 1. Two-lane, four-stage VFU (MVL = EVL = 64) executing FPFMAV V3<-V0,V1,V2.
Disadvantages:
 Inaccuracy in operation
 Power savings is not efficient
Proposed System:
Proposed clock-gating Techniques
Scalar Operand Clock-Gating
We propose this technique to tackle the cases in which one or two operands do not
change during the vector instruction. Table III lists the types of instructions during which
NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
Fig. 2. Simplified block diagram of a one-lane, four-stage VFU with all clock-gating techniques applied
(AllCG technique). Input signals for the baseline without clock-gating are multiplicands (A and B),
addend (C), rounding mode, and operation sign (opsign), while output signals are result (Out) and
exception flags.
Table I
NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
Types of instructions where ScalarCG AND ImplCG
scalar operand clock-gating (ScalarCG) is active. As only one of all the supported vector
instructions has all three vector operands, often at least one operand is scalar. Only the
FPFMAV instruction, in which all operands are vectors, does not benefit from this
technique.
During these instructions, the corresponding input register(s) of scalar operand(s) should
latch a new value only on the first clock edge of the execution of the instruction, while
during the rest of the instruction, they can be clock-gated. To implement this, we
introduce the signals VS[2..0] (Fig. 2), where VS[i] = 0 means that the ith operand is
gated after the mentioned first cycle. VS signals are derived from the instruction
OPCODE. Deriving VS signals from the OPCODE is done before the first pipeline stage
(as shown in Fig. 2). This generation (decoding) requires regular comparators, and they
are not on the critical path as the OPCODE is available at least one cycle in advance.
Table I shows corresponding VS signals for all the instructions.
Implicit Scalar Operand Clock-Gating (ImplCG)
NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
This technique is an additional optimization of ScalarCG and aims to exploit further the
information given through the instruction OPCODE for clock-gating, operand isolation,
and computation bypassing. In the case of addition and subtraction instructions, such as
FPADDV and FPSUBV, the 53 × 53 mantissa multiplier is not needed as it is known that
one of the multiplicands is ―1,‖ and thus, we can bypass, isolate, and clock-gate it
providing the value of the other multiplicand directly to the adder. There is an analogous
situation for FPMULV since the addend is known to be ―0.‖ In this case, the 162-bit wide
adder, leading zero anticipation, and the aligning part are not needed.
To control bypassing, isolation, and clock-gating of the mentioned sub modules, we
introduce signal INSTYP (see Fig. 2 and Table I), generated from the instruction
OPCODE, which indicates whether an FPFMAV or an FPADDV/FPSUBV/FPMULV
instruction is executed. INSTYP together with VS signals provide information of the
instruction type. For example, INSTYP = 1, VS[0] = 1, and VS[2] = 0 indicate that we
have an FPMULV instruction while INSTYP = 1, and VS[1] = 0 indicates that we have
an FPADDV/FPSUBV instruction. Fig. 3 shows the simplified block diagram of gated
FMA sub modules when the aforementioned instructions are executed. Circuitry added
for implementing ImplCG mostly consists of clock-gating cells and MUXs.
In the context of instruction-dependent techniques, there is interesting research done in
the past for scalar processors [8]. The main advantages of our ImplCG proposal over the
mentioned research are: 1) we apply the technique for a variable number of pipeline
stages; 2) we evaluate power, timing, and area; and 3) we propose the technique for
vector processors. The advantage of applying this technique on a vector processor over
other models (e.g., scalar) is that vector instructions last many cycles, so the state of the
related hardware (clock-gating logic and MUXs) maintains the same
NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
Fig. 3. Gated FMA sub modules during FPMULV (dark gray) and FPADDV (light gray) instructions in
case of ImplCG.
state during the whole instruction. Thus, there will be less switching overhead than in the
scalar case.
Vector Masking and Vector Multilane-Aware Clock-Gating (MaskCG)
Here, we target cases in which there are idle cycles during the vector mask instructions
(e.g., FPFMAV_MASK). Common cases in which vector mask control is used are: 1)
sparse matrix operations, and 2) conditional statements inside a vectorized loop.
Additionally, we assume that the same mechanism is also used to reduce the EVL to less
than the MVL . We assume that the control logic will detect and optimize this case,
skiping the last elements of the vector corresponding to the trailing 0s of the mask.
However, in vector designs with nL lanes, there will still be mod(EVL, nL ) idle lanes in
the last cycle of the operation. The VMR directly controls the clock-gating of the whole
arithmetic unit during these idle cycles (see Fig. 2).
NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
Regarding the internal implementation, we perform clock-gating at pipeline stage
granularity [25], so we prevent useless cycles inside the unit, i.e., the data are latched in
subsequent stages only if necessary. Once the Enable signal of the first pipeline stage’s
register gets the value ―1,‖ this Enable signal propagates to the end of the pipeline, one
stage per cycle (see Fig. 2). In other words, the Enable signal of the nth stage is actually
the first stage’s Enable signal delayed by n−1 cycles. This is implemented by adding a 1-
bit-wide, (nS−1)-long shift register that drives clock-gating cells. To the best of our
knowledge, there is no related work that aims to exploit vector conditional execution with
VMR to lower the power of vector processors.
Input Data Aware Clock-Gating (InputCG)
Here, we identify the scenarios in which, depending on the input data, a part of mantissa
processing is not needed for the correct result and, thus, can be bypassed. We use a
recoded format for internal representation that allows us to detect special cases (explained
in Section III-A) and zeros with an negligible hardware overhead; it requires inspection
of only three most significant bits of the exponent (fourth column in Table II). Table II
presents the identified scenarios (conditions) in which a hardware block of mantissa
TABLE II
InputCG—Conditions under which a hardware block of mantissa arithmetic computations and
corresponding input registers can be bypassed
NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
arithmetic computations and the corresponding input registers can be bypassed, isolated,
and clock-gated. The recoded format allows detection of relevant scenarios by using
simple 3-bit comparators. They are located at the inputs of VFU (A, B, and C processing
block on Fig. 2). In that way, we assure that the mentioned detection comparators are not
on the VFU’s critical path, i.e., gating information is available in time. The added internal
hardware is similar as for ImplCG. Having zero addend is analog to FPMULV instruction
case (see Fig. 3). Zero multiplicand allows gating and bypassing all the modules from
Fig. 3 except the registers that hold operand C value as in that case the final result is
operand C. In the case of NaN and infinity, there is no need for any computation as the
result that has to be at the VFU output is already known (explained in Section III-A), so
we can gate/bypass/isolate vast majority of FMA sub modules. There are many
workloads whose data contain a lot of zero values, thus can fairly benefit from the last
two sub techniques presented in Table II. Although these techniques are applicable to
other architectures as well, their application to vector processors is more efficient since
the recurrent values are common within the vector data, thus lowering the switching
overhead in added hardware (clock gating logic and MUXs). While both ImplCG and
InputCG techniques aim to exploit cases when the addend is ―0,‖ in this case, there is no
external information of ―0‖ existence via VS signals, but it has to be detected, and the
gating has to be done on time. As in the case of ImplCG, the research done in presents a
related data-driven technique for scalar processors. The main advantages (that enable
additional savings) of our InputCG technique over the mentioned research are: 1)
detection of zero operands and 2) gating the mantissa multiplier when processing NaNs.
Advantages:
 High accuracy
 Improved performance
 Power saving is efficient
NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
References:
[1] I. Ratkovi´c, O. Palomar, M. Stani´c, O. Unsal, A. Cristal, and M. Valero, ―A fully parameterizable
low power design of vector fused multiply-add using active clock-gating techniques,‖ in Proc. Int.
Symp. Low Power Electron. Design (ISLPED), 2016, pp. 362–367.
[2] K. Asanovi´c, ―Vector microprocessor,‖ Ph.D. dissertation, Dept. Comput. Sci., Univ. California,
Berkeley, Berkeley, CA, USA, 1998.
[3] Y. Lee et al., ―Exploring the tradeoffs between programmability and efficiency in data-parallel
accelerators,‖ in Proc. 38th Annu. Int. Symp. Comput. Archit. (ISCA), 2011, pp. 129–140.
[4] B. Zimmer et al., ―A RISC-V vector processor with simultaneousswitching switched-capacitor DC–
DC converters in 28 nm FDSOI,‖ IEEE J. Solid-State Circuits, vol. 51, no. 4, pp. 930–942, Apr. 2016.
[5] R. Espasa, M. Valero, and J. E. Smith, ―Vector architectures: Past, present and future,‖ in Proc. 12th
Int. Conf. Supercomput. (SC), 1998, pp. 425–432.
[6] H. Li, S. Bhunia, Y. Chen, T. N. Vijaykumar, and K. Roy, ―Deterministic clock gating for
microprocessor power reduction,‖ in Proc. 9th Int. Symp. High-Perform. Comput. Archit. (ISCA), Feb.
2003, pp. 113–122.
[7] N. Mohyuddin, K. Patel, and M. Pedram, ―Deterministic clock gating to eliminate wasteful activity
due to wrong-path instructions in outof-order superscalar processors,‖ in Proc. IEEE Int. Conf. Comput.
Design (ICCD), Oct. 2009, pp. 166–172.
[8] J. Preiss, M. Boersma, and S. M. Mueller, ―Advanced clockgating schemes for fused-multiply-add-
type floating-point units,‖ in Proc. 19th IEEE Symp. Comput. Arithmetic (ARITH), Jun. 2009, pp. 48–
56.
[9] I. Ratkovi´c, O. Palomar, M. Stani´c, O. S. Ünsal, A. Cristal, and M. Valero, ―On the selection of
adder unit in energy efficient vector processing,‖ in Proc. 14th Int. Symp. Quality Electron. Design
(ISQED), Mar. 2013, pp. 143–150.
NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
[10] I. Ratkovi´c et al., ―Joint circuit-system design space exploration of multiplier unit structure for
energy-efficient vector processors,‖ in Proc. IEEE Comput. Soc. Annu. Symp. VLSI (ISVLSI), Jul.
2015, pp. 19–26.
[11] M. Stanic, O. Palomar, I. Ratkovi´c, M. Duric, O. Unsal, and A. Cristal, ―VALib and SimpleVector:
Tools for rapid initial research on vector architectures,‖ in Proc. 11th ACM Conf. Comput. Frontiers
(CS), 2014, p. 7.

More Related Content

Similar to Vector processing aware advanced clock-gating techniques for low-power fused multiply-add

Analysis and design of cost effective, high-throughput ldpc decoders
Analysis and design of cost effective, high-throughput ldpc decodersAnalysis and design of cost effective, high-throughput ldpc decoders
Analysis and design of cost effective, high-throughput ldpc decoders
Nxfee Innovation
 
Efficient fpga mapping of pipeline sdf fft cores
Efficient fpga mapping of pipeline sdf fft coresEfficient fpga mapping of pipeline sdf fft cores
Efficient fpga mapping of pipeline sdf fft cores
Nxfee Innovation
 
Approximate hybrid high radix encoding for energy efficient inexact multipliers
Approximate hybrid high radix encoding for energy efficient inexact multipliersApproximate hybrid high radix encoding for energy efficient inexact multipliers
Approximate hybrid high radix encoding for energy efficient inexact multipliers
Nxfee Innovation
 
Feedback based low-power soft-error-tolerant design for dual-modular redundancy
Feedback based low-power soft-error-tolerant design for dual-modular redundancyFeedback based low-power soft-error-tolerant design for dual-modular redundancy
Feedback based low-power soft-error-tolerant design for dual-modular redundancy
Nxfee Innovation
 
A reconfigurable ldpc decoder optimized applications
A reconfigurable ldpc decoder optimized applicationsA reconfigurable ldpc decoder optimized applications
A reconfigurable ldpc decoder optimized applications
Nxfee Innovation
 
Deep learning enhanced digital twin for Closed-loop In-Process Quality Improv...
Deep learning enhanced digital twin for Closed-loop In-Process Quality Improv...Deep learning enhanced digital twin for Closed-loop In-Process Quality Improv...
Deep learning enhanced digital twin for Closed-loop In-Process Quality Improv...
WMG centre High Value Manufacturing Catapult
 
An efficient fault tolerance design for integer parallel matrix vector
An efficient fault tolerance design for integer parallel matrix vectorAn efficient fault tolerance design for integer parallel matrix vector
An efficient fault tolerance design for integer parallel matrix vector
Nxfee Innovation
 
Why Network Functions Virtualization sdn?
Why Network Functions Virtualization sdn?Why Network Functions Virtualization sdn?
Why Network Functions Virtualization sdn?
idrajeev
 
Fpga based 128 bit customised vliw processor for executing dual scalarvector ...
Fpga based 128 bit customised vliw processor for executing dual scalarvector ...Fpga based 128 bit customised vliw processor for executing dual scalarvector ...
Fpga based 128 bit customised vliw processor for executing dual scalarvector ...
eSAT Publishing House
 
IRJET-To Implement Cloud Computing by using Agile Methodology in Indian E-Gov...
IRJET-To Implement Cloud Computing by using Agile Methodology in Indian E-Gov...IRJET-To Implement Cloud Computing by using Agile Methodology in Indian E-Gov...
IRJET-To Implement Cloud Computing by using Agile Methodology in Indian E-Gov...
IRJET Journal
 
Cloudify: Open vCPE Design Concepts and Multi-Cloud Orchestration
Cloudify: Open vCPE Design Concepts and Multi-Cloud OrchestrationCloudify: Open vCPE Design Concepts and Multi-Cloud Orchestration
Cloudify: Open vCPE Design Concepts and Multi-Cloud Orchestration
Cloudify Community
 
Securing the present block cipher against combined side channel analysis and ...
Securing the present block cipher against combined side channel analysis and ...Securing the present block cipher against combined side channel analysis and ...
Securing the present block cipher against combined side channel analysis and ...
Nxfee Innovation
 
Optimize Utility in Computing-Based Manufacturing Systems Using Service Model...
Optimize Utility in Computing-Based Manufacturing Systems Using Service Model...Optimize Utility in Computing-Based Manufacturing Systems Using Service Model...
Optimize Utility in Computing-Based Manufacturing Systems Using Service Model...
AM Publications
 
VMworld 2013: Architecting the Software-Defined Data Center
VMworld 2013: Architecting the Software-Defined Data Center VMworld 2013: Architecting the Software-Defined Data Center
VMworld 2013: Architecting the Software-Defined Data Center
VMworld
 
NaveadKazi_resume
NaveadKazi_resumeNaveadKazi_resume
NaveadKazi_resumeNavead Kazi
 
Elecworks - electrical and automation CAD software - ECAD
Elecworks - electrical and automation CAD software - ECADElecworks - electrical and automation CAD software - ECAD
Elecworks - electrical and automation CAD software - ECAD
Guillem Fiter
 

Similar to Vector processing aware advanced clock-gating techniques for low-power fused multiply-add (20)

Analysis and design of cost effective, high-throughput ldpc decoders
Analysis and design of cost effective, high-throughput ldpc decodersAnalysis and design of cost effective, high-throughput ldpc decoders
Analysis and design of cost effective, high-throughput ldpc decoders
 
Efficient fpga mapping of pipeline sdf fft cores
Efficient fpga mapping of pipeline sdf fft coresEfficient fpga mapping of pipeline sdf fft cores
Efficient fpga mapping of pipeline sdf fft cores
 
Approximate hybrid high radix encoding for energy efficient inexact multipliers
Approximate hybrid high radix encoding for energy efficient inexact multipliersApproximate hybrid high radix encoding for energy efficient inexact multipliers
Approximate hybrid high radix encoding for energy efficient inexact multipliers
 
Feedback based low-power soft-error-tolerant design for dual-modular redundancy
Feedback based low-power soft-error-tolerant design for dual-modular redundancyFeedback based low-power soft-error-tolerant design for dual-modular redundancy
Feedback based low-power soft-error-tolerant design for dual-modular redundancy
 
Md CV
Md CVMd CV
Md CV
 
Md CV
Md CVMd CV
Md CV
 
A reconfigurable ldpc decoder optimized applications
A reconfigurable ldpc decoder optimized applicationsA reconfigurable ldpc decoder optimized applications
A reconfigurable ldpc decoder optimized applications
 
Deep learning enhanced digital twin for Closed-loop In-Process Quality Improv...
Deep learning enhanced digital twin for Closed-loop In-Process Quality Improv...Deep learning enhanced digital twin for Closed-loop In-Process Quality Improv...
Deep learning enhanced digital twin for Closed-loop In-Process Quality Improv...
 
An efficient fault tolerance design for integer parallel matrix vector
An efficient fault tolerance design for integer parallel matrix vectorAn efficient fault tolerance design for integer parallel matrix vector
An efficient fault tolerance design for integer parallel matrix vector
 
Why Network Functions Virtualization sdn?
Why Network Functions Virtualization sdn?Why Network Functions Virtualization sdn?
Why Network Functions Virtualization sdn?
 
Fpga based 128 bit customised vliw processor for executing dual scalarvector ...
Fpga based 128 bit customised vliw processor for executing dual scalarvector ...Fpga based 128 bit customised vliw processor for executing dual scalarvector ...
Fpga based 128 bit customised vliw processor for executing dual scalarvector ...
 
Vinay Resume
Vinay ResumeVinay Resume
Vinay Resume
 
IRJET-To Implement Cloud Computing by using Agile Methodology in Indian E-Gov...
IRJET-To Implement Cloud Computing by using Agile Methodology in Indian E-Gov...IRJET-To Implement Cloud Computing by using Agile Methodology in Indian E-Gov...
IRJET-To Implement Cloud Computing by using Agile Methodology in Indian E-Gov...
 
Cloudify: Open vCPE Design Concepts and Multi-Cloud Orchestration
Cloudify: Open vCPE Design Concepts and Multi-Cloud OrchestrationCloudify: Open vCPE Design Concepts and Multi-Cloud Orchestration
Cloudify: Open vCPE Design Concepts and Multi-Cloud Orchestration
 
Securing the present block cipher against combined side channel analysis and ...
Securing the present block cipher against combined side channel analysis and ...Securing the present block cipher against combined side channel analysis and ...
Securing the present block cipher against combined side channel analysis and ...
 
Optimize Utility in Computing-Based Manufacturing Systems Using Service Model...
Optimize Utility in Computing-Based Manufacturing Systems Using Service Model...Optimize Utility in Computing-Based Manufacturing Systems Using Service Model...
Optimize Utility in Computing-Based Manufacturing Systems Using Service Model...
 
VMworld 2013: Architecting the Software-Defined Data Center
VMworld 2013: Architecting the Software-Defined Data Center VMworld 2013: Architecting the Software-Defined Data Center
VMworld 2013: Architecting the Software-Defined Data Center
 
NaveadKazi_resume
NaveadKazi_resumeNaveadKazi_resume
NaveadKazi_resume
 
MR_Resume
MR_ResumeMR_Resume
MR_Resume
 
Elecworks - electrical and automation CAD software - ECAD
Elecworks - electrical and automation CAD software - ECADElecworks - electrical and automation CAD software - ECAD
Elecworks - electrical and automation CAD software - ECAD
 

More from Nxfee Innovation

VLSI IEEE Transaction 2018 - IEEE Transaction
VLSI IEEE Transaction 2018 - IEEE Transaction VLSI IEEE Transaction 2018 - IEEE Transaction
VLSI IEEE Transaction 2018 - IEEE Transaction
Nxfee Innovation
 
Noise insensitive pll using a gate-voltage-boosted source-follower regulator ...
Noise insensitive pll using a gate-voltage-boosted source-follower regulator ...Noise insensitive pll using a gate-voltage-boosted source-follower regulator ...
Noise insensitive pll using a gate-voltage-boosted source-follower regulator ...
Nxfee Innovation
 
The implementation of the improved omp for aic reconstruction based on parall...
The implementation of the improved omp for aic reconstruction based on parall...The implementation of the improved omp for aic reconstruction based on parall...
The implementation of the improved omp for aic reconstruction based on parall...
Nxfee Innovation
 
Low complexity methodology for complex square-root computation
Low complexity methodology for complex square-root computationLow complexity methodology for complex square-root computation
Low complexity methodology for complex square-root computation
Nxfee Innovation
 
Combating data leakage trojans in commercial and asic applications with time ...
Combating data leakage trojans in commercial and asic applications with time ...Combating data leakage trojans in commercial and asic applications with time ...
Combating data leakage trojans in commercial and asic applications with time ...
Nxfee Innovation
 
Approximate sum of-products designs based on distributed arithmetic
Approximate sum of-products designs based on distributed arithmeticApproximate sum of-products designs based on distributed arithmetic
Approximate sum of-products designs based on distributed arithmetic
Nxfee Innovation
 
Algorithm and vlsi architecture design of proportionate type lms adaptive fil...
Algorithm and vlsi architecture design of proportionate type lms adaptive fil...Algorithm and vlsi architecture design of proportionate type lms adaptive fil...
Algorithm and vlsi architecture design of proportionate type lms adaptive fil...
Nxfee Innovation
 
A flexible wildcard pattern matching accelerator via simultaneous discrete fi...
A flexible wildcard pattern matching accelerator via simultaneous discrete fi...A flexible wildcard pattern matching accelerator via simultaneous discrete fi...
A flexible wildcard pattern matching accelerator via simultaneous discrete fi...
Nxfee Innovation
 
A closed form expression for minimum operating voltage of cmos d flip-flop
A closed form expression for minimum operating voltage of cmos d flip-flopA closed form expression for minimum operating voltage of cmos d flip-flop
A closed form expression for minimum operating voltage of cmos d flip-flop
Nxfee Innovation
 
A 128 tap highly tunable cmos if finite impulse response filter for pulsed ra...
A 128 tap highly tunable cmos if finite impulse response filter for pulsed ra...A 128 tap highly tunable cmos if finite impulse response filter for pulsed ra...
A 128 tap highly tunable cmos if finite impulse response filter for pulsed ra...
Nxfee Innovation
 
A 12 bit 40-ms s sar adc with a fast-binary-window dac switching scheme
A 12 bit 40-ms s sar adc with a fast-binary-window dac switching schemeA 12 bit 40-ms s sar adc with a fast-binary-window dac switching scheme
A 12 bit 40-ms s sar adc with a fast-binary-window dac switching scheme
Nxfee Innovation
 
Nxfee Innovation Brochure
Nxfee Innovation BrochureNxfee Innovation Brochure
Nxfee Innovation Brochure
Nxfee Innovation
 

More from Nxfee Innovation (12)

VLSI IEEE Transaction 2018 - IEEE Transaction
VLSI IEEE Transaction 2018 - IEEE Transaction VLSI IEEE Transaction 2018 - IEEE Transaction
VLSI IEEE Transaction 2018 - IEEE Transaction
 
Noise insensitive pll using a gate-voltage-boosted source-follower regulator ...
Noise insensitive pll using a gate-voltage-boosted source-follower regulator ...Noise insensitive pll using a gate-voltage-boosted source-follower regulator ...
Noise insensitive pll using a gate-voltage-boosted source-follower regulator ...
 
The implementation of the improved omp for aic reconstruction based on parall...
The implementation of the improved omp for aic reconstruction based on parall...The implementation of the improved omp for aic reconstruction based on parall...
The implementation of the improved omp for aic reconstruction based on parall...
 
Low complexity methodology for complex square-root computation
Low complexity methodology for complex square-root computationLow complexity methodology for complex square-root computation
Low complexity methodology for complex square-root computation
 
Combating data leakage trojans in commercial and asic applications with time ...
Combating data leakage trojans in commercial and asic applications with time ...Combating data leakage trojans in commercial and asic applications with time ...
Combating data leakage trojans in commercial and asic applications with time ...
 
Approximate sum of-products designs based on distributed arithmetic
Approximate sum of-products designs based on distributed arithmeticApproximate sum of-products designs based on distributed arithmetic
Approximate sum of-products designs based on distributed arithmetic
 
Algorithm and vlsi architecture design of proportionate type lms adaptive fil...
Algorithm and vlsi architecture design of proportionate type lms adaptive fil...Algorithm and vlsi architecture design of proportionate type lms adaptive fil...
Algorithm and vlsi architecture design of proportionate type lms adaptive fil...
 
A flexible wildcard pattern matching accelerator via simultaneous discrete fi...
A flexible wildcard pattern matching accelerator via simultaneous discrete fi...A flexible wildcard pattern matching accelerator via simultaneous discrete fi...
A flexible wildcard pattern matching accelerator via simultaneous discrete fi...
 
A closed form expression for minimum operating voltage of cmos d flip-flop
A closed form expression for minimum operating voltage of cmos d flip-flopA closed form expression for minimum operating voltage of cmos d flip-flop
A closed form expression for minimum operating voltage of cmos d flip-flop
 
A 128 tap highly tunable cmos if finite impulse response filter for pulsed ra...
A 128 tap highly tunable cmos if finite impulse response filter for pulsed ra...A 128 tap highly tunable cmos if finite impulse response filter for pulsed ra...
A 128 tap highly tunable cmos if finite impulse response filter for pulsed ra...
 
A 12 bit 40-ms s sar adc with a fast-binary-window dac switching scheme
A 12 bit 40-ms s sar adc with a fast-binary-window dac switching schemeA 12 bit 40-ms s sar adc with a fast-binary-window dac switching scheme
A 12 bit 40-ms s sar adc with a fast-binary-window dac switching scheme
 
Nxfee Innovation Brochure
Nxfee Innovation BrochureNxfee Innovation Brochure
Nxfee Innovation Brochure
 

Recently uploaded

Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
manasideore6
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
gerogepatton
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
fxintegritypublishin
 
Basic Industrial Engineering terms for apparel
Basic Industrial Engineering terms for apparelBasic Industrial Engineering terms for apparel
Basic Industrial Engineering terms for apparel
top1002
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
Pratik Pawar
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
obonagu
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
SamSarthak3
 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
Kerry Sado
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
zwunae
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
Divya Somashekar
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
ClaraZara1
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
Aditya Rajan Patra
 
AP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specificAP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specific
BrazilAccount1
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Amil Baba Dawood bangali
 
Investor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptxInvestor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptx
AmarGB2
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
WENKENLI1
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Teleport Manpower Consultant
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
VENKATESHvenky89705
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Sreedhar Chowdam
 
Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
ChristineTorrepenida1
 

Recently uploaded (20)

Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
 
Basic Industrial Engineering terms for apparel
Basic Industrial Engineering terms for apparelBasic Industrial Engineering terms for apparel
Basic Industrial Engineering terms for apparel
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
 
AP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specificAP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specific
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
 
Investor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptxInvestor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptx
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
 
Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
 

Vector processing aware advanced clock-gating techniques for low-power fused multiply-add

  • 1. NXFEE INNOVATION (SEMICONDUCTOR IP &PRODUCT DEVELOPMENT) (ISO : 9001:2015Certified Company), # 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam, Pondicherry– 605004, India. Buy Project on Online :www.nxfee.com | contact : +91 9789443203 | email : nxfee.innovation@gmail.com _________________________________________________________________ Vector Processing-Aware Advanced Clock-Gating Techniques for Low- Power Fused Multiply-Add Abstract: The need for power efficiency is driving a rethink of design decisions in processor architectures. While vector processors succeeded in the high-performance market in the past, they need a retailoring for the mobile market that they are entering now. Floating- point (FP) fused multiply-add (FMA), being a functional unit with high power consumption, deserves special attention. Although clock gating is a well-known method to reduce switching power in synchronous designs, there are unexplored opportunities for its application to vector processors, especially when considering active operating mode. In this research, we comprehensively identify, propose, and evaluate the most suitable clock-gating techniques for vector FMA units (VFUs). These techniques ensure power savings without jeopardizing the timing. We evaluate the proposed techniques using both synthetic and ―real-world‖ application-based benchmarking. Using vector masking and vector multilane-aware clock gating, we report power reductions of up to 52%, assuming active VFU operating at the peak performance. Among other findings, we observe that vector instruction-based clock-gating techniques achieve power savings for all vector FP instructions. Finally, when evaluating all techniques together, using ―real-world‖ benchmarking, the power reductions are up to 80%. Additionally, in accordance with processor design trends, we perform this research in a fully parameterizable and automated fashion. Software Implementation:  Modelsim  Xilinx 14.2 Existing System:
  • 2. NXFEE INNOVATION (SEMICONDUCTOR IP &PRODUCT DEVELOPMENT) (ISO : 9001:2015Certified Company), # 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam, Pondicherry– 605004, India. Buy Project on Online :www.nxfee.com | contact : +91 9789443203 | email : nxfee.innovation@gmail.com _________________________________________________________________ Power and energy efficiency have become the dominant limiting factor to processor performance and have increased significantly processor design complexity, especially when considering the mobile market. Being able to exploit high degrees of data-level parallelism (DLP) at low cost in a power- and energy-efficient way, vector processors are an attractive architectural-level solution. Undoubtedly, the design goals for mobile vector processors clearly differ from the performance-driven designs of traditional vector machines. Therefore, mobile vector processors require a redesign of their functional units (FUs) in a power-efficient manner. Clock gating is a common method to reduce switching power in synchronous pipelines. It is practically a standard in low-power design. The goal is to ―gate‖ the clock of any component whenever it does not perform useful work. In that way, the power spent in the associated clock tree, registers, and the logic between the registers is reduced. It is the most efficient power reduction technique for active operating mode.1 Therefore, the conditions under which clock gating can be applied should be extensively studied and identified. A widely used approach is to clock gate a whole FU when it is idle. A complementary and more challenging approach is clock gating the FU or its sub blocks when it is active, i.e., operating at peak performance. Furthermore, there are characteristics of vector processors that provide Since fused multiply-add (FMA) units usually dissipate the most power of all FUs, their design requires special attention. Abundant floating-point (FP) FMA is typically found in vector workloads, such as multimedia, computer graphics, or deep learning workloads. Although in the past FMA has been used for high performance, it recently has been included in mobile processors as well. In contrast to high performance vector processors (e.g., NEC SX-series and Tarantula ) that have separated units for each FP operation, mobile vector processors’ resources are limited; thus, we typically have a single unit per vector lane capable of performing multiple FP operations rather than separate FP units. Apart from that, additional advantages of using FMA over separate FP adder and multiplier are as follows.
  • 3. NXFEE INNOVATION (SEMICONDUCTOR IP &PRODUCT DEVELOPMENT) (ISO : 9001:2015Certified Company), # 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam, Pondicherry– 605004, India. Buy Project on Online :www.nxfee.com | contact : +91 9789443203 | email : nxfee.innovation@gmail.com _________________________________________________________________ 1) Computation localization inside the same unit reduces the number of interconnections (power and energy efficiency). 2) Higher accuracy (single, instead of two round/normalize steps). 3) Improved performance (shorter latency). We present three kinds of techniques: 1) novel ideas to exploit unique characteristics of vector architectures for clock gating during active periods of execution (e.g., vector instructions with a scalar operand or vector masking); 2) novel ideas for clock gating during active periods of execution that are also applicable to scalar architectures but especially beneficial to vector processors (e.g., gating internal blocks depending on the values of input data); 3) ideas that are already used in other architectures and that we present as its application is beneficial to vector processors, and for the sake of completeness (e.g., idle VFU). Regarding the second and third groups of ideas, an advantage of vector processing that extends the applicability of clock gating is that vector instructions last many cycles, so the state of the clock-gating and bypassing logic remains the same during the whole instruction. As a result, power savings typically overcome the switching overhead of the added hardware (which is often not a case in scalar processors. To fulfill current trends in digital design that promote building generators rather than instances, we perform this research in a fully parameterizable, scalable, and automated manner. We developed an integrated architecture-circuit framework that consists of several generators, simulators, and other tools, in order to join architectural-level information (e.g., vector length or benchmark configuration) with circuit-level outputs (e.g., VFU power and timing measurements). We implement our clock-gating techniques and generate hardware VFU models using a fully parameterizable Chisel based FMA generator (FM Agen) and a 40-nm low-power technology. We discuss the related work individually for each of our clock-gating techniques together with the description of the technique in Section IV. Besides, in the context of alternative
  • 4. NXFEE INNOVATION (SEMICONDUCTOR IP &PRODUCT DEVELOPMENT) (ISO : 9001:2015Certified Company), # 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam, Pondicherry– 605004, India. Buy Project on Online :www.nxfee.com | contact : +91 9789443203 | email : nxfee.innovation@gmail.com _________________________________________________________________ low-power techniques for FP units, interesting approaches have been proposed (caching results that can be reused) and byte encoding (computation performed over significant bytes). However, detailed and accurate evaluation reveals that the actual savings are often low and with an unaffordable area overhead . Fig. 1. Two-lane, four-stage VFU (MVL = EVL = 64) executing FPFMAV V3<-V0,V1,V2. Disadvantages:  Inaccuracy in operation  Power savings is not efficient Proposed System: Proposed clock-gating Techniques Scalar Operand Clock-Gating We propose this technique to tackle the cases in which one or two operands do not change during the vector instruction. Table III lists the types of instructions during which
  • 5. NXFEE INNOVATION (SEMICONDUCTOR IP &PRODUCT DEVELOPMENT) (ISO : 9001:2015Certified Company), # 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam, Pondicherry– 605004, India. Buy Project on Online :www.nxfee.com | contact : +91 9789443203 | email : nxfee.innovation@gmail.com _________________________________________________________________ Fig. 2. Simplified block diagram of a one-lane, four-stage VFU with all clock-gating techniques applied (AllCG technique). Input signals for the baseline without clock-gating are multiplicands (A and B), addend (C), rounding mode, and operation sign (opsign), while output signals are result (Out) and exception flags. Table I
  • 6. NXFEE INNOVATION (SEMICONDUCTOR IP &PRODUCT DEVELOPMENT) (ISO : 9001:2015Certified Company), # 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam, Pondicherry– 605004, India. Buy Project on Online :www.nxfee.com | contact : +91 9789443203 | email : nxfee.innovation@gmail.com _________________________________________________________________ Types of instructions where ScalarCG AND ImplCG scalar operand clock-gating (ScalarCG) is active. As only one of all the supported vector instructions has all three vector operands, often at least one operand is scalar. Only the FPFMAV instruction, in which all operands are vectors, does not benefit from this technique. During these instructions, the corresponding input register(s) of scalar operand(s) should latch a new value only on the first clock edge of the execution of the instruction, while during the rest of the instruction, they can be clock-gated. To implement this, we introduce the signals VS[2..0] (Fig. 2), where VS[i] = 0 means that the ith operand is gated after the mentioned first cycle. VS signals are derived from the instruction OPCODE. Deriving VS signals from the OPCODE is done before the first pipeline stage (as shown in Fig. 2). This generation (decoding) requires regular comparators, and they are not on the critical path as the OPCODE is available at least one cycle in advance. Table I shows corresponding VS signals for all the instructions. Implicit Scalar Operand Clock-Gating (ImplCG)
  • 7. NXFEE INNOVATION (SEMICONDUCTOR IP &PRODUCT DEVELOPMENT) (ISO : 9001:2015Certified Company), # 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam, Pondicherry– 605004, India. Buy Project on Online :www.nxfee.com | contact : +91 9789443203 | email : nxfee.innovation@gmail.com _________________________________________________________________ This technique is an additional optimization of ScalarCG and aims to exploit further the information given through the instruction OPCODE for clock-gating, operand isolation, and computation bypassing. In the case of addition and subtraction instructions, such as FPADDV and FPSUBV, the 53 × 53 mantissa multiplier is not needed as it is known that one of the multiplicands is ―1,‖ and thus, we can bypass, isolate, and clock-gate it providing the value of the other multiplicand directly to the adder. There is an analogous situation for FPMULV since the addend is known to be ―0.‖ In this case, the 162-bit wide adder, leading zero anticipation, and the aligning part are not needed. To control bypassing, isolation, and clock-gating of the mentioned sub modules, we introduce signal INSTYP (see Fig. 2 and Table I), generated from the instruction OPCODE, which indicates whether an FPFMAV or an FPADDV/FPSUBV/FPMULV instruction is executed. INSTYP together with VS signals provide information of the instruction type. For example, INSTYP = 1, VS[0] = 1, and VS[2] = 0 indicate that we have an FPMULV instruction while INSTYP = 1, and VS[1] = 0 indicates that we have an FPADDV/FPSUBV instruction. Fig. 3 shows the simplified block diagram of gated FMA sub modules when the aforementioned instructions are executed. Circuitry added for implementing ImplCG mostly consists of clock-gating cells and MUXs. In the context of instruction-dependent techniques, there is interesting research done in the past for scalar processors [8]. The main advantages of our ImplCG proposal over the mentioned research are: 1) we apply the technique for a variable number of pipeline stages; 2) we evaluate power, timing, and area; and 3) we propose the technique for vector processors. The advantage of applying this technique on a vector processor over other models (e.g., scalar) is that vector instructions last many cycles, so the state of the related hardware (clock-gating logic and MUXs) maintains the same
  • 8. NXFEE INNOVATION (SEMICONDUCTOR IP &PRODUCT DEVELOPMENT) (ISO : 9001:2015Certified Company), # 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam, Pondicherry– 605004, India. Buy Project on Online :www.nxfee.com | contact : +91 9789443203 | email : nxfee.innovation@gmail.com _________________________________________________________________ Fig. 3. Gated FMA sub modules during FPMULV (dark gray) and FPADDV (light gray) instructions in case of ImplCG. state during the whole instruction. Thus, there will be less switching overhead than in the scalar case. Vector Masking and Vector Multilane-Aware Clock-Gating (MaskCG) Here, we target cases in which there are idle cycles during the vector mask instructions (e.g., FPFMAV_MASK). Common cases in which vector mask control is used are: 1) sparse matrix operations, and 2) conditional statements inside a vectorized loop. Additionally, we assume that the same mechanism is also used to reduce the EVL to less than the MVL . We assume that the control logic will detect and optimize this case, skiping the last elements of the vector corresponding to the trailing 0s of the mask. However, in vector designs with nL lanes, there will still be mod(EVL, nL ) idle lanes in the last cycle of the operation. The VMR directly controls the clock-gating of the whole arithmetic unit during these idle cycles (see Fig. 2).
  • 9. NXFEE INNOVATION (SEMICONDUCTOR IP &PRODUCT DEVELOPMENT) (ISO : 9001:2015Certified Company), # 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam, Pondicherry– 605004, India. Buy Project on Online :www.nxfee.com | contact : +91 9789443203 | email : nxfee.innovation@gmail.com _________________________________________________________________ Regarding the internal implementation, we perform clock-gating at pipeline stage granularity [25], so we prevent useless cycles inside the unit, i.e., the data are latched in subsequent stages only if necessary. Once the Enable signal of the first pipeline stage’s register gets the value ―1,‖ this Enable signal propagates to the end of the pipeline, one stage per cycle (see Fig. 2). In other words, the Enable signal of the nth stage is actually the first stage’s Enable signal delayed by n−1 cycles. This is implemented by adding a 1- bit-wide, (nS−1)-long shift register that drives clock-gating cells. To the best of our knowledge, there is no related work that aims to exploit vector conditional execution with VMR to lower the power of vector processors. Input Data Aware Clock-Gating (InputCG) Here, we identify the scenarios in which, depending on the input data, a part of mantissa processing is not needed for the correct result and, thus, can be bypassed. We use a recoded format for internal representation that allows us to detect special cases (explained in Section III-A) and zeros with an negligible hardware overhead; it requires inspection of only three most significant bits of the exponent (fourth column in Table II). Table II presents the identified scenarios (conditions) in which a hardware block of mantissa TABLE II InputCG—Conditions under which a hardware block of mantissa arithmetic computations and corresponding input registers can be bypassed
  • 10. NXFEE INNOVATION (SEMICONDUCTOR IP &PRODUCT DEVELOPMENT) (ISO : 9001:2015Certified Company), # 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam, Pondicherry– 605004, India. Buy Project on Online :www.nxfee.com | contact : +91 9789443203 | email : nxfee.innovation@gmail.com _________________________________________________________________ arithmetic computations and the corresponding input registers can be bypassed, isolated, and clock-gated. The recoded format allows detection of relevant scenarios by using simple 3-bit comparators. They are located at the inputs of VFU (A, B, and C processing block on Fig. 2). In that way, we assure that the mentioned detection comparators are not on the VFU’s critical path, i.e., gating information is available in time. The added internal hardware is similar as for ImplCG. Having zero addend is analog to FPMULV instruction case (see Fig. 3). Zero multiplicand allows gating and bypassing all the modules from Fig. 3 except the registers that hold operand C value as in that case the final result is operand C. In the case of NaN and infinity, there is no need for any computation as the result that has to be at the VFU output is already known (explained in Section III-A), so we can gate/bypass/isolate vast majority of FMA sub modules. There are many workloads whose data contain a lot of zero values, thus can fairly benefit from the last two sub techniques presented in Table II. Although these techniques are applicable to other architectures as well, their application to vector processors is more efficient since the recurrent values are common within the vector data, thus lowering the switching overhead in added hardware (clock gating logic and MUXs). While both ImplCG and InputCG techniques aim to exploit cases when the addend is ―0,‖ in this case, there is no external information of ―0‖ existence via VS signals, but it has to be detected, and the gating has to be done on time. As in the case of ImplCG, the research done in presents a related data-driven technique for scalar processors. The main advantages (that enable additional savings) of our InputCG technique over the mentioned research are: 1) detection of zero operands and 2) gating the mantissa multiplier when processing NaNs. Advantages:  High accuracy  Improved performance  Power saving is efficient
  • 11. NXFEE INNOVATION (SEMICONDUCTOR IP &PRODUCT DEVELOPMENT) (ISO : 9001:2015Certified Company), # 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam, Pondicherry– 605004, India. Buy Project on Online :www.nxfee.com | contact : +91 9789443203 | email : nxfee.innovation@gmail.com _________________________________________________________________ References: [1] I. Ratkovi´c, O. Palomar, M. Stani´c, O. Unsal, A. Cristal, and M. Valero, ―A fully parameterizable low power design of vector fused multiply-add using active clock-gating techniques,‖ in Proc. Int. Symp. Low Power Electron. Design (ISLPED), 2016, pp. 362–367. [2] K. Asanovi´c, ―Vector microprocessor,‖ Ph.D. dissertation, Dept. Comput. Sci., Univ. California, Berkeley, Berkeley, CA, USA, 1998. [3] Y. Lee et al., ―Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators,‖ in Proc. 38th Annu. Int. Symp. Comput. Archit. (ISCA), 2011, pp. 129–140. [4] B. Zimmer et al., ―A RISC-V vector processor with simultaneousswitching switched-capacitor DC– DC converters in 28 nm FDSOI,‖ IEEE J. Solid-State Circuits, vol. 51, no. 4, pp. 930–942, Apr. 2016. [5] R. Espasa, M. Valero, and J. E. Smith, ―Vector architectures: Past, present and future,‖ in Proc. 12th Int. Conf. Supercomput. (SC), 1998, pp. 425–432. [6] H. Li, S. Bhunia, Y. Chen, T. N. Vijaykumar, and K. Roy, ―Deterministic clock gating for microprocessor power reduction,‖ in Proc. 9th Int. Symp. High-Perform. Comput. Archit. (ISCA), Feb. 2003, pp. 113–122. [7] N. Mohyuddin, K. Patel, and M. Pedram, ―Deterministic clock gating to eliminate wasteful activity due to wrong-path instructions in outof-order superscalar processors,‖ in Proc. IEEE Int. Conf. Comput. Design (ICCD), Oct. 2009, pp. 166–172. [8] J. Preiss, M. Boersma, and S. M. Mueller, ―Advanced clockgating schemes for fused-multiply-add- type floating-point units,‖ in Proc. 19th IEEE Symp. Comput. Arithmetic (ARITH), Jun. 2009, pp. 48– 56. [9] I. Ratkovi´c, O. Palomar, M. Stani´c, O. S. Ünsal, A. Cristal, and M. Valero, ―On the selection of adder unit in energy efficient vector processing,‖ in Proc. 14th Int. Symp. Quality Electron. Design (ISQED), Mar. 2013, pp. 143–150.
  • 12. NXFEE INNOVATION (SEMICONDUCTOR IP &PRODUCT DEVELOPMENT) (ISO : 9001:2015Certified Company), # 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam, Pondicherry– 605004, India. Buy Project on Online :www.nxfee.com | contact : +91 9789443203 | email : nxfee.innovation@gmail.com _________________________________________________________________ [10] I. Ratkovi´c et al., ―Joint circuit-system design space exploration of multiplier unit structure for energy-efficient vector processors,‖ in Proc. IEEE Comput. Soc. Annu. Symp. VLSI (ISVLSI), Jul. 2015, pp. 19–26. [11] M. Stanic, O. Palomar, I. Ratkovi´c, M. Duric, O. Unsal, and A. Cristal, ―VALib and SimpleVector: Tools for rapid initial research on vector architectures,‖ in Proc. 11th ACM Conf. Comput. Frontiers (CS), 2014, p. 7.