Design and Implementation of Single Precision Pipelined Floating Point Co-Processor

Design and Implementation of Single Precision
Pipelined Floating Point Co-Processor
Manisha Sangwan
PG Student, M.Tech VLSI Design
SENSE, VIT University
Chennai, India - 600048
manishasangwan47@gmail.com
A Anita Angeline
Professor
SENSE, VIT University
Chennai, India -600048
Abstract—Floating point numbers are used in various
applications such as medical imaging, radar, telecommunications
Etc. This paper deals with the comparison of various arithmetic
modules and the implementation of optimized floating point
ALU. Here pipelined architecture is used in order to increase the
performance and the design is achieved to increase the operating
frequency by 1.62 times. The logic is designed using Verilog
HDL. Synthesis is done on Encounter by Cadence after timing
and logic simulation.
Keywords—CLA; clock-cycles; GDM; HDL; IEEE 754;
pipelining; verilog
I. INTRODUCTION
These days the use of computers has been incorporated in
many applications such as medical imaging, radar, audio
system design, signal processors, industrial control,
telecommunications and other applications. There are
many key factors that are considered before choosing the
number system. Those preferred factors are computational
capabilities required for the application, processor and system
costs, accuracy, complexity and performance attributes. Over
the years the designers moved from fixed point to floating
point operations due to wide range along with the ability of
floating point numbers to represent a very small number to a
very large number but at the same time accuracy of
floating point numbers is less. So trade off has to be done to
get the optimized architecture.
Almost twenty years ago IEEE 754 standard was
adopted for floating point numbers. This single precision
standard is for 32-bit number and the double precision is for
64- bit number.
The storage layout consists of three components: the sign,
the exponent and mantissa. The mantissa includes implicit
leading bit and the fractional part.
TABLE I. FLOATING POINT REPRESENTATION
Sign Exponent Fractional Bias
Single Precision 1[31] 8[30-23] 23[22-00] 127
Double Precision 1[63] 11[62-52] 52[62-52] 1023
Sign Bit: It defines whether the number the number is
positive or negative. If it is 0 then number is positive else
negative.
Exponent: Both positive and negative values are
represented by this field. To do this, a bias is added to the
actual exponent in order to get the stored exponent. [10] For
single precision this value is 127 and for double precision
this value is 1023.
Mantissa: mantissa bit consists of both implicit leading bit
and fractional part it is represented in the form of 1.f where 1
is implicit and f is fractional part mantissa is also known as
significant.
II. IMPLEMENTATION
A. Adder and Subtractor
Algorithm
Fig. 1. Block diagram of Floating Point Adder and Subtractor
In adders the propagation of carry from one adder block
to another block consumes a lot of time but Carry Look Ahead
(CLA) adder save this propagation time by generating and
propagating this carry simultaneously in the consecutive
blocks. So, for faster operation this carry look ahead adder
is used.
2013 International Conference on Advanced Electronic Systems (ICAES)

F1 E1 F2 E2 F3 E3
S[i] = X[i] Y[i] C[i]
G[i] = X[i] * Y[i]
P[i] = X[i] +Y[i]
C[i+1] = X[i] * Y[i] + X[i] * C[i] + Y[i] * C[i] C[i+1] = G[i] + P[i] *
G[i] + P[i] * P[i-1] * C[i-1]
B. Multiplication
Algorithm
Fig. 2. Block diagram of Floating Point Multiplier
Multiplication is an important block for ALU. With
high speed and low power multipliers there comes the
complexity so, we need to do the trade off between these
to get the optimized algorithm with the regularity of
layout. There are different algorithms available for
multiplication such as Booth, modified booth, Wallace,
Bough Wooley, Braun multipliers. But issue with multipliers
is speed and regular layout so keeping both the parameters in
mind modified booth algorithm was chosen. It is a powerful
algorithm for signed-number multiplication, which treats both
positive and negative numbers uniformly.
C. Division
Algorithm
Fig. 3. Block diagram of Floating Point Division
For division process Goldsmith (GDM) algorithm is
used. For this algorithm to be applied needs both the inputs to
be normalized first. In this both the multiplication of inputs is
independent of each other, so can be executed in parallel,
therefore it will increase the latency.
Algorithm for GDM for Q = A/B using k iterations:
• B !=0, |e0| < 1
• Initialize N = A, D = B, R = (1-e0) / B
• For I = 0 to k
• N = N*R
• D = D*R
• R = 2 – D
• End for
• Q = N
Return Q
D. Pipelining
The speed of execution of any instruction can be varied by
a number of methods like, using a faster circuit technology to
build a processor or to arrange the hardware in such a way so
that multiple operations can be performed at the same
time. [11] By using pipelining multiple operations can be
performed simultaneously without changing the execution
time of an instruction. As shown in example below, in
sequential execution the third instruction will be executed at
sixth clock cycle but in pipelined architecture the same
instruction will be executed at fourth clock cycle and we’ll be
saving two clock cycles at the end where, F is fetching block
and E is execution block. Therefore, as the number of
instruction count will be increasing we’ll be saving more
clock cycles.
1 2 3 4 5 6
|______I1______| |_______I2______| |_______I3______|
Fig. 4. Sequential Execution
Clock Cycles: 0 1 2 3 4
F1 E1
F2 E2
F3 E3
Fig. 5. Pipelined Execution
III. FUNCTIONAL AND TIMING VERIFICATION
The functional verification is done using both Cadence and
Xilinx along with that the arithmetic results are verified
theoretically. Timing analyses, power analyses and area
analyses are done using both Cadence and Xilinx.

A. Adder and Subtractors
In addition and subtraction, all the sign, exponential and
fractional bits are operated separately and then the
exponents are shifted accordingly to equate the exponents
then this addition operation is done on fractional bits. The
final result is combined to make the output of 32 bits [Fig 6].
Fig. 6. Simulation Waveform for Adder and Subtractor
B. Multiplier
In multiplication block, the exponents are added and the
fractional bits are multiplied according to the algorithm. To
get the sign of the result, the XOR operation is performed on
both the input sign bits. Finally all the bits are clubbed to get
the final result [Fig 7].
Fig. 7. Simulation Waveform for Multiplier
C. Division
The exponents are subtracted and the fractional bits are
multiplied and subtracted according to the GDM algorithm
and continues iterations are done to get the closest result, for
sign bit XOR operation is performed on the sign bits of inputs.
This block is most time and area consuming [Fig 8].
Fig. 8. Simulation Waveform for Division
D. ALU Layout
In Fig 9, the final layout of the circuit is shown.
Fig. 9. ALU Layout
IV. SYNTHESIS RESULT
Synthesis results are shown in the table 2, shown below.
TABLE II. COMPARATIVE ANALYSIS OF BOTH EXISTING AND
PROPOSED DESIGN
Existing Proposed
Leakage power 2.880282 W 3.50267 W
Dynamic Power 11.377751 mW 16.14882 mW
Total power 11.380632 mW 16.15232 mW
Gate counts 2881 3712
Frequency 225.65 MHz 367.654 MHz
Critical path 4.43164 nsec 2.70 nsec
Logic utilization 1% (466/38000) 4%(1780/46560)
IOS 44% (130/296) 65% (157/ 240)
Area 75436 97194
V. CONCLUSION
In this paper various arithmetic modules are implemented
and then various comparative analyses are done. Ultimately
these individual blocks are clubbed to make Floating point
based ALU in a pipelined manner to minimize the power and
to increase the operating frequency at the same time. These
comparative analyses are done on cadence and Xilinx both.
Simulation results are verified theoretically. Verilog HDL
(Hardware Description Language) is used to design the whole
ALU block. In existing design, total power is 11.380632 mW
that is 0.70458 times less as compared to the proposed design
but the operating frequency is 1.67 times more than the
existing design. Along with that the gate count is increased
and area is also increased because of number of iterations used
in the algorithm.

VI. FUTURE WORK
Optimization of source code to decrease the area and gate
counts will improve the reliability. Low power techniques
could be incorporated to obtain better trade off.
REFERENCES
[1] Addanki Purna Ramesh, Ch. Pradeep, “FPGA Based Implementation of
Double Precision Floating Point Adder/Subtractor Using VERILOG”,
International Journal of Emerging Technology and Advanced
Engineering, Volume 2, Issue 7, July 2012
[2] Semih Aslan, Erdal Oruklu and Jafar Saniie, “A High Level Synthesis
and Verification Tool for Fixed to Floating Point Conversion”, 55th
IEEE Internation Midwest Symposium on Circuits and Systems
(MWSCAS 2012)
[3] Prashanth B.U.V, P. Anil Kumai, G. Sreenivasulu, “Design and
Implementation of Floating Point ALU on a FPGA Processor”,
International Conference on Computing, Electronics and Electrical
Technologies (ICCEET 2012), 2012
[4] Subhajit Banerjee Purnapatra, Siddharth Kumar, Subrata Bhattacharya,
“Implementation of Floating Point Operations on Fixed Point Processor
– An Optimization Algorithm and Comparative Analysis”, IEEE 10th
International Conference on Computer Information Technology (CIT
2010), 2010
[5] Ghassem Jaberipur, Behrooz Parhami, and Saeid Gorgin,
“RedundantDigit Floating-Point Addition Scheme Based on a Stored
Rounding Value”, IEEE transactions on computer, vol. 59, no.
[6] Alexandre F. Tenca, “Multi-operand Floating-point Addition”, 19th
IEEE International Symposium on Computer Arithmetic, 2009.
[7] Cornea, “IEEE 754-2008 Decimal Floating-Point for Intel®
Architecture Processors”, 19th IEEE International Symposium on
Computer Arithmetic, 2009.
[8] Joy Alinda P. Reyes, Louis P. Alarcon, and Luis Alarilla, “A Study of
Floating-Point Architectures for Pipelined RISC Processors”, IEEE
International Symposium on Circuits and Systems, 2006.
[9] Peter-Michael Seidel, “High-Radix Implementation of IEEE
FloatingPoint Addition”, Proceedings of the 17th IEEE Symposium on
Computer Arithmetic, 2005.
[10] Guillermo Marcus, Patricia Hinojosa, Alfonso Avila and Juan
NolazcoFlores, “A Fully Synthesizable Single-Precision, Floating-Point
Adder/Subtractor and Multiplier in VHDL for General and Educational
Use”, Proceedings of the 5thIEEE International Caracas Conference on
Devices, Circuits and Systems, Dominican Republic, Nov.3-5, 2004.
[11] Carl Hamacher, Zvonko Vranesic, Safwat Zaky “Computer
Organization” 5th Edition, Tata McGraw-Hill Education, 2011.

Design and Implementation of Single Precision Pipelined Floating Point Co-Processor

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to Design and Implementation of Single Precision Pipelined Floating Point Co-Processor

Similar to Design and Implementation of Single Precision Pipelined Floating Point Co-Processor (20)

More from Silicon Mentor

More from Silicon Mentor (16)

Design and Implementation of Single Precision Pipelined Floating Point Co-Processor