SlideShare a Scribd company logo
1 of 74
Download to read offline
University of Manchester
School of Computer Science
Project Report 2014
Design and implementation of arithmetic units with Xilinx FPGA
Author: Prakhar Bahuguna
Supervisor: Dr. Vasilis Pavlidis
Design and implementation of arithmetic units with Xilinx FPGA
Author: Prakhar Bahuguna
The aim of the project is to design, implement, test and profile various different arithmetic
units, such as adders and multipliers on an FPGA platform. These are algorithms for effi-
ciently performing arithmetic in hardware that are widely used in various different applic-
ations. This project strives to detail the various types of such arithmetic units, providing
example designs and implementations for each and critically evaluating the merits and is-
sues with each design. The designs will then be implemented on an Xilinx Virtex-7 FPGA
development board upon which real-world performance, logic area and power consumption
can be measured and analysed.
Supervisor: Dr. Vasilis Pavlidis
Contents
1. Introduction 7
2. Background 8
2.1. Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1. Half Adder and Full Adder . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2. Ripple-carry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.3. Carry-lookahead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.4. Carry-select . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.5. Carry-skip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2. Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1. Array Multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.2. Modified Booth Encoder . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.3. Wallace Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.4. Dadda Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3. Design 23
3.1. Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2. Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4. Simulation 26
4.1. Testing Methodology For 8-bit units . . . . . . . . . . . . . . . . . . . . . . . 27
4.2. Testing Methodology For Larger Units . . . . . . . . . . . . . . . . . . . . . 27
5. Implementation 29
5.1. The Synthesis Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.2. Synthesis Reports and Statistics . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.2.1. Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.2.2. Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6. Conclusion 38
3
Contents Contents
A. Unit Source Code 41
A.1. Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
A.1.1. Half Adder and Full Adder . . . . . . . . . . . . . . . . . . . . . . . . 41
A.1.2. Ripple-carry Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
A.1.3. Carry-lookahead Adder . . . . . . . . . . . . . . . . . . . . . . . . . 43
A.1.4. Carry-select Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
A.1.5. Carry-skip Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
A.2. Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
A.2.1. Array Multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
A.2.2. Modified Booth Encoder . . . . . . . . . . . . . . . . . . . . . . . . . 52
A.2.3. Wallace Tree Multiplier . . . . . . . . . . . . . . . . . . . . . . . . . 54
A.2.4. Dadda Tree Multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . 58
B. Testbench Source Code 62
B.1. 8-bit Testbenches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
B.1.1. Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
B.1.2. Unsigned Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
B.1.3. Signed Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
B.2. Testdata Generation Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
B.3. 32-bit Testbenches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
B.3.1. Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
B.3.2. Unsigned Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
B.3.3. Signed Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
B.4. 64-bit Testbenches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
B.4.1. Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
B.4.2. Unsigned Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
B.4.3. Signed Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4
List of Figures
2.1. Half Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2. Full Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3. Ripple-Carry Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4. Propagate-Generate Full Adder . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5. Carry-lookahead Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.6. Carry-select Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.7. Carry-skip Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.8. Modified Full Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.9. Array Multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.10. Wallace Tree Multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.11. Dadda Tree Multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.1. Simulation Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.1. Area Usage For 8-bit Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.2. Area Usage For 32-bit Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.3. Area Usage For 64-bit Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.4. Delay For 8-bit Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.5. Delay For 32-bit Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.6. Delay For 64-bit Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.7. Area Usage For 8-bit Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.8. Delay For 8-bit Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.9. Area Usage For Array Multipliers . . . . . . . . . . . . . . . . . . . . . . . . 37
5.10. Delay For Array Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5
List of Tables
2.1. Basic Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2. Long Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3. MBE Truth Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4. MBE Partial Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
6
1. Introduction
Arithmetic units are hardware circuits that are designed to perform some type of arithmetic
operation on binary numbers. They are typically integrated into some form of processor,
which rely on the arithmetic units for much of their operation. This project aims to design,
simulate, implement and evaluate the various different types of arithmetic units using an
electronic design workflow in conjunction with an FPGA development platform. Each unit
will be benchmarked for characteristics such as area usage, delay and power consumption
that are important considerations in hardware design.
This project in particular will focus on adders and multipliers. Addition and multiplication
are very common operations that are utilised heavily by even the most basic of processors,
both for internal operations and for processing data inputs. Both arithmetic operations have
a variety of designs that can efficiently perform arithmetic in hardware, and each design has
its own trade-offs with regard to the key characteristics. The project strives to detail these
various types of arithmetic units, with example designs and implementations for each. The
merits and issues of each design will then be critically evaluated and compared with each
other.
The next chapter will give a complete overview of the arithmetic units that will be im-
plemented in the course of this project. This includes the approach taken by the design and
the logic behind its functioning, its gate/block-level schematic and an estimate of number
of gates required as well as the critical path delay. The design chapter will then cover the
details of designing the units and the considerations that need to be taken into account for
their development. The units will also need to be simulated and verified to ensure correct-
ness, preferably in an automated manner that can provide guarantees of correct operation.
This important stage is addressed by the simulation chapter. Finally, the designs will then be
implemented in hardware, targeting an Xilinx Virtex-7 FPGA development board. Finally,
the implementation chapter will cover the synthesis process and evaluate the characterist-
ics of the implemented hardware units, comparing and contrasting the different varieties of
adders and multipliers with each other.
7
2. Background
Arithmetic units as used in computers are digital circuits that perform elementary arithmetic
operations on numbers. They are based on binary arithmetic, taking operands as input and
giving a result as output, both as binary numbers. Typically, these operations are simple
mathematical arithmetic such as addition, subtraction, multiplication and division, though
more complex units may implement more complicated mathematical operations.
Arithmetic units can be classified into the two main groups of integer units and floating-
point units. Integer units operate exclusively with integers and typically implement opera-
tions such as addition and multiplication where the result for two integer operands is always
an integer. As decimals do not need to be considered, integer units are typically smaller,
faster and require lower power and less area. Floating-point units operate with numbers
that have a floating decimal point. Like integer units, they can also perform additions, sub-
tractions and multiplications but with floating-point operands. However, they are also able
to perform operations such as division, exponentiation and square root calculation that often
give a floating-point result even for integer operands. As the logic to handle floating-point
numbers is more complex, floating-point units are usually larger and more power-hungry.
Typically, arithmetic units are implemented in hardware within Application-Specific In-
tegrated Circuits (ASIC). They are usually grouped together to form a complete Arithmetic
Logic Unit (ALU) of a microprocessor, enabling the processor to perform useful work.1
ALUs
are also found in different types of processors such as Graphics Processing Units (GPUs),
which use a large number of complex ALUs to perform complex graphics calculations in
parallel.2
Digital Signal Processors (DSPs) are largely based around their ALUs to process
a continuous data stream in a pipelined fashion.3
In this project, the arithmetic units will
be implemented in an FPGA as an ASIC implementation is simply too costly and time con-
suming to consider for this use and is irrelevant to the learning objectives of this project.
However, the same principles of hardware design still apply and the underlying difference
in implementation can be abstracted away for the purposes of this project.
The rest of this chapter will provide a complete overview of the arithmetic units that will be
1
Terms, ALU (Arithmetic Logic Unit) Definition.
2
Nvidia, What is GPU Accelerated Computing?
3
Yovits, Advances in Computers.
8
2.1. ADDERS CHAPTER 2. BACKGROUND
implemented in this project. A variety of adders, followed by multipliers will be introduced
along with their design details. In addition, each will have an estimate of delay and area.
2.1. Adders
Addition is one of the most basic mathematical operations needed in modern computer sys-
tems. It is performed bitwise on two binary operands in a similar fashion to traditional
base-10 addition. The least significant bits are first added together - in binary, this can be
performed by XORing the bits together. If the result is greater than 1, the carry is passed to
the next set of significant bits and incorporated into the addition. This addition process is
repeated to the left up to the most significant bit as shown in Table 2.1
Carry: 1 1
1 0 0 1
+ 0 0 1 1
1 1 0 0
Table 2.1.: The basic addition process.
The method just described is the most straightforward and natural way for a human to
add two binary numbers together. However, in computer hardware, there are multiple ap-
proaches to solve the problem of adding two numbers together efficiently, and each presents
its own set of advantages and disadvantages. The primary concern in digital design is the
critical path. This is the path in the circuit which has the longest delay between the input
being fed to the unit and the correct output being obtained from it. As the delay in the critical
path defines the maximum speed at which the adder can operate, minimising this delay is
crucial to improve performance. Other concerns in adder design are the power consumption
of the circuit and the area needed to implement the circuit, which depends directly on the
number of logic gates that are used. These are usually at odds with minimising the critical
delay as more sophisticated logic demands higher power consumption and more logic gates.
Given this situation, there are various designs that result in different trade-offs between
these two goals, depending on the requirements for the hardware being developed. The pros
and cons of each design are analysed and evaluated in the following subsections.
2.1.1. Half Adder and Full Adder
The most basic building blocks of any adder is the 1-bit half adder and full adder. A half adder
simply takes two operands A and B. It calculates the sum S by XORing A and B together,
9
2.1. ADDERS CHAPTER 2. BACKGROUND
denoted as S = A ⊕ B. A carry output cout can also be evaluated by ANDing the two
operands together as cout = A · B.4
This is shown in Figure 2.1. Thus, the half-adder is
sufficient for calculating the sum of two 1-bit operands
S
A
B
cout
Figure 2.1.: A half adder with two operands A and B.
However, it is not typical to be adding 1-bit numbers together. Often, several bits need to
be added, with the carry of the previous column needed to correctly compute the result of
the current column. The full adder is a complete 1-bit adder, including a carry-in that allows
it to be chained to previous bits to compute an n-bit sum.5
It calculates S = A ⊕ B ⊕ cin and
cout = (A · B) + (cin · (A ⊕ B)) and an implementation can be produced by chaining two
half-adders together as demonstrated by Figure 2.2
S
cout
B
A
in
c
Figure 2.2.: A full adder with two operands and a carry-in generated from two half adders.
2.1.2. Ripple-carry
The ripple-carry adder (RCA) is the simplest possible type of n-bit adder. The RCA utilises
a chain of full adders connected in series with the carry-outs of each full adder feeding into
4
Vahid, Digital Design.
5
Ibid.
10
2.1. ADDERS CHAPTER 2. BACKGROUND
the carry-in of the next, as illustrated by Figure 2.3. It is named as such because the carry
from the rightmost column ‘ripples’ through to the left column in a sequential fashion.6
As
this adder uses n full adders with five logic gates each, it only requires 5n logic gates.
B0 A 0
Full
Adder
Full
Adder
Full
Adder
Full
Adder
B3 A B2 A B1 A 123
c
S2
S3
S1
S0
in
Figure 2.3.: A ripple-carry adder resulting from chaining multiple full adders.
The main issue with the ripple-carry adder is that the nature of the design means that the
final result is not known until the carry has propagated all the way to the leftmost column.
This situation gives a long critical path between A0/B0 and cout which makes the adder slow
to calculate the result. The delay for this path is three gate delays for each full adder with a
total delay of 3n gate delays (assuming that every gate along this path has a similar delay).
Clearly an alternative design to add two operands needs to be developed.
2.1.3. Carry-lookahead
The carry-lookahead adder (CLA) attempts to avoid the slow carry ripple of the ripple-carry
adder by predicting ahead of time what the carry from the previous column is likely to be. It
does this by replacing the carry-out signal from the full adders with two signals: P (propag-
ate) and G (generate). These signals are based on whether each full adder will propagate a
carry-in of 1 to its carry-out, or will generate a carry itself.
A full adder will propagate a carry-in when either A or B or both are 1, since the result of
A + B + cin will be equal to cin in this case. Hence, we can set P = A + B. A full adder will
generate a carry if both A and B are 1 regardless of the value of cin as A + B will be greater
than one. The generate signal can be set to G = A · B.7
The full adder can be modified using
these results to create a propagate-generate full adder as show in Figure 2.4
The propagate and generate signals from prior columns can now be used to look ahead
and evaluate the expected carry-in for each full adder. Suppose the modified full adders
are assembled as below with some lookahead logic to determine cin for each adder as in
Figure 2.5.
6
Vahid, Digital Design.
7
Tohoku University, Hardware Algorithms For Arithmetic Modules.
11
2.1. ADDERS CHAPTER 2. BACKGROUND
B
A
S
in
c
P
G
Figure 2.4.: A full-adder with propagate and generate signals instead of a carry-out.
B0 A 0
Full
Adder
Full
Adder
S0
S1
S2
S3
c1
B3 A B2 A B1 A 123
cin
Full
Adder
Full
Adder
Carry−lookahead Logic
P P0 GP GGP 01122G3 3
c3
c2
Figure 2.5.: Block diagram of a carry-lookahead adder. The carry-in for each full adder is
evaluated from the lookahead logic rather than the previous full adder.
We know that c1 will definitely be 1 if G0 is 1 as the first column will definitely generate a
carry-out regardless of the value of c0. If G0 is 0, the only other way that c1 will be 1 is if the
previous adder propagates a carry-in. This propagation will occur when P0 is 1, so c1 will be
1 if both P0 and c0 are 1. This logic can therefore be formulated as c1 = G0 + P0 · c0 and is
easily implemented with one OR gate and one AND gate.
12
2.1. ADDERS CHAPTER 2. BACKGROUND
This carry-lookahead logic can now be generalised to all full adders as ci+1 = Gi +Pi ·ci.8
Recursive substitution and expansion of ci then gives an expression for every carry-in ci+1
as described in Equation 2.1
c1 = G0 + P0 · c0
c2 = G1 + P1 · c1 = G1 + P1 · (G0 + P0 · c0) = G1 + P1 · G0 + P1 · P0 · c0
c3 = G2 + P2 · c2 = G2 + P2 · (G1 + P1 · G0 + P1 · P0 · c0)
= G2 + P2 · G1 + P2 · P1 · G0 + P2 · P1 · P0 · c0
...
cn+1 = Gn + Pn · Gn−1 . . . . (2.1)
The critical path has now been shortened significantly as it now runs between Ai/Bi and
Si. After one gate delay, Pi and Gi are evaluated. The parallel nature of the lookahead logic
requires only two gate delays. Finally, the addition requires two gate delays, so the overall
critical delay is just five gate delays, regardless of the number of bits in the adder.
The main downside is that the carry-lookahead adders needs significantly more logic gates
for the lookahead logic. While c1 only needs two gates to compute, c2 requires three gates,
c3 requires four gates and so on, with ci requiring i gates. The lookahead logic in total there-
fore needs n(n − 1)/2 gates. When combined with the full adders, a complete n-bit carry-
lookahead adder requires n(n + 7)/2 gates. Clearly, an adder design with a quadratic gate
count will not scale well to larger sizes.
2.1.4. Carry-select
A carry-select adder (CSLA) attempts to provide a compromise between the simplicity of a
ripple-carry adder and the speed of a carry-lookahead adder. It uses a chain of ripple-carry
adders of a certain width (usually 4-bit), but for each subsequent block after the first, two
adders are placed which simultaneously calculate the sum of the 4-bit operands. One adder
assumes a carry-in of 0, the other assumes a carry-in of 1, allowing for the two possible results
to be precomputed. The correct result is then selected by a 2:1 mux using the carry-out of
the previous block.9
The complete layout is given by Figure 2.6
The critical path still runs from A0/B0 to cout, but the delay is much shorter as each block
is computed simultaneously. Assuming that each mux has a gate delay of 3, the overall gate
delay of a carry-select adder with 4-bit ripple-carry blocks is 12+3(n/4−1) = 3n/4+9. The
8
Vladutiu, Computer Arithmetic.
9
Ibid.
13
2.1. ADDERS CHAPTER 2. BACKGROUND
S[3:0]
A[3:0]B[3:0]
0
Adder
Ripple−carry
1
Adder
Ripple−carry
A[7:4]B[7:4]
A[7:4]B[7:4]
S[7:4]
cout
S[7:4] S[7:4]
1
1
0cout
0
Adder
Ripple−carry cin
Figure 2.6.: Diagram of the first section of a carry-select adder. Each block after the ini-
tial block has two ripple-carry adders to compute the two possible results
simultaneously.
increase in delay is still linear but is significantly smaller as compared to a ripple-carry adder.
More logic gates are required than a ripple-carry adder but the design does not suffer from
the quadratic increase that the carry-lookahead adder has, making the carry-select adder far
more scalable.
2.1.5. Carry-skip
The carry-skip adder (CSKA) is a variation on the carry-select. It avoids waiting for the carry-
in ripple from the previous block if it can conclusively determine that the current block will
not propagate it further. In a similar fashion to carry-lookahead, this can be determined by
evaluating P for the entire block, which is true if Pi is true for every bit i in the block.10
The
overall expression for P for a 4-bit block is therefore given in Equation 2.2
10
Vladutiu, Computer Arithmetic.
14
2.2. MULTIPLIERS CHAPTER 2. BACKGROUND
P = P0 · P1 · P2 · P3
= (A0 + B0) · (A1 + B1) · (A2 + B2) · (A3 + B3). (2.2)
The carry-in cini for block i can now be evaluated faster. Assume that both the carry-out
couti−2 of block i − 2 and Pi−1 from block i − 1 have been evaluated. If Pi−1 is true, we
can simply pass the value of couti−2 to cini as the block will simply propagate it. Otherwise,
we have to wait for couti−1 to be evaluated as its value may differ from couti−2. This can be
summed up by Equation 2.3
cini = couti−1 + Pi−1 · couti−2 (2.3)
and the entire schematic is depicted in Figure 2.7
S[3:0]
A[3:0]B[3:0]
cin
S[7:4]
B[7:4] A[7:4]A[11:8]B[11:8]
S[11:8]
P[11:8] P[7:4]
Ripple−carry
Adder
Ripple−carry
Adder Adder
Ripple−carry
Figure 2.7.: Block diagram of a carry-skip adder. If the current block propagates its carry-in,
the carry-in is used directly for the next block.
2.2. Multipliers
Multiplication is another common operation that is found in the ALUs of most modern pro-
cessors. As for all electronic logic circuits, an arithmetic multiplier operates on two binary
operands, referred to as the multiplicand and the multiplier. A set of partial products is
computed from each bit of the multiplier, much in the same way that a human performs
long multiplication by hand but in binary. Each partial product is zero if its corresponding
multiplier bit is zero, and equal to the multiplicand shifted left by the appropriate number
of positions if the multiplier bit is one. These partial products are then summed with mul-
tiple adders to compute the final product. This long multiplication methods is illustrated in
Table 2.2
15
2.2. MULTIPLIERS CHAPTER 2. BACKGROUND
1 0 0 1
× 0 0 1 1
1 0 0 1
1 0 0 1
0 0 0 0
+ 0 0 0 0
0 0 0 1 1 0 1 1
Table 2.2.: The long multiplication method for binary operands.
Generating the partial products for a multiplication calculation of a × b is extremely easy.
Each partial product pi is evaluated as pi = a · bi, shifted left by i for each bit i in the
multiplier b. The difficulty arises in summing, or reducing the partial products to compute
the final product in an efficient manner.
As with adders, there are a large variety of designs which can compute this partial products
sum. Each has its own trade-offs between delay and area/power consumption depending on
the requirements of the hardware being developed. As there are a large number of pos-
sible multiplier designs, three common designs will be analysed and evaluated in this report,
namely the array multiplier, the Wallace tree multiplier and the Dadda tree multiplier.
2.2.1. Array Multiplier
Much like the ripple-carry adder, the array multiplier is the most straightforward implement-
ation of an n-bit multiplier. It uses an array of modified full adders arranged in an n × n grid
to evaluate the result, with the carries rippling diagonally through the grid and the sum-outs
rippling down. The ith
column of the grid corresponds to the ith
bit of the final product and
the jth
row corresponds to the jth
partial product, which is generated by the jth
bit of the
multiplier. A final row of regular full adders is then used to sum the remaining carry-outs to
compute the upper bits of the final product.11
A schematic of these modified adder is given
in Figure 2.8 and the complete array multiplier in Figure 2.9.
As with the RCA, the longest critical path of the array multiplier is easy to visualise. It runs
from the least significant bits of the operands in the top-right, diagonally through the carry-
outs to the most significant bit of the final product in the bottom-left. Hence, it traverses n
modified full adders (which have a gate delay of four), one half adder (gate delay of one) and
n − 1 full adders (gate delay of three). The delay of the array multiplier is thus 4n + 1 +
3n = 7n + 1. This is significantly faster than performing repeated addition to compute the
11
Vahid, Digital Design.
16
2.2. MULTIPLIERS CHAPTER 2. BACKGROUND
S
cout
in
c
sin
out
A
B
Figure 2.8.: An array multiplier full adder module, modified to compute the product of two
bits and sum this with the previous partial product.
A 0B0A 1B1 0
Full Adder
Modified
Full Adder
Modified
Full Adder
Modified
Full Adder
Modified
Half
Adder
Sn
S1
S0
Full
Adder
Sn+1
S2n−1
Half
Adder
B2 A B1 A 12
0
00
Figure 2.9.: An array multiplier, showing the grid structure of the full adder modules to com-
pute the partial products and sum them.
multiplication, which will necessarily have gate delays larger than 7n + 1.
However, the most apparent problem as visible in the schematic for the array multiplier is
the large amount of logic required to perform the multiplication. An n-bit array multiplier
requires n × n modified adders, so that the logic required scales by n2
. An 8-bit multiplier
17
2.2. MULTIPLIERS CHAPTER 2. BACKGROUND
requires 64 full adders while a 32-bit multiplier will need 1024. Clearly, the array multiplier
does not scale efficiently for practical applications where 32-bit or 64-bit width multipliers
would be needed as the power and area requirements are too large.
Another problem with the array multiplier is its inability to deal with signed integers.
With addition, the addition process inherently gives the correct answer whether the value
is unsigned or signed. These problems are addressed by the Wallace tree and Dadda tree
multipliers, in conjunction with a Modified Booth Encoder. The latter will be discussed first
as it forms a logic sub-block of both tree-based multipliers.
2.2.2. Modified Booth Encoder
The Modified Booth Encoder (MBE) serves two important purposes for more sophisticated
multiplier designs. Firstly, it allows the multiplier to correctly handle signed integers as
part of the partial product reduction process without any additional sign-extension logic.
Secondly, it reduces the number of partial products that need to be computed by half. This
is accomplished by re-encoding the partial products according to the patterns of repeated
1s and 0s in the bits of the multiplier. For instance, a multiplication involving 4-bit integers
such as a × 0011 would normally give the partial product sum of a + 2a + 0 + 0. This can
be re-written as −a + 4a. Similarly, a × 0111 would normally require the calculation of
a + 2a + 4a + 0. This can be formulated as −a + 8a. Hence, the number of partial products
has been reduced from four to two.
This encoding is accomplished by first padding the bits of the multiplier with a zero to the
least significant bit (LSB). If the multiplier has an odd number of bits, two additional zeros
are padded to the most significant bit, otherwise no additional padding is necessary. The bits
of the padded multiplier are then grouped into overlapping groups of threes as illustrated in
Equation 2.4.12
87 = 01010111 ⇒ 010101110 (with padding)
⇒ 010
Bit Group 4
010
Bit Group 3
011
Bit Group 2
110
Bit Group 1
(2.4)
Each of these bit groups corresponds to a partial product that will be generated by the
MBE. The value of each partial product is determined by the truth table in Table 2.3. In this
table, ∼ a means invert all the bits of a, and a 1 means shift a left by one
Given two 8-bit operands a and b, the partial products from an MBE can then be summed
12
Saharan and Kaur, ‘Design and Implementation of an Efficient Modified Booth Multiplier using VHDL’.
18
2.2. MULTIPLIERS CHAPTER 2. BACKGROUND
Bit Value Operation Partial Product
000 0 × a 0 . . . 0
001 1 × a a
010 1 × a a
011 2 × a a 1
100 −2 × a (∼ a 1) + 2
101 −1 × a ∼ a + 1
110 −1 × a ∼ a + 1
111 0 × a 1 . . . 1 + 1
Table 2.3.: Truth table for the MBE partial products.
by the addition logic of the multiplier as shown in Table 2.4.13
The outcome of utilising the
MBE is that only four partial products need to be summed instead of eight, saving on the
logic required for the multiplier
a7 a6 a5 a4 a3 a2 a1 a0
× b7 b6 b5 b4 b3 b2 b1 b0
p07 p06 p05 p04 p03 p02 p01 p00
p17 p16 p15 p14 p13 p12 p11 p10 b1
p27 p26 p25 p24 p23 p22 p21 p20 b3
p37 p36 p35 p34 p33 p32 p31 p30 b5
+ b7
Table 2.4.: The partial products and addition tree generated by an MBE.
2.2.3. Wallace Tree
The Wallace Tree multiplier takes the partial products of a multiplication and groups the
constituent bits according to their weight. The weight of a particular bit depends on its
position - for instance the least significant bit has weight 20
= 1 while bit 3 has weight
23
= 8. These bits are then reduced by layers of half adders and full adders in a tree structure
to compute the final product from the partial products.14
13
Punnaiah et al., ‘Design and Evaluation of High Performance Multiplier Using Modified Booth Encoding
Algorithm’.
14
Vladutiu, Computer Arithmetic.
19
2.2. MULTIPLIERS CHAPTER 2. BACKGROUND
The Wallace structure operates with multiple layers that reduce the number of bits with
the same weight at each stage. The operation performed depends on number of bits in the
layer:
• One: Pass the bit down to the next layer.
• Two: Add the bits together with a half adder, passing the sum to the same weight in
the next layer and the carry-out to the next weight in the next layer.
• Three or more: Add any three bits together with a full adder. Pass the sum and any
remaining bits to the same weight next layer. Pass the carry-out to the next weight in
the next layer
Layers are added to the Wallace structure in this fashion until all weights have been re-
duced to one bit.15
These remaining bits form the final product of the multiplication. Fig-
ure 2.10 shows the structure of a Wallace tree multiplier, where each layer of adders reduces
the partial products until the final product has been computed.
Figure 2.10.: A Wallace tree multiplier, showing the tree structure of layers of half and full
adders.16
The primary advantage of the Wallace Tree is that it has a significantly smaller number
of logic gates as compared to an array multiplier. The tree structure requires logn reduction
layers with each layer containing at most 2n adders, so no more than 2nlogn adders are re-
quired as opposed to the n2
adders required for an array multiplier. The use of an MBE halves
the number of partial products, so that the number of adders required is further reduced to
15
Bohsali and Doan, Rectangular Styled Wallace Tree Multipliers.
20
2.2. MULTIPLIERS CHAPTER 2. BACKGROUND
2nlog(n/2) in this instance. The MBE itself requires n/2 partial product logic blocks, with
each logic block requiring approximate twelve gates to evaluate the partial product.
The disadvantage of the Wallace Tree is that in contrast to an array multiplier, it has an
irregular layout and wiring structure. This is because the higher weights have more wires
and therefore require more adders than the lower weight wires. These extra adders also
require additional internal wiring to connect them all up correctly. This irregular routing and
layout is particularly problematic for FPGAs which are based around utilising a regular grid
of lookup tables and logic blocks to implement their functionality. Hence, a fully synthesised
Wallace Tree multiplier may require more logic blocks than would be expected.
2.2.4. Dadda Tree
A Dadda Tree multiplier operates in a very similar manner to a Wallace Tree. It receives a
set of partial products as inputs, each consisting of bits of different weights and sums these
together using layers of adders to compute the final product. It differs from the Wallace Tree
in terms of the structure of these adders, reducing the complexity of each reduction layer at
the cost of using additional layers. This structure is illustrated by Figure 2.11. The reduction
rules for the Dadda structure are as follows:17
• One: Pass the bit down to the next layer.
• Two: If all weights in the layer have two or fewer bits, then add the bits together with
a half adder, passing the sum to the same weight in the next layer and the carry-out to
the next weight in the next layer. Otherwise, pass the bits down to the next layer.
• Three or more: Add any three bits together with a full adder, ensuring that the total
number of bits remains equal or close to a multiple of three. Pass the sum and any
remaining bits to the same weight next layer. Pass the carry-out to the next weight in
the next layer.
The Dadda Tree still gives a similar nlogn scaling in logic area as the Wallace Tree multi-
plier due to its tree structure. The actual area used is slightly larger than that of the Wallace
Tree as each reduction layer is less aggressive at summing the partial products, resulting in
more layers needed to compute the sum. One advantage of this slight trade-off in area is
that the complexity of the wiring is reduced. This is useful for FPGA implementation as it is
likely to synthesise with better layout and routing than a Wallace Tree.
17
EDA Cafe, Datapath Logic Cells.
21
2.2. MULTIPLIERS CHAPTER 2. BACKGROUND
Figure 2.11.: A Dadda tree multiplier, showing a tree structure that is larger but more regular
than a Wallace tree.18
22
3. Design
To develop the arithmetic units discussed in the previous chapter, it is necessary to specify
their logic. This will be accomplished using Verilog, a hardware description language (HDL).
Each arithmetic unit is written in this language to describe its inputs, outputs and the in-
ternal logic of the unit to compute their outputs based on their inputs. These Verilog source
code files can then be used by Electronic Design Automation (EDA) tools such as Cadence,
Synopsys or Xilinx ISE to simulate and verify the logic, as well as synthesising it into a bit-
stream suitable for configuring an FPGA to implement the logic. This chapter addresses the
requirements and details of designing the arithmetic units.
3.1. Requirements
An arithmetic unit such as an adder or a multiplier is typically used as a block within a
larger unit, such as an ALU. These ALUs in turn are typically utilised by a processing unit,
most commonly the CPU of a computer system. Thus, an arithmetic unit must satisfy the
requirements of the encompassing unit.
The first requirement is the width of the operands that will be utilised. A processor will
typically use a common word width for its registers, memory addresses and data buses, hence
the arithmetic units it relies on will need to match. Early processors were 8-bit, but most mod-
ern processors such as ARM and Intel x86 now use word sizes of 32 bits and 64 bits. Hence,
the arithmetic units in this project each have 8-bit, 32-bit and 64-bit variants. However, due
to time constraints the Wallace tree and Dadda tree multipliers were only produced as 8-bit
variants.
The second requirement for arithmetic units is that the result should be computed within
a specific deadline. Processors are based on clocked logic, with operations triggered on the
rising edge of a clock cycle. For the processor to operate correctly, the result from the previ-
ous operation needs to be computed and latched within a discrete number of clock periods.
This imposes timing requirements on the arithmetic units and the delay plays a part in de-
termining the maximum clock speed of the entire design. The worst-case delays for each
arithmetic unit must therefore be analysed and evaluated as part of the development pro-
23
3.2. IMPLEMENTATION DETAILS CHAPTER 3. DESIGN
cess.
Finally, a significant consideration is the area used by the designs, and by extension their
power consumption. In the context of an FPGA, area usage is determined by the number of
logic slices and look-up tables (LUTs) that are used by the synthesised design. It is important
to ensure that the FPGA has enough logic slices to load the bitstream for the entire design,
so the designer of the arithmetic units must ensure enough area is left for the rest of the
design. In addition, a larger design requires more power to operate as the additional gates
draw more power upon switching. The power draw of the unit influences a device’s current
requirements, thermal constraints and battery lifetime in the case of mobile devices. Hence,
it is important to analyse the power consumption of the arithmetic units.
3.2. Implementation Details
Since Verilog is a hardware description language, a source file simply describes the beha-
vioural logic of the hardware and the state of its outputs given a set of inputs. The EDA
toolchain is free to synthesise any gates and wires necessary to ensure the unit will behave
according to the source file. This means that for the instance of an adder, it is perfectly valid
to write s = a + b and synthesising this will give a correctly functioning adder by the tool-
chain. However, since the actual implementation of this adder is completely determined by
the synthesis tool, this approach is not useful for this project.
Instead, to design the specific arithmetic units discussed in the previous chapter, it is ne-
cessary to be explicit and specify the exact logic of each unit. The arithmetic units developed
in this project will be designed at the level of basic logic gates such as NOT, AND, OR and
XOR gates, with the wiring between ports specified explicitly. This ensures that the EDA
toolchain will not attempt to generate its own optimised logic to substitute as an equivalent
to the logic specified in the source file. This approach allows for the differences between the
types of units to be clearly distinguished for further analysis.
However, some optimisations that are difficult to avoid entirely occur during the trans-
lation and mapping stages of synthesising a unit. For example, when synthesising an XOR
gate, the EDA toolchain has a number of possibilities for configuring an LUT to implement
this. In addition, the toolchain can be configured to optimise for particular design goals such
as minimising area or delay. The exact algorithms and optimisations for doing this are pro-
prietary and depend on both the synthesis tool and the capabilities and properties of the
FPGA hardware being used. Since it is not possible to directly observe what exactly occurs
at the synthesis level, it is best to hold this source of variability constant to ensure consistent
results. For this project, the synthesis tool used will be XST, from the Xilinx ISE 14.5 pack-
24
3.2. IMPLEMENTATION DETAILS CHAPTER 3. DESIGN
age. This will be used to synthesis designs targeted at the Xilinx Virtex-7 VC707 evaluation
kit. The ISE projects will be configured for the default Balanced profile which aims to give a
balance between compact area usage and short delays when synthesising the units.
The Verilog source files for the 8-bit units are provided in the appendix for reference (the
32-bit and 64-bit units have been omitted due to space constraints, but are straightforward
extensions of the basic logic). The next stage after designing and developing these units is
performing logic simulation and testing to ensure they perform as expected and give the
correct result for any set of inputs. This topic is covered in the next chapter.
25
4. Simulation
Logic simulation, also known as functional simulation, is a process by which software can
be used to determine the behaviour of a digital circuit. Simulation is performed through
the use of a stimulus testbench and the logic unit being tested, referred to as the device
under test (DUT). The testbench unit sends test data to the DUT’s inputs and the outputs
are observed and recorded in the simulation as a waveform trace. Additionally, the outputs
can be compared to a set of expected results for each test case. This allows for a unit to be
tested and verified for correct operation, as well as allowing the designer to find the source
of any faults in the unit. Figure 4.1 shows an example of a multiplier unit being simulated
with the ISim tool in Xilinx ISE, with the expected product being compared alongside the
actual output value from the multiplier unit. One can observe that this unit is performing as
expected, giving the correct product for every operand combination.
Figure 4.1.: The simulation process for a Wallace tree multiplier in Xilinx ISim.
26
4.1. TESTING METHODOLOGY FOR 8-BIT UNITS CHAPTER 4. SIMULATION
In the context of the arithmetic units in this project, simulation is essential to ensure that
the correct result is computed for any set of operands. The unit should correctly handle
small and large numbers, corner cases such as one operand set to one or zero, and negative
numbers if appropriate. To achieve good testing coverage of the units, it is necessary to test
using a large number of operand combinations. As performing this many tests manually
would be tedious and impractical, it is desirable to automate the test, allowing for a quick
pass/fail decision to be made for a unit.
4.1. Testing Methodology For 8-bit units
For the 8-bit arithmetic unit, it is feasible to test every single combination of operands and
verify the result. This is referred to as exhaustive testing. This testing scenario is possible
because in the case of an 8-bit multiplier, there are 28
possible values for operand a and 28
possible values for operand b. Hence there are 28
× 28
= 216
= 65536 test cases to consider.
This is a seemingly large number but it can be easily performed by a computer with an
automated test. The testbench for the 8-bit adders and multipliers simply uses two nested
loops to loop through every possible value for both operands, checking that the output result
is equal to that calculated in software. If a discrepancy occurs, it halts and logs the error,
otherwise the testbench continues the simulation until the end. The code for this is given in
the appendix.
4.2. Testing Methodology For Larger Units
The exhaustive testing approach however does not work in practice for 32-bit and 64-bit
arithmetic units. This is because the number of test cases required scales exponentially by
22n
. An exhaustive 32-bit multiplier testbench requires 264
≈ 1.8 × 1019
cases and a 64-bit
unit requires 2128
≈ 3.4 × 1038
cases. These are extremely large numbers and it simply isn’t
feasible to test every possibility in a reasonable amount of time. An alternative approach is
to test a large but feasible set of test cases, each with a randomly selected combination of
operands. If the unit gives the correct result for all of these test cases and all paths through
the unit have been tested at least once, it can be assumed with a high degree of confidence
that it will give the correct result in all cases.
The approach taken in this project for testing the 32-bit and 64-bit units is to generate
a set of test data for the testbench. This involves a Python script that select two numbers
at random (within the bit width constraints) using the system’s random number generator.
It computes their sum, unsigned product and signed product and appends the output as a
27
4.2. TESTING METHODOLOGY FOR LARGER UNITS CHAPTER 4. SIMULATION
formatted string of hexadecimal numbers to a test data file. This process is repeated for one
million cases. The process of generating random numbers occasionally generates duplicate
operand pairs. These duplicate pairs are removed from the test data file using standard UNIX
utilities, allowing for the script to be rerun to generate the remaining test cases until the test
data contains one million unique pairs of operands. The test data file can then be used by the
testbenches, which scan each line of the test data, set the input values according to the two
operands and compare the output with the expected result in the data. If the output differs
from the expected result, the test halts and logs the error, otherwise the test continues until
the last line of the test data is reached. The code for the test data script and the testbench
is given in the appendix, but the actual test data used is omitted from this report due to its
large size.
Once a unit has successfully passed all of the test cases in the testbench, it can be assumed
to be functionally correct under all input conditions. It can then be synthesised as a hardware
unit suitable for implementation on the FPGA hardware. This process is covered in the next
chapter.
28
5. Implementation
Once an arithmetic has been designed and fully verified, it is ready to be implemented. This
process involves transforming the Verilog source code for a hardware unit into binary bit-
stream data suitable for downloading to an FPGA device. The synthesis process also gener-
ates various statistics that are useful in analysing and comparing the various types of adders
and multipliers. These will be the primary focus of this chapter.
5.1. The Synthesis Process
Synthesis is the process by which a hardware description of a logic unit is used to generate
a hardware implementation of logic gates. This implementation can be targeted at the fab-
rication of a physical ASIC, or a bitstream for the configuration of an FPGA. In this manner,
synthesis is roughly analogous to compiling the source code for some software into a binary
executable. The process is performed by the synthesis tool of an EDA toolchain. In the case
of this project, the synthesis tool is XST (Xilinx Synthesis Tool) which is integrated into the
Xilinx ISE suite. The main stages of synthesis are as follows:
• Translate: The translation stage parses the source file and generates a netlist (a list
of the wires in the design) and the logic gates associated with them. Various optim-
isations are utilised to minimise the specified Boolean logic to a set of logic gates with
equivalent truth tables. This assists in reducing the area and delay for the unit.
• Map: The map stage uses the aforementioned gate list, assigning them to specific logic
slices and inputs/outputs on the FPGA. The LUTs are also configured to reflect the logic
required for the design.
• Place-and-route: Once the design has been mapped to the FPGA, the place-and-route
stage uses the netlist to decide on how the design should be arranged on the chip and
the routing of wires between the logic gates. There are a selection of optimisation
target profiles that can be used to influence the place-and-route stage. For example,
the synthesis tool can be directed to minimize area usage or delay, or strike a balance
between both goals. This project utilises the Balanced profile for all units.
29
5.2. SYNTHESIS REPORTS AND STATISTICS CHAPTER 5. IMPLEMENTATION
Once a unit has been synthesised, a bitstream can be generated from it. This requires the
designer to assign the unit’s inputs and outputs to the physical pins on the FPGA chip, along
with any other constraints that may be required. Once this has been completed, the toolchain
generates a complete bitstream of the entire FPGA’s configuration. This bitstream can then
be downloaded to the FPGA, finishing the process of implementing the design into hardware.
5.2. Synthesis Reports and Statistics
An important part of the synthesis process is the reports and statistics that are generated
along with the synthesised unit. These provide important details with regard to the unit such
as the area usage, the number of input/output buffers (IOBs) required, the estimated pin-to-
pin delay for each combination of input and output pin, the estimated power consumption
at a given clock rate, and so forth. This data is important from a development perspective
and here it will be utilised to evaluate each of the arithmetic units developed in the course
of this project.
Unfortunately, due to technical issues with the development software it was not possible
to generate reliable dynamic power consumption reports for the purposes of this project, nor
was it possible to interact directly with the implemented designs on the hardware. However,
since power consumption of a digital circuit is directly proportional to the number of logic
gates in the circuit, it can be indirectly inferred that a design with a larger area is expected
to consume more power, assuming a similar percentage of gates are switched with each
computation.
5.2.1. Adders
In this project, the four adders discussed earlier were fully implemented, namely the ripple-
carry adder, the carry-lookahead adder, the carry-select adder and the carry-skip adder. These
were run through the synthesis tool which generated synthesis reports for each. The first
criteria for evaluating the adders was the area usage. This was quantified by examining the
number of slice LUTs required for the design. The results for the 8-bit variants are graphed
in Figure 5.1.
As would be expected, the ripple-carry adder’s simple design gives it the smallest area us-
age of the four units, with eight LUTs used. What was not expected was that the carry-select
adder also used an equal amount of area. This was most likely due to synthesis optimisations
that allowed for the adder to be efficiently implemented in the FPGA. The carry-lookahead
adder required an additional LUT for its lookahead logic while the carry-skip adder required
30
5.2. SYNTHESIS REPORTS AND STATISTICS CHAPTER 5. IMPLEMENTATION
Figure 5.1.: Graph showing the area usage of the 8-bit adder variants, in terms of LUTs used.
the most with ten LUTs. The scaling of these adders is depicted in Figure 5.2 and Figure 5.3
for the 32-bit and 64-bit adders respectively.
From these graphs, it is immediately apparent that the area requirements of the carry-
lookahead adder scale up very quickly in relation to operand width. The 32-bit and 64-bit
carry-lookahead adders are significantly larger than the other variants, which suggests that
large carry-lookahead adders are not suitable for practical use. Again, both the ripple-carry
and carry-select adders use the least area, while the carry-skip adder uses noticeably more
LUTs.
31
5.2. SYNTHESIS REPORTS AND STATISTICS CHAPTER 5. IMPLEMENTATION
Figure 5.2.: Graph showing the area usage of the 32-bit adder variants.
Figure 5.3.: Graph showing the area usage of the 64-bit adder variants.
32
5.2. SYNTHESIS REPORTS AND STATISTICS CHAPTER 5. IMPLEMENTATION
The next criteria to analyse is the maximum worst-case delay, a value that is necessary
to determine for the purpose of integrating the unit within a larger logic unit. This value
is obtained from the pin-to-pin delay report, which details the delays between each bit of
every input pin and every bit of every output pin. The largest value from this list is selected
as being the maximum delay. The results of this are shown in Figure 5.4.
Figure 5.4.: Graph showing the maximum worst-case delay of the 8-bit adder variants, meas-
ured in nanoseconds.
Slightly surprisingly, the ripple-carry adder comes first with the shortest delay as com-
pared to the other adders. This is likely due to the fact that with only eight full adders in
a ripple-carry chain, the ripple delay is not yet large enough to be a significant issue. The
additional cost of the critical path optimisations present in the other adder designs hence do
not outweigh the benefits derived from them. Figure 5.5 and Figure 5.6 show the maximum
delay of the 32-bit and 64-bit adders respectively.
Here, the other adders now provide a tangible reduction in maximum delay as compared
to the ripple-carry adder. The carry-lookahead adder in particular has the shortest delay,
though this comes at the cost of significantly more area usage as discussed previously. The
32-bit carry-select adder also improves on the delay relative to the ripple-carry adder, while
the 32-bit carry-skip adder proves to be the slowest. This situation is reversed for the 64-bit
adders where the carry-skip adder proves to be faster than the carry-select adder, though
not as fast as the carry-lookahead adder. In fact, in this instance the carry-select adder has
33
5.2. SYNTHESIS REPORTS AND STATISTICS CHAPTER 5. IMPLEMENTATION
Figure 5.5.: Graph showing the maximum worst-case delay of the 32-bit adder variants.
Figure 5.6.: Graph showing the maximum worst-case delay of the 64-bit adder variants.
a longer delay than the ripple-carry adder. Hence, we can conclude that there is no ‘best’
adder design in all cases. The best choice of adder for each operand width is determined by
the designers requirements and by profiling the individual designs.
34
5.2. SYNTHESIS REPORTS AND STATISTICS CHAPTER 5. IMPLEMENTATION
5.2.2. Multipliers
The array multiplier, Wallace tree multiplier and Dadda tree multiplier were all implemented
as 8-bits units in this project. However, due to time constraints only the array multiplier was
implemented as 32-bit and 64-bit units. The majority of this section will therefore focus on
the 8-bit units. Firstly, the area usage of these units is given in Figure 5.7.
Figure 5.7.: Graph showing the area usage of the 8-bit multipliers.
It is clear that the Wallace tree multiplier uses the least area with 73 LUTs as compared
to 74 for the array multiplier and 76 for the Dadda multiplier. This is despite the additional
overhead imposed by the MBE, which requires more logic than traditional calculation of the
partial products. In addition, the Wallace tree multiplier’s ability to handle multiplication of
signed numbers places it at a clear advantage against the array multiplier. Meanwhile, the
Dadda multiplier’s expanded design is visible in its increased area usage.
The next criteria is the maximum worst-case delay. The graph for this is given in Figure 5.8.
Again, the Wallace tree multiplier emerges as the unequivocal winner with the shortest delay,
while also being able to multiply signed numbers. The Dadda tree multiplier has the longest
delay, although by a relatively small margin. However, it is entirely possible that different
characteristics could be observed with 32-bit and 64-bit implementations of the tree multi-
pliers, as was the case with the carry-select and carry-skip adders.
35
5.2. SYNTHESIS REPORTS AND STATISTICS CHAPTER 5. IMPLEMENTATION
Figure 5.8.: Graph showing the maximum worst-case of the 8-bit multipliers.
For informative purposes, the 8-bit array multiplier was compared to its 32-bit and 64-bit
counterparts to analyse the scaling of the array multiplier with operand width. The graphs
for area and delay are given in Figure 5.9 and Figure 5.10 respectively.
The delay graph shows the array multiplier scales slightly less than linearly between the
three bit widths, with delays of 11ns, 38ns and 56ns for the 8-bit, 32-bit and 64-bit multipliers
respectively. However, it is the area graph that shows the enormous effect of n2
scaling of
area usage. While the 8-bit multiplier only required 74 LUTs, the 32-bit multiplier requires
over 1400 and this balloons to almost 6000 for the 64-bit multiplier. Hence, it is obvious that
the array multiplier is a poor choice of design for practical multiplier applications as the area
requirements are simply too large.
36
5.2. SYNTHESIS REPORTS AND STATISTICS CHAPTER 5. IMPLEMENTATION
Figure 5.9.: Graph showing the scaling of area usage for array multipliers of various widths.
Figure 5.10.: Graph showing the scaling of maximum delay for array multipliers of various
widths.
37
6. Conclusion
This project has covered the complete development process of a variety of arithmetic units,
from the theory that underpins them through to their design and testing process, before being
implemented as complete hardware units. Additionally, it has covered the design properties
and characteristics that distinguish the units from each other, with the advantages and dis-
advantages of each unit discussed in detail. Each unit was also developed with a selection of
operand widths to investigate the effects of scaling on the characteristics of the units.
In particular, it can be concluded that for the adders, no particular adder emerged as a clear
best design. The choice of adder used in a digital circuit is guided heavily by the requirements
of the designer’s project. For instance, a compact design requiring an 8-bit adder with min-
imal area usage and power consumption is best served by the corresponding ripple-carry
adder. A design that calls for minimal delay in an 8-bit adder is best served by the carry-
lookahead adder. For a 32-bit adder, the area scaling issues of a carry-lookahead adder make
it unsuitable for many applications, so a designer seeking out low delay while keeping area
usage acceptable would select the carry-select adder.
For the multipliers, the situation is more clear-cut. The 8-bit Wallace tree multiplier proved
to be superior to its array multiplier counterpart in both aspects. Its use of a modified Booth
encoder and a tree structure allowed it to use less area while simultaneously possessing a
shorter delay. It also avoids the n2
area scaling issue of the array multiplier as seen with the
latter’s 32-bit and 64-bit variants. The ability to correctly multiply signed integers cements
its advantage as many digital designs will require the ability to multiply negative numbers
together. Meanwhile, the Dadda tree multiplier was disadvantaged by having a larger area
and delay than the other two multiplier designs. However, since these results were only
obtained for 8-bit multipliers, it is entirely possible that wider variants of the Dadda tree
multiplier may give more favourable characteristics than its Wallace counterpart.
The primary issues that occurred with this project was a lack of time as well as a lack of
prior knowledge and experience in digital hardware design and arithmetic units. In partic-
ular, understanding the logic and theory of the tree-based multiplier units and the modified
Booth encoder was very time-consuming. Possessing a thorough understanding of these
units was essential before development work could begin. Hence, there was only sufficient
38
CHAPTER 6. CONCLUSION
time to complete the design, verification and implementation of the 8-bit variants of the tree
multipliers. Given more time, an analysis of the scaling characteristics of the tree multipliers
with 32-bit and 64-bit wide operands could have been undertaken.
Another issue was the inability to obtain data on the power consumption qualities of the
arithmetic units as was originally intended, from both software estimated values and actual
values measured from the hardware. Technical issues as well a lack of prior experience with
the software made it difficult to obtain meaningful dynamic power consumption estimates.
Only estimates of static power consumption (from transistor leakage) was available, which
were not useful for quantifying the power consumed when evaluating a calculation. Ad-
ditionally, there were further issues with using the arithmetic units on the physical FPGA
hardware. While it was possible to generate a bitstream from the synthesised units and
download this to the hardware, there was no clear method of interfacing with the arithmetic
unit. It was not possible to send test data to the unit nor to read its output, severely limit-
ing the usefulness of this approach. With more time available, it would have been easier to
overcome these issues. It would also have been possible to make physical measurements on
the hardware, allowing for a useful analysis of real-world power consumption by the units.
Despite these issues, the project was still successful in that many useful results were ob-
tained. Nearly all of the intended units were fully designed, verified and implemented in the
course of this project. Ultimately, it has been an extremely rewarding experience and has
given a significant amount of in-depth knowledge and practical hands-on experience in the
realm of digital hardware design.
39
Bibliography
Bohsali, M. and M. Doan. Rectangular Styled Wallace Tree Multipliers. url: http://www.
veech.com/index files/Wallace%20Tree.pdf.
EDA Cafe. Datapath Logic Cells. url: http://www10.edacafe.com/book/ASIC/
Book/Book/Book/CH02/CH02.6.php.
Nvidia. What is GPU Accelerated Computing? url: http : / / www . nvidia . com /
object/what-is-gpu-computing.html.
Punnaiah, S. et al. ‘Design and Evaluation of High Performance Multiplier Using Modified
Booth Encoding Algorithm’. In: International Journal of Engineering and Innovative Tech-
nology 1 (6 June 2012), pp. 16–19.
Saharan, K. and J. Kaur. ‘Design and Implementation of an Efficient Modified Booth Multiplier
using VHDL’. In: International Journal of Advances in Engineering Sciences 3 (3 July 2013),
pp. 78–81.
Terms, Tech. ALU (Arithmetic Logic Unit) Definition. url: http://www.techterms.
com/definition/alu.
Tohoku University. Hardware Algorithms For Arithmetic Modules. url: http://www.
aoki.ecei.tohoku.ac.jp/arith/mg/algorithm.html.
Tufts University. 4*4 multiplier. url: http://www.eecs.tufts.edu/∼ryun01/
vlsi/verilog simulation.htm.
Vahid, Frank. Digital Design. Wiley, 2011. isbn: 978-0-470-53108-2.
Vladutiu, Mircea. Computer Arithmetic. Algorithms and Hardware Implementations. Springer,
2012. isbn: 978-3-642-18315-7.
Yovits, Marshall C. Advances in Computers. Academic Press, 1993, p. 105. isbn: 978-0-470-
53108-2.
40
Appendix A.
Verilog Source Code for the 8-bit
Arithmetic Units
A.1. Adders
A.1.1. Half Adder and Full Adder
/ / Simple half −adder t h a t computes the sum and carry −out of
two b i t s
module h a l f a d d e r ( input wire a ,
input wire b ,
output wire s ,
output wire cout ) ;
assign s = a ˆ b ;
assign cout = a & b ;
endmodule
/ / Simple f u l l −adder t h a t computes the sum and carry −out of
two b i t s , p l u s a carry −in
module f u l l a d d e r ( input wire a ,
input wire b ,
input wire cin ,
output wire s ,
output wire cout ) ;
wire s0 , c0 , c1 ;
41
A.1. ADDERS APPENDIX A. UNIT SOURCE CODE
/ / Two half −adders are chained to c r e a t e a f u l l adder
h a l f a d d e r ha0 ( . a ( a ) , . b ( b ) , . s ( s0 ) , . cout ( c0 ) ) ;
h a l f a d d e r ha1 ( . a ( s0 ) , . b ( cin ) , . s ( s ) , . cout ( c1 ) ) ;
assign cout = c0 | c1 ;
endmodule
A.1.2. Ripple-carry Adder
/ / 4− b i t r i p p l e −c a r r y adder with f o u r f u l l −adders
module r i p p l e c a r r y a d d 4 ( input wire [ 3 : 0 ] a ,
input wire [ 3 : 0 ] b ,
input wire cin ,
output wire [ 3 : 0 ] s ,
output wire cout ) ;
wire [ 3 : 0 ] c ;
assign cout = c [ 3 ] ;
f u l l a d d e r fa0 ( . a ( a [ 0 ] ) , . b ( b [ 0 ] ) , . cin ( cin ) , . s ( s [ 0 ] ) , .
cout ( c [ 0 ] ) ) ;
f u l l a d d e r fa1 ( . a ( a [ 1 ] ) , . b ( b [ 1 ] ) , . cin ( c [ 0 ] ) , . s ( s [ 1 ] ) , .
cout ( c [ 1 ] ) ) ;
f u l l a d d e r fa2 ( . a ( a [ 2 ] ) , . b ( b [ 2 ] ) , . cin ( c [ 1 ] ) , . s ( s [ 2 ] ) , .
cout ( c [ 2 ] ) ) ;
f u l l a d d e r fa3 ( . a ( a [ 3 ] ) , . b ( b [ 3 ] ) , . cin ( c [ 2 ] ) , . s ( s [ 3 ] ) , .
cout ( c [ 3 ] ) ) ;
endmodule
/ / 8− b i t r i p p l e −c a r r y adder with two 4− b i t RCAs
module r i p p l e c a r r y a d d 8 ( input wire [ 7 : 0 ] a ,
input wire [ 7 : 0 ] b ,
input wire cin ,
output wire [ 7 : 0 ] s ,
output wire cout ) ;
42
A.1. ADDERS APPENDIX A. UNIT SOURCE CODE
wire c0 ;
r i p p l e c a r r y a d d 4 r ca 4 0 ( . a ( a [ 3 : 0 ] ) , . b ( b [ 3 : 0 ] ) , . cin ( cin ) , .
s ( s [ 3 : 0 ] ) , . cout ( c ) ) ;
r i p p l e c a r r y a d d 4 r ca 4 1 ( . a ( a [ 7 : 4 ] ) , . b ( b [ 7 : 4 ] ) , . cin ( c ) , .
s ( s [ 7 : 4 ] ) , . cout ( cout ) ) ;
endmodule
A.1.3. Carry-lookahead Adder
/ / Propagate / Generate adder t h a t computes the P /G s i g n a l s
/ / i n s t e a d of the carry −out as in the f u l l adder
module pgadder ( input wire a ,
input wire b ,
input wire cin ,
output wire s ,
output wire p ,
output wire g ) ;
assign s = a ˆ b ˆ cin ;
assign p = a | b ;
assign g = a & b ;
endmodule
/ / 8− b i t carry −lookahead adder
module carrylookaheadadd8 ( input wire [ 7 : 0 ] a ,
input wire [ 7 : 0 ] b ,
input wire cin ,
output wire [ 7 : 0 ] s ,
output wire cout ) ;
/ / Propagate , g e n e r a t e and c a r r y output s i g n a l s f o r each b i t
wire [ 7 : 0 ] p , g , c ;
43
A.1. ADDERS APPENDIX A. UNIT SOURCE CODE
/ / The formula f o r the lookahead i s c i +1 = g i | ( p i & c i
) , where c i i s expanded r e c u r s i v e l y
assign c [ 0 ] = cin ;
assign c [ 1 ] = g [ 0 ] | ( p [ 0 ] & c [ 0 ] ) ;
assign c [ 2 ] = g [ 1 ] | ( p [ 1 ] & g [ 0 ] ) | ( p [ 1 ] & p [ 0 ] & c [ 0 ] ) ;
assign c [ 3 ] = g [ 2 ] | ( p [ 2 ] & g [ 1 ] ) | ( p [ 2 ] & p [ 1 ] & g [ 0 ] ) |
( p [ 2 ] & p [ 1 ] & p [ 0 ] & c [ 0 ] ) ;
assign c [ 4 ] = g [ 3 ] | ( p [ 3 ] & g [ 2 ] ) | ( p [ 3 ] & p [ 2 ] & g [ 1 ] ) |
( p [ 3 ] & p [ 2 ] & p [ 1 ] & g [ 0 ] ) | ( p [ 3 ] & p [ 2 ] & p [ 1 ] & p [ 0 ]
& c [ 0 ] ) ;
assign c [ 5 ] = g [ 4 ] | ( p [ 4 ] & g [ 3 ] ) | ( p [ 4 ] & p [ 3 ] & g [ 2 ] ) |
( p [ 4 ] & p [ 3 ] & p [ 2 ] & g [ 1 ] ) | ( p [ 4 ] & p [ 3 ] & p [ 2 ] & p [ 1 ]
& g [ 0 ] ) | ( p [ 4 ] & p [ 3 ] & p [ 2 ] & p [ 1 ] & p [ 0 ] & c [ 0 ] ) ;
assign c [ 6 ] = g [ 5 ] | ( p [ 5 ] & g [ 4 ] ) | ( p [ 5 ] & p [ 4 ] & g [ 3 ] ) |
( p [ 5 ] & p [ 4 ] & p [ 3 ] & g [ 2 ] ) | ( p [ 5 ] & p [ 4 ] & p [ 3 ] & p [ 2 ]
& g [ 1 ] ) | ( p [ 5 ] & p [ 4 ] & p [ 3 ] & p [ 2 ] & p [ 1 ] & g [ 0 ] ) | ( p
[ 5 ] & p [ 4 ] & p [ 3 ] & p [ 2 ] & p [ 1 ] & p [ 0 ] & c [ 0 ] ) ;
assign c [ 7 ] = g [ 6 ] | ( p [ 6 ] & g [ 5 ] ) | ( p [ 6 ] & p [ 5 ] & g [ 4 ] ) |
( p [ 6 ] & p [ 5 ] & p [ 4 ] & g [ 3 ] ) | ( p [ 6 ] & p [ 5 ] & p [ 4 ] & p [ 3 ]
& g [ 2 ] ) | ( p [ 6 ] & p [ 5 ] & p [ 4 ] & p [ 3 ] & p [ 2 ] & g [ 1 ] ) | ( p
[ 6 ] & p [ 5 ] & p [ 4 ] & p [ 3 ] & p [ 2 ] & p [ 1 ] & g [ 0 ] ) | ( p [ 6 ] &
p [ 5 ] & p [ 4 ] & p [ 3 ] & p [ 2 ] & p [ 1 ] & p [ 0 ] & c [ 0 ] ) ;
assign cout = g [ 7 ] | ( p [ 7 ] & g [ 6 ] ) | ( p [ 7 ] & p [ 6 ] & g [ 5 ] ) |
( p [ 7 ] & p [ 6 ] & p [ 5 ] & g [ 4 ] ) | ( p [ 7 ] & p [ 6 ] & p [ 5 ] & p [ 4 ]
& g [ 3 ] ) | ( p [ 7 ] & p [ 6 ] & p [ 5 ] & p [ 4 ] & p [ 3 ] & g [ 2 ] ) | ( p
[ 7 ] & p [ 6 ] & p [ 5 ] & p [ 4 ] & p [ 3 ] & p [ 2 ] & g [ 1 ] ) | ( p [ 7 ] &
p [ 6 ] & p [ 5 ] & p [ 4 ] & p [ 3 ] & p [ 2 ] & p [ 1 ] & g [ 0 ] ) | ( p [ 7 ] &
p [ 6 ] & p [ 5 ] & p [ 4 ] & p [ 3 ] & p [ 2 ] & p [ 1 ] & p [ 0 ] & c [ 0 ] ) ;
/ / Propagate / Generate adders which g i v e the P /G s i g n a l s and
use the c a r r i e s computed by the CLA l o g i c
pgadder pga0 ( . a ( a [ 0 ] ) , . b ( b [ 0 ] ) , . cin ( c [ 0 ] ) , . s ( s [ 0 ] ) , . p ( p
[ 0 ] ) , . g ( g [ 0 ] ) ) ;
pgadder pga1 ( . a ( a [ 1 ] ) , . b ( b [ 1 ] ) , . cin ( c [ 1 ] ) , . s ( s [ 1 ] ) , . p ( p
[ 1 ] ) , . g ( g [ 1 ] ) ) ;
pgadder pga2 ( . a ( a [ 2 ] ) , . b ( b [ 2 ] ) , . cin ( c [ 2 ] ) , . s ( s [ 2 ] ) , . p ( p
44
A.1. ADDERS APPENDIX A. UNIT SOURCE CODE
[ 2 ] ) , . g ( g [ 2 ] ) ) ;
pgadder pga3 ( . a ( a [ 3 ] ) , . b ( b [ 3 ] ) , . cin ( c [ 3 ] ) , . s ( s [ 3 ] ) , . p ( p
[ 3 ] ) , . g ( g [ 3 ] ) ) ;
pgadder pga4 ( . a ( a [ 4 ] ) , . b ( b [ 4 ] ) , . cin ( c [ 4 ] ) , . s ( s [ 4 ] ) , . p ( p
[ 4 ] ) , . g ( g [ 4 ] ) ) ;
pgadder pga5 ( . a ( a [ 5 ] ) , . b ( b [ 5 ] ) , . cin ( c [ 5 ] ) , . s ( s [ 5 ] ) , . p ( p
[ 5 ] ) , . g ( g [ 5 ] ) ) ;
pgadder pga6 ( . a ( a [ 6 ] ) , . b ( b [ 6 ] ) , . cin ( c [ 6 ] ) , . s ( s [ 6 ] ) , . p ( p
[ 6 ] ) , . g ( g [ 6 ] ) ) ;
pgadder pga7 ( . a ( a [ 7 ] ) , . b ( b [ 7 ] ) , . cin ( c [ 7 ] ) , . s ( s [ 7 ] ) , . p ( p
[ 7 ] ) , . g ( g [ 7 ] ) ) ;
endmodule
A.1.4. Carry-select Adder
/ / 8− b i t carry −s e l e c t adder
module c a r r y s e l e c t a d d 8 ( input wire [ 7 : 0 ] a ,
input wire [ 7 : 0 ] b ,
input wire cin ,
output wire [ 7 : 0 ] s ,
output wire cout ) ;
wire cs , cout 0 , cout 1 ;
wire [ 3 : 0 ] r e s u l t 0 , r e s u l t 1 ;
/ / The a p p r o p r i a t e output f o r the upper b i t s i s s e l e c t e d by
the carry −s e l e c t s i g n a l
assign { cout , s [ 7 : 4 ] } = ( cs ) ? { cout 1 , r e s u l t 1 } : { cout 0 ,
r e s u l t 0 } ;
/ / Simple RCA adds the lower f o u r b i t s and emits a carry −
s e l e c t s i g n a l
r i p p l e c a r r y a d d 4 r ca 4 0 ( . a ( a [ 3 : 0 ] ) , . b ( b [ 3 : 0 ] ) , . cin ( cin )
, . s ( s [ 3 : 0 ] ) , . cout ( cs ) ) ;
/ / The upper b i t s are computed twice , f o r a carry −in of 0
45
A.1. ADDERS APPENDIX A. UNIT SOURCE CODE
and 1 , with the c o r r e c t answer s e l e c t e d l a t e r
r i p p l e c a r r y a d d 4 r c a 4 1 0 ( . a ( a [ 7 : 4 ] ) , . b ( b [ 7 : 4 ] ) , . cin ( 1 ’ b0 )
, . s ( r e s u l t 0 ) , . cout ( cout 0 ) ) ;
r i p p l e c a r r y a d d 4 r c a 4 1 1 ( . a ( a [ 7 : 4 ] ) , . b ( b [ 7 : 4 ] ) , . cin ( 1 ’ b1 )
, . s ( r e s u l t 1 ) , . cout ( cout 1 ) ) ;
endmodule
A.1.5. Carry-skip Adder
/ / A 4− b i t RCA t h a t p r o v i d e s a propagate s i g n a l f o r the carry −
s k i p adder
module p gr ip p le c ar ry a dd 4 ( input wire [ 3 : 0 ] a ,
input wire [ 3 : 0 ] b ,
input wire cin ,
output wire [ 3 : 0 ] s ,
output wire cout ,
output wire p ) ;
wire [ 3 : 0 ] c ;
assign cout = c [ 3 ] ;
assign p = ( a [ 0 ] | b [ 0 ] ) & ( a [ 1 ] | b [ 1 ] ) & ( a [ 2 ] | b [ 2 ] ) & (
a [ 3 ] | b [ 3 ] ) ;
f u l l a d d e r fa0 ( . a ( a [ 0 ] ) , . b ( b [ 0 ] ) , . cin ( cin ) , . s ( s [ 0 ] ) , .
cout ( c [ 0 ] ) ) ;
f u l l a d d e r fa1 ( . a ( a [ 1 ] ) , . b ( b [ 1 ] ) , . cin ( c [ 0 ] ) , . s ( s [ 1 ] ) , .
cout ( c [ 1 ] ) ) ;
f u l l a d d e r fa2 ( . a ( a [ 2 ] ) , . b ( b [ 2 ] ) , . cin ( c [ 1 ] ) , . s ( s [ 2 ] ) , .
cout ( c [ 2 ] ) ) ;
f u l l a d d e r fa3 ( . a ( a [ 3 ] ) , . b ( b [ 3 ] ) , . cin ( c [ 2 ] ) , . s ( s [ 3 ] ) , .
cout ( c [ 3 ] ) ) ;
endmodule
/ / 8− b i t carry −s k i p adder
module carryskipadd8 ( input wire [ 7 : 0 ] a ,
input wire [ 7 : 0 ] b ,
46
A.2. MULTIPLIERS APPENDIX A. UNIT SOURCE CODE
input wire cin ,
output wire [ 7 : 0 ] s ,
output wire cout ) ;
wire [ 1 : 0 ] rcin , rcout ;
wire p ;
assign r c i n [ 0 ] = cin ;
assign r c i n [ 1 ] = rcout [ 0 ] ;
/ / I f the second RCA w i l l propagate a carry , simply pass
r c o u t [ 0 ] to the cout , s k i p p i n g the second RCA
assign cout = rcout [ 1 ] | ( p & rcout [ 0 ] ) ;
r i p p l e c a r r y a d d 4 r ca 4 0 ( . a ( a [ 3 : 0 ] ) , . b ( b [ 3 : 0 ] ) , . cin ( r c i n
[ 0 ] ) , . s ( s [ 3 : 0 ] ) , . cout ( rcout [ 0 ] ) ) ;
p gr ip p le c ar ry a dd 4 rc a4 1 ( . a ( a [ 7 : 4 ] ) , . b ( b [ 7 : 4 ] ) , . cin ( r c i n
[ 1 ] ) , . s ( s [ 7 : 4 ] ) , . cout ( rcout [ 1 ] ) , . p ( p ) ) ;
endmodule
A.2. Multipliers
A.2.1. Array Multiplier
/ / Array m u l t i p l i e r module t h a t computes a b i t p roduct and
adds i t to a sum−in
module mulmodule ( input wire x ,
input wire y ,
input wire sin ,
input wire cin ,
output wire cout ,
output wire sout ) ;
wire b in ;
assign b in = x & y ;
f u l l a d d e r adder ( . a ( sin ) , . b ( b in ) , . cin ( cin ) , . s ( sout ) , .
47
A.2. MULTIPLIERS APPENDIX A. UNIT SOURCE CODE
cout ( cout ) ) ;
endmodule
/ / 8− b i t unsigned array m u l t i p l i e r
module a r r a y m u l t i p l i e r 8 ( input wire [ 7 : 0 ] a ,
input wire [ 7 : 0 ] b ,
output wire [ 1 5 : 0 ] r e s u l t ) ;
wire [ 7 : 0 ] c [ 8 : 0 ] , s [ 7 : 0 ] ; / / I n t e r m e d i a t e c a r r y and sum
w i r e s
/ / P a r t i a l p r o d u c t s of m u l t i p l i c a n d with each m u l t i p l i e r b i t
mulmodule mm00 00 ( . x ( a [ 0 ] ) , . y ( b [ 0 ] ) , . sin ( 1 ’ b0 ) , .
cin ( 1 ’ b0 ) , . cout ( c [ 0 ] [ 0 ] ) , . sout ( s [ 0 ] [ 0 ] ) ) ;
mulmodule mm00 01 ( . x ( a [ 1 ] ) , . y ( b [ 0 ] ) , . sin ( 1 ’ b0 ) , .
cin ( 1 ’ b0 ) , . cout ( c [ 0 ] [ 1 ] ) , . sout ( s [ 0 ] [ 1 ] ) ) ;
mulmodule mm00 02 ( . x ( a [ 2 ] ) , . y ( b [ 0 ] ) , . sin ( 1 ’ b0 ) , .
cin ( 1 ’ b0 ) , . cout ( c [ 0 ] [ 2 ] ) , . sout ( s [ 0 ] [ 2 ] ) ) ;
mulmodule mm00 03 ( . x ( a [ 3 ] ) , . y ( b [ 0 ] ) , . sin ( 1 ’ b0 ) , .
cin ( 1 ’ b0 ) , . cout ( c [ 0 ] [ 3 ] ) , . sout ( s [ 0 ] [ 3 ] ) ) ;
mulmodule mm00 04 ( . x ( a [ 4 ] ) , . y ( b [ 0 ] ) , . sin ( 1 ’ b0 ) , .
cin ( 1 ’ b0 ) , . cout ( c [ 0 ] [ 4 ] ) , . sout ( s [ 0 ] [ 4 ] ) ) ;
mulmodule mm00 05 ( . x ( a [ 5 ] ) , . y ( b [ 0 ] ) , . sin ( 1 ’ b0 ) , .
cin ( 1 ’ b0 ) , . cout ( c [ 0 ] [ 5 ] ) , . sout ( s [ 0 ] [ 5 ] ) ) ;
mulmodule mm00 06 ( . x ( a [ 6 ] ) , . y ( b [ 0 ] ) , . sin ( 1 ’ b0 ) , .
cin ( 1 ’ b0 ) , . cout ( c [ 0 ] [ 6 ] ) , . sout ( s [ 0 ] [ 6 ] ) ) ;
mulmodule mm00 07 ( . x ( a [ 7 ] ) , . y ( b [ 0 ] ) , . sin ( 1 ’ b0 ) , .
cin ( 1 ’ b0 ) , . cout ( c [ 0 ] [ 7 ] ) , . sout ( s [ 0 ] [ 7 ] ) ) ;
mulmodule mm01 00 ( . x ( a [ 0 ] ) , . y ( b [ 1 ] ) , . sin ( s [ 0 ] [ 1 ] ) , .
cin ( c [ 0 ] [ 0 ] ) , . cout ( c [ 1 ] [ 0 ] ) , . sout ( s [ 1 ] [ 0 ] ) ) ;
mulmodule mm01 01 ( . x ( a [ 1 ] ) , . y ( b [ 1 ] ) , . sin ( s [ 0 ] [ 2 ] ) , .
cin ( c [ 0 ] [ 1 ] ) , . cout ( c [ 1 ] [ 1 ] ) , . sout ( s [ 1 ] [ 1 ] ) ) ;
mulmodule mm01 02 ( . x ( a [ 2 ] ) , . y ( b [ 1 ] ) , . sin ( s [ 0 ] [ 3 ] ) , .
cin ( c [ 0 ] [ 2 ] ) , . cout ( c [ 1 ] [ 2 ] ) , . sout ( s [ 1 ] [ 2 ] ) ) ;
mulmodule mm01 03 ( . x ( a [ 3 ] ) , . y ( b [ 1 ] ) , . sin ( s [ 0 ] [ 4 ] ) , .
48
A.2. MULTIPLIERS APPENDIX A. UNIT SOURCE CODE
cin ( c [ 0 ] [ 3 ] ) , . cout ( c [ 1 ] [ 3 ] ) , . sout ( s [ 1 ] [ 3 ] ) ) ;
mulmodule mm01 04 ( . x ( a [ 4 ] ) , . y ( b [ 1 ] ) , . sin ( s [ 0 ] [ 5 ] ) , .
cin ( c [ 0 ] [ 4 ] ) , . cout ( c [ 1 ] [ 4 ] ) , . sout ( s [ 1 ] [ 4 ] ) ) ;
mulmodule mm01 05 ( . x ( a [ 5 ] ) , . y ( b [ 1 ] ) , . sin ( s [ 0 ] [ 6 ] ) , .
cin ( c [ 0 ] [ 5 ] ) , . cout ( c [ 1 ] [ 5 ] ) , . sout ( s [ 1 ] [ 5 ] ) ) ;
mulmodule mm01 06 ( . x ( a [ 6 ] ) , . y ( b [ 1 ] ) , . sin ( s [ 0 ] [ 7 ] ) , .
cin ( c [ 0 ] [ 6 ] ) , . cout ( c [ 1 ] [ 6 ] ) , . sout ( s [ 1 ] [ 6 ] ) ) ;
mulmodule mm01 07 ( . x ( a [ 7 ] ) , . y ( b [ 1 ] ) , . sin ( 1 ’ b0 ) , .
cin ( c [ 0 ] [ 7 ] ) , . cout ( c [ 1 ] [ 7 ] ) , . sout ( s [ 1 ] [ 7 ] ) ) ;
mulmodule mm02 00 ( . x ( a [ 0 ] ) , . y ( b [ 2 ] ) , . sin ( s [ 1 ] [ 1 ] ) , .
cin ( c [ 1 ] [ 0 ] ) , . cout ( c [ 2 ] [ 0 ] ) , . sout ( s [ 2 ] [ 0 ] ) ) ;
mulmodule mm02 01 ( . x ( a [ 1 ] ) , . y ( b [ 2 ] ) , . sin ( s [ 1 ] [ 2 ] ) , .
cin ( c [ 1 ] [ 1 ] ) , . cout ( c [ 2 ] [ 1 ] ) , . sout ( s [ 2 ] [ 1 ] ) ) ;
mulmodule mm02 02 ( . x ( a [ 2 ] ) , . y ( b [ 2 ] ) , . sin ( s [ 1 ] [ 3 ] ) , .
cin ( c [ 1 ] [ 2 ] ) , . cout ( c [ 2 ] [ 2 ] ) , . sout ( s [ 2 ] [ 2 ] ) ) ;
mulmodule mm02 03 ( . x ( a [ 3 ] ) , . y ( b [ 2 ] ) , . sin ( s [ 1 ] [ 4 ] ) , .
cin ( c [ 1 ] [ 3 ] ) , . cout ( c [ 2 ] [ 3 ] ) , . sout ( s [ 2 ] [ 3 ] ) ) ;
mulmodule mm02 04 ( . x ( a [ 4 ] ) , . y ( b [ 2 ] ) , . sin ( s [ 1 ] [ 5 ] ) , .
cin ( c [ 1 ] [ 4 ] ) , . cout ( c [ 2 ] [ 4 ] ) , . sout ( s [ 2 ] [ 4 ] ) ) ;
mulmodule mm02 05 ( . x ( a [ 5 ] ) , . y ( b [ 2 ] ) , . sin ( s [ 1 ] [ 6 ] ) , .
cin ( c [ 1 ] [ 5 ] ) , . cout ( c [ 2 ] [ 5 ] ) , . sout ( s [ 2 ] [ 5 ] ) ) ;
mulmodule mm02 06 ( . x ( a [ 6 ] ) , . y ( b [ 2 ] ) , . sin ( s [ 1 ] [ 7 ] ) , .
cin ( c [ 1 ] [ 6 ] ) , . cout ( c [ 2 ] [ 6 ] ) , . sout ( s [ 2 ] [ 6 ] ) ) ;
mulmodule mm02 07 ( . x ( a [ 7 ] ) , . y ( b [ 2 ] ) , . sin ( 1 ’ b0 ) , .
cin ( c [ 1 ] [ 7 ] ) , . cout ( c [ 2 ] [ 7 ] ) , . sout ( s [ 2 ] [ 7 ] ) ) ;
mulmodule mm03 00 ( . x ( a [ 0 ] ) , . y ( b [ 3 ] ) , . sin ( s [ 2 ] [ 1 ] ) , .
cin ( c [ 2 ] [ 0 ] ) , . cout ( c [ 3 ] [ 0 ] ) , . sout ( s [ 3 ] [ 0 ] ) ) ;
mulmodule mm03 01 ( . x ( a [ 1 ] ) , . y ( b [ 3 ] ) , . sin ( s [ 2 ] [ 2 ] ) , .
cin ( c [ 2 ] [ 1 ] ) , . cout ( c [ 3 ] [ 1 ] ) , . sout ( s [ 3 ] [ 1 ] ) ) ;
mulmodule mm03 02 ( . x ( a [ 2 ] ) , . y ( b [ 3 ] ) , . sin ( s [ 2 ] [ 3 ] ) , .
cin ( c [ 2 ] [ 2 ] ) , . cout ( c [ 3 ] [ 2 ] ) , . sout ( s [ 3 ] [ 2 ] ) ) ;
mulmodule mm03 03 ( . x ( a [ 3 ] ) , . y ( b [ 3 ] ) , . sin ( s [ 2 ] [ 4 ] ) , .
cin ( c [ 2 ] [ 3 ] ) , . cout ( c [ 3 ] [ 3 ] ) , . sout ( s [ 3 ] [ 3 ] ) ) ;
mulmodule mm03 04 ( . x ( a [ 4 ] ) , . y ( b [ 3 ] ) , . sin ( s [ 2 ] [ 5 ] ) , .
49
A.2. MULTIPLIERS APPENDIX A. UNIT SOURCE CODE
cin ( c [ 2 ] [ 4 ] ) , . cout ( c [ 3 ] [ 4 ] ) , . sout ( s [ 3 ] [ 4 ] ) ) ;
mulmodule mm03 05 ( . x ( a [ 5 ] ) , . y ( b [ 3 ] ) , . sin ( s [ 2 ] [ 6 ] ) , .
cin ( c [ 2 ] [ 5 ] ) , . cout ( c [ 3 ] [ 5 ] ) , . sout ( s [ 3 ] [ 5 ] ) ) ;
mulmodule mm03 06 ( . x ( a [ 6 ] ) , . y ( b [ 3 ] ) , . sin ( s [ 2 ] [ 7 ] ) , .
cin ( c [ 2 ] [ 6 ] ) , . cout ( c [ 3 ] [ 6 ] ) , . sout ( s [ 3 ] [ 6 ] ) ) ;
mulmodule mm03 07 ( . x ( a [ 7 ] ) , . y ( b [ 3 ] ) , . sin ( 1 ’ b0 ) , .
cin ( c [ 2 ] [ 7 ] ) , . cout ( c [ 3 ] [ 7 ] ) , . sout ( s [ 3 ] [ 7 ] ) ) ;
mulmodule mm04 00 ( . x ( a [ 0 ] ) , . y ( b [ 4 ] ) , . sin ( s [ 3 ] [ 1 ] ) , .
cin ( c [ 3 ] [ 0 ] ) , . cout ( c [ 4 ] [ 0 ] ) , . sout ( s [ 4 ] [ 0 ] ) ) ;
mulmodule mm04 01 ( . x ( a [ 1 ] ) , . y ( b [ 4 ] ) , . sin ( s [ 3 ] [ 2 ] ) , .
cin ( c [ 3 ] [ 1 ] ) , . cout ( c [ 4 ] [ 1 ] ) , . sout ( s [ 4 ] [ 1 ] ) ) ;
mulmodule mm04 02 ( . x ( a [ 2 ] ) , . y ( b [ 4 ] ) , . sin ( s [ 3 ] [ 3 ] ) , .
cin ( c [ 3 ] [ 2 ] ) , . cout ( c [ 4 ] [ 2 ] ) , . sout ( s [ 4 ] [ 2 ] ) ) ;
mulmodule mm04 03 ( . x ( a [ 3 ] ) , . y ( b [ 4 ] ) , . sin ( s [ 3 ] [ 4 ] ) , .
cin ( c [ 3 ] [ 3 ] ) , . cout ( c [ 4 ] [ 3 ] ) , . sout ( s [ 4 ] [ 3 ] ) ) ;
mulmodule mm04 04 ( . x ( a [ 4 ] ) , . y ( b [ 4 ] ) , . sin ( s [ 3 ] [ 5 ] ) , .
cin ( c [ 3 ] [ 4 ] ) , . cout ( c [ 4 ] [ 4 ] ) , . sout ( s [ 4 ] [ 4 ] ) ) ;
mulmodule mm04 05 ( . x ( a [ 5 ] ) , . y ( b [ 4 ] ) , . sin ( s [ 3 ] [ 6 ] ) , .
cin ( c [ 3 ] [ 5 ] ) , . cout ( c [ 4 ] [ 5 ] ) , . sout ( s [ 4 ] [ 5 ] ) ) ;
mulmodule mm04 06 ( . x ( a [ 6 ] ) , . y ( b [ 4 ] ) , . sin ( s [ 3 ] [ 7 ] ) , .
cin ( c [ 3 ] [ 6 ] ) , . cout ( c [ 4 ] [ 6 ] ) , . sout ( s [ 4 ] [ 6 ] ) ) ;
mulmodule mm04 07 ( . x ( a [ 7 ] ) , . y ( b [ 4 ] ) , . sin ( 1 ’ b0 ) , .
cin ( c [ 3 ] [ 7 ] ) , . cout ( c [ 4 ] [ 7 ] ) , . sout ( s [ 4 ] [ 7 ] ) ) ;
mulmodule mm05 00 ( . x ( a [ 0 ] ) , . y ( b [ 5 ] ) , . sin ( s [ 4 ] [ 1 ] ) , .
cin ( c [ 4 ] [ 0 ] ) , . cout ( c [ 5 ] [ 0 ] ) , . sout ( s [ 5 ] [ 0 ] ) ) ;
mulmodule mm05 01 ( . x ( a [ 1 ] ) , . y ( b [ 5 ] ) , . sin ( s [ 4 ] [ 2 ] ) , .
cin ( c [ 4 ] [ 1 ] ) , . cout ( c [ 5 ] [ 1 ] ) , . sout ( s [ 5 ] [ 1 ] ) ) ;
mulmodule mm05 02 ( . x ( a [ 2 ] ) , . y ( b [ 5 ] ) , . sin ( s [ 4 ] [ 3 ] ) , .
cin ( c [ 4 ] [ 2 ] ) , . cout ( c [ 5 ] [ 2 ] ) , . sout ( s [ 5 ] [ 2 ] ) ) ;
mulmodule mm05 03 ( . x ( a [ 3 ] ) , . y ( b [ 5 ] ) , . sin ( s [ 4 ] [ 4 ] ) , .
cin ( c [ 4 ] [ 3 ] ) , . cout ( c [ 5 ] [ 3 ] ) , . sout ( s [ 5 ] [ 3 ] ) ) ;
mulmodule mm05 04 ( . x ( a [ 4 ] ) , . y ( b [ 5 ] ) , . sin ( s [ 4 ] [ 5 ] ) , .
cin ( c [ 4 ] [ 4 ] ) , . cout ( c [ 5 ] [ 4 ] ) , . sout ( s [ 5 ] [ 4 ] ) ) ;
mulmodule mm05 05 ( . x ( a [ 5 ] ) , . y ( b [ 5 ] ) , . sin ( s [ 4 ] [ 6 ] ) , .
50
A.2. MULTIPLIERS APPENDIX A. UNIT SOURCE CODE
cin ( c [ 4 ] [ 5 ] ) , . cout ( c [ 5 ] [ 5 ] ) , . sout ( s [ 5 ] [ 5 ] ) ) ;
mulmodule mm05 06 ( . x ( a [ 6 ] ) , . y ( b [ 5 ] ) , . sin ( s [ 4 ] [ 7 ] ) , .
cin ( c [ 4 ] [ 6 ] ) , . cout ( c [ 5 ] [ 6 ] ) , . sout ( s [ 5 ] [ 6 ] ) ) ;
mulmodule mm05 07 ( . x ( a [ 7 ] ) , . y ( b [ 5 ] ) , . sin ( 1 ’ b0 ) , .
cin ( c [ 4 ] [ 7 ] ) , . cout ( c [ 5 ] [ 7 ] ) , . sout ( s [ 5 ] [ 7 ] ) ) ;
mulmodule mm06 00 ( . x ( a [ 0 ] ) , . y ( b [ 6 ] ) , . sin ( s [ 5 ] [ 1 ] ) , .
cin ( c [ 5 ] [ 0 ] ) , . cout ( c [ 6 ] [ 0 ] ) , . sout ( s [ 6 ] [ 0 ] ) ) ;
mulmodule mm06 01 ( . x ( a [ 1 ] ) , . y ( b [ 6 ] ) , . sin ( s [ 5 ] [ 2 ] ) , .
cin ( c [ 5 ] [ 1 ] ) , . cout ( c [ 6 ] [ 1 ] ) , . sout ( s [ 6 ] [ 1 ] ) ) ;
mulmodule mm06 02 ( . x ( a [ 2 ] ) , . y ( b [ 6 ] ) , . sin ( s [ 5 ] [ 3 ] ) , .
cin ( c [ 5 ] [ 2 ] ) , . cout ( c [ 6 ] [ 2 ] ) , . sout ( s [ 6 ] [ 2 ] ) ) ;
mulmodule mm06 03 ( . x ( a [ 3 ] ) , . y ( b [ 6 ] ) , . sin ( s [ 5 ] [ 4 ] ) , .
cin ( c [ 5 ] [ 3 ] ) , . cout ( c [ 6 ] [ 3 ] ) , . sout ( s [ 6 ] [ 3 ] ) ) ;
mulmodule mm06 04 ( . x ( a [ 4 ] ) , . y ( b [ 6 ] ) , . sin ( s [ 5 ] [ 5 ] ) , .
cin ( c [ 5 ] [ 4 ] ) , . cout ( c [ 6 ] [ 4 ] ) , . sout ( s [ 6 ] [ 4 ] ) ) ;
mulmodule mm06 05 ( . x ( a [ 5 ] ) , . y ( b [ 6 ] ) , . sin ( s [ 5 ] [ 6 ] ) , .
cin ( c [ 5 ] [ 5 ] ) , . cout ( c [ 6 ] [ 5 ] ) , . sout ( s [ 6 ] [ 5 ] ) ) ;
mulmodule mm06 06 ( . x ( a [ 6 ] ) , . y ( b [ 6 ] ) , . sin ( s [ 5 ] [ 7 ] ) , .
cin ( c [ 5 ] [ 6 ] ) , . cout ( c [ 6 ] [ 6 ] ) , . sout ( s [ 6 ] [ 6 ] ) ) ;
mulmodule mm06 07 ( . x ( a [ 7 ] ) , . y ( b [ 6 ] ) , . sin ( 1 ’ b0 ) , .
cin ( c [ 5 ] [ 7 ] ) , . cout ( c [ 6 ] [ 7 ] ) , . sout ( s [ 6 ] [ 7 ] ) ) ;
mulmodule mm07 00 ( . x ( a [ 0 ] ) , . y ( b [ 7 ] ) , . sin ( s [ 6 ] [ 1 ] ) , .
cin ( c [ 6 ] [ 0 ] ) , . cout ( c [ 7 ] [ 0 ] ) , . sout ( s [ 7 ] [ 0 ] ) ) ;
mulmodule mm07 01 ( . x ( a [ 1 ] ) , . y ( b [ 7 ] ) , . sin ( s [ 6 ] [ 2 ] ) , .
cin ( c [ 6 ] [ 1 ] ) , . cout ( c [ 7 ] [ 1 ] ) , . sout ( s [ 7 ] [ 1 ] ) ) ;
mulmodule mm07 02 ( . x ( a [ 2 ] ) , . y ( b [ 7 ] ) , . sin ( s [ 6 ] [ 3 ] ) , .
cin ( c [ 6 ] [ 2 ] ) , . cout ( c [ 7 ] [ 2 ] ) , . sout ( s [ 7 ] [ 2 ] ) ) ;
mulmodule mm07 03 ( . x ( a [ 3 ] ) , . y ( b [ 7 ] ) , . sin ( s [ 6 ] [ 4 ] ) , .
cin ( c [ 6 ] [ 3 ] ) , . cout ( c [ 7 ] [ 3 ] ) , . sout ( s [ 7 ] [ 3 ] ) ) ;
mulmodule mm07 04 ( . x ( a [ 4 ] ) , . y ( b [ 7 ] ) , . sin ( s [ 6 ] [ 5 ] ) , .
cin ( c [ 6 ] [ 4 ] ) , . cout ( c [ 7 ] [ 4 ] ) , . sout ( s [ 7 ] [ 4 ] ) ) ;
mulmodule mm07 05 ( . x ( a [ 5 ] ) , . y ( b [ 7 ] ) , . sin ( s [ 6 ] [ 6 ] ) , .
cin ( c [ 6 ] [ 5 ] ) , . cout ( c [ 7 ] [ 5 ] ) , . sout ( s [ 7 ] [ 5 ] ) ) ;
mulmodule mm07 06 ( . x ( a [ 6 ] ) , . y ( b [ 7 ] ) , . sin ( s [ 6 ] [ 7 ] ) , .
51
A.2. MULTIPLIERS APPENDIX A. UNIT SOURCE CODE
cin ( c [ 6 ] [ 6 ] ) , . cout ( c [ 7 ] [ 6 ] ) , . sout ( s [ 7 ] [ 6 ] ) ) ;
mulmodule mm07 07 ( . x ( a [ 7 ] ) , . y ( b [ 7 ] ) , . sin ( 1 ’ b0 ) , .
cin ( c [ 6 ] [ 7 ] ) , . cout ( c [ 7 ] [ 7 ] ) , . sout ( s [ 7 ] [ 7 ] ) ) ;
/ / Lower 8 b i t s can be o b t a i n e d from the sum out of the l a s t
l a y e r
assign r e s u l t [ 0] = s [ 0 ] [ 0 ] ;
assign r e s u l t [ 1] = s [ 1 ] [ 0 ] ;
assign r e s u l t [ 2] = s [ 2 ] [ 0 ] ;
assign r e s u l t [ 3] = s [ 3 ] [ 0 ] ;
assign r e s u l t [ 4] = s [ 4 ] [ 0 ] ;
assign r e s u l t [ 5] = s [ 5 ] [ 0 ] ;
assign r e s u l t [ 6] = s [ 6 ] [ 0 ] ;
assign r e s u l t [ 7] = s [ 7 ] [ 0 ] ;
/ / Upper 8 b i t s need to be summed with carry −o u t s from
p r e v i o u s b i t s
h a l f a d d e r ha00 ( . a ( s [ 7 ] [ 1 ] ) , . b ( c [ 7 ] [ 0 ] ) ,
. s ( r e s u l t [ 8 ] ) , . cout ( c [ 8 ] [ 0 ] ) ) ;
f u l l a d d e r fa01 ( . a ( s [ 7 ] [ 2 ] ) , . b ( c [ 7 ] [ 1 ] ) , . cin ( c [ 8 ] [
0 ] ) , . s ( r e s u l t [ 9 ] ) , . cout ( c [ 8 ] [ 1 ] ) ) ;
f u l l a d d e r fa02 ( . a ( s [ 7 ] [ 3 ] ) , . b ( c [ 7 ] [ 2 ] ) , . cin ( c [ 8 ] [
1 ] ) , . s ( r e s u l t [ 1 0 ] ) , . cout ( c [ 8 ] [ 2 ] ) ) ;
f u l l a d d e r fa03 ( . a ( s [ 7 ] [ 4 ] ) , . b ( c [ 7 ] [ 3 ] ) , . cin ( c [ 8 ] [
2 ] ) , . s ( r e s u l t [ 1 1 ] ) , . cout ( c [ 8 ] [ 3 ] ) ) ;
f u l l a d d e r fa04 ( . a ( s [ 7 ] [ 5 ] ) , . b ( c [ 7 ] [ 4 ] ) , . cin ( c [ 8 ] [
3 ] ) , . s ( r e s u l t [ 1 2 ] ) , . cout ( c [ 8 ] [ 4 ] ) ) ;
f u l l a d d e r fa05 ( . a ( s [ 7 ] [ 6 ] ) , . b ( c [ 7 ] [ 5 ] ) , . cin ( c [ 8 ] [
4 ] ) , . s ( r e s u l t [ 1 3 ] ) , . cout ( c [ 8 ] [ 5 ] ) ) ;
f u l l a d d e r fa06 ( . a ( s [ 7 ] [ 7 ] ) , . b ( c [ 7 ] [ 6 ] ) , . cin ( c [ 8 ] [
5 ] ) , . s ( r e s u l t [ 1 4 ] ) , . cout ( c [ 8 ] [ 6 ] ) ) ;
assign r e s u l t [ 1 5 ] = c [ 7 ] [ 7] ˆ c [ 8 ] [ 6 ] ;
endmodule
A.2.2. Modified Booth Encoder
/ / 8− b i t Modified Booth Encoder to g e n e r a t e f o u r p a r t i a l
52
A.2. MULTIPLIERS APPENDIX A. UNIT SOURCE CODE
p r o d u c t s to be summed
module boothencoder8 ( input wire [ 7 : 0 ] a ,
input wire [ 7 : 0 ] b ,
output reg [ 8 : 0 ] p00 ,
output reg [ 8 : 0 ] p01 ,
output reg [ 8 : 0 ] p02 ,
output reg [ 8 : 0 ] p03 ) ;
/ / Group m u l t i p l i e r b i t s i n t o o v e r l a p p i n g groups of t h r e e
b i t s , then d e c i d e
/ / what the p a r t i a l p r o d u c t s should be based on the b i t s
always @ ( a or b )
begin
/ / E q u i v a l e n t to appending a z e r o to b i t s 1 and 0 , only
need to check f o u r c a s e s
case ( b [ 1 : 0 ] )
2 ’ b00 : p00 <= 9 ’ b000000000 ;
2 ’ b01 : p00 <= $signed ( a ) ;
2 ’ b10 : p00 <= {˜ a , 1 ’ b1 } ;
2 ’ b11 : p00 <= $signed ( ˜ a ) ;
default : p00 <= 9 ’ bxxxxxxxxx ;
endcase
case ( b [ 3 : 1 ] )
3 ’ b000 : p01 <= 9 ’ b000000000 ; / / P = 0
3 ’ b001 : p01 <= $signed ( a ) ; / / P = A
3 ’ b010 : p01 <= $signed ( a ) ; / / P = A
3 ’ b011 : p01 <= { a , 1 ’ b0 } ; / / P = 2A
3 ’ b100 : p01 <= {˜ a , 1 ’ b1 } ; / / P = −2A
3 ’ b101 : p01 <= $signed ( ˜ a ) ; / / P = −A
3 ’ b110 : p01 <= $signed ( ˜ a ) ; / / P = −A
3 ’ b111 : p01 <= 9 ’ b111111111 ; / / P = 0 ( a l l 1 s , p l u s
complement b i t g i v e s 0)
default : p01 <= 9 ’ bxxxxxxxxx ; / / Should not occur in
normal o p e r a t i o n with d e f i n e d i n p u t s
endcase
53
A.2. MULTIPLIERS APPENDIX A. UNIT SOURCE CODE
case ( b [ 5 : 3 ] )
3 ’ b000 : p02 <= 9 ’ b000000000 ;
3 ’ b001 : p02 <= $signed ( a ) ;
3 ’ b010 : p02 <= $signed ( a ) ;
3 ’ b011 : p02 <= { a , 1 ’ b0 } ;
3 ’ b100 : p02 <= {˜ a , 1 ’ b1 } ;
3 ’ b101 : p02 <= $signed ( ˜ a ) ;
3 ’ b110 : p02 <= $signed ( ˜ a ) ;
3 ’ b111 : p02 <= 9 ’ b111111111 ;
default : p02 <= 9 ’ bxxxxxxxxx ;
endcase
case ( b [ 7 : 5 ] )
3 ’ b000 : p03 <= 9 ’ b000000000 ;
3 ’ b001 : p03 <= $signed ( a ) ;
3 ’ b010 : p03 <= $signed ( a ) ;
3 ’ b011 : p03 <= { a , 1 ’ b0 } ;
3 ’ b100 : p03 <= {˜ a , 1 ’ b1 } ;
3 ’ b101 : p03 <= $signed ( ˜ a ) ;
3 ’ b110 : p03 <= $signed ( ˜ a ) ;
3 ’ b111 : p03 <= 9 ’ b111111111 ;
default : p03 <= 9 ’ bxxxxxxxxx ;
endcase
end
endmodule
A.2.3. Wallace Tree Multiplier
/ / 8− b i t s i g n e d Wallace Tree m u l t i p l i e r
module w a l l a c e t r e e m u l t i p l i e r 8 ( input wire [ 7 : 0 ] a ,
input wire [ 7 : 0 ] b ,
output wire [ 1 5 : 0 ] r e s u l t ) ;
wire [ 8 : 0 ] p [ 3 : 0 ] ;
wire [ 2 : 0 ] c [ 1 5 : 2 ] ;
54
A.2. MULTIPLIERS APPENDIX A. UNIT SOURCE CODE
wire [ 2 : 0 ] w [ 1 5 : 2 ] ;
wire c l a 0 c o u t ;
/ / Use the modified Booth encoder to g e n e r a t e the p a r t i a l
p r o d u c t s
boothencoder8 pp ( . a ( a ) , . b ( b ) , . p00 ( p [ 0 ] ) , . p01 ( p [ 1 ] ) , . p02
( p [ 2 ] ) , . p03 ( p [ 3 ] ) ) ;
/ / Sum the p a r t i a l p r o d u c t s by wire weight u n t i l only two
w i r e s are l e f t f o r each weight
/ / Weight 2ˆ2
h a l f a d d e r ha02 00 ( . a ( p [ 0 ] [ 2 ] ) , . b ( p [ 1 ] [ 0 ] ) ,
. s ( w[ 2 ] [ 0 ] ) , . cout ( c [ 2 ] [ 0 ] ) ) ;
/ / Weight 2ˆ3
h a l f a d d e r ha03 00 ( . a ( p [ 0 ] [ 3 ] ) , . b ( p [ 1 ] [ 1 ] ) ,
. s ( w[ 3 ] [ 0 ] ) , . cout ( c [ 3 ] [ 0 ] ) ) ;
/ / Weight 2ˆ4
f u l l a d d e r f a 0 4 0 0 ( . a ( p [ 0 ] [ 4 ] ) , . b ( p [ 1 ] [ 2 ] ) , . cin ( p [ 2 ] [
0 ] ) , . s ( w[ 4 ] [ 0 ] ) , . cout ( c [ 4 ] [ 0 ] ) ) ;
h a l f a d d e r ha04 01 ( . a (w[ 4 ] [ 0 ] ) , . b ( b [ 5 ] ) ,
. s ( w[ 4 ] [ 1 ] ) , . cout ( c [ 4 ] [ 1 ] ) ) ;
/ / Weight 2ˆ5
f u l l a d d e r f a 0 5 0 0 ( . a ( p [ 0 ] [ 5 ] ) , . b ( p [ 1 ] [ 3 ] ) , . cin ( p [ 2 ] [
1 ] ) , . s ( w[ 5 ] [ 0 ] ) , . cout ( c [ 5 ] [ 0 ] ) ) ;
h a l f a d d e r ha05 01 ( . a (w[ 5 ] [ 0 ] ) , . b ( c [ 4 ] [ 0 ] ) ,
. s ( w[ 5 ] [ 1 ] ) , . cout ( c [ 5 ] [ 1 ] ) ) ;
/ / Weight 2ˆ6
f u l l a d d e r f a 0 6 0 0 ( . a ( p [ 0 ] [ 6 ] ) , . b ( p [ 1 ] [ 4 ] ) , . cin ( p [ 2 ] [
2 ] ) , . s ( w[ 6 ] [ 0 ] ) , . cout ( c [ 6 ] [ 0 ] ) ) ;
f u l l a d d e r f a 0 6 0 1 ( . a (w[ 6 ] [ 0 ] ) , . b ( p [ 3 ] [ 0 ] ) , . cin ( b [
7 ] ) , . s ( w[ 6 ] [ 1 ] ) , . cout ( c [ 6 ] [ 1 ] ) ) ;
h a l f a d d e r f a 0 6 0 2 ( . a (w[ 6 ] [ 1 ] ) , . b ( c [ 5 ] [ 0 ] ) ,
55
A.2. MULTIPLIERS APPENDIX A. UNIT SOURCE CODE
. s ( w[ 6 ] [ 2 ] ) , . cout ( c [ 6 ] [ 2 ] ) ) ;
/ / Weight 2ˆ7
f u l l a d d e r f a 0 7 0 0 ( . a ( p [ 0 ] [ 7 ] ) , . b ( p [ 1 ] [ 5 ] ) , . cin ( p [ 2 ] [
3 ] ) , . s ( w[ 7 ] [ 0 ] ) , . cout ( c [ 7 ] [ 0 ] ) ) ;
f u l l a d d e r f a 0 7 0 1 ( . a (w[ 7 ] [ 0 ] ) , . b ( p [ 3 ] [ 1 ] ) , . cin ( c [ 6 ] [
0 ] ) , . s ( w[ 7 ] [ 1 ] ) , . cout ( c [ 7 ] [ 1 ] ) ) ;
h a l f a d d e r ha07 02 ( . a (w[ 7 ] [ 1 ] ) , . b ( c [ 6 ] [ 1 ] ) ,
. s ( w[ 7 ] [ 2 ] ) , . cout ( c [ 7 ] [ 2 ] ) ) ;
/ / Weight 2ˆ8
f u l l a d d e r f a 0 8 0 0 ( . a ( p [ 0 ] [ 8 ] ) , . b ( p [ 1 ] [ 6 ] ) , . cin ( p [ 2 ] [
4 ] ) , . s ( w[ 8 ] [ 0 ] ) , . cout ( c [ 8 ] [ 0 ] ) ) ;
f u l l a d d e r f a 0 8 0 1 ( . a (w[ 8 ] [ 0 ] ) , . b ( p [ 3 ] [ 2 ] ) , . cin ( c [ 7 ] [
0 ] ) , . s ( w[ 8 ] [ 1 ] ) , . cout ( c [ 8 ] [ 1 ] ) ) ;
h a l f a d d e r ha08 02 ( . a (w[ 8 ] [ 1 ] ) , . b ( c [ 7 ] [ 1 ] ) ,
. s ( w[ 8 ] [ 2 ] ) , . cout ( c [ 8 ] [ 2 ] ) ) ;
/ / Weight 2ˆ9
f u l l a d d e r f a 0 9 0 0 ( . a ( p [ 0 ] [ 8 ] ) , . b ( p [ 1 ] [ 7 ] ) , . cin ( p [ 2 ] [
5 ] ) , . s ( w[ 9 ] [ 0 ] ) , . cout ( c [ 9 ] [ 0 ] ) ) ;
f u l l a d d e r f a 0 9 0 1 ( . a (w[ 9 ] [ 0 ] ) , . b ( p [ 3 ] [ 3 ] ) , . cin ( c [ 8 ] [
0 ] ) , . s ( w[ 9 ] [ 1 ] ) , . cout ( c [ 9 ] [ 1 ] ) ) ;
h a l f a d d e r f a 0 9 0 2 ( . a (w[ 9 ] [ 1 ] ) , . b ( c [ 8 ] [ 1 ] ) ,
. s ( w[ 9 ] [ 2 ] ) , . cout ( c [ 9 ] [ 2 ] ) ) ;
/ / Weight 2ˆ10
f u l l a d d e r f a 1 0 0 0 ( . a ( p [ 0 ] [ 8 ] ) , . b ( p [ 1 ] [ 8 ] ) , . cin ( p [ 2 ] [
6 ] ) , . s ( w[ 1 0 ] [ 0 ] ) , . cout ( c [ 1 0 ] [ 0 ] ) ) ;
f u l l a d d e r f a 1 0 0 1 ( . a (w[ 1 0 ] [ 0 ] ) , . b ( p [ 3 ] [ 4 ] ) , . cin ( c [ 9 ] [
0 ] ) , . s ( w[ 1 0 ] [ 1 ] ) , . cout ( c [ 1 0 ] [ 1 ] ) ) ;
h a l f a d d e r ha10 02 ( . a (w[ 1 0 ] [ 1 ] ) , . b ( c [ 9 ] [ 1 ] ) ,
. s ( w[ 1 0 ] [ 2 ] ) , . cout ( c [ 1 0 ] [ 2 ] ) ) ;
/ / Weight 2ˆ11
f u l l a d d e r f a 1 1 0 0 ( . a ( p [ 0 ] [ 8 ] ) , . b ( p [ 1 ] [ 8 ] ) , . cin ( p [ 2 ] [
56
A.2. MULTIPLIERS APPENDIX A. UNIT SOURCE CODE
7 ] ) , . s ( w[ 1 1 ] [ 0 ] ) , . cout ( c [ 1 1 ] [ 0 ] ) ) ;
f u l l a d d e r f a 1 1 0 1 ( . a (w[ 1 1 ] [ 0 ] ) , . b ( p [ 3 ] [ 5 ] ) , . cin ( c [ 1 0 ] [
0 ] ) , . s ( w[ 1 1 ] [ 1 ] ) , . cout ( c [ 1 1 ] [ 1 ] ) ) ;
h a l f a d d e r ha11 02 ( . a (w[ 1 1 ] [ 1 ] ) , . b ( c [ 1 0 ] [ 1 ] ) ,
. s ( w[ 1 1 ] [ 2 ] ) , . cout ( c [ 1 1 ] [ 2 ] ) ) ;
/ / Weight 2ˆ12
f u l l a d d e r f a 1 2 0 0 ( . a ( p [ 0 ] [ 8 ] ) , . b ( p [ 1 ] [ 8 ] ) , . cin ( p [ 2 ] [
8 ] ) , . s ( w[ 1 2 ] [ 0 ] ) , . cout ( c [ 1 2 ] [ 0 ] ) ) ;
f u l l a d d e r f a 1 2 0 1 ( . a (w[ 1 2 ] [ 0 ] ) , . b ( p [ 3 ] [ 6 ] ) , . cin ( c [ 1 1 ] [
0 ] ) , . s ( w[ 1 2 ] [ 1 ] ) , . cout ( c [ 1 2 ] [ 1 ] ) ) ;
h a l f a d d e r ha12 02 ( . a (w[ 1 2 ] [ 1 ] ) , . b ( c [ 1 1 ] [ 1 ] ) ,
. s ( w[ 1 2 ] [ 2 ] ) , . cout ( c [ 1 2 ] [ 2 ] ) ) ;
/ / Weight 2ˆ13
f u l l a d d e r f a 1 3 0 0 ( . a (w[ 1 2 ] [ 0 ] ) , . b ( p [ 3 ] [ 7 ] ) , . cin ( c [ 1 2 ] [
0 ] ) , . s ( w[ 1 3 ] [ 0 ] ) , . cout ( c [ 1 3 ] [ 0 ] ) ) ;
h a l f a d d e r ha13 01 ( . a (w[ 1 3 ] [ 0 ] ) , . b ( c [ 1 2 ] [ 1 ] ) ,
. s ( w[ 1 3 ] [ 1 ] ) , . cout ( c [ 1 3 ] [ 1 ] ) ) ;
/ / Weight 2ˆ14
f u l l a d d e r f a 1 4 0 0 ( . a (w[ 1 2 ] [ 0 ] ) , . b ( p [ 3 ] [ 8 ] ) , . cin ( c [ 1 2 ] [
0 ] ) , . s ( w[ 1 4 ] [ 0 ] ) , . cout ( c [ 1 4 ] [ 0 ] ) ) ;
h a l f a d d e r ha14 01 ( . a (w[ 1 4 ] [ 0 ] ) , . b ( c [ 1 3 ] [ 0 ] ) ,
. s ( w[ 1 4 ] [ 1 ] ) , . cout ( c [ 1 4 ] [ 1 ] ) ) ;
/ / Weight 2ˆ15
assign w[ 1 5 ] [ 0] = w[ 1 4 ] [ 0] ˆ c [ 1 4 ] [ 0 ] ;
/ / Use two chained CLA adders to perform the f i n a l a d d i t i o n
carrylookaheadadd8 c l a 0 ( . a ({w[ 7 ] [ 2 ] , w[ 6 ] [ 2 ] , w[ 5 ] [ 1 ] , w
[ 4 ] [ 1 ] , w[ 3 ] [ 0 ] , w[ 2 ] [ 0 ] , p [ 0 ] [ 1 ] , p [ 0 ] [ 0 ] } ) ,
. b ({ c [ 6 ] [ 2 ] , c [ 5 ] [ 1 ] , c [ 4 ] [ 1 ] , c
[ 3 ] [ 0 ] , c [ 2 ] [ 0 ] , b [ 3 ] , 1 ’
b0 , b [ 1 ] } ) ,
. cin ( 1 ’ b0 ) , . s ( r e s u l t [ 7 : 0 ] ) , .
57
A.2. MULTIPLIERS APPENDIX A. UNIT SOURCE CODE
cout ( c l a 0 c o u t ) ) ;
carrylookaheadadd8 c l a 1 ( . a ({w[ 1 5 ] [ 0 ] , w[ 1 4 ] [ 1 ] , w[ 1 3 ] [ 1 ] , w
[ 1 2 ] [ 2 ] , w[ 1 1 ] [ 2 ] , w[ 1 0 ] [ 2 ] , w[ 9 ] [ 2 ] , w[ 8 ] [ 2 ] } ) ,
. b ({ c [ 1 4 ] [ 1 ] , c [ 1 3 ] [ 1 ] , c [ 1 2 ] [ 2 ] , c
[ 1 1 ] [ 2 ] , c [ 1 0 ] [ 2 ] , c [ 9 ] [ 2 ] , c
[ 8 ] [ 2 ] , c [ 7 ] [ 2 ] } ) ,
. cin ( c l a 0 c o u t ) , . s ( r e s u l t [ 1 5 : 8 ] ) ) ;
endmodule
A.2.4. Dadda Tree Multiplier
/ / 8− b i t s i g n e d Dadda Tree m u l t i p l i e r
module d a d d a t r e e m u l t i p l i e r 8 ( input wire [ 7 : 0 ] a ,
input wire [ 7 : 0 ] b ,
output wire [ 1 5 : 0 ] r e s u l t ) ;
wire [ 8 : 0 ] p [ 3 : 0 ] ;
wire [ 2 : 0 ] c [ 1 5 : 2 ] ;
wire [ 2 : 0 ] w [ 1 5 : 2 ] ;
wire c l a 0 c o u t ;
/ / Use the modified Booth encoder to g e n e r a t e the p a r t i a l
p r o d u c t s
boothencoder8 pp ( . a ( a ) , . b ( b ) , . p00 ( p [ 0 ] ) , . p01 ( p [ 1 ] ) , . p02
( p [ 2 ] ) , . p03 ( p [ 3 ] ) ) ;
/ / Sum the p a r t i a l p r o d u c t s by wire weight u n t i l only two
w i r e s are l e f t f o r each weight
/ / Weight 2ˆ2
h a l f a d d e r ha02 02 ( . a ( p [ 0 ] [ 2 ] ) , . b ( p [ 1 ] [ 0 ] ) ,
. s ( w[ 2 ] [ 0 ] ) , . cout ( c [ 2 ] [ 0 ] ) ) ;
/ / Weight 2ˆ3
h a l f a d d e r ha03 02 ( . a ( p [ 0 ] [ 3 ] ) , . b ( p [ 1 ] [ 1 ] ) ,
. s ( w[ 3 ] [ 0 ] ) , . cout ( c [ 3 ] [ 0 ] ) ) ;
58
A.2. MULTIPLIERS APPENDIX A. UNIT SOURCE CODE
/ / Weight 2ˆ4
h a l f a d d e r ha04 01 ( . a ( p [ 0 ] [ 4 ] ) , . b ( p [ 1 ] [ 2 ] ) ,
. s ( w[ 4 ] [ 0 ] ) , . cout ( c [ 4 ] [ 0 ] ) ) ;
f u l l a d d e r f a 0 4 0 2 ( . a ( p [ 2 ] [ 0 ] ) , . b ( b [ 5 ] ) , . cin ( c [ 3 ] [
0 ] ) , . s ( w[ 4 ] [ 1 ] ) , . cout ( c [ 4 ] [ 1 ] ) ) ;
/ / Weight 2ˆ5
h a l f a d d e r ha05 01 ( . a ( p [ 0 ] [ 5 ] ) , . b ( p [ 1 ] [ 3 ] ) ,
. s ( w[ 5 ] [ 0 ] ) , . cout ( c [ 5 ] [ 0 ] ) ) ;
f u l l a d d e r f a 0 5 0 2 ( . a ( p [ 2 ] [ 1 ] ) , . b ( c [ 4 ] [ 0 ] ) , . cin ( c [ 4 ] [
1 ] ) , . s ( w[ 5 ] [ 1 ] ) , . cout ( c [ 5 ] [ 1 ] ) ) ;
/ / Weight 2ˆ6
h a l f a d d e r ha06 00 ( . a ( p [ 0 ] [ 6 ] ) , . b ( p [ 1 ] [ 4 ] ) ,
. s ( w[ 6 ] [ 0 ] ) , . cout ( c [ 6 ] [ 0 ] ) ) ;
f u l l a d d e r f a 0 6 0 1 ( . a ( p [ 2 ] [ 2 ] ) , . b ( p [ 3 ] [ 0 ] ) , . cin ( b [
7 ] ) , . s ( w[ 6 ] [ 1 ] ) , . cout ( c [ 6 ] [ 1 ] ) ) ;
f u l l a d d e r f a 0 6 0 2 ( . a (w[ 6 ] [ 0 ] ) , . b ( c [ 5 ] [ 0 ] ) , . cin ( c [ 5 ] [
1 ] ) , . s ( w[ 6 ] [ 2 ] ) , . cout ( c [ 6 ] [ 2 ] ) ) ;
/ / Weight 2ˆ7
f u l l a d d e r f a 0 7 0 0 ( . a ( p [ 0 ] [ 7 ] ) , . b ( p [ 1 ] [ 5 ] ) , . cin ( p [ 2 ] [
3 ] ) , . s ( w[ 7 ] [ 0 ] ) , . cout ( c [ 7 ] [ 0 ] ) ) ;
h a l f a d d e r ha07 01 ( . a ( p [ 3 ] [ 1 ] ) , . b ( c [ 6 ] [ 0 ] ) ,
. s ( w[ 7 ] [ 1 ] ) , . cout ( c [ 7 ] [ 1 ] ) ) ;
f u l l a d d e r f a 0 7 0 2 ( . a (w[ 7 ] [ 0 ] ) , . b ( c [ 6 ] [ 1 ] ) , . cin ( c [ 6 ] [
2 ] ) , . s ( w[ 7 ] [ 2 ] ) , . cout ( c [ 7 ] [ 2 ] ) ) ;
/ / Weight 2ˆ8
h a l f a d d e r ha08 00 ( . a ( p [ 0 ] [ 8 ] ) , . b ( p [ 1 ] [ 6 ] ) ,
. s ( w[ 8 ] [ 0 ] ) , . cout ( c [ 8 ] [ 0 ] ) ) ;
f u l l a d d e r f a 0 8 0 1 ( . a ( p [ 2 ] [ 4 ] ) , . b ( p [ 3 ] [ 2 ] ) , . cin ( c [ 7 ] [
0 ] ) , . s ( w[ 8 ] [ 1 ] ) , . cout ( c [ 8 ] [ 1 ] ) ) ;
f u l l a d d e r f a 0 8 0 2 ( . a (w[ 8 ] [ 0 ] ) , . b ( c [ 7 ] [ 1 ] ) , . cin ( c [ 7 ] [
2 ] ) , . s ( w[ 8 ] [ 2 ] ) , . cout ( c [ 8 ] [ 2 ] ) ) ;
59
report
report
report
report
report
report
report
report
report
report
report
report
report
report
report

More Related Content

What's hot

Notes for C Programming for MCA, BCA, B. Tech CSE, ECE and MSC (CS) 1 of 5 by...
Notes for C Programming for MCA, BCA, B. Tech CSE, ECE and MSC (CS) 1 of 5 by...Notes for C Programming for MCA, BCA, B. Tech CSE, ECE and MSC (CS) 1 of 5 by...
Notes for C Programming for MCA, BCA, B. Tech CSE, ECE and MSC (CS) 1 of 5 by...ssuserd6b1fd
 
Embedded linux barco-20121001
Embedded linux barco-20121001Embedded linux barco-20121001
Embedded linux barco-20121001Marc Leeman
 
Perl &lt;b>5 Tutorial&lt;/b>, First Edition
Perl &lt;b>5 Tutorial&lt;/b>, First EditionPerl &lt;b>5 Tutorial&lt;/b>, First Edition
Perl &lt;b>5 Tutorial&lt;/b>, First Editiontutorialsruby
 
Ref arch for ve sg248155
Ref arch for ve sg248155Ref arch for ve sg248155
Ref arch for ve sg248155Accenture
 
Pratical mpi programming
Pratical mpi programmingPratical mpi programming
Pratical mpi programmingunifesptk
 
C++ annotations version
C++ annotations versionC++ annotations version
C++ annotations versionPL Sharma
 
Notes of 8085 micro processor Programming for BCA, MCA, MSC (CS), MSC (IT) &...
Notes of 8085 micro processor Programming  for BCA, MCA, MSC (CS), MSC (IT) &...Notes of 8085 micro processor Programming  for BCA, MCA, MSC (CS), MSC (IT) &...
Notes of 8085 micro processor Programming for BCA, MCA, MSC (CS), MSC (IT) &...ssuserd6b1fd
 
Introduction to Arduino
Introduction to ArduinoIntroduction to Arduino
Introduction to ArduinoRimsky Cheng
 
Mongo db replication guide
Mongo db replication guideMongo db replication guide
Mongo db replication guideDeysi Gmarra
 
Reverse engineering for_beginners-en
Reverse engineering for_beginners-enReverse engineering for_beginners-en
Reverse engineering for_beginners-enAndri Yabu
 
Introduction to Programming Using Java v. 7 - David J Eck - Inglês
Introduction to Programming Using Java v. 7 - David J Eck - InglêsIntroduction to Programming Using Java v. 7 - David J Eck - Inglês
Introduction to Programming Using Java v. 7 - David J Eck - InglêsMarcelo Negreiros
 
Algorithms for programmers ideas and source code
Algorithms for programmers ideas and source code Algorithms for programmers ideas and source code
Algorithms for programmers ideas and source code Duy Phan
 

What's hot (20)

thesis
thesisthesis
thesis
 
Bash
BashBash
Bash
 
zend framework 2
zend framework 2zend framework 2
zend framework 2
 
Notes for C Programming for MCA, BCA, B. Tech CSE, ECE and MSC (CS) 1 of 5 by...
Notes for C Programming for MCA, BCA, B. Tech CSE, ECE and MSC (CS) 1 of 5 by...Notes for C Programming for MCA, BCA, B. Tech CSE, ECE and MSC (CS) 1 of 5 by...
Notes for C Programming for MCA, BCA, B. Tech CSE, ECE and MSC (CS) 1 of 5 by...
 
Javanotes6 linked
Javanotes6 linkedJavanotes6 linked
Javanotes6 linked
 
Embedded linux barco-20121001
Embedded linux barco-20121001Embedded linux barco-20121001
Embedded linux barco-20121001
 
10.1.1.652.4894
10.1.1.652.489410.1.1.652.4894
10.1.1.652.4894
 
Perl &lt;b>5 Tutorial&lt;/b>, First Edition
Perl &lt;b>5 Tutorial&lt;/b>, First EditionPerl &lt;b>5 Tutorial&lt;/b>, First Edition
Perl &lt;b>5 Tutorial&lt;/b>, First Edition
 
Ref arch for ve sg248155
Ref arch for ve sg248155Ref arch for ve sg248155
Ref arch for ve sg248155
 
Pratical mpi programming
Pratical mpi programmingPratical mpi programming
Pratical mpi programming
 
Mat power manual
Mat power manualMat power manual
Mat power manual
 
c programming
c programmingc programming
c programming
 
C++ annotations version
C++ annotations versionC++ annotations version
C++ annotations version
 
Notes of 8085 micro processor Programming for BCA, MCA, MSC (CS), MSC (IT) &...
Notes of 8085 micro processor Programming  for BCA, MCA, MSC (CS), MSC (IT) &...Notes of 8085 micro processor Programming  for BCA, MCA, MSC (CS), MSC (IT) &...
Notes of 8085 micro processor Programming for BCA, MCA, MSC (CS), MSC (IT) &...
 
Akka java
Akka javaAkka java
Akka java
 
Introduction to Arduino
Introduction to ArduinoIntroduction to Arduino
Introduction to Arduino
 
Mongo db replication guide
Mongo db replication guideMongo db replication guide
Mongo db replication guide
 
Reverse engineering for_beginners-en
Reverse engineering for_beginners-enReverse engineering for_beginners-en
Reverse engineering for_beginners-en
 
Introduction to Programming Using Java v. 7 - David J Eck - Inglês
Introduction to Programming Using Java v. 7 - David J Eck - InglêsIntroduction to Programming Using Java v. 7 - David J Eck - Inglês
Introduction to Programming Using Java v. 7 - David J Eck - Inglês
 
Algorithms for programmers ideas and source code
Algorithms for programmers ideas and source code Algorithms for programmers ideas and source code
Algorithms for programmers ideas and source code
 

Similar to report (20)

Programming
ProgrammingProgramming
Programming
 
eclipse.pdf
eclipse.pdfeclipse.pdf
eclipse.pdf
 
test6
test6test6
test6
 
MS_Thesis
MS_ThesisMS_Thesis
MS_Thesis
 
Javanotes5 linked
Javanotes5 linkedJavanotes5 linked
Javanotes5 linked
 
C++ For Quantitative Finance
C++ For Quantitative FinanceC++ For Quantitative Finance
C++ For Quantitative Finance
 
Francois fleuret -_c++_lecture_notes
Francois fleuret -_c++_lecture_notesFrancois fleuret -_c++_lecture_notes
Francois fleuret -_c++_lecture_notes
 
Perltut
PerltutPerltut
Perltut
 
Explorations in Parallel Distributed Processing: A Handbook of Models, Progra...
Explorations in Parallel Distributed Processing: A Handbook of Models, Progra...Explorations in Parallel Distributed Processing: A Handbook of Models, Progra...
Explorations in Parallel Distributed Processing: A Handbook of Models, Progra...
 
Thesis_Report
Thesis_ReportThesis_Report
Thesis_Report
 
javanotes5.pdf
javanotes5.pdfjavanotes5.pdf
javanotes5.pdf
 
Aidan_O_Mahony_Project_Report
Aidan_O_Mahony_Project_ReportAidan_O_Mahony_Project_Report
Aidan_O_Mahony_Project_Report
 
Master_Thesis
Master_ThesisMaster_Thesis
Master_Thesis
 
Jmetal4.5.user manual
Jmetal4.5.user manualJmetal4.5.user manual
Jmetal4.5.user manual
 
Fraser_William
Fraser_WilliamFraser_William
Fraser_William
 
The maxima book
The maxima bookThe maxima book
The maxima book
 
java web_programming
java web_programmingjava web_programming
java web_programming
 
Agathos-PHD-uoi-2016
Agathos-PHD-uoi-2016Agathos-PHD-uoi-2016
Agathos-PHD-uoi-2016
 
Agathos-PHD-uoi-2016
Agathos-PHD-uoi-2016Agathos-PHD-uoi-2016
Agathos-PHD-uoi-2016
 
Data over dab
Data over dabData over dab
Data over dab
 

report

  • 1. University of Manchester School of Computer Science Project Report 2014 Design and implementation of arithmetic units with Xilinx FPGA Author: Prakhar Bahuguna Supervisor: Dr. Vasilis Pavlidis
  • 2. Design and implementation of arithmetic units with Xilinx FPGA Author: Prakhar Bahuguna The aim of the project is to design, implement, test and profile various different arithmetic units, such as adders and multipliers on an FPGA platform. These are algorithms for effi- ciently performing arithmetic in hardware that are widely used in various different applic- ations. This project strives to detail the various types of such arithmetic units, providing example designs and implementations for each and critically evaluating the merits and is- sues with each design. The designs will then be implemented on an Xilinx Virtex-7 FPGA development board upon which real-world performance, logic area and power consumption can be measured and analysed. Supervisor: Dr. Vasilis Pavlidis
  • 3. Contents 1. Introduction 7 2. Background 8 2.1. Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.1. Half Adder and Full Adder . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.2. Ripple-carry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.3. Carry-lookahead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.4. Carry-select . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.1.5. Carry-skip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2. Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.1. Array Multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.2. Modified Booth Encoder . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.3. Wallace Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2.4. Dadda Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3. Design 23 3.1. Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2. Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4. Simulation 26 4.1. Testing Methodology For 8-bit units . . . . . . . . . . . . . . . . . . . . . . . 27 4.2. Testing Methodology For Larger Units . . . . . . . . . . . . . . . . . . . . . 27 5. Implementation 29 5.1. The Synthesis Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5.2. Synthesis Reports and Statistics . . . . . . . . . . . . . . . . . . . . . . . . . 30 5.2.1. Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 5.2.2. Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 6. Conclusion 38 3
  • 4. Contents Contents A. Unit Source Code 41 A.1. Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 A.1.1. Half Adder and Full Adder . . . . . . . . . . . . . . . . . . . . . . . . 41 A.1.2. Ripple-carry Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 A.1.3. Carry-lookahead Adder . . . . . . . . . . . . . . . . . . . . . . . . . 43 A.1.4. Carry-select Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 A.1.5. Carry-skip Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 A.2. Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 A.2.1. Array Multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 A.2.2. Modified Booth Encoder . . . . . . . . . . . . . . . . . . . . . . . . . 52 A.2.3. Wallace Tree Multiplier . . . . . . . . . . . . . . . . . . . . . . . . . 54 A.2.4. Dadda Tree Multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . 58 B. Testbench Source Code 62 B.1. 8-bit Testbenches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 B.1.1. Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 B.1.2. Unsigned Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 B.1.3. Signed Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 B.2. Testdata Generation Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 B.3. 32-bit Testbenches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 B.3.1. Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 B.3.2. Unsigned Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 B.3.3. Signed Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 B.4. 64-bit Testbenches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 B.4.1. Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 B.4.2. Unsigned Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 B.4.3. Signed Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4
  • 5. List of Figures 2.1. Half Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2. Full Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3. Ripple-Carry Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4. Propagate-Generate Full Adder . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.5. Carry-lookahead Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.6. Carry-select Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.7. Carry-skip Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.8. Modified Full Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.9. Array Multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.10. Wallace Tree Multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.11. Dadda Tree Multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.1. Simulation Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 5.1. Area Usage For 8-bit Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.2. Area Usage For 32-bit Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.3. Area Usage For 64-bit Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.4. Delay For 8-bit Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5.5. Delay For 32-bit Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.6. Delay For 64-bit Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.7. Area Usage For 8-bit Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.8. Delay For 8-bit Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.9. Area Usage For Array Multipliers . . . . . . . . . . . . . . . . . . . . . . . . 37 5.10. Delay For Array Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5
  • 6. List of Tables 2.1. Basic Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2. Long Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3. MBE Truth Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4. MBE Partial Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 6
  • 7. 1. Introduction Arithmetic units are hardware circuits that are designed to perform some type of arithmetic operation on binary numbers. They are typically integrated into some form of processor, which rely on the arithmetic units for much of their operation. This project aims to design, simulate, implement and evaluate the various different types of arithmetic units using an electronic design workflow in conjunction with an FPGA development platform. Each unit will be benchmarked for characteristics such as area usage, delay and power consumption that are important considerations in hardware design. This project in particular will focus on adders and multipliers. Addition and multiplication are very common operations that are utilised heavily by even the most basic of processors, both for internal operations and for processing data inputs. Both arithmetic operations have a variety of designs that can efficiently perform arithmetic in hardware, and each design has its own trade-offs with regard to the key characteristics. The project strives to detail these various types of arithmetic units, with example designs and implementations for each. The merits and issues of each design will then be critically evaluated and compared with each other. The next chapter will give a complete overview of the arithmetic units that will be im- plemented in the course of this project. This includes the approach taken by the design and the logic behind its functioning, its gate/block-level schematic and an estimate of number of gates required as well as the critical path delay. The design chapter will then cover the details of designing the units and the considerations that need to be taken into account for their development. The units will also need to be simulated and verified to ensure correct- ness, preferably in an automated manner that can provide guarantees of correct operation. This important stage is addressed by the simulation chapter. Finally, the designs will then be implemented in hardware, targeting an Xilinx Virtex-7 FPGA development board. Finally, the implementation chapter will cover the synthesis process and evaluate the characterist- ics of the implemented hardware units, comparing and contrasting the different varieties of adders and multipliers with each other. 7
  • 8. 2. Background Arithmetic units as used in computers are digital circuits that perform elementary arithmetic operations on numbers. They are based on binary arithmetic, taking operands as input and giving a result as output, both as binary numbers. Typically, these operations are simple mathematical arithmetic such as addition, subtraction, multiplication and division, though more complex units may implement more complicated mathematical operations. Arithmetic units can be classified into the two main groups of integer units and floating- point units. Integer units operate exclusively with integers and typically implement opera- tions such as addition and multiplication where the result for two integer operands is always an integer. As decimals do not need to be considered, integer units are typically smaller, faster and require lower power and less area. Floating-point units operate with numbers that have a floating decimal point. Like integer units, they can also perform additions, sub- tractions and multiplications but with floating-point operands. However, they are also able to perform operations such as division, exponentiation and square root calculation that often give a floating-point result even for integer operands. As the logic to handle floating-point numbers is more complex, floating-point units are usually larger and more power-hungry. Typically, arithmetic units are implemented in hardware within Application-Specific In- tegrated Circuits (ASIC). They are usually grouped together to form a complete Arithmetic Logic Unit (ALU) of a microprocessor, enabling the processor to perform useful work.1 ALUs are also found in different types of processors such as Graphics Processing Units (GPUs), which use a large number of complex ALUs to perform complex graphics calculations in parallel.2 Digital Signal Processors (DSPs) are largely based around their ALUs to process a continuous data stream in a pipelined fashion.3 In this project, the arithmetic units will be implemented in an FPGA as an ASIC implementation is simply too costly and time con- suming to consider for this use and is irrelevant to the learning objectives of this project. However, the same principles of hardware design still apply and the underlying difference in implementation can be abstracted away for the purposes of this project. The rest of this chapter will provide a complete overview of the arithmetic units that will be 1 Terms, ALU (Arithmetic Logic Unit) Definition. 2 Nvidia, What is GPU Accelerated Computing? 3 Yovits, Advances in Computers. 8
  • 9. 2.1. ADDERS CHAPTER 2. BACKGROUND implemented in this project. A variety of adders, followed by multipliers will be introduced along with their design details. In addition, each will have an estimate of delay and area. 2.1. Adders Addition is one of the most basic mathematical operations needed in modern computer sys- tems. It is performed bitwise on two binary operands in a similar fashion to traditional base-10 addition. The least significant bits are first added together - in binary, this can be performed by XORing the bits together. If the result is greater than 1, the carry is passed to the next set of significant bits and incorporated into the addition. This addition process is repeated to the left up to the most significant bit as shown in Table 2.1 Carry: 1 1 1 0 0 1 + 0 0 1 1 1 1 0 0 Table 2.1.: The basic addition process. The method just described is the most straightforward and natural way for a human to add two binary numbers together. However, in computer hardware, there are multiple ap- proaches to solve the problem of adding two numbers together efficiently, and each presents its own set of advantages and disadvantages. The primary concern in digital design is the critical path. This is the path in the circuit which has the longest delay between the input being fed to the unit and the correct output being obtained from it. As the delay in the critical path defines the maximum speed at which the adder can operate, minimising this delay is crucial to improve performance. Other concerns in adder design are the power consumption of the circuit and the area needed to implement the circuit, which depends directly on the number of logic gates that are used. These are usually at odds with minimising the critical delay as more sophisticated logic demands higher power consumption and more logic gates. Given this situation, there are various designs that result in different trade-offs between these two goals, depending on the requirements for the hardware being developed. The pros and cons of each design are analysed and evaluated in the following subsections. 2.1.1. Half Adder and Full Adder The most basic building blocks of any adder is the 1-bit half adder and full adder. A half adder simply takes two operands A and B. It calculates the sum S by XORing A and B together, 9
  • 10. 2.1. ADDERS CHAPTER 2. BACKGROUND denoted as S = A ⊕ B. A carry output cout can also be evaluated by ANDing the two operands together as cout = A · B.4 This is shown in Figure 2.1. Thus, the half-adder is sufficient for calculating the sum of two 1-bit operands S A B cout Figure 2.1.: A half adder with two operands A and B. However, it is not typical to be adding 1-bit numbers together. Often, several bits need to be added, with the carry of the previous column needed to correctly compute the result of the current column. The full adder is a complete 1-bit adder, including a carry-in that allows it to be chained to previous bits to compute an n-bit sum.5 It calculates S = A ⊕ B ⊕ cin and cout = (A · B) + (cin · (A ⊕ B)) and an implementation can be produced by chaining two half-adders together as demonstrated by Figure 2.2 S cout B A in c Figure 2.2.: A full adder with two operands and a carry-in generated from two half adders. 2.1.2. Ripple-carry The ripple-carry adder (RCA) is the simplest possible type of n-bit adder. The RCA utilises a chain of full adders connected in series with the carry-outs of each full adder feeding into 4 Vahid, Digital Design. 5 Ibid. 10
  • 11. 2.1. ADDERS CHAPTER 2. BACKGROUND the carry-in of the next, as illustrated by Figure 2.3. It is named as such because the carry from the rightmost column ‘ripples’ through to the left column in a sequential fashion.6 As this adder uses n full adders with five logic gates each, it only requires 5n logic gates. B0 A 0 Full Adder Full Adder Full Adder Full Adder B3 A B2 A B1 A 123 c S2 S3 S1 S0 in Figure 2.3.: A ripple-carry adder resulting from chaining multiple full adders. The main issue with the ripple-carry adder is that the nature of the design means that the final result is not known until the carry has propagated all the way to the leftmost column. This situation gives a long critical path between A0/B0 and cout which makes the adder slow to calculate the result. The delay for this path is three gate delays for each full adder with a total delay of 3n gate delays (assuming that every gate along this path has a similar delay). Clearly an alternative design to add two operands needs to be developed. 2.1.3. Carry-lookahead The carry-lookahead adder (CLA) attempts to avoid the slow carry ripple of the ripple-carry adder by predicting ahead of time what the carry from the previous column is likely to be. It does this by replacing the carry-out signal from the full adders with two signals: P (propag- ate) and G (generate). These signals are based on whether each full adder will propagate a carry-in of 1 to its carry-out, or will generate a carry itself. A full adder will propagate a carry-in when either A or B or both are 1, since the result of A + B + cin will be equal to cin in this case. Hence, we can set P = A + B. A full adder will generate a carry if both A and B are 1 regardless of the value of cin as A + B will be greater than one. The generate signal can be set to G = A · B.7 The full adder can be modified using these results to create a propagate-generate full adder as show in Figure 2.4 The propagate and generate signals from prior columns can now be used to look ahead and evaluate the expected carry-in for each full adder. Suppose the modified full adders are assembled as below with some lookahead logic to determine cin for each adder as in Figure 2.5. 6 Vahid, Digital Design. 7 Tohoku University, Hardware Algorithms For Arithmetic Modules. 11
  • 12. 2.1. ADDERS CHAPTER 2. BACKGROUND B A S in c P G Figure 2.4.: A full-adder with propagate and generate signals instead of a carry-out. B0 A 0 Full Adder Full Adder S0 S1 S2 S3 c1 B3 A B2 A B1 A 123 cin Full Adder Full Adder Carry−lookahead Logic P P0 GP GGP 01122G3 3 c3 c2 Figure 2.5.: Block diagram of a carry-lookahead adder. The carry-in for each full adder is evaluated from the lookahead logic rather than the previous full adder. We know that c1 will definitely be 1 if G0 is 1 as the first column will definitely generate a carry-out regardless of the value of c0. If G0 is 0, the only other way that c1 will be 1 is if the previous adder propagates a carry-in. This propagation will occur when P0 is 1, so c1 will be 1 if both P0 and c0 are 1. This logic can therefore be formulated as c1 = G0 + P0 · c0 and is easily implemented with one OR gate and one AND gate. 12
  • 13. 2.1. ADDERS CHAPTER 2. BACKGROUND This carry-lookahead logic can now be generalised to all full adders as ci+1 = Gi +Pi ·ci.8 Recursive substitution and expansion of ci then gives an expression for every carry-in ci+1 as described in Equation 2.1 c1 = G0 + P0 · c0 c2 = G1 + P1 · c1 = G1 + P1 · (G0 + P0 · c0) = G1 + P1 · G0 + P1 · P0 · c0 c3 = G2 + P2 · c2 = G2 + P2 · (G1 + P1 · G0 + P1 · P0 · c0) = G2 + P2 · G1 + P2 · P1 · G0 + P2 · P1 · P0 · c0 ... cn+1 = Gn + Pn · Gn−1 . . . . (2.1) The critical path has now been shortened significantly as it now runs between Ai/Bi and Si. After one gate delay, Pi and Gi are evaluated. The parallel nature of the lookahead logic requires only two gate delays. Finally, the addition requires two gate delays, so the overall critical delay is just five gate delays, regardless of the number of bits in the adder. The main downside is that the carry-lookahead adders needs significantly more logic gates for the lookahead logic. While c1 only needs two gates to compute, c2 requires three gates, c3 requires four gates and so on, with ci requiring i gates. The lookahead logic in total there- fore needs n(n − 1)/2 gates. When combined with the full adders, a complete n-bit carry- lookahead adder requires n(n + 7)/2 gates. Clearly, an adder design with a quadratic gate count will not scale well to larger sizes. 2.1.4. Carry-select A carry-select adder (CSLA) attempts to provide a compromise between the simplicity of a ripple-carry adder and the speed of a carry-lookahead adder. It uses a chain of ripple-carry adders of a certain width (usually 4-bit), but for each subsequent block after the first, two adders are placed which simultaneously calculate the sum of the 4-bit operands. One adder assumes a carry-in of 0, the other assumes a carry-in of 1, allowing for the two possible results to be precomputed. The correct result is then selected by a 2:1 mux using the carry-out of the previous block.9 The complete layout is given by Figure 2.6 The critical path still runs from A0/B0 to cout, but the delay is much shorter as each block is computed simultaneously. Assuming that each mux has a gate delay of 3, the overall gate delay of a carry-select adder with 4-bit ripple-carry blocks is 12+3(n/4−1) = 3n/4+9. The 8 Vladutiu, Computer Arithmetic. 9 Ibid. 13
  • 14. 2.1. ADDERS CHAPTER 2. BACKGROUND S[3:0] A[3:0]B[3:0] 0 Adder Ripple−carry 1 Adder Ripple−carry A[7:4]B[7:4] A[7:4]B[7:4] S[7:4] cout S[7:4] S[7:4] 1 1 0cout 0 Adder Ripple−carry cin Figure 2.6.: Diagram of the first section of a carry-select adder. Each block after the ini- tial block has two ripple-carry adders to compute the two possible results simultaneously. increase in delay is still linear but is significantly smaller as compared to a ripple-carry adder. More logic gates are required than a ripple-carry adder but the design does not suffer from the quadratic increase that the carry-lookahead adder has, making the carry-select adder far more scalable. 2.1.5. Carry-skip The carry-skip adder (CSKA) is a variation on the carry-select. It avoids waiting for the carry- in ripple from the previous block if it can conclusively determine that the current block will not propagate it further. In a similar fashion to carry-lookahead, this can be determined by evaluating P for the entire block, which is true if Pi is true for every bit i in the block.10 The overall expression for P for a 4-bit block is therefore given in Equation 2.2 10 Vladutiu, Computer Arithmetic. 14
  • 15. 2.2. MULTIPLIERS CHAPTER 2. BACKGROUND P = P0 · P1 · P2 · P3 = (A0 + B0) · (A1 + B1) · (A2 + B2) · (A3 + B3). (2.2) The carry-in cini for block i can now be evaluated faster. Assume that both the carry-out couti−2 of block i − 2 and Pi−1 from block i − 1 have been evaluated. If Pi−1 is true, we can simply pass the value of couti−2 to cini as the block will simply propagate it. Otherwise, we have to wait for couti−1 to be evaluated as its value may differ from couti−2. This can be summed up by Equation 2.3 cini = couti−1 + Pi−1 · couti−2 (2.3) and the entire schematic is depicted in Figure 2.7 S[3:0] A[3:0]B[3:0] cin S[7:4] B[7:4] A[7:4]A[11:8]B[11:8] S[11:8] P[11:8] P[7:4] Ripple−carry Adder Ripple−carry Adder Adder Ripple−carry Figure 2.7.: Block diagram of a carry-skip adder. If the current block propagates its carry-in, the carry-in is used directly for the next block. 2.2. Multipliers Multiplication is another common operation that is found in the ALUs of most modern pro- cessors. As for all electronic logic circuits, an arithmetic multiplier operates on two binary operands, referred to as the multiplicand and the multiplier. A set of partial products is computed from each bit of the multiplier, much in the same way that a human performs long multiplication by hand but in binary. Each partial product is zero if its corresponding multiplier bit is zero, and equal to the multiplicand shifted left by the appropriate number of positions if the multiplier bit is one. These partial products are then summed with mul- tiple adders to compute the final product. This long multiplication methods is illustrated in Table 2.2 15
  • 16. 2.2. MULTIPLIERS CHAPTER 2. BACKGROUND 1 0 0 1 × 0 0 1 1 1 0 0 1 1 0 0 1 0 0 0 0 + 0 0 0 0 0 0 0 1 1 0 1 1 Table 2.2.: The long multiplication method for binary operands. Generating the partial products for a multiplication calculation of a × b is extremely easy. Each partial product pi is evaluated as pi = a · bi, shifted left by i for each bit i in the multiplier b. The difficulty arises in summing, or reducing the partial products to compute the final product in an efficient manner. As with adders, there are a large variety of designs which can compute this partial products sum. Each has its own trade-offs between delay and area/power consumption depending on the requirements of the hardware being developed. As there are a large number of pos- sible multiplier designs, three common designs will be analysed and evaluated in this report, namely the array multiplier, the Wallace tree multiplier and the Dadda tree multiplier. 2.2.1. Array Multiplier Much like the ripple-carry adder, the array multiplier is the most straightforward implement- ation of an n-bit multiplier. It uses an array of modified full adders arranged in an n × n grid to evaluate the result, with the carries rippling diagonally through the grid and the sum-outs rippling down. The ith column of the grid corresponds to the ith bit of the final product and the jth row corresponds to the jth partial product, which is generated by the jth bit of the multiplier. A final row of regular full adders is then used to sum the remaining carry-outs to compute the upper bits of the final product.11 A schematic of these modified adder is given in Figure 2.8 and the complete array multiplier in Figure 2.9. As with the RCA, the longest critical path of the array multiplier is easy to visualise. It runs from the least significant bits of the operands in the top-right, diagonally through the carry- outs to the most significant bit of the final product in the bottom-left. Hence, it traverses n modified full adders (which have a gate delay of four), one half adder (gate delay of one) and n − 1 full adders (gate delay of three). The delay of the array multiplier is thus 4n + 1 + 3n = 7n + 1. This is significantly faster than performing repeated addition to compute the 11 Vahid, Digital Design. 16
  • 17. 2.2. MULTIPLIERS CHAPTER 2. BACKGROUND S cout in c sin out A B Figure 2.8.: An array multiplier full adder module, modified to compute the product of two bits and sum this with the previous partial product. A 0B0A 1B1 0 Full Adder Modified Full Adder Modified Full Adder Modified Full Adder Modified Half Adder Sn S1 S0 Full Adder Sn+1 S2n−1 Half Adder B2 A B1 A 12 0 00 Figure 2.9.: An array multiplier, showing the grid structure of the full adder modules to com- pute the partial products and sum them. multiplication, which will necessarily have gate delays larger than 7n + 1. However, the most apparent problem as visible in the schematic for the array multiplier is the large amount of logic required to perform the multiplication. An n-bit array multiplier requires n × n modified adders, so that the logic required scales by n2 . An 8-bit multiplier 17
  • 18. 2.2. MULTIPLIERS CHAPTER 2. BACKGROUND requires 64 full adders while a 32-bit multiplier will need 1024. Clearly, the array multiplier does not scale efficiently for practical applications where 32-bit or 64-bit width multipliers would be needed as the power and area requirements are too large. Another problem with the array multiplier is its inability to deal with signed integers. With addition, the addition process inherently gives the correct answer whether the value is unsigned or signed. These problems are addressed by the Wallace tree and Dadda tree multipliers, in conjunction with a Modified Booth Encoder. The latter will be discussed first as it forms a logic sub-block of both tree-based multipliers. 2.2.2. Modified Booth Encoder The Modified Booth Encoder (MBE) serves two important purposes for more sophisticated multiplier designs. Firstly, it allows the multiplier to correctly handle signed integers as part of the partial product reduction process without any additional sign-extension logic. Secondly, it reduces the number of partial products that need to be computed by half. This is accomplished by re-encoding the partial products according to the patterns of repeated 1s and 0s in the bits of the multiplier. For instance, a multiplication involving 4-bit integers such as a × 0011 would normally give the partial product sum of a + 2a + 0 + 0. This can be re-written as −a + 4a. Similarly, a × 0111 would normally require the calculation of a + 2a + 4a + 0. This can be formulated as −a + 8a. Hence, the number of partial products has been reduced from four to two. This encoding is accomplished by first padding the bits of the multiplier with a zero to the least significant bit (LSB). If the multiplier has an odd number of bits, two additional zeros are padded to the most significant bit, otherwise no additional padding is necessary. The bits of the padded multiplier are then grouped into overlapping groups of threes as illustrated in Equation 2.4.12 87 = 01010111 ⇒ 010101110 (with padding) ⇒ 010 Bit Group 4 010 Bit Group 3 011 Bit Group 2 110 Bit Group 1 (2.4) Each of these bit groups corresponds to a partial product that will be generated by the MBE. The value of each partial product is determined by the truth table in Table 2.3. In this table, ∼ a means invert all the bits of a, and a 1 means shift a left by one Given two 8-bit operands a and b, the partial products from an MBE can then be summed 12 Saharan and Kaur, ‘Design and Implementation of an Efficient Modified Booth Multiplier using VHDL’. 18
  • 19. 2.2. MULTIPLIERS CHAPTER 2. BACKGROUND Bit Value Operation Partial Product 000 0 × a 0 . . . 0 001 1 × a a 010 1 × a a 011 2 × a a 1 100 −2 × a (∼ a 1) + 2 101 −1 × a ∼ a + 1 110 −1 × a ∼ a + 1 111 0 × a 1 . . . 1 + 1 Table 2.3.: Truth table for the MBE partial products. by the addition logic of the multiplier as shown in Table 2.4.13 The outcome of utilising the MBE is that only four partial products need to be summed instead of eight, saving on the logic required for the multiplier a7 a6 a5 a4 a3 a2 a1 a0 × b7 b6 b5 b4 b3 b2 b1 b0 p07 p06 p05 p04 p03 p02 p01 p00 p17 p16 p15 p14 p13 p12 p11 p10 b1 p27 p26 p25 p24 p23 p22 p21 p20 b3 p37 p36 p35 p34 p33 p32 p31 p30 b5 + b7 Table 2.4.: The partial products and addition tree generated by an MBE. 2.2.3. Wallace Tree The Wallace Tree multiplier takes the partial products of a multiplication and groups the constituent bits according to their weight. The weight of a particular bit depends on its position - for instance the least significant bit has weight 20 = 1 while bit 3 has weight 23 = 8. These bits are then reduced by layers of half adders and full adders in a tree structure to compute the final product from the partial products.14 13 Punnaiah et al., ‘Design and Evaluation of High Performance Multiplier Using Modified Booth Encoding Algorithm’. 14 Vladutiu, Computer Arithmetic. 19
  • 20. 2.2. MULTIPLIERS CHAPTER 2. BACKGROUND The Wallace structure operates with multiple layers that reduce the number of bits with the same weight at each stage. The operation performed depends on number of bits in the layer: • One: Pass the bit down to the next layer. • Two: Add the bits together with a half adder, passing the sum to the same weight in the next layer and the carry-out to the next weight in the next layer. • Three or more: Add any three bits together with a full adder. Pass the sum and any remaining bits to the same weight next layer. Pass the carry-out to the next weight in the next layer Layers are added to the Wallace structure in this fashion until all weights have been re- duced to one bit.15 These remaining bits form the final product of the multiplication. Fig- ure 2.10 shows the structure of a Wallace tree multiplier, where each layer of adders reduces the partial products until the final product has been computed. Figure 2.10.: A Wallace tree multiplier, showing the tree structure of layers of half and full adders.16 The primary advantage of the Wallace Tree is that it has a significantly smaller number of logic gates as compared to an array multiplier. The tree structure requires logn reduction layers with each layer containing at most 2n adders, so no more than 2nlogn adders are re- quired as opposed to the n2 adders required for an array multiplier. The use of an MBE halves the number of partial products, so that the number of adders required is further reduced to 15 Bohsali and Doan, Rectangular Styled Wallace Tree Multipliers. 20
  • 21. 2.2. MULTIPLIERS CHAPTER 2. BACKGROUND 2nlog(n/2) in this instance. The MBE itself requires n/2 partial product logic blocks, with each logic block requiring approximate twelve gates to evaluate the partial product. The disadvantage of the Wallace Tree is that in contrast to an array multiplier, it has an irregular layout and wiring structure. This is because the higher weights have more wires and therefore require more adders than the lower weight wires. These extra adders also require additional internal wiring to connect them all up correctly. This irregular routing and layout is particularly problematic for FPGAs which are based around utilising a regular grid of lookup tables and logic blocks to implement their functionality. Hence, a fully synthesised Wallace Tree multiplier may require more logic blocks than would be expected. 2.2.4. Dadda Tree A Dadda Tree multiplier operates in a very similar manner to a Wallace Tree. It receives a set of partial products as inputs, each consisting of bits of different weights and sums these together using layers of adders to compute the final product. It differs from the Wallace Tree in terms of the structure of these adders, reducing the complexity of each reduction layer at the cost of using additional layers. This structure is illustrated by Figure 2.11. The reduction rules for the Dadda structure are as follows:17 • One: Pass the bit down to the next layer. • Two: If all weights in the layer have two or fewer bits, then add the bits together with a half adder, passing the sum to the same weight in the next layer and the carry-out to the next weight in the next layer. Otherwise, pass the bits down to the next layer. • Three or more: Add any three bits together with a full adder, ensuring that the total number of bits remains equal or close to a multiple of three. Pass the sum and any remaining bits to the same weight next layer. Pass the carry-out to the next weight in the next layer. The Dadda Tree still gives a similar nlogn scaling in logic area as the Wallace Tree multi- plier due to its tree structure. The actual area used is slightly larger than that of the Wallace Tree as each reduction layer is less aggressive at summing the partial products, resulting in more layers needed to compute the sum. One advantage of this slight trade-off in area is that the complexity of the wiring is reduced. This is useful for FPGA implementation as it is likely to synthesise with better layout and routing than a Wallace Tree. 17 EDA Cafe, Datapath Logic Cells. 21
  • 22. 2.2. MULTIPLIERS CHAPTER 2. BACKGROUND Figure 2.11.: A Dadda tree multiplier, showing a tree structure that is larger but more regular than a Wallace tree.18 22
  • 23. 3. Design To develop the arithmetic units discussed in the previous chapter, it is necessary to specify their logic. This will be accomplished using Verilog, a hardware description language (HDL). Each arithmetic unit is written in this language to describe its inputs, outputs and the in- ternal logic of the unit to compute their outputs based on their inputs. These Verilog source code files can then be used by Electronic Design Automation (EDA) tools such as Cadence, Synopsys or Xilinx ISE to simulate and verify the logic, as well as synthesising it into a bit- stream suitable for configuring an FPGA to implement the logic. This chapter addresses the requirements and details of designing the arithmetic units. 3.1. Requirements An arithmetic unit such as an adder or a multiplier is typically used as a block within a larger unit, such as an ALU. These ALUs in turn are typically utilised by a processing unit, most commonly the CPU of a computer system. Thus, an arithmetic unit must satisfy the requirements of the encompassing unit. The first requirement is the width of the operands that will be utilised. A processor will typically use a common word width for its registers, memory addresses and data buses, hence the arithmetic units it relies on will need to match. Early processors were 8-bit, but most mod- ern processors such as ARM and Intel x86 now use word sizes of 32 bits and 64 bits. Hence, the arithmetic units in this project each have 8-bit, 32-bit and 64-bit variants. However, due to time constraints the Wallace tree and Dadda tree multipliers were only produced as 8-bit variants. The second requirement for arithmetic units is that the result should be computed within a specific deadline. Processors are based on clocked logic, with operations triggered on the rising edge of a clock cycle. For the processor to operate correctly, the result from the previ- ous operation needs to be computed and latched within a discrete number of clock periods. This imposes timing requirements on the arithmetic units and the delay plays a part in de- termining the maximum clock speed of the entire design. The worst-case delays for each arithmetic unit must therefore be analysed and evaluated as part of the development pro- 23
  • 24. 3.2. IMPLEMENTATION DETAILS CHAPTER 3. DESIGN cess. Finally, a significant consideration is the area used by the designs, and by extension their power consumption. In the context of an FPGA, area usage is determined by the number of logic slices and look-up tables (LUTs) that are used by the synthesised design. It is important to ensure that the FPGA has enough logic slices to load the bitstream for the entire design, so the designer of the arithmetic units must ensure enough area is left for the rest of the design. In addition, a larger design requires more power to operate as the additional gates draw more power upon switching. The power draw of the unit influences a device’s current requirements, thermal constraints and battery lifetime in the case of mobile devices. Hence, it is important to analyse the power consumption of the arithmetic units. 3.2. Implementation Details Since Verilog is a hardware description language, a source file simply describes the beha- vioural logic of the hardware and the state of its outputs given a set of inputs. The EDA toolchain is free to synthesise any gates and wires necessary to ensure the unit will behave according to the source file. This means that for the instance of an adder, it is perfectly valid to write s = a + b and synthesising this will give a correctly functioning adder by the tool- chain. However, since the actual implementation of this adder is completely determined by the synthesis tool, this approach is not useful for this project. Instead, to design the specific arithmetic units discussed in the previous chapter, it is ne- cessary to be explicit and specify the exact logic of each unit. The arithmetic units developed in this project will be designed at the level of basic logic gates such as NOT, AND, OR and XOR gates, with the wiring between ports specified explicitly. This ensures that the EDA toolchain will not attempt to generate its own optimised logic to substitute as an equivalent to the logic specified in the source file. This approach allows for the differences between the types of units to be clearly distinguished for further analysis. However, some optimisations that are difficult to avoid entirely occur during the trans- lation and mapping stages of synthesising a unit. For example, when synthesising an XOR gate, the EDA toolchain has a number of possibilities for configuring an LUT to implement this. In addition, the toolchain can be configured to optimise for particular design goals such as minimising area or delay. The exact algorithms and optimisations for doing this are pro- prietary and depend on both the synthesis tool and the capabilities and properties of the FPGA hardware being used. Since it is not possible to directly observe what exactly occurs at the synthesis level, it is best to hold this source of variability constant to ensure consistent results. For this project, the synthesis tool used will be XST, from the Xilinx ISE 14.5 pack- 24
  • 25. 3.2. IMPLEMENTATION DETAILS CHAPTER 3. DESIGN age. This will be used to synthesis designs targeted at the Xilinx Virtex-7 VC707 evaluation kit. The ISE projects will be configured for the default Balanced profile which aims to give a balance between compact area usage and short delays when synthesising the units. The Verilog source files for the 8-bit units are provided in the appendix for reference (the 32-bit and 64-bit units have been omitted due to space constraints, but are straightforward extensions of the basic logic). The next stage after designing and developing these units is performing logic simulation and testing to ensure they perform as expected and give the correct result for any set of inputs. This topic is covered in the next chapter. 25
  • 26. 4. Simulation Logic simulation, also known as functional simulation, is a process by which software can be used to determine the behaviour of a digital circuit. Simulation is performed through the use of a stimulus testbench and the logic unit being tested, referred to as the device under test (DUT). The testbench unit sends test data to the DUT’s inputs and the outputs are observed and recorded in the simulation as a waveform trace. Additionally, the outputs can be compared to a set of expected results for each test case. This allows for a unit to be tested and verified for correct operation, as well as allowing the designer to find the source of any faults in the unit. Figure 4.1 shows an example of a multiplier unit being simulated with the ISim tool in Xilinx ISE, with the expected product being compared alongside the actual output value from the multiplier unit. One can observe that this unit is performing as expected, giving the correct product for every operand combination. Figure 4.1.: The simulation process for a Wallace tree multiplier in Xilinx ISim. 26
  • 27. 4.1. TESTING METHODOLOGY FOR 8-BIT UNITS CHAPTER 4. SIMULATION In the context of the arithmetic units in this project, simulation is essential to ensure that the correct result is computed for any set of operands. The unit should correctly handle small and large numbers, corner cases such as one operand set to one or zero, and negative numbers if appropriate. To achieve good testing coverage of the units, it is necessary to test using a large number of operand combinations. As performing this many tests manually would be tedious and impractical, it is desirable to automate the test, allowing for a quick pass/fail decision to be made for a unit. 4.1. Testing Methodology For 8-bit units For the 8-bit arithmetic unit, it is feasible to test every single combination of operands and verify the result. This is referred to as exhaustive testing. This testing scenario is possible because in the case of an 8-bit multiplier, there are 28 possible values for operand a and 28 possible values for operand b. Hence there are 28 × 28 = 216 = 65536 test cases to consider. This is a seemingly large number but it can be easily performed by a computer with an automated test. The testbench for the 8-bit adders and multipliers simply uses two nested loops to loop through every possible value for both operands, checking that the output result is equal to that calculated in software. If a discrepancy occurs, it halts and logs the error, otherwise the testbench continues the simulation until the end. The code for this is given in the appendix. 4.2. Testing Methodology For Larger Units The exhaustive testing approach however does not work in practice for 32-bit and 64-bit arithmetic units. This is because the number of test cases required scales exponentially by 22n . An exhaustive 32-bit multiplier testbench requires 264 ≈ 1.8 × 1019 cases and a 64-bit unit requires 2128 ≈ 3.4 × 1038 cases. These are extremely large numbers and it simply isn’t feasible to test every possibility in a reasonable amount of time. An alternative approach is to test a large but feasible set of test cases, each with a randomly selected combination of operands. If the unit gives the correct result for all of these test cases and all paths through the unit have been tested at least once, it can be assumed with a high degree of confidence that it will give the correct result in all cases. The approach taken in this project for testing the 32-bit and 64-bit units is to generate a set of test data for the testbench. This involves a Python script that select two numbers at random (within the bit width constraints) using the system’s random number generator. It computes their sum, unsigned product and signed product and appends the output as a 27
  • 28. 4.2. TESTING METHODOLOGY FOR LARGER UNITS CHAPTER 4. SIMULATION formatted string of hexadecimal numbers to a test data file. This process is repeated for one million cases. The process of generating random numbers occasionally generates duplicate operand pairs. These duplicate pairs are removed from the test data file using standard UNIX utilities, allowing for the script to be rerun to generate the remaining test cases until the test data contains one million unique pairs of operands. The test data file can then be used by the testbenches, which scan each line of the test data, set the input values according to the two operands and compare the output with the expected result in the data. If the output differs from the expected result, the test halts and logs the error, otherwise the test continues until the last line of the test data is reached. The code for the test data script and the testbench is given in the appendix, but the actual test data used is omitted from this report due to its large size. Once a unit has successfully passed all of the test cases in the testbench, it can be assumed to be functionally correct under all input conditions. It can then be synthesised as a hardware unit suitable for implementation on the FPGA hardware. This process is covered in the next chapter. 28
  • 29. 5. Implementation Once an arithmetic has been designed and fully verified, it is ready to be implemented. This process involves transforming the Verilog source code for a hardware unit into binary bit- stream data suitable for downloading to an FPGA device. The synthesis process also gener- ates various statistics that are useful in analysing and comparing the various types of adders and multipliers. These will be the primary focus of this chapter. 5.1. The Synthesis Process Synthesis is the process by which a hardware description of a logic unit is used to generate a hardware implementation of logic gates. This implementation can be targeted at the fab- rication of a physical ASIC, or a bitstream for the configuration of an FPGA. In this manner, synthesis is roughly analogous to compiling the source code for some software into a binary executable. The process is performed by the synthesis tool of an EDA toolchain. In the case of this project, the synthesis tool is XST (Xilinx Synthesis Tool) which is integrated into the Xilinx ISE suite. The main stages of synthesis are as follows: • Translate: The translation stage parses the source file and generates a netlist (a list of the wires in the design) and the logic gates associated with them. Various optim- isations are utilised to minimise the specified Boolean logic to a set of logic gates with equivalent truth tables. This assists in reducing the area and delay for the unit. • Map: The map stage uses the aforementioned gate list, assigning them to specific logic slices and inputs/outputs on the FPGA. The LUTs are also configured to reflect the logic required for the design. • Place-and-route: Once the design has been mapped to the FPGA, the place-and-route stage uses the netlist to decide on how the design should be arranged on the chip and the routing of wires between the logic gates. There are a selection of optimisation target profiles that can be used to influence the place-and-route stage. For example, the synthesis tool can be directed to minimize area usage or delay, or strike a balance between both goals. This project utilises the Balanced profile for all units. 29
  • 30. 5.2. SYNTHESIS REPORTS AND STATISTICS CHAPTER 5. IMPLEMENTATION Once a unit has been synthesised, a bitstream can be generated from it. This requires the designer to assign the unit’s inputs and outputs to the physical pins on the FPGA chip, along with any other constraints that may be required. Once this has been completed, the toolchain generates a complete bitstream of the entire FPGA’s configuration. This bitstream can then be downloaded to the FPGA, finishing the process of implementing the design into hardware. 5.2. Synthesis Reports and Statistics An important part of the synthesis process is the reports and statistics that are generated along with the synthesised unit. These provide important details with regard to the unit such as the area usage, the number of input/output buffers (IOBs) required, the estimated pin-to- pin delay for each combination of input and output pin, the estimated power consumption at a given clock rate, and so forth. This data is important from a development perspective and here it will be utilised to evaluate each of the arithmetic units developed in the course of this project. Unfortunately, due to technical issues with the development software it was not possible to generate reliable dynamic power consumption reports for the purposes of this project, nor was it possible to interact directly with the implemented designs on the hardware. However, since power consumption of a digital circuit is directly proportional to the number of logic gates in the circuit, it can be indirectly inferred that a design with a larger area is expected to consume more power, assuming a similar percentage of gates are switched with each computation. 5.2.1. Adders In this project, the four adders discussed earlier were fully implemented, namely the ripple- carry adder, the carry-lookahead adder, the carry-select adder and the carry-skip adder. These were run through the synthesis tool which generated synthesis reports for each. The first criteria for evaluating the adders was the area usage. This was quantified by examining the number of slice LUTs required for the design. The results for the 8-bit variants are graphed in Figure 5.1. As would be expected, the ripple-carry adder’s simple design gives it the smallest area us- age of the four units, with eight LUTs used. What was not expected was that the carry-select adder also used an equal amount of area. This was most likely due to synthesis optimisations that allowed for the adder to be efficiently implemented in the FPGA. The carry-lookahead adder required an additional LUT for its lookahead logic while the carry-skip adder required 30
  • 31. 5.2. SYNTHESIS REPORTS AND STATISTICS CHAPTER 5. IMPLEMENTATION Figure 5.1.: Graph showing the area usage of the 8-bit adder variants, in terms of LUTs used. the most with ten LUTs. The scaling of these adders is depicted in Figure 5.2 and Figure 5.3 for the 32-bit and 64-bit adders respectively. From these graphs, it is immediately apparent that the area requirements of the carry- lookahead adder scale up very quickly in relation to operand width. The 32-bit and 64-bit carry-lookahead adders are significantly larger than the other variants, which suggests that large carry-lookahead adders are not suitable for practical use. Again, both the ripple-carry and carry-select adders use the least area, while the carry-skip adder uses noticeably more LUTs. 31
  • 32. 5.2. SYNTHESIS REPORTS AND STATISTICS CHAPTER 5. IMPLEMENTATION Figure 5.2.: Graph showing the area usage of the 32-bit adder variants. Figure 5.3.: Graph showing the area usage of the 64-bit adder variants. 32
  • 33. 5.2. SYNTHESIS REPORTS AND STATISTICS CHAPTER 5. IMPLEMENTATION The next criteria to analyse is the maximum worst-case delay, a value that is necessary to determine for the purpose of integrating the unit within a larger logic unit. This value is obtained from the pin-to-pin delay report, which details the delays between each bit of every input pin and every bit of every output pin. The largest value from this list is selected as being the maximum delay. The results of this are shown in Figure 5.4. Figure 5.4.: Graph showing the maximum worst-case delay of the 8-bit adder variants, meas- ured in nanoseconds. Slightly surprisingly, the ripple-carry adder comes first with the shortest delay as com- pared to the other adders. This is likely due to the fact that with only eight full adders in a ripple-carry chain, the ripple delay is not yet large enough to be a significant issue. The additional cost of the critical path optimisations present in the other adder designs hence do not outweigh the benefits derived from them. Figure 5.5 and Figure 5.6 show the maximum delay of the 32-bit and 64-bit adders respectively. Here, the other adders now provide a tangible reduction in maximum delay as compared to the ripple-carry adder. The carry-lookahead adder in particular has the shortest delay, though this comes at the cost of significantly more area usage as discussed previously. The 32-bit carry-select adder also improves on the delay relative to the ripple-carry adder, while the 32-bit carry-skip adder proves to be the slowest. This situation is reversed for the 64-bit adders where the carry-skip adder proves to be faster than the carry-select adder, though not as fast as the carry-lookahead adder. In fact, in this instance the carry-select adder has 33
  • 34. 5.2. SYNTHESIS REPORTS AND STATISTICS CHAPTER 5. IMPLEMENTATION Figure 5.5.: Graph showing the maximum worst-case delay of the 32-bit adder variants. Figure 5.6.: Graph showing the maximum worst-case delay of the 64-bit adder variants. a longer delay than the ripple-carry adder. Hence, we can conclude that there is no ‘best’ adder design in all cases. The best choice of adder for each operand width is determined by the designers requirements and by profiling the individual designs. 34
  • 35. 5.2. SYNTHESIS REPORTS AND STATISTICS CHAPTER 5. IMPLEMENTATION 5.2.2. Multipliers The array multiplier, Wallace tree multiplier and Dadda tree multiplier were all implemented as 8-bits units in this project. However, due to time constraints only the array multiplier was implemented as 32-bit and 64-bit units. The majority of this section will therefore focus on the 8-bit units. Firstly, the area usage of these units is given in Figure 5.7. Figure 5.7.: Graph showing the area usage of the 8-bit multipliers. It is clear that the Wallace tree multiplier uses the least area with 73 LUTs as compared to 74 for the array multiplier and 76 for the Dadda multiplier. This is despite the additional overhead imposed by the MBE, which requires more logic than traditional calculation of the partial products. In addition, the Wallace tree multiplier’s ability to handle multiplication of signed numbers places it at a clear advantage against the array multiplier. Meanwhile, the Dadda multiplier’s expanded design is visible in its increased area usage. The next criteria is the maximum worst-case delay. The graph for this is given in Figure 5.8. Again, the Wallace tree multiplier emerges as the unequivocal winner with the shortest delay, while also being able to multiply signed numbers. The Dadda tree multiplier has the longest delay, although by a relatively small margin. However, it is entirely possible that different characteristics could be observed with 32-bit and 64-bit implementations of the tree multi- pliers, as was the case with the carry-select and carry-skip adders. 35
  • 36. 5.2. SYNTHESIS REPORTS AND STATISTICS CHAPTER 5. IMPLEMENTATION Figure 5.8.: Graph showing the maximum worst-case of the 8-bit multipliers. For informative purposes, the 8-bit array multiplier was compared to its 32-bit and 64-bit counterparts to analyse the scaling of the array multiplier with operand width. The graphs for area and delay are given in Figure 5.9 and Figure 5.10 respectively. The delay graph shows the array multiplier scales slightly less than linearly between the three bit widths, with delays of 11ns, 38ns and 56ns for the 8-bit, 32-bit and 64-bit multipliers respectively. However, it is the area graph that shows the enormous effect of n2 scaling of area usage. While the 8-bit multiplier only required 74 LUTs, the 32-bit multiplier requires over 1400 and this balloons to almost 6000 for the 64-bit multiplier. Hence, it is obvious that the array multiplier is a poor choice of design for practical multiplier applications as the area requirements are simply too large. 36
  • 37. 5.2. SYNTHESIS REPORTS AND STATISTICS CHAPTER 5. IMPLEMENTATION Figure 5.9.: Graph showing the scaling of area usage for array multipliers of various widths. Figure 5.10.: Graph showing the scaling of maximum delay for array multipliers of various widths. 37
  • 38. 6. Conclusion This project has covered the complete development process of a variety of arithmetic units, from the theory that underpins them through to their design and testing process, before being implemented as complete hardware units. Additionally, it has covered the design properties and characteristics that distinguish the units from each other, with the advantages and dis- advantages of each unit discussed in detail. Each unit was also developed with a selection of operand widths to investigate the effects of scaling on the characteristics of the units. In particular, it can be concluded that for the adders, no particular adder emerged as a clear best design. The choice of adder used in a digital circuit is guided heavily by the requirements of the designer’s project. For instance, a compact design requiring an 8-bit adder with min- imal area usage and power consumption is best served by the corresponding ripple-carry adder. A design that calls for minimal delay in an 8-bit adder is best served by the carry- lookahead adder. For a 32-bit adder, the area scaling issues of a carry-lookahead adder make it unsuitable for many applications, so a designer seeking out low delay while keeping area usage acceptable would select the carry-select adder. For the multipliers, the situation is more clear-cut. The 8-bit Wallace tree multiplier proved to be superior to its array multiplier counterpart in both aspects. Its use of a modified Booth encoder and a tree structure allowed it to use less area while simultaneously possessing a shorter delay. It also avoids the n2 area scaling issue of the array multiplier as seen with the latter’s 32-bit and 64-bit variants. The ability to correctly multiply signed integers cements its advantage as many digital designs will require the ability to multiply negative numbers together. Meanwhile, the Dadda tree multiplier was disadvantaged by having a larger area and delay than the other two multiplier designs. However, since these results were only obtained for 8-bit multipliers, it is entirely possible that wider variants of the Dadda tree multiplier may give more favourable characteristics than its Wallace counterpart. The primary issues that occurred with this project was a lack of time as well as a lack of prior knowledge and experience in digital hardware design and arithmetic units. In partic- ular, understanding the logic and theory of the tree-based multiplier units and the modified Booth encoder was very time-consuming. Possessing a thorough understanding of these units was essential before development work could begin. Hence, there was only sufficient 38
  • 39. CHAPTER 6. CONCLUSION time to complete the design, verification and implementation of the 8-bit variants of the tree multipliers. Given more time, an analysis of the scaling characteristics of the tree multipliers with 32-bit and 64-bit wide operands could have been undertaken. Another issue was the inability to obtain data on the power consumption qualities of the arithmetic units as was originally intended, from both software estimated values and actual values measured from the hardware. Technical issues as well a lack of prior experience with the software made it difficult to obtain meaningful dynamic power consumption estimates. Only estimates of static power consumption (from transistor leakage) was available, which were not useful for quantifying the power consumed when evaluating a calculation. Ad- ditionally, there were further issues with using the arithmetic units on the physical FPGA hardware. While it was possible to generate a bitstream from the synthesised units and download this to the hardware, there was no clear method of interfacing with the arithmetic unit. It was not possible to send test data to the unit nor to read its output, severely limit- ing the usefulness of this approach. With more time available, it would have been easier to overcome these issues. It would also have been possible to make physical measurements on the hardware, allowing for a useful analysis of real-world power consumption by the units. Despite these issues, the project was still successful in that many useful results were ob- tained. Nearly all of the intended units were fully designed, verified and implemented in the course of this project. Ultimately, it has been an extremely rewarding experience and has given a significant amount of in-depth knowledge and practical hands-on experience in the realm of digital hardware design. 39
  • 40. Bibliography Bohsali, M. and M. Doan. Rectangular Styled Wallace Tree Multipliers. url: http://www. veech.com/index files/Wallace%20Tree.pdf. EDA Cafe. Datapath Logic Cells. url: http://www10.edacafe.com/book/ASIC/ Book/Book/Book/CH02/CH02.6.php. Nvidia. What is GPU Accelerated Computing? url: http : / / www . nvidia . com / object/what-is-gpu-computing.html. Punnaiah, S. et al. ‘Design and Evaluation of High Performance Multiplier Using Modified Booth Encoding Algorithm’. In: International Journal of Engineering and Innovative Tech- nology 1 (6 June 2012), pp. 16–19. Saharan, K. and J. Kaur. ‘Design and Implementation of an Efficient Modified Booth Multiplier using VHDL’. In: International Journal of Advances in Engineering Sciences 3 (3 July 2013), pp. 78–81. Terms, Tech. ALU (Arithmetic Logic Unit) Definition. url: http://www.techterms. com/definition/alu. Tohoku University. Hardware Algorithms For Arithmetic Modules. url: http://www. aoki.ecei.tohoku.ac.jp/arith/mg/algorithm.html. Tufts University. 4*4 multiplier. url: http://www.eecs.tufts.edu/∼ryun01/ vlsi/verilog simulation.htm. Vahid, Frank. Digital Design. Wiley, 2011. isbn: 978-0-470-53108-2. Vladutiu, Mircea. Computer Arithmetic. Algorithms and Hardware Implementations. Springer, 2012. isbn: 978-3-642-18315-7. Yovits, Marshall C. Advances in Computers. Academic Press, 1993, p. 105. isbn: 978-0-470- 53108-2. 40
  • 41. Appendix A. Verilog Source Code for the 8-bit Arithmetic Units A.1. Adders A.1.1. Half Adder and Full Adder / / Simple half −adder t h a t computes the sum and carry −out of two b i t s module h a l f a d d e r ( input wire a , input wire b , output wire s , output wire cout ) ; assign s = a ˆ b ; assign cout = a & b ; endmodule / / Simple f u l l −adder t h a t computes the sum and carry −out of two b i t s , p l u s a carry −in module f u l l a d d e r ( input wire a , input wire b , input wire cin , output wire s , output wire cout ) ; wire s0 , c0 , c1 ; 41
  • 42. A.1. ADDERS APPENDIX A. UNIT SOURCE CODE / / Two half −adders are chained to c r e a t e a f u l l adder h a l f a d d e r ha0 ( . a ( a ) , . b ( b ) , . s ( s0 ) , . cout ( c0 ) ) ; h a l f a d d e r ha1 ( . a ( s0 ) , . b ( cin ) , . s ( s ) , . cout ( c1 ) ) ; assign cout = c0 | c1 ; endmodule A.1.2. Ripple-carry Adder / / 4− b i t r i p p l e −c a r r y adder with f o u r f u l l −adders module r i p p l e c a r r y a d d 4 ( input wire [ 3 : 0 ] a , input wire [ 3 : 0 ] b , input wire cin , output wire [ 3 : 0 ] s , output wire cout ) ; wire [ 3 : 0 ] c ; assign cout = c [ 3 ] ; f u l l a d d e r fa0 ( . a ( a [ 0 ] ) , . b ( b [ 0 ] ) , . cin ( cin ) , . s ( s [ 0 ] ) , . cout ( c [ 0 ] ) ) ; f u l l a d d e r fa1 ( . a ( a [ 1 ] ) , . b ( b [ 1 ] ) , . cin ( c [ 0 ] ) , . s ( s [ 1 ] ) , . cout ( c [ 1 ] ) ) ; f u l l a d d e r fa2 ( . a ( a [ 2 ] ) , . b ( b [ 2 ] ) , . cin ( c [ 1 ] ) , . s ( s [ 2 ] ) , . cout ( c [ 2 ] ) ) ; f u l l a d d e r fa3 ( . a ( a [ 3 ] ) , . b ( b [ 3 ] ) , . cin ( c [ 2 ] ) , . s ( s [ 3 ] ) , . cout ( c [ 3 ] ) ) ; endmodule / / 8− b i t r i p p l e −c a r r y adder with two 4− b i t RCAs module r i p p l e c a r r y a d d 8 ( input wire [ 7 : 0 ] a , input wire [ 7 : 0 ] b , input wire cin , output wire [ 7 : 0 ] s , output wire cout ) ; 42
  • 43. A.1. ADDERS APPENDIX A. UNIT SOURCE CODE wire c0 ; r i p p l e c a r r y a d d 4 r ca 4 0 ( . a ( a [ 3 : 0 ] ) , . b ( b [ 3 : 0 ] ) , . cin ( cin ) , . s ( s [ 3 : 0 ] ) , . cout ( c ) ) ; r i p p l e c a r r y a d d 4 r ca 4 1 ( . a ( a [ 7 : 4 ] ) , . b ( b [ 7 : 4 ] ) , . cin ( c ) , . s ( s [ 7 : 4 ] ) , . cout ( cout ) ) ; endmodule A.1.3. Carry-lookahead Adder / / Propagate / Generate adder t h a t computes the P /G s i g n a l s / / i n s t e a d of the carry −out as in the f u l l adder module pgadder ( input wire a , input wire b , input wire cin , output wire s , output wire p , output wire g ) ; assign s = a ˆ b ˆ cin ; assign p = a | b ; assign g = a & b ; endmodule / / 8− b i t carry −lookahead adder module carrylookaheadadd8 ( input wire [ 7 : 0 ] a , input wire [ 7 : 0 ] b , input wire cin , output wire [ 7 : 0 ] s , output wire cout ) ; / / Propagate , g e n e r a t e and c a r r y output s i g n a l s f o r each b i t wire [ 7 : 0 ] p , g , c ; 43
  • 44. A.1. ADDERS APPENDIX A. UNIT SOURCE CODE / / The formula f o r the lookahead i s c i +1 = g i | ( p i & c i ) , where c i i s expanded r e c u r s i v e l y assign c [ 0 ] = cin ; assign c [ 1 ] = g [ 0 ] | ( p [ 0 ] & c [ 0 ] ) ; assign c [ 2 ] = g [ 1 ] | ( p [ 1 ] & g [ 0 ] ) | ( p [ 1 ] & p [ 0 ] & c [ 0 ] ) ; assign c [ 3 ] = g [ 2 ] | ( p [ 2 ] & g [ 1 ] ) | ( p [ 2 ] & p [ 1 ] & g [ 0 ] ) | ( p [ 2 ] & p [ 1 ] & p [ 0 ] & c [ 0 ] ) ; assign c [ 4 ] = g [ 3 ] | ( p [ 3 ] & g [ 2 ] ) | ( p [ 3 ] & p [ 2 ] & g [ 1 ] ) | ( p [ 3 ] & p [ 2 ] & p [ 1 ] & g [ 0 ] ) | ( p [ 3 ] & p [ 2 ] & p [ 1 ] & p [ 0 ] & c [ 0 ] ) ; assign c [ 5 ] = g [ 4 ] | ( p [ 4 ] & g [ 3 ] ) | ( p [ 4 ] & p [ 3 ] & g [ 2 ] ) | ( p [ 4 ] & p [ 3 ] & p [ 2 ] & g [ 1 ] ) | ( p [ 4 ] & p [ 3 ] & p [ 2 ] & p [ 1 ] & g [ 0 ] ) | ( p [ 4 ] & p [ 3 ] & p [ 2 ] & p [ 1 ] & p [ 0 ] & c [ 0 ] ) ; assign c [ 6 ] = g [ 5 ] | ( p [ 5 ] & g [ 4 ] ) | ( p [ 5 ] & p [ 4 ] & g [ 3 ] ) | ( p [ 5 ] & p [ 4 ] & p [ 3 ] & g [ 2 ] ) | ( p [ 5 ] & p [ 4 ] & p [ 3 ] & p [ 2 ] & g [ 1 ] ) | ( p [ 5 ] & p [ 4 ] & p [ 3 ] & p [ 2 ] & p [ 1 ] & g [ 0 ] ) | ( p [ 5 ] & p [ 4 ] & p [ 3 ] & p [ 2 ] & p [ 1 ] & p [ 0 ] & c [ 0 ] ) ; assign c [ 7 ] = g [ 6 ] | ( p [ 6 ] & g [ 5 ] ) | ( p [ 6 ] & p [ 5 ] & g [ 4 ] ) | ( p [ 6 ] & p [ 5 ] & p [ 4 ] & g [ 3 ] ) | ( p [ 6 ] & p [ 5 ] & p [ 4 ] & p [ 3 ] & g [ 2 ] ) | ( p [ 6 ] & p [ 5 ] & p [ 4 ] & p [ 3 ] & p [ 2 ] & g [ 1 ] ) | ( p [ 6 ] & p [ 5 ] & p [ 4 ] & p [ 3 ] & p [ 2 ] & p [ 1 ] & g [ 0 ] ) | ( p [ 6 ] & p [ 5 ] & p [ 4 ] & p [ 3 ] & p [ 2 ] & p [ 1 ] & p [ 0 ] & c [ 0 ] ) ; assign cout = g [ 7 ] | ( p [ 7 ] & g [ 6 ] ) | ( p [ 7 ] & p [ 6 ] & g [ 5 ] ) | ( p [ 7 ] & p [ 6 ] & p [ 5 ] & g [ 4 ] ) | ( p [ 7 ] & p [ 6 ] & p [ 5 ] & p [ 4 ] & g [ 3 ] ) | ( p [ 7 ] & p [ 6 ] & p [ 5 ] & p [ 4 ] & p [ 3 ] & g [ 2 ] ) | ( p [ 7 ] & p [ 6 ] & p [ 5 ] & p [ 4 ] & p [ 3 ] & p [ 2 ] & g [ 1 ] ) | ( p [ 7 ] & p [ 6 ] & p [ 5 ] & p [ 4 ] & p [ 3 ] & p [ 2 ] & p [ 1 ] & g [ 0 ] ) | ( p [ 7 ] & p [ 6 ] & p [ 5 ] & p [ 4 ] & p [ 3 ] & p [ 2 ] & p [ 1 ] & p [ 0 ] & c [ 0 ] ) ; / / Propagate / Generate adders which g i v e the P /G s i g n a l s and use the c a r r i e s computed by the CLA l o g i c pgadder pga0 ( . a ( a [ 0 ] ) , . b ( b [ 0 ] ) , . cin ( c [ 0 ] ) , . s ( s [ 0 ] ) , . p ( p [ 0 ] ) , . g ( g [ 0 ] ) ) ; pgadder pga1 ( . a ( a [ 1 ] ) , . b ( b [ 1 ] ) , . cin ( c [ 1 ] ) , . s ( s [ 1 ] ) , . p ( p [ 1 ] ) , . g ( g [ 1 ] ) ) ; pgadder pga2 ( . a ( a [ 2 ] ) , . b ( b [ 2 ] ) , . cin ( c [ 2 ] ) , . s ( s [ 2 ] ) , . p ( p 44
  • 45. A.1. ADDERS APPENDIX A. UNIT SOURCE CODE [ 2 ] ) , . g ( g [ 2 ] ) ) ; pgadder pga3 ( . a ( a [ 3 ] ) , . b ( b [ 3 ] ) , . cin ( c [ 3 ] ) , . s ( s [ 3 ] ) , . p ( p [ 3 ] ) , . g ( g [ 3 ] ) ) ; pgadder pga4 ( . a ( a [ 4 ] ) , . b ( b [ 4 ] ) , . cin ( c [ 4 ] ) , . s ( s [ 4 ] ) , . p ( p [ 4 ] ) , . g ( g [ 4 ] ) ) ; pgadder pga5 ( . a ( a [ 5 ] ) , . b ( b [ 5 ] ) , . cin ( c [ 5 ] ) , . s ( s [ 5 ] ) , . p ( p [ 5 ] ) , . g ( g [ 5 ] ) ) ; pgadder pga6 ( . a ( a [ 6 ] ) , . b ( b [ 6 ] ) , . cin ( c [ 6 ] ) , . s ( s [ 6 ] ) , . p ( p [ 6 ] ) , . g ( g [ 6 ] ) ) ; pgadder pga7 ( . a ( a [ 7 ] ) , . b ( b [ 7 ] ) , . cin ( c [ 7 ] ) , . s ( s [ 7 ] ) , . p ( p [ 7 ] ) , . g ( g [ 7 ] ) ) ; endmodule A.1.4. Carry-select Adder / / 8− b i t carry −s e l e c t adder module c a r r y s e l e c t a d d 8 ( input wire [ 7 : 0 ] a , input wire [ 7 : 0 ] b , input wire cin , output wire [ 7 : 0 ] s , output wire cout ) ; wire cs , cout 0 , cout 1 ; wire [ 3 : 0 ] r e s u l t 0 , r e s u l t 1 ; / / The a p p r o p r i a t e output f o r the upper b i t s i s s e l e c t e d by the carry −s e l e c t s i g n a l assign { cout , s [ 7 : 4 ] } = ( cs ) ? { cout 1 , r e s u l t 1 } : { cout 0 , r e s u l t 0 } ; / / Simple RCA adds the lower f o u r b i t s and emits a carry − s e l e c t s i g n a l r i p p l e c a r r y a d d 4 r ca 4 0 ( . a ( a [ 3 : 0 ] ) , . b ( b [ 3 : 0 ] ) , . cin ( cin ) , . s ( s [ 3 : 0 ] ) , . cout ( cs ) ) ; / / The upper b i t s are computed twice , f o r a carry −in of 0 45
  • 46. A.1. ADDERS APPENDIX A. UNIT SOURCE CODE and 1 , with the c o r r e c t answer s e l e c t e d l a t e r r i p p l e c a r r y a d d 4 r c a 4 1 0 ( . a ( a [ 7 : 4 ] ) , . b ( b [ 7 : 4 ] ) , . cin ( 1 ’ b0 ) , . s ( r e s u l t 0 ) , . cout ( cout 0 ) ) ; r i p p l e c a r r y a d d 4 r c a 4 1 1 ( . a ( a [ 7 : 4 ] ) , . b ( b [ 7 : 4 ] ) , . cin ( 1 ’ b1 ) , . s ( r e s u l t 1 ) , . cout ( cout 1 ) ) ; endmodule A.1.5. Carry-skip Adder / / A 4− b i t RCA t h a t p r o v i d e s a propagate s i g n a l f o r the carry − s k i p adder module p gr ip p le c ar ry a dd 4 ( input wire [ 3 : 0 ] a , input wire [ 3 : 0 ] b , input wire cin , output wire [ 3 : 0 ] s , output wire cout , output wire p ) ; wire [ 3 : 0 ] c ; assign cout = c [ 3 ] ; assign p = ( a [ 0 ] | b [ 0 ] ) & ( a [ 1 ] | b [ 1 ] ) & ( a [ 2 ] | b [ 2 ] ) & ( a [ 3 ] | b [ 3 ] ) ; f u l l a d d e r fa0 ( . a ( a [ 0 ] ) , . b ( b [ 0 ] ) , . cin ( cin ) , . s ( s [ 0 ] ) , . cout ( c [ 0 ] ) ) ; f u l l a d d e r fa1 ( . a ( a [ 1 ] ) , . b ( b [ 1 ] ) , . cin ( c [ 0 ] ) , . s ( s [ 1 ] ) , . cout ( c [ 1 ] ) ) ; f u l l a d d e r fa2 ( . a ( a [ 2 ] ) , . b ( b [ 2 ] ) , . cin ( c [ 1 ] ) , . s ( s [ 2 ] ) , . cout ( c [ 2 ] ) ) ; f u l l a d d e r fa3 ( . a ( a [ 3 ] ) , . b ( b [ 3 ] ) , . cin ( c [ 2 ] ) , . s ( s [ 3 ] ) , . cout ( c [ 3 ] ) ) ; endmodule / / 8− b i t carry −s k i p adder module carryskipadd8 ( input wire [ 7 : 0 ] a , input wire [ 7 : 0 ] b , 46
  • 47. A.2. MULTIPLIERS APPENDIX A. UNIT SOURCE CODE input wire cin , output wire [ 7 : 0 ] s , output wire cout ) ; wire [ 1 : 0 ] rcin , rcout ; wire p ; assign r c i n [ 0 ] = cin ; assign r c i n [ 1 ] = rcout [ 0 ] ; / / I f the second RCA w i l l propagate a carry , simply pass r c o u t [ 0 ] to the cout , s k i p p i n g the second RCA assign cout = rcout [ 1 ] | ( p & rcout [ 0 ] ) ; r i p p l e c a r r y a d d 4 r ca 4 0 ( . a ( a [ 3 : 0 ] ) , . b ( b [ 3 : 0 ] ) , . cin ( r c i n [ 0 ] ) , . s ( s [ 3 : 0 ] ) , . cout ( rcout [ 0 ] ) ) ; p gr ip p le c ar ry a dd 4 rc a4 1 ( . a ( a [ 7 : 4 ] ) , . b ( b [ 7 : 4 ] ) , . cin ( r c i n [ 1 ] ) , . s ( s [ 7 : 4 ] ) , . cout ( rcout [ 1 ] ) , . p ( p ) ) ; endmodule A.2. Multipliers A.2.1. Array Multiplier / / Array m u l t i p l i e r module t h a t computes a b i t p roduct and adds i t to a sum−in module mulmodule ( input wire x , input wire y , input wire sin , input wire cin , output wire cout , output wire sout ) ; wire b in ; assign b in = x & y ; f u l l a d d e r adder ( . a ( sin ) , . b ( b in ) , . cin ( cin ) , . s ( sout ) , . 47
  • 48. A.2. MULTIPLIERS APPENDIX A. UNIT SOURCE CODE cout ( cout ) ) ; endmodule / / 8− b i t unsigned array m u l t i p l i e r module a r r a y m u l t i p l i e r 8 ( input wire [ 7 : 0 ] a , input wire [ 7 : 0 ] b , output wire [ 1 5 : 0 ] r e s u l t ) ; wire [ 7 : 0 ] c [ 8 : 0 ] , s [ 7 : 0 ] ; / / I n t e r m e d i a t e c a r r y and sum w i r e s / / P a r t i a l p r o d u c t s of m u l t i p l i c a n d with each m u l t i p l i e r b i t mulmodule mm00 00 ( . x ( a [ 0 ] ) , . y ( b [ 0 ] ) , . sin ( 1 ’ b0 ) , . cin ( 1 ’ b0 ) , . cout ( c [ 0 ] [ 0 ] ) , . sout ( s [ 0 ] [ 0 ] ) ) ; mulmodule mm00 01 ( . x ( a [ 1 ] ) , . y ( b [ 0 ] ) , . sin ( 1 ’ b0 ) , . cin ( 1 ’ b0 ) , . cout ( c [ 0 ] [ 1 ] ) , . sout ( s [ 0 ] [ 1 ] ) ) ; mulmodule mm00 02 ( . x ( a [ 2 ] ) , . y ( b [ 0 ] ) , . sin ( 1 ’ b0 ) , . cin ( 1 ’ b0 ) , . cout ( c [ 0 ] [ 2 ] ) , . sout ( s [ 0 ] [ 2 ] ) ) ; mulmodule mm00 03 ( . x ( a [ 3 ] ) , . y ( b [ 0 ] ) , . sin ( 1 ’ b0 ) , . cin ( 1 ’ b0 ) , . cout ( c [ 0 ] [ 3 ] ) , . sout ( s [ 0 ] [ 3 ] ) ) ; mulmodule mm00 04 ( . x ( a [ 4 ] ) , . y ( b [ 0 ] ) , . sin ( 1 ’ b0 ) , . cin ( 1 ’ b0 ) , . cout ( c [ 0 ] [ 4 ] ) , . sout ( s [ 0 ] [ 4 ] ) ) ; mulmodule mm00 05 ( . x ( a [ 5 ] ) , . y ( b [ 0 ] ) , . sin ( 1 ’ b0 ) , . cin ( 1 ’ b0 ) , . cout ( c [ 0 ] [ 5 ] ) , . sout ( s [ 0 ] [ 5 ] ) ) ; mulmodule mm00 06 ( . x ( a [ 6 ] ) , . y ( b [ 0 ] ) , . sin ( 1 ’ b0 ) , . cin ( 1 ’ b0 ) , . cout ( c [ 0 ] [ 6 ] ) , . sout ( s [ 0 ] [ 6 ] ) ) ; mulmodule mm00 07 ( . x ( a [ 7 ] ) , . y ( b [ 0 ] ) , . sin ( 1 ’ b0 ) , . cin ( 1 ’ b0 ) , . cout ( c [ 0 ] [ 7 ] ) , . sout ( s [ 0 ] [ 7 ] ) ) ; mulmodule mm01 00 ( . x ( a [ 0 ] ) , . y ( b [ 1 ] ) , . sin ( s [ 0 ] [ 1 ] ) , . cin ( c [ 0 ] [ 0 ] ) , . cout ( c [ 1 ] [ 0 ] ) , . sout ( s [ 1 ] [ 0 ] ) ) ; mulmodule mm01 01 ( . x ( a [ 1 ] ) , . y ( b [ 1 ] ) , . sin ( s [ 0 ] [ 2 ] ) , . cin ( c [ 0 ] [ 1 ] ) , . cout ( c [ 1 ] [ 1 ] ) , . sout ( s [ 1 ] [ 1 ] ) ) ; mulmodule mm01 02 ( . x ( a [ 2 ] ) , . y ( b [ 1 ] ) , . sin ( s [ 0 ] [ 3 ] ) , . cin ( c [ 0 ] [ 2 ] ) , . cout ( c [ 1 ] [ 2 ] ) , . sout ( s [ 1 ] [ 2 ] ) ) ; mulmodule mm01 03 ( . x ( a [ 3 ] ) , . y ( b [ 1 ] ) , . sin ( s [ 0 ] [ 4 ] ) , . 48
  • 49. A.2. MULTIPLIERS APPENDIX A. UNIT SOURCE CODE cin ( c [ 0 ] [ 3 ] ) , . cout ( c [ 1 ] [ 3 ] ) , . sout ( s [ 1 ] [ 3 ] ) ) ; mulmodule mm01 04 ( . x ( a [ 4 ] ) , . y ( b [ 1 ] ) , . sin ( s [ 0 ] [ 5 ] ) , . cin ( c [ 0 ] [ 4 ] ) , . cout ( c [ 1 ] [ 4 ] ) , . sout ( s [ 1 ] [ 4 ] ) ) ; mulmodule mm01 05 ( . x ( a [ 5 ] ) , . y ( b [ 1 ] ) , . sin ( s [ 0 ] [ 6 ] ) , . cin ( c [ 0 ] [ 5 ] ) , . cout ( c [ 1 ] [ 5 ] ) , . sout ( s [ 1 ] [ 5 ] ) ) ; mulmodule mm01 06 ( . x ( a [ 6 ] ) , . y ( b [ 1 ] ) , . sin ( s [ 0 ] [ 7 ] ) , . cin ( c [ 0 ] [ 6 ] ) , . cout ( c [ 1 ] [ 6 ] ) , . sout ( s [ 1 ] [ 6 ] ) ) ; mulmodule mm01 07 ( . x ( a [ 7 ] ) , . y ( b [ 1 ] ) , . sin ( 1 ’ b0 ) , . cin ( c [ 0 ] [ 7 ] ) , . cout ( c [ 1 ] [ 7 ] ) , . sout ( s [ 1 ] [ 7 ] ) ) ; mulmodule mm02 00 ( . x ( a [ 0 ] ) , . y ( b [ 2 ] ) , . sin ( s [ 1 ] [ 1 ] ) , . cin ( c [ 1 ] [ 0 ] ) , . cout ( c [ 2 ] [ 0 ] ) , . sout ( s [ 2 ] [ 0 ] ) ) ; mulmodule mm02 01 ( . x ( a [ 1 ] ) , . y ( b [ 2 ] ) , . sin ( s [ 1 ] [ 2 ] ) , . cin ( c [ 1 ] [ 1 ] ) , . cout ( c [ 2 ] [ 1 ] ) , . sout ( s [ 2 ] [ 1 ] ) ) ; mulmodule mm02 02 ( . x ( a [ 2 ] ) , . y ( b [ 2 ] ) , . sin ( s [ 1 ] [ 3 ] ) , . cin ( c [ 1 ] [ 2 ] ) , . cout ( c [ 2 ] [ 2 ] ) , . sout ( s [ 2 ] [ 2 ] ) ) ; mulmodule mm02 03 ( . x ( a [ 3 ] ) , . y ( b [ 2 ] ) , . sin ( s [ 1 ] [ 4 ] ) , . cin ( c [ 1 ] [ 3 ] ) , . cout ( c [ 2 ] [ 3 ] ) , . sout ( s [ 2 ] [ 3 ] ) ) ; mulmodule mm02 04 ( . x ( a [ 4 ] ) , . y ( b [ 2 ] ) , . sin ( s [ 1 ] [ 5 ] ) , . cin ( c [ 1 ] [ 4 ] ) , . cout ( c [ 2 ] [ 4 ] ) , . sout ( s [ 2 ] [ 4 ] ) ) ; mulmodule mm02 05 ( . x ( a [ 5 ] ) , . y ( b [ 2 ] ) , . sin ( s [ 1 ] [ 6 ] ) , . cin ( c [ 1 ] [ 5 ] ) , . cout ( c [ 2 ] [ 5 ] ) , . sout ( s [ 2 ] [ 5 ] ) ) ; mulmodule mm02 06 ( . x ( a [ 6 ] ) , . y ( b [ 2 ] ) , . sin ( s [ 1 ] [ 7 ] ) , . cin ( c [ 1 ] [ 6 ] ) , . cout ( c [ 2 ] [ 6 ] ) , . sout ( s [ 2 ] [ 6 ] ) ) ; mulmodule mm02 07 ( . x ( a [ 7 ] ) , . y ( b [ 2 ] ) , . sin ( 1 ’ b0 ) , . cin ( c [ 1 ] [ 7 ] ) , . cout ( c [ 2 ] [ 7 ] ) , . sout ( s [ 2 ] [ 7 ] ) ) ; mulmodule mm03 00 ( . x ( a [ 0 ] ) , . y ( b [ 3 ] ) , . sin ( s [ 2 ] [ 1 ] ) , . cin ( c [ 2 ] [ 0 ] ) , . cout ( c [ 3 ] [ 0 ] ) , . sout ( s [ 3 ] [ 0 ] ) ) ; mulmodule mm03 01 ( . x ( a [ 1 ] ) , . y ( b [ 3 ] ) , . sin ( s [ 2 ] [ 2 ] ) , . cin ( c [ 2 ] [ 1 ] ) , . cout ( c [ 3 ] [ 1 ] ) , . sout ( s [ 3 ] [ 1 ] ) ) ; mulmodule mm03 02 ( . x ( a [ 2 ] ) , . y ( b [ 3 ] ) , . sin ( s [ 2 ] [ 3 ] ) , . cin ( c [ 2 ] [ 2 ] ) , . cout ( c [ 3 ] [ 2 ] ) , . sout ( s [ 3 ] [ 2 ] ) ) ; mulmodule mm03 03 ( . x ( a [ 3 ] ) , . y ( b [ 3 ] ) , . sin ( s [ 2 ] [ 4 ] ) , . cin ( c [ 2 ] [ 3 ] ) , . cout ( c [ 3 ] [ 3 ] ) , . sout ( s [ 3 ] [ 3 ] ) ) ; mulmodule mm03 04 ( . x ( a [ 4 ] ) , . y ( b [ 3 ] ) , . sin ( s [ 2 ] [ 5 ] ) , . 49
  • 50. A.2. MULTIPLIERS APPENDIX A. UNIT SOURCE CODE cin ( c [ 2 ] [ 4 ] ) , . cout ( c [ 3 ] [ 4 ] ) , . sout ( s [ 3 ] [ 4 ] ) ) ; mulmodule mm03 05 ( . x ( a [ 5 ] ) , . y ( b [ 3 ] ) , . sin ( s [ 2 ] [ 6 ] ) , . cin ( c [ 2 ] [ 5 ] ) , . cout ( c [ 3 ] [ 5 ] ) , . sout ( s [ 3 ] [ 5 ] ) ) ; mulmodule mm03 06 ( . x ( a [ 6 ] ) , . y ( b [ 3 ] ) , . sin ( s [ 2 ] [ 7 ] ) , . cin ( c [ 2 ] [ 6 ] ) , . cout ( c [ 3 ] [ 6 ] ) , . sout ( s [ 3 ] [ 6 ] ) ) ; mulmodule mm03 07 ( . x ( a [ 7 ] ) , . y ( b [ 3 ] ) , . sin ( 1 ’ b0 ) , . cin ( c [ 2 ] [ 7 ] ) , . cout ( c [ 3 ] [ 7 ] ) , . sout ( s [ 3 ] [ 7 ] ) ) ; mulmodule mm04 00 ( . x ( a [ 0 ] ) , . y ( b [ 4 ] ) , . sin ( s [ 3 ] [ 1 ] ) , . cin ( c [ 3 ] [ 0 ] ) , . cout ( c [ 4 ] [ 0 ] ) , . sout ( s [ 4 ] [ 0 ] ) ) ; mulmodule mm04 01 ( . x ( a [ 1 ] ) , . y ( b [ 4 ] ) , . sin ( s [ 3 ] [ 2 ] ) , . cin ( c [ 3 ] [ 1 ] ) , . cout ( c [ 4 ] [ 1 ] ) , . sout ( s [ 4 ] [ 1 ] ) ) ; mulmodule mm04 02 ( . x ( a [ 2 ] ) , . y ( b [ 4 ] ) , . sin ( s [ 3 ] [ 3 ] ) , . cin ( c [ 3 ] [ 2 ] ) , . cout ( c [ 4 ] [ 2 ] ) , . sout ( s [ 4 ] [ 2 ] ) ) ; mulmodule mm04 03 ( . x ( a [ 3 ] ) , . y ( b [ 4 ] ) , . sin ( s [ 3 ] [ 4 ] ) , . cin ( c [ 3 ] [ 3 ] ) , . cout ( c [ 4 ] [ 3 ] ) , . sout ( s [ 4 ] [ 3 ] ) ) ; mulmodule mm04 04 ( . x ( a [ 4 ] ) , . y ( b [ 4 ] ) , . sin ( s [ 3 ] [ 5 ] ) , . cin ( c [ 3 ] [ 4 ] ) , . cout ( c [ 4 ] [ 4 ] ) , . sout ( s [ 4 ] [ 4 ] ) ) ; mulmodule mm04 05 ( . x ( a [ 5 ] ) , . y ( b [ 4 ] ) , . sin ( s [ 3 ] [ 6 ] ) , . cin ( c [ 3 ] [ 5 ] ) , . cout ( c [ 4 ] [ 5 ] ) , . sout ( s [ 4 ] [ 5 ] ) ) ; mulmodule mm04 06 ( . x ( a [ 6 ] ) , . y ( b [ 4 ] ) , . sin ( s [ 3 ] [ 7 ] ) , . cin ( c [ 3 ] [ 6 ] ) , . cout ( c [ 4 ] [ 6 ] ) , . sout ( s [ 4 ] [ 6 ] ) ) ; mulmodule mm04 07 ( . x ( a [ 7 ] ) , . y ( b [ 4 ] ) , . sin ( 1 ’ b0 ) , . cin ( c [ 3 ] [ 7 ] ) , . cout ( c [ 4 ] [ 7 ] ) , . sout ( s [ 4 ] [ 7 ] ) ) ; mulmodule mm05 00 ( . x ( a [ 0 ] ) , . y ( b [ 5 ] ) , . sin ( s [ 4 ] [ 1 ] ) , . cin ( c [ 4 ] [ 0 ] ) , . cout ( c [ 5 ] [ 0 ] ) , . sout ( s [ 5 ] [ 0 ] ) ) ; mulmodule mm05 01 ( . x ( a [ 1 ] ) , . y ( b [ 5 ] ) , . sin ( s [ 4 ] [ 2 ] ) , . cin ( c [ 4 ] [ 1 ] ) , . cout ( c [ 5 ] [ 1 ] ) , . sout ( s [ 5 ] [ 1 ] ) ) ; mulmodule mm05 02 ( . x ( a [ 2 ] ) , . y ( b [ 5 ] ) , . sin ( s [ 4 ] [ 3 ] ) , . cin ( c [ 4 ] [ 2 ] ) , . cout ( c [ 5 ] [ 2 ] ) , . sout ( s [ 5 ] [ 2 ] ) ) ; mulmodule mm05 03 ( . x ( a [ 3 ] ) , . y ( b [ 5 ] ) , . sin ( s [ 4 ] [ 4 ] ) , . cin ( c [ 4 ] [ 3 ] ) , . cout ( c [ 5 ] [ 3 ] ) , . sout ( s [ 5 ] [ 3 ] ) ) ; mulmodule mm05 04 ( . x ( a [ 4 ] ) , . y ( b [ 5 ] ) , . sin ( s [ 4 ] [ 5 ] ) , . cin ( c [ 4 ] [ 4 ] ) , . cout ( c [ 5 ] [ 4 ] ) , . sout ( s [ 5 ] [ 4 ] ) ) ; mulmodule mm05 05 ( . x ( a [ 5 ] ) , . y ( b [ 5 ] ) , . sin ( s [ 4 ] [ 6 ] ) , . 50
  • 51. A.2. MULTIPLIERS APPENDIX A. UNIT SOURCE CODE cin ( c [ 4 ] [ 5 ] ) , . cout ( c [ 5 ] [ 5 ] ) , . sout ( s [ 5 ] [ 5 ] ) ) ; mulmodule mm05 06 ( . x ( a [ 6 ] ) , . y ( b [ 5 ] ) , . sin ( s [ 4 ] [ 7 ] ) , . cin ( c [ 4 ] [ 6 ] ) , . cout ( c [ 5 ] [ 6 ] ) , . sout ( s [ 5 ] [ 6 ] ) ) ; mulmodule mm05 07 ( . x ( a [ 7 ] ) , . y ( b [ 5 ] ) , . sin ( 1 ’ b0 ) , . cin ( c [ 4 ] [ 7 ] ) , . cout ( c [ 5 ] [ 7 ] ) , . sout ( s [ 5 ] [ 7 ] ) ) ; mulmodule mm06 00 ( . x ( a [ 0 ] ) , . y ( b [ 6 ] ) , . sin ( s [ 5 ] [ 1 ] ) , . cin ( c [ 5 ] [ 0 ] ) , . cout ( c [ 6 ] [ 0 ] ) , . sout ( s [ 6 ] [ 0 ] ) ) ; mulmodule mm06 01 ( . x ( a [ 1 ] ) , . y ( b [ 6 ] ) , . sin ( s [ 5 ] [ 2 ] ) , . cin ( c [ 5 ] [ 1 ] ) , . cout ( c [ 6 ] [ 1 ] ) , . sout ( s [ 6 ] [ 1 ] ) ) ; mulmodule mm06 02 ( . x ( a [ 2 ] ) , . y ( b [ 6 ] ) , . sin ( s [ 5 ] [ 3 ] ) , . cin ( c [ 5 ] [ 2 ] ) , . cout ( c [ 6 ] [ 2 ] ) , . sout ( s [ 6 ] [ 2 ] ) ) ; mulmodule mm06 03 ( . x ( a [ 3 ] ) , . y ( b [ 6 ] ) , . sin ( s [ 5 ] [ 4 ] ) , . cin ( c [ 5 ] [ 3 ] ) , . cout ( c [ 6 ] [ 3 ] ) , . sout ( s [ 6 ] [ 3 ] ) ) ; mulmodule mm06 04 ( . x ( a [ 4 ] ) , . y ( b [ 6 ] ) , . sin ( s [ 5 ] [ 5 ] ) , . cin ( c [ 5 ] [ 4 ] ) , . cout ( c [ 6 ] [ 4 ] ) , . sout ( s [ 6 ] [ 4 ] ) ) ; mulmodule mm06 05 ( . x ( a [ 5 ] ) , . y ( b [ 6 ] ) , . sin ( s [ 5 ] [ 6 ] ) , . cin ( c [ 5 ] [ 5 ] ) , . cout ( c [ 6 ] [ 5 ] ) , . sout ( s [ 6 ] [ 5 ] ) ) ; mulmodule mm06 06 ( . x ( a [ 6 ] ) , . y ( b [ 6 ] ) , . sin ( s [ 5 ] [ 7 ] ) , . cin ( c [ 5 ] [ 6 ] ) , . cout ( c [ 6 ] [ 6 ] ) , . sout ( s [ 6 ] [ 6 ] ) ) ; mulmodule mm06 07 ( . x ( a [ 7 ] ) , . y ( b [ 6 ] ) , . sin ( 1 ’ b0 ) , . cin ( c [ 5 ] [ 7 ] ) , . cout ( c [ 6 ] [ 7 ] ) , . sout ( s [ 6 ] [ 7 ] ) ) ; mulmodule mm07 00 ( . x ( a [ 0 ] ) , . y ( b [ 7 ] ) , . sin ( s [ 6 ] [ 1 ] ) , . cin ( c [ 6 ] [ 0 ] ) , . cout ( c [ 7 ] [ 0 ] ) , . sout ( s [ 7 ] [ 0 ] ) ) ; mulmodule mm07 01 ( . x ( a [ 1 ] ) , . y ( b [ 7 ] ) , . sin ( s [ 6 ] [ 2 ] ) , . cin ( c [ 6 ] [ 1 ] ) , . cout ( c [ 7 ] [ 1 ] ) , . sout ( s [ 7 ] [ 1 ] ) ) ; mulmodule mm07 02 ( . x ( a [ 2 ] ) , . y ( b [ 7 ] ) , . sin ( s [ 6 ] [ 3 ] ) , . cin ( c [ 6 ] [ 2 ] ) , . cout ( c [ 7 ] [ 2 ] ) , . sout ( s [ 7 ] [ 2 ] ) ) ; mulmodule mm07 03 ( . x ( a [ 3 ] ) , . y ( b [ 7 ] ) , . sin ( s [ 6 ] [ 4 ] ) , . cin ( c [ 6 ] [ 3 ] ) , . cout ( c [ 7 ] [ 3 ] ) , . sout ( s [ 7 ] [ 3 ] ) ) ; mulmodule mm07 04 ( . x ( a [ 4 ] ) , . y ( b [ 7 ] ) , . sin ( s [ 6 ] [ 5 ] ) , . cin ( c [ 6 ] [ 4 ] ) , . cout ( c [ 7 ] [ 4 ] ) , . sout ( s [ 7 ] [ 4 ] ) ) ; mulmodule mm07 05 ( . x ( a [ 5 ] ) , . y ( b [ 7 ] ) , . sin ( s [ 6 ] [ 6 ] ) , . cin ( c [ 6 ] [ 5 ] ) , . cout ( c [ 7 ] [ 5 ] ) , . sout ( s [ 7 ] [ 5 ] ) ) ; mulmodule mm07 06 ( . x ( a [ 6 ] ) , . y ( b [ 7 ] ) , . sin ( s [ 6 ] [ 7 ] ) , . 51
  • 52. A.2. MULTIPLIERS APPENDIX A. UNIT SOURCE CODE cin ( c [ 6 ] [ 6 ] ) , . cout ( c [ 7 ] [ 6 ] ) , . sout ( s [ 7 ] [ 6 ] ) ) ; mulmodule mm07 07 ( . x ( a [ 7 ] ) , . y ( b [ 7 ] ) , . sin ( 1 ’ b0 ) , . cin ( c [ 6 ] [ 7 ] ) , . cout ( c [ 7 ] [ 7 ] ) , . sout ( s [ 7 ] [ 7 ] ) ) ; / / Lower 8 b i t s can be o b t a i n e d from the sum out of the l a s t l a y e r assign r e s u l t [ 0] = s [ 0 ] [ 0 ] ; assign r e s u l t [ 1] = s [ 1 ] [ 0 ] ; assign r e s u l t [ 2] = s [ 2 ] [ 0 ] ; assign r e s u l t [ 3] = s [ 3 ] [ 0 ] ; assign r e s u l t [ 4] = s [ 4 ] [ 0 ] ; assign r e s u l t [ 5] = s [ 5 ] [ 0 ] ; assign r e s u l t [ 6] = s [ 6 ] [ 0 ] ; assign r e s u l t [ 7] = s [ 7 ] [ 0 ] ; / / Upper 8 b i t s need to be summed with carry −o u t s from p r e v i o u s b i t s h a l f a d d e r ha00 ( . a ( s [ 7 ] [ 1 ] ) , . b ( c [ 7 ] [ 0 ] ) , . s ( r e s u l t [ 8 ] ) , . cout ( c [ 8 ] [ 0 ] ) ) ; f u l l a d d e r fa01 ( . a ( s [ 7 ] [ 2 ] ) , . b ( c [ 7 ] [ 1 ] ) , . cin ( c [ 8 ] [ 0 ] ) , . s ( r e s u l t [ 9 ] ) , . cout ( c [ 8 ] [ 1 ] ) ) ; f u l l a d d e r fa02 ( . a ( s [ 7 ] [ 3 ] ) , . b ( c [ 7 ] [ 2 ] ) , . cin ( c [ 8 ] [ 1 ] ) , . s ( r e s u l t [ 1 0 ] ) , . cout ( c [ 8 ] [ 2 ] ) ) ; f u l l a d d e r fa03 ( . a ( s [ 7 ] [ 4 ] ) , . b ( c [ 7 ] [ 3 ] ) , . cin ( c [ 8 ] [ 2 ] ) , . s ( r e s u l t [ 1 1 ] ) , . cout ( c [ 8 ] [ 3 ] ) ) ; f u l l a d d e r fa04 ( . a ( s [ 7 ] [ 5 ] ) , . b ( c [ 7 ] [ 4 ] ) , . cin ( c [ 8 ] [ 3 ] ) , . s ( r e s u l t [ 1 2 ] ) , . cout ( c [ 8 ] [ 4 ] ) ) ; f u l l a d d e r fa05 ( . a ( s [ 7 ] [ 6 ] ) , . b ( c [ 7 ] [ 5 ] ) , . cin ( c [ 8 ] [ 4 ] ) , . s ( r e s u l t [ 1 3 ] ) , . cout ( c [ 8 ] [ 5 ] ) ) ; f u l l a d d e r fa06 ( . a ( s [ 7 ] [ 7 ] ) , . b ( c [ 7 ] [ 6 ] ) , . cin ( c [ 8 ] [ 5 ] ) , . s ( r e s u l t [ 1 4 ] ) , . cout ( c [ 8 ] [ 6 ] ) ) ; assign r e s u l t [ 1 5 ] = c [ 7 ] [ 7] ˆ c [ 8 ] [ 6 ] ; endmodule A.2.2. Modified Booth Encoder / / 8− b i t Modified Booth Encoder to g e n e r a t e f o u r p a r t i a l 52
  • 53. A.2. MULTIPLIERS APPENDIX A. UNIT SOURCE CODE p r o d u c t s to be summed module boothencoder8 ( input wire [ 7 : 0 ] a , input wire [ 7 : 0 ] b , output reg [ 8 : 0 ] p00 , output reg [ 8 : 0 ] p01 , output reg [ 8 : 0 ] p02 , output reg [ 8 : 0 ] p03 ) ; / / Group m u l t i p l i e r b i t s i n t o o v e r l a p p i n g groups of t h r e e b i t s , then d e c i d e / / what the p a r t i a l p r o d u c t s should be based on the b i t s always @ ( a or b ) begin / / E q u i v a l e n t to appending a z e r o to b i t s 1 and 0 , only need to check f o u r c a s e s case ( b [ 1 : 0 ] ) 2 ’ b00 : p00 <= 9 ’ b000000000 ; 2 ’ b01 : p00 <= $signed ( a ) ; 2 ’ b10 : p00 <= {˜ a , 1 ’ b1 } ; 2 ’ b11 : p00 <= $signed ( ˜ a ) ; default : p00 <= 9 ’ bxxxxxxxxx ; endcase case ( b [ 3 : 1 ] ) 3 ’ b000 : p01 <= 9 ’ b000000000 ; / / P = 0 3 ’ b001 : p01 <= $signed ( a ) ; / / P = A 3 ’ b010 : p01 <= $signed ( a ) ; / / P = A 3 ’ b011 : p01 <= { a , 1 ’ b0 } ; / / P = 2A 3 ’ b100 : p01 <= {˜ a , 1 ’ b1 } ; / / P = −2A 3 ’ b101 : p01 <= $signed ( ˜ a ) ; / / P = −A 3 ’ b110 : p01 <= $signed ( ˜ a ) ; / / P = −A 3 ’ b111 : p01 <= 9 ’ b111111111 ; / / P = 0 ( a l l 1 s , p l u s complement b i t g i v e s 0) default : p01 <= 9 ’ bxxxxxxxxx ; / / Should not occur in normal o p e r a t i o n with d e f i n e d i n p u t s endcase 53
  • 54. A.2. MULTIPLIERS APPENDIX A. UNIT SOURCE CODE case ( b [ 5 : 3 ] ) 3 ’ b000 : p02 <= 9 ’ b000000000 ; 3 ’ b001 : p02 <= $signed ( a ) ; 3 ’ b010 : p02 <= $signed ( a ) ; 3 ’ b011 : p02 <= { a , 1 ’ b0 } ; 3 ’ b100 : p02 <= {˜ a , 1 ’ b1 } ; 3 ’ b101 : p02 <= $signed ( ˜ a ) ; 3 ’ b110 : p02 <= $signed ( ˜ a ) ; 3 ’ b111 : p02 <= 9 ’ b111111111 ; default : p02 <= 9 ’ bxxxxxxxxx ; endcase case ( b [ 7 : 5 ] ) 3 ’ b000 : p03 <= 9 ’ b000000000 ; 3 ’ b001 : p03 <= $signed ( a ) ; 3 ’ b010 : p03 <= $signed ( a ) ; 3 ’ b011 : p03 <= { a , 1 ’ b0 } ; 3 ’ b100 : p03 <= {˜ a , 1 ’ b1 } ; 3 ’ b101 : p03 <= $signed ( ˜ a ) ; 3 ’ b110 : p03 <= $signed ( ˜ a ) ; 3 ’ b111 : p03 <= 9 ’ b111111111 ; default : p03 <= 9 ’ bxxxxxxxxx ; endcase end endmodule A.2.3. Wallace Tree Multiplier / / 8− b i t s i g n e d Wallace Tree m u l t i p l i e r module w a l l a c e t r e e m u l t i p l i e r 8 ( input wire [ 7 : 0 ] a , input wire [ 7 : 0 ] b , output wire [ 1 5 : 0 ] r e s u l t ) ; wire [ 8 : 0 ] p [ 3 : 0 ] ; wire [ 2 : 0 ] c [ 1 5 : 2 ] ; 54
  • 55. A.2. MULTIPLIERS APPENDIX A. UNIT SOURCE CODE wire [ 2 : 0 ] w [ 1 5 : 2 ] ; wire c l a 0 c o u t ; / / Use the modified Booth encoder to g e n e r a t e the p a r t i a l p r o d u c t s boothencoder8 pp ( . a ( a ) , . b ( b ) , . p00 ( p [ 0 ] ) , . p01 ( p [ 1 ] ) , . p02 ( p [ 2 ] ) , . p03 ( p [ 3 ] ) ) ; / / Sum the p a r t i a l p r o d u c t s by wire weight u n t i l only two w i r e s are l e f t f o r each weight / / Weight 2ˆ2 h a l f a d d e r ha02 00 ( . a ( p [ 0 ] [ 2 ] ) , . b ( p [ 1 ] [ 0 ] ) , . s ( w[ 2 ] [ 0 ] ) , . cout ( c [ 2 ] [ 0 ] ) ) ; / / Weight 2ˆ3 h a l f a d d e r ha03 00 ( . a ( p [ 0 ] [ 3 ] ) , . b ( p [ 1 ] [ 1 ] ) , . s ( w[ 3 ] [ 0 ] ) , . cout ( c [ 3 ] [ 0 ] ) ) ; / / Weight 2ˆ4 f u l l a d d e r f a 0 4 0 0 ( . a ( p [ 0 ] [ 4 ] ) , . b ( p [ 1 ] [ 2 ] ) , . cin ( p [ 2 ] [ 0 ] ) , . s ( w[ 4 ] [ 0 ] ) , . cout ( c [ 4 ] [ 0 ] ) ) ; h a l f a d d e r ha04 01 ( . a (w[ 4 ] [ 0 ] ) , . b ( b [ 5 ] ) , . s ( w[ 4 ] [ 1 ] ) , . cout ( c [ 4 ] [ 1 ] ) ) ; / / Weight 2ˆ5 f u l l a d d e r f a 0 5 0 0 ( . a ( p [ 0 ] [ 5 ] ) , . b ( p [ 1 ] [ 3 ] ) , . cin ( p [ 2 ] [ 1 ] ) , . s ( w[ 5 ] [ 0 ] ) , . cout ( c [ 5 ] [ 0 ] ) ) ; h a l f a d d e r ha05 01 ( . a (w[ 5 ] [ 0 ] ) , . b ( c [ 4 ] [ 0 ] ) , . s ( w[ 5 ] [ 1 ] ) , . cout ( c [ 5 ] [ 1 ] ) ) ; / / Weight 2ˆ6 f u l l a d d e r f a 0 6 0 0 ( . a ( p [ 0 ] [ 6 ] ) , . b ( p [ 1 ] [ 4 ] ) , . cin ( p [ 2 ] [ 2 ] ) , . s ( w[ 6 ] [ 0 ] ) , . cout ( c [ 6 ] [ 0 ] ) ) ; f u l l a d d e r f a 0 6 0 1 ( . a (w[ 6 ] [ 0 ] ) , . b ( p [ 3 ] [ 0 ] ) , . cin ( b [ 7 ] ) , . s ( w[ 6 ] [ 1 ] ) , . cout ( c [ 6 ] [ 1 ] ) ) ; h a l f a d d e r f a 0 6 0 2 ( . a (w[ 6 ] [ 1 ] ) , . b ( c [ 5 ] [ 0 ] ) , 55
  • 56. A.2. MULTIPLIERS APPENDIX A. UNIT SOURCE CODE . s ( w[ 6 ] [ 2 ] ) , . cout ( c [ 6 ] [ 2 ] ) ) ; / / Weight 2ˆ7 f u l l a d d e r f a 0 7 0 0 ( . a ( p [ 0 ] [ 7 ] ) , . b ( p [ 1 ] [ 5 ] ) , . cin ( p [ 2 ] [ 3 ] ) , . s ( w[ 7 ] [ 0 ] ) , . cout ( c [ 7 ] [ 0 ] ) ) ; f u l l a d d e r f a 0 7 0 1 ( . a (w[ 7 ] [ 0 ] ) , . b ( p [ 3 ] [ 1 ] ) , . cin ( c [ 6 ] [ 0 ] ) , . s ( w[ 7 ] [ 1 ] ) , . cout ( c [ 7 ] [ 1 ] ) ) ; h a l f a d d e r ha07 02 ( . a (w[ 7 ] [ 1 ] ) , . b ( c [ 6 ] [ 1 ] ) , . s ( w[ 7 ] [ 2 ] ) , . cout ( c [ 7 ] [ 2 ] ) ) ; / / Weight 2ˆ8 f u l l a d d e r f a 0 8 0 0 ( . a ( p [ 0 ] [ 8 ] ) , . b ( p [ 1 ] [ 6 ] ) , . cin ( p [ 2 ] [ 4 ] ) , . s ( w[ 8 ] [ 0 ] ) , . cout ( c [ 8 ] [ 0 ] ) ) ; f u l l a d d e r f a 0 8 0 1 ( . a (w[ 8 ] [ 0 ] ) , . b ( p [ 3 ] [ 2 ] ) , . cin ( c [ 7 ] [ 0 ] ) , . s ( w[ 8 ] [ 1 ] ) , . cout ( c [ 8 ] [ 1 ] ) ) ; h a l f a d d e r ha08 02 ( . a (w[ 8 ] [ 1 ] ) , . b ( c [ 7 ] [ 1 ] ) , . s ( w[ 8 ] [ 2 ] ) , . cout ( c [ 8 ] [ 2 ] ) ) ; / / Weight 2ˆ9 f u l l a d d e r f a 0 9 0 0 ( . a ( p [ 0 ] [ 8 ] ) , . b ( p [ 1 ] [ 7 ] ) , . cin ( p [ 2 ] [ 5 ] ) , . s ( w[ 9 ] [ 0 ] ) , . cout ( c [ 9 ] [ 0 ] ) ) ; f u l l a d d e r f a 0 9 0 1 ( . a (w[ 9 ] [ 0 ] ) , . b ( p [ 3 ] [ 3 ] ) , . cin ( c [ 8 ] [ 0 ] ) , . s ( w[ 9 ] [ 1 ] ) , . cout ( c [ 9 ] [ 1 ] ) ) ; h a l f a d d e r f a 0 9 0 2 ( . a (w[ 9 ] [ 1 ] ) , . b ( c [ 8 ] [ 1 ] ) , . s ( w[ 9 ] [ 2 ] ) , . cout ( c [ 9 ] [ 2 ] ) ) ; / / Weight 2ˆ10 f u l l a d d e r f a 1 0 0 0 ( . a ( p [ 0 ] [ 8 ] ) , . b ( p [ 1 ] [ 8 ] ) , . cin ( p [ 2 ] [ 6 ] ) , . s ( w[ 1 0 ] [ 0 ] ) , . cout ( c [ 1 0 ] [ 0 ] ) ) ; f u l l a d d e r f a 1 0 0 1 ( . a (w[ 1 0 ] [ 0 ] ) , . b ( p [ 3 ] [ 4 ] ) , . cin ( c [ 9 ] [ 0 ] ) , . s ( w[ 1 0 ] [ 1 ] ) , . cout ( c [ 1 0 ] [ 1 ] ) ) ; h a l f a d d e r ha10 02 ( . a (w[ 1 0 ] [ 1 ] ) , . b ( c [ 9 ] [ 1 ] ) , . s ( w[ 1 0 ] [ 2 ] ) , . cout ( c [ 1 0 ] [ 2 ] ) ) ; / / Weight 2ˆ11 f u l l a d d e r f a 1 1 0 0 ( . a ( p [ 0 ] [ 8 ] ) , . b ( p [ 1 ] [ 8 ] ) , . cin ( p [ 2 ] [ 56
  • 57. A.2. MULTIPLIERS APPENDIX A. UNIT SOURCE CODE 7 ] ) , . s ( w[ 1 1 ] [ 0 ] ) , . cout ( c [ 1 1 ] [ 0 ] ) ) ; f u l l a d d e r f a 1 1 0 1 ( . a (w[ 1 1 ] [ 0 ] ) , . b ( p [ 3 ] [ 5 ] ) , . cin ( c [ 1 0 ] [ 0 ] ) , . s ( w[ 1 1 ] [ 1 ] ) , . cout ( c [ 1 1 ] [ 1 ] ) ) ; h a l f a d d e r ha11 02 ( . a (w[ 1 1 ] [ 1 ] ) , . b ( c [ 1 0 ] [ 1 ] ) , . s ( w[ 1 1 ] [ 2 ] ) , . cout ( c [ 1 1 ] [ 2 ] ) ) ; / / Weight 2ˆ12 f u l l a d d e r f a 1 2 0 0 ( . a ( p [ 0 ] [ 8 ] ) , . b ( p [ 1 ] [ 8 ] ) , . cin ( p [ 2 ] [ 8 ] ) , . s ( w[ 1 2 ] [ 0 ] ) , . cout ( c [ 1 2 ] [ 0 ] ) ) ; f u l l a d d e r f a 1 2 0 1 ( . a (w[ 1 2 ] [ 0 ] ) , . b ( p [ 3 ] [ 6 ] ) , . cin ( c [ 1 1 ] [ 0 ] ) , . s ( w[ 1 2 ] [ 1 ] ) , . cout ( c [ 1 2 ] [ 1 ] ) ) ; h a l f a d d e r ha12 02 ( . a (w[ 1 2 ] [ 1 ] ) , . b ( c [ 1 1 ] [ 1 ] ) , . s ( w[ 1 2 ] [ 2 ] ) , . cout ( c [ 1 2 ] [ 2 ] ) ) ; / / Weight 2ˆ13 f u l l a d d e r f a 1 3 0 0 ( . a (w[ 1 2 ] [ 0 ] ) , . b ( p [ 3 ] [ 7 ] ) , . cin ( c [ 1 2 ] [ 0 ] ) , . s ( w[ 1 3 ] [ 0 ] ) , . cout ( c [ 1 3 ] [ 0 ] ) ) ; h a l f a d d e r ha13 01 ( . a (w[ 1 3 ] [ 0 ] ) , . b ( c [ 1 2 ] [ 1 ] ) , . s ( w[ 1 3 ] [ 1 ] ) , . cout ( c [ 1 3 ] [ 1 ] ) ) ; / / Weight 2ˆ14 f u l l a d d e r f a 1 4 0 0 ( . a (w[ 1 2 ] [ 0 ] ) , . b ( p [ 3 ] [ 8 ] ) , . cin ( c [ 1 2 ] [ 0 ] ) , . s ( w[ 1 4 ] [ 0 ] ) , . cout ( c [ 1 4 ] [ 0 ] ) ) ; h a l f a d d e r ha14 01 ( . a (w[ 1 4 ] [ 0 ] ) , . b ( c [ 1 3 ] [ 0 ] ) , . s ( w[ 1 4 ] [ 1 ] ) , . cout ( c [ 1 4 ] [ 1 ] ) ) ; / / Weight 2ˆ15 assign w[ 1 5 ] [ 0] = w[ 1 4 ] [ 0] ˆ c [ 1 4 ] [ 0 ] ; / / Use two chained CLA adders to perform the f i n a l a d d i t i o n carrylookaheadadd8 c l a 0 ( . a ({w[ 7 ] [ 2 ] , w[ 6 ] [ 2 ] , w[ 5 ] [ 1 ] , w [ 4 ] [ 1 ] , w[ 3 ] [ 0 ] , w[ 2 ] [ 0 ] , p [ 0 ] [ 1 ] , p [ 0 ] [ 0 ] } ) , . b ({ c [ 6 ] [ 2 ] , c [ 5 ] [ 1 ] , c [ 4 ] [ 1 ] , c [ 3 ] [ 0 ] , c [ 2 ] [ 0 ] , b [ 3 ] , 1 ’ b0 , b [ 1 ] } ) , . cin ( 1 ’ b0 ) , . s ( r e s u l t [ 7 : 0 ] ) , . 57
  • 58. A.2. MULTIPLIERS APPENDIX A. UNIT SOURCE CODE cout ( c l a 0 c o u t ) ) ; carrylookaheadadd8 c l a 1 ( . a ({w[ 1 5 ] [ 0 ] , w[ 1 4 ] [ 1 ] , w[ 1 3 ] [ 1 ] , w [ 1 2 ] [ 2 ] , w[ 1 1 ] [ 2 ] , w[ 1 0 ] [ 2 ] , w[ 9 ] [ 2 ] , w[ 8 ] [ 2 ] } ) , . b ({ c [ 1 4 ] [ 1 ] , c [ 1 3 ] [ 1 ] , c [ 1 2 ] [ 2 ] , c [ 1 1 ] [ 2 ] , c [ 1 0 ] [ 2 ] , c [ 9 ] [ 2 ] , c [ 8 ] [ 2 ] , c [ 7 ] [ 2 ] } ) , . cin ( c l a 0 c o u t ) , . s ( r e s u l t [ 1 5 : 8 ] ) ) ; endmodule A.2.4. Dadda Tree Multiplier / / 8− b i t s i g n e d Dadda Tree m u l t i p l i e r module d a d d a t r e e m u l t i p l i e r 8 ( input wire [ 7 : 0 ] a , input wire [ 7 : 0 ] b , output wire [ 1 5 : 0 ] r e s u l t ) ; wire [ 8 : 0 ] p [ 3 : 0 ] ; wire [ 2 : 0 ] c [ 1 5 : 2 ] ; wire [ 2 : 0 ] w [ 1 5 : 2 ] ; wire c l a 0 c o u t ; / / Use the modified Booth encoder to g e n e r a t e the p a r t i a l p r o d u c t s boothencoder8 pp ( . a ( a ) , . b ( b ) , . p00 ( p [ 0 ] ) , . p01 ( p [ 1 ] ) , . p02 ( p [ 2 ] ) , . p03 ( p [ 3 ] ) ) ; / / Sum the p a r t i a l p r o d u c t s by wire weight u n t i l only two w i r e s are l e f t f o r each weight / / Weight 2ˆ2 h a l f a d d e r ha02 02 ( . a ( p [ 0 ] [ 2 ] ) , . b ( p [ 1 ] [ 0 ] ) , . s ( w[ 2 ] [ 0 ] ) , . cout ( c [ 2 ] [ 0 ] ) ) ; / / Weight 2ˆ3 h a l f a d d e r ha03 02 ( . a ( p [ 0 ] [ 3 ] ) , . b ( p [ 1 ] [ 1 ] ) , . s ( w[ 3 ] [ 0 ] ) , . cout ( c [ 3 ] [ 0 ] ) ) ; 58
  • 59. A.2. MULTIPLIERS APPENDIX A. UNIT SOURCE CODE / / Weight 2ˆ4 h a l f a d d e r ha04 01 ( . a ( p [ 0 ] [ 4 ] ) , . b ( p [ 1 ] [ 2 ] ) , . s ( w[ 4 ] [ 0 ] ) , . cout ( c [ 4 ] [ 0 ] ) ) ; f u l l a d d e r f a 0 4 0 2 ( . a ( p [ 2 ] [ 0 ] ) , . b ( b [ 5 ] ) , . cin ( c [ 3 ] [ 0 ] ) , . s ( w[ 4 ] [ 1 ] ) , . cout ( c [ 4 ] [ 1 ] ) ) ; / / Weight 2ˆ5 h a l f a d d e r ha05 01 ( . a ( p [ 0 ] [ 5 ] ) , . b ( p [ 1 ] [ 3 ] ) , . s ( w[ 5 ] [ 0 ] ) , . cout ( c [ 5 ] [ 0 ] ) ) ; f u l l a d d e r f a 0 5 0 2 ( . a ( p [ 2 ] [ 1 ] ) , . b ( c [ 4 ] [ 0 ] ) , . cin ( c [ 4 ] [ 1 ] ) , . s ( w[ 5 ] [ 1 ] ) , . cout ( c [ 5 ] [ 1 ] ) ) ; / / Weight 2ˆ6 h a l f a d d e r ha06 00 ( . a ( p [ 0 ] [ 6 ] ) , . b ( p [ 1 ] [ 4 ] ) , . s ( w[ 6 ] [ 0 ] ) , . cout ( c [ 6 ] [ 0 ] ) ) ; f u l l a d d e r f a 0 6 0 1 ( . a ( p [ 2 ] [ 2 ] ) , . b ( p [ 3 ] [ 0 ] ) , . cin ( b [ 7 ] ) , . s ( w[ 6 ] [ 1 ] ) , . cout ( c [ 6 ] [ 1 ] ) ) ; f u l l a d d e r f a 0 6 0 2 ( . a (w[ 6 ] [ 0 ] ) , . b ( c [ 5 ] [ 0 ] ) , . cin ( c [ 5 ] [ 1 ] ) , . s ( w[ 6 ] [ 2 ] ) , . cout ( c [ 6 ] [ 2 ] ) ) ; / / Weight 2ˆ7 f u l l a d d e r f a 0 7 0 0 ( . a ( p [ 0 ] [ 7 ] ) , . b ( p [ 1 ] [ 5 ] ) , . cin ( p [ 2 ] [ 3 ] ) , . s ( w[ 7 ] [ 0 ] ) , . cout ( c [ 7 ] [ 0 ] ) ) ; h a l f a d d e r ha07 01 ( . a ( p [ 3 ] [ 1 ] ) , . b ( c [ 6 ] [ 0 ] ) , . s ( w[ 7 ] [ 1 ] ) , . cout ( c [ 7 ] [ 1 ] ) ) ; f u l l a d d e r f a 0 7 0 2 ( . a (w[ 7 ] [ 0 ] ) , . b ( c [ 6 ] [ 1 ] ) , . cin ( c [ 6 ] [ 2 ] ) , . s ( w[ 7 ] [ 2 ] ) , . cout ( c [ 7 ] [ 2 ] ) ) ; / / Weight 2ˆ8 h a l f a d d e r ha08 00 ( . a ( p [ 0 ] [ 8 ] ) , . b ( p [ 1 ] [ 6 ] ) , . s ( w[ 8 ] [ 0 ] ) , . cout ( c [ 8 ] [ 0 ] ) ) ; f u l l a d d e r f a 0 8 0 1 ( . a ( p [ 2 ] [ 4 ] ) , . b ( p [ 3 ] [ 2 ] ) , . cin ( c [ 7 ] [ 0 ] ) , . s ( w[ 8 ] [ 1 ] ) , . cout ( c [ 8 ] [ 1 ] ) ) ; f u l l a d d e r f a 0 8 0 2 ( . a (w[ 8 ] [ 0 ] ) , . b ( c [ 7 ] [ 1 ] ) , . cin ( c [ 7 ] [ 2 ] ) , . s ( w[ 8 ] [ 2 ] ) , . cout ( c [ 8 ] [ 2 ] ) ) ; 59