report

University of Manchester
School of Computer Science
Project Report 2014
Design and implementation of arithmetic units with Xilinx FPGA
Author: Prakhar Bahuguna
Supervisor: Dr. Vasilis Pavlidis

Design and implementation of arithmetic units with Xilinx FPGA
Author: Prakhar Bahuguna
The aim of the project is to design, implement, test and profile various different arithmetic
units, such as adders and multipliers on an FPGA platform. These are algorithms for effi-
ciently performing arithmetic in hardware that are widely used in various different applic-
ations. This project strives to detail the various types of such arithmetic units, providing
example designs and implementations for each and critically evaluating the merits and is-
sues with each design. The designs will then be implemented on an Xilinx Virtex-7 FPGA
development board upon which real-world performance, logic area and power consumption
can be measured and analysed.
Supervisor: Dr. Vasilis Pavlidis

Contents
1. Introduction 7
2. Background 8
2.1. Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1. Half Adder and Full Adder . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2. Ripple-carry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.3. Carry-lookahead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.4. Carry-select . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.5. Carry-skip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2. Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1. Array Multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.2. Modified Booth Encoder . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.3. Wallace Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.4. Dadda Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3. Design 23
3.1. Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2. Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4. Simulation 26
4.1. Testing Methodology For 8-bit units . . . . . . . . . . . . . . . . . . . . . . . 27
4.2. Testing Methodology For Larger Units . . . . . . . . . . . . . . . . . . . . . 27
5. Implementation 29
5.1. The Synthesis Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.2. Synthesis Reports and Statistics . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.2.1. Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.2.2. Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6. Conclusion 38
3

Contents Contents
A. Unit Source Code 41
A.1. Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
A.1.1. Half Adder and Full Adder . . . . . . . . . . . . . . . . . . . . . . . . 41
A.1.2. Ripple-carry Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
A.1.3. Carry-lookahead Adder . . . . . . . . . . . . . . . . . . . . . . . . . 43
A.1.4. Carry-select Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
A.1.5. Carry-skip Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
A.2. Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
A.2.1. Array Multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
A.2.2. Modified Booth Encoder . . . . . . . . . . . . . . . . . . . . . . . . . 52
A.2.3. Wallace Tree Multiplier . . . . . . . . . . . . . . . . . . . . . . . . . 54
A.2.4. Dadda Tree Multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . 58
B. Testbench Source Code 62
B.1. 8-bit Testbenches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
B.1.1. Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
B.1.2. Unsigned Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
B.1.3. Signed Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
B.2. Testdata Generation Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
B.3.1. Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
B.4.1. Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4

List of Figures
2.1. Half Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2. Full Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3. Ripple-Carry Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4. Propagate-Generate Full Adder . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5. Carry-lookahead Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.6. Carry-select Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.7. Carry-skip Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.8. Modified Full Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.9. Array Multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.10. Wallace Tree Multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.11. Dadda Tree Multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.1. Simulation Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.1. Area Usage For 8-bit Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.4. Delay For 8-bit Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.7. Area Usage For 8-bit Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.8. Delay For 8-bit Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.9. Area Usage For Array Multipliers . . . . . . . . . . . . . . . . . . . . . . . . 37
5.10. Delay For Array Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5

List of Tables
2.1. Basic Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2. Long Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3. MBE Truth Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4. MBE Partial Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
6

1. Introduction
Arithmetic units are hardware circuits that are designed to perform some type of arithmetic
operation on binary numbers. They are typically integrated into some form of processor,
which rely on the arithmetic units for much of their operation. This project aims to design,
simulate, implement and evaluate the various different types of arithmetic units using an
electronic design workflow in conjunction with an FPGA development platform. Each unit
will be benchmarked for characteristics such as area usage, delay and power consumption
that are important considerations in hardware design.
This project in particular will focus on adders and multipliers. Addition and multiplication
are very common operations that are utilised heavily by even the most basic of processors,
both for internal operations and for processing data inputs. Both arithmetic operations have
a variety of designs that can efficiently perform arithmetic in hardware, and each design has
its own trade-offs with regard to the key characteristics. The project strives to detail these
various types of arithmetic units, with example designs and implementations for each. The
merits and issues of each design will then be critically evaluated and compared with each
other.
The next chapter will give a complete overview of the arithmetic units that will be im-
plemented in the course of this project. This includes the approach taken by the design and
the logic behind its functioning, its gate/block-level schematic and an estimate of number
of gates required as well as the critical path delay. The design chapter will then cover the
details of designing the units and the considerations that need to be taken into account for
their development. The units will also need to be simulated and verified to ensure correct-
ness, preferably in an automated manner that can provide guarantees of correct operation.
This important stage is addressed by the simulation chapter. Finally, the designs will then be
implemented in hardware, targeting an Xilinx Virtex-7 FPGA development board. Finally,
the implementation chapter will cover the synthesis process and evaluate the characterist-
ics of the implemented hardware units, comparing and contrasting the different varieties of
adders and multipliers with each other.
7

2. Background
Arithmetic units as used in computers are digital circuits that perform elementary arithmetic
operations on numbers. They are based on binary arithmetic, taking operands as input and
giving a result as output, both as binary numbers. Typically, these operations are simple
mathematical arithmetic such as addition, subtraction, multiplication and division, though
more complex units may implement more complicated mathematical operations.
Arithmetic units can be classified into the two main groups of integer units and floating-
point units. Integer units operate exclusively with integers and typically implement opera-
tions such as addition and multiplication where the result for two integer operands is always
an integer. As decimals do not need to be considered, integer units are typically smaller,
faster and require lower power and less area. Floating-point units operate with numbers
that have a floating decimal point. Like integer units, they can also perform additions, sub-
tractions and multiplications but with floating-point operands. However, they are also able
to perform operations such as division, exponentiation and square root calculation that often
give a floating-point result even for integer operands. As the logic to handle floating-point
numbers is more complex, floating-point units are usually larger and more power-hungry.
Typically, arithmetic units are implemented in hardware within Application-Specific In-
tegrated Circuits (ASIC). They are usually grouped together to form a complete Arithmetic
Logic Unit (ALU) of a microprocessor, enabling the processor to perform useful work.1
ALUs
are also found in different types of processors such as Graphics Processing Units (GPUs),
which use a large number of complex ALUs to perform complex graphics calculations in
parallel.2
Digital Signal Processors (DSPs) are largely based around their ALUs to process
a continuous data stream in a pipelined fashion.3
In this project, the arithmetic units will
be implemented in an FPGA as an ASIC implementation is simply too costly and time con-
suming to consider for this use and is irrelevant to the learning objectives of this project.
However, the same principles of hardware design still apply and the underlying difference
in implementation can be abstracted away for the purposes of this project.
The rest of this chapter will provide a complete overview of the arithmetic units that will be
1
Terms, ALU (Arithmetic Logic Unit) Definition.
2
Nvidia, What is GPU Accelerated Computing?
3
Yovits, Advances in Computers.
8

2.1. ADDERS CHAPTER 2. BACKGROUND
implemented in this project. A variety of adders, followed by multipliers will be introduced
along with their design details. In addition, each will have an estimate of delay and area.
2.1. Adders
Addition is one of the most basic mathematical operations needed in modern computer sys-
tems. It is performed bitwise on two binary operands in a similar fashion to traditional
base-10 addition. The least significant bits are first added together - in binary, this can be
performed by XORing the bits together. If the result is greater than 1, the carry is passed to
the next set of significant bits and incorporated into the addition. This addition process is
repeated to the left up to the most significant bit as shown in Table 2.1
Carry: 1 1
1 0 0 1
+ 0 0 1 1
1 1 0 0
Table 2.1.: The basic addition process.
The method just described is the most straightforward and natural way for a human to
add two binary numbers together. However, in computer hardware, there are multiple ap-
proaches to solve the problem of adding two numbers together efficiently, and each presents
its own set of advantages and disadvantages. The primary concern in digital design is the
critical path. This is the path in the circuit which has the longest delay between the input
being fed to the unit and the correct output being obtained from it. As the delay in the critical
path defines the maximum speed at which the adder can operate, minimising this delay is
crucial to improve performance. Other concerns in adder design are the power consumption
of the circuit and the area needed to implement the circuit, which depends directly on the
number of logic gates that are used. These are usually at odds with minimising the critical
delay as more sophisticated logic demands higher power consumption and more logic gates.
Given this situation, there are various designs that result in different trade-offs between
these two goals, depending on the requirements for the hardware being developed. The pros
and cons of each design are analysed and evaluated in the following subsections.
2.1.1. Half Adder and Full Adder
The most basic building blocks of any adder is the 1-bit half adder and full adder. A half adder
simply takes two operands A and B. It calculates the sum S by XORing A and B together,
9

denoted as S = A ⊕ B. A carry output cout can also be evaluated by ANDing the two
operands together as cout = A · B.4
This is shown in Figure 2.1. Thus, the half-adder is
sufficient for calculating the sum of two 1-bit operands
S
A
B
cout
Figure 2.1.: A half adder with two operands A and B.
However, it is not typical to be adding 1-bit numbers together. Often, several bits need to
be added, with the carry of the previous column needed to correctly compute the result of
the current column. The full adder is a complete 1-bit adder, including a carry-in that allows
it to be chained to previous bits to compute an n-bit sum.5
It calculates S = A ⊕ B ⊕ cin and
cout = (A · B) + (cin · (A ⊕ B)) and an implementation can be produced by chaining two
half-adders together as demonstrated by Figure 2.2
S
cout
B
A
in
c
Figure 2.2.: A full adder with two operands and a carry-in generated from two half adders.
2.1.2. Ripple-carry
The ripple-carry adder (RCA) is the simplest possible type of n-bit adder. The RCA utilises
a chain of full adders connected in series with the carry-outs of each full adder feeding into
4
Vahid, Digital Design.
5
Ibid.
10

the carry-in of the next, as illustrated by Figure 2.3. It is named as such because the carry
from the rightmost column ‘ripples’ through to the left column in a sequential fashion.6
As
this adder uses n full adders with five logic gates each, it only requires 5n logic gates.
B0 A 0
Full
Adder
Full
Adder
Full
Adder
Full
Adder
B3 A B2 A B1 A 123
c
S2
S3
S1
S0
in
Figure 2.3.: A ripple-carry adder resulting from chaining multiple full adders.
The main issue with the ripple-carry adder is that the nature of the design means that the
final result is not known until the carry has propagated all the way to the leftmost column.
This situation gives a long critical path between A0/B0 and cout which makes the adder slow
to calculate the result. The delay for this path is three gate delays for each full adder with a
total delay of 3n gate delays (assuming that every gate along this path has a similar delay).
Clearly an alternative design to add two operands needs to be developed.
2.1.3. Carry-lookahead
The carry-lookahead adder (CLA) attempts to avoid the slow carry ripple of the ripple-carry
adder by predicting ahead of time what the carry from the previous column is likely to be. It
does this by replacing the carry-out signal from the full adders with two signals: P (propag-
ate) and G (generate). These signals are based on whether each full adder will propagate a
carry-in of 1 to its carry-out, or will generate a carry itself.
A full adder will propagate a carry-in when either A or B or both are 1, since the result of
A + B + cin will be equal to cin in this case. Hence, we can set P = A + B. A full adder will
generate a carry if both A and B are 1 regardless of the value of cin as A + B will be greater
than one. The generate signal can be set to G = A · B.7
The full adder can be modified using
these results to create a propagate-generate full adder as show in Figure 2.4
The propagate and generate signals from prior columns can now be used to look ahead
and evaluate the expected carry-in for each full adder. Suppose the modified full adders
are assembled as below with some lookahead logic to determine cin for each adder as in
Figure 2.5.
6
7
Tohoku University, Hardware Algorithms For Arithmetic Modules.
11

B
A
S
in
c
P
G
Figure 2.4.: A full-adder with propagate and generate signals instead of a carry-out.
B0 A 0
Full
Adder
Full
Adder
S0
S1
S2
S3
c1
B3 A B2 A B1 A 123
cin
Full
Adder
Full
Adder
Carry−lookahead Logic
P P0 GP GGP 01122G3 3
c3
c2
Figure 2.5.: Block diagram of a carry-lookahead adder. The carry-in for each full adder is
evaluated from the lookahead logic rather than the previous full adder.
We know that c1 will definitely be 1 if G0 is 1 as the first column will definitely generate a
carry-out regardless of the value of c0. If G0 is 0, the only other way that c1 will be 1 is if the
previous adder propagates a carry-in. This propagation will occur when P0 is 1, so c1 will be
1 if both P0 and c0 are 1. This logic can therefore be formulated as c1 = G0 + P0 · c0 and is
easily implemented with one OR gate and one AND gate.
12

This carry-lookahead logic can now be generalised to all full adders as ci+1 = Gi +Pi ·ci.8
Recursive substitution and expansion of ci then gives an expression for every carry-in ci+1
as described in Equation 2.1
c1 = G0 + P0 · c0
c2 = G1 + P1 · c1 = G1 + P1 · (G0 + P0 · c0) = G1 + P1 · G0 + P1 · P0 · c0
c3 = G2 + P2 · c2 = G2 + P2 · (G1 + P1 · G0 + P1 · P0 · c0)
= G2 + P2 · G1 + P2 · P1 · G0 + P2 · P1 · P0 · c0
...
cn+1 = Gn + Pn · Gn−1 . . . . (2.1)
The critical path has now been shortened significantly as it now runs between Ai/Bi and
Si. After one gate delay, Pi and Gi are evaluated. The parallel nature of the lookahead logic
requires only two gate delays. Finally, the addition requires two gate delays, so the overall
critical delay is just five gate delays, regardless of the number of bits in the adder.
The main downside is that the carry-lookahead adders needs significantly more logic gates
for the lookahead logic. While c1 only needs two gates to compute, c2 requires three gates,
c3 requires four gates and so on, with ci requiring i gates. The lookahead logic in total there-
fore needs n(n − 1)/2 gates. When combined with the full adders, a complete n-bit carry-
lookahead adder requires n(n + 7)/2 gates. Clearly, an adder design with a quadratic gate
count will not scale well to larger sizes.
2.1.4. Carry-select
A carry-select adder (CSLA) attempts to provide a compromise between the simplicity of a
ripple-carry adder and the speed of a carry-lookahead adder. It uses a chain of ripple-carry
adders of a certain width (usually 4-bit), but for each subsequent block after the first, two
adders are placed which simultaneously calculate the sum of the 4-bit operands. One adder
assumes a carry-in of 0, the other assumes a carry-in of 1, allowing for the two possible results
to be precomputed. The correct result is then selected by a 2:1 mux using the carry-out of
the previous block.9
The complete layout is given by Figure 2.6
The critical path still runs from A0/B0 to cout, but the delay is much shorter as each block
is computed simultaneously. Assuming that each mux has a gate delay of 3, the overall gate
delay of a carry-select adder with 4-bit ripple-carry blocks is 12+3(n/4−1) = 3n/4+9. The
8
Vladutiu, Computer Arithmetic.
9
Ibid.
13

S[3:0]
A[3:0]B[3:0]
0
Adder
Ripple−carry
1
Adder
Ripple−carry
A[7:4]B[7:4]
A[7:4]B[7:4]
S[7:4]
cout
S[7:4] S[7:4]
1
1
0cout
0
Adder
Ripple−carry cin
Figure 2.6.: Diagram of the first section of a carry-select adder. Each block after the ini-
tial block has two ripple-carry adders to compute the two possible results
simultaneously.
increase in delay is still linear but is significantly smaller as compared to a ripple-carry adder.
More logic gates are required than a ripple-carry adder but the design does not suffer from
the quadratic increase that the carry-lookahead adder has, making the carry-select adder far
more scalable.
2.1.5. Carry-skip
The carry-skip adder (CSKA) is a variation on the carry-select. It avoids waiting for the carry-
in ripple from the previous block if it can conclusively determine that the current block will
not propagate it further. In a similar fashion to carry-lookahead, this can be determined by
evaluating P for the entire block, which is true if Pi is true for every bit i in the block.10
The
overall expression for P for a 4-bit block is therefore given in Equation 2.2
10
14

2.2. MULTIPLIERS CHAPTER 2. BACKGROUND
P = P0 · P1 · P2 · P3
= (A0 + B0) · (A1 + B1) · (A2 + B2) · (A3 + B3). (2.2)
The carry-in cini for block i can now be evaluated faster. Assume that both the carry-out
couti−2 of block i − 2 and Pi−1 from block i − 1 have been evaluated. If Pi−1 is true, we
can simply pass the value of couti−2 to cini as the block will simply propagate it. Otherwise,
we have to wait for couti−1 to be evaluated as its value may differ from couti−2. This can be
summed up by Equation 2.3
cini = couti−1 + Pi−1 · couti−2 (2.3)
and the entire schematic is depicted in Figure 2.7
S[3:0]
A[3:0]B[3:0]
cin
S[7:4]
B[7:4] A[7:4]A[11:8]B[11:8]
S[11:8]
P[11:8] P[7:4]
Ripple−carry
Adder
Ripple−carry
Adder Adder
Ripple−carry
Figure 2.7.: Block diagram of a carry-skip adder. If the current block propagates its carry-in,
the carry-in is used directly for the next block.
2.2. Multipliers
Multiplication is another common operation that is found in the ALUs of most modern pro-
cessors. As for all electronic logic circuits, an arithmetic multiplier operates on two binary
operands, referred to as the multiplicand and the multiplier. A set of partial products is
computed from each bit of the multiplier, much in the same way that a human performs
long multiplication by hand but in binary. Each partial product is zero if its corresponding
multiplier bit is zero, and equal to the multiplicand shifted left by the appropriate number
of positions if the multiplier bit is one. These partial products are then summed with mul-
tiple adders to compute the final product. This long multiplication methods is illustrated in
Table 2.2
15

1 0 0 1
× 0 0 1 1
1 0 0 1
1 0 0 1
0 0 0 0
+ 0 0 0 0
0 0 0 1 1 0 1 1
Table 2.2.: The long multiplication method for binary operands.
Generating the partial products for a multiplication calculation of a × b is extremely easy.
Each partial product pi is evaluated as pi = a · bi, shifted left by i for each bit i in the
multiplier b. The difficulty arises in summing, or reducing the partial products to compute
the final product in an efficient manner.
As with adders, there are a large variety of designs which can compute this partial products
sum. Each has its own trade-offs between delay and area/power consumption depending on
the requirements of the hardware being developed. As there are a large number of pos-
sible multiplier designs, three common designs will be analysed and evaluated in this report,
namely the array multiplier, the Wallace tree multiplier and the Dadda tree multiplier.
2.2.1. Array Multiplier
Much like the ripple-carry adder, the array multiplier is the most straightforward implement-
ation of an n-bit multiplier. It uses an array of modified full adders arranged in an n × n grid
to evaluate the result, with the carries rippling diagonally through the grid and the sum-outs
rippling down. The ith
column of the grid corresponds to the ith
bit of the final product and
the jth
row corresponds to the jth
partial product, which is generated by the jth
bit of the
multiplier. A final row of regular full adders is then used to sum the remaining carry-outs to
compute the upper bits of the final product.11
A schematic of these modified adder is given
in Figure 2.8 and the complete array multiplier in Figure 2.9.
As with the RCA, the longest critical path of the array multiplier is easy to visualise. It runs
from the least significant bits of the operands in the top-right, diagonally through the carry-
outs to the most significant bit of the final product in the bottom-left. Hence, it traverses n
modified full adders (which have a gate delay of four), one half adder (gate delay of one) and
n − 1 full adders (gate delay of three). The delay of the array multiplier is thus 4n + 1 +
3n = 7n + 1. This is significantly faster than performing repeated addition to compute the
11
16

S
cout
in
c
sin
out
A
B
Figure 2.8.: An array multiplier full adder module, modified to compute the product of two
bits and sum this with the previous partial product.
A 0B0A 1B1 0
Full Adder
Modified
Full Adder
Modified
Full Adder
Modified
Full Adder
Modified
Half
Adder
Sn
S1
S0
Full
Adder
Sn+1
S2n−1
Half
Adder
B2 A B1 A 12
0
00
Figure 2.9.: An array multiplier, showing the grid structure of the full adder modules to com-
pute the partial products and sum them.
multiplication, which will necessarily have gate delays larger than 7n + 1.
However, the most apparent problem as visible in the schematic for the array multiplier is
the large amount of logic required to perform the multiplication. An n-bit array multiplier
requires n × n modified adders, so that the logic required scales by n2
. An 8-bit multiplier
17

requires 64 full adders while a 32-bit multiplier will need 1024. Clearly, the array multiplier
does not scale efficiently for practical applications where 32-bit or 64-bit width multipliers
would be needed as the power and area requirements are too large.
Another problem with the array multiplier is its inability to deal with signed integers.
With addition, the addition process inherently gives the correct answer whether the value
is unsigned or signed. These problems are addressed by the Wallace tree and Dadda tree
multipliers, in conjunction with a Modified Booth Encoder. The latter will be discussed first
as it forms a logic sub-block of both tree-based multipliers.
2.2.2. Modified Booth Encoder
The Modified Booth Encoder (MBE) serves two important purposes for more sophisticated
multiplier designs. Firstly, it allows the multiplier to correctly handle signed integers as
part of the partial product reduction process without any additional sign-extension logic.
Secondly, it reduces the number of partial products that need to be computed by half. This
is accomplished by re-encoding the partial products according to the patterns of repeated
1s and 0s in the bits of the multiplier. For instance, a multiplication involving 4-bit integers
such as a × 0011 would normally give the partial product sum of a + 2a + 0 + 0. This can
be re-written as −a + 4a. Similarly, a × 0111 would normally require the calculation of
a + 2a + 4a + 0. This can be formulated as −a + 8a. Hence, the number of partial products
has been reduced from four to two.
This encoding is accomplished by first padding the bits of the multiplier with a zero to the
least significant bit (LSB). If the multiplier has an odd number of bits, two additional zeros
are padded to the most significant bit, otherwise no additional padding is necessary. The bits
of the padded multiplier are then grouped into overlapping groups of threes as illustrated in
Equation 2.4.12
87 = 01010111 ⇒ 010101110 (with padding)
⇒ 010
Bit Group 4
010
Bit Group 3
011
Bit Group 2
110
Bit Group 1
(2.4)
Each of these bit groups corresponds to a partial product that will be generated by the
MBE. The value of each partial product is determined by the truth table in Table 2.3. In this
table, ∼ a means invert all the bits of a, and a 1 means shift a left by one
Given two 8-bit operands a and b, the partial products from an MBE can then be summed
12
Saharan and Kaur, ‘Design and Implementation of an Efficient Modified Booth Multiplier using VHDL’.
18

Bit Value Operation Partial Product
000 0 × a 0 . . . 0
001 1 × a a
010 1 × a a
011 2 × a a 1
100 −2 × a (∼ a 1) + 2
101 −1 × a ∼ a + 1
110 −1 × a ∼ a + 1
111 0 × a 1 . . . 1 + 1
Table 2.3.: Truth table for the MBE partial products.
by the addition logic of the multiplier as shown in Table 2.4.13
The outcome of utilising the
MBE is that only four partial products need to be summed instead of eight, saving on the
logic required for the multiplier
a7 a6 a5 a4 a3 a2 a1 a0
× b7 b6 b5 b4 b3 b2 b1 b0
p07 p06 p05 p04 p03 p02 p01 p00
p17 p16 p15 p14 p13 p12 p11 p10 b1
p27 p26 p25 p24 p23 p22 p21 p20 b3
p37 p36 p35 p34 p33 p32 p31 p30 b5
+ b7
Table 2.4.: The partial products and addition tree generated by an MBE.
2.2.3. Wallace Tree
The Wallace Tree multiplier takes the partial products of a multiplication and groups the
constituent bits according to their weight. The weight of a particular bit depends on its
position - for instance the least significant bit has weight 20
= 1 while bit 3 has weight
23
= 8. These bits are then reduced by layers of half adders and full adders in a tree structure
to compute the final product from the partial products.14
13
Punnaiah et al., ‘Design and Evaluation of High Performance Multiplier Using Modified Booth Encoding
Algorithm’.
14
19

The Wallace structure operates with multiple layers that reduce the number of bits with
the same weight at each stage. The operation performed depends on number of bits in the
layer:
• One: Pass the bit down to the next layer.
• Two: Add the bits together with a half adder, passing the sum to the same weight in
the next layer and the carry-out to the next weight in the next layer.
• Three or more: Add any three bits together with a full adder. Pass the sum and any
remaining bits to the same weight next layer. Pass the carry-out to the next weight in
the next layer
Layers are added to the Wallace structure in this fashion until all weights have been re-
duced to one bit.15
These remaining bits form the final product of the multiplication. Fig-
ure 2.10 shows the structure of a Wallace tree multiplier, where each layer of adders reduces
the partial products until the final product has been computed.
Figure 2.10.: A Wallace tree multiplier, showing the tree structure of layers of half and full
adders.16
The primary advantage of the Wallace Tree is that it has a significantly smaller number
of logic gates as compared to an array multiplier. The tree structure requires logn reduction
layers with each layer containing at most 2n adders, so no more than 2nlogn adders are re-
quired as opposed to the n2
adders required for an array multiplier. The use of an MBE halves
the number of partial products, so that the number of adders required is further reduced to
15
Bohsali and Doan, Rectangular Styled Wallace Tree Multipliers.
20

2nlog(n/2) in this instance. The MBE itself requires n/2 partial product logic blocks, with
each logic block requiring approximate twelve gates to evaluate the partial product.
The disadvantage of the Wallace Tree is that in contrast to an array multiplier, it has an
irregular layout and wiring structure. This is because the higher weights have more wires
and therefore require more adders than the lower weight wires. These extra adders also
require additional internal wiring to connect them all up correctly. This irregular routing and
layout is particularly problematic for FPGAs which are based around utilising a regular grid
of lookup tables and logic blocks to implement their functionality. Hence, a fully synthesised
Wallace Tree multiplier may require more logic blocks than would be expected.
2.2.4. Dadda Tree
A Dadda Tree multiplier operates in a very similar manner to a Wallace Tree. It receives a
set of partial products as inputs, each consisting of bits of different weights and sums these
together using layers of adders to compute the final product. It differs from the Wallace Tree
in terms of the structure of these adders, reducing the complexity of each reduction layer at
the cost of using additional layers. This structure is illustrated by Figure 2.11. The reduction
rules for the Dadda structure are as follows:17
• One: Pass the bit down to the next layer.
• Two: If all weights in the layer have two or fewer bits, then add the bits together with
a half adder, passing the sum to the same weight in the next layer and the carry-out to
the next weight in the next layer. Otherwise, pass the bits down to the next layer.
• Three or more: Add any three bits together with a full adder, ensuring that the total
number of bits remains equal or close to a multiple of three. Pass the sum and any
remaining bits to the same weight next layer. Pass the carry-out to the next weight in
the next layer.
The Dadda Tree still gives a similar nlogn scaling in logic area as the Wallace Tree multi-
plier due to its tree structure. The actual area used is slightly larger than that of the Wallace
Tree as each reduction layer is less aggressive at summing the partial products, resulting in
more layers needed to compute the sum. One advantage of this slight trade-off in area is
that the complexity of the wiring is reduced. This is useful for FPGA implementation as it is
likely to synthesise with better layout and routing than a Wallace Tree.
17
EDA Cafe, Datapath Logic Cells.
21

Figure 2.11.: A Dadda tree multiplier, showing a tree structure that is larger but more regular
than a Wallace tree.18
22

3. Design
To develop the arithmetic units discussed in the previous chapter, it is necessary to specify
their logic. This will be accomplished using Verilog, a hardware description language (HDL).
Each arithmetic unit is written in this language to describe its inputs, outputs and the in-
ternal logic of the unit to compute their outputs based on their inputs. These Verilog source
code files can then be used by Electronic Design Automation (EDA) tools such as Cadence,
Synopsys or Xilinx ISE to simulate and verify the logic, as well as synthesising it into a bit-
stream suitable for configuring an FPGA to implement the logic. This chapter addresses the
requirements and details of designing the arithmetic units.
3.1. Requirements
An arithmetic unit such as an adder or a multiplier is typically used as a block within a
larger unit, such as an ALU. These ALUs in turn are typically utilised by a processing unit,
most commonly the CPU of a computer system. Thus, an arithmetic unit must satisfy the
requirements of the encompassing unit.
The first requirement is the width of the operands that will be utilised. A processor will
typically use a common word width for its registers, memory addresses and data buses, hence
the arithmetic units it relies on will need to match. Early processors were 8-bit, but most mod-
ern processors such as ARM and Intel x86 now use word sizes of 32 bits and 64 bits. Hence,
the arithmetic units in this project each have 8-bit, 32-bit and 64-bit variants. However, due
to time constraints the Wallace tree and Dadda tree multipliers were only produced as 8-bit
variants.
The second requirement for arithmetic units is that the result should be computed within
a specific deadline. Processors are based on clocked logic, with operations triggered on the
rising edge of a clock cycle. For the processor to operate correctly, the result from the previ-
ous operation needs to be computed and latched within a discrete number of clock periods.
This imposes timing requirements on the arithmetic units and the delay plays a part in de-
termining the maximum clock speed of the entire design. The worst-case delays for each
arithmetic unit must therefore be analysed and evaluated as part of the development pro-
23

3.2. IMPLEMENTATION DETAILS CHAPTER 3. DESIGN
cess.
Finally, a significant consideration is the area used by the designs, and by extension their
power consumption. In the context of an FPGA, area usage is determined by the number of
logic slices and look-up tables (LUTs) that are used by the synthesised design. It is important
to ensure that the FPGA has enough logic slices to load the bitstream for the entire design,
so the designer of the arithmetic units must ensure enough area is left for the rest of the
design. In addition, a larger design requires more power to operate as the additional gates
draw more power upon switching. The power draw of the unit influences a device’s current
requirements, thermal constraints and battery lifetime in the case of mobile devices. Hence,
it is important to analyse the power consumption of the arithmetic units.
3.2. Implementation Details
Since Verilog is a hardware description language, a source file simply describes the beha-
vioural logic of the hardware and the state of its outputs given a set of inputs. The EDA
toolchain is free to synthesise any gates and wires necessary to ensure the unit will behave
according to the source file. This means that for the instance of an adder, it is perfectly valid
to write s = a + b and synthesising this will give a correctly functioning adder by the tool-
chain. However, since the actual implementation of this adder is completely determined by
the synthesis tool, this approach is not useful for this project.
Instead, to design the specific arithmetic units discussed in the previous chapter, it is ne-
cessary to be explicit and specify the exact logic of each unit. The arithmetic units developed
in this project will be designed at the level of basic logic gates such as NOT, AND, OR and
XOR gates, with the wiring between ports specified explicitly. This ensures that the EDA
toolchain will not attempt to generate its own optimised logic to substitute as an equivalent
to the logic specified in the source file. This approach allows for the differences between the
types of units to be clearly distinguished for further analysis.
However, some optimisations that are difficult to avoid entirely occur during the trans-
lation and mapping stages of synthesising a unit. For example, when synthesising an XOR
gate, the EDA toolchain has a number of possibilities for configuring an LUT to implement
this. In addition, the toolchain can be configured to optimise for particular design goals such
as minimising area or delay. The exact algorithms and optimisations for doing this are pro-
prietary and depend on both the synthesis tool and the capabilities and properties of the
FPGA hardware being used. Since it is not possible to directly observe what exactly occurs
at the synthesis level, it is best to hold this source of variability constant to ensure consistent
results. For this project, the synthesis tool used will be XST, from the Xilinx ISE 14.5 pack-
24

3.2. IMPLEMENTATION DETAILS CHAPTER 3. DESIGN
age. This will be used to synthesis designs targeted at the Xilinx Virtex-7 VC707 evaluation
kit. The ISE projects will be configured for the default Balanced profile which aims to give a
balance between compact area usage and short delays when synthesising the units.
The Verilog source files for the 8-bit units are provided in the appendix for reference (the
32-bit and 64-bit units have been omitted due to space constraints, but are straightforward
extensions of the basic logic). The next stage after designing and developing these units is
performing logic simulation and testing to ensure they perform as expected and give the
correct result for any set of inputs. This topic is covered in the next chapter.
25

4. Simulation
Logic simulation, also known as functional simulation, is a process by which software can
be used to determine the behaviour of a digital circuit. Simulation is performed through
the use of a stimulus testbench and the logic unit being tested, referred to as the device
under test (DUT). The testbench unit sends test data to the DUT’s inputs and the outputs
are observed and recorded in the simulation as a waveform trace. Additionally, the outputs
can be compared to a set of expected results for each test case. This allows for a unit to be
tested and verified for correct operation, as well as allowing the designer to find the source
of any faults in the unit. Figure 4.1 shows an example of a multiplier unit being simulated
with the ISim tool in Xilinx ISE, with the expected product being compared alongside the
actual output value from the multiplier unit. One can observe that this unit is performing as
expected, giving the correct product for every operand combination.
Figure 4.1.: The simulation process for a Wallace tree multiplier in Xilinx ISim.
26

4.1. TESTING METHODOLOGY FOR 8-BIT UNITS CHAPTER 4. SIMULATION
In the context of the arithmetic units in this project, simulation is essential to ensure that
the correct result is computed for any set of operands. The unit should correctly handle
small and large numbers, corner cases such as one operand set to one or zero, and negative
numbers if appropriate. To achieve good testing coverage of the units, it is necessary to test
using a large number of operand combinations. As performing this many tests manually
would be tedious and impractical, it is desirable to automate the test, allowing for a quick
pass/fail decision to be made for a unit.
4.1. Testing Methodology For 8-bit units
For the 8-bit arithmetic unit, it is feasible to test every single combination of operands and
verify the result. This is referred to as exhaustive testing. This testing scenario is possible
because in the case of an 8-bit multiplier, there are 28
possible values for operand a and 28
possible values for operand b. Hence there are 28
× 28
= 216
= 65536 test cases to consider.
This is a seemingly large number but it can be easily performed by a computer with an
automated test. The testbench for the 8-bit adders and multipliers simply uses two nested
loops to loop through every possible value for both operands, checking that the output result
is equal to that calculated in software. If a discrepancy occurs, it halts and logs the error,
otherwise the testbench continues the simulation until the end. The code for this is given in
the appendix.
4.2. Testing Methodology For Larger Units
The exhaustive testing approach however does not work in practice for 32-bit and 64-bit
arithmetic units. This is because the number of test cases required scales exponentially by
22n
. An exhaustive 32-bit multiplier testbench requires 264
≈ 1.8 × 1019
cases and a 64-bit
unit requires 2128
≈ 3.4 × 1038
cases. These are extremely large numbers and it simply isn’t
feasible to test every possibility in a reasonable amount of time. An alternative approach is
to test a large but feasible set of test cases, each with a randomly selected combination of
operands. If the unit gives the correct result for all of these test cases and all paths through
the unit have been tested at least once, it can be assumed with a high degree of confidence
that it will give the correct result in all cases.
The approach taken in this project for testing the 32-bit and 64-bit units is to generate
a set of test data for the testbench. This involves a Python script that select two numbers
at random (within the bit width constraints) using the system’s random number generator.
It computes their sum, unsigned product and signed product and appends the output as a
27

4.2. TESTING METHODOLOGY FOR LARGER UNITS CHAPTER 4. SIMULATION
formatted string of hexadecimal numbers to a test data file. This process is repeated for one
million cases. The process of generating random numbers occasionally generates duplicate
operand pairs. These duplicate pairs are removed from the test data file using standard UNIX
utilities, allowing for the script to be rerun to generate the remaining test cases until the test
data contains one million unique pairs of operands. The test data file can then be used by the
testbenches, which scan each line of the test data, set the input values according to the two
operands and compare the output with the expected result in the data. If the output differs
from the expected result, the test halts and logs the error, otherwise the test continues until
the last line of the test data is reached. The code for the test data script and the testbench
is given in the appendix, but the actual test data used is omitted from this report due to its
large size.
Once a unit has successfully passed all of the test cases in the testbench, it can be assumed
to be functionally correct under all input conditions. It can then be synthesised as a hardware
unit suitable for implementation on the FPGA hardware. This process is covered in the next
chapter.
28

5. Implementation
Once an arithmetic has been designed and fully verified, it is ready to be implemented. This
process involves transforming the Verilog source code for a hardware unit into binary bit-
stream data suitable for downloading to an FPGA device. The synthesis process also gener-
ates various statistics that are useful in analysing and comparing the various types of adders
and multipliers. These will be the primary focus of this chapter.
5.1. The Synthesis Process
Synthesis is the process by which a hardware description of a logic unit is used to generate
a hardware implementation of logic gates. This implementation can be targeted at the fab-
rication of a physical ASIC, or a bitstream for the configuration of an FPGA. In this manner,
synthesis is roughly analogous to compiling the source code for some software into a binary
executable. The process is performed by the synthesis tool of an EDA toolchain. In the case
of this project, the synthesis tool is XST (Xilinx Synthesis Tool) which is integrated into the
Xilinx ISE suite. The main stages of synthesis are as follows:
• Translate: The translation stage parses the source file and generates a netlist (a list
of the wires in the design) and the logic gates associated with them. Various optim-
isations are utilised to minimise the specified Boolean logic to a set of logic gates with
equivalent truth tables. This assists in reducing the area and delay for the unit.
• Map: The map stage uses the aforementioned gate list, assigning them to specific logic
slices and inputs/outputs on the FPGA. The LUTs are also configured to reflect the logic
required for the design.
• Place-and-route: Once the design has been mapped to the FPGA, the place-and-route
stage uses the netlist to decide on how the design should be arranged on the chip and
the routing of wires between the logic gates. There are a selection of optimisation
target profiles that can be used to influence the place-and-route stage. For example,
the synthesis tool can be directed to minimize area usage or delay, or strike a balance
between both goals. This project utilises the Balanced profile for all units.
29

5.2. SYNTHESIS REPORTS AND STATISTICS CHAPTER 5. IMPLEMENTATION
Once a unit has been synthesised, a bitstream can be generated from it. This requires the
designer to assign the unit’s inputs and outputs to the physical pins on the FPGA chip, along
with any other constraints that may be required. Once this has been completed, the toolchain
generates a complete bitstream of the entire FPGA’s configuration. This bitstream can then
be downloaded to the FPGA, finishing the process of implementing the design into hardware.
5.2. Synthesis Reports and Statistics
An important part of the synthesis process is the reports and statistics that are generated
along with the synthesised unit. These provide important details with regard to the unit such
as the area usage, the number of input/output buffers (IOBs) required, the estimated pin-to-
pin delay for each combination of input and output pin, the estimated power consumption
at a given clock rate, and so forth. This data is important from a development perspective
and here it will be utilised to evaluate each of the arithmetic units developed in the course
of this project.
Unfortunately, due to technical issues with the development software it was not possible
to generate reliable dynamic power consumption reports for the purposes of this project, nor
was it possible to interact directly with the implemented designs on the hardware. However,
since power consumption of a digital circuit is directly proportional to the number of logic
gates in the circuit, it can be indirectly inferred that a design with a larger area is expected
to consume more power, assuming a similar percentage of gates are switched with each
computation.
5.2.1. Adders
In this project, the four adders discussed earlier were fully implemented, namely the ripple-
carry adder, the carry-lookahead adder, the carry-select adder and the carry-skip adder. These
were run through the synthesis tool which generated synthesis reports for each. The first
criteria for evaluating the adders was the area usage. This was quantified by examining the
number of slice LUTs required for the design. The results for the 8-bit variants are graphed
in Figure 5.1.
As would be expected, the ripple-carry adder’s simple design gives it the smallest area us-
age of the four units, with eight LUTs used. What was not expected was that the carry-select
adder also used an equal amount of area. This was most likely due to synthesis optimisations
that allowed for the adder to be efficiently implemented in the FPGA. The carry-lookahead
adder required an additional LUT for its lookahead logic while the carry-skip adder required
30

Figure 5.1.: Graph showing the area usage of the 8-bit adder variants, in terms of LUTs used.
the most with ten LUTs. The scaling of these adders is depicted in Figure 5.2 and Figure 5.3
for the 32-bit and 64-bit adders respectively.
From these graphs, it is immediately apparent that the area requirements of the carry-
lookahead adder scale up very quickly in relation to operand width. The 32-bit and 64-bit
carry-lookahead adders are significantly larger than the other variants, which suggests that
large carry-lookahead adders are not suitable for practical use. Again, both the ripple-carry
and carry-select adders use the least area, while the carry-skip adder uses noticeably more
LUTs.
31

Figure 5.2.: Graph showing the area usage of the 32-bit adder variants.
Figure 5.3.: Graph showing the area usage of the 64-bit adder variants.
32

The next criteria to analyse is the maximum worst-case delay, a value that is necessary
to determine for the purpose of integrating the unit within a larger logic unit. This value
is obtained from the pin-to-pin delay report, which details the delays between each bit of
every input pin and every bit of every output pin. The largest value from this list is selected
as being the maximum delay. The results of this are shown in Figure 5.4.
Figure 5.4.: Graph showing the maximum worst-case delay of the 8-bit adder variants, meas-
ured in nanoseconds.
Slightly surprisingly, the ripple-carry adder comes first with the shortest delay as com-
pared to the other adders. This is likely due to the fact that with only eight full adders in
a ripple-carry chain, the ripple delay is not yet large enough to be a significant issue. The
additional cost of the critical path optimisations present in the other adder designs hence do
not outweigh the benefits derived from them. Figure 5.5 and Figure 5.6 show the maximum
delay of the 32-bit and 64-bit adders respectively.
Here, the other adders now provide a tangible reduction in maximum delay as compared
to the ripple-carry adder. The carry-lookahead adder in particular has the shortest delay,
though this comes at the cost of significantly more area usage as discussed previously. The
32-bit carry-select adder also improves on the delay relative to the ripple-carry adder, while
the 32-bit carry-skip adder proves to be the slowest. This situation is reversed for the 64-bit
adders where the carry-skip adder proves to be faster than the carry-select adder, though
not as fast as the carry-lookahead adder. In fact, in this instance the carry-select adder has
33

Figure 5.5.: Graph showing the maximum worst-case delay of the 32-bit adder variants.
Figure 5.6.: Graph showing the maximum worst-case delay of the 64-bit adder variants.
a longer delay than the ripple-carry adder. Hence, we can conclude that there is no ‘best’
adder design in all cases. The best choice of adder for each operand width is determined by
the designers requirements and by profiling the individual designs.
34

5.2.2. Multipliers
The array multiplier, Wallace tree multiplier and Dadda tree multiplier were all implemented
as 8-bits units in this project. However, due to time constraints only the array multiplier was
implemented as 32-bit and 64-bit units. The majority of this section will therefore focus on
the 8-bit units. Firstly, the area usage of these units is given in Figure 5.7.
Figure 5.7.: Graph showing the area usage of the 8-bit multipliers.
It is clear that the Wallace tree multiplier uses the least area with 73 LUTs as compared
to 74 for the array multiplier and 76 for the Dadda multiplier. This is despite the additional
overhead imposed by the MBE, which requires more logic than traditional calculation of the
partial products. In addition, the Wallace tree multiplier’s ability to handle multiplication of
signed numbers places it at a clear advantage against the array multiplier. Meanwhile, the
Dadda multiplier’s expanded design is visible in its increased area usage.
The next criteria is the maximum worst-case delay. The graph for this is given in Figure 5.8.
Again, the Wallace tree multiplier emerges as the unequivocal winner with the shortest delay,
while also being able to multiply signed numbers. The Dadda tree multiplier has the longest
delay, although by a relatively small margin. However, it is entirely possible that different
characteristics could be observed with 32-bit and 64-bit implementations of the tree multi-
pliers, as was the case with the carry-select and carry-skip adders.
35

Figure 5.8.: Graph showing the maximum worst-case of the 8-bit multipliers.
For informative purposes, the 8-bit array multiplier was compared to its 32-bit and 64-bit
counterparts to analyse the scaling of the array multiplier with operand width. The graphs
for area and delay are given in Figure 5.9 and Figure 5.10 respectively.
The delay graph shows the array multiplier scales slightly less than linearly between the
three bit widths, with delays of 11ns, 38ns and 56ns for the 8-bit, 32-bit and 64-bit multipliers
respectively. However, it is the area graph that shows the enormous effect of n2
scaling of
area usage. While the 8-bit multiplier only required 74 LUTs, the 32-bit multiplier requires
over 1400 and this balloons to almost 6000 for the 64-bit multiplier. Hence, it is obvious that
the array multiplier is a poor choice of design for practical multiplier applications as the area
requirements are simply too large.
36

Figure 5.9.: Graph showing the scaling of area usage for array multipliers of various widths.
Figure 5.10.: Graph showing the scaling of maximum delay for array multipliers of various
widths.
37

6. Conclusion
This project has covered the complete development process of a variety of arithmetic units,
from the theory that underpins them through to their design and testing process, before being
implemented as complete hardware units. Additionally, it has covered the design properties
and characteristics that distinguish the units from each other, with the advantages and dis-
advantages of each unit discussed in detail. Each unit was also developed with a selection of
operand widths to investigate the effects of scaling on the characteristics of the units.
In particular, it can be concluded that for the adders, no particular adder emerged as a clear
best design. The choice of adder used in a digital circuit is guided heavily by the requirements
of the designer’s project. For instance, a compact design requiring an 8-bit adder with min-
imal area usage and power consumption is best served by the corresponding ripple-carry
adder. A design that calls for minimal delay in an 8-bit adder is best served by the carry-
lookahead adder. For a 32-bit adder, the area scaling issues of a carry-lookahead adder make
it unsuitable for many applications, so a designer seeking out low delay while keeping area
usage acceptable would select the carry-select adder.
For the multipliers, the situation is more clear-cut. The 8-bit Wallace tree multiplier proved
to be superior to its array multiplier counterpart in both aspects. Its use of a modified Booth
encoder and a tree structure allowed it to use less area while simultaneously possessing a
shorter delay. It also avoids the n2
area scaling issue of the array multiplier as seen with the
latter’s 32-bit and 64-bit variants. The ability to correctly multiply signed integers cements
its advantage as many digital designs will require the ability to multiply negative numbers
together. Meanwhile, the Dadda tree multiplier was disadvantaged by having a larger area
and delay than the other two multiplier designs. However, since these results were only
obtained for 8-bit multipliers, it is entirely possible that wider variants of the Dadda tree
multiplier may give more favourable characteristics than its Wallace counterpart.
The primary issues that occurred with this project was a lack of time as well as a lack of
prior knowledge and experience in digital hardware design and arithmetic units. In partic-
ular, understanding the logic and theory of the tree-based multiplier units and the modified
Booth encoder was very time-consuming. Possessing a thorough understanding of these
units was essential before development work could begin. Hence, there was only sufficient
38

CHAPTER 6. CONCLUSION
time to complete the design, verification and implementation of the 8-bit variants of the tree
multipliers. Given more time, an analysis of the scaling characteristics of the tree multipliers
with 32-bit and 64-bit wide operands could have been undertaken.
Another issue was the inability to obtain data on the power consumption qualities of the
arithmetic units as was originally intended, from both software estimated values and actual
values measured from the hardware. Technical issues as well a lack of prior experience with
the software made it difficult to obtain meaningful dynamic power consumption estimates.
Only estimates of static power consumption (from transistor leakage) was available, which
were not useful for quantifying the power consumed when evaluating a calculation. Ad-
ditionally, there were further issues with using the arithmetic units on the physical FPGA
hardware. While it was possible to generate a bitstream from the synthesised units and
download this to the hardware, there was no clear method of interfacing with the arithmetic
unit. It was not possible to send test data to the unit nor to read its output, severely limit-
ing the usefulness of this approach. With more time available, it would have been easier to
overcome these issues. It would also have been possible to make physical measurements on
the hardware, allowing for a useful analysis of real-world power consumption by the units.
Despite these issues, the project was still successful in that many useful results were ob-
tained. Nearly all of the intended units were fully designed, verified and implemented in the
course of this project. Ultimately, it has been an extremely rewarding experience and has
given a significant amount of in-depth knowledge and practical hands-on experience in the
realm of digital hardware design.
39

Bibliography
Bohsali, M. and M. Doan. Rectangular Styled Wallace Tree Multipliers. url: http://www.
veech.com/index files/Wallace%20Tree.pdf.
EDA Cafe. Datapath Logic Cells. url: http://www10.edacafe.com/book/ASIC/
Book/Book/Book/CH02/CH02.6.php.
Nvidia. What is GPU Accelerated Computing? url: http : / / www . nvidia . com /
object/what-is-gpu-computing.html.
Punnaiah, S. et al. ‘Design and Evaluation of High Performance Multiplier Using Modified
Booth Encoding Algorithm’. In: International Journal of Engineering and Innovative Tech-
nology 1 (6 June 2012), pp. 16–19.
Saharan, K. and J. Kaur. ‘Design and Implementation of an Efficient Modified Booth Multiplier
using VHDL’. In: International Journal of Advances in Engineering Sciences 3 (3 July 2013),
pp. 78–81.
Terms, Tech. ALU (Arithmetic Logic Unit) Definition. url: http://www.techterms.
com/definition/alu.
Tohoku University. Hardware Algorithms For Arithmetic Modules. url: http://www.
aoki.ecei.tohoku.ac.jp/arith/mg/algorithm.html.
Tufts University. 4*4 multiplier. url: http://www.eecs.tufts.edu/∼ryun01/
vlsi/verilog simulation.htm.
Vahid, Frank. Digital Design. Wiley, 2011. isbn: 978-0-470-53108-2.
Vladutiu, Mircea. Computer Arithmetic. Algorithms and Hardware Implementations. Springer,
2012. isbn: 978-3-642-18315-7.
Yovits, Marshall C. Advances in Computers. Academic Press, 1993, p. 105. isbn: 978-0-470-
53108-2.
40

Appendix A.
Verilog Source Code for the 8-bit
Arithmetic Units
A.1. Adders
A.1.1. Half Adder and Full Adder
/ / Simple half −adder t h a t computes the sum and carry −out of
two b i t s
module h a l f a d d e r ( input wire a ,
input wire b ,
output wire s ,
output wire cout ) ;
assign s = a ˆ b ;
assign cout = a & b ;
endmodule
/ / Simple f u l l −adder t h a t computes the sum and carry −out of
two b i t s , p l u s a carry −in
module f u l l a d d e r ( input wire a ,
input wire b ,
input wire cin ,
output wire s ,
wire s0 , c0 , c1 ;
41

A.1. ADDERS APPENDIX A. UNIT SOURCE CODE
/ / Two half −adders are chained to c r e a t e a f u l l adder
h a l f a d d e r ha0 ( . a ( a ) , . b ( b ) , . s ( s0 ) , . cout ( c0 ) ) ;
h a l f a d d e r ha1 ( . a ( s0 ) , . b ( cin ) , . s ( s ) , . cout ( c1 ) ) ;
assign cout = c0 | c1 ;
endmodule
A.1.2. Ripple-carry Adder
/ / 4− b i t r i p p l e −c a r r y adder with f o u r f u l l −adders
module r i p p l e c a r r y a d d 4 ( input wire [ 3 : 0 ] a ,
input wire [ 3 : 0 ] b ,
input wire cin ,
output wire [ 3 : 0 ] s ,
wire [ 3 : 0 ] c ;
assign cout = c [ 3 ] ;
f u l l a d d e r fa0 ( . a ( a [ 0 ] ) , . b ( b [ 0 ] ) , . cin ( cin ) , . s ( s [ 0 ] ) , .
cout ( c [ 0 ] ) ) ;
f u l l a d d e r fa1 ( . a ( a [ 1 ] ) , . b ( b [ 1 ] ) , . cin ( c [ 0 ] ) , . s ( s [ 1 ] ) , .
cout ( c [ 1 ] ) ) ;
cout ( c [ 2 ] ) ) ;
cout ( c [ 3 ] ) ) ;
endmodule
/ / 8− b i t r i p p l e −c a r r y adder with two 4− b i t RCAs
module r i p p l e c a r r y a d d 8 ( input wire [ 7 : 0 ] a ,
input wire cin ,
42

wire c0 ;
r i p p l e c a r r y a d d 4 r ca 4 0 ( . a ( a [ 3 : 0 ] ) , . b ( b [ 3 : 0 ] ) , . cin ( cin ) , .
s ( s [ 3 : 0 ] ) , . cout ( c ) ) ;
r i p p l e c a r r y a d d 4 r ca 4 1 ( . a ( a [ 7 : 4 ] ) , . b ( b [ 7 : 4 ] ) , . cin ( c ) , .
s ( s [ 7 : 4 ] ) , . cout ( cout ) ) ;
endmodule
A.1.3. Carry-lookahead Adder
/ / Propagate / Generate adder t h a t computes the P /G s i g n a l s
/ / i n s t e a d of the carry −out as in the f u l l adder
module pgadder ( input wire a ,
input wire b ,
input wire cin ,
output wire s ,
output wire p ,
output wire g ) ;
assign s = a ˆ b ˆ cin ;
assign p = a | b ;
assign g = a & b ;
endmodule
/ / 8− b i t carry −lookahead adder
module carrylookaheadadd8 ( input wire [ 7 : 0 ] a ,
input wire cin ,
/ / Propagate , g e n e r a t e and c a r r y output s i g n a l s f o r each b i t
wire [ 7 : 0 ] p , g , c ;
43

/ / The formula f o r the lookahead i s c i +1 = g i | ( p i & c i
) , where c i i s expanded r e c u r s i v e l y
assign c [ 0 ] = cin ;
assign c [ 1 ] = g [ 0 ] | ( p [ 0 ] & c [ 0 ] ) ;
assign c [ 2 ] = g [ 1 ] | ( p [ 1 ] & g [ 0 ] ) | ( p [ 1 ] & p [ 0 ] & c [ 0 ] ) ;
assign c [ 3 ] = g [ 2 ] | ( p [ 2 ] & g [ 1 ] ) | ( p [ 2 ] & p [ 1 ] & g [ 0 ] ) |
( p [ 2 ] & p [ 1 ] & p [ 0 ] & c [ 0 ] ) ;
assign c [ 4 ] = g [ 3 ] | ( p [ 3 ] & g [ 2 ] ) | ( p [ 3 ] & p [ 2 ] & g [ 1 ] ) |
( p [ 3 ] & p [ 2 ] & p [ 1 ] & g [ 0 ] ) | ( p [ 3 ] & p [ 2 ] & p [ 1 ] & p [ 0 ]
& c [ 0 ] ) ;
assign c [ 5 ] = g [ 4 ] | ( p [ 4 ] & g [ 3 ] ) | ( p [ 4 ] & p [ 3 ] & g [ 2 ] ) |
( p [ 4 ] & p [ 3 ] & p [ 2 ] & g [ 1 ] ) | ( p [ 4 ] & p [ 3 ] & p [ 2 ] & p [ 1 ]
& g [ 0 ] ) | ( p [ 4 ] & p [ 3 ] & p [ 2 ] & p [ 1 ] & p [ 0 ] & c [ 0 ] ) ;
assign c [ 6 ] = g [ 5 ] | ( p [ 5 ] & g [ 4 ] ) | ( p [ 5 ] & p [ 4 ] & g [ 3 ] ) |
( p [ 5 ] & p [ 4 ] & p [ 3 ] & g [ 2 ] ) | ( p [ 5 ] & p [ 4 ] & p [ 3 ] & p [ 2 ]
& g [ 1 ] ) | ( p [ 5 ] & p [ 4 ] & p [ 3 ] & p [ 2 ] & p [ 1 ] & g [ 0 ] ) | ( p
[ 5 ] & p [ 4 ] & p [ 3 ] & p [ 2 ] & p [ 1 ] & p [ 0 ] & c [ 0 ] ) ;
assign c [ 7 ] = g [ 6 ] | ( p [ 6 ] & g [ 5 ] ) | ( p [ 6 ] & p [ 5 ] & g [ 4 ] ) |
( p [ 6 ] & p [ 5 ] & p [ 4 ] & g [ 3 ] ) | ( p [ 6 ] & p [ 5 ] & p [ 4 ] & p [ 3 ]
& g [ 2 ] ) | ( p [ 6 ] & p [ 5 ] & p [ 4 ] & p [ 3 ] & p [ 2 ] & g [ 1 ] ) | ( p
[ 6 ] & p [ 5 ] & p [ 4 ] & p [ 3 ] & p [ 2 ] & p [ 1 ] & g [ 0 ] ) | ( p [ 6 ] &
p [ 5 ] & p [ 4 ] & p [ 3 ] & p [ 2 ] & p [ 1 ] & p [ 0 ] & c [ 0 ] ) ;
assign cout = g [ 7 ] | ( p [ 7 ] & g [ 6 ] ) | ( p [ 7 ] & p [ 6 ] & g [ 5 ] ) |
( p [ 7 ] & p [ 6 ] & p [ 5 ] & g [ 4 ] ) | ( p [ 7 ] & p [ 6 ] & p [ 5 ] & p [ 4 ]
& g [ 3 ] ) | ( p [ 7 ] & p [ 6 ] & p [ 5 ] & p [ 4 ] & p [ 3 ] & g [ 2 ] ) | ( p
[ 7 ] & p [ 6 ] & p [ 5 ] & p [ 4 ] & p [ 3 ] & p [ 2 ] & g [ 1 ] ) | ( p [ 7 ] &
p [ 6 ] & p [ 5 ] & p [ 4 ] & p [ 3 ] & p [ 2 ] & p [ 1 ] & g [ 0 ] ) | ( p [ 7 ] &
p [ 6 ] & p [ 5 ] & p [ 4 ] & p [ 3 ] & p [ 2 ] & p [ 1 ] & p [ 0 ] & c [ 0 ] ) ;
/ / Propagate / Generate adders which g i v e the P /G s i g n a l s and
use the c a r r i e s computed by the CLA l o g i c
pgadder pga0 ( . a ( a [ 0 ] ) , . b ( b [ 0 ] ) , . cin ( c [ 0 ] ) , . s ( s [ 0 ] ) , . p ( p
[ 0 ] ) , . g ( g [ 0 ] ) ) ;
[ 1 ] ) , . g ( g [ 1 ] ) ) ;
44

[ 2 ] ) , . g ( g [ 2 ] ) ) ;
[ 3 ] ) , . g ( g [ 3 ] ) ) ;
[ 4 ] ) , . g ( g [ 4 ] ) ) ;
[ 5 ] ) , . g ( g [ 5 ] ) ) ;
[ 6 ] ) , . g ( g [ 6 ] ) ) ;
[ 7 ] ) , . g ( g [ 7 ] ) ) ;
endmodule
A.1.4. Carry-select Adder
/ / 8− b i t carry −s e l e c t adder
module c a r r y s e l e c t a d d 8 ( input wire [ 7 : 0 ] a ,
input wire cin ,
wire cs , cout 0 , cout 1 ;
wire [ 3 : 0 ] r e s u l t 0 , r e s u l t 1 ;
/ / The a p p r o p r i a t e output f o r the upper b i t s i s s e l e c t e d by
the carry −s e l e c t s i g n a l
assign { cout , s [ 7 : 4 ] } = ( cs ) ? { cout 1 , r e s u l t 1 } : { cout 0 ,
r e s u l t 0 } ;
/ / Simple RCA adds the lower f o u r b i t s and emits a carry −
s e l e c t s i g n a l
r i p p l e c a r r y a d d 4 r ca 4 0 ( . a ( a [ 3 : 0 ] ) , . b ( b [ 3 : 0 ] ) , . cin ( cin )
, . s ( s [ 3 : 0 ] ) , . cout ( cs ) ) ;
/ / The upper b i t s are computed twice , f o r a carry −in of 0
45

and 1 , with the c o r r e c t answer s e l e c t e d l a t e r
r i p p l e c a r r y a d d 4 r c a 4 1 0 ( . a ( a [ 7 : 4 ] ) , . b ( b [ 7 : 4 ] ) , . cin ( 1 ’ b0 )
, . s ( r e s u l t 0 ) , . cout ( cout 0 ) ) ;
r i p p l e c a r r y a d d 4 r c a 4 1 1 ( . a ( a [ 7 : 4 ] ) , . b ( b [ 7 : 4 ] ) , . cin ( 1 ’ b1 )
, . s ( r e s u l t 1 ) , . cout ( cout 1 ) ) ;
endmodule
A.1.5. Carry-skip Adder
/ / A 4− b i t RCA t h a t p r o v i d e s a propagate s i g n a l f o r the carry −
s k i p adder
module p gr ip p le c ar ry a dd 4 ( input wire [ 3 : 0 ] a ,
input wire cin ,
output wire cout ,
output wire p ) ;
wire [ 3 : 0 ] c ;
assign cout = c [ 3 ] ;
assign p = ( a [ 0 ] | b [ 0 ] ) & ( a [ 1 ] | b [ 1 ] ) & ( a [ 2 ] | b [ 2 ] ) & (
a [ 3 ] | b [ 3 ] ) ;
f u l l a d d e r fa0 ( . a ( a [ 0 ] ) , . b ( b [ 0 ] ) , . cin ( cin ) , . s ( s [ 0 ] ) , .
cout ( c [ 0 ] ) ) ;
cout ( c [ 1 ] ) ) ;
cout ( c [ 2 ] ) ) ;
cout ( c [ 3 ] ) ) ;
endmodule
/ / 8− b i t carry −s k i p adder
module carryskipadd8 ( input wire [ 7 : 0 ] a ,
46

A.2. MULTIPLIERS APPENDIX A. UNIT SOURCE CODE
input wire cin ,
wire [ 1 : 0 ] rcin , rcout ;
wire p ;
assign r c i n [ 0 ] = cin ;
assign r c i n [ 1 ] = rcout [ 0 ] ;
/ / I f the second RCA w i l l propagate a carry , simply pass
r c o u t [ 0 ] to the cout , s k i p p i n g the second RCA
assign cout = rcout [ 1 ] | ( p & rcout [ 0 ] ) ;
r i p p l e c a r r y a d d 4 r ca 4 0 ( . a ( a [ 3 : 0 ] ) , . b ( b [ 3 : 0 ] ) , . cin ( r c i n
[ 0 ] ) , . s ( s [ 3 : 0 ] ) , . cout ( rcout [ 0 ] ) ) ;
p gr ip p le c ar ry a dd 4 rc a4 1 ( . a ( a [ 7 : 4 ] ) , . b ( b [ 7 : 4 ] ) , . cin ( r c i n
[ 1 ] ) , . s ( s [ 7 : 4 ] ) , . cout ( rcout [ 1 ] ) , . p ( p ) ) ;
endmodule
A.2. Multipliers
A.2.1. Array Multiplier
/ / Array m u l t i p l i e r module t h a t computes a b i t p roduct and
adds i t to a sum−in
module mulmodule ( input wire x ,
input wire y ,
input wire sin ,
input wire cin ,
output wire cout ,
output wire sout ) ;
wire b in ;
assign b in = x & y ;
f u l l a d d e r adder ( . a ( sin ) , . b ( b in ) , . cin ( cin ) , . s ( sout ) , .
47

cout ( cout ) ) ;
endmodule
/ / 8− b i t unsigned array m u l t i p l i e r
module a r r a y m u l t i p l i e r 8 ( input wire [ 7 : 0 ] a ,
output wire [ 1 5 : 0 ] r e s u l t ) ;
wire [ 7 : 0 ] c [ 8 : 0 ] , s [ 7 : 0 ] ; / / I n t e r m e d i a t e c a r r y and sum
w i r e s
/ / P a r t i a l p r o d u c t s of m u l t i p l i c a n d with each m u l t i p l i e r b i t
mulmodule mm00 00 ( . x ( a [ 0 ] ) , . y ( b [ 0 ] ) , . sin ( 1 ’ b0 ) , .
cin ( 1 ’ b0 ) , . cout ( c [ 0 ] [ 0 ] ) , . sout ( s [ 0 ] [ 0 ] ) ) ;
cin ( 1 ’ b0 ) , . cout ( c [ 0 ] [ 1 ] ) , . sout ( s [ 0 ] [ 1 ] ) ) ;
cin ( 1 ’ b0 ) , . cout ( c [ 0 ] [ 2 ] ) , . sout ( s [ 0 ] [ 2 ] ) ) ;
cin ( 1 ’ b0 ) , . cout ( c [ 0 ] [ 3 ] ) , . sout ( s [ 0 ] [ 3 ] ) ) ;
cin ( 1 ’ b0 ) , . cout ( c [ 0 ] [ 4 ] ) , . sout ( s [ 0 ] [ 4 ] ) ) ;
cin ( 1 ’ b0 ) , . cout ( c [ 0 ] [ 5 ] ) , . sout ( s [ 0 ] [ 5 ] ) ) ;
cin ( 1 ’ b0 ) , . cout ( c [ 0 ] [ 6 ] ) , . sout ( s [ 0 ] [ 6 ] ) ) ;
cin ( 1 ’ b0 ) , . cout ( c [ 0 ] [ 7 ] ) , . sout ( s [ 0 ] [ 7 ] ) ) ;
mulmodule mm01 00 ( . x ( a [ 0 ] ) , . y ( b [ 1 ] ) , . sin ( s [ 0 ] [ 1 ] ) , .
cin ( c [ 0 ] [ 0 ] ) , . cout ( c [ 1 ] [ 0 ] ) , . sout ( s [ 1 ] [ 0 ] ) ) ;
cin ( c [ 0 ] [ 1 ] ) , . cout ( c [ 1 ] [ 1 ] ) , . sout ( s [ 1 ] [ 1 ] ) ) ;
cin ( c [ 0 ] [ 2 ] ) , . cout ( c [ 1 ] [ 2 ] ) , . sout ( s [ 1 ] [ 2 ] ) ) ;
48

cin ( c [ 0 ] [ 3 ] ) , . cout ( c [ 1 ] [ 3 ] ) , . sout ( s [ 1 ] [ 3 ] ) ) ;
cin ( c [ 0 ] [ 4 ] ) , . cout ( c [ 1 ] [ 4 ] ) , . sout ( s [ 1 ] [ 4 ] ) ) ;
cin ( c [ 0 ] [ 5 ] ) , . cout ( c [ 1 ] [ 5 ] ) , . sout ( s [ 1 ] [ 5 ] ) ) ;
cin ( c [ 0 ] [ 6 ] ) , . cout ( c [ 1 ] [ 6 ] ) , . sout ( s [ 1 ] [ 6 ] ) ) ;
cin ( c [ 0 ] [ 7 ] ) , . cout ( c [ 1 ] [ 7 ] ) , . sout ( s [ 1 ] [ 7 ] ) ) ;
cin ( c [ 1 ] [ 0 ] ) , . cout ( c [ 2 ] [ 0 ] ) , . sout ( s [ 2 ] [ 0 ] ) ) ;
cin ( c [ 1 ] [ 1 ] ) , . cout ( c [ 2 ] [ 1 ] ) , . sout ( s [ 2 ] [ 1 ] ) ) ;
cin ( c [ 1 ] [ 2 ] ) , . cout ( c [ 2 ] [ 2 ] ) , . sout ( s [ 2 ] [ 2 ] ) ) ;
cin ( c [ 1 ] [ 3 ] ) , . cout ( c [ 2 ] [ 3 ] ) , . sout ( s [ 2 ] [ 3 ] ) ) ;
cin ( c [ 1 ] [ 4 ] ) , . cout ( c [ 2 ] [ 4 ] ) , . sout ( s [ 2 ] [ 4 ] ) ) ;
cin ( c [ 1 ] [ 5 ] ) , . cout ( c [ 2 ] [ 5 ] ) , . sout ( s [ 2 ] [ 5 ] ) ) ;
cin ( c [ 1 ] [ 6 ] ) , . cout ( c [ 2 ] [ 6 ] ) , . sout ( s [ 2 ] [ 6 ] ) ) ;
cin ( c [ 1 ] [ 7 ] ) , . cout ( c [ 2 ] [ 7 ] ) , . sout ( s [ 2 ] [ 7 ] ) ) ;
cin ( c [ 2 ] [ 0 ] ) , . cout ( c [ 3 ] [ 0 ] ) , . sout ( s [ 3 ] [ 0 ] ) ) ;
cin ( c [ 2 ] [ 1 ] ) , . cout ( c [ 3 ] [ 1 ] ) , . sout ( s [ 3 ] [ 1 ] ) ) ;
cin ( c [ 2 ] [ 2 ] ) , . cout ( c [ 3 ] [ 2 ] ) , . sout ( s [ 3 ] [ 2 ] ) ) ;
cin ( c [ 2 ] [ 3 ] ) , . cout ( c [ 3 ] [ 3 ] ) , . sout ( s [ 3 ] [ 3 ] ) ) ;
49

cin ( c [ 2 ] [ 4 ] ) , . cout ( c [ 3 ] [ 4 ] ) , . sout ( s [ 3 ] [ 4 ] ) ) ;
cin ( c [ 2 ] [ 5 ] ) , . cout ( c [ 3 ] [ 5 ] ) , . sout ( s [ 3 ] [ 5 ] ) ) ;
cin ( c [ 2 ] [ 6 ] ) , . cout ( c [ 3 ] [ 6 ] ) , . sout ( s [ 3 ] [ 6 ] ) ) ;
cin ( c [ 2 ] [ 7 ] ) , . cout ( c [ 3 ] [ 7 ] ) , . sout ( s [ 3 ] [ 7 ] ) ) ;
cin ( c [ 3 ] [ 0 ] ) , . cout ( c [ 4 ] [ 0 ] ) , . sout ( s [ 4 ] [ 0 ] ) ) ;
cin ( c [ 3 ] [ 1 ] ) , . cout ( c [ 4 ] [ 1 ] ) , . sout ( s [ 4 ] [ 1 ] ) ) ;
cin ( c [ 3 ] [ 2 ] ) , . cout ( c [ 4 ] [ 2 ] ) , . sout ( s [ 4 ] [ 2 ] ) ) ;
cin ( c [ 3 ] [ 3 ] ) , . cout ( c [ 4 ] [ 3 ] ) , . sout ( s [ 4 ] [ 3 ] ) ) ;
cin ( c [ 3 ] [ 4 ] ) , . cout ( c [ 4 ] [ 4 ] ) , . sout ( s [ 4 ] [ 4 ] ) ) ;
cin ( c [ 3 ] [ 5 ] ) , . cout ( c [ 4 ] [ 5 ] ) , . sout ( s [ 4 ] [ 5 ] ) ) ;
cin ( c [ 3 ] [ 6 ] ) , . cout ( c [ 4 ] [ 6 ] ) , . sout ( s [ 4 ] [ 6 ] ) ) ;
cin ( c [ 3 ] [ 7 ] ) , . cout ( c [ 4 ] [ 7 ] ) , . sout ( s [ 4 ] [ 7 ] ) ) ;
cin ( c [ 4 ] [ 0 ] ) , . cout ( c [ 5 ] [ 0 ] ) , . sout ( s [ 5 ] [ 0 ] ) ) ;
cin ( c [ 4 ] [ 1 ] ) , . cout ( c [ 5 ] [ 1 ] ) , . sout ( s [ 5 ] [ 1 ] ) ) ;
cin ( c [ 4 ] [ 2 ] ) , . cout ( c [ 5 ] [ 2 ] ) , . sout ( s [ 5 ] [ 2 ] ) ) ;
cin ( c [ 4 ] [ 3 ] ) , . cout ( c [ 5 ] [ 3 ] ) , . sout ( s [ 5 ] [ 3 ] ) ) ;
cin ( c [ 4 ] [ 4 ] ) , . cout ( c [ 5 ] [ 4 ] ) , . sout ( s [ 5 ] [ 4 ] ) ) ;
50

cin ( c [ 4 ] [ 5 ] ) , . cout ( c [ 5 ] [ 5 ] ) , . sout ( s [ 5 ] [ 5 ] ) ) ;
cin ( c [ 4 ] [ 6 ] ) , . cout ( c [ 5 ] [ 6 ] ) , . sout ( s [ 5 ] [ 6 ] ) ) ;
cin ( c [ 4 ] [ 7 ] ) , . cout ( c [ 5 ] [ 7 ] ) , . sout ( s [ 5 ] [ 7 ] ) ) ;
cin ( c [ 5 ] [ 0 ] ) , . cout ( c [ 6 ] [ 0 ] ) , . sout ( s [ 6 ] [ 0 ] ) ) ;
cin ( c [ 5 ] [ 1 ] ) , . cout ( c [ 6 ] [ 1 ] ) , . sout ( s [ 6 ] [ 1 ] ) ) ;
cin ( c [ 5 ] [ 2 ] ) , . cout ( c [ 6 ] [ 2 ] ) , . sout ( s [ 6 ] [ 2 ] ) ) ;
cin ( c [ 5 ] [ 3 ] ) , . cout ( c [ 6 ] [ 3 ] ) , . sout ( s [ 6 ] [ 3 ] ) ) ;
cin ( c [ 5 ] [ 4 ] ) , . cout ( c [ 6 ] [ 4 ] ) , . sout ( s [ 6 ] [ 4 ] ) ) ;
cin ( c [ 5 ] [ 5 ] ) , . cout ( c [ 6 ] [ 5 ] ) , . sout ( s [ 6 ] [ 5 ] ) ) ;
cin ( c [ 5 ] [ 6 ] ) , . cout ( c [ 6 ] [ 6 ] ) , . sout ( s [ 6 ] [ 6 ] ) ) ;
cin ( c [ 5 ] [ 7 ] ) , . cout ( c [ 6 ] [ 7 ] ) , . sout ( s [ 6 ] [ 7 ] ) ) ;
cin ( c [ 6 ] [ 0 ] ) , . cout ( c [ 7 ] [ 0 ] ) , . sout ( s [ 7 ] [ 0 ] ) ) ;
cin ( c [ 6 ] [ 1 ] ) , . cout ( c [ 7 ] [ 1 ] ) , . sout ( s [ 7 ] [ 1 ] ) ) ;
cin ( c [ 6 ] [ 2 ] ) , . cout ( c [ 7 ] [ 2 ] ) , . sout ( s [ 7 ] [ 2 ] ) ) ;
cin ( c [ 6 ] [ 3 ] ) , . cout ( c [ 7 ] [ 3 ] ) , . sout ( s [ 7 ] [ 3 ] ) ) ;
cin ( c [ 6 ] [ 4 ] ) , . cout ( c [ 7 ] [ 4 ] ) , . sout ( s [ 7 ] [ 4 ] ) ) ;
cin ( c [ 6 ] [ 5 ] ) , . cout ( c [ 7 ] [ 5 ] ) , . sout ( s [ 7 ] [ 5 ] ) ) ;
51

cin ( c [ 6 ] [ 6 ] ) , . cout ( c [ 7 ] [ 6 ] ) , . sout ( s [ 7 ] [ 6 ] ) ) ;
cin ( c [ 6 ] [ 7 ] ) , . cout ( c [ 7 ] [ 7 ] ) , . sout ( s [ 7 ] [ 7 ] ) ) ;
/ / Lower 8 b i t s can be o b t a i n e d from the sum out of the l a s t
l a y e r
assign r e s u l t [ 0] = s [ 0 ] [ 0 ] ;
assign r e s u l t [ 1] = s [ 1 ] [ 0 ] ;
assign r e s u l t [ 2] = s [ 2 ] [ 0 ] ;
assign r e s u l t [ 3] = s [ 3 ] [ 0 ] ;
assign r e s u l t [ 4] = s [ 4 ] [ 0 ] ;
assign r e s u l t [ 5] = s [ 5 ] [ 0 ] ;
assign r e s u l t [ 6] = s [ 6 ] [ 0 ] ;
assign r e s u l t [ 7] = s [ 7 ] [ 0 ] ;
/ / Upper 8 b i t s need to be summed with carry −o u t s from
p r e v i o u s b i t s
h a l f a d d e r ha00 ( . a ( s [ 7 ] [ 1 ] ) , . b ( c [ 7 ] [ 0 ] ) ,
. s ( r e s u l t [ 8 ] ) , . cout ( c [ 8 ] [ 0 ] ) ) ;
f u l l a d d e r fa01 ( . a ( s [ 7 ] [ 2 ] ) , . b ( c [ 7 ] [ 1 ] ) , . cin ( c [ 8 ] [
0 ] ) , . s ( r e s u l t [ 9 ] ) , . cout ( c [ 8 ] [ 1 ] ) ) ;
1 ] ) , . s ( r e s u l t [ 1 0 ] ) , . cout ( c [ 8 ] [ 2 ] ) ) ;
2 ] ) , . s ( r e s u l t [ 1 1 ] ) , . cout ( c [ 8 ] [ 3 ] ) ) ;
3 ] ) , . s ( r e s u l t [ 1 2 ] ) , . cout ( c [ 8 ] [ 4 ] ) ) ;
4 ] ) , . s ( r e s u l t [ 1 3 ] ) , . cout ( c [ 8 ] [ 5 ] ) ) ;
5 ] ) , . s ( r e s u l t [ 1 4 ] ) , . cout ( c [ 8 ] [ 6 ] ) ) ;
assign r e s u l t [ 1 5 ] = c [ 7 ] [ 7] ˆ c [ 8 ] [ 6 ] ;
endmodule
A.2.2. Modified Booth Encoder
/ / 8− b i t Modified Booth Encoder to g e n e r a t e f o u r p a r t i a l
52

p r o d u c t s to be summed
module boothencoder8 ( input wire [ 7 : 0 ] a ,
output reg [ 8 : 0 ] p00 ,
output reg [ 8 : 0 ] p01 ,
output reg [ 8 : 0 ] p02 ,
output reg [ 8 : 0 ] p03 ) ;
/ / Group m u l t i p l i e r b i t s i n t o o v e r l a p p i n g groups of t h r e e
b i t s , then d e c i d e
/ / what the p a r t i a l p r o d u c t s should be based on the b i t s
always @ ( a or b )
begin
/ / E q u i v a l e n t to appending a z e r o to b i t s 1 and 0 , only
need to check f o u r c a s e s
case ( b [ 1 : 0 ] )
2 ’ b00 : p00 <= 9 ’ b000000000 ;
2 ’ b01 : p00 <= $signed ( a ) ;
2 ’ b10 : p00 <= {˜ a , 1 ’ b1 } ;
2 ’ b11 : p00 <= $signed ( ˜ a ) ;
default : p00 <= 9 ’ bxxxxxxxxx ;
endcase
case ( b [ 3 : 1 ] )
3 ’ b000 : p01 <= 9 ’ b000000000 ; / / P = 0
3 ’ b001 : p01 <= $signed ( a ) ; / / P = A
3 ’ b010 : p01 <= $signed ( a ) ; / / P = A
3 ’ b011 : p01 <= { a , 1 ’ b0 } ; / / P = 2A
3 ’ b100 : p01 <= {˜ a , 1 ’ b1 } ; / / P = −2A
3 ’ b101 : p01 <= $signed ( ˜ a ) ; / / P = −A
3 ’ b110 : p01 <= $signed ( ˜ a ) ; / / P = −A
3 ’ b111 : p01 <= 9 ’ b111111111 ; / / P = 0 ( a l l 1 s , p l u s
complement b i t g i v e s 0)
default : p01 <= 9 ’ bxxxxxxxxx ; / / Should not occur in
normal o p e r a t i o n with d e f i n e d i n p u t s
endcase
53

case ( b [ 5 : 3 ] )
3 ’ b000 : p02 <= 9 ’ b000000000 ;
3 ’ b001 : p02 <= $signed ( a ) ;
3 ’ b010 : p02 <= $signed ( a ) ;
3 ’ b011 : p02 <= { a , 1 ’ b0 } ;
3 ’ b100 : p02 <= {˜ a , 1 ’ b1 } ;
3 ’ b101 : p02 <= $signed ( ˜ a ) ;
3 ’ b110 : p02 <= $signed ( ˜ a ) ;
3 ’ b111 : p02 <= 9 ’ b111111111 ;
endcase
case ( b [ 7 : 5 ] )
3 ’ b000 : p03 <= 9 ’ b000000000 ;
3 ’ b001 : p03 <= $signed ( a ) ;
3 ’ b010 : p03 <= $signed ( a ) ;
3 ’ b011 : p03 <= { a , 1 ’ b0 } ;
3 ’ b100 : p03 <= {˜ a , 1 ’ b1 } ;
3 ’ b101 : p03 <= $signed ( ˜ a ) ;
3 ’ b110 : p03 <= $signed ( ˜ a ) ;
3 ’ b111 : p03 <= 9 ’ b111111111 ;
endcase
end
endmodule
A.2.3. Wallace Tree Multiplier
/ / 8− b i t s i g n e d Wallace Tree m u l t i p l i e r
module w a l l a c e t r e e m u l t i p l i e r 8 ( input wire [ 7 : 0 ] a ,
wire [ 8 : 0 ] p [ 3 : 0 ] ;
wire [ 2 : 0 ] c [ 1 5 : 2 ] ;
54

wire [ 2 : 0 ] w [ 1 5 : 2 ] ;
wire c l a 0 c o u t ;
/ / Use the modified Booth encoder to g e n e r a t e the p a r t i a l
p r o d u c t s
boothencoder8 pp ( . a ( a ) , . b ( b ) , . p00 ( p [ 0 ] ) , . p01 ( p [ 1 ] ) , . p02
( p [ 2 ] ) , . p03 ( p [ 3 ] ) ) ;
/ / Sum the p a r t i a l p r o d u c t s by wire weight u n t i l only two
w i r e s are l e f t f o r each weight
/ / Weight 2ˆ2
h a l f a d d e r ha02 00 ( . a ( p [ 0 ] [ 2 ] ) , . b ( p [ 1 ] [ 0 ] ) ,
. s ( w[ 2 ] [ 0 ] ) , . cout ( c [ 2 ] [ 0 ] ) ) ;
/ / Weight 2ˆ3
. s ( w[ 3 ] [ 0 ] ) , . cout ( c [ 3 ] [ 0 ] ) ) ;
/ / Weight 2ˆ4
f u l l a d d e r f a 0 4 0 0 ( . a ( p [ 0 ] [ 4 ] ) , . b ( p [ 1 ] [ 2 ] ) , . cin ( p [ 2 ] [
0 ] ) , . s ( w[ 4 ] [ 0 ] ) , . cout ( c [ 4 ] [ 0 ] ) ) ;
h a l f a d d e r ha04 01 ( . a (w[ 4 ] [ 0 ] ) , . b ( b [ 5 ] ) ,
. s ( w[ 4 ] [ 1 ] ) , . cout ( c [ 4 ] [ 1 ] ) ) ;
/ / Weight 2ˆ5
1 ] ) , . s ( w[ 5 ] [ 0 ] ) , . cout ( c [ 5 ] [ 0 ] ) ) ;
h a l f a d d e r ha05 01 ( . a (w[ 5 ] [ 0 ] ) , . b ( c [ 4 ] [ 0 ] ) ,
. s ( w[ 5 ] [ 1 ] ) , . cout ( c [ 5 ] [ 1 ] ) ) ;
/ / Weight 2ˆ6
2 ] ) , . s ( w[ 6 ] [ 0 ] ) , . cout ( c [ 6 ] [ 0 ] ) ) ;
f u l l a d d e r f a 0 6 0 1 ( . a (w[ 6 ] [ 0 ] ) , . b ( p [ 3 ] [ 0 ] ) , . cin ( b [
7 ] ) , . s ( w[ 6 ] [ 1 ] ) , . cout ( c [ 6 ] [ 1 ] ) ) ;
h a l f a d d e r f a 0 6 0 2 ( . a (w[ 6 ] [ 1 ] ) , . b ( c [ 5 ] [ 0 ] ) ,
55

. s ( w[ 6 ] [ 2 ] ) , . cout ( c [ 6 ] [ 2 ] ) ) ;
/ / Weight 2ˆ7
3 ] ) , . s ( w[ 7 ] [ 0 ] ) , . cout ( c [ 7 ] [ 0 ] ) ) ;
f u l l a d d e r f a 0 7 0 1 ( . a (w[ 7 ] [ 0 ] ) , . b ( p [ 3 ] [ 1 ] ) , . cin ( c [ 6 ] [
0 ] ) , . s ( w[ 7 ] [ 1 ] ) , . cout ( c [ 7 ] [ 1 ] ) ) ;
. s ( w[ 7 ] [ 2 ] ) , . cout ( c [ 7 ] [ 2 ] ) ) ;
/ / Weight 2ˆ8
4 ] ) , . s ( w[ 8 ] [ 0 ] ) , . cout ( c [ 8 ] [ 0 ] ) ) ;
0 ] ) , . s ( w[ 8 ] [ 1 ] ) , . cout ( c [ 8 ] [ 1 ] ) ) ;
. s ( w[ 8 ] [ 2 ] ) , . cout ( c [ 8 ] [ 2 ] ) ) ;
/ / Weight 2ˆ9
5 ] ) , . s ( w[ 9 ] [ 0 ] ) , . cout ( c [ 9 ] [ 0 ] ) ) ;
0 ] ) , . s ( w[ 9 ] [ 1 ] ) , . cout ( c [ 9 ] [ 1 ] ) ) ;
h a l f a d d e r f a 0 9 0 2 ( . a (w[ 9 ] [ 1 ] ) , . b ( c [ 8 ] [ 1 ] ) ,
. s ( w[ 9 ] [ 2 ] ) , . cout ( c [ 9 ] [ 2 ] ) ) ;
/ / Weight 2ˆ10
6 ] ) , . s ( w[ 1 0 ] [ 0 ] ) , . cout ( c [ 1 0 ] [ 0 ] ) ) ;
f u l l a d d e r f a 1 0 0 1 ( . a (w[ 1 0 ] [ 0 ] ) , . b ( p [ 3 ] [ 4 ] ) , . cin ( c [ 9 ] [
0 ] ) , . s ( w[ 1 0 ] [ 1 ] ) , . cout ( c [ 1 0 ] [ 1 ] ) ) ;
h a l f a d d e r ha10 02 ( . a (w[ 1 0 ] [ 1 ] ) , . b ( c [ 9 ] [ 1 ] ) ,
. s ( w[ 1 0 ] [ 2 ] ) , . cout ( c [ 1 0 ] [ 2 ] ) ) ;
/ / Weight 2ˆ11
56

7 ] ) , . s ( w[ 1 1 ] [ 0 ] ) , . cout ( c [ 1 1 ] [ 0 ] ) ) ;
f u l l a d d e r f a 1 1 0 1 ( . a (w[ 1 1 ] [ 0 ] ) , . b ( p [ 3 ] [ 5 ] ) , . cin ( c [ 1 0 ] [
0 ] ) , . s ( w[ 1 1 ] [ 1 ] ) , . cout ( c [ 1 1 ] [ 1 ] ) ) ;
h a l f a d d e r ha11 02 ( . a (w[ 1 1 ] [ 1 ] ) , . b ( c [ 1 0 ] [ 1 ] ) ,
. s ( w[ 1 1 ] [ 2 ] ) , . cout ( c [ 1 1 ] [ 2 ] ) ) ;
/ / Weight 2ˆ12
8 ] ) , . s ( w[ 1 2 ] [ 0 ] ) , . cout ( c [ 1 2 ] [ 0 ] ) ) ;
0 ] ) , . s ( w[ 1 2 ] [ 1 ] ) , . cout ( c [ 1 2 ] [ 1 ] ) ) ;
. s ( w[ 1 2 ] [ 2 ] ) , . cout ( c [ 1 2 ] [ 2 ] ) ) ;
/ / Weight 2ˆ13
0 ] ) , . s ( w[ 1 3 ] [ 0 ] ) , . cout ( c [ 1 3 ] [ 0 ] ) ) ;
. s ( w[ 1 3 ] [ 1 ] ) , . cout ( c [ 1 3 ] [ 1 ] ) ) ;
/ / Weight 2ˆ14
0 ] ) , . s ( w[ 1 4 ] [ 0 ] ) , . cout ( c [ 1 4 ] [ 0 ] ) ) ;
. s ( w[ 1 4 ] [ 1 ] ) , . cout ( c [ 1 4 ] [ 1 ] ) ) ;
/ / Weight 2ˆ15
assign w[ 1 5 ] [ 0] = w[ 1 4 ] [ 0] ˆ c [ 1 4 ] [ 0 ] ;
/ / Use two chained CLA adders to perform the f i n a l a d d i t i o n
carrylookaheadadd8 c l a 0 ( . a ({w[ 7 ] [ 2 ] , w[ 6 ] [ 2 ] , w[ 5 ] [ 1 ] , w
[ 4 ] [ 1 ] , w[ 3 ] [ 0 ] , w[ 2 ] [ 0 ] , p [ 0 ] [ 1 ] , p [ 0 ] [ 0 ] } ) ,
. b ({ c [ 6 ] [ 2 ] , c [ 5 ] [ 1 ] , c [ 4 ] [ 1 ] , c
[ 3 ] [ 0 ] , c [ 2 ] [ 0 ] , b [ 3 ] , 1 ’
b0 , b [ 1 ] } ) ,
. cin ( 1 ’ b0 ) , . s ( r e s u l t [ 7 : 0 ] ) , .
57

cout ( c l a 0 c o u t ) ) ;
carrylookaheadadd8 c l a 1 ( . a ({w[ 1 5 ] [ 0 ] , w[ 1 4 ] [ 1 ] , w[ 1 3 ] [ 1 ] , w
[ 1 2 ] [ 2 ] , w[ 1 1 ] [ 2 ] , w[ 1 0 ] [ 2 ] , w[ 9 ] [ 2 ] , w[ 8 ] [ 2 ] } ) ,
. b ({ c [ 1 4 ] [ 1 ] , c [ 1 3 ] [ 1 ] , c [ 1 2 ] [ 2 ] , c
[ 1 1 ] [ 2 ] , c [ 1 0 ] [ 2 ] , c [ 9 ] [ 2 ] , c
[ 8 ] [ 2 ] , c [ 7 ] [ 2 ] } ) ,
. cin ( c l a 0 c o u t ) , . s ( r e s u l t [ 1 5 : 8 ] ) ) ;
endmodule
A.2.4. Dadda Tree Multiplier
/ / 8− b i t s i g n e d Dadda Tree m u l t i p l i e r
module d a d d a t r e e m u l t i p l i e r 8 ( input wire [ 7 : 0 ] a ,
wire [ 8 : 0 ] p [ 3 : 0 ] ;
wire [ 2 : 0 ] c [ 1 5 : 2 ] ;
wire [ 2 : 0 ] w [ 1 5 : 2 ] ;
wire c l a 0 c o u t ;
/ / Use the modified Booth encoder to g e n e r a t e the p a r t i a l
p r o d u c t s
boothencoder8 pp ( . a ( a ) , . b ( b ) , . p00 ( p [ 0 ] ) , . p01 ( p [ 1 ] ) , . p02
( p [ 2 ] ) , . p03 ( p [ 3 ] ) ) ;
/ / Sum the p a r t i a l p r o d u c t s by wire weight u n t i l only two
w i r e s are l e f t f o r each weight
/ / Weight 2ˆ2
. s ( w[ 2 ] [ 0 ] ) , . cout ( c [ 2 ] [ 0 ] ) ) ;
/ / Weight 2ˆ3
. s ( w[ 3 ] [ 0 ] ) , . cout ( c [ 3 ] [ 0 ] ) ) ;
58

/ / Weight 2ˆ4
. s ( w[ 4 ] [ 0 ] ) , . cout ( c [ 4 ] [ 0 ] ) ) ;
f u l l a d d e r f a 0 4 0 2 ( . a ( p [ 2 ] [ 0 ] ) , . b ( b [ 5 ] ) , . cin ( c [ 3 ] [
0 ] ) , . s ( w[ 4 ] [ 1 ] ) , . cout ( c [ 4 ] [ 1 ] ) ) ;
/ / Weight 2ˆ5
. s ( w[ 5 ] [ 0 ] ) , . cout ( c [ 5 ] [ 0 ] ) ) ;
f u l l a d d e r f a 0 5 0 2 ( . a ( p [ 2 ] [ 1 ] ) , . b ( c [ 4 ] [ 0 ] ) , . cin ( c [ 4 ] [
1 ] ) , . s ( w[ 5 ] [ 1 ] ) , . cout ( c [ 5 ] [ 1 ] ) ) ;
/ / Weight 2ˆ6
. s ( w[ 6 ] [ 0 ] ) , . cout ( c [ 6 ] [ 0 ] ) ) ;
f u l l a d d e r f a 0 6 0 1 ( . a ( p [ 2 ] [ 2 ] ) , . b ( p [ 3 ] [ 0 ] ) , . cin ( b [
7 ] ) , . s ( w[ 6 ] [ 1 ] ) , . cout ( c [ 6 ] [ 1 ] ) ) ;
f u l l a d d e r f a 0 6 0 2 ( . a (w[ 6 ] [ 0 ] ) , . b ( c [ 5 ] [ 0 ] ) , . cin ( c [ 5 ] [
1 ] ) , . s ( w[ 6 ] [ 2 ] ) , . cout ( c [ 6 ] [ 2 ] ) ) ;
/ / Weight 2ˆ7
3 ] ) , . s ( w[ 7 ] [ 0 ] ) , . cout ( c [ 7 ] [ 0 ] ) ) ;
h a l f a d d e r ha07 01 ( . a ( p [ 3 ] [ 1 ] ) , . b ( c [ 6 ] [ 0 ] ) ,
. s ( w[ 7 ] [ 1 ] ) , . cout ( c [ 7 ] [ 1 ] ) ) ;
2 ] ) , . s ( w[ 7 ] [ 2 ] ) , . cout ( c [ 7 ] [ 2 ] ) ) ;
/ / Weight 2ˆ8
. s ( w[ 8 ] [ 0 ] ) , . cout ( c [ 8 ] [ 0 ] ) ) ;
f u l l a d d e r f a 0 8 0 1 ( . a ( p [ 2 ] [ 4 ] ) , . b ( p [ 3 ] [ 2 ] ) , . cin ( c [ 7 ] [
0 ] ) , . s ( w[ 8 ] [ 1 ] ) , . cout ( c [ 8 ] [ 1 ] ) ) ;
2 ] ) , . s ( w[ 8 ] [ 2 ] ) , . cout ( c [ 8 ] [ 2 ] ) ) ;
59

report

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to report

Similar to report (20)

report