222083242 full-documg

LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 1
Get Homework/Assignment Done
Homeworkping.com
Homework Help
https://www.homeworkping.com/
Research Paper help
Online Tutoring
click here for freelancing tutoring sites
CHAPTER 1
INTRODUCTION
1. Introduction:
Power dissipation is recognized as a critical parameter in modern VLSI design field. To
satisfy MOORE’S law and to produce consumer electronics goods with more backup and less
weight, low power VLSI design is necessary.
Fast multipliers are essential parts of digital signal processing systems. The speed of
multiply operation is of great importance in digital signal processing as well as in the general

NCET Page 2
purpose processors today, especially since the media processing took off. In the past
multiplication was generally implemented via a sequence of addition,Subtraction, and shift
operations. Multiplication can be considered as a series of repeated additions. The number to be
added is the multiplicand, the number of times that it is added is the multiplier, and the result is
the product. Each step of addition generates a partial product. In most computers, the operand
usually contains the same number of bits. When the operands are interpreted as integers, the
product is generally twice the length of operands in order to preserve the information content.
This repeated addition method that is suggested by the arithmetic definition is slow that it is
almost always replaced by an algorithm that makes use of positional representation. It is possible
to decompose multipliers into two parts. The first part is dedicated to the generation of partial
products, and the second one collects and adds them.
The basic multiplication principle is two fold, i.e. evaluation of partial products and
accumulation of the shifted partial products. It is performed by the successive Addition’s of the
columns of the shifted partial product matrix. The ‘multiplier’ is successfully shifted and gates
the appropriate bit of the ‘multiplicand’. The delayed, gated instance of the multiplicand must all
be in the same column of the shifted partial product matrix. They are then added to form the
product bit for the particular form. Multiplication is therefore a multi operand operation. To
extend the multiplication to both signed and unsigned numbers, a convenient number system
would be the representation of numbers in two’s complement format.
Multipliers are key components of many high performance systems such as FIR filters,
microprocessors, digital signal processors, etc. A system’s performance is generally determined
by the performance of the multiplier because the multiplier is generally the slowest clement in
the system. Furthermore, it is generally the most area consuming. Hence, optimizing the speed
and area of the multiplier is a major design issue. However, area and speed are usually
conflicting constraints so that improving speed results mostly in larger areas. As a result, whole
spectrums of multipliers with different area-speed constraints are designed with fully parallel
processing. In between are digit serial multipliers where single digits consisting of several bits
are operated on. These multipliers have moderate performance in both speed and area. However,
existing digit serial multipliers have been plagued by complicated switching systems and/or
irregularities in design. Radix 2^n multipliers which operate on digits in a parallel fashion

NCET Page 3
instead of bits bring the pipelining to the digit level and avoid most of the above problems. They
were introduced by M. K. Ibrahim in 1993. These structures are iterative and modular. The
pipelining done at the digit level brings the benefit of constant operation speed irrespective of the
size of’ the multiplier. The clock speed is only determined by the digit size which is already
fixed before the design is implemented.
The growing market for fast floating-point co-processors, digital signal processing chips,
and graphics processors has created a demand for high speed, area-efficient multipliers. Current
architectures range from small, low-performance shift and add multipliers, to large, high-
performance array and tree multipliers. Conventional linear array multipliers achieve high
performance in a regular structure, but require large amounts of silicon. Tree structures achieve
even higher performance than linear arrays but the tree interconnection is more complex and less
regular, making them even larger than linear arrays. Ideally, one would want the speed benefits
of a tree structure, the regularity of an array multiplier, and the small size of a shift and add
multiplier.
To reduce the size of the multiplier a partial tree is used together with a 4-2 carry-save
accumulator placed at its outputs to iteratively accumulate the partial products. This allows a full
multiplier to be built in a fraction of the area required by a full array. Higher performance is
achieved by increasing the hardware utilization of the partial 4-2 tree through pipelining. To
ensure optimal performance of the pipelined 4-2 tree, the clock frequency must be tightly
controlled to match the delay of the 4-2 adder pipe stages.
Figure 2.2 Minimal Iterative Structures

NCET Page 4
In an attempt to increase performance of the minimal iterative structure additional rows
of CSA’s could be added to make a bigger array. For example, the addition of one row of CM’s
to the minimal structure would yield a partial array with two rows of CM’s. This structure
provides two advantages over the single row of CSA cells:
1) It reduces the required clock frequency, and
2) It requires only half as many latch delays.
It is important to note that although the number of CSA’s has been doubled, the latency
was reduced only by halving the number of latch delays. The number of CSA delays remains the
same. Thus, assuming the latch delays are small relative to the CSA delays, increasing the depth
of the partial array by adding additional rows of CSA’s in a linear structure yields only a slight
increase in performance.
2.4 Multiplication of Unsigned and Signed Numbers:
Multiplication is less common than addition, but is still essential for microprocessors,
digitalsignal processors, and graphics engines. The most basic form of multiplication consistsof
forming the product of two unsigned (positive) binary numbers. This canbe accomplished
through the traditional technique taught in primary school,simplified to base 2. For example, the
multiplication of two positive 6-bitbinary integers, 2510 and 3910, proceeds as shown in Figure
2.3.M × N-bit multiplication P = Y × X can be viewed as forming N partialproducts of M bits
each, and then summing the appropriately shifted partialproducts to produce an M+ N-bit result
P. Binary multiplication is equivalentto a logical AND operation. Therefore, generating partial
products consists ofthe logical ANDing of the appropriate bits of the multiplier and
multiplicand.Each column of partial products must then be added and, if necessary, anycarry
values passed to the next column. We denote the multiplicand asY = {yM–1, yM–2, …,y1, y0}
and the multiplier as X = {xN–1, xN–2, …, x1, x0}. For unsignedmultiplication, the product is
given in EQ (2.1). Figure 2.4 illustrates the generation,shifting, and summing of partial products
in a 6 × 6-bit multiplier.

NCET Page 5
Fig 2.3 multiplication of two positive 6-bit binary integers
Fig 2.4 generation, shifting, and summing of partial products in a 6 × 6-bit multiplier
Large multiplications can be more conveniently illustrated using dot diagrams. Figure2.5 shows
a dot diagram for a simple 16 × 16 multiplier. Each dot represents a placeholderfor a single bit
that can be a 0 or 1. The partial products are represented by a horizontalboxed row of dots,
shifted according to their weight. The multiplier bits used togenerate the partial products are
shown on the right.There are a number of techniques that can be used to perform multiplication.
In general,the choice is based upon factors such as latency, throughput, energy, area, and
designcomplexity. An obvious approach is to use an M + 1-bit carry-propagate adder (CPA)
toadd the first two partial products, then another CPA to add the third partial product to

NCET Page 6
therunning sum, and so forth. Such an approach requires N – 1 CPAs and is slow, even if afast
CPA is employed. More efficient parallel approaches use some sort of array or tree offull adders
to sum the partial products. We begin with a simple array for unsigned multipliers,and then
modify the array to handle signed two’s complement numbers using theBaugh-Wooley
algorithm. The number of partial products to sum can be reduced usingBooth encoding and the
number of logic levels required to perform the summation can bereduced with Wallace trees.
Unfortunately, Wallace trees are complex to lay out and havelong, irregular wires, so hybrid
array/tree structures may be more attractive. For completeness,we consider a serial multiplier
architecture. This was once popular when gates wererelatively expensive, but is now rarely
necessary.
Fig 2.5 dot diagram for a simple 16 × 16 multiplier
2.4.1 Unsigned Array Multiplication:
Fast multipliers use carry-save adders to sum the partial products.
A CSA typically has a delay of 1.5–2 FO4 inverters independent of the width of thepartial
product, while a carry-propagate adder (CPA) tends to have a delay of 4–15+ FO4inverters
depending on the width, architecture, and circuit family. Figure 2.6 shows a4 × 4 array multiplier
for unsigned numbers using an array of CSAs. Each cell contains a2-input AND gate that forms

NCET Page 7
a partial product and a full adder (CSA) to add the partialproduct into the running sum. The first
row converts the first partial product intocarry-save redundant form. Each later row uses the
CSA to add the corresponding partialproduct to the carry-save redundant result of the previous
row and generate a carry-saveredundant result. The least significant N output bits are available as
sum outputs directlyfrom CSAs. The most significant output bits arrive in carry-save redundant
form andrequire an M-bit carry-propagate adder to convert into regular binary form. In
Figure11.74, the CPA is implemented as a carry-ripple adder. The array is regular in structureand
uses a single type of cell, so it is easy to design and lay out. Assuming the carry outputis faster
than the sum output in a CSA, the critical path through the array is marked on
the figure with a dashed line. The adder can easily be pipelined with the placement of registers
between rows. In practice, circuits are assigned rectangular blocks in the floorplan sothe
parallelogram shape wastes space. Figure 2.7 shows the same adder squashed to fit arectangular
block.

NCET Page 8
Fig 2.6 Array Multiplier
A key element of the design is a compact CSA. This not only benefits area but also helps
performance because it leads to short wires with low wire capacitance. An ideal CSA design has
approximatelyequal sum and carry delays because the greater of these two delays limits
performance. The mirror adder is commonly used for its compact layout even though the sum
delay exceeds the carry delay. The sum output can be connected to the faster carry input to
partially compensate. Note that the first row of CSAs adds the first partial product to a pair of 0s.
This leads to a regular structure, but is inefficient. At a slight cost to regularity, the first row of
CSAs can be used to add the first three partial products together. This reduces the number of
rows by two and correspondingly reduces the adder propagation delay. Yet another way to
improve the multiplier array performance is toreplace the bottom row with a faster CPA such as

NCET Page 9
a look ahead or tree adder. In summary, the critical path of an array multiplier involves N–2
CSAs and a CPA.
Fig 2.7 Rectangular Multiplier
2.4.2.Two’s Complement Array Multiplication:
Multiplication of two’s complement numbers at first might seem more difficult because some
partial products are negative and must be subtracted. Recall that the most significant bit of a
two’s complementnumber has a negative weight. Hence, the product is

NCET Page 10
In EQ (2.2), two of the partial products have negative weight and thus should be
subtracted rather than added. The Baugh-Wooley multiplier algorithm handles subtraction by
taking the two’s complement of the terms to be subtracted (i.e., inverting the bits and adding
one). Figure 2.8 shows the partial products that must be summed. The upper parallelogram
represents the unsigned multiplication of all but the most significant bits of the inputs. The next
row is a single bit corresponding to the product of the most significant bits. The next two pairs of
rows are the inversions of the terms to be subtracted.
Each term has implicit leading and trailing zeros, which are inverted to leading and
trailing ones. Extra ones must be added in the least significant column when taking the two’s
complement. The multiplier delay depends on the number of partial product rows to be summed.
The modified Baugh-Wooley multiplier reduces this number of partial products by precomputing
the sums of the constant ones and pushing some of the terms upward into extra columns. Figure
2.9 shows such an arrangement. The parallelogram shaped array can again be squashed into a
rectangle, as shown in Figure 2.10, giving a design almost identical to the unsigned multiplier of
Figure 2.7. The AND gates are replaced by NAND gates in the hatched cells and 1s are added in
place of 0s at two of the unused inputs. The signed and unsigned arrays are so similar that a
single array can be usedfor both purposes if XOR gates are used to conditionally invert some of
the terms dependingon the mode.
Fig 2.8 Partial products for two’s complement multiplier

NCET Page 11
Fig 2.9 Simplified partial products for two’s complement multiplier
Fig 2.10 Modified Baugh-Wooley two’s complement multiplier
2.5 Booth Encoding:-
The array multipliers in the previous sections compute the partial products in a radix-2 manner;
i.e., by observing one bit of the multiplier at a time. Radix 2r multipliers produce N/r partial
products, each of which depend on r bits of the multiplier. Fewer partial productsleads to a
smaller and faster CSA array. For example, a radix-4 multiplier producesN/2 partial products.

NCET Page 12
Each partial product is 0, Y, 2Y, or 3Y, depending on a pair of bits ofX. Computing 2Y is a simple
shift, but 3Y is a hard multiple requiring a slow carrypropagateaddition of Y + 2Y before partial
product generation begins.Booth encoding was originally proposed to accelerate serial
multiplication.Modified Booth encoding [MacSorley61] allows higher radix parallel operation
without generatingthe hard 3Y multiple by instead using negative partial products. Observe
that3Y = 4Y – Y and 2Y = 4Y – 2Y. However, 4Y in a radix-4 multiplier array is equivalent to Yin
the next row of the array that carries four times the weight. Hence, partial products arechosen by
considering a pair of bits along with the most significant bit from the previouspair. If the most
significant bit from the previous pair is true, Y must be added to the currentpartial product. If the
most significant bit of the current pair is true, the current partialproduct is selected to be negative
and the next partial product is incremented.
Table 2.1 shows how the partial products are selected, based on bits of the multiplier.
Negative partial products are generated by taking the two’s complement of themultiplicand
(possibly left-shifted by one column for –2Y). An unsigned radix-4 Boothencodedmultiplier
requires partial products rather than N. Each partialproduct is M+ 1 bits to accommodate the 2Y
and –2Y multiples. Even though X and Y areunsigned, the partial products can be negative and
must be sign extended properly. TheBooth selects will be discussed further after an example.
Table2.1 Radix-4 modified Booth encoding values
In a typical radix-4 Booth-encoded multiplier design, each group of 3 bits (a pair,along with the
most significant bit of the previous pair) is encoded into several select lines(SINGLEi,

NCET Page 13
DOUBLEi, and NEGi, given in the rightmost columns of Table 2.1) anddriven across the partial
product row as shown in Figure 2.11 The multiplier Y is distributedto all the rows. The select
lines control Booth selectors that choose the appropriatemultiple of Y for each partial product.
The Booth selectors substitute for the AND gates ofa simple array multiplier to determine the ith
partial product. Figure 2.11 shows a conventionalBooth encoder and selector design. Y is zero-
extended to M + 1 bits.Depending on SINGLEi and DOUBLEi, the A22OI gate selects either 0,
Y, or 2Y. Negativepartial products should be two’s-complemented (i.e., invert and add 1). If
NEGiisasserted, the partial product is inverted. The extra 1 can be added in the least
significantcolumn of the next row to avoid needing a CPA.Even in an unsigned multiplier,
negative partial products must be sign-extended to besummed correctly. Figure 2.11 shows a 16-
bit radix-4 Booth partial product array for anunsigned multiplier using the dot diagram notation.
Fig 2.11 Radix-4 Booth encoder and selector

NCET Page 14
Fig 2.12 Radix-4 Booth-encoded partial products with sign extension
Each dot in the Booth-encoded multiplier is produced by a Booth selector rather than a
simple AND gate. Partial products 0–7 are 17 bits. Each partial product i is sign extended with
si= NEGi= x2i+1, which is 1 for negative multiples (those in the bottom half of Table 2.1) or 0
for positive multiples.Observe how an extra 1 is added to the least significant bit in the next row
to form the 2’s complement of negative multiples. Inverting the implicit leading zeros generates
leading ones on negative multiples. The extra terms increase the size of the multiplier. PP8 is
required in case PP7 is negative; this partial product is always 0 or Y because x16 and x17 are 0.
Hence, partial product 8 is only 16 bits.
Observe that the sign extension bits are all either 1s or 0s. If a single 1 is added tothe least
significant position in a string of 1s, the result is a string of 0s plus a carry-outthe top bit that
may be discarded. Therefore, the large number of s bits in each partialproduct can be replaced by
an equal number of constant 1s plus the inverse of s added tothe least significant position, as
shown in Figure 2.13(a). These constants mostly canbe optimized out of the array by
precomputing their sum. The simplified result is shownin Figure 2.13(b). As usual, it can be
squashed to fit a rectangular floorplan.The critical path of the multiplier involves the Booth
decoder, the select line drivers,the Booth selector, approximately N/2 CSAs, and a final CPA.
Each partial product fillsabout M + 5 columns. 54 × 54-bit radix-4 Booth multipliers for IEEE
double-precisionfloating-point units are typically 20–50% smaller (and arguably up to 20%
faster) thannonencoded counterparts, so the technique is widely used. The multiplier requires

NCET Page 15
M × N/2 Booth selectors.
Because the selectors account for a substantial portion of the area and only a
smallfraction of the critical path, they should be optimized for size over speed. For example,
Fig 2.13 Radix-4 Booth-encoded partial products with simplified sign extension
describes a sign select Booth encoder and selector that uses only 10 transistors perselector bit at
the expense of a more complex encoder.It presents a one-hot Boothencoder and selector that
chooses one of the six possible partial products using a transmissiongate multiplexer.
Booth Encoding Signed Multipliers Signed two’s complement multiplication is similar, but the
multiplicand may have been negative so sign extension must be done based on the sign bit of the
partial product, PpiM. Figure 2.14 shows such an array, where the sign extension bit is ei=
PPiM. Also notice that PP8, which was either Y or 0 for unsigned multiplication, is always 0 and
can be omitted for signed multiplication because the multiplier x is sign-extended such that x17 =

NCET Page 16
x16 = x15. The same Booth selector and encoder can be employed , but Y should be sign-
extended rather than zero-extended to M+ 1 bits.
Fig 2.14 Radix-4 Booth-encoded partial products for signed multiplication
Higher Radix Booth Encoding Large multipliers can use Booth encoding of higher radix. For
example, ordinary radix-8 multiplication reduces the number of partial products by a factor of 3,
but requires hard multiples of 3Y, 5Y, and 7Y. Radix-8 Boothencoding only requires the hard 3Y
multiple, as shown in Table 2.2. Although this requires a CPA before partial product generation,
it can be justified by the reduction in array size and delay. Higher-radix Booth encoding is
possible, but generating the otherhard multiples appears not to be worthwhile for multipliers of
fewer than 64 bits. Similartechniques apply to sign-extending higher-radix multipliers.

NCET Page 17
Table 2.2 Radix-8 modified Booth encoding values
Column Addition:
The critical path in a multiplier involves summing the dots in each column. Observe thata CSA is
effectively a “ones counter” that adds the number of 1s on the A, B, and C inputsand encodes
them on the sum and carry outputs, as summarized in Table 2.3. A CSA istherefore also known
as a (3,2) counter because it converts three inputs into a countencoded in two outputs .The carry-
out is passed to the next more significantcolumn, while a corresponding carry-in is received from
the previous column. This iscalled a horizontal path because it crosses columns. For simplicity, a
carry is represented asbeing passed directly down the column. Figure 11.84 shows a dot diagram
of an arraymultiplier column that sums N partial products sequentially using N–2 CSAs. For
example,the 16 × 16 Booth-encoded multiplier from Figure 2.13(b) sums nine partial
productswith seven levels of CSAs. The output is produced in carry-save redundant formsuitable
for the final CPA.

NCET Page 18
Table 2.3Radix-8 modified Booth encoding values
Fig 2.15 Dot diagram for array multiplier
The column addition is slow because only one CSA is active at a time. Another way to speed the
column addition is to sum partial products in parallel rather than sequentially. Figure 2.16 shows
a Wallace treeusing this approach [Wallace64]. The Wallace tree requires
levels of (3,2) counters to reduce N inputs down to two carry-save redundant form outputs.
Even though the CSAs in the Wallace tree are shown in two dimensions, they are logically
packed into a single column of the multiplier. This leads to long and irregular wires along the
column to connect the CSAs. The wire capacitance increases the delay and energy of multiplier,
and the wires can be difficult to lay out.
Fig 2.16 Dot diagram for Wallace tree multiplier

NCET Page 19
2.6Compressor Trees
[4:2] compressors can be used in a binary tree to produce a more regular layout, as shown in
Figure 2.17 . A [4:2] compressor takes four inputs of equal weight and produces two outputs. It
can be constructed from two (3,2) counters as shown in Figure 2.18. Along the way, it generates
an intermediate carry, ti, into the next column and accepts a carry, ti–1, from the previous
column,so it may more aptly be called a (5,3) counter. This horizontal path does not impact the
delay because the output of the top CSA in one column is the input of the bottom CSA in the
next column.
Fig 2.17 Dot diagram for [4:2] tree multiplier
Fig 2.18 [4:2] compressor (a) implementation with two CSAs (b) symbol
The [4:2] CSA symbol emphasizes only the primary inputs and outputs to emphasize the main
function of reducing four inputs to two outputs. Only
levels of [4:2] compressors are required, although each has greater delay than a CSA. The
regular layout and routing also make the binary tree attractive. To see the benefits of a [4:2]

NCET Page 20
compressor, we introduce the notion of fast and slow inputs and outputs. Figure 2.19 shows a
simple gate-level CSA design. The
longest path through the CSA involves two levels of XOR2 to compute the sum.Xis called a fast
input, while Y and Z are slow inputs because they pass through a second level of XOR. C is the
fast outputbecause it involves a single gate delay, while S is the slow output because it involves
two gate delays. A [4:2] compressor might be expected to use four levels of XOR2s.
Fig 2.19 Gate-level carry-save adder
Fig 2.20 [4:2] compressors

NCET Page 21
Figure 2.20 shows various [4:2]compressor designs that reduce the critical path to only 3
XOR2s. In Figure 2.20(a), the slow output of the first CSA is connected to the fast input of the
second.In Figure 2.20(b), the [4:2] compressor has been munged into a single cell,allowing a
majority gate to be replaced with a multiplexer. In Figure 2.20(c), the initial XORs have been
replaced with 2-level XNOR circuits that allow some sharing of subfunctions, reducing the
transistor count Figure 2.21 shows a transmission gate implementation of a [4:2] compressor
from. It uses only 48 transistors,allowing for a smaller multiplier array with shorter wires. Note
that it uses three distinct XNOR circuit forms and two transmission gate multiplexers.
Fig 2.21 Transmission gate [4:2] compressor

NCET Page 22
Fig 2.22 16 × 16 Booth-encoded multiplier floorplans
(a) array (b) Wallace tree (c) [4:2] tree
Figure 2.22compares floorplansof the 16 X 16 Boothencoded array multiplier from Figure
2.15, the Wallace tree from Figure 2.16, and the [4:2] tree from Figure 2.17. Each row represents
a horizontal slice of the multiplier containing a Booth selector or a CSA. Vertical busses connect
CSAs. The Wallace tree has the most irregular and lengthy wiring. In practice, the parallelogram
may be squashed into a rectangular formto make better use of the space.
2.7Three-Dimensional Method The notion of connectingslow outputs to fast inputs generalizes
to compressors with morethan four inputs. By examining the entire partial product array at
once, one can construct trees for each column that sum all of thepartial products in the shortest
possible time. This approach is called the three-dimensionalmethod (TDM) because it considers
the arrival time as a third dimension along with rowsand columns .Figure 11.92 shows an
example of a 16 × 16 multiplier. The parallelogram at the topshows the dot diagram from Figure
11.82(b) containing nine partial product rowsobtained through Booth encoding. The partial
products in each of the 32 columns must besummed to produce the 32-bit result. As we have
seen, this is done with a compressor toproduce a pair of outputs, followed by a final CPA.

NCET Page 23
Table2.4 Comparison of XOR levels in multiplier trees
2.8 Hybrid Multiplication Arrays offer regular layout, but many levels of CSAs.Trees offer
fewer levels of CSAs, but less regular layout and some long wires. A number ofhybrids have
been proposed that offer trade-offs between these two extremes. Theseinclude odd/even arrays
arrays of arrays, balanced delay trees, overturned-staircase trees, and upper/lower left-to-right
leapfrog(ULLRF) trees. They can achieve nearly as few levels of logic as the Wallacetree while
offering more regular (and faster) wiring. None have caught on as distinctly betterthan [4:2]
trees.
The three steps of multiplication are partial product generation, partial product
reduction,and carry propagate addition. A simple M × N multiplier generates N partial
productsusing AND gates. For multipliers of 16 or more bits, radix-4 Booth encoding is
typicallyused to cut the number of partial products in two, saving substantial area and power.
Someimplementations find Booth encoding is faster, while others find it has little speed
benefit.The partial products are then reduced to a pair of numbers in carry-save redundant
formusing an array or tree of CSAs. Trees have fewer levels of logic, but longer and less
regularwiring; nevertheless most large multipliers use trees or hybrid structures. Pass
transistorBooth selectors and CSAs were popular in the 1990s, but the trend is toward
staticCMOS as supply voltage scales. Finally, a CPA converts the result to nonredundant form.

NCET Page 24
CHAPTER 3
SPST MODIFIED BOOTHENCODER
3.1. Spurious power suppression technique:
Figure shows the five cases of a 16-bit addition in which the spurious switching activities
occur. The 1st case illustrates a transient state in which the spurious transitions of carry signals
occur in the MSP though the final result of the MSP are unchanged. The 2nd and the 3rd cases
describe the situations of one negative operand adding another positive operand without and with
carry from LSP, respectively. Moreover, the 4th and the 5th cases respectively demonstrate the
addition of two negative operands without and with carry-in from LSP. In those cases, the results
of the MSP are predictable Therefore the computations in the MSP are useless and can be
neglected. The data are separated into the Most Significant Part (MSP) and the Least Significant
Part (LSP). To know whether the MSP affects the computation results or not. We need a
detection logic unit to detect the effective ranges of the inputs. The Boolean logical equations

NCET Page 25
shown below express the behavioral principles of the detection logic unit in the MSP circuits of
the SPST-based adder/subtractor:
Figure 2. Spurious transition cases in multimedia/ DSP processing
AMSP = A[15:8]; BMSP = B[15:8] ;
Aand = A[15] A[14] A[8];
Band = B[15] B[14] B[8];]
where A[m] and B[n] respectively denote the mth bit of the operands A and the nth bit of
the operand B, and AMSP and BMSP respectively denote the MSP parts, i.e. the 9th bit to the
16th bit, of the operands A and B. When the bits in AMSP and/or those in BMSP are all ones,
the value of Aand and/or that of Band respectively become one, while the bits in AMSP and/or
those in BMSP are all zeros, the value of Anor, and/or that of Bnor respectively turn into one.
Being one of the three outputs of the detection logic unit, close denotes whether the MSP circuits
can be neglected or not. When the two input operand can be classified into one of the five classes
as shown in figure 1,
the value of close becomes zero which indicates that the MSP circuits can be closed. figure 1.
also shows that it is necessary to compensate the sign bit of computing results Accordingly, we
derive the Karnaugh maps which lead to the Boolean equations (7) and (8) for the Carr_ctrl and
the sign signals, respectively. In equation (7) and (8), CLSP denotes the carry propagated from
the LSP circuits.

NCET Page 26
Figure shows a 16-bit adder/subtractor design example based on the proposed SPST. In
this example, the 16-bit adder/subtractor is divided into MSP and LSP at the place between the
8th bit and the 9th bit. Latches implemented by simple AND gates are used to control the input
data of the MSP. When the MSP is necessary, the input data of MSP remain the same as usual,
while the MSP is negligible, the input data of the MSP become zeros to avoid switching power
consumption. From the derived Boolean equations (1) to (8), the detection logic unit of the SPST
is designed as shown in figure 4. The use of MSP can be determined by whether the input data of
MSP should be latched or not. Moreover, we add three 1-bit to control the assertion of the close,
sign, and Carr-ctrl signals in order to further decrease the glitch signals occurred in the cascaded
circuits which are usually adopted in VLSI architectures designed for video coding.

NCET Page 27
Fig 3.1 16-bit adder/subtractor design example
Fig. shows a 16-bit adder/subtractor design example adopting the proposed SPST. In this
example, the 16-bit adder/subtractor is divided into MSP and LSP between the eighth and the
ninth bits. Latches implemented by simple AND gates are used to control the input data of the
MSP. When theMSP is necessary, the input data of MSP remain unchanged. However, when the
MSP is negligible, the input data of the MSP become zeros to avoid glitching power
consumption. The two operands of the MSP enter the detection-logic unit, except the
adder/subtractor, so that the detection-logic unit can decide whether to turn off the MSP or not.
Based on the derived Boolean equations (1) to (8), the detection-logic unit of SPST is shown in
Fig. 6(a), which can determine whether the input data of MSP should be latched or not.
Moreover, we propose the novel glitch-diminishing technique by adding three 1-bit registers to
control the assertion of the close, sign, and carr-ctrl signals to further decrease the transient
signals occurred in thecascaded circuits which are usually adopted in VLSI architecturesdesigned
for multimedia/DSP applications. The timing diagram is shown in Fig. 6(b). A certain amount of
delay is used to assert the close, sign, and carr-ctrl signals after the period of data transition
which is achieved by controlling the three 1-bit registers at the outputs of the detection-logic
unit.
Hence, the transients of the detection-logic unit can be filtered out; thus, the data latches
shown in Fig can prevent the glitch signals from flowing into the MSP with tiny cost. The data

NCET Page 28
transient time and the earliest required time of all the inputs are also illustrated. The delay should
be set in the range of, which is shown as the shadow area in Fig, to filter out the glitch signals as
well as to keep the computation results correct. Based on Figs. 5 and 6, the timing issue of the
SPST is analyzed as follows.
3.1.1. When the detection-logic unit turns off the MSP:
At this moment, the outputs of the MSP are directly compensated by the SE unit; therefore, the
time saved from skipping the computations in the MSP circuits shall cancel out the delay caused
by the detection-logic unit.
3.1.2. When the detection-logic unit turns on the MSP:
The MSP circuits must wait for the notification of the detection-logic unit to turn on the data
latches to let the data in. Hence, the delay caused by the detection-logic unit will contribute to
the delay of the whole combinational circuitry, i.e., the16-bit adder/subtractor in this design
example.
3.1.3.When the detection-logic unit remains its decision:
No matter whether the last decision is turning on or turning off the MSP, the delay of the
detection logic is negligible because the path of the combinational circuitry (i.e., the 16-bit
adder/subtractor in this design example) remains the same. From the analysis earlier, we can
know that the total delay is affected only when the detection-logic unit turns on the MSP.
However, the detection-logic unit should be a speed-oriented design. When the SPST is applied
on combinational circuitries, we should first determine the longest transitions of the interested
cross sections of each combinational circuitry, which is timing characteristic and is also related
to the adopted technology. The longest transitions can be obtained from analyzing the timing
differences between the earliest arrival and the latest arrival signals of the cross sections of a
combinational circuitry.

NCET Page 29
3.2. MAC
3.2.1 Block Diagram of MAC:
In this Project, a new architecture for a high-speed MAC is proposed. In this MAC, the
computations of multiplication and accumulation are combined and a hybrid-type CSA structure
is proposed to reduce the critical path and improve the output rate. It uses MBA algorithm based
on 1’s complement number system. A modified array structure for the sign bits is used to
increase the density of the operands. A carry look-ahead adder (CLA) is inserted in the CSA tree
to reduce the number of bits in the final adder. In addition, in order to increase the output rate by
optimizing the pipeline efficiency, intermediate calculation results are accumulated in the form
of sum and carry instead of the final adder outputs.
A multiplier can be divided into three operational steps. The first is radix-2 Booth
encoding in which a partial product is generated from the multiplicand X and the multiplier Y .
The second is adder array or partial product compression to add all partial products and convert
them into the form of sum and carry. The last is the final addition in which the final
multiplication result is produced by adding the sum and the carry. If the process to accumulate
the multiplied results is included, a MAC consistsof four steps, as shown in Fig. 1, which shows
the operational steps explicitly.

NCET Page 30
Fig 3.2 MAC Operational steps
3.2.2.Proposed MAC Architecture:
In this section, the expression for the new arithmetic will be derived from equations of
the standard design. From this result, VLSI architecture for the new MAC will be proposed. In
addition, a hybrid-typed CSA architecture that can satisfy the operation of the proposed MAC
will be proposed.
3.3.Radix-4 modified Booth's algorithm:
Booth's Algorithm is simple but powerful. Speed of VMFU is dependent on the number
of partial products and speed of accumulate partial product. Booth's Algorithm provide us to
reduced partial products. We choose radix-4 algorithm because of below reasons.
 Original Booth's algorithm has an inefficient case.
The 17 partial products are generated in 16bit x 16bit signed or unsigned multiplication.
 Modified Booth's radix-4 algorithm has fatal encoding time in 16bit x 16bit multiplication.

NCET Page 31
Radix-4 Algorithm has a 3x term which means that a partial product cannot be generated by
shifting. Therefore, 2x + 1x are needed in encoding processing. One of the solution is handling
an additional 1x term in wallace tree. However, large wallace tree has some problems too.
A radix-4 modified Booth's algorithm: Booth's radix-4 algorithm is widely used to reduce
the area of multiplier and to increase the speed. Grouping 3 bits of multiplier with overlapping
has half partial products which improves the system speed. Radix-4 modified Booth's algorithm
is shown below:
 X-1 = 0; Insert 0 on the right side of LSB of multiplier.
 Start grouping each 3bits with overlapping from x-1
 If the number of multiplier bits is odd, add a extra 1 bit on left side of MSB
 generate partial product from truth table
 when new partial product is generated, each partial product is added 2 bit left
shifting in regular sequence.
x: multiplicand y: multiplier
3.4. Sign or zero extension
Our MAC supports signed or unsigned multiplication and the produced result is 64bit
which are stored in 2 special 32bit register. First MAC receives a multiplicand and multiplier but
just 16bit operands are signed number in Booth's radix-4 algorithm. Hence, extension bit is
required to express 16bit signed number. The core idea of this is that 16bit unsigned number can

NCET Page 32
be expressed by 33bit signed number. The 17 partial products are generated in 33bit x 33bit case
(16 partial products in 32bit x 32bit case). Here is an example of signed and unsigned
multiplication. When x(multiplicand) is 3bit 111 and y(multiplier) is 3bit 111, the signed and
unsigned multiplication is different. In signed case x × y = 1 (-1 x -1 = 1) and in unsigned case x
× y = 49 (7 x 7 = 49).
3.5. Carry-Save Adder
When three or more operands are to be added simultaneously using two operand adders,
the time consuming carry propagation must be repeated several times. If the number of operands
is ‘k’, then carries have to propagate (k-1) times (Weste& Harris, 3rd Ed). In the carry save
addition, we let the carry propagate only in the last step, while in all the other steps we generate
the partial sum and sequence of carries separately. A CSA is capable of reducing the number of
operands to be added from 3 to 2 without any carry propagation. A CSA can be implemented in
different ways. In the simplest implementation, the basic element of carry save adder is the
combination of two half adders or 1 bit full adder (Weste& Harris, 3rd Ed)
3.6 Circuit DesignFeatures
One of the most advanced types of MAC for general-purpose digital signal processing
has been proposed by Elguibaly . It is an architecture in which accumulation has been combined
with the carry save adder (CSA) tree that compresses partial products. In the architecture
proposed in, the critical path was reduced by eliminating the adder for accumulation and
decreasing the number of input bits in the final adder. While it has a better performance because
of the reduced critical path compared to the previous VMFU architectures, there is a need to
improve the output rate due to the use of the final adder results for accumulation. The

NCET Page 33
architecture to merge the adder block to the accumulator register in the VMFU operator was
proposed to provide the possibility of using two separate N/2-bit adders instead of one-bit adder
to accumulate the MAC results. Recently, Zicari proposed an architecture that took a merging
technique to fully utilize the 4–2 compressor .It also took this compressor as the basic building
blocks for the multiplication circuit.
Fig 3.3 basic building blocks for the multiplication circuit.

NCET Page 34
CHAPTER 4
IMPLEMENTATION
4.1 Introduction to VMFU:
If an operation to multiply two N –bit numbers and accumulates into a 2N -bit number,
addition, subtraction, Sum of Absolute Difference (SAD), and Interpolation is considered. The
critical path is determined by the 2-bit accumulation operation. If a pipeline scheme is applied
for each step in the standard design of Fig. 4.1, the delay of the last accumulator must be reduced
in order to improve the performance of the MAC. The overall performance of the proposed
VMFU is improved by eliminating the accumulator itself by combining it with the CSA function.
If the accumulator has been eliminated, the critical path is then determined by the final adder in
the multiplier. The basic method to improve the performance of the final adder is to decrease the
number of input bits. In order to reduce this number of input bits, the multiple partial products
are compressed into a sum and a carry by CSA. The number of bits of sums and carries to be
transferred to the final adder is reduced by adding the lower bits of sums and carries in advance
within the range in which the overall performance will not be degraded. A 2-bit CLA is used to
add the lower bits in the CSA. In addition, to increase the output rate when pipelining is applied,
the sums and carrys from the CSA are accumulated instead of the outputs from the final adder in
the manner that the sum and carry from the CSA in the previous cycle are inputted to CSA. Due
to this feedback of both sum and carry, the number of inputs to CSA increases, compared to the
standard design. In order to efficiently solve the increase in the amount of data, a CSA
architecture is modified to treat the sign bit.

NCET Page 35
Fig 4.1 Versatile Multimedia Functional Unit
VMFU is composed of an adder, multiplier and an accumulator. Usually adders
implemented are Carry- Select or Carry-Save adders, as speed is of utmost importance in DSP
(Chandrakasan, Sheng, &Brodersen, 1992 and Weste& Harris, 3rd Ed). One implementation of
the multiplier could be as a parallel array multiplier. The inputs for the VMFU are to be fetched
from memory location and fed to the multiplier block, which will perform multiplication and
give the result to adder which will accumulate the result and then will store the result into a
memory location. This entire process is to be achieved in a single clock cycle (Weste& Harris,
3rd Ed).

NCET Page 36
Fig 4.2 Architecture of MAC
Figure 4.2 is the architecture of the MAC unit which had been designed in this work. The
design consists of one 16 bit register, one 16-bit Modified Booth Multiplier multiplier, 33-bit
accumulator using ripple carry and two16-bit accumulator registers. To multiply the values of A
and B, Modified Booth multiplier is used instead of conventional multiplier because Modified
Booth multiplier can increase the MAC unit design speed and reduce multiplication complexity.
Carry save Adder (CSA) is used as an accumulator in this design. Apparently, together with the
utilization of Wallace tree multiplier approach, carry save adder in the final stage of the Modified
Booth multiplier and Carry save Adder as the accumulator, this VMFU unit design is not only
reducing the standby power consumption but also can enhance the VMFU unit speed so as to
gain better system performance. The operation of the designed VMFU unit is as in Equation 2.1.
The product of Ai X Bi is always fed back into the 34-bit Carry save accumulator and then added

NCET Page 37
again with the next product Ai x Bi. This MAC unit is capable of multiplying and adding with
previous product consecutively up to as many as eight times. Operation: Output = Σ Ai Bi (2.1)
In this Project, the design of 16x16 multiplier unit is carried out that can perform accumulation
on 34 bit number. This MAC unit has 34 bit output and its operation is to add repeatedly the
multiplication results. The total design area is also being inspected by observing the total count
of gates [Hardwires]. Power delay product is calculated by multiplying the power consumption
result with the time delay.
Design ofVMFU
In the majority of digital signal processing (DSP) applications the critical operations
usually involve many multiplications and/or accumulations. For real-time signal processing, a
high speed and high throughput Multiplier-Accumulator (MAC) is always a key to achieve a
high performance digital signal processing system and versatile Multimedia functional units. In
the last few years, the main consideration of MAC design is to enhance its speed. This is
because; speed and throughput rate is always the concern of VMFU. But for the epoch of
personal communication, low power design also becomes another main design consideration.
This is because; battery energy available for these portable products limits the power
consumption of the system. Therefore, the main motivation of this work is to investigate various
Pipelined multiplier/accumulator architectures and circuit design techniques which are suitable
for implementing high throughput signal processing algorithms and at the same time achieve low
power consumption. A conventional VMFU unit consists of (fast multiplier) multiplier and an
accumulator that contains the sum of the previous consecutive products. The function of the
VMFU unit is given by the following equation:
F = Σ Ai Bi………………………………………………………… (2.1)
The main goal of a VMFU design is to enhance the speed of the MAC unit, and at the same time
limit the power consumption. In a pipelined MAC circuit, the delay of pipeline stage is the delay of a 1-
bit full adder. Estimating this delay will assist in identifying the overall delay of the pipelined MAC. In
this work, 1-bit full adder is designed. Area, power and delay are calculated for the full adder, based on
which the pipelined MAC unit is designed for low power.

NCET Page 38
4.2 Explanation
4.2.1. High-Speed Booth Encoded Parallel Multiplier Design:
Fast multipliers are essential parts of digital signal processing systems. The speed of multiply
operation is of great importance in digital signal processing as well as in the general purpose processors
today, especially since the media processing took off. In the past multiplication was generally
implemented via a sequence of addition, subtraction, and shift operations. Multiplication can be
considered as a series of repeated additions. The number to be added is the multiplicand, the number of
times that it is added is the multiplier, and the result is the product. Each step of addition generates a
partial product. In most computers, the operand usually contains the same number of bits. When the
operands are interpreted as integers, the product is generally twice the length of operands in order to
preserve the information content. This repeated addition method that is suggested by the arithmetic
definition is slow that it is almost always replaced by an algorithm that makes use of positional
representation. It is possible to decompose multipliers into two parts. The first part is dedicated to the
generation of partial products, and the second one collects and adds them.
The basic multiplication principle is two fold i.e. evaluation of partial products and accumulation
of the shifted partial products. It is performed by the successive additions of the columns of the shifted
partial product matrix. The ‘multiplier’ is successfully shifted and gates the appropriate bit of the
‘multiplicand’. The delayed, gated instance of the multiplicand must all be in the same column of the
shifted partial product matrix. They are then added to form the product bit for the particular form.
Multiplication is therefore a multi operand operation. To extend the multiplication to both signed and
unsigned.
4.2.2. Modified Booth Encoder:
In order to achieve high-speed multiplication, multiplication algorithms using parallel counters,
such as the modified Booth algorithm has been proposed, and some multipliers based on the algorithms
have been implemented for practical use. This type of multiplier operates much faster than an array
multiplier for longer operands because its computation time is proportional to the logarithm of the word
length of operands.

NCET Page 39
Fig 4.3 Radix-4 Booth Encoding
Booth multiplication is a technique that allows for smaller, faster multiplication circuits, by
recoding the numbers that are multiplied. It is possible to reduce the number of partial products by half,
by using the technique of radix-4 Booth recoding. The basic idea is that, instead of shifting and adding for
every column of the multiplier term and multiplying by 1 or 0, we only take every second column, and
multiply by ±1, ±2, or 0, to obtain the same results. The advantage of this method is the halving of the
number of partial products. To Booth recode the multiplier term, we consider the bits in blocks of three,
such that each block overlaps the previous block by one bit. Grouping starts from the LSB, and the first
block only uses two bits of the multiplier. Figure 3 shows the grouping of bits from the multiplier term for
use in modified booth encoding.
Fig.4.4 Grouping of bits from the multiplier term
Each block is decoded to generate the correct partial product. The encoding of the multiplier Y, using the
modified booth algorithm, generates the following five signed digits, -2, -1, 0, +1, +2. Each encoded digit
in the multiplier performs a certain operation on the multiplicand, X, as illustrated in Table 4.1

NCET Page 40
Table 4.1 Each encoded digit in the multiplier performs a certain operation on the
multiplicand, X,
For the partial product generation, we adopt Radix-4 Modified Booth algorithm to reduce the
number of partial products for roughly one half. For multiplication of 2’s complement numbers, the two-
bit encoding using this algorithm scans a triplet of bits. When the multiplier B is divided into groups of
two bits, the algorithm is applied to this group of divided bits.
Figure 4, shows a computing example of Booth multiplying two numbers ”2AC9” and “006A”.
The shadow denotes that the numbers in this part of Booth multiplication are all zero so that this part of
the computations can be neglected. Saving those computations can significantly reduce the power
consumption caused by the transient signals. According to the analysis of the multiplication shown in
figure 4, we propose the SPST-equipped modified-Booth encoder, which is controlled by a detection unit.
The detection unit has one of the two operands as its input to decide whether the Booth encoder calculates
redundant computations. As shown in figure 9. The latches can, respectively, freeze the inputs of MUX-4
to MUX-7 or only those of MUX-6 to MUX-7 when the PP4 to PP7 or the PP6 to PP7 are zero; to reduce
the transition power dissipation. Figure 10, shows the booth partial product generation circuit. It includes
AND/OR/EX-OR logic.

NCET Page 41
Fig.4.5 Illustration of multiplication using modified Booth encoding
The PP generator generates five candidates of the partial products, i.e., {-2A,-A, 0, A, 2A}. These
are then selected according to the Booth encoding results of the operand B. When the operand besides the
Booth encoded one has a small absolute value, there are opportunities to reduce the spurious power
dissipated in the compression tree.
Fig4.6 SPST equipped modified Booth encoder

NCET Page 42
4.2.3. Partial product generator:
Fig4.7 Booth partial product selector logic
The multiplication first step generates from A and X a set of bits whose weights sum is the product P. For
unsigned multiplication, P most significant bit weight is positive, while in 2's complement it is negative.
The partial product is generated by doing AND between ‘a’ and ‘b’ which are a 4 bit vectors as
shown in fig. If we take, four bit multiplier and 4-bit multiplicand we get sixteen partial products in which
the first partial product is stored in ‘q’. Similarly, the second, third and fourth partial products are stored
in 4-bit vector n, x, y.
Fig.4.8 Booth partial products Generation

NCET Page 43
The multiplication second step reduces the partial products from the preceding step into two
numbers while preserving the weighted sum. The sough after product P is the sum of those two numbers.
The two numbers will be added during the third step The "Wallace trees" synthesis follows the Dadda's
algorithm, which assures of the minimum counter number. If on top of that we impose to reduce as late as
(or as soon as) possible then the solution is unique. The two binary number to be added during the third
step may also be seen a one number in CSA notation (2 bits per digit).
Fig.4.9 Booth single partial product selector logic
4.2.4.Truth Table ofModified Booth Encoder:
Multiplication consists of three steps: 1) the first step to generate the partial products; 2) the
second step to add the generated partial products until the last two rows are remained; 3) the third step to
compute the final multiplication results by adding the last two rows. The modified Booth algorithm
reduces the number of partial products by half in the first step. We used the modified Booth encoding
(MBE) scheme proposed in. It is known as the most efficient Booth encoding and decoding scheme. To
multiply X by Y using the modified Booth algorithm starts from grouping Y by three bits and encoding
into one of {-2, -1, 0, 1, 2}. Table I shows the rules to generate the encoded signals by MBE scheme and
Fig. 1 (a) shows the corresponding logic diagram. The Booth decoder generates the partial products using
the encoded signals as shown in Fig. 1

NCET Page 44
Fig.4.10 Booth Encoder
Fig.4.11.Booth Decoder
Fig. shows the generated partial products and sign extension scheme of the 8-bit modified Booth
multiplier. The partial products generated by the modified Booth algorithm are added in parallel using the
Wallace tree until the last two rows are remained. The final multiplication results are generated by adding
the last two rows. The carry propagation adder is usually used in this step.
Fig 4.12 Truth table for MBE Scheme

NCET Page 45
CHAPTER 5
5.1 Introduction to FPGA:
FPGA stands for Field Programmable Gate Array which has the array of logic module, I
/O module and routing tracks (programmable interconnect). FPGA can be configured by end user
to implement specific circuitry. Speed is up to 100 MHz but at present speed is in GHz.
Main applications are DSP, FPGA based computers, logic emulation, ASIC and ASSP.
FPGA can be programmed mainly on SRAM (Static Random Access Memory). It is Volatile and
main advantage of using SRAM programming technology is re-configurability. Issues in FPGA
technology are complexity of logic element, clock support, IO support and interconnections
(Routing).
5.2 Block diagram of FPGA:
FPGA contains a two dimensional arrays of logic blocks and interconnections between
logic blocks. Both the logic blocks and interconnects are programmable. Logic blocks are
programmed to implement a desired function and the interconnects are programmed using the
switch boxes to connect the logic blocks.
To be more clear, if we want to implement a complex design (CPU for instance), then the
design is divided into small sub functions and each sub function is implemented using one logic
block. Now, to get our desired design (CPU), all the sub functions implemented in logic blocks
must be connected and this is done by programming the Internal structure of an FPGA is
depicted in the following figure.

NCET Page 46
Fig 5.1 Internal structure of an FPGA
FPGAs, alternative to the custom ICs, can be used to implement an entire System On one
Chip (SOC). The main advantage of FPGA is ability to reprogram. User can reprogram an FPGA
to implement a design and this is done after the FPGA is manufactured. This brings the name
“FieldProgrammable.”
Custom ICs are expensive and takes long time to design so they are useful when
produced in bulk amounts. But FPGAs are easy to implement with in a short time with the help
of Computer Aided Designing (CAD) tools (because there is no physical layout process, no mask
making, and no IC manufacturing).
Some disadvantages of FPGAs are, they are slow compared to custom ICs as they can’t
handle vary complex designs and also they draw more power.
Xilinx logic block consists of one Look Up Table (LUT) and one FlipFlop. An LUT is
used to implement number of different functionality. The input lines to the logic block go into

NCET Page 47
the LUT and enable it. The output of the LUT gives the result of the logic function that it
implements and the output of logic block is registered or unregistered out put from the LUT.
SRAM is used to implement a LUT.A k-input logic function is implemented using 2^k *
1 size SRAM. Number of different possible functions for k input LUT is 2^2^k. Advantage of
such an architecture is that it supports implementation of so many logic functions, however the
disadvantage is unusually large number of memory cells required to implement such a logic
block in case number of inputs is large.
Figure 5.2 4-input LUT based implementation of logic block
LUT based design provides for better logic block utilization. A k-input LUT based logic
block can be implemented in number of different ways with trade off between performance and
logic density.
An n-LUT can be shown as a direct implementation of a function truth-table. Each of the latch
holds the value of the function corresponding to one input combination. For Example: 2-LUT
can be used to implement 16 types of functions like AND , OR, A+not B .... etc.
Interconnects
A wire segment can be described as two end points of an interconnect with no
programmable switch between them. A sequence of one or more wire segments in an FPGA can
be termed as a track.

NCET Page 48
Typically an FPGA has logic blocks, interconnects and switch blocks (Input/Output
blocks). Switch blocks lie in the periphery of logic blocks and interconnect. Wire segments are
connected to logic blocks through switch blocks. Depending on the required design, one logic
block is connected to another and so on.
5.3 FPGA Designflow
In this part of tutorial we are going to have a short intro on FPGA design flow. A simplified
version of design flow is given in the flowing diagram.
Fig 5.3 FPGA DesignFlow
DesignEntry
There are different techniques for design entry. Schematic based, Hardware Description
Language and combination of both etc. . Selection of a method depends on the design and
designer. If the designer wants to deal more with Hardware, then Schematic entry is the better
choice. When the design is complex or the designer thinks the design in an algorithmic way then
HDL is the better choice. Language based entry is faster but lag in performance and density.
HDLs represent a level of abstraction that can isolate the designers from the details of the
hardware implementation. Schematic based entry gives designers much more visibility into the
hardware. It is the better choice for those who are hardware oriented. Another method but rarely

NCET Page 49
used is state-machines. It is the better choice for the designers who think the design as a series of
states. But the tools for state machine entry are limited. In this documentation we are going to
deal with the HDL based design entry.
Synthesis
The process which translates VHDL or Verilog code into a device netlistformate. i.e a
complete circuit with logical elements( gates, flip flops, etc…) for the design.If the design
contains more than one sub designs, ex. to implement a processor, we need a CPU as one design
element and RAM as another and so on, then the synthesis process generates netlist for each
design element Synthesis process will check code syntax and analyze the hierarchy of the design
which ensures that the design is optimized for the design architecture, the designer has selected.
The resulting netlist(s) is saved to an NGC( Native Generic Circuit) file (for Xilinx® Synthesis
Technology (XST)).
Fig 5.4 FPGA Synthesis
Implementation
This process consists a sequence of three steps
1. Translate
2. Map
3. Place and Route

NCET Page 50
Translate:
Process combines all the input netlists and constraints to a logic design file. This
information is saved as a NGD (Native Generic Database) file. This can be done using NGD
Build program. Here, defining constraints is nothing but, assigning the ports in the design to the
physical elements (ex. pins, switches, buttons etc) of the targeted device and specifying time
requirements of the design. This information is stored in a file named UCF (User Constraints
File). Tools used to create or modify the UCF are PACE, Constraint Editor etc.
Fig 5.5 FPGA Translate
Map
Process divides the whole circuit with logical elements into sub blocks such that they can
be fit into the FPGA logic blocks. That means map process fits the logic defined by the NGD file
into the targeted FPGA elements (Combinational Logic Blocks (CLB), Input Output Blocks
(IOB)) and generates an NCD (Native Circuit Description) file which physically represents the
design mapped to the components of FPGA. MAP program is used for this purpose.

NCET Page 51
Fig 5.6 FPGA map
Place and Route:
PAR program is used for this process. The place and route process places the sub blocks
from the map process into logic blocks according to the constraints and connects the logic
blocks. Ex. if a sub block is placed in a logic block which is very near to IO pin, then it may save
the time but it may effect some other constraint. So trade off between all the constraints is taken
account by the place and route process
The PAR tool takes the mapped NCD file as input and produces a completely routed
NCD file as output. Output NCD file consists the routing information.
Fig 5.7 FPGA Place and route
Device Programming:
Now the design must be loaded on the FPGA. But the design must be converted to a
format so that the FPGA can accept it. BITGEN program deals with the conversion. The routed
NCD file is then given to the BITGEN program to generate a bit stream (a .BIT file) which can

NCET Page 52
be used to configure the target FPGA device. This can be done using a cable. Selection of cable
depends on the design.
Behavioral Simulation (RTL Simulation):
This is first of all simulation steps; those are encountered throughout the hierarchy of the
design flow. This simulation is performed before synthesis process to verify RTL (behavioral)
code and to confirm that the design is functioning as intended. Behavioral simulation can be
performed on either VHDL or Verilog designs. In this process, signals and variables are
observed, procedures and functions are traced and breakpoints are set. This is a very fast
simulation and so allows the designer to change the HDL code if the required functionality is not
met with in a short time period. Since the design is not yet synthesized to gate level, timing and
resource usage properties are still unknown.
5.4 Introduction to Hardware Description Language
Classical design methods relied on schematics and manual methods to design a circuit,
but today computer-based languages are widely used to design circuits of enormous size and
complexity. There are several reasons for this shift in practice. No team of engineers can
correctly design and manage, by manual methods, the details of state-of-the-art integrated
circuits (ICs) containing several million gates, but using hardware description languages (HDLs)
designers easily manage the complexity of large designs. Even small designs rely on language-
based descriptions, because designers have to quickly produce correct designs targeted for an
ever-shrinking window of opportunity in the marketplace.
Language-based designs are portable and independent of technology, allowing design
teams to modify and re-use designs to keep pace with improvements in technology. As physical
dimensions of devices shrink, denser circuits with better performance can be synthesized from an
original HDL-based model. HDLs are a convenient medium for integrating intellectual property
(IP) from a variety of sources with a proprietary design. By relying on a common design
language, models can be integrated for testing and synthesized separately or together, with a net
reduction in time for the design cycle. Some simulators also support mixed descriptions based on
multiple languages.

NCET Page 53
The most significant gain that results from the use of an HDL is that a working circuit
can be synthesized automatically from a language-based description, bypassing the laborious
steps that characterize manual design methods (e.g., logic minimization with Karnaugh maps).
HDL-based synthesis is now the dominant design paradigm used by industry.
Today, designers build a software prototype/model of the design, verify its
functionality, and then use a synthesis tool to automatically optimize the circuit and create a
netlist in a physical technology.
HDLs and synthesis tools focus an engineer's attention on functionality rather than on
individual transistors or gates; they synthesize a circuit that will realize the desired functionality,
and satisfy area and/or performance constraints. Moreover, alternative architectures can be
generated from a single HDL model and evaluated quickly to perform design tradeoffs.
Functional models are also referred to as behavioral models.
HDLs serve as a platform for several tools: design entry, design verification, test
generation, fault analysis and simulation, timing analysis and/or verification, synthesis, and
automatic generation of schematics. This breadth of use improves the efficiency of the design
flow by eliminating translations of design descriptions as the design moves through the tool
chain.
Two languages enjoy widespread industry support: Verilog and VHDL. Both languages
are IEEE (Institute of Electrical and Electronics Engineers) standards; both are supported by
synthesis tools for ASICs (application-specific integrated circuits) and FPGAs (field-
programmable gate arrays). Languages for analog circuit design, such as Spice, play an
important role in verifying critical timing paths of a circuit, but these languages impose a
prohibitive computational burden on large designs, cannot support abstract styles of design, and
become impractical when used on a large scale. Hybrid languages (e.g., Verilog-A) are used in
designing mixed-signal circuits, which have both digital and analog circuitry. System-level
design languages, such as SystemC and Superlog, are now emerging to support a higher level of
design abstraction than can be supported by Verilog or VHDL.

NCET Page 54
Difference between HDL and other software languages:
 The main difference with the traditional programming languages is HDL’s representation
of extensive parallel operations whereas traditional ones represents mostly serial
operations.
Importance of HDLs:
HDLs have many advantages compared to traditional schematic-based design
 Designs can be described at a very abstract level by use of HDLs. Designers can write
their RTL description without choosing a specific fabrication technology. Logic synthesis
tools can automatically convert the design to any fabrication technology. If a new
technology emerges, designers do not need to redesign their circuit. They simply input
the RTL description to the logic synthesis tool and create a new gate-level netlist, using
the new fabrication technology. The logic synthesis tool will optimize the circuit in area
and timing for the new technology.
 By describing designs in HDLs, functional verification of the design can be done early in
the design cycle. Since designers work at the RTL level, they can optimize and modify
the RTL description until it meets the desired functionality. Most design bugs are
eliminated at this point. This cuts down design cycle time significantly because the
probability of hitting a functional bug at a later time in the gate-level netlist or physical
layout is minimized.
 Designing with HDLs is analogous to computer programming. A textual description with
comments is an easier way to develop and debug circuits. This also provides a concise
representation of the design, compared to gate-level schematics. Gate-level schematics
are almost incomprehensible for very complex designs.
Importance of Computer-Aided Digital Design:
The earliest digital circuits were designed with vacuum tubes and transistors. Integrated
circuits were then invented where logic gates were placed on a single chip. The first integrated
circuit (IC) chips were SSI (Small Scale Integration) chips where the gate count was very small.
As technologies became sophisticated, designers were able to place circuits with hundreds of

NCET Page 55
gates on a chip. These chips were called MSI (Medium Scale Integration) chips. With the advent
of LSI (Large Scale Integration), designers could put thousands of gates on a single chip. At this
point, design processes started getting very complicated, and designers felt the need to automate
these processes. Electronic Design Automation (EDA) techniques began to evolve. Chip
designers began to use circuit and logic simulation techniques to verify the functionality of
building blocks of the order of about 100 transistors. The circuits were still tested on the
breadboard, and the layout was done on Project or by hand on a graphic computer terminal.
With the advent of VLSI (Very Large Scale Integration) technology, designers could
design single chips with more than 100,000 transistors. Because of the complexity of these
circuits, it was not possible to verify these circuits on a breadboard. Computer-aided techniques
became critical for verification and design of VLSI digital circuits. Computer programs to do
automatic placement and routing of circuit layouts also became popular. The designers were now
building gate-level digital circuits manually on graphic terminals. They would build small
building blocks and then derive higher-level blocks from them. This process would continue
until they had built the top-level block. Logic simulators came into existence to verify the
functionality of these circuits before they were fabricated on chip.
What is gate-level netlist:
A gate-level netlist is a description of the circuit in terms of gates and connections between them.
Logic synthesis tools convert the RTL description to a gate-level netlist.
Problems associatedwith conventional approach to digital design:
Digital ICs of SSI and MSI types have become universally standardized and have been
accepted for use. Whenever a designer has to realize a digital function, he uses a standard set of
ICs along with a minimal set of additional discrete circuitry.
Consider a simple example of realizing a function as Q n+1 = Q n + (A.B)
Here Qn, A, and B are Boolean variables, with Q n being the value of Q at the nth time step. Here
A.Bsignifies the logical AND of A and B; the ‘+’ symbol signifies the logical OR of the logic
variables on either side. A circuit to realize the function is shown in Figure 5.1. The circuit can

NCET Page 56
be realized in terms of two ICs –an A-O-I gate and a flip-flop. It can be directly wired up, tested,
and used.
Fig. 5.8 A simple digital circuit
The accepted approach to digital design here is a mix of the top-down and bottom-up
approaches as follows:
1. Decide the requirements at the system level and translate them to circuit requirements.
2. Identify the major functional blocks required like timer, DMA unit, register file etc., say
as in the design of a processor.
3. Whenever a function can be realized using a standard IC, use the same –for example
programmable counter, mux, demux, etc.
4. Whenever the above is not possible, form the circuit to carry out the block functions
using standard SSI – for example gates, flip-flops, etc.
5. Use additional components like transistor, diode, resistor, capacitor, etc., wherever
essential.
Once the above steps are gone through, a Project design is ready. Starting with the
Project design, one has to do a circuit layout. The physical location of all the components is
tentatively decided; they are interconnected and the ‘circuit-on Project’ is made ready. Once a
Project design is done, a layout is carried out and a net-list prepared. Based on this, the PCB is
fabricated and populated and all the populated cards tested and debugged. The procedure is
shown as a process flowchart in Figure 5.2.

NCET Page 57
Fig.5.9 Sequence of steps in conventional electronic circuit design.
At the debugging stage one may encounter three types of problems:
1. Functional mismatch: The realized and expected functions are different. One may have
to go through the relevant functional block carefully and locate any error logically.
Finally the necessary correction has to be carried out in hardware.
2. Timing mismatch: The problem can manifest in different forms. One possibility is due to
the signal going through different propagation delays in two paths and arriving at a point
with a timing mismatch. This can cause faulty operation. Another possibility is a race
condition in a circuit involving asynchronous feedback. This kind of problem may call
for elaborate debugging. The preferred practice is to do debugging at smaller module
stages and ensuring that feedback through larger loops is avoided: It becomes essential to
check for the existence of long asynchronous loops.
3. Overload: Some signals may be overloaded to such an extent that the signal transition
may be unduly delayed or even suppressed. The problem manifests as reflections and
erratic behavior in some cases (The signal has to be suitably buffered here.). In fact,
overload on a signal can lead to timing mismatches.

NCET Page 58
The above have to be carried out after completion of the prototype PCB
manufacturing;it involves cost, time, and also a redesigning process to develop a bug free
design.
Logic simulation and synthesis:
 There are two applications of HDL processing: Simulation and Synthesis
Simulation Simulation is used to verify the functionality of the circuit
A) Functional Simulation: study of circuit’s operation independent of timing parameters and
gate delays.
B) Timing Simulation: study including estimated delays; verify setup, hold and other timing
requirements of devices like flip flops are met.
Synthesis :
One of the foremost in back end steps where by synthesizing is nothing but converting VHDL or
VERILOG description to a set of primitives(equations as in CPLD) or components(as in
FPGA'S)to fit into the target technology. Basically the synthesis tools convert the design
description into equations or components

NCET Page 59
CHAPTER 6
RESULT ANALYSIS
6.1. Simulation Results of VMFU:
6.1.1 Partial products Generators:
Fig 6.1 Simulation result of Partial products Generators
6.1.2 Booth Encoder:
Fig 6.2 Simulation result of Booth Encoder

NCET Page 60
6.1.3 Carry-Save Adder:
Fig 6.3 Simulation result of Carry-save Adder
6.1.4 Versatile Multimedia Functional Unit:
Fig 6.4 Simulation result of Versatile Multimedia Functional Unit
6.2 Synthesis Result
The developed MAC design is simulated and verified their functionality. Once the
functional verification is done, the RTL model is taken to the synthesis process using the Xilinx
ISE tool. In synthesis process, the RTL model will be converted to the gate level netlist mapped
to a specific technology library. This MAC design can be synthesized on the family of Spartan
3E.
Here in this Spartan 3E family, many different devices were available in the Xilinx ISE
tool. In order to synthesis this design the device named as “XC3S500E” has been chosen and the

NCET Page 61
package as “FG320” with the device speed such as “-4”. The design of MAC is synthesized and
its results were analyzed as follows.
Device utilization summary:
This device utilization includes the following.
 Logic Utilization
 Logic Distribution
 Total Gate count for the Design
The device utilization summery is shown above in which its gives the details of number of
devices used from the available devices and also represented in %. Hence as the result of the
synthesis process, the device utilization in the used device and package is shown above.
Timing Summary:
Speed Grade: -4
Minimum period: 35.100ns (Maximum Frequency: 28.490MHz)

NCET Page 62
Minimum input arrival time before clock: 23.605ns
Maximum output required time after clock: 4.283ns
Maximum combinational path delay: No path found
In timing summery, details regarding time period and frequency is shown are
approximate while synthesize. After place and routing is over, we get the exact timing summery.
Hence the maximum operating frequency of this synthesized design is given as 28.490MHz and
the minimum period as 35.100ns. Here, OFFSET IN is the minimum input arrival time before
clock and OFFSET OUT is maximum output required time after clock.
RTL Schematic
The RTL (Register Transfer Logic) can be viewed as black box after synthesize of design
is made. It shows the inputs and outputs of the system. By double-clicking on the diagram we
can see gates, flip-flops and MUX.
Figure 6.5 Schematic with Basic Inputs and Output
I
N
P
U
T
S
O
U
T
P
U
T
S

NCET Page 63
Figure 6.6 Schematic of Booth Encoder with SPST Adder
6.3 Summary
 The developed VMFU design is modelled and is simulated using the Modelsim tool.
 The simulation results are discussed by considering different cases.
 The RTL model is synthesized using the Xilinx tool in Spartan 3E and their synthesis
results were discussed with the help of generated reports.

NCET Page 64
CHAPTER 7
CONCLUSION
In his Project a versatile multimedia functional unit is designed with low-power
technique called SPST, 16x16 multiplier-accumulators (MAC), with addition, subtraction, sum
of absolute difference, interpolation. A Radix-2 Modified Booth multiplier circuit is used for
MAC architecture. Compared to other circuits, the Booth multiplier has the highest operational
speed and less hardware count. The basic building blocks for the VMFU unit are identified and
each of the blocks is analyzed for its performance. Power and delay is calculated for the blocks.
MAC unit is designed with enable to reduce the total power consumption based on block enable
technique. Using this block, the N-bit MAC unit is constructed and the total power consumption
is calculated for the MAC unit.
The presented low-power technique called SPST and explores its applications in
multimedia/DSP computations, where the theoretical analysis and the realization issues of the
SPST are fully discussed. The proposed SPST can obviously decrease the switching (or
dynamic) power dissipation, which comprises a significant portion of the whole power
dissipation in integrated circuits. Besides, the proposed SPST can achieve a 24% saving in power
consumption at the expense of only 10% area overheads for the proposed VMFU.

NCET Page 65
FUTURE SCOPE

NCET Page 66
BIBILIOGRAPHY
[1] T. Stockhammer, M. Hannuksela, and T. Wiegand, “H.264/AVC inwireless environments,”
IEEE Trans. Circuits Syst. Video Technol., vol.13, no. 7, pp. 657–673, Jul. 2003.
[2] R. Schafer, T. Wiegand, and H. Schwarz, “The emerging H.264/AVCstandard,” EBU
Technique Review Jan. 2003 [Online].Available:http://www.ebu.ch/trev_293-schaefer.pdf
[3] A. Bellaouar and M. I. Elmasry, Low-Power Digital VLSI Design"Circuitsand Systems.
Norwell, MA: Kluwer, 1995.
[4] A. P. Chandrakasan and R. W. Brodersen, “Minimizing power consumptionin digital CMOS
circuits,” Proc. IEEE, vol. 83, no. 4, pp.498–523, Apr. 1995.
[5] K. K. Parhi, “Approaches to low-power implementations of DSP systems,”IEEE Trans.
Circuits Syst. I, Fundam. Theory Appl., vol. 48, no.10, pp. 1214–1224, Oct. 2001.
[6] K. Choi, R. Soma, and M. Pedram, “Dynamic voltage and frequencyscaling based on
workload decomposition,” in Proc. IEEE Int. Symp.Low Power Electron.Des., 2004, pp. 174
[7] J. Choi, J. Jeon, and K. Choi, “Power minimization of functional unitsby partially guarded
computation,” in Proc. IEEE Int. Symp.Low PowerElectron.Des., 2000, pp. 131–136.
[8] O. Chen, R. Sheen, and S. Wang, “A low-power adder operating oneffective dynamic data
ranges,” IEEE Trans. Very Large Scale Integr.(VLSI) Syst., vol. 10, no. 4, pp. 435–453, Aug.
2002.
[9] O. Chen, S.Wang, and Y. W.Wu, “Minimization of switching activitiesof partial products for
designing low-power multipliers,” IEEE Trans.Very Large Scale Integr. (VLSI) Syst., vol. 11, no.
3, pp. 418–433, Jun.2003.
[10] L. Benini, G. D. Micheli, A. Macii, E. Macii, M. Poncino, and R. Scarsi,“Glitch power
minimization by selective gate freezing,” IEEE Trans.Very Large Scale Integr. (VLSI) Syst., vol.
8, no. 3, pp. 287–298, Jun.2000.
[11] S. Henzler, G. Georgakos, J. Berthold, and D. Schmitt-Landsiedel,“Fast power-efficient
circuit-block switch-off scheme,” Electron. Lett.,vol. 40, no. 2, pp. 103–104, Jan. 2004.
[12] T. Xanthopoulos and A. P. Chandrakasan, “A low-power DCT coreusing adaptive bitwidth
and arithmetic activity exploiting signal correlationsand quantization,” IEEE J. Solid-State
Circuits, vol. 35, no. 5,pp. 740–750, May 2000.

NCET Page 67
[13] K. H. Chen, J. I. Guo, and J. S. Wang, “A high-performance direct2-D transform coding IP
design for MPEG-4AVC/H.264,” IEEE Trans.Circuits Syst. Video Technol., vol. 16, no. 4, pp.
472–483, Apr. 2006.
[14] K. H. Chen, K. C. Chao, J. I. Guo, J. S. Wang, and Y. S. Chu, “Designexploration of a
spurious power suppression technique (SPST) and itsapplications,” in Proc. IEEE Asian Solid-
State Circuits Conf., Hsinchu,Taiwan, Nov. 2005, pp. 341–344.
[15] K. H. Chen, Y. M. Chen, and Y. S. Chu, “A versatile multimedia functionalunit design
using the spurious power suppression technique,” inProc. IEEE Asian Solid-State Circuits Conf.,
Hangzhou, China, Nov.2006, pp. 111–114.
[16] H. H. Chang, S. H. Sun, and S. I. Liu, “A low-jitter and precise multiphasedelay-locked
loop using shifted averaging VCDL,” in Proc. IEEEInt. Solid-State Circuits Conf., Feb. 2003,
vol. 1, pp. 434–505.
[17] Y. J. Jung, S. W. Lee, D. Shim, W. Kim, C. Kim, and S. I. Cho, “Adual-loop delay-locked
loop using multiple voltage-controlled delaylines,” IEEE J. Solid-State Circuits, vol. 36, no. 5,
pp. 784–791, May2001.

222083242 full-documg

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Viewers also liked

Viewers also liked (7)

Similar to 222083242 full-documg

Similar to 222083242 full-documg (20)

Recently uploaded

Recently uploaded (20)

222083242 full-documg