SlideShare a Scribd company logo
1 of 67
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 1
Get Homework/Assignment Done
Homeworkping.com
Homework Help
https://www.homeworkping.com/
Research Paper help
https://www.homeworkping.com/
Online Tutoring
https://www.homeworkping.com/
click here for freelancing tutoring sites
CHAPTER 1
INTRODUCTION
1. Introduction:
Power dissipation is recognized as a critical parameter in modern VLSI design field. To
satisfy MOORE’S law and to produce consumer electronics goods with more backup and less
weight, low power VLSI design is necessary.
Fast multipliers are essential parts of digital signal processing systems. The speed of
multiply operation is of great importance in digital signal processing as well as in the general
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 2
purpose processors today, especially since the media processing took off. In the past
multiplication was generally implemented via a sequence of addition,Subtraction, and shift
operations. Multiplication can be considered as a series of repeated additions. The number to be
added is the multiplicand, the number of times that it is added is the multiplier, and the result is
the product. Each step of addition generates a partial product. In most computers, the operand
usually contains the same number of bits. When the operands are interpreted as integers, the
product is generally twice the length of operands in order to preserve the information content.
This repeated addition method that is suggested by the arithmetic definition is slow that it is
almost always replaced by an algorithm that makes use of positional representation. It is possible
to decompose multipliers into two parts. The first part is dedicated to the generation of partial
products, and the second one collects and adds them.
The basic multiplication principle is two fold, i.e. evaluation of partial products and
accumulation of the shifted partial products. It is performed by the successive Addition’s of the
columns of the shifted partial product matrix. The ‘multiplier’ is successfully shifted and gates
the appropriate bit of the ‘multiplicand’. The delayed, gated instance of the multiplicand must all
be in the same column of the shifted partial product matrix. They are then added to form the
product bit for the particular form. Multiplication is therefore a multi operand operation. To
extend the multiplication to both signed and unsigned numbers, a convenient number system
would be the representation of numbers in two’s complement format.
Multipliers are key components of many high performance systems such as FIR filters,
microprocessors, digital signal processors, etc. A system’s performance is generally determined
by the performance of the multiplier because the multiplier is generally the slowest clement in
the system. Furthermore, it is generally the most area consuming. Hence, optimizing the speed
and area of the multiplier is a major design issue. However, area and speed are usually
conflicting constraints so that improving speed results mostly in larger areas. As a result, whole
spectrums of multipliers with different area-speed constraints are designed with fully parallel
processing. In between are digit serial multipliers where single digits consisting of several bits
are operated on. These multipliers have moderate performance in both speed and area. However,
existing digit serial multipliers have been plagued by complicated switching systems and/or
irregularities in design. Radix 2^n multipliers which operate on digits in a parallel fashion
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 3
instead of bits bring the pipelining to the digit level and avoid most of the above problems. They
were introduced by M. K. Ibrahim in 1993. These structures are iterative and modular. The
pipelining done at the digit level brings the benefit of constant operation speed irrespective of the
size of’ the multiplier. The clock speed is only determined by the digit size which is already
fixed before the design is implemented.
The growing market for fast floating-point co-processors, digital signal processing chips,
and graphics processors has created a demand for high speed, area-efficient multipliers. Current
architectures range from small, low-performance shift and add multipliers, to large, high-
performance array and tree multipliers. Conventional linear array multipliers achieve high
performance in a regular structure, but require large amounts of silicon. Tree structures achieve
even higher performance than linear arrays but the tree interconnection is more complex and less
regular, making them even larger than linear arrays. Ideally, one would want the speed benefits
of a tree structure, the regularity of an array multiplier, and the small size of a shift and add
multiplier.
To reduce the size of the multiplier a partial tree is used together with a 4-2 carry-save
accumulator placed at its outputs to iteratively accumulate the partial products. This allows a full
multiplier to be built in a fraction of the area required by a full array. Higher performance is
achieved by increasing the hardware utilization of the partial 4-2 tree through pipelining. To
ensure optimal performance of the pipelined 4-2 tree, the clock frequency must be tightly
controlled to match the delay of the 4-2 adder pipe stages.
Figure 2.2 Minimal Iterative Structures
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 4
In an attempt to increase performance of the minimal iterative structure additional rows
of CSA’s could be added to make a bigger array. For example, the addition of one row of CM’s
to the minimal structure would yield a partial array with two rows of CM’s. This structure
provides two advantages over the single row of CSA cells:
1) It reduces the required clock frequency, and
2) It requires only half as many latch delays.
It is important to note that although the number of CSA’s has been doubled, the latency
was reduced only by halving the number of latch delays. The number of CSA delays remains the
same. Thus, assuming the latch delays are small relative to the CSA delays, increasing the depth
of the partial array by adding additional rows of CSA’s in a linear structure yields only a slight
increase in performance.
2.4 Multiplication of Unsigned and Signed Numbers:
Multiplication is less common than addition, but is still essential for microprocessors,
digitalsignal processors, and graphics engines. The most basic form of multiplication consistsof
forming the product of two unsigned (positive) binary numbers. This canbe accomplished
through the traditional technique taught in primary school,simplified to base 2. For example, the
multiplication of two positive 6-bitbinary integers, 2510 and 3910, proceeds as shown in Figure
2.3.M × N-bit multiplication P = Y × X can be viewed as forming N partialproducts of M bits
each, and then summing the appropriately shifted partialproducts to produce an M+ N-bit result
P. Binary multiplication is equivalentto a logical AND operation. Therefore, generating partial
products consists ofthe logical ANDing of the appropriate bits of the multiplier and
multiplicand.Each column of partial products must then be added and, if necessary, anycarry
values passed to the next column. We denote the multiplicand asY = {yM–1, yM–2, …,y1, y0}
and the multiplier as X = {xN–1, xN–2, …, x1, x0}. For unsignedmultiplication, the product is
given in EQ (2.1). Figure 2.4 illustrates the generation,shifting, and summing of partial products
in a 6 × 6-bit multiplier.
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 5
Fig 2.3 multiplication of two positive 6-bit binary integers
Fig 2.4 generation, shifting, and summing of partial products in a 6 × 6-bit multiplier
Large multiplications can be more conveniently illustrated using dot diagrams. Figure2.5 shows
a dot diagram for a simple 16 × 16 multiplier. Each dot represents a placeholderfor a single bit
that can be a 0 or 1. The partial products are represented by a horizontalboxed row of dots,
shifted according to their weight. The multiplier bits used togenerate the partial products are
shown on the right.There are a number of techniques that can be used to perform multiplication.
In general,the choice is based upon factors such as latency, throughput, energy, area, and
designcomplexity. An obvious approach is to use an M + 1-bit carry-propagate adder (CPA)
toadd the first two partial products, then another CPA to add the third partial product to
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 6
therunning sum, and so forth. Such an approach requires N – 1 CPAs and is slow, even if afast
CPA is employed. More efficient parallel approaches use some sort of array or tree offull adders
to sum the partial products. We begin with a simple array for unsigned multipliers,and then
modify the array to handle signed two’s complement numbers using theBaugh-Wooley
algorithm. The number of partial products to sum can be reduced usingBooth encoding and the
number of logic levels required to perform the summation can bereduced with Wallace trees.
Unfortunately, Wallace trees are complex to lay out and havelong, irregular wires, so hybrid
array/tree structures may be more attractive. For completeness,we consider a serial multiplier
architecture. This was once popular when gates wererelatively expensive, but is now rarely
necessary.
Fig 2.5 dot diagram for a simple 16 × 16 multiplier
2.4.1 Unsigned Array Multiplication:
Fast multipliers use carry-save adders to sum the partial products.
A CSA typically has a delay of 1.5–2 FO4 inverters independent of the width of thepartial
product, while a carry-propagate adder (CPA) tends to have a delay of 4–15+ FO4inverters
depending on the width, architecture, and circuit family. Figure 2.6 shows a4 × 4 array multiplier
for unsigned numbers using an array of CSAs. Each cell contains a2-input AND gate that forms
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 7
a partial product and a full adder (CSA) to add the partialproduct into the running sum. The first
row converts the first partial product intocarry-save redundant form. Each later row uses the
CSA to add the corresponding partialproduct to the carry-save redundant result of the previous
row and generate a carry-saveredundant result. The least significant N output bits are available as
sum outputs directlyfrom CSAs. The most significant output bits arrive in carry-save redundant
form andrequire an M-bit carry-propagate adder to convert into regular binary form. In
Figure11.74, the CPA is implemented as a carry-ripple adder. The array is regular in structureand
uses a single type of cell, so it is easy to design and lay out. Assuming the carry outputis faster
than the sum output in a CSA, the critical path through the array is marked on
the figure with a dashed line. The adder can easily be pipelined with the placement of registers
between rows. In practice, circuits are assigned rectangular blocks in the floorplan sothe
parallelogram shape wastes space. Figure 2.7 shows the same adder squashed to fit arectangular
block.
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 8
Fig 2.6 Array Multiplier
A key element of the design is a compact CSA. This not only benefits area but also helps
performance because it leads to short wires with low wire capacitance. An ideal CSA design has
approximatelyequal sum and carry delays because the greater of these two delays limits
performance. The mirror adder is commonly used for its compact layout even though the sum
delay exceeds the carry delay. The sum output can be connected to the faster carry input to
partially compensate. Note that the first row of CSAs adds the first partial product to a pair of 0s.
This leads to a regular structure, but is inefficient. At a slight cost to regularity, the first row of
CSAs can be used to add the first three partial products together. This reduces the number of
rows by two and correspondingly reduces the adder propagation delay. Yet another way to
improve the multiplier array performance is toreplace the bottom row with a faster CPA such as
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 9
a look ahead or tree adder. In summary, the critical path of an array multiplier involves N–2
CSAs and a CPA.
Fig 2.7 Rectangular Multiplier
2.4.2.Two’s Complement Array Multiplication:
Multiplication of two’s complement numbers at first might seem more difficult because some
partial products are negative and must be subtracted. Recall that the most significant bit of a
two’s complementnumber has a negative weight. Hence, the product is
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 10
In EQ (2.2), two of the partial products have negative weight and thus should be
subtracted rather than added. The Baugh-Wooley multiplier algorithm handles subtraction by
taking the two’s complement of the terms to be subtracted (i.e., inverting the bits and adding
one). Figure 2.8 shows the partial products that must be summed. The upper parallelogram
represents the unsigned multiplication of all but the most significant bits of the inputs. The next
row is a single bit corresponding to the product of the most significant bits. The next two pairs of
rows are the inversions of the terms to be subtracted.
Each term has implicit leading and trailing zeros, which are inverted to leading and
trailing ones. Extra ones must be added in the least significant column when taking the two’s
complement. The multiplier delay depends on the number of partial product rows to be summed.
The modified Baugh-Wooley multiplier reduces this number of partial products by precomputing
the sums of the constant ones and pushing some of the terms upward into extra columns. Figure
2.9 shows such an arrangement. The parallelogram shaped array can again be squashed into a
rectangle, as shown in Figure 2.10, giving a design almost identical to the unsigned multiplier of
Figure 2.7. The AND gates are replaced by NAND gates in the hatched cells and 1s are added in
place of 0s at two of the unused inputs. The signed and unsigned arrays are so similar that a
single array can be usedfor both purposes if XOR gates are used to conditionally invert some of
the terms dependingon the mode.
Fig 2.8 Partial products for two’s complement multiplier
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 11
Fig 2.9 Simplified partial products for two’s complement multiplier
Fig 2.10 Modified Baugh-Wooley two’s complement multiplier
2.5 Booth Encoding:-
The array multipliers in the previous sections compute the partial products in a radix-2 manner;
i.e., by observing one bit of the multiplier at a time. Radix 2r multipliers produce N/r partial
products, each of which depend on r bits of the multiplier. Fewer partial productsleads to a
smaller and faster CSA array. For example, a radix-4 multiplier producesN/2 partial products.
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 12
Each partial product is 0, Y, 2Y, or 3Y, depending on a pair of bits ofX. Computing 2Y is a simple
shift, but 3Y is a hard multiple requiring a slow carrypropagateaddition of Y + 2Y before partial
product generation begins.Booth encoding was originally proposed to accelerate serial
multiplication.Modified Booth encoding [MacSorley61] allows higher radix parallel operation
without generatingthe hard 3Y multiple by instead using negative partial products. Observe
that3Y = 4Y – Y and 2Y = 4Y – 2Y. However, 4Y in a radix-4 multiplier array is equivalent to Yin
the next row of the array that carries four times the weight. Hence, partial products arechosen by
considering a pair of bits along with the most significant bit from the previouspair. If the most
significant bit from the previous pair is true, Y must be added to the currentpartial product. If the
most significant bit of the current pair is true, the current partialproduct is selected to be negative
and the next partial product is incremented.
Table 2.1 shows how the partial products are selected, based on bits of the multiplier.
Negative partial products are generated by taking the two’s complement of themultiplicand
(possibly left-shifted by one column for –2Y). An unsigned radix-4 Boothencodedmultiplier
requires partial products rather than N. Each partialproduct is M+ 1 bits to accommodate the 2Y
and –2Y multiples. Even though X and Y areunsigned, the partial products can be negative and
must be sign extended properly. TheBooth selects will be discussed further after an example.
Table2.1 Radix-4 modified Booth encoding values
In a typical radix-4 Booth-encoded multiplier design, each group of 3 bits (a pair,along with the
most significant bit of the previous pair) is encoded into several select lines(SINGLEi,
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 13
DOUBLEi, and NEGi, given in the rightmost columns of Table 2.1) anddriven across the partial
product row as shown in Figure 2.11 The multiplier Y is distributedto all the rows. The select
lines control Booth selectors that choose the appropriatemultiple of Y for each partial product.
The Booth selectors substitute for the AND gates ofa simple array multiplier to determine the ith
partial product. Figure 2.11 shows a conventionalBooth encoder and selector design. Y is zero-
extended to M + 1 bits.Depending on SINGLEi and DOUBLEi, the A22OI gate selects either 0,
Y, or 2Y. Negativepartial products should be two’s-complemented (i.e., invert and add 1). If
NEGiisasserted, the partial product is inverted. The extra 1 can be added in the least
significantcolumn of the next row to avoid needing a CPA.Even in an unsigned multiplier,
negative partial products must be sign-extended to besummed correctly. Figure 2.11 shows a 16-
bit radix-4 Booth partial product array for anunsigned multiplier using the dot diagram notation.
Fig 2.11 Radix-4 Booth encoder and selector
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 14
Fig 2.12 Radix-4 Booth-encoded partial products with sign extension
Each dot in the Booth-encoded multiplier is produced by a Booth selector rather than a
simple AND gate. Partial products 0–7 are 17 bits. Each partial product i is sign extended with
si= NEGi= x2i+1, which is 1 for negative multiples (those in the bottom half of Table 2.1) or 0
for positive multiples.Observe how an extra 1 is added to the least significant bit in the next row
to form the 2’s complement of negative multiples. Inverting the implicit leading zeros generates
leading ones on negative multiples. The extra terms increase the size of the multiplier. PP8 is
required in case PP7 is negative; this partial product is always 0 or Y because x16 and x17 are 0.
Hence, partial product 8 is only 16 bits.
Observe that the sign extension bits are all either 1s or 0s. If a single 1 is added tothe least
significant position in a string of 1s, the result is a string of 0s plus a carry-outthe top bit that
may be discarded. Therefore, the large number of s bits in each partialproduct can be replaced by
an equal number of constant 1s plus the inverse of s added tothe least significant position, as
shown in Figure 2.13(a). These constants mostly canbe optimized out of the array by
precomputing their sum. The simplified result is shownin Figure 2.13(b). As usual, it can be
squashed to fit a rectangular floorplan.The critical path of the multiplier involves the Booth
decoder, the select line drivers,the Booth selector, approximately N/2 CSAs, and a final CPA.
Each partial product fillsabout M + 5 columns. 54 × 54-bit radix-4 Booth multipliers for IEEE
double-precisionfloating-point units are typically 20–50% smaller (and arguably up to 20%
faster) thannonencoded counterparts, so the technique is widely used. The multiplier requires
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 15
M × N/2 Booth selectors.
Because the selectors account for a substantial portion of the area and only a
smallfraction of the critical path, they should be optimized for size over speed. For example,
Fig 2.13 Radix-4 Booth-encoded partial products with simplified sign extension
describes a sign select Booth encoder and selector that uses only 10 transistors perselector bit at
the expense of a more complex encoder.It presents a one-hot Boothencoder and selector that
chooses one of the six possible partial products using a transmissiongate multiplexer.
Booth Encoding Signed Multipliers Signed two’s complement multiplication is similar, but the
multiplicand may have been negative so sign extension must be done based on the sign bit of the
partial product, PpiM. Figure 2.14 shows such an array, where the sign extension bit is ei=
PPiM. Also notice that PP8, which was either Y or 0 for unsigned multiplication, is always 0 and
can be omitted for signed multiplication because the multiplier x is sign-extended such that x17 =
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 16
x16 = x15. The same Booth selector and encoder can be employed , but Y should be sign-
extended rather than zero-extended to M+ 1 bits.
Fig 2.14 Radix-4 Booth-encoded partial products for signed multiplication
Higher Radix Booth Encoding Large multipliers can use Booth encoding of higher radix. For
example, ordinary radix-8 multiplication reduces the number of partial products by a factor of 3,
but requires hard multiples of 3Y, 5Y, and 7Y. Radix-8 Boothencoding only requires the hard 3Y
multiple, as shown in Table 2.2. Although this requires a CPA before partial product generation,
it can be justified by the reduction in array size and delay. Higher-radix Booth encoding is
possible, but generating the otherhard multiples appears not to be worthwhile for multipliers of
fewer than 64 bits. Similartechniques apply to sign-extending higher-radix multipliers.
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 17
Table 2.2 Radix-8 modified Booth encoding values
Column Addition:
The critical path in a multiplier involves summing the dots in each column. Observe thata CSA is
effectively a “ones counter” that adds the number of 1s on the A, B, and C inputsand encodes
them on the sum and carry outputs, as summarized in Table 2.3. A CSA istherefore also known
as a (3,2) counter because it converts three inputs into a countencoded in two outputs .The carry-
out is passed to the next more significantcolumn, while a corresponding carry-in is received from
the previous column. This iscalled a horizontal path because it crosses columns. For simplicity, a
carry is represented asbeing passed directly down the column. Figure 11.84 shows a dot diagram
of an arraymultiplier column that sums N partial products sequentially using N–2 CSAs. For
example,the 16 × 16 Booth-encoded multiplier from Figure 2.13(b) sums nine partial
productswith seven levels of CSAs. The output is produced in carry-save redundant formsuitable
for the final CPA.
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 18
Table 2.3Radix-8 modified Booth encoding values
Fig 2.15 Dot diagram for array multiplier
The column addition is slow because only one CSA is active at a time. Another way to speed the
column addition is to sum partial products in parallel rather than sequentially. Figure 2.16 shows
a Wallace treeusing this approach [Wallace64]. The Wallace tree requires
levels of (3,2) counters to reduce N inputs down to two carry-save redundant form outputs.
Even though the CSAs in the Wallace tree are shown in two dimensions, they are logically
packed into a single column of the multiplier. This leads to long and irregular wires along the
column to connect the CSAs. The wire capacitance increases the delay and energy of multiplier,
and the wires can be difficult to lay out.
Fig 2.16 Dot diagram for Wallace tree multiplier
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 19
2.6Compressor Trees
[4:2] compressors can be used in a binary tree to produce a more regular layout, as shown in
Figure 2.17 . A [4:2] compressor takes four inputs of equal weight and produces two outputs. It
can be constructed from two (3,2) counters as shown in Figure 2.18. Along the way, it generates
an intermediate carry, ti, into the next column and accepts a carry, ti–1, from the previous
column,so it may more aptly be called a (5,3) counter. This horizontal path does not impact the
delay because the output of the top CSA in one column is the input of the bottom CSA in the
next column.
Fig 2.17 Dot diagram for [4:2] tree multiplier
Fig 2.18 [4:2] compressor (a) implementation with two CSAs (b) symbol
The [4:2] CSA symbol emphasizes only the primary inputs and outputs to emphasize the main
function of reducing four inputs to two outputs. Only
levels of [4:2] compressors are required, although each has greater delay than a CSA. The
regular layout and routing also make the binary tree attractive. To see the benefits of a [4:2]
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 20
compressor, we introduce the notion of fast and slow inputs and outputs. Figure 2.19 shows a
simple gate-level CSA design. The
longest path through the CSA involves two levels of XOR2 to compute the sum.Xis called a fast
input, while Y and Z are slow inputs because they pass through a second level of XOR. C is the
fast outputbecause it involves a single gate delay, while S is the slow output because it involves
two gate delays. A [4:2] compressor might be expected to use four levels of XOR2s.
Fig 2.19 Gate-level carry-save adder
Fig 2.20 [4:2] compressors
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 21
Figure 2.20 shows various [4:2]compressor designs that reduce the critical path to only 3
XOR2s. In Figure 2.20(a), the slow output of the first CSA is connected to the fast input of the
second.In Figure 2.20(b), the [4:2] compressor has been munged into a single cell,allowing a
majority gate to be replaced with a multiplexer. In Figure 2.20(c), the initial XORs have been
replaced with 2-level XNOR circuits that allow some sharing of subfunctions, reducing the
transistor count Figure 2.21 shows a transmission gate implementation of a [4:2] compressor
from. It uses only 48 transistors,allowing for a smaller multiplier array with shorter wires. Note
that it uses three distinct XNOR circuit forms and two transmission gate multiplexers.
Fig 2.21 Transmission gate [4:2] compressor
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 22
Fig 2.22 16 × 16 Booth-encoded multiplier floorplans
(a) array (b) Wallace tree (c) [4:2] tree
Figure 2.22compares floorplansof the 16 X 16 Boothencoded array multiplier from Figure
2.15, the Wallace tree from Figure 2.16, and the [4:2] tree from Figure 2.17. Each row represents
a horizontal slice of the multiplier containing a Booth selector or a CSA. Vertical busses connect
CSAs. The Wallace tree has the most irregular and lengthy wiring. In practice, the parallelogram
may be squashed into a rectangular formto make better use of the space.
2.7Three-Dimensional Method The notion of connectingslow outputs to fast inputs generalizes
to compressors with morethan four inputs. By examining the entire partial product array at
once, one can construct trees for each column that sum all of thepartial products in the shortest
possible time. This approach is called the three-dimensionalmethod (TDM) because it considers
the arrival time as a third dimension along with rowsand columns .Figure 11.92 shows an
example of a 16 × 16 multiplier. The parallelogram at the topshows the dot diagram from Figure
11.82(b) containing nine partial product rowsobtained through Booth encoding. The partial
products in each of the 32 columns must besummed to produce the 32-bit result. As we have
seen, this is done with a compressor toproduce a pair of outputs, followed by a final CPA.
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 23
Table2.4 Comparison of XOR levels in multiplier trees
2.8 Hybrid Multiplication Arrays offer regular layout, but many levels of CSAs.Trees offer
fewer levels of CSAs, but less regular layout and some long wires. A number ofhybrids have
been proposed that offer trade-offs between these two extremes. Theseinclude odd/even arrays
arrays of arrays, balanced delay trees, overturned-staircase trees, and upper/lower left-to-right
leapfrog(ULLRF) trees. They can achieve nearly as few levels of logic as the Wallacetree while
offering more regular (and faster) wiring. None have caught on as distinctly betterthan [4:2]
trees.
The three steps of multiplication are partial product generation, partial product
reduction,and carry propagate addition. A simple M × N multiplier generates N partial
productsusing AND gates. For multipliers of 16 or more bits, radix-4 Booth encoding is
typicallyused to cut the number of partial products in two, saving substantial area and power.
Someimplementations find Booth encoding is faster, while others find it has little speed
benefit.The partial products are then reduced to a pair of numbers in carry-save redundant
formusing an array or tree of CSAs. Trees have fewer levels of logic, but longer and less
regularwiring; nevertheless most large multipliers use trees or hybrid structures. Pass
transistorBooth selectors and CSAs were popular in the 1990s, but the trend is toward
staticCMOS as supply voltage scales. Finally, a CPA converts the result to nonredundant form.
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 24
CHAPTER 3
SPST MODIFIED BOOTHENCODER
3.1. Spurious power suppression technique:
Figure shows the five cases of a 16-bit addition in which the spurious switching activities
occur. The 1st case illustrates a transient state in which the spurious transitions of carry signals
occur in the MSP though the final result of the MSP are unchanged. The 2nd and the 3rd cases
describe the situations of one negative operand adding another positive operand without and with
carry from LSP, respectively. Moreover, the 4th and the 5th cases respectively demonstrate the
addition of two negative operands without and with carry-in from LSP. In those cases, the results
of the MSP are predictable Therefore the computations in the MSP are useless and can be
neglected. The data are separated into the Most Significant Part (MSP) and the Least Significant
Part (LSP). To know whether the MSP affects the computation results or not. We need a
detection logic unit to detect the effective ranges of the inputs. The Boolean logical equations
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 25
shown below express the behavioral principles of the detection logic unit in the MSP circuits of
the SPST-based adder/subtractor:
Figure 2. Spurious transition cases in multimedia/ DSP processing
AMSP = A[15:8]; BMSP = B[15:8] ;
Aand = A[15] A[14] A[8];
Band = B[15] B[14] B[8];]
where A[m] and B[n] respectively denote the mth bit of the operands A and the nth bit of
the operand B, and AMSP and BMSP respectively denote the MSP parts, i.e. the 9th bit to the
16th bit, of the operands A and B. When the bits in AMSP and/or those in BMSP are all ones,
the value of Aand and/or that of Band respectively become one, while the bits in AMSP and/or
those in BMSP are all zeros, the value of Anor, and/or that of Bnor respectively turn into one.
Being one of the three outputs of the detection logic unit, close denotes whether the MSP circuits
can be neglected or not. When the two input operand can be classified into one of the five classes
as shown in figure 1,
the value of close becomes zero which indicates that the MSP circuits can be closed. figure 1.
also shows that it is necessary to compensate the sign bit of computing results Accordingly, we
derive the Karnaugh maps which lead to the Boolean equations (7) and (8) for the Carr_ctrl and
the sign signals, respectively. In equation (7) and (8), CLSP denotes the carry propagated from
the LSP circuits.
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 26
Figure shows a 16-bit adder/subtractor design example based on the proposed SPST. In
this example, the 16-bit adder/subtractor is divided into MSP and LSP at the place between the
8th bit and the 9th bit. Latches implemented by simple AND gates are used to control the input
data of the MSP. When the MSP is necessary, the input data of MSP remain the same as usual,
while the MSP is negligible, the input data of the MSP become zeros to avoid switching power
consumption. From the derived Boolean equations (1) to (8), the detection logic unit of the SPST
is designed as shown in figure 4. The use of MSP can be determined by whether the input data of
MSP should be latched or not. Moreover, we add three 1-bit to control the assertion of the close,
sign, and Carr-ctrl signals in order to further decrease the glitch signals occurred in the cascaded
circuits which are usually adopted in VLSI architectures designed for video coding.
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 27
Fig 3.1 16-bit adder/subtractor design example
Fig. shows a 16-bit adder/subtractor design example adopting the proposed SPST. In this
example, the 16-bit adder/subtractor is divided into MSP and LSP between the eighth and the
ninth bits. Latches implemented by simple AND gates are used to control the input data of the
MSP. When theMSP is necessary, the input data of MSP remain unchanged. However, when the
MSP is negligible, the input data of the MSP become zeros to avoid glitching power
consumption. The two operands of the MSP enter the detection-logic unit, except the
adder/subtractor, so that the detection-logic unit can decide whether to turn off the MSP or not.
Based on the derived Boolean equations (1) to (8), the detection-logic unit of SPST is shown in
Fig. 6(a), which can determine whether the input data of MSP should be latched or not.
Moreover, we propose the novel glitch-diminishing technique by adding three 1-bit registers to
control the assertion of the close, sign, and carr-ctrl signals to further decrease the transient
signals occurred in thecascaded circuits which are usually adopted in VLSI architecturesdesigned
for multimedia/DSP applications. The timing diagram is shown in Fig. 6(b). A certain amount of
delay is used to assert the close, sign, and carr-ctrl signals after the period of data transition
which is achieved by controlling the three 1-bit registers at the outputs of the detection-logic
unit.
Hence, the transients of the detection-logic unit can be filtered out; thus, the data latches
shown in Fig can prevent the glitch signals from flowing into the MSP with tiny cost. The data
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 28
transient time and the earliest required time of all the inputs are also illustrated. The delay should
be set in the range of, which is shown as the shadow area in Fig, to filter out the glitch signals as
well as to keep the computation results correct. Based on Figs. 5 and 6, the timing issue of the
SPST is analyzed as follows.
3.1.1. When the detection-logic unit turns off the MSP:
At this moment, the outputs of the MSP are directly compensated by the SE unit; therefore, the
time saved from skipping the computations in the MSP circuits shall cancel out the delay caused
by the detection-logic unit.
3.1.2. When the detection-logic unit turns on the MSP:
The MSP circuits must wait for the notification of the detection-logic unit to turn on the data
latches to let the data in. Hence, the delay caused by the detection-logic unit will contribute to
the delay of the whole combinational circuitry, i.e., the16-bit adder/subtractor in this design
example.
3.1.3.When the detection-logic unit remains its decision:
No matter whether the last decision is turning on or turning off the MSP, the delay of the
detection logic is negligible because the path of the combinational circuitry (i.e., the 16-bit
adder/subtractor in this design example) remains the same. From the analysis earlier, we can
know that the total delay is affected only when the detection-logic unit turns on the MSP.
However, the detection-logic unit should be a speed-oriented design. When the SPST is applied
on combinational circuitries, we should first determine the longest transitions of the interested
cross sections of each combinational circuitry, which is timing characteristic and is also related
to the adopted technology. The longest transitions can be obtained from analyzing the timing
differences between the earliest arrival and the latest arrival signals of the cross sections of a
combinational circuitry.
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 29
3.2. MAC
3.2.1 Block Diagram of MAC:
In this Project, a new architecture for a high-speed MAC is proposed. In this MAC, the
computations of multiplication and accumulation are combined and a hybrid-type CSA structure
is proposed to reduce the critical path and improve the output rate. It uses MBA algorithm based
on 1’s complement number system. A modified array structure for the sign bits is used to
increase the density of the operands. A carry look-ahead adder (CLA) is inserted in the CSA tree
to reduce the number of bits in the final adder. In addition, in order to increase the output rate by
optimizing the pipeline efficiency, intermediate calculation results are accumulated in the form
of sum and carry instead of the final adder outputs.
A multiplier can be divided into three operational steps. The first is radix-2 Booth
encoding in which a partial product is generated from the multiplicand X and the multiplier Y .
The second is adder array or partial product compression to add all partial products and convert
them into the form of sum and carry. The last is the final addition in which the final
multiplication result is produced by adding the sum and the carry. If the process to accumulate
the multiplied results is included, a MAC consistsof four steps, as shown in Fig. 1, which shows
the operational steps explicitly.
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 30
Fig 3.2 MAC Operational steps
3.2.2.Proposed MAC Architecture:
In this section, the expression for the new arithmetic will be derived from equations of
the standard design. From this result, VLSI architecture for the new MAC will be proposed. In
addition, a hybrid-typed CSA architecture that can satisfy the operation of the proposed MAC
will be proposed.
3.3.Radix-4 modified Booth's algorithm:
Booth's Algorithm is simple but powerful. Speed of VMFU is dependent on the number
of partial products and speed of accumulate partial product. Booth's Algorithm provide us to
reduced partial products. We choose radix-4 algorithm because of below reasons.
 Original Booth's algorithm has an inefficient case.
The 17 partial products are generated in 16bit x 16bit signed or unsigned multiplication.
 Modified Booth's radix-4 algorithm has fatal encoding time in 16bit x 16bit multiplication.
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 31
Radix-4 Algorithm has a 3x term which means that a partial product cannot be generated by
shifting. Therefore, 2x + 1x are needed in encoding processing. One of the solution is handling
an additional 1x term in wallace tree. However, large wallace tree has some problems too.
A radix-4 modified Booth's algorithm: Booth's radix-4 algorithm is widely used to reduce
the area of multiplier and to increase the speed. Grouping 3 bits of multiplier with overlapping
has half partial products which improves the system speed. Radix-4 modified Booth's algorithm
is shown below:
 X-1 = 0; Insert 0 on the right side of LSB of multiplier.
 Start grouping each 3bits with overlapping from x-1
 If the number of multiplier bits is odd, add a extra 1 bit on left side of MSB
 generate partial product from truth table
 when new partial product is generated, each partial product is added 2 bit left
shifting in regular sequence.
x: multiplicand y: multiplier
3.4. Sign or zero extension
Our MAC supports signed or unsigned multiplication and the produced result is 64bit
which are stored in 2 special 32bit register. First MAC receives a multiplicand and multiplier but
just 16bit operands are signed number in Booth's radix-4 algorithm. Hence, extension bit is
required to express 16bit signed number. The core idea of this is that 16bit unsigned number can
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 32
be expressed by 33bit signed number. The 17 partial products are generated in 33bit x 33bit case
(16 partial products in 32bit x 32bit case). Here is an example of signed and unsigned
multiplication. When x(multiplicand) is 3bit 111 and y(multiplier) is 3bit 111, the signed and
unsigned multiplication is different. In signed case x × y = 1 (-1 x -1 = 1) and in unsigned case x
× y = 49 (7 x 7 = 49).
3.5. Carry-Save Adder
When three or more operands are to be added simultaneously using two operand adders,
the time consuming carry propagation must be repeated several times. If the number of operands
is ‘k’, then carries have to propagate (k-1) times (Weste& Harris, 3rd Ed). In the carry save
addition, we let the carry propagate only in the last step, while in all the other steps we generate
the partial sum and sequence of carries separately. A CSA is capable of reducing the number of
operands to be added from 3 to 2 without any carry propagation. A CSA can be implemented in
different ways. In the simplest implementation, the basic element of carry save adder is the
combination of two half adders or 1 bit full adder (Weste& Harris, 3rd Ed)
3.6 Circuit DesignFeatures
One of the most advanced types of MAC for general-purpose digital signal processing
has been proposed by Elguibaly . It is an architecture in which accumulation has been combined
with the carry save adder (CSA) tree that compresses partial products. In the architecture
proposed in, the critical path was reduced by eliminating the adder for accumulation and
decreasing the number of input bits in the final adder. While it has a better performance because
of the reduced critical path compared to the previous VMFU architectures, there is a need to
improve the output rate due to the use of the final adder results for accumulation. The
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 33
architecture to merge the adder block to the accumulator register in the VMFU operator was
proposed to provide the possibility of using two separate N/2-bit adders instead of one-bit adder
to accumulate the MAC results. Recently, Zicari proposed an architecture that took a merging
technique to fully utilize the 4–2 compressor .It also took this compressor as the basic building
blocks for the multiplication circuit.
Fig 3.3 basic building blocks for the multiplication circuit.
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 34
CHAPTER 4
IMPLEMENTATION
4.1 Introduction to VMFU:
If an operation to multiply two N –bit numbers and accumulates into a 2N -bit number,
addition, subtraction, Sum of Absolute Difference (SAD), and Interpolation is considered. The
critical path is determined by the 2-bit accumulation operation. If a pipeline scheme is applied
for each step in the standard design of Fig. 4.1, the delay of the last accumulator must be reduced
in order to improve the performance of the MAC. The overall performance of the proposed
VMFU is improved by eliminating the accumulator itself by combining it with the CSA function.
If the accumulator has been eliminated, the critical path is then determined by the final adder in
the multiplier. The basic method to improve the performance of the final adder is to decrease the
number of input bits. In order to reduce this number of input bits, the multiple partial products
are compressed into a sum and a carry by CSA. The number of bits of sums and carries to be
transferred to the final adder is reduced by adding the lower bits of sums and carries in advance
within the range in which the overall performance will not be degraded. A 2-bit CLA is used to
add the lower bits in the CSA. In addition, to increase the output rate when pipelining is applied,
the sums and carrys from the CSA are accumulated instead of the outputs from the final adder in
the manner that the sum and carry from the CSA in the previous cycle are inputted to CSA. Due
to this feedback of both sum and carry, the number of inputs to CSA increases, compared to the
standard design. In order to efficiently solve the increase in the amount of data, a CSA
architecture is modified to treat the sign bit.
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 35
Fig 4.1 Versatile Multimedia Functional Unit
VMFU is composed of an adder, multiplier and an accumulator. Usually adders
implemented are Carry- Select or Carry-Save adders, as speed is of utmost importance in DSP
(Chandrakasan, Sheng, &Brodersen, 1992 and Weste& Harris, 3rd Ed). One implementation of
the multiplier could be as a parallel array multiplier. The inputs for the VMFU are to be fetched
from memory location and fed to the multiplier block, which will perform multiplication and
give the result to adder which will accumulate the result and then will store the result into a
memory location. This entire process is to be achieved in a single clock cycle (Weste& Harris,
3rd Ed).
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 36
Fig 4.2 Architecture of MAC
Figure 4.2 is the architecture of the MAC unit which had been designed in this work. The
design consists of one 16 bit register, one 16-bit Modified Booth Multiplier multiplier, 33-bit
accumulator using ripple carry and two16-bit accumulator registers. To multiply the values of A
and B, Modified Booth multiplier is used instead of conventional multiplier because Modified
Booth multiplier can increase the MAC unit design speed and reduce multiplication complexity.
Carry save Adder (CSA) is used as an accumulator in this design. Apparently, together with the
utilization of Wallace tree multiplier approach, carry save adder in the final stage of the Modified
Booth multiplier and Carry save Adder as the accumulator, this VMFU unit design is not only
reducing the standby power consumption but also can enhance the VMFU unit speed so as to
gain better system performance. The operation of the designed VMFU unit is as in Equation 2.1.
The product of Ai X Bi is always fed back into the 34-bit Carry save accumulator and then added
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 37
again with the next product Ai x Bi. This MAC unit is capable of multiplying and adding with
previous product consecutively up to as many as eight times. Operation: Output = Σ Ai Bi (2.1)
In this Project, the design of 16x16 multiplier unit is carried out that can perform accumulation
on 34 bit number. This MAC unit has 34 bit output and its operation is to add repeatedly the
multiplication results. The total design area is also being inspected by observing the total count
of gates [Hardwires]. Power delay product is calculated by multiplying the power consumption
result with the time delay.
Design ofVMFU
In the majority of digital signal processing (DSP) applications the critical operations
usually involve many multiplications and/or accumulations. For real-time signal processing, a
high speed and high throughput Multiplier-Accumulator (MAC) is always a key to achieve a
high performance digital signal processing system and versatile Multimedia functional units. In
the last few years, the main consideration of MAC design is to enhance its speed. This is
because; speed and throughput rate is always the concern of VMFU. But for the epoch of
personal communication, low power design also becomes another main design consideration.
This is because; battery energy available for these portable products limits the power
consumption of the system. Therefore, the main motivation of this work is to investigate various
Pipelined multiplier/accumulator architectures and circuit design techniques which are suitable
for implementing high throughput signal processing algorithms and at the same time achieve low
power consumption. A conventional VMFU unit consists of (fast multiplier) multiplier and an
accumulator that contains the sum of the previous consecutive products. The function of the
VMFU unit is given by the following equation:
F = Σ Ai Bi………………………………………………………… (2.1)
The main goal of a VMFU design is to enhance the speed of the MAC unit, and at the same time
limit the power consumption. In a pipelined MAC circuit, the delay of pipeline stage is the delay of a 1-
bit full adder. Estimating this delay will assist in identifying the overall delay of the pipelined MAC. In
this work, 1-bit full adder is designed. Area, power and delay are calculated for the full adder, based on
which the pipelined MAC unit is designed for low power.
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 38
4.2 Explanation
4.2.1. High-Speed Booth Encoded Parallel Multiplier Design:
Fast multipliers are essential parts of digital signal processing systems. The speed of multiply
operation is of great importance in digital signal processing as well as in the general purpose processors
today, especially since the media processing took off. In the past multiplication was generally
implemented via a sequence of addition, subtraction, and shift operations. Multiplication can be
considered as a series of repeated additions. The number to be added is the multiplicand, the number of
times that it is added is the multiplier, and the result is the product. Each step of addition generates a
partial product. In most computers, the operand usually contains the same number of bits. When the
operands are interpreted as integers, the product is generally twice the length of operands in order to
preserve the information content. This repeated addition method that is suggested by the arithmetic
definition is slow that it is almost always replaced by an algorithm that makes use of positional
representation. It is possible to decompose multipliers into two parts. The first part is dedicated to the
generation of partial products, and the second one collects and adds them.
The basic multiplication principle is two fold i.e. evaluation of partial products and accumulation
of the shifted partial products. It is performed by the successive additions of the columns of the shifted
partial product matrix. The ‘multiplier’ is successfully shifted and gates the appropriate bit of the
‘multiplicand’. The delayed, gated instance of the multiplicand must all be in the same column of the
shifted partial product matrix. They are then added to form the product bit for the particular form.
Multiplication is therefore a multi operand operation. To extend the multiplication to both signed and
unsigned.
4.2.2. Modified Booth Encoder:
In order to achieve high-speed multiplication, multiplication algorithms using parallel counters,
such as the modified Booth algorithm has been proposed, and some multipliers based on the algorithms
have been implemented for practical use. This type of multiplier operates much faster than an array
multiplier for longer operands because its computation time is proportional to the logarithm of the word
length of operands.
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 39
Fig 4.3 Radix-4 Booth Encoding
Booth multiplication is a technique that allows for smaller, faster multiplication circuits, by
recoding the numbers that are multiplied. It is possible to reduce the number of partial products by half,
by using the technique of radix-4 Booth recoding. The basic idea is that, instead of shifting and adding for
every column of the multiplier term and multiplying by 1 or 0, we only take every second column, and
multiply by ±1, ±2, or 0, to obtain the same results. The advantage of this method is the halving of the
number of partial products. To Booth recode the multiplier term, we consider the bits in blocks of three,
such that each block overlaps the previous block by one bit. Grouping starts from the LSB, and the first
block only uses two bits of the multiplier. Figure 3 shows the grouping of bits from the multiplier term for
use in modified booth encoding.
Fig.4.4 Grouping of bits from the multiplier term
Each block is decoded to generate the correct partial product. The encoding of the multiplier Y, using the
modified booth algorithm, generates the following five signed digits, -2, -1, 0, +1, +2. Each encoded digit
in the multiplier performs a certain operation on the multiplicand, X, as illustrated in Table 4.1
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 40
Table 4.1 Each encoded digit in the multiplier performs a certain operation on the
multiplicand, X,
For the partial product generation, we adopt Radix-4 Modified Booth algorithm to reduce the
number of partial products for roughly one half. For multiplication of 2’s complement numbers, the two-
bit encoding using this algorithm scans a triplet of bits. When the multiplier B is divided into groups of
two bits, the algorithm is applied to this group of divided bits.
Figure 4, shows a computing example of Booth multiplying two numbers ”2AC9” and “006A”.
The shadow denotes that the numbers in this part of Booth multiplication are all zero so that this part of
the computations can be neglected. Saving those computations can significantly reduce the power
consumption caused by the transient signals. According to the analysis of the multiplication shown in
figure 4, we propose the SPST-equipped modified-Booth encoder, which is controlled by a detection unit.
The detection unit has one of the two operands as its input to decide whether the Booth encoder calculates
redundant computations. As shown in figure 9. The latches can, respectively, freeze the inputs of MUX-4
to MUX-7 or only those of MUX-6 to MUX-7 when the PP4 to PP7 or the PP6 to PP7 are zero; to reduce
the transition power dissipation. Figure 10, shows the booth partial product generation circuit. It includes
AND/OR/EX-OR logic.
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 41
Fig.4.5 Illustration of multiplication using modified Booth encoding
The PP generator generates five candidates of the partial products, i.e., {-2A,-A, 0, A, 2A}. These
are then selected according to the Booth encoding results of the operand B. When the operand besides the
Booth encoded one has a small absolute value, there are opportunities to reduce the spurious power
dissipated in the compression tree.
Fig4.6 SPST equipped modified Booth encoder
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 42
4.2.3. Partial product generator:
Fig4.7 Booth partial product selector logic
The multiplication first step generates from A and X a set of bits whose weights sum is the product P. For
unsigned multiplication, P most significant bit weight is positive, while in 2's complement it is negative.
The partial product is generated by doing AND between ‘a’ and ‘b’ which are a 4 bit vectors as
shown in fig. If we take, four bit multiplier and 4-bit multiplicand we get sixteen partial products in which
the first partial product is stored in ‘q’. Similarly, the second, third and fourth partial products are stored
in 4-bit vector n, x, y.
Fig.4.8 Booth partial products Generation
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 43
The multiplication second step reduces the partial products from the preceding step into two
numbers while preserving the weighted sum. The sough after product P is the sum of those two numbers.
The two numbers will be added during the third step The "Wallace trees" synthesis follows the Dadda's
algorithm, which assures of the minimum counter number. If on top of that we impose to reduce as late as
(or as soon as) possible then the solution is unique. The two binary number to be added during the third
step may also be seen a one number in CSA notation (2 bits per digit).
Fig.4.9 Booth single partial product selector logic
4.2.4.Truth Table ofModified Booth Encoder:
Multiplication consists of three steps: 1) the first step to generate the partial products; 2) the
second step to add the generated partial products until the last two rows are remained; 3) the third step to
compute the final multiplication results by adding the last two rows. The modified Booth algorithm
reduces the number of partial products by half in the first step. We used the modified Booth encoding
(MBE) scheme proposed in. It is known as the most efficient Booth encoding and decoding scheme. To
multiply X by Y using the modified Booth algorithm starts from grouping Y by three bits and encoding
into one of {-2, -1, 0, 1, 2}. Table I shows the rules to generate the encoded signals by MBE scheme and
Fig. 1 (a) shows the corresponding logic diagram. The Booth decoder generates the partial products using
the encoded signals as shown in Fig. 1
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 44
Fig.4.10 Booth Encoder
Fig.4.11.Booth Decoder
Fig. shows the generated partial products and sign extension scheme of the 8-bit modified Booth
multiplier. The partial products generated by the modified Booth algorithm are added in parallel using the
Wallace tree until the last two rows are remained. The final multiplication results are generated by adding
the last two rows. The carry propagation adder is usually used in this step.
Fig 4.12 Truth table for MBE Scheme
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 45
CHAPTER 5
5.1 Introduction to FPGA:
FPGA stands for Field Programmable Gate Array which has the array of logic module, I
/O module and routing tracks (programmable interconnect). FPGA can be configured by end user
to implement specific circuitry. Speed is up to 100 MHz but at present speed is in GHz.
Main applications are DSP, FPGA based computers, logic emulation, ASIC and ASSP.
FPGA can be programmed mainly on SRAM (Static Random Access Memory). It is Volatile and
main advantage of using SRAM programming technology is re-configurability. Issues in FPGA
technology are complexity of logic element, clock support, IO support and interconnections
(Routing).
5.2 Block diagram of FPGA:
FPGA contains a two dimensional arrays of logic blocks and interconnections between
logic blocks. Both the logic blocks and interconnects are programmable. Logic blocks are
programmed to implement a desired function and the interconnects are programmed using the
switch boxes to connect the logic blocks.
To be more clear, if we want to implement a complex design (CPU for instance), then the
design is divided into small sub functions and each sub function is implemented using one logic
block. Now, to get our desired design (CPU), all the sub functions implemented in logic blocks
must be connected and this is done by programming the Internal structure of an FPGA is
depicted in the following figure.
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 46
Fig 5.1 Internal structure of an FPGA
FPGAs, alternative to the custom ICs, can be used to implement an entire System On one
Chip (SOC). The main advantage of FPGA is ability to reprogram. User can reprogram an FPGA
to implement a design and this is done after the FPGA is manufactured. This brings the name
“FieldProgrammable.”
Custom ICs are expensive and takes long time to design so they are useful when
produced in bulk amounts. But FPGAs are easy to implement with in a short time with the help
of Computer Aided Designing (CAD) tools (because there is no physical layout process, no mask
making, and no IC manufacturing).
Some disadvantages of FPGAs are, they are slow compared to custom ICs as they can’t
handle vary complex designs and also they draw more power.
Xilinx logic block consists of one Look Up Table (LUT) and one FlipFlop. An LUT is
used to implement number of different functionality. The input lines to the logic block go into
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 47
the LUT and enable it. The output of the LUT gives the result of the logic function that it
implements and the output of logic block is registered or unregistered out put from the LUT.
SRAM is used to implement a LUT.A k-input logic function is implemented using 2^k *
1 size SRAM. Number of different possible functions for k input LUT is 2^2^k. Advantage of
such an architecture is that it supports implementation of so many logic functions, however the
disadvantage is unusually large number of memory cells required to implement such a logic
block in case number of inputs is large.
Figure 5.2 4-input LUT based implementation of logic block
LUT based design provides for better logic block utilization. A k-input LUT based logic
block can be implemented in number of different ways with trade off between performance and
logic density.
An n-LUT can be shown as a direct implementation of a function truth-table. Each of the latch
holds the value of the function corresponding to one input combination. For Example: 2-LUT
can be used to implement 16 types of functions like AND , OR, A+not B .... etc.
Interconnects
A wire segment can be described as two end points of an interconnect with no
programmable switch between them. A sequence of one or more wire segments in an FPGA can
be termed as a track.
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 48
Typically an FPGA has logic blocks, interconnects and switch blocks (Input/Output
blocks). Switch blocks lie in the periphery of logic blocks and interconnect. Wire segments are
connected to logic blocks through switch blocks. Depending on the required design, one logic
block is connected to another and so on.
5.3 FPGA Designflow
In this part of tutorial we are going to have a short intro on FPGA design flow. A simplified
version of design flow is given in the flowing diagram.
Fig 5.3 FPGA DesignFlow
DesignEntry
There are different techniques for design entry. Schematic based, Hardware Description
Language and combination of both etc. . Selection of a method depends on the design and
designer. If the designer wants to deal more with Hardware, then Schematic entry is the better
choice. When the design is complex or the designer thinks the design in an algorithmic way then
HDL is the better choice. Language based entry is faster but lag in performance and density.
HDLs represent a level of abstraction that can isolate the designers from the details of the
hardware implementation. Schematic based entry gives designers much more visibility into the
hardware. It is the better choice for those who are hardware oriented. Another method but rarely
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 49
used is state-machines. It is the better choice for the designers who think the design as a series of
states. But the tools for state machine entry are limited. In this documentation we are going to
deal with the HDL based design entry.
Synthesis
The process which translates VHDL or Verilog code into a device netlistformate. i.e a
complete circuit with logical elements( gates, flip flops, etc…) for the design.If the design
contains more than one sub designs, ex. to implement a processor, we need a CPU as one design
element and RAM as another and so on, then the synthesis process generates netlist for each
design element Synthesis process will check code syntax and analyze the hierarchy of the design
which ensures that the design is optimized for the design architecture, the designer has selected.
The resulting netlist(s) is saved to an NGC( Native Generic Circuit) file (for Xilinx® Synthesis
Technology (XST)).
Fig 5.4 FPGA Synthesis
Implementation
This process consists a sequence of three steps
1. Translate
2. Map
3. Place and Route
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 50
Translate:
Process combines all the input netlists and constraints to a logic design file. This
information is saved as a NGD (Native Generic Database) file. This can be done using NGD
Build program. Here, defining constraints is nothing but, assigning the ports in the design to the
physical elements (ex. pins, switches, buttons etc) of the targeted device and specifying time
requirements of the design. This information is stored in a file named UCF (User Constraints
File). Tools used to create or modify the UCF are PACE, Constraint Editor etc.
Fig 5.5 FPGA Translate
Map
Process divides the whole circuit with logical elements into sub blocks such that they can
be fit into the FPGA logic blocks. That means map process fits the logic defined by the NGD file
into the targeted FPGA elements (Combinational Logic Blocks (CLB), Input Output Blocks
(IOB)) and generates an NCD (Native Circuit Description) file which physically represents the
design mapped to the components of FPGA. MAP program is used for this purpose.
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 51
Fig 5.6 FPGA map
Place and Route:
PAR program is used for this process. The place and route process places the sub blocks
from the map process into logic blocks according to the constraints and connects the logic
blocks. Ex. if a sub block is placed in a logic block which is very near to IO pin, then it may save
the time but it may effect some other constraint. So trade off between all the constraints is taken
account by the place and route process
The PAR tool takes the mapped NCD file as input and produces a completely routed
NCD file as output. Output NCD file consists the routing information.
Fig 5.7 FPGA Place and route
Device Programming:
Now the design must be loaded on the FPGA. But the design must be converted to a
format so that the FPGA can accept it. BITGEN program deals with the conversion. The routed
NCD file is then given to the BITGEN program to generate a bit stream (a .BIT file) which can
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 52
be used to configure the target FPGA device. This can be done using a cable. Selection of cable
depends on the design.
Behavioral Simulation (RTL Simulation):
This is first of all simulation steps; those are encountered throughout the hierarchy of the
design flow. This simulation is performed before synthesis process to verify RTL (behavioral)
code and to confirm that the design is functioning as intended. Behavioral simulation can be
performed on either VHDL or Verilog designs. In this process, signals and variables are
observed, procedures and functions are traced and breakpoints are set. This is a very fast
simulation and so allows the designer to change the HDL code if the required functionality is not
met with in a short time period. Since the design is not yet synthesized to gate level, timing and
resource usage properties are still unknown.
5.4 Introduction to Hardware Description Language
Classical design methods relied on schematics and manual methods to design a circuit,
but today computer-based languages are widely used to design circuits of enormous size and
complexity. There are several reasons for this shift in practice. No team of engineers can
correctly design and manage, by manual methods, the details of state-of-the-art integrated
circuits (ICs) containing several million gates, but using hardware description languages (HDLs)
designers easily manage the complexity of large designs. Even small designs rely on language-
based descriptions, because designers have to quickly produce correct designs targeted for an
ever-shrinking window of opportunity in the marketplace.
Language-based designs are portable and independent of technology, allowing design
teams to modify and re-use designs to keep pace with improvements in technology. As physical
dimensions of devices shrink, denser circuits with better performance can be synthesized from an
original HDL-based model. HDLs are a convenient medium for integrating intellectual property
(IP) from a variety of sources with a proprietary design. By relying on a common design
language, models can be integrated for testing and synthesized separately or together, with a net
reduction in time for the design cycle. Some simulators also support mixed descriptions based on
multiple languages.
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 53
The most significant gain that results from the use of an HDL is that a working circuit
can be synthesized automatically from a language-based description, bypassing the laborious
steps that characterize manual design methods (e.g., logic minimization with Karnaugh maps).
HDL-based synthesis is now the dominant design paradigm used by industry.
Today, designers build a software prototype/model of the design, verify its
functionality, and then use a synthesis tool to automatically optimize the circuit and create a
netlist in a physical technology.
HDLs and synthesis tools focus an engineer's attention on functionality rather than on
individual transistors or gates; they synthesize a circuit that will realize the desired functionality,
and satisfy area and/or performance constraints. Moreover, alternative architectures can be
generated from a single HDL model and evaluated quickly to perform design tradeoffs.
Functional models are also referred to as behavioral models.
HDLs serve as a platform for several tools: design entry, design verification, test
generation, fault analysis and simulation, timing analysis and/or verification, synthesis, and
automatic generation of schematics. This breadth of use improves the efficiency of the design
flow by eliminating translations of design descriptions as the design moves through the tool
chain.
Two languages enjoy widespread industry support: Verilog and VHDL. Both languages
are IEEE (Institute of Electrical and Electronics Engineers) standards; both are supported by
synthesis tools for ASICs (application-specific integrated circuits) and FPGAs (field-
programmable gate arrays). Languages for analog circuit design, such as Spice, play an
important role in verifying critical timing paths of a circuit, but these languages impose a
prohibitive computational burden on large designs, cannot support abstract styles of design, and
become impractical when used on a large scale. Hybrid languages (e.g., Verilog-A) are used in
designing mixed-signal circuits, which have both digital and analog circuitry. System-level
design languages, such as SystemC and Superlog, are now emerging to support a higher level of
design abstraction than can be supported by Verilog or VHDL.
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 54
Difference between HDL and other software languages:
 The main difference with the traditional programming languages is HDL’s representation
of extensive parallel operations whereas traditional ones represents mostly serial
operations.
Importance of HDLs:
HDLs have many advantages compared to traditional schematic-based design
 Designs can be described at a very abstract level by use of HDLs. Designers can write
their RTL description without choosing a specific fabrication technology. Logic synthesis
tools can automatically convert the design to any fabrication technology. If a new
technology emerges, designers do not need to redesign their circuit. They simply input
the RTL description to the logic synthesis tool and create a new gate-level netlist, using
the new fabrication technology. The logic synthesis tool will optimize the circuit in area
and timing for the new technology.
 By describing designs in HDLs, functional verification of the design can be done early in
the design cycle. Since designers work at the RTL level, they can optimize and modify
the RTL description until it meets the desired functionality. Most design bugs are
eliminated at this point. This cuts down design cycle time significantly because the
probability of hitting a functional bug at a later time in the gate-level netlist or physical
layout is minimized.
 Designing with HDLs is analogous to computer programming. A textual description with
comments is an easier way to develop and debug circuits. This also provides a concise
representation of the design, compared to gate-level schematics. Gate-level schematics
are almost incomprehensible for very complex designs.
Importance of Computer-Aided Digital Design:
The earliest digital circuits were designed with vacuum tubes and transistors. Integrated
circuits were then invented where logic gates were placed on a single chip. The first integrated
circuit (IC) chips were SSI (Small Scale Integration) chips where the gate count was very small.
As technologies became sophisticated, designers were able to place circuits with hundreds of
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 55
gates on a chip. These chips were called MSI (Medium Scale Integration) chips. With the advent
of LSI (Large Scale Integration), designers could put thousands of gates on a single chip. At this
point, design processes started getting very complicated, and designers felt the need to automate
these processes. Electronic Design Automation (EDA) techniques began to evolve. Chip
designers began to use circuit and logic simulation techniques to verify the functionality of
building blocks of the order of about 100 transistors. The circuits were still tested on the
breadboard, and the layout was done on Project or by hand on a graphic computer terminal.
With the advent of VLSI (Very Large Scale Integration) technology, designers could
design single chips with more than 100,000 transistors. Because of the complexity of these
circuits, it was not possible to verify these circuits on a breadboard. Computer-aided techniques
became critical for verification and design of VLSI digital circuits. Computer programs to do
automatic placement and routing of circuit layouts also became popular. The designers were now
building gate-level digital circuits manually on graphic terminals. They would build small
building blocks and then derive higher-level blocks from them. This process would continue
until they had built the top-level block. Logic simulators came into existence to verify the
functionality of these circuits before they were fabricated on chip.
What is gate-level netlist:
A gate-level netlist is a description of the circuit in terms of gates and connections between them.
Logic synthesis tools convert the RTL description to a gate-level netlist.
Problems associatedwith conventional approach to digital design:
Digital ICs of SSI and MSI types have become universally standardized and have been
accepted for use. Whenever a designer has to realize a digital function, he uses a standard set of
ICs along with a minimal set of additional discrete circuitry.
Consider a simple example of realizing a function as Q n+1 = Q n + (A.B)
Here Qn, A, and B are Boolean variables, with Q n being the value of Q at the nth time step. Here
A.Bsignifies the logical AND of A and B; the ‘+’ symbol signifies the logical OR of the logic
variables on either side. A circuit to realize the function is shown in Figure 5.1. The circuit can
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 56
be realized in terms of two ICs –an A-O-I gate and a flip-flop. It can be directly wired up, tested,
and used.
Fig. 5.8 A simple digital circuit
The accepted approach to digital design here is a mix of the top-down and bottom-up
approaches as follows:
1. Decide the requirements at the system level and translate them to circuit requirements.
2. Identify the major functional blocks required like timer, DMA unit, register file etc., say
as in the design of a processor.
3. Whenever a function can be realized using a standard IC, use the same –for example
programmable counter, mux, demux, etc.
4. Whenever the above is not possible, form the circuit to carry out the block functions
using standard SSI – for example gates, flip-flops, etc.
5. Use additional components like transistor, diode, resistor, capacitor, etc., wherever
essential.
Once the above steps are gone through, a Project design is ready. Starting with the
Project design, one has to do a circuit layout. The physical location of all the components is
tentatively decided; they are interconnected and the ‘circuit-on Project’ is made ready. Once a
Project design is done, a layout is carried out and a net-list prepared. Based on this, the PCB is
fabricated and populated and all the populated cards tested and debugged. The procedure is
shown as a process flowchart in Figure 5.2.
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 57
Fig.5.9 Sequence of steps in conventional electronic circuit design.
At the debugging stage one may encounter three types of problems:
1. Functional mismatch: The realized and expected functions are different. One may have
to go through the relevant functional block carefully and locate any error logically.
Finally the necessary correction has to be carried out in hardware.
2. Timing mismatch: The problem can manifest in different forms. One possibility is due to
the signal going through different propagation delays in two paths and arriving at a point
with a timing mismatch. This can cause faulty operation. Another possibility is a race
condition in a circuit involving asynchronous feedback. This kind of problem may call
for elaborate debugging. The preferred practice is to do debugging at smaller module
stages and ensuring that feedback through larger loops is avoided: It becomes essential to
check for the existence of long asynchronous loops.
3. Overload: Some signals may be overloaded to such an extent that the signal transition
may be unduly delayed or even suppressed. The problem manifests as reflections and
erratic behavior in some cases (The signal has to be suitably buffered here.). In fact,
overload on a signal can lead to timing mismatches.
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 58
The above have to be carried out after completion of the prototype PCB
manufacturing;it involves cost, time, and also a redesigning process to develop a bug free
design.
Logic simulation and synthesis:
 There are two applications of HDL processing: Simulation and Synthesis
Simulation Simulation is used to verify the functionality of the circuit
A) Functional Simulation: study of circuit’s operation independent of timing parameters and
gate delays.
B) Timing Simulation: study including estimated delays; verify setup, hold and other timing
requirements of devices like flip flops are met.
Synthesis :
One of the foremost in back end steps where by synthesizing is nothing but converting VHDL or
VERILOG description to a set of primitives(equations as in CPLD) or components(as in
FPGA'S)to fit into the target technology. Basically the synthesis tools convert the design
description into equations or components
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 59
CHAPTER 6
RESULT ANALYSIS
6.1. Simulation Results of VMFU:
6.1.1 Partial products Generators:
Fig 6.1 Simulation result of Partial products Generators
6.1.2 Booth Encoder:
Fig 6.2 Simulation result of Booth Encoder
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 60
6.1.3 Carry-Save Adder:
Fig 6.3 Simulation result of Carry-save Adder
6.1.4 Versatile Multimedia Functional Unit:
Fig 6.4 Simulation result of Versatile Multimedia Functional Unit
6.2 Synthesis Result
The developed MAC design is simulated and verified their functionality. Once the
functional verification is done, the RTL model is taken to the synthesis process using the Xilinx
ISE tool. In synthesis process, the RTL model will be converted to the gate level netlist mapped
to a specific technology library. This MAC design can be synthesized on the family of Spartan
3E.
Here in this Spartan 3E family, many different devices were available in the Xilinx ISE
tool. In order to synthesis this design the device named as “XC3S500E” has been chosen and the
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 61
package as “FG320” with the device speed such as “-4”. The design of MAC is synthesized and
its results were analyzed as follows.
Device utilization summary:
This device utilization includes the following.
 Logic Utilization
 Logic Distribution
 Total Gate count for the Design
The device utilization summery is shown above in which its gives the details of number of
devices used from the available devices and also represented in %. Hence as the result of the
synthesis process, the device utilization in the used device and package is shown above.
Timing Summary:
Speed Grade: -4
Minimum period: 35.100ns (Maximum Frequency: 28.490MHz)
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 62
Minimum input arrival time before clock: 23.605ns
Maximum output required time after clock: 4.283ns
Maximum combinational path delay: No path found
In timing summery, details regarding time period and frequency is shown are
approximate while synthesize. After place and routing is over, we get the exact timing summery.
Hence the maximum operating frequency of this synthesized design is given as 28.490MHz and
the minimum period as 35.100ns. Here, OFFSET IN is the minimum input arrival time before
clock and OFFSET OUT is maximum output required time after clock.
RTL Schematic
The RTL (Register Transfer Logic) can be viewed as black box after synthesize of design
is made. It shows the inputs and outputs of the system. By double-clicking on the diagram we
can see gates, flip-flops and MUX.
Figure 6.5 Schematic with Basic Inputs and Output
I
N
P
U
T
S
O
U
T
P
U
T
S
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 63
Figure 6.6 Schematic of Booth Encoder with SPST Adder
6.3 Summary
 The developed VMFU design is modelled and is simulated using the Modelsim tool.
 The simulation results are discussed by considering different cases.
 The RTL model is synthesized using the Xilinx tool in Spartan 3E and their synthesis
results were discussed with the help of generated reports.
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 64
CHAPTER 7
CONCLUSION
In his Project a versatile multimedia functional unit is designed with low-power
technique called SPST, 16x16 multiplier-accumulators (MAC), with addition, subtraction, sum
of absolute difference, interpolation. A Radix-2 Modified Booth multiplier circuit is used for
MAC architecture. Compared to other circuits, the Booth multiplier has the highest operational
speed and less hardware count. The basic building blocks for the VMFU unit are identified and
each of the blocks is analyzed for its performance. Power and delay is calculated for the blocks.
MAC unit is designed with enable to reduce the total power consumption based on block enable
technique. Using this block, the N-bit MAC unit is constructed and the total power consumption
is calculated for the MAC unit.
The presented low-power technique called SPST and explores its applications in
multimedia/DSP computations, where the theoretical analysis and the realization issues of the
SPST are fully discussed. The proposed SPST can obviously decrease the switching (or
dynamic) power dissipation, which comprises a significant portion of the whole power
dissipation in integrated circuits. Besides, the proposed SPST can achieve a 24% saving in power
consumption at the expense of only 10% area overheads for the proposed VMFU.
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 65
FUTURE SCOPE
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 66
BIBILIOGRAPHY
[1] T. Stockhammer, M. Hannuksela, and T. Wiegand, “H.264/AVC inwireless environments,”
IEEE Trans. Circuits Syst. Video Technol., vol.13, no. 7, pp. 657–673, Jul. 2003.
[2] R. Schafer, T. Wiegand, and H. Schwarz, “The emerging H.264/AVCstandard,” EBU
Technique Review Jan. 2003 [Online].Available:http://www.ebu.ch/trev_293-schaefer.pdf
[3] A. Bellaouar and M. I. Elmasry, Low-Power Digital VLSI Design"Circuitsand Systems.
Norwell, MA: Kluwer, 1995.
[4] A. P. Chandrakasan and R. W. Brodersen, “Minimizing power consumptionin digital CMOS
circuits,” Proc. IEEE, vol. 83, no. 4, pp.498–523, Apr. 1995.
[5] K. K. Parhi, “Approaches to low-power implementations of DSP systems,”IEEE Trans.
Circuits Syst. I, Fundam. Theory Appl., vol. 48, no.10, pp. 1214–1224, Oct. 2001.
[6] K. Choi, R. Soma, and M. Pedram, “Dynamic voltage and frequencyscaling based on
workload decomposition,” in Proc. IEEE Int. Symp.Low Power Electron.Des., 2004, pp. 174
[7] J. Choi, J. Jeon, and K. Choi, “Power minimization of functional unitsby partially guarded
computation,” in Proc. IEEE Int. Symp.Low PowerElectron.Des., 2000, pp. 131–136.
[8] O. Chen, R. Sheen, and S. Wang, “A low-power adder operating oneffective dynamic data
ranges,” IEEE Trans. Very Large Scale Integr.(VLSI) Syst., vol. 10, no. 4, pp. 435–453, Aug.
2002.
[9] O. Chen, S.Wang, and Y. W.Wu, “Minimization of switching activitiesof partial products for
designing low-power multipliers,” IEEE Trans.Very Large Scale Integr. (VLSI) Syst., vol. 11, no.
3, pp. 418–433, Jun.2003.
[10] L. Benini, G. D. Micheli, A. Macii, E. Macii, M. Poncino, and R. Scarsi,“Glitch power
minimization by selective gate freezing,” IEEE Trans.Very Large Scale Integr. (VLSI) Syst., vol.
8, no. 3, pp. 287–298, Jun.2000.
[11] S. Henzler, G. Georgakos, J. Berthold, and D. Schmitt-Landsiedel,“Fast power-efficient
circuit-block switch-off scheme,” Electron. Lett.,vol. 40, no. 2, pp. 103–104, Jan. 2004.
[12] T. Xanthopoulos and A. P. Chandrakasan, “A low-power DCT coreusing adaptive bitwidth
and arithmetic activity exploiting signal correlationsand quantization,” IEEE J. Solid-State
Circuits, vol. 35, no. 5,pp. 740–750, May 2000.
LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE
NCET Page 67
[13] K. H. Chen, J. I. Guo, and J. S. Wang, “A high-performance direct2-D transform coding IP
design for MPEG-4AVC/H.264,” IEEE Trans.Circuits Syst. Video Technol., vol. 16, no. 4, pp.
472–483, Apr. 2006.
[14] K. H. Chen, K. C. Chao, J. I. Guo, J. S. Wang, and Y. S. Chu, “Designexploration of a
spurious power suppression technique (SPST) and itsapplications,” in Proc. IEEE Asian Solid-
State Circuits Conf., Hsinchu,Taiwan, Nov. 2005, pp. 341–344.
[15] K. H. Chen, Y. M. Chen, and Y. S. Chu, “A versatile multimedia functionalunit design
using the spurious power suppression technique,” inProc. IEEE Asian Solid-State Circuits Conf.,
Hangzhou, China, Nov.2006, pp. 111–114.
[16] H. H. Chang, S. H. Sun, and S. I. Liu, “A low-jitter and precise multiphasedelay-locked
loop using shifted averaging VCDL,” in Proc. IEEEInt. Solid-State Circuits Conf., Feb. 2003,
vol. 1, pp. 434–505.
[17] Y. J. Jung, S. W. Lee, D. Shim, W. Kim, C. Kim, and S. I. Cho, “Adual-loop delay-locked
loop using multiple voltage-controlled delaylines,” IEEE J. Solid-State Circuits, vol. 36, no. 5,
pp. 784–791, May2001.

More Related Content

What's hot

A Review of Different Methods for Booth Multiplier
A Review of Different Methods for Booth MultiplierA Review of Different Methods for Booth Multiplier
A Review of Different Methods for Booth MultiplierIJERA Editor
 
Comparative Study of Low Power Low Area Bypass Multipliers for Signal Process...
Comparative Study of Low Power Low Area Bypass Multipliers for Signal Process...Comparative Study of Low Power Low Area Bypass Multipliers for Signal Process...
Comparative Study of Low Power Low Area Bypass Multipliers for Signal Process...IJERA Editor
 
Low Power and Area Efficient Multiplier Layout using Transmission Gate
Low Power and Area Efficient Multiplier Layout using Transmission GateLow Power and Area Efficient Multiplier Layout using Transmission Gate
Low Power and Area Efficient Multiplier Layout using Transmission GateIJEEE
 
Paper id 25201467
Paper id 25201467Paper id 25201467
Paper id 25201467IJRAT
 
Design and Implementation of Low Power DSP Core with Programmable Truncated V...
Design and Implementation of Low Power DSP Core with Programmable Truncated V...Design and Implementation of Low Power DSP Core with Programmable Truncated V...
Design and Implementation of Low Power DSP Core with Programmable Truncated V...ijsrd.com
 
Review On Design Of Low Power Multiply And Accumulate Unit Using Baugh-Wooley...
Review On Design Of Low Power Multiply And Accumulate Unit Using Baugh-Wooley...Review On Design Of Low Power Multiply And Accumulate Unit Using Baugh-Wooley...
Review On Design Of Low Power Multiply And Accumulate Unit Using Baugh-Wooley...IRJET Journal
 
A METHODOLOGY FOR IMPROVEMENT OF ROBA MULTIPLIER FOR ELECTRONIC APPLICATIONS
A METHODOLOGY FOR IMPROVEMENT OF ROBA MULTIPLIER FOR ELECTRONIC APPLICATIONSA METHODOLOGY FOR IMPROVEMENT OF ROBA MULTIPLIER FOR ELECTRONIC APPLICATIONS
A METHODOLOGY FOR IMPROVEMENT OF ROBA MULTIPLIER FOR ELECTRONIC APPLICATIONSVLSICS Design
 
Implementation of Radix-4 Booth Multiplier by VHDL
Implementation of Radix-4 Booth Multiplier by VHDLImplementation of Radix-4 Booth Multiplier by VHDL
Implementation of Radix-4 Booth Multiplier by VHDLpaperpublications3
 
JOURNAL PAPER
JOURNAL PAPERJOURNAL PAPER
JOURNAL PAPERRaj kumar
 
FPGA-Based Multi-Level Approximate Multipliers for High-Performance Error-Res...
FPGA-Based Multi-Level Approximate Multipliers for High-Performance Error-Res...FPGA-Based Multi-Level Approximate Multipliers for High-Performance Error-Res...
FPGA-Based Multi-Level Approximate Multipliers for High-Performance Error-Res...AishwaryaRavishankar8
 
VLSI Implementation of High Speed & Low Power Multiplier in FPGA
VLSI Implementation of High Speed & Low Power Multiplier in FPGAVLSI Implementation of High Speed & Low Power Multiplier in FPGA
VLSI Implementation of High Speed & Low Power Multiplier in FPGAIOSR Journals
 
Compressor based approximate multiplier architectures for media processing ap...
Compressor based approximate multiplier architectures for media processing ap...Compressor based approximate multiplier architectures for media processing ap...
Compressor based approximate multiplier architectures for media processing ap...IJECEIAES
 
High Speed Signed multiplier for Digital Signal Processing Applications
High Speed Signed multiplier for Digital Signal Processing ApplicationsHigh Speed Signed multiplier for Digital Signal Processing Applications
High Speed Signed multiplier for Digital Signal Processing ApplicationsIOSR Journals
 
IRJET- Low Power Adder and Multiplier Circuits Design Optimization in VLSI
IRJET- Low Power Adder and Multiplier Circuits Design Optimization in VLSIIRJET- Low Power Adder and Multiplier Circuits Design Optimization in VLSI
IRJET- Low Power Adder and Multiplier Circuits Design Optimization in VLSIIRJET Journal
 

What's hot (17)

A Review of Different Methods for Booth Multiplier
A Review of Different Methods for Booth MultiplierA Review of Different Methods for Booth Multiplier
A Review of Different Methods for Booth Multiplier
 
Comparative Study of Low Power Low Area Bypass Multipliers for Signal Process...
Comparative Study of Low Power Low Area Bypass Multipliers for Signal Process...Comparative Study of Low Power Low Area Bypass Multipliers for Signal Process...
Comparative Study of Low Power Low Area Bypass Multipliers for Signal Process...
 
Low Power and Area Efficient Multiplier Layout using Transmission Gate
Low Power and Area Efficient Multiplier Layout using Transmission GateLow Power and Area Efficient Multiplier Layout using Transmission Gate
Low Power and Area Efficient Multiplier Layout using Transmission Gate
 
Paper id 25201467
Paper id 25201467Paper id 25201467
Paper id 25201467
 
Design and Implementation of Low Power DSP Core with Programmable Truncated V...
Design and Implementation of Low Power DSP Core with Programmable Truncated V...Design and Implementation of Low Power DSP Core with Programmable Truncated V...
Design and Implementation of Low Power DSP Core with Programmable Truncated V...
 
Review On Design Of Low Power Multiply And Accumulate Unit Using Baugh-Wooley...
Review On Design Of Low Power Multiply And Accumulate Unit Using Baugh-Wooley...Review On Design Of Low Power Multiply And Accumulate Unit Using Baugh-Wooley...
Review On Design Of Low Power Multiply And Accumulate Unit Using Baugh-Wooley...
 
A METHODOLOGY FOR IMPROVEMENT OF ROBA MULTIPLIER FOR ELECTRONIC APPLICATIONS
A METHODOLOGY FOR IMPROVEMENT OF ROBA MULTIPLIER FOR ELECTRONIC APPLICATIONSA METHODOLOGY FOR IMPROVEMENT OF ROBA MULTIPLIER FOR ELECTRONIC APPLICATIONS
A METHODOLOGY FOR IMPROVEMENT OF ROBA MULTIPLIER FOR ELECTRONIC APPLICATIONS
 
F1074145
F1074145F1074145
F1074145
 
Implementation of Radix-4 Booth Multiplier by VHDL
Implementation of Radix-4 Booth Multiplier by VHDLImplementation of Radix-4 Booth Multiplier by VHDL
Implementation of Radix-4 Booth Multiplier by VHDL
 
JOURNAL PAPER
JOURNAL PAPERJOURNAL PAPER
JOURNAL PAPER
 
FPGA-Based Multi-Level Approximate Multipliers for High-Performance Error-Res...
FPGA-Based Multi-Level Approximate Multipliers for High-Performance Error-Res...FPGA-Based Multi-Level Approximate Multipliers for High-Performance Error-Res...
FPGA-Based Multi-Level Approximate Multipliers for High-Performance Error-Res...
 
Bn26425431
Bn26425431Bn26425431
Bn26425431
 
VLSI Implementation of High Speed & Low Power Multiplier in FPGA
VLSI Implementation of High Speed & Low Power Multiplier in FPGAVLSI Implementation of High Speed & Low Power Multiplier in FPGA
VLSI Implementation of High Speed & Low Power Multiplier in FPGA
 
Compressor based approximate multiplier architectures for media processing ap...
Compressor based approximate multiplier architectures for media processing ap...Compressor based approximate multiplier architectures for media processing ap...
Compressor based approximate multiplier architectures for media processing ap...
 
High Speed Signed multiplier for Digital Signal Processing Applications
High Speed Signed multiplier for Digital Signal Processing ApplicationsHigh Speed Signed multiplier for Digital Signal Processing Applications
High Speed Signed multiplier for Digital Signal Processing Applications
 
Radix8
Radix8Radix8
Radix8
 
IRJET- Low Power Adder and Multiplier Circuits Design Optimization in VLSI
IRJET- Low Power Adder and Multiplier Circuits Design Optimization in VLSIIRJET- Low Power Adder and Multiplier Circuits Design Optimization in VLSI
IRJET- Low Power Adder and Multiplier Circuits Design Optimization in VLSI
 

Viewers also liked

219333078 a-dessertation-report-on-wc
219333078 a-dessertation-report-on-wc219333078 a-dessertation-report-on-wc
219333078 a-dessertation-report-on-wchomeworkping9
 
218019207 laporan-preeklampsia-berat
218019207 laporan-preeklampsia-berat218019207 laporan-preeklampsia-berat
218019207 laporan-preeklampsia-berathomeworkping9
 
221713042 jacobs-2003
221713042 jacobs-2003221713042 jacobs-2003
221713042 jacobs-2003homeworkping9
 
221367277 cases-in-legal-ethics
221367277 cases-in-legal-ethics221367277 cases-in-legal-ethics
221367277 cases-in-legal-ethicshomeworkping9
 

Viewers also liked (7)

219333078 a-dessertation-report-on-wc
219333078 a-dessertation-report-on-wc219333078 a-dessertation-report-on-wc
219333078 a-dessertation-report-on-wc
 
218019207 laporan-preeklampsia-berat
218019207 laporan-preeklampsia-berat218019207 laporan-preeklampsia-berat
218019207 laporan-preeklampsia-berat
 
221713042 jacobs-2003
221713042 jacobs-2003221713042 jacobs-2003
221713042 jacobs-2003
 
219107733 case-ckd
219107733 case-ckd219107733 case-ckd
219107733 case-ckd
 
221367277 cases-in-legal-ethics
221367277 cases-in-legal-ethics221367277 cases-in-legal-ethics
221367277 cases-in-legal-ethics
 
217876681 jamail1
217876681 jamail1217876681 jamail1
217876681 jamail1
 
115289141 case-123
115289141 case-123115289141 case-123
115289141 case-123
 

Similar to 222083242 full-documg

Low Power VLSI Design of Modified Booth Multiplier
Low Power VLSI Design of Modified Booth MultiplierLow Power VLSI Design of Modified Booth Multiplier
Low Power VLSI Design of Modified Booth Multiplieridescitation
 
Final Project Report
Final Project ReportFinal Project Report
Final Project ReportRiddhi Shah
 
IRJET- MAC Unit by Efficient Grouping of Partial Products along with Circular...
IRJET- MAC Unit by Efficient Grouping of Partial Products along with Circular...IRJET- MAC Unit by Efficient Grouping of Partial Products along with Circular...
IRJET- MAC Unit by Efficient Grouping of Partial Products along with Circular...IRJET Journal
 
A SURVEY - COMPARISON OF MULTIPLIERS USING DIFFERENT LOGIC STYLE
A SURVEY - COMPARISON OF MULTIPLIERS USING DIFFERENT LOGIC STYLEA SURVEY - COMPARISON OF MULTIPLIERS USING DIFFERENT LOGIC STYLE
A SURVEY - COMPARISON OF MULTIPLIERS USING DIFFERENT LOGIC STYLEEditor IJMTER
 
DESIGN OF LOW POWER MULTIPLIER
DESIGN OF LOW POWER MULTIPLIERDESIGN OF LOW POWER MULTIPLIER
DESIGN OF LOW POWER MULTIPLIERIRJET Journal
 
High Speed and Area Efficient Matrix Multiplication using Radix-4 Booth Multi...
High Speed and Area Efficient Matrix Multiplication using Radix-4 Booth Multi...High Speed and Area Efficient Matrix Multiplication using Radix-4 Booth Multi...
High Speed and Area Efficient Matrix Multiplication using Radix-4 Booth Multi...IRJET Journal
 
High-Speed and Energy-Efficient MAC Design using Vedic Multiplier and Carry S...
High-Speed and Energy-Efficient MAC Design using Vedic Multiplier and Carry S...High-Speed and Energy-Efficient MAC Design using Vedic Multiplier and Carry S...
High-Speed and Energy-Efficient MAC Design using Vedic Multiplier and Carry S...IRJET Journal
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentIJERD Editor
 
Implementation of High Speed & Area Efficient Modified Booth Recoder for Effi...
Implementation of High Speed & Area Efficient Modified Booth Recoder for Effi...Implementation of High Speed & Area Efficient Modified Booth Recoder for Effi...
Implementation of High Speed & Area Efficient Modified Booth Recoder for Effi...IJMTST Journal
 
Design and Implementation of a Programmable Truncated Multiplier
Design and Implementation of a Programmable Truncated MultiplierDesign and Implementation of a Programmable Truncated Multiplier
Design and Implementation of a Programmable Truncated Multiplierijsrd.com
 
Review on Multiply-Accumulate Unit
Review on Multiply-Accumulate UnitReview on Multiply-Accumulate Unit
Review on Multiply-Accumulate UnitIJERA Editor
 
PARTIAL PRODUCT ARRAY HEIGHT REDUCTION USING RADIX-16 FOR 64-BIT BOOTH MULTI...
PARTIAL PRODUCT ARRAY HEIGHT REDUCTION USING RADIX-16 FOR 64-BIT BOOTH MULTI...PARTIAL PRODUCT ARRAY HEIGHT REDUCTION USING RADIX-16 FOR 64-BIT BOOTH MULTI...
PARTIAL PRODUCT ARRAY HEIGHT REDUCTION USING RADIX-16 FOR 64-BIT BOOTH MULTI...Hari M
 
High Performance Baugh Wooley Multiplier Using Carry Skip Adder Structure
High Performance Baugh Wooley Multiplier Using Carry Skip Adder StructureHigh Performance Baugh Wooley Multiplier Using Carry Skip Adder Structure
High Performance Baugh Wooley Multiplier Using Carry Skip Adder StructureIRJET Journal
 
32 bit×32 bit multiprecision razor based dynamic
32 bit×32 bit multiprecision razor based dynamic32 bit×32 bit multiprecision razor based dynamic
32 bit×32 bit multiprecision razor based dynamicMastan Masthan
 
Design and testing of systolic array multiplier using fault injecting schemes
Design and testing of systolic array multiplier using fault injecting schemesDesign and testing of systolic array multiplier using fault injecting schemes
Design and testing of systolic array multiplier using fault injecting schemesCSITiaesprime
 
IRJET- Design of 16 Bit Low Power Vedic Architecture using CSA & UTS
IRJET-  	  Design of 16 Bit Low Power Vedic Architecture using CSA & UTSIRJET-  	  Design of 16 Bit Low Power Vedic Architecture using CSA & UTS
IRJET- Design of 16 Bit Low Power Vedic Architecture using CSA & UTSIRJET Journal
 
High Performance MAC Unit for FFT Implementation
High Performance MAC Unit for FFT Implementation High Performance MAC Unit for FFT Implementation
High Performance MAC Unit for FFT Implementation IJMER
 
Parallel Processing Technique for Time Efficient Matrix Multiplication
Parallel Processing Technique for Time Efficient Matrix MultiplicationParallel Processing Technique for Time Efficient Matrix Multiplication
Parallel Processing Technique for Time Efficient Matrix MultiplicationIJERA Editor
 

Similar to 222083242 full-documg (20)

IJET-V2I6P12
IJET-V2I6P12IJET-V2I6P12
IJET-V2I6P12
 
Low Power VLSI Design of Modified Booth Multiplier
Low Power VLSI Design of Modified Booth MultiplierLow Power VLSI Design of Modified Booth Multiplier
Low Power VLSI Design of Modified Booth Multiplier
 
Final Project Report
Final Project ReportFinal Project Report
Final Project Report
 
IRJET- MAC Unit by Efficient Grouping of Partial Products along with Circular...
IRJET- MAC Unit by Efficient Grouping of Partial Products along with Circular...IRJET- MAC Unit by Efficient Grouping of Partial Products along with Circular...
IRJET- MAC Unit by Efficient Grouping of Partial Products along with Circular...
 
A SURVEY - COMPARISON OF MULTIPLIERS USING DIFFERENT LOGIC STYLE
A SURVEY - COMPARISON OF MULTIPLIERS USING DIFFERENT LOGIC STYLEA SURVEY - COMPARISON OF MULTIPLIERS USING DIFFERENT LOGIC STYLE
A SURVEY - COMPARISON OF MULTIPLIERS USING DIFFERENT LOGIC STYLE
 
DESIGN OF LOW POWER MULTIPLIER
DESIGN OF LOW POWER MULTIPLIERDESIGN OF LOW POWER MULTIPLIER
DESIGN OF LOW POWER MULTIPLIER
 
High Speed and Area Efficient Matrix Multiplication using Radix-4 Booth Multi...
High Speed and Area Efficient Matrix Multiplication using Radix-4 Booth Multi...High Speed and Area Efficient Matrix Multiplication using Radix-4 Booth Multi...
High Speed and Area Efficient Matrix Multiplication using Radix-4 Booth Multi...
 
High-Speed and Energy-Efficient MAC Design using Vedic Multiplier and Carry S...
High-Speed and Energy-Efficient MAC Design using Vedic Multiplier and Carry S...High-Speed and Energy-Efficient MAC Design using Vedic Multiplier and Carry S...
High-Speed and Energy-Efficient MAC Design using Vedic Multiplier and Carry S...
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
Implementation of High Speed & Area Efficient Modified Booth Recoder for Effi...
Implementation of High Speed & Area Efficient Modified Booth Recoder for Effi...Implementation of High Speed & Area Efficient Modified Booth Recoder for Effi...
Implementation of High Speed & Area Efficient Modified Booth Recoder for Effi...
 
Design and Implementation of a Programmable Truncated Multiplier
Design and Implementation of a Programmable Truncated MultiplierDesign and Implementation of a Programmable Truncated Multiplier
Design and Implementation of a Programmable Truncated Multiplier
 
Review on Multiply-Accumulate Unit
Review on Multiply-Accumulate UnitReview on Multiply-Accumulate Unit
Review on Multiply-Accumulate Unit
 
PARTIAL PRODUCT ARRAY HEIGHT REDUCTION USING RADIX-16 FOR 64-BIT BOOTH MULTI...
PARTIAL PRODUCT ARRAY HEIGHT REDUCTION USING RADIX-16 FOR 64-BIT BOOTH MULTI...PARTIAL PRODUCT ARRAY HEIGHT REDUCTION USING RADIX-16 FOR 64-BIT BOOTH MULTI...
PARTIAL PRODUCT ARRAY HEIGHT REDUCTION USING RADIX-16 FOR 64-BIT BOOTH MULTI...
 
High Performance Baugh Wooley Multiplier Using Carry Skip Adder Structure
High Performance Baugh Wooley Multiplier Using Carry Skip Adder StructureHigh Performance Baugh Wooley Multiplier Using Carry Skip Adder Structure
High Performance Baugh Wooley Multiplier Using Carry Skip Adder Structure
 
Implementation of MAC using Modified Booth Algorithm
Implementation of MAC using Modified Booth AlgorithmImplementation of MAC using Modified Booth Algorithm
Implementation of MAC using Modified Booth Algorithm
 
32 bit×32 bit multiprecision razor based dynamic
32 bit×32 bit multiprecision razor based dynamic32 bit×32 bit multiprecision razor based dynamic
32 bit×32 bit multiprecision razor based dynamic
 
Design and testing of systolic array multiplier using fault injecting schemes
Design and testing of systolic array multiplier using fault injecting schemesDesign and testing of systolic array multiplier using fault injecting schemes
Design and testing of systolic array multiplier using fault injecting schemes
 
IRJET- Design of 16 Bit Low Power Vedic Architecture using CSA & UTS
IRJET-  	  Design of 16 Bit Low Power Vedic Architecture using CSA & UTSIRJET-  	  Design of 16 Bit Low Power Vedic Architecture using CSA & UTS
IRJET- Design of 16 Bit Low Power Vedic Architecture using CSA & UTS
 
High Performance MAC Unit for FFT Implementation
High Performance MAC Unit for FFT Implementation High Performance MAC Unit for FFT Implementation
High Performance MAC Unit for FFT Implementation
 
Parallel Processing Technique for Time Efficient Matrix Multiplication
Parallel Processing Technique for Time Efficient Matrix MultiplicationParallel Processing Technique for Time Efficient Matrix Multiplication
Parallel Processing Technique for Time Efficient Matrix Multiplication
 

Recently uploaded

Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 

Recently uploaded (20)

Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 

222083242 full-documg

  • 1. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 1 Get Homework/Assignment Done Homeworkping.com Homework Help https://www.homeworkping.com/ Research Paper help https://www.homeworkping.com/ Online Tutoring https://www.homeworkping.com/ click here for freelancing tutoring sites CHAPTER 1 INTRODUCTION 1. Introduction: Power dissipation is recognized as a critical parameter in modern VLSI design field. To satisfy MOORE’S law and to produce consumer electronics goods with more backup and less weight, low power VLSI design is necessary. Fast multipliers are essential parts of digital signal processing systems. The speed of multiply operation is of great importance in digital signal processing as well as in the general
  • 2. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 2 purpose processors today, especially since the media processing took off. In the past multiplication was generally implemented via a sequence of addition,Subtraction, and shift operations. Multiplication can be considered as a series of repeated additions. The number to be added is the multiplicand, the number of times that it is added is the multiplier, and the result is the product. Each step of addition generates a partial product. In most computers, the operand usually contains the same number of bits. When the operands are interpreted as integers, the product is generally twice the length of operands in order to preserve the information content. This repeated addition method that is suggested by the arithmetic definition is slow that it is almost always replaced by an algorithm that makes use of positional representation. It is possible to decompose multipliers into two parts. The first part is dedicated to the generation of partial products, and the second one collects and adds them. The basic multiplication principle is two fold, i.e. evaluation of partial products and accumulation of the shifted partial products. It is performed by the successive Addition’s of the columns of the shifted partial product matrix. The ‘multiplier’ is successfully shifted and gates the appropriate bit of the ‘multiplicand’. The delayed, gated instance of the multiplicand must all be in the same column of the shifted partial product matrix. They are then added to form the product bit for the particular form. Multiplication is therefore a multi operand operation. To extend the multiplication to both signed and unsigned numbers, a convenient number system would be the representation of numbers in two’s complement format. Multipliers are key components of many high performance systems such as FIR filters, microprocessors, digital signal processors, etc. A system’s performance is generally determined by the performance of the multiplier because the multiplier is generally the slowest clement in the system. Furthermore, it is generally the most area consuming. Hence, optimizing the speed and area of the multiplier is a major design issue. However, area and speed are usually conflicting constraints so that improving speed results mostly in larger areas. As a result, whole spectrums of multipliers with different area-speed constraints are designed with fully parallel processing. In between are digit serial multipliers where single digits consisting of several bits are operated on. These multipliers have moderate performance in both speed and area. However, existing digit serial multipliers have been plagued by complicated switching systems and/or irregularities in design. Radix 2^n multipliers which operate on digits in a parallel fashion
  • 3. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 3 instead of bits bring the pipelining to the digit level and avoid most of the above problems. They were introduced by M. K. Ibrahim in 1993. These structures are iterative and modular. The pipelining done at the digit level brings the benefit of constant operation speed irrespective of the size of’ the multiplier. The clock speed is only determined by the digit size which is already fixed before the design is implemented. The growing market for fast floating-point co-processors, digital signal processing chips, and graphics processors has created a demand for high speed, area-efficient multipliers. Current architectures range from small, low-performance shift and add multipliers, to large, high- performance array and tree multipliers. Conventional linear array multipliers achieve high performance in a regular structure, but require large amounts of silicon. Tree structures achieve even higher performance than linear arrays but the tree interconnection is more complex and less regular, making them even larger than linear arrays. Ideally, one would want the speed benefits of a tree structure, the regularity of an array multiplier, and the small size of a shift and add multiplier. To reduce the size of the multiplier a partial tree is used together with a 4-2 carry-save accumulator placed at its outputs to iteratively accumulate the partial products. This allows a full multiplier to be built in a fraction of the area required by a full array. Higher performance is achieved by increasing the hardware utilization of the partial 4-2 tree through pipelining. To ensure optimal performance of the pipelined 4-2 tree, the clock frequency must be tightly controlled to match the delay of the 4-2 adder pipe stages. Figure 2.2 Minimal Iterative Structures
  • 4. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 4 In an attempt to increase performance of the minimal iterative structure additional rows of CSA’s could be added to make a bigger array. For example, the addition of one row of CM’s to the minimal structure would yield a partial array with two rows of CM’s. This structure provides two advantages over the single row of CSA cells: 1) It reduces the required clock frequency, and 2) It requires only half as many latch delays. It is important to note that although the number of CSA’s has been doubled, the latency was reduced only by halving the number of latch delays. The number of CSA delays remains the same. Thus, assuming the latch delays are small relative to the CSA delays, increasing the depth of the partial array by adding additional rows of CSA’s in a linear structure yields only a slight increase in performance. 2.4 Multiplication of Unsigned and Signed Numbers: Multiplication is less common than addition, but is still essential for microprocessors, digitalsignal processors, and graphics engines. The most basic form of multiplication consistsof forming the product of two unsigned (positive) binary numbers. This canbe accomplished through the traditional technique taught in primary school,simplified to base 2. For example, the multiplication of two positive 6-bitbinary integers, 2510 and 3910, proceeds as shown in Figure 2.3.M × N-bit multiplication P = Y × X can be viewed as forming N partialproducts of M bits each, and then summing the appropriately shifted partialproducts to produce an M+ N-bit result P. Binary multiplication is equivalentto a logical AND operation. Therefore, generating partial products consists ofthe logical ANDing of the appropriate bits of the multiplier and multiplicand.Each column of partial products must then be added and, if necessary, anycarry values passed to the next column. We denote the multiplicand asY = {yM–1, yM–2, …,y1, y0} and the multiplier as X = {xN–1, xN–2, …, x1, x0}. For unsignedmultiplication, the product is given in EQ (2.1). Figure 2.4 illustrates the generation,shifting, and summing of partial products in a 6 × 6-bit multiplier.
  • 5. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 5 Fig 2.3 multiplication of two positive 6-bit binary integers Fig 2.4 generation, shifting, and summing of partial products in a 6 × 6-bit multiplier Large multiplications can be more conveniently illustrated using dot diagrams. Figure2.5 shows a dot diagram for a simple 16 × 16 multiplier. Each dot represents a placeholderfor a single bit that can be a 0 or 1. The partial products are represented by a horizontalboxed row of dots, shifted according to their weight. The multiplier bits used togenerate the partial products are shown on the right.There are a number of techniques that can be used to perform multiplication. In general,the choice is based upon factors such as latency, throughput, energy, area, and designcomplexity. An obvious approach is to use an M + 1-bit carry-propagate adder (CPA) toadd the first two partial products, then another CPA to add the third partial product to
  • 6. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 6 therunning sum, and so forth. Such an approach requires N – 1 CPAs and is slow, even if afast CPA is employed. More efficient parallel approaches use some sort of array or tree offull adders to sum the partial products. We begin with a simple array for unsigned multipliers,and then modify the array to handle signed two’s complement numbers using theBaugh-Wooley algorithm. The number of partial products to sum can be reduced usingBooth encoding and the number of logic levels required to perform the summation can bereduced with Wallace trees. Unfortunately, Wallace trees are complex to lay out and havelong, irregular wires, so hybrid array/tree structures may be more attractive. For completeness,we consider a serial multiplier architecture. This was once popular when gates wererelatively expensive, but is now rarely necessary. Fig 2.5 dot diagram for a simple 16 × 16 multiplier 2.4.1 Unsigned Array Multiplication: Fast multipliers use carry-save adders to sum the partial products. A CSA typically has a delay of 1.5–2 FO4 inverters independent of the width of thepartial product, while a carry-propagate adder (CPA) tends to have a delay of 4–15+ FO4inverters depending on the width, architecture, and circuit family. Figure 2.6 shows a4 × 4 array multiplier for unsigned numbers using an array of CSAs. Each cell contains a2-input AND gate that forms
  • 7. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 7 a partial product and a full adder (CSA) to add the partialproduct into the running sum. The first row converts the first partial product intocarry-save redundant form. Each later row uses the CSA to add the corresponding partialproduct to the carry-save redundant result of the previous row and generate a carry-saveredundant result. The least significant N output bits are available as sum outputs directlyfrom CSAs. The most significant output bits arrive in carry-save redundant form andrequire an M-bit carry-propagate adder to convert into regular binary form. In Figure11.74, the CPA is implemented as a carry-ripple adder. The array is regular in structureand uses a single type of cell, so it is easy to design and lay out. Assuming the carry outputis faster than the sum output in a CSA, the critical path through the array is marked on the figure with a dashed line. The adder can easily be pipelined with the placement of registers between rows. In practice, circuits are assigned rectangular blocks in the floorplan sothe parallelogram shape wastes space. Figure 2.7 shows the same adder squashed to fit arectangular block.
  • 8. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 8 Fig 2.6 Array Multiplier A key element of the design is a compact CSA. This not only benefits area but also helps performance because it leads to short wires with low wire capacitance. An ideal CSA design has approximatelyequal sum and carry delays because the greater of these two delays limits performance. The mirror adder is commonly used for its compact layout even though the sum delay exceeds the carry delay. The sum output can be connected to the faster carry input to partially compensate. Note that the first row of CSAs adds the first partial product to a pair of 0s. This leads to a regular structure, but is inefficient. At a slight cost to regularity, the first row of CSAs can be used to add the first three partial products together. This reduces the number of rows by two and correspondingly reduces the adder propagation delay. Yet another way to improve the multiplier array performance is toreplace the bottom row with a faster CPA such as
  • 9. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 9 a look ahead or tree adder. In summary, the critical path of an array multiplier involves N–2 CSAs and a CPA. Fig 2.7 Rectangular Multiplier 2.4.2.Two’s Complement Array Multiplication: Multiplication of two’s complement numbers at first might seem more difficult because some partial products are negative and must be subtracted. Recall that the most significant bit of a two’s complementnumber has a negative weight. Hence, the product is
  • 10. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 10 In EQ (2.2), two of the partial products have negative weight and thus should be subtracted rather than added. The Baugh-Wooley multiplier algorithm handles subtraction by taking the two’s complement of the terms to be subtracted (i.e., inverting the bits and adding one). Figure 2.8 shows the partial products that must be summed. The upper parallelogram represents the unsigned multiplication of all but the most significant bits of the inputs. The next row is a single bit corresponding to the product of the most significant bits. The next two pairs of rows are the inversions of the terms to be subtracted. Each term has implicit leading and trailing zeros, which are inverted to leading and trailing ones. Extra ones must be added in the least significant column when taking the two’s complement. The multiplier delay depends on the number of partial product rows to be summed. The modified Baugh-Wooley multiplier reduces this number of partial products by precomputing the sums of the constant ones and pushing some of the terms upward into extra columns. Figure 2.9 shows such an arrangement. The parallelogram shaped array can again be squashed into a rectangle, as shown in Figure 2.10, giving a design almost identical to the unsigned multiplier of Figure 2.7. The AND gates are replaced by NAND gates in the hatched cells and 1s are added in place of 0s at two of the unused inputs. The signed and unsigned arrays are so similar that a single array can be usedfor both purposes if XOR gates are used to conditionally invert some of the terms dependingon the mode. Fig 2.8 Partial products for two’s complement multiplier
  • 11. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 11 Fig 2.9 Simplified partial products for two’s complement multiplier Fig 2.10 Modified Baugh-Wooley two’s complement multiplier 2.5 Booth Encoding:- The array multipliers in the previous sections compute the partial products in a radix-2 manner; i.e., by observing one bit of the multiplier at a time. Radix 2r multipliers produce N/r partial products, each of which depend on r bits of the multiplier. Fewer partial productsleads to a smaller and faster CSA array. For example, a radix-4 multiplier producesN/2 partial products.
  • 12. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 12 Each partial product is 0, Y, 2Y, or 3Y, depending on a pair of bits ofX. Computing 2Y is a simple shift, but 3Y is a hard multiple requiring a slow carrypropagateaddition of Y + 2Y before partial product generation begins.Booth encoding was originally proposed to accelerate serial multiplication.Modified Booth encoding [MacSorley61] allows higher radix parallel operation without generatingthe hard 3Y multiple by instead using negative partial products. Observe that3Y = 4Y – Y and 2Y = 4Y – 2Y. However, 4Y in a radix-4 multiplier array is equivalent to Yin the next row of the array that carries four times the weight. Hence, partial products arechosen by considering a pair of bits along with the most significant bit from the previouspair. If the most significant bit from the previous pair is true, Y must be added to the currentpartial product. If the most significant bit of the current pair is true, the current partialproduct is selected to be negative and the next partial product is incremented. Table 2.1 shows how the partial products are selected, based on bits of the multiplier. Negative partial products are generated by taking the two’s complement of themultiplicand (possibly left-shifted by one column for –2Y). An unsigned radix-4 Boothencodedmultiplier requires partial products rather than N. Each partialproduct is M+ 1 bits to accommodate the 2Y and –2Y multiples. Even though X and Y areunsigned, the partial products can be negative and must be sign extended properly. TheBooth selects will be discussed further after an example. Table2.1 Radix-4 modified Booth encoding values In a typical radix-4 Booth-encoded multiplier design, each group of 3 bits (a pair,along with the most significant bit of the previous pair) is encoded into several select lines(SINGLEi,
  • 13. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 13 DOUBLEi, and NEGi, given in the rightmost columns of Table 2.1) anddriven across the partial product row as shown in Figure 2.11 The multiplier Y is distributedto all the rows. The select lines control Booth selectors that choose the appropriatemultiple of Y for each partial product. The Booth selectors substitute for the AND gates ofa simple array multiplier to determine the ith partial product. Figure 2.11 shows a conventionalBooth encoder and selector design. Y is zero- extended to M + 1 bits.Depending on SINGLEi and DOUBLEi, the A22OI gate selects either 0, Y, or 2Y. Negativepartial products should be two’s-complemented (i.e., invert and add 1). If NEGiisasserted, the partial product is inverted. The extra 1 can be added in the least significantcolumn of the next row to avoid needing a CPA.Even in an unsigned multiplier, negative partial products must be sign-extended to besummed correctly. Figure 2.11 shows a 16- bit radix-4 Booth partial product array for anunsigned multiplier using the dot diagram notation. Fig 2.11 Radix-4 Booth encoder and selector
  • 14. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 14 Fig 2.12 Radix-4 Booth-encoded partial products with sign extension Each dot in the Booth-encoded multiplier is produced by a Booth selector rather than a simple AND gate. Partial products 0–7 are 17 bits. Each partial product i is sign extended with si= NEGi= x2i+1, which is 1 for negative multiples (those in the bottom half of Table 2.1) or 0 for positive multiples.Observe how an extra 1 is added to the least significant bit in the next row to form the 2’s complement of negative multiples. Inverting the implicit leading zeros generates leading ones on negative multiples. The extra terms increase the size of the multiplier. PP8 is required in case PP7 is negative; this partial product is always 0 or Y because x16 and x17 are 0. Hence, partial product 8 is only 16 bits. Observe that the sign extension bits are all either 1s or 0s. If a single 1 is added tothe least significant position in a string of 1s, the result is a string of 0s plus a carry-outthe top bit that may be discarded. Therefore, the large number of s bits in each partialproduct can be replaced by an equal number of constant 1s plus the inverse of s added tothe least significant position, as shown in Figure 2.13(a). These constants mostly canbe optimized out of the array by precomputing their sum. The simplified result is shownin Figure 2.13(b). As usual, it can be squashed to fit a rectangular floorplan.The critical path of the multiplier involves the Booth decoder, the select line drivers,the Booth selector, approximately N/2 CSAs, and a final CPA. Each partial product fillsabout M + 5 columns. 54 × 54-bit radix-4 Booth multipliers for IEEE double-precisionfloating-point units are typically 20–50% smaller (and arguably up to 20% faster) thannonencoded counterparts, so the technique is widely used. The multiplier requires
  • 15. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 15 M × N/2 Booth selectors. Because the selectors account for a substantial portion of the area and only a smallfraction of the critical path, they should be optimized for size over speed. For example, Fig 2.13 Radix-4 Booth-encoded partial products with simplified sign extension describes a sign select Booth encoder and selector that uses only 10 transistors perselector bit at the expense of a more complex encoder.It presents a one-hot Boothencoder and selector that chooses one of the six possible partial products using a transmissiongate multiplexer. Booth Encoding Signed Multipliers Signed two’s complement multiplication is similar, but the multiplicand may have been negative so sign extension must be done based on the sign bit of the partial product, PpiM. Figure 2.14 shows such an array, where the sign extension bit is ei= PPiM. Also notice that PP8, which was either Y or 0 for unsigned multiplication, is always 0 and can be omitted for signed multiplication because the multiplier x is sign-extended such that x17 =
  • 16. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 16 x16 = x15. The same Booth selector and encoder can be employed , but Y should be sign- extended rather than zero-extended to M+ 1 bits. Fig 2.14 Radix-4 Booth-encoded partial products for signed multiplication Higher Radix Booth Encoding Large multipliers can use Booth encoding of higher radix. For example, ordinary radix-8 multiplication reduces the number of partial products by a factor of 3, but requires hard multiples of 3Y, 5Y, and 7Y. Radix-8 Boothencoding only requires the hard 3Y multiple, as shown in Table 2.2. Although this requires a CPA before partial product generation, it can be justified by the reduction in array size and delay. Higher-radix Booth encoding is possible, but generating the otherhard multiples appears not to be worthwhile for multipliers of fewer than 64 bits. Similartechniques apply to sign-extending higher-radix multipliers.
  • 17. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 17 Table 2.2 Radix-8 modified Booth encoding values Column Addition: The critical path in a multiplier involves summing the dots in each column. Observe thata CSA is effectively a “ones counter” that adds the number of 1s on the A, B, and C inputsand encodes them on the sum and carry outputs, as summarized in Table 2.3. A CSA istherefore also known as a (3,2) counter because it converts three inputs into a countencoded in two outputs .The carry- out is passed to the next more significantcolumn, while a corresponding carry-in is received from the previous column. This iscalled a horizontal path because it crosses columns. For simplicity, a carry is represented asbeing passed directly down the column. Figure 11.84 shows a dot diagram of an arraymultiplier column that sums N partial products sequentially using N–2 CSAs. For example,the 16 × 16 Booth-encoded multiplier from Figure 2.13(b) sums nine partial productswith seven levels of CSAs. The output is produced in carry-save redundant formsuitable for the final CPA.
  • 18. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 18 Table 2.3Radix-8 modified Booth encoding values Fig 2.15 Dot diagram for array multiplier The column addition is slow because only one CSA is active at a time. Another way to speed the column addition is to sum partial products in parallel rather than sequentially. Figure 2.16 shows a Wallace treeusing this approach [Wallace64]. The Wallace tree requires levels of (3,2) counters to reduce N inputs down to two carry-save redundant form outputs. Even though the CSAs in the Wallace tree are shown in two dimensions, they are logically packed into a single column of the multiplier. This leads to long and irregular wires along the column to connect the CSAs. The wire capacitance increases the delay and energy of multiplier, and the wires can be difficult to lay out. Fig 2.16 Dot diagram for Wallace tree multiplier
  • 19. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 19 2.6Compressor Trees [4:2] compressors can be used in a binary tree to produce a more regular layout, as shown in Figure 2.17 . A [4:2] compressor takes four inputs of equal weight and produces two outputs. It can be constructed from two (3,2) counters as shown in Figure 2.18. Along the way, it generates an intermediate carry, ti, into the next column and accepts a carry, ti–1, from the previous column,so it may more aptly be called a (5,3) counter. This horizontal path does not impact the delay because the output of the top CSA in one column is the input of the bottom CSA in the next column. Fig 2.17 Dot diagram for [4:2] tree multiplier Fig 2.18 [4:2] compressor (a) implementation with two CSAs (b) symbol The [4:2] CSA symbol emphasizes only the primary inputs and outputs to emphasize the main function of reducing four inputs to two outputs. Only levels of [4:2] compressors are required, although each has greater delay than a CSA. The regular layout and routing also make the binary tree attractive. To see the benefits of a [4:2]
  • 20. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 20 compressor, we introduce the notion of fast and slow inputs and outputs. Figure 2.19 shows a simple gate-level CSA design. The longest path through the CSA involves two levels of XOR2 to compute the sum.Xis called a fast input, while Y and Z are slow inputs because they pass through a second level of XOR. C is the fast outputbecause it involves a single gate delay, while S is the slow output because it involves two gate delays. A [4:2] compressor might be expected to use four levels of XOR2s. Fig 2.19 Gate-level carry-save adder Fig 2.20 [4:2] compressors
  • 21. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 21 Figure 2.20 shows various [4:2]compressor designs that reduce the critical path to only 3 XOR2s. In Figure 2.20(a), the slow output of the first CSA is connected to the fast input of the second.In Figure 2.20(b), the [4:2] compressor has been munged into a single cell,allowing a majority gate to be replaced with a multiplexer. In Figure 2.20(c), the initial XORs have been replaced with 2-level XNOR circuits that allow some sharing of subfunctions, reducing the transistor count Figure 2.21 shows a transmission gate implementation of a [4:2] compressor from. It uses only 48 transistors,allowing for a smaller multiplier array with shorter wires. Note that it uses three distinct XNOR circuit forms and two transmission gate multiplexers. Fig 2.21 Transmission gate [4:2] compressor
  • 22. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 22 Fig 2.22 16 × 16 Booth-encoded multiplier floorplans (a) array (b) Wallace tree (c) [4:2] tree Figure 2.22compares floorplansof the 16 X 16 Boothencoded array multiplier from Figure 2.15, the Wallace tree from Figure 2.16, and the [4:2] tree from Figure 2.17. Each row represents a horizontal slice of the multiplier containing a Booth selector or a CSA. Vertical busses connect CSAs. The Wallace tree has the most irregular and lengthy wiring. In practice, the parallelogram may be squashed into a rectangular formto make better use of the space. 2.7Three-Dimensional Method The notion of connectingslow outputs to fast inputs generalizes to compressors with morethan four inputs. By examining the entire partial product array at once, one can construct trees for each column that sum all of thepartial products in the shortest possible time. This approach is called the three-dimensionalmethod (TDM) because it considers the arrival time as a third dimension along with rowsand columns .Figure 11.92 shows an example of a 16 × 16 multiplier. The parallelogram at the topshows the dot diagram from Figure 11.82(b) containing nine partial product rowsobtained through Booth encoding. The partial products in each of the 32 columns must besummed to produce the 32-bit result. As we have seen, this is done with a compressor toproduce a pair of outputs, followed by a final CPA.
  • 23. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 23 Table2.4 Comparison of XOR levels in multiplier trees 2.8 Hybrid Multiplication Arrays offer regular layout, but many levels of CSAs.Trees offer fewer levels of CSAs, but less regular layout and some long wires. A number ofhybrids have been proposed that offer trade-offs between these two extremes. Theseinclude odd/even arrays arrays of arrays, balanced delay trees, overturned-staircase trees, and upper/lower left-to-right leapfrog(ULLRF) trees. They can achieve nearly as few levels of logic as the Wallacetree while offering more regular (and faster) wiring. None have caught on as distinctly betterthan [4:2] trees. The three steps of multiplication are partial product generation, partial product reduction,and carry propagate addition. A simple M × N multiplier generates N partial productsusing AND gates. For multipliers of 16 or more bits, radix-4 Booth encoding is typicallyused to cut the number of partial products in two, saving substantial area and power. Someimplementations find Booth encoding is faster, while others find it has little speed benefit.The partial products are then reduced to a pair of numbers in carry-save redundant formusing an array or tree of CSAs. Trees have fewer levels of logic, but longer and less regularwiring; nevertheless most large multipliers use trees or hybrid structures. Pass transistorBooth selectors and CSAs were popular in the 1990s, but the trend is toward staticCMOS as supply voltage scales. Finally, a CPA converts the result to nonredundant form.
  • 24. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 24 CHAPTER 3 SPST MODIFIED BOOTHENCODER 3.1. Spurious power suppression technique: Figure shows the five cases of a 16-bit addition in which the spurious switching activities occur. The 1st case illustrates a transient state in which the spurious transitions of carry signals occur in the MSP though the final result of the MSP are unchanged. The 2nd and the 3rd cases describe the situations of one negative operand adding another positive operand without and with carry from LSP, respectively. Moreover, the 4th and the 5th cases respectively demonstrate the addition of two negative operands without and with carry-in from LSP. In those cases, the results of the MSP are predictable Therefore the computations in the MSP are useless and can be neglected. The data are separated into the Most Significant Part (MSP) and the Least Significant Part (LSP). To know whether the MSP affects the computation results or not. We need a detection logic unit to detect the effective ranges of the inputs. The Boolean logical equations
  • 25. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 25 shown below express the behavioral principles of the detection logic unit in the MSP circuits of the SPST-based adder/subtractor: Figure 2. Spurious transition cases in multimedia/ DSP processing AMSP = A[15:8]; BMSP = B[15:8] ; Aand = A[15] A[14] A[8]; Band = B[15] B[14] B[8];] where A[m] and B[n] respectively denote the mth bit of the operands A and the nth bit of the operand B, and AMSP and BMSP respectively denote the MSP parts, i.e. the 9th bit to the 16th bit, of the operands A and B. When the bits in AMSP and/or those in BMSP are all ones, the value of Aand and/or that of Band respectively become one, while the bits in AMSP and/or those in BMSP are all zeros, the value of Anor, and/or that of Bnor respectively turn into one. Being one of the three outputs of the detection logic unit, close denotes whether the MSP circuits can be neglected or not. When the two input operand can be classified into one of the five classes as shown in figure 1, the value of close becomes zero which indicates that the MSP circuits can be closed. figure 1. also shows that it is necessary to compensate the sign bit of computing results Accordingly, we derive the Karnaugh maps which lead to the Boolean equations (7) and (8) for the Carr_ctrl and the sign signals, respectively. In equation (7) and (8), CLSP denotes the carry propagated from the LSP circuits.
  • 26. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 26 Figure shows a 16-bit adder/subtractor design example based on the proposed SPST. In this example, the 16-bit adder/subtractor is divided into MSP and LSP at the place between the 8th bit and the 9th bit. Latches implemented by simple AND gates are used to control the input data of the MSP. When the MSP is necessary, the input data of MSP remain the same as usual, while the MSP is negligible, the input data of the MSP become zeros to avoid switching power consumption. From the derived Boolean equations (1) to (8), the detection logic unit of the SPST is designed as shown in figure 4. The use of MSP can be determined by whether the input data of MSP should be latched or not. Moreover, we add three 1-bit to control the assertion of the close, sign, and Carr-ctrl signals in order to further decrease the glitch signals occurred in the cascaded circuits which are usually adopted in VLSI architectures designed for video coding.
  • 27. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 27 Fig 3.1 16-bit adder/subtractor design example Fig. shows a 16-bit adder/subtractor design example adopting the proposed SPST. In this example, the 16-bit adder/subtractor is divided into MSP and LSP between the eighth and the ninth bits. Latches implemented by simple AND gates are used to control the input data of the MSP. When theMSP is necessary, the input data of MSP remain unchanged. However, when the MSP is negligible, the input data of the MSP become zeros to avoid glitching power consumption. The two operands of the MSP enter the detection-logic unit, except the adder/subtractor, so that the detection-logic unit can decide whether to turn off the MSP or not. Based on the derived Boolean equations (1) to (8), the detection-logic unit of SPST is shown in Fig. 6(a), which can determine whether the input data of MSP should be latched or not. Moreover, we propose the novel glitch-diminishing technique by adding three 1-bit registers to control the assertion of the close, sign, and carr-ctrl signals to further decrease the transient signals occurred in thecascaded circuits which are usually adopted in VLSI architecturesdesigned for multimedia/DSP applications. The timing diagram is shown in Fig. 6(b). A certain amount of delay is used to assert the close, sign, and carr-ctrl signals after the period of data transition which is achieved by controlling the three 1-bit registers at the outputs of the detection-logic unit. Hence, the transients of the detection-logic unit can be filtered out; thus, the data latches shown in Fig can prevent the glitch signals from flowing into the MSP with tiny cost. The data
  • 28. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 28 transient time and the earliest required time of all the inputs are also illustrated. The delay should be set in the range of, which is shown as the shadow area in Fig, to filter out the glitch signals as well as to keep the computation results correct. Based on Figs. 5 and 6, the timing issue of the SPST is analyzed as follows. 3.1.1. When the detection-logic unit turns off the MSP: At this moment, the outputs of the MSP are directly compensated by the SE unit; therefore, the time saved from skipping the computations in the MSP circuits shall cancel out the delay caused by the detection-logic unit. 3.1.2. When the detection-logic unit turns on the MSP: The MSP circuits must wait for the notification of the detection-logic unit to turn on the data latches to let the data in. Hence, the delay caused by the detection-logic unit will contribute to the delay of the whole combinational circuitry, i.e., the16-bit adder/subtractor in this design example. 3.1.3.When the detection-logic unit remains its decision: No matter whether the last decision is turning on or turning off the MSP, the delay of the detection logic is negligible because the path of the combinational circuitry (i.e., the 16-bit adder/subtractor in this design example) remains the same. From the analysis earlier, we can know that the total delay is affected only when the detection-logic unit turns on the MSP. However, the detection-logic unit should be a speed-oriented design. When the SPST is applied on combinational circuitries, we should first determine the longest transitions of the interested cross sections of each combinational circuitry, which is timing characteristic and is also related to the adopted technology. The longest transitions can be obtained from analyzing the timing differences between the earliest arrival and the latest arrival signals of the cross sections of a combinational circuitry.
  • 29. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 29 3.2. MAC 3.2.1 Block Diagram of MAC: In this Project, a new architecture for a high-speed MAC is proposed. In this MAC, the computations of multiplication and accumulation are combined and a hybrid-type CSA structure is proposed to reduce the critical path and improve the output rate. It uses MBA algorithm based on 1’s complement number system. A modified array structure for the sign bits is used to increase the density of the operands. A carry look-ahead adder (CLA) is inserted in the CSA tree to reduce the number of bits in the final adder. In addition, in order to increase the output rate by optimizing the pipeline efficiency, intermediate calculation results are accumulated in the form of sum and carry instead of the final adder outputs. A multiplier can be divided into three operational steps. The first is radix-2 Booth encoding in which a partial product is generated from the multiplicand X and the multiplier Y . The second is adder array or partial product compression to add all partial products and convert them into the form of sum and carry. The last is the final addition in which the final multiplication result is produced by adding the sum and the carry. If the process to accumulate the multiplied results is included, a MAC consistsof four steps, as shown in Fig. 1, which shows the operational steps explicitly.
  • 30. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 30 Fig 3.2 MAC Operational steps 3.2.2.Proposed MAC Architecture: In this section, the expression for the new arithmetic will be derived from equations of the standard design. From this result, VLSI architecture for the new MAC will be proposed. In addition, a hybrid-typed CSA architecture that can satisfy the operation of the proposed MAC will be proposed. 3.3.Radix-4 modified Booth's algorithm: Booth's Algorithm is simple but powerful. Speed of VMFU is dependent on the number of partial products and speed of accumulate partial product. Booth's Algorithm provide us to reduced partial products. We choose radix-4 algorithm because of below reasons.  Original Booth's algorithm has an inefficient case. The 17 partial products are generated in 16bit x 16bit signed or unsigned multiplication.  Modified Booth's radix-4 algorithm has fatal encoding time in 16bit x 16bit multiplication.
  • 31. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 31 Radix-4 Algorithm has a 3x term which means that a partial product cannot be generated by shifting. Therefore, 2x + 1x are needed in encoding processing. One of the solution is handling an additional 1x term in wallace tree. However, large wallace tree has some problems too. A radix-4 modified Booth's algorithm: Booth's radix-4 algorithm is widely used to reduce the area of multiplier and to increase the speed. Grouping 3 bits of multiplier with overlapping has half partial products which improves the system speed. Radix-4 modified Booth's algorithm is shown below:  X-1 = 0; Insert 0 on the right side of LSB of multiplier.  Start grouping each 3bits with overlapping from x-1  If the number of multiplier bits is odd, add a extra 1 bit on left side of MSB  generate partial product from truth table  when new partial product is generated, each partial product is added 2 bit left shifting in regular sequence. x: multiplicand y: multiplier 3.4. Sign or zero extension Our MAC supports signed or unsigned multiplication and the produced result is 64bit which are stored in 2 special 32bit register. First MAC receives a multiplicand and multiplier but just 16bit operands are signed number in Booth's radix-4 algorithm. Hence, extension bit is required to express 16bit signed number. The core idea of this is that 16bit unsigned number can
  • 32. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 32 be expressed by 33bit signed number. The 17 partial products are generated in 33bit x 33bit case (16 partial products in 32bit x 32bit case). Here is an example of signed and unsigned multiplication. When x(multiplicand) is 3bit 111 and y(multiplier) is 3bit 111, the signed and unsigned multiplication is different. In signed case x × y = 1 (-1 x -1 = 1) and in unsigned case x × y = 49 (7 x 7 = 49). 3.5. Carry-Save Adder When three or more operands are to be added simultaneously using two operand adders, the time consuming carry propagation must be repeated several times. If the number of operands is ‘k’, then carries have to propagate (k-1) times (Weste& Harris, 3rd Ed). In the carry save addition, we let the carry propagate only in the last step, while in all the other steps we generate the partial sum and sequence of carries separately. A CSA is capable of reducing the number of operands to be added from 3 to 2 without any carry propagation. A CSA can be implemented in different ways. In the simplest implementation, the basic element of carry save adder is the combination of two half adders or 1 bit full adder (Weste& Harris, 3rd Ed) 3.6 Circuit DesignFeatures One of the most advanced types of MAC for general-purpose digital signal processing has been proposed by Elguibaly . It is an architecture in which accumulation has been combined with the carry save adder (CSA) tree that compresses partial products. In the architecture proposed in, the critical path was reduced by eliminating the adder for accumulation and decreasing the number of input bits in the final adder. While it has a better performance because of the reduced critical path compared to the previous VMFU architectures, there is a need to improve the output rate due to the use of the final adder results for accumulation. The
  • 33. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 33 architecture to merge the adder block to the accumulator register in the VMFU operator was proposed to provide the possibility of using two separate N/2-bit adders instead of one-bit adder to accumulate the MAC results. Recently, Zicari proposed an architecture that took a merging technique to fully utilize the 4–2 compressor .It also took this compressor as the basic building blocks for the multiplication circuit. Fig 3.3 basic building blocks for the multiplication circuit.
  • 34. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 34 CHAPTER 4 IMPLEMENTATION 4.1 Introduction to VMFU: If an operation to multiply two N –bit numbers and accumulates into a 2N -bit number, addition, subtraction, Sum of Absolute Difference (SAD), and Interpolation is considered. The critical path is determined by the 2-bit accumulation operation. If a pipeline scheme is applied for each step in the standard design of Fig. 4.1, the delay of the last accumulator must be reduced in order to improve the performance of the MAC. The overall performance of the proposed VMFU is improved by eliminating the accumulator itself by combining it with the CSA function. If the accumulator has been eliminated, the critical path is then determined by the final adder in the multiplier. The basic method to improve the performance of the final adder is to decrease the number of input bits. In order to reduce this number of input bits, the multiple partial products are compressed into a sum and a carry by CSA. The number of bits of sums and carries to be transferred to the final adder is reduced by adding the lower bits of sums and carries in advance within the range in which the overall performance will not be degraded. A 2-bit CLA is used to add the lower bits in the CSA. In addition, to increase the output rate when pipelining is applied, the sums and carrys from the CSA are accumulated instead of the outputs from the final adder in the manner that the sum and carry from the CSA in the previous cycle are inputted to CSA. Due to this feedback of both sum and carry, the number of inputs to CSA increases, compared to the standard design. In order to efficiently solve the increase in the amount of data, a CSA architecture is modified to treat the sign bit.
  • 35. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 35 Fig 4.1 Versatile Multimedia Functional Unit VMFU is composed of an adder, multiplier and an accumulator. Usually adders implemented are Carry- Select or Carry-Save adders, as speed is of utmost importance in DSP (Chandrakasan, Sheng, &Brodersen, 1992 and Weste& Harris, 3rd Ed). One implementation of the multiplier could be as a parallel array multiplier. The inputs for the VMFU are to be fetched from memory location and fed to the multiplier block, which will perform multiplication and give the result to adder which will accumulate the result and then will store the result into a memory location. This entire process is to be achieved in a single clock cycle (Weste& Harris, 3rd Ed).
  • 36. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 36 Fig 4.2 Architecture of MAC Figure 4.2 is the architecture of the MAC unit which had been designed in this work. The design consists of one 16 bit register, one 16-bit Modified Booth Multiplier multiplier, 33-bit accumulator using ripple carry and two16-bit accumulator registers. To multiply the values of A and B, Modified Booth multiplier is used instead of conventional multiplier because Modified Booth multiplier can increase the MAC unit design speed and reduce multiplication complexity. Carry save Adder (CSA) is used as an accumulator in this design. Apparently, together with the utilization of Wallace tree multiplier approach, carry save adder in the final stage of the Modified Booth multiplier and Carry save Adder as the accumulator, this VMFU unit design is not only reducing the standby power consumption but also can enhance the VMFU unit speed so as to gain better system performance. The operation of the designed VMFU unit is as in Equation 2.1. The product of Ai X Bi is always fed back into the 34-bit Carry save accumulator and then added
  • 37. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 37 again with the next product Ai x Bi. This MAC unit is capable of multiplying and adding with previous product consecutively up to as many as eight times. Operation: Output = Σ Ai Bi (2.1) In this Project, the design of 16x16 multiplier unit is carried out that can perform accumulation on 34 bit number. This MAC unit has 34 bit output and its operation is to add repeatedly the multiplication results. The total design area is also being inspected by observing the total count of gates [Hardwires]. Power delay product is calculated by multiplying the power consumption result with the time delay. Design ofVMFU In the majority of digital signal processing (DSP) applications the critical operations usually involve many multiplications and/or accumulations. For real-time signal processing, a high speed and high throughput Multiplier-Accumulator (MAC) is always a key to achieve a high performance digital signal processing system and versatile Multimedia functional units. In the last few years, the main consideration of MAC design is to enhance its speed. This is because; speed and throughput rate is always the concern of VMFU. But for the epoch of personal communication, low power design also becomes another main design consideration. This is because; battery energy available for these portable products limits the power consumption of the system. Therefore, the main motivation of this work is to investigate various Pipelined multiplier/accumulator architectures and circuit design techniques which are suitable for implementing high throughput signal processing algorithms and at the same time achieve low power consumption. A conventional VMFU unit consists of (fast multiplier) multiplier and an accumulator that contains the sum of the previous consecutive products. The function of the VMFU unit is given by the following equation: F = Σ Ai Bi………………………………………………………… (2.1) The main goal of a VMFU design is to enhance the speed of the MAC unit, and at the same time limit the power consumption. In a pipelined MAC circuit, the delay of pipeline stage is the delay of a 1- bit full adder. Estimating this delay will assist in identifying the overall delay of the pipelined MAC. In this work, 1-bit full adder is designed. Area, power and delay are calculated for the full adder, based on which the pipelined MAC unit is designed for low power.
  • 38. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 38 4.2 Explanation 4.2.1. High-Speed Booth Encoded Parallel Multiplier Design: Fast multipliers are essential parts of digital signal processing systems. The speed of multiply operation is of great importance in digital signal processing as well as in the general purpose processors today, especially since the media processing took off. In the past multiplication was generally implemented via a sequence of addition, subtraction, and shift operations. Multiplication can be considered as a series of repeated additions. The number to be added is the multiplicand, the number of times that it is added is the multiplier, and the result is the product. Each step of addition generates a partial product. In most computers, the operand usually contains the same number of bits. When the operands are interpreted as integers, the product is generally twice the length of operands in order to preserve the information content. This repeated addition method that is suggested by the arithmetic definition is slow that it is almost always replaced by an algorithm that makes use of positional representation. It is possible to decompose multipliers into two parts. The first part is dedicated to the generation of partial products, and the second one collects and adds them. The basic multiplication principle is two fold i.e. evaluation of partial products and accumulation of the shifted partial products. It is performed by the successive additions of the columns of the shifted partial product matrix. The ‘multiplier’ is successfully shifted and gates the appropriate bit of the ‘multiplicand’. The delayed, gated instance of the multiplicand must all be in the same column of the shifted partial product matrix. They are then added to form the product bit for the particular form. Multiplication is therefore a multi operand operation. To extend the multiplication to both signed and unsigned. 4.2.2. Modified Booth Encoder: In order to achieve high-speed multiplication, multiplication algorithms using parallel counters, such as the modified Booth algorithm has been proposed, and some multipliers based on the algorithms have been implemented for practical use. This type of multiplier operates much faster than an array multiplier for longer operands because its computation time is proportional to the logarithm of the word length of operands.
  • 39. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 39 Fig 4.3 Radix-4 Booth Encoding Booth multiplication is a technique that allows for smaller, faster multiplication circuits, by recoding the numbers that are multiplied. It is possible to reduce the number of partial products by half, by using the technique of radix-4 Booth recoding. The basic idea is that, instead of shifting and adding for every column of the multiplier term and multiplying by 1 or 0, we only take every second column, and multiply by ±1, ±2, or 0, to obtain the same results. The advantage of this method is the halving of the number of partial products. To Booth recode the multiplier term, we consider the bits in blocks of three, such that each block overlaps the previous block by one bit. Grouping starts from the LSB, and the first block only uses two bits of the multiplier. Figure 3 shows the grouping of bits from the multiplier term for use in modified booth encoding. Fig.4.4 Grouping of bits from the multiplier term Each block is decoded to generate the correct partial product. The encoding of the multiplier Y, using the modified booth algorithm, generates the following five signed digits, -2, -1, 0, +1, +2. Each encoded digit in the multiplier performs a certain operation on the multiplicand, X, as illustrated in Table 4.1
  • 40. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 40 Table 4.1 Each encoded digit in the multiplier performs a certain operation on the multiplicand, X, For the partial product generation, we adopt Radix-4 Modified Booth algorithm to reduce the number of partial products for roughly one half. For multiplication of 2’s complement numbers, the two- bit encoding using this algorithm scans a triplet of bits. When the multiplier B is divided into groups of two bits, the algorithm is applied to this group of divided bits. Figure 4, shows a computing example of Booth multiplying two numbers ”2AC9” and “006A”. The shadow denotes that the numbers in this part of Booth multiplication are all zero so that this part of the computations can be neglected. Saving those computations can significantly reduce the power consumption caused by the transient signals. According to the analysis of the multiplication shown in figure 4, we propose the SPST-equipped modified-Booth encoder, which is controlled by a detection unit. The detection unit has one of the two operands as its input to decide whether the Booth encoder calculates redundant computations. As shown in figure 9. The latches can, respectively, freeze the inputs of MUX-4 to MUX-7 or only those of MUX-6 to MUX-7 when the PP4 to PP7 or the PP6 to PP7 are zero; to reduce the transition power dissipation. Figure 10, shows the booth partial product generation circuit. It includes AND/OR/EX-OR logic.
  • 41. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 41 Fig.4.5 Illustration of multiplication using modified Booth encoding The PP generator generates five candidates of the partial products, i.e., {-2A,-A, 0, A, 2A}. These are then selected according to the Booth encoding results of the operand B. When the operand besides the Booth encoded one has a small absolute value, there are opportunities to reduce the spurious power dissipated in the compression tree. Fig4.6 SPST equipped modified Booth encoder
  • 42. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 42 4.2.3. Partial product generator: Fig4.7 Booth partial product selector logic The multiplication first step generates from A and X a set of bits whose weights sum is the product P. For unsigned multiplication, P most significant bit weight is positive, while in 2's complement it is negative. The partial product is generated by doing AND between ‘a’ and ‘b’ which are a 4 bit vectors as shown in fig. If we take, four bit multiplier and 4-bit multiplicand we get sixteen partial products in which the first partial product is stored in ‘q’. Similarly, the second, third and fourth partial products are stored in 4-bit vector n, x, y. Fig.4.8 Booth partial products Generation
  • 43. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 43 The multiplication second step reduces the partial products from the preceding step into two numbers while preserving the weighted sum. The sough after product P is the sum of those two numbers. The two numbers will be added during the third step The "Wallace trees" synthesis follows the Dadda's algorithm, which assures of the minimum counter number. If on top of that we impose to reduce as late as (or as soon as) possible then the solution is unique. The two binary number to be added during the third step may also be seen a one number in CSA notation (2 bits per digit). Fig.4.9 Booth single partial product selector logic 4.2.4.Truth Table ofModified Booth Encoder: Multiplication consists of three steps: 1) the first step to generate the partial products; 2) the second step to add the generated partial products until the last two rows are remained; 3) the third step to compute the final multiplication results by adding the last two rows. The modified Booth algorithm reduces the number of partial products by half in the first step. We used the modified Booth encoding (MBE) scheme proposed in. It is known as the most efficient Booth encoding and decoding scheme. To multiply X by Y using the modified Booth algorithm starts from grouping Y by three bits and encoding into one of {-2, -1, 0, 1, 2}. Table I shows the rules to generate the encoded signals by MBE scheme and Fig. 1 (a) shows the corresponding logic diagram. The Booth decoder generates the partial products using the encoded signals as shown in Fig. 1
  • 44. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 44 Fig.4.10 Booth Encoder Fig.4.11.Booth Decoder Fig. shows the generated partial products and sign extension scheme of the 8-bit modified Booth multiplier. The partial products generated by the modified Booth algorithm are added in parallel using the Wallace tree until the last two rows are remained. The final multiplication results are generated by adding the last two rows. The carry propagation adder is usually used in this step. Fig 4.12 Truth table for MBE Scheme
  • 45. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 45 CHAPTER 5 5.1 Introduction to FPGA: FPGA stands for Field Programmable Gate Array which has the array of logic module, I /O module and routing tracks (programmable interconnect). FPGA can be configured by end user to implement specific circuitry. Speed is up to 100 MHz but at present speed is in GHz. Main applications are DSP, FPGA based computers, logic emulation, ASIC and ASSP. FPGA can be programmed mainly on SRAM (Static Random Access Memory). It is Volatile and main advantage of using SRAM programming technology is re-configurability. Issues in FPGA technology are complexity of logic element, clock support, IO support and interconnections (Routing). 5.2 Block diagram of FPGA: FPGA contains a two dimensional arrays of logic blocks and interconnections between logic blocks. Both the logic blocks and interconnects are programmable. Logic blocks are programmed to implement a desired function and the interconnects are programmed using the switch boxes to connect the logic blocks. To be more clear, if we want to implement a complex design (CPU for instance), then the design is divided into small sub functions and each sub function is implemented using one logic block. Now, to get our desired design (CPU), all the sub functions implemented in logic blocks must be connected and this is done by programming the Internal structure of an FPGA is depicted in the following figure.
  • 46. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 46 Fig 5.1 Internal structure of an FPGA FPGAs, alternative to the custom ICs, can be used to implement an entire System On one Chip (SOC). The main advantage of FPGA is ability to reprogram. User can reprogram an FPGA to implement a design and this is done after the FPGA is manufactured. This brings the name “FieldProgrammable.” Custom ICs are expensive and takes long time to design so they are useful when produced in bulk amounts. But FPGAs are easy to implement with in a short time with the help of Computer Aided Designing (CAD) tools (because there is no physical layout process, no mask making, and no IC manufacturing). Some disadvantages of FPGAs are, they are slow compared to custom ICs as they can’t handle vary complex designs and also they draw more power. Xilinx logic block consists of one Look Up Table (LUT) and one FlipFlop. An LUT is used to implement number of different functionality. The input lines to the logic block go into
  • 47. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 47 the LUT and enable it. The output of the LUT gives the result of the logic function that it implements and the output of logic block is registered or unregistered out put from the LUT. SRAM is used to implement a LUT.A k-input logic function is implemented using 2^k * 1 size SRAM. Number of different possible functions for k input LUT is 2^2^k. Advantage of such an architecture is that it supports implementation of so many logic functions, however the disadvantage is unusually large number of memory cells required to implement such a logic block in case number of inputs is large. Figure 5.2 4-input LUT based implementation of logic block LUT based design provides for better logic block utilization. A k-input LUT based logic block can be implemented in number of different ways with trade off between performance and logic density. An n-LUT can be shown as a direct implementation of a function truth-table. Each of the latch holds the value of the function corresponding to one input combination. For Example: 2-LUT can be used to implement 16 types of functions like AND , OR, A+not B .... etc. Interconnects A wire segment can be described as two end points of an interconnect with no programmable switch between them. A sequence of one or more wire segments in an FPGA can be termed as a track.
  • 48. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 48 Typically an FPGA has logic blocks, interconnects and switch blocks (Input/Output blocks). Switch blocks lie in the periphery of logic blocks and interconnect. Wire segments are connected to logic blocks through switch blocks. Depending on the required design, one logic block is connected to another and so on. 5.3 FPGA Designflow In this part of tutorial we are going to have a short intro on FPGA design flow. A simplified version of design flow is given in the flowing diagram. Fig 5.3 FPGA DesignFlow DesignEntry There are different techniques for design entry. Schematic based, Hardware Description Language and combination of both etc. . Selection of a method depends on the design and designer. If the designer wants to deal more with Hardware, then Schematic entry is the better choice. When the design is complex or the designer thinks the design in an algorithmic way then HDL is the better choice. Language based entry is faster but lag in performance and density. HDLs represent a level of abstraction that can isolate the designers from the details of the hardware implementation. Schematic based entry gives designers much more visibility into the hardware. It is the better choice for those who are hardware oriented. Another method but rarely
  • 49. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 49 used is state-machines. It is the better choice for the designers who think the design as a series of states. But the tools for state machine entry are limited. In this documentation we are going to deal with the HDL based design entry. Synthesis The process which translates VHDL or Verilog code into a device netlistformate. i.e a complete circuit with logical elements( gates, flip flops, etc…) for the design.If the design contains more than one sub designs, ex. to implement a processor, we need a CPU as one design element and RAM as another and so on, then the synthesis process generates netlist for each design element Synthesis process will check code syntax and analyze the hierarchy of the design which ensures that the design is optimized for the design architecture, the designer has selected. The resulting netlist(s) is saved to an NGC( Native Generic Circuit) file (for Xilinx® Synthesis Technology (XST)). Fig 5.4 FPGA Synthesis Implementation This process consists a sequence of three steps 1. Translate 2. Map 3. Place and Route
  • 50. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 50 Translate: Process combines all the input netlists and constraints to a logic design file. This information is saved as a NGD (Native Generic Database) file. This can be done using NGD Build program. Here, defining constraints is nothing but, assigning the ports in the design to the physical elements (ex. pins, switches, buttons etc) of the targeted device and specifying time requirements of the design. This information is stored in a file named UCF (User Constraints File). Tools used to create or modify the UCF are PACE, Constraint Editor etc. Fig 5.5 FPGA Translate Map Process divides the whole circuit with logical elements into sub blocks such that they can be fit into the FPGA logic blocks. That means map process fits the logic defined by the NGD file into the targeted FPGA elements (Combinational Logic Blocks (CLB), Input Output Blocks (IOB)) and generates an NCD (Native Circuit Description) file which physically represents the design mapped to the components of FPGA. MAP program is used for this purpose.
  • 51. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 51 Fig 5.6 FPGA map Place and Route: PAR program is used for this process. The place and route process places the sub blocks from the map process into logic blocks according to the constraints and connects the logic blocks. Ex. if a sub block is placed in a logic block which is very near to IO pin, then it may save the time but it may effect some other constraint. So trade off between all the constraints is taken account by the place and route process The PAR tool takes the mapped NCD file as input and produces a completely routed NCD file as output. Output NCD file consists the routing information. Fig 5.7 FPGA Place and route Device Programming: Now the design must be loaded on the FPGA. But the design must be converted to a format so that the FPGA can accept it. BITGEN program deals with the conversion. The routed NCD file is then given to the BITGEN program to generate a bit stream (a .BIT file) which can
  • 52. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 52 be used to configure the target FPGA device. This can be done using a cable. Selection of cable depends on the design. Behavioral Simulation (RTL Simulation): This is first of all simulation steps; those are encountered throughout the hierarchy of the design flow. This simulation is performed before synthesis process to verify RTL (behavioral) code and to confirm that the design is functioning as intended. Behavioral simulation can be performed on either VHDL or Verilog designs. In this process, signals and variables are observed, procedures and functions are traced and breakpoints are set. This is a very fast simulation and so allows the designer to change the HDL code if the required functionality is not met with in a short time period. Since the design is not yet synthesized to gate level, timing and resource usage properties are still unknown. 5.4 Introduction to Hardware Description Language Classical design methods relied on schematics and manual methods to design a circuit, but today computer-based languages are widely used to design circuits of enormous size and complexity. There are several reasons for this shift in practice. No team of engineers can correctly design and manage, by manual methods, the details of state-of-the-art integrated circuits (ICs) containing several million gates, but using hardware description languages (HDLs) designers easily manage the complexity of large designs. Even small designs rely on language- based descriptions, because designers have to quickly produce correct designs targeted for an ever-shrinking window of opportunity in the marketplace. Language-based designs are portable and independent of technology, allowing design teams to modify and re-use designs to keep pace with improvements in technology. As physical dimensions of devices shrink, denser circuits with better performance can be synthesized from an original HDL-based model. HDLs are a convenient medium for integrating intellectual property (IP) from a variety of sources with a proprietary design. By relying on a common design language, models can be integrated for testing and synthesized separately or together, with a net reduction in time for the design cycle. Some simulators also support mixed descriptions based on multiple languages.
  • 53. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 53 The most significant gain that results from the use of an HDL is that a working circuit can be synthesized automatically from a language-based description, bypassing the laborious steps that characterize manual design methods (e.g., logic minimization with Karnaugh maps). HDL-based synthesis is now the dominant design paradigm used by industry. Today, designers build a software prototype/model of the design, verify its functionality, and then use a synthesis tool to automatically optimize the circuit and create a netlist in a physical technology. HDLs and synthesis tools focus an engineer's attention on functionality rather than on individual transistors or gates; they synthesize a circuit that will realize the desired functionality, and satisfy area and/or performance constraints. Moreover, alternative architectures can be generated from a single HDL model and evaluated quickly to perform design tradeoffs. Functional models are also referred to as behavioral models. HDLs serve as a platform for several tools: design entry, design verification, test generation, fault analysis and simulation, timing analysis and/or verification, synthesis, and automatic generation of schematics. This breadth of use improves the efficiency of the design flow by eliminating translations of design descriptions as the design moves through the tool chain. Two languages enjoy widespread industry support: Verilog and VHDL. Both languages are IEEE (Institute of Electrical and Electronics Engineers) standards; both are supported by synthesis tools for ASICs (application-specific integrated circuits) and FPGAs (field- programmable gate arrays). Languages for analog circuit design, such as Spice, play an important role in verifying critical timing paths of a circuit, but these languages impose a prohibitive computational burden on large designs, cannot support abstract styles of design, and become impractical when used on a large scale. Hybrid languages (e.g., Verilog-A) are used in designing mixed-signal circuits, which have both digital and analog circuitry. System-level design languages, such as SystemC and Superlog, are now emerging to support a higher level of design abstraction than can be supported by Verilog or VHDL.
  • 54. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 54 Difference between HDL and other software languages:  The main difference with the traditional programming languages is HDL’s representation of extensive parallel operations whereas traditional ones represents mostly serial operations. Importance of HDLs: HDLs have many advantages compared to traditional schematic-based design  Designs can be described at a very abstract level by use of HDLs. Designers can write their RTL description without choosing a specific fabrication technology. Logic synthesis tools can automatically convert the design to any fabrication technology. If a new technology emerges, designers do not need to redesign their circuit. They simply input the RTL description to the logic synthesis tool and create a new gate-level netlist, using the new fabrication technology. The logic synthesis tool will optimize the circuit in area and timing for the new technology.  By describing designs in HDLs, functional verification of the design can be done early in the design cycle. Since designers work at the RTL level, they can optimize and modify the RTL description until it meets the desired functionality. Most design bugs are eliminated at this point. This cuts down design cycle time significantly because the probability of hitting a functional bug at a later time in the gate-level netlist or physical layout is minimized.  Designing with HDLs is analogous to computer programming. A textual description with comments is an easier way to develop and debug circuits. This also provides a concise representation of the design, compared to gate-level schematics. Gate-level schematics are almost incomprehensible for very complex designs. Importance of Computer-Aided Digital Design: The earliest digital circuits were designed with vacuum tubes and transistors. Integrated circuits were then invented where logic gates were placed on a single chip. The first integrated circuit (IC) chips were SSI (Small Scale Integration) chips where the gate count was very small. As technologies became sophisticated, designers were able to place circuits with hundreds of
  • 55. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 55 gates on a chip. These chips were called MSI (Medium Scale Integration) chips. With the advent of LSI (Large Scale Integration), designers could put thousands of gates on a single chip. At this point, design processes started getting very complicated, and designers felt the need to automate these processes. Electronic Design Automation (EDA) techniques began to evolve. Chip designers began to use circuit and logic simulation techniques to verify the functionality of building blocks of the order of about 100 transistors. The circuits were still tested on the breadboard, and the layout was done on Project or by hand on a graphic computer terminal. With the advent of VLSI (Very Large Scale Integration) technology, designers could design single chips with more than 100,000 transistors. Because of the complexity of these circuits, it was not possible to verify these circuits on a breadboard. Computer-aided techniques became critical for verification and design of VLSI digital circuits. Computer programs to do automatic placement and routing of circuit layouts also became popular. The designers were now building gate-level digital circuits manually on graphic terminals. They would build small building blocks and then derive higher-level blocks from them. This process would continue until they had built the top-level block. Logic simulators came into existence to verify the functionality of these circuits before they were fabricated on chip. What is gate-level netlist: A gate-level netlist is a description of the circuit in terms of gates and connections between them. Logic synthesis tools convert the RTL description to a gate-level netlist. Problems associatedwith conventional approach to digital design: Digital ICs of SSI and MSI types have become universally standardized and have been accepted for use. Whenever a designer has to realize a digital function, he uses a standard set of ICs along with a minimal set of additional discrete circuitry. Consider a simple example of realizing a function as Q n+1 = Q n + (A.B) Here Qn, A, and B are Boolean variables, with Q n being the value of Q at the nth time step. Here A.Bsignifies the logical AND of A and B; the ‘+’ symbol signifies the logical OR of the logic variables on either side. A circuit to realize the function is shown in Figure 5.1. The circuit can
  • 56. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 56 be realized in terms of two ICs –an A-O-I gate and a flip-flop. It can be directly wired up, tested, and used. Fig. 5.8 A simple digital circuit The accepted approach to digital design here is a mix of the top-down and bottom-up approaches as follows: 1. Decide the requirements at the system level and translate them to circuit requirements. 2. Identify the major functional blocks required like timer, DMA unit, register file etc., say as in the design of a processor. 3. Whenever a function can be realized using a standard IC, use the same –for example programmable counter, mux, demux, etc. 4. Whenever the above is not possible, form the circuit to carry out the block functions using standard SSI – for example gates, flip-flops, etc. 5. Use additional components like transistor, diode, resistor, capacitor, etc., wherever essential. Once the above steps are gone through, a Project design is ready. Starting with the Project design, one has to do a circuit layout. The physical location of all the components is tentatively decided; they are interconnected and the ‘circuit-on Project’ is made ready. Once a Project design is done, a layout is carried out and a net-list prepared. Based on this, the PCB is fabricated and populated and all the populated cards tested and debugged. The procedure is shown as a process flowchart in Figure 5.2.
  • 57. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 57 Fig.5.9 Sequence of steps in conventional electronic circuit design. At the debugging stage one may encounter three types of problems: 1. Functional mismatch: The realized and expected functions are different. One may have to go through the relevant functional block carefully and locate any error logically. Finally the necessary correction has to be carried out in hardware. 2. Timing mismatch: The problem can manifest in different forms. One possibility is due to the signal going through different propagation delays in two paths and arriving at a point with a timing mismatch. This can cause faulty operation. Another possibility is a race condition in a circuit involving asynchronous feedback. This kind of problem may call for elaborate debugging. The preferred practice is to do debugging at smaller module stages and ensuring that feedback through larger loops is avoided: It becomes essential to check for the existence of long asynchronous loops. 3. Overload: Some signals may be overloaded to such an extent that the signal transition may be unduly delayed or even suppressed. The problem manifests as reflections and erratic behavior in some cases (The signal has to be suitably buffered here.). In fact, overload on a signal can lead to timing mismatches.
  • 58. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 58 The above have to be carried out after completion of the prototype PCB manufacturing;it involves cost, time, and also a redesigning process to develop a bug free design. Logic simulation and synthesis:  There are two applications of HDL processing: Simulation and Synthesis Simulation Simulation is used to verify the functionality of the circuit A) Functional Simulation: study of circuit’s operation independent of timing parameters and gate delays. B) Timing Simulation: study including estimated delays; verify setup, hold and other timing requirements of devices like flip flops are met. Synthesis : One of the foremost in back end steps where by synthesizing is nothing but converting VHDL or VERILOG description to a set of primitives(equations as in CPLD) or components(as in FPGA'S)to fit into the target technology. Basically the synthesis tools convert the design description into equations or components
  • 59. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 59 CHAPTER 6 RESULT ANALYSIS 6.1. Simulation Results of VMFU: 6.1.1 Partial products Generators: Fig 6.1 Simulation result of Partial products Generators 6.1.2 Booth Encoder: Fig 6.2 Simulation result of Booth Encoder
  • 60. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 60 6.1.3 Carry-Save Adder: Fig 6.3 Simulation result of Carry-save Adder 6.1.4 Versatile Multimedia Functional Unit: Fig 6.4 Simulation result of Versatile Multimedia Functional Unit 6.2 Synthesis Result The developed MAC design is simulated and verified their functionality. Once the functional verification is done, the RTL model is taken to the synthesis process using the Xilinx ISE tool. In synthesis process, the RTL model will be converted to the gate level netlist mapped to a specific technology library. This MAC design can be synthesized on the family of Spartan 3E. Here in this Spartan 3E family, many different devices were available in the Xilinx ISE tool. In order to synthesis this design the device named as “XC3S500E” has been chosen and the
  • 61. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 61 package as “FG320” with the device speed such as “-4”. The design of MAC is synthesized and its results were analyzed as follows. Device utilization summary: This device utilization includes the following.  Logic Utilization  Logic Distribution  Total Gate count for the Design The device utilization summery is shown above in which its gives the details of number of devices used from the available devices and also represented in %. Hence as the result of the synthesis process, the device utilization in the used device and package is shown above. Timing Summary: Speed Grade: -4 Minimum period: 35.100ns (Maximum Frequency: 28.490MHz)
  • 62. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 62 Minimum input arrival time before clock: 23.605ns Maximum output required time after clock: 4.283ns Maximum combinational path delay: No path found In timing summery, details regarding time period and frequency is shown are approximate while synthesize. After place and routing is over, we get the exact timing summery. Hence the maximum operating frequency of this synthesized design is given as 28.490MHz and the minimum period as 35.100ns. Here, OFFSET IN is the minimum input arrival time before clock and OFFSET OUT is maximum output required time after clock. RTL Schematic The RTL (Register Transfer Logic) can be viewed as black box after synthesize of design is made. It shows the inputs and outputs of the system. By double-clicking on the diagram we can see gates, flip-flops and MUX. Figure 6.5 Schematic with Basic Inputs and Output I N P U T S O U T P U T S
  • 63. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 63 Figure 6.6 Schematic of Booth Encoder with SPST Adder 6.3 Summary  The developed VMFU design is modelled and is simulated using the Modelsim tool.  The simulation results are discussed by considering different cases.  The RTL model is synthesized using the Xilinx tool in Spartan 3E and their synthesis results were discussed with the help of generated reports.
  • 64. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 64 CHAPTER 7 CONCLUSION In his Project a versatile multimedia functional unit is designed with low-power technique called SPST, 16x16 multiplier-accumulators (MAC), with addition, subtraction, sum of absolute difference, interpolation. A Radix-2 Modified Booth multiplier circuit is used for MAC architecture. Compared to other circuits, the Booth multiplier has the highest operational speed and less hardware count. The basic building blocks for the VMFU unit are identified and each of the blocks is analyzed for its performance. Power and delay is calculated for the blocks. MAC unit is designed with enable to reduce the total power consumption based on block enable technique. Using this block, the N-bit MAC unit is constructed and the total power consumption is calculated for the MAC unit. The presented low-power technique called SPST and explores its applications in multimedia/DSP computations, where the theoretical analysis and the realization issues of the SPST are fully discussed. The proposed SPST can obviously decrease the switching (or dynamic) power dissipation, which comprises a significant portion of the whole power dissipation in integrated circuits. Besides, the proposed SPST can achieve a 24% saving in power consumption at the expense of only 10% area overheads for the proposed VMFU.
  • 65. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 65 FUTURE SCOPE
  • 66. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 66 BIBILIOGRAPHY [1] T. Stockhammer, M. Hannuksela, and T. Wiegand, “H.264/AVC inwireless environments,” IEEE Trans. Circuits Syst. Video Technol., vol.13, no. 7, pp. 657–673, Jul. 2003. [2] R. Schafer, T. Wiegand, and H. Schwarz, “The emerging H.264/AVCstandard,” EBU Technique Review Jan. 2003 [Online].Available:http://www.ebu.ch/trev_293-schaefer.pdf [3] A. Bellaouar and M. I. Elmasry, Low-Power Digital VLSI Design"Circuitsand Systems. Norwell, MA: Kluwer, 1995. [4] A. P. Chandrakasan and R. W. Brodersen, “Minimizing power consumptionin digital CMOS circuits,” Proc. IEEE, vol. 83, no. 4, pp.498–523, Apr. 1995. [5] K. K. Parhi, “Approaches to low-power implementations of DSP systems,”IEEE Trans. Circuits Syst. I, Fundam. Theory Appl., vol. 48, no.10, pp. 1214–1224, Oct. 2001. [6] K. Choi, R. Soma, and M. Pedram, “Dynamic voltage and frequencyscaling based on workload decomposition,” in Proc. IEEE Int. Symp.Low Power Electron.Des., 2004, pp. 174 [7] J. Choi, J. Jeon, and K. Choi, “Power minimization of functional unitsby partially guarded computation,” in Proc. IEEE Int. Symp.Low PowerElectron.Des., 2000, pp. 131–136. [8] O. Chen, R. Sheen, and S. Wang, “A low-power adder operating oneffective dynamic data ranges,” IEEE Trans. Very Large Scale Integr.(VLSI) Syst., vol. 10, no. 4, pp. 435–453, Aug. 2002. [9] O. Chen, S.Wang, and Y. W.Wu, “Minimization of switching activitiesof partial products for designing low-power multipliers,” IEEE Trans.Very Large Scale Integr. (VLSI) Syst., vol. 11, no. 3, pp. 418–433, Jun.2003. [10] L. Benini, G. D. Micheli, A. Macii, E. Macii, M. Poncino, and R. Scarsi,“Glitch power minimization by selective gate freezing,” IEEE Trans.Very Large Scale Integr. (VLSI) Syst., vol. 8, no. 3, pp. 287–298, Jun.2000. [11] S. Henzler, G. Georgakos, J. Berthold, and D. Schmitt-Landsiedel,“Fast power-efficient circuit-block switch-off scheme,” Electron. Lett.,vol. 40, no. 2, pp. 103–104, Jan. 2004. [12] T. Xanthopoulos and A. P. Chandrakasan, “A low-power DCT coreusing adaptive bitwidth and arithmetic activity exploiting signal correlationsand quantization,” IEEE J. Solid-State Circuits, vol. 35, no. 5,pp. 740–750, May 2000.
  • 67. LOW POWER VLSI DESIGN FOR MULTIMEDIA APPLICATIONS USING SPS TECHNIQUE NCET Page 67 [13] K. H. Chen, J. I. Guo, and J. S. Wang, “A high-performance direct2-D transform coding IP design for MPEG-4AVC/H.264,” IEEE Trans.Circuits Syst. Video Technol., vol. 16, no. 4, pp. 472–483, Apr. 2006. [14] K. H. Chen, K. C. Chao, J. I. Guo, J. S. Wang, and Y. S. Chu, “Designexploration of a spurious power suppression technique (SPST) and itsapplications,” in Proc. IEEE Asian Solid- State Circuits Conf., Hsinchu,Taiwan, Nov. 2005, pp. 341–344. [15] K. H. Chen, Y. M. Chen, and Y. S. Chu, “A versatile multimedia functionalunit design using the spurious power suppression technique,” inProc. IEEE Asian Solid-State Circuits Conf., Hangzhou, China, Nov.2006, pp. 111–114. [16] H. H. Chang, S. H. Sun, and S. I. Liu, “A low-jitter and precise multiphasedelay-locked loop using shifted averaging VCDL,” in Proc. IEEEInt. Solid-State Circuits Conf., Feb. 2003, vol. 1, pp. 434–505. [17] Y. J. Jung, S. W. Lee, D. Shim, W. Kim, C. Kim, and S. I. Cho, “Adual-loop delay-locked loop using multiple voltage-controlled delaylines,” IEEE J. Solid-State Circuits, vol. 36, no. 5, pp. 784–791, May2001.