EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 1
UNIT – I
INTRODUCTION - BASIC MOS TRANSISTOR
The invention of the transistor by William B. Shockley, Walter H. Brattain and John Bardeen
of Bell Telephone laboratories was followed by the development of the Integrated circuit (IC)
The very first IC emerged at the beginning of 1960 and since that time there have already
been 4 generations of ICs
1) SSI ( Small Scale Integration)
2) MSI ( Medium Scale Integration)
3) LSI ( Large Scale Integration)
4) VLSI ( Very Large Scale Integration)
Now we see the emergence of the 5th
generation, ULSI ( Ultra Large Scale Integration) which
is characterized by complexities in excess of 3 million devices on a single IC chip.Within the bounds
of MOS technology, the possible circuit realizations may be based on pMOS, nMOS, CMOS and now
BiCMOS devices. Although CMOS is the dominant technology, some of the examples used to
illustrate the design processes will be presented in nMOS form. The reasons are :
1) For NMOS technology, the design methodology and the design rules are easily learned, thus
providing a simple but excellent introduction to structured design for VLSI.
2) nMOS technology and design processes provide an excellent background for other
technologies. In particular some familiarity with nMOS allows a relatively easy transition to CMOS
technology and design.
3) For GaAs technology some arrangements in relation to logic design are similar to those
employed in nMOS technology. Therefore, understanding the basics of nMOS design will assist in the
layout of GaAs circuits.
BASIC MOS TRANSISTORS
nMOS devices are formed in a p-type substrate of moderate doping level. The source and drain
regions are formed by diffusing n-type impurities through suitable masks into 3 areas to give the
desired n-impurity concentration and give rise to depletion regions which extend mainly in the more
lightly doped p-region.
 Thus, source and drain are isolated from one another by 2 diodes.
 Connections to the source and drain are made by a deposited metal layer. . ( Fig a)
 A polysilicon gate is deposited on a layer of insulation over the region between source and drain
 If the gate is connected to a suitable positive voltage with respect to the source, then the
electric field established between the gate and the substrate gives rise to a charge inversion region in
the substrate under the gate insulation and a conducting path or channel is formed between source and
drain.
 Channel may also be established so that it is present under the condition Vgs = 0 by
implanting suitable impurities in the region between the insulation and the gate. (fig b)
 Substrate is of n-type material and the source and drain diffusions are consequently p-type.(fig c)
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 2
ENHANCEMENT MODE TRANSISTOR ACTION:
 In order to establish the channel in the first place a min. voltage level of threshold voltage Vt
must be established between gate and source.
 Fig (a) indicates the conditions
prevailing with the channel established
but no current flowing between source
and drain (Vds = 0)
 Condition: When current flows in the
channel by applying a voltage Vds
between drain and source.
 Corresponding IR drop = Vds along the
channel.
 This results in the voltage between gate
and channel varying with distance along
the channel with the voltage being a max.
ofVgs at the source end.
 Effective voltage Vg = Vgs-Vt, there
will be voltage available to invert the
channel at the drain end so long as Vgs –
Vt>= Vds.
 Limiting condition comes when Vds =
Vgs – Vt.
 For all voltages Vds<Vgs – Vt, the
device is in the non-saturated region of
operation.
 IR drop = Vgs –Vt takes place over less
than the whole length of the channel so
that over part of the channel, near the
drain, there is insufficient electric field
available to give rise to inversion layer to
create the channel.
 Diffusion current completes the path
from source to drain causing the channel
to exhibit a high resistance known as
saturation region.
DEPLETION MODE TRANSISTOR ACTION
 The channel is established, due to the implant, even when Vgs = 0 and to cause the channel to cease
to exist a –ve voltage Vtd must be applied between gate and source.
Vtd is typically < -0.8Vdd, depending on
the implant and substrate bias, but
threshold voltage differences apart.
Drain to source current Ids versus voltage Vds relationships
 The whole concept of the MOS transistor evolves from the use of a voltage on the gate to induce a
charge in the channel between source and drain, which may then be caused to move from source to
drain under the influence of an electric field created by voltage Vds applied between source and drain.
 Since the charge induced is dependent on the gate to source voltage Vgs then Ids is independent on
both Vgs and Vds.
 Consider a structure in which electrons will flow from source to drain.
= , First, transit time ζ sd
But velocity ,Where μ = electron or hole mobility (surface) Eds = electric field (drain to
source) ;
Now , So that , Thus,
Typical values of μ at room temp. areμn = 650 cm2
/Vsec ( surface) μp = 240 cm2
/Vsec (surface)
Non Saturated region:
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 3
 Charge induced in channel due to gate voltage is due to to the voltage difference between the gate
and the channel Vgs
 Voltage along the channel varies linearly with distance X from source due to the IR drop in the
channel.
 Assuming the device is not saturated then the average value is Vds/2
 Effective gate voltage Vg = Vgs-Vt, Where Vt is the threshold voltage needed to invert the charge
under the gate and establish the channel.
, Thus induced charge , Where
Eg= avg. electric field gate to channel
εins = relative permittivity of insulation between gate and channel
ε0 = permittivity of free space = 8.85x10-14
Fcm-1
Where D = oxide thickness
Thus 3
Combine eqn 2 & 3 in 1 , we have
or in the non saturated or resistive region where Vds<Vgs - Vtand
/D
The factor W/L is of course contributed by the geometry and it is a common practice to write
 = K. W/L
so that Ids =   
2
/
)
( 2
ds
V
Vds
Vt
Vgs 
  4a ( Alternate form of Eqn 4)
Gate/Channel Capacitance (parallel plate) Also , so
Sometimes it is convenient to use gate capacitance per unit area Co rather than Cg. Noting that Cg = Co
WL
We may also write , Ids = Co W/L 
2
/
)
( 2
ds
V
Vds
Vt
Vgs 
 4c
Saturated region:
Saturation begins when Vds = Vgs - Vt. Since at this point the IR drop in the channel equals the
effective gate to channel voltage at the drain and we may assume that the current remains fairly
constant as Vds increases further.
Ideal I-V Characteristics
Drain current of MOS device in different operating regions.
MOS transistors have three regions of operation:
• Cutoff or sub-threshold region •Linear region • Saturation region
The long-channel model assumes that the current through an OFF transistor is 0.When a transistor
turns ON (Vgs>Vt),the gate attracts carriers(electrons) to form a channel. The electrons drift from
source to drain at a rate proportional to the electric field between these regions. Thus, we can
compute currents if we know the amount of charge in the channel and the rate at which it moves. We
know that the charge on each plate of a capacitor is Q=CV. Thus, the charge in the channel Qchannel
is where Cg is the capacitance of the gate to the channel and Vgc-Vt is
the amount of voltage attracting charge to the channel beyond the minimum required to invert from
pton. The gate voltage is referenced to the channel, which is not grounded. If the source is at Vs and
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 4
the drain is at Vd, the average is Vc=(Vs+Vd)/2= Vs+Vds/2. Therefore, the mean difference between
the gate and channel potentials Vgc is Vg–Vc=Vgs–Vds /2,as shown in Figure 2.5. We can model the
gate as a parallel plate capacitor with capacitance proportional to area over thickness. If the gate has
length L and width W and the oxide thickness is tox, as shown in Figure2.6, the capacitance is
Where ε0 is the permittivity of frees pace,8.85×10–14F/cm,andthepermittivityofSiO2is
kox=3.9times as great. Often, the εox/tox term is called Cox, the capacitance per unit area of the
gate oxide.
Some nanometer processes use a different gate dielectric with a higher dielectric constant. In these
processes, tox the equivalent oxide thickness (EOT), the thickness of a layer of SiO2 that has the
same Cox. In this case, tox is thinner than the actual dielectric. Each carrier in the channel is
accelerated to an average velocity, v, proportional to the lateral electric field, i.e., the field between
source and drain. The constant of proportionality μ is called the mobility. The electric field
E is the voltage difference between drain and source Vds divided by the channel length .
The time required for carriers to cross the channel is the channel length divided by the carrier
velocity: L/v. Therefore, the current between source and drain is the total amount of charge in the
channel divided by the time required to cross
The term Vgs–Vt arises so often that it is convenient to abbreviate it as VGT. Equation describes the
linear region of operation, for Vgs>Vt, but Vds relatively small. It is called linear or resistive
because when Vds<<VGT, Ids increases almost linearly with Vds, just like an ideal resistor. The
geometry and technology- dependent parameters are sometimes merged into a single factor ᵝ .
If Vds>Vdsat-VGT, the channel is no longer inverted in the vicinity of the drain; we say it is pinched
off. Beyond this point, called the drain saturation voltage, increasing the drain voltage has no further
effect on current. Substituting Vds=Vdsat at this point of maximum current into Eq(2.5),we find an
expression for the saturation current that is independent of Vds. …
This expression is valid for Vgs>Vt and Vds>Vdsat. Thus, long-channel MOS transistors are said to
exhibit square-law behavior in saturation.
Two key figures of merit for a transistor are Ion and Ioff. Ion (also called Idsat) is the ON current,
Ids, when Vgs=Vds=VDD. Ioff is the OFF current when Vgs=0 and Vds=VDD. According to the
long-channel model, Ioff=0and .
Figure 2.7(a) showsthe I-Vcharacteristicsforthe transistor.Accordingtothefirst-ordermodel,the current
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 5
is zero for gate voltages below Vt. For higher gate voltages, current increases linearly with Vds for
small Vds. As Vds reaches the saturation point Vdsat=VGT, current rolls off and eventually becomes
independent of Vds when the transistor is saturated. pMOS transistors behave in the same way, but
with the signs of all voltages and currents reversed. The I-V characteristics are in the third quadrant,
as shown in Figure2.7 (b).
Non -Ideal I-V Effects
The saturation current increases less than quadratically with increasing Vgs . This is caused
by two effects: velocity saturation and mobility degradation.
 At high lateral field strengths (Vds /L), carrier velocity ceases to increase linearly with field
strength. This is called velocity saturation and results in lower Ids than expected at high Vds .
 At high vertical field strengths (Vgs /tox ), the carriers scatter off the oxide interface more
often, slowing their progess. This mobility degradation effect also leads to less current than
expected at high Vgs .
 The saturation current of the nonideal transistor increases somewhat with Vds . This is caused
by channel length modulation, in which higher Vds increases the size of the depletion region
around the drain and thus effectively shortens the channel.
 Increasing the potential between the source and body raises the threshold through the body
effect. Increasing the drain voltage lowers the threshold through drain-induced barrier
lowering. Increasing the channel length raises the threshold through the short channel effect.
 When Vgs<Vt , the current drops off exponentially rather than abruptly becoming zero. This is
called subthreshold conduction. The current into the gate Ig is ideally 0. However, as the
thickness of gate oxides reduces to only a small number of atomic layers, electrons tunnel through
the gate, causing some gate leakage current. The source and drain diffusions are typically reverse-
biased diodes and also experience junction leakage into the substrate or well.
Both mobility and threshold voltage decrease with rising temperature. The mobility effect
tends to dominate for strongly ON transistors, resulting in lower Ids at high temperature. The
threshold effect is most important for OFF transistors, resulting in higher leakage current at high
temperature. In summary, MOS characteristics degrade with temperature.
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 6
Mobility Degradtion and Velocity Saturation
 Carrier drift velocity, and hence current, is proportional to the lateral electric field Elat = Vds /L
between source and drain. The constant of proportionality is called the carrier mobility, μ. The long-
channel model assumed that carrier mobility is independent of the applied fields.
 A high voltage at the gate of the transistor attracts the carriers to the edge of the channel, causing
collisions with the oxide interface that slow the carriers. This is called mobility degradation.
 Carriers approach a maximum velocity vsat when high fields are applied. This phenomenon is
called velocity saturation.
Channel Length Modulation
Ideally, Ids is independent of Vds for a transistor in saturation, making the transistor a perfect
current source. The p–n junction between the drain and body forms a depletion region with a width Ld
that increases with Vdb. The depletion region effectively shortens the channel length to Leff = L - Ld
Assume the source voltage is close to the body voltage so Vdb = Vds. Hence, increasing Vds
decreases the effective channel length. Shorter channel length results in higher current; thus, Ids
increases with Vds in saturation. This can be crudely modeled by multiplying EQ (2.10) by a factor of
(1 + Vds / VA), where VA is called the Early voltage. In the saturation region
As channel length gets shorter, the effect of the channel length modulation becomes relatively more
important. Hence, VA is proportional to channel length. This channel length modulation model is a
gross oversimplification of nonlinear behavior and is more useful for conceptual understanding than
for accurate device modeling.
Threshold Effects
So far, we have treated the threshold voltage as a constant. However, Vt increases with the source
voltage, decreases with the body voltage, decreases with the drain voltage, and increases with channel
length. This section models each of these effects.
Body Effect
The body is an implicit fourth terminal. When a voltage Vsb is applied between the source and body,
it increases the amount of charge required to invert the channel, hence, it increases the threshold
voltage. The threshold voltage can be modeled as
where Vt0 is the threshold voltage when the source is at the body potential, ϕs is the surface potential
at threshold and γ is the body effect coefficient, typically in the range 0.4 to 1 V1/2
.
i. Drain induced barrier Lowering (DIBL)
The drain voltage Vds creates an electric field that affects the threshold voltage. This drain-
induced barrier lowering (DIBL) effect is especially pronounced in short-channel transistors.
 It can be modeled asVt = Vto –ηVds. where η is the DIBL coefficient, typically on the order
of 0.1 (often expressed as 100 mV/V).
Drain-induced barrier lowering causes Ids to increase with Vds in saturation, in much the same way as
channel length modulation does. This effect can be lumped into a smaller Early voltage VA.
Short Channel Effects
The threshold voltage typically increases with channel length. This phenomenon is especially
pronounced for small L where the source and drain depletion regions extend into a significant portion
of the channel, and hence is called the short channel effect or Vtrolloff.
ii. Leakage
 Even when transistors are nominally OFF, they leak small amounts of current. Leakage
mechanisms include subthreshold conduction between source and drain, gate leakage from the
gate to body, and junction leakage from source to body and drain to body.
 Subthreshold conduction is caused by thermal emission of carriers over the potential barrier set by
the threshold. Gate leakage is a quantum-mechanical effect caused by tunneling through the
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 7
extremely thin gate dielectric. Junction leakage is caused by current through the p-n junction
between the source/drain diffusions and the body.
Subthreshold Leakage
 The long-channel transistor I-V model assumes current only flows from source to drain when
Vgs> Vt. In real transistors, current does not abruptly cut off below threshold, but rather drops off
exponentially.
 When the gate voltage is high, the transistor is strongly ON. When the gate falls below Vt , the
exponential decline in current appears as a straight line on the logarithmic scale. This regime of
Vgs<Vt is called weak inversion.
 The subthreshold leakage current increases significantly with Vds because of drain-induced
barrier lowering. There is a lower limit on Ids set by drain junction leakage that is exacerbated by
the negative gate voltage.
 Subthreshold leakage current is described by EQ (2.42). Ids0 is the current at threshold and is
dependent on process and device geometry.
Gate Leakage
According to quantum mechanics, the electron cloud surrounding an atom has a probabilistic spatial
distribution. For gate oxides thinner than 15–20 Å, side of the oxide, where it will get whisked away
through the channel. This effect of carriers crossing a thin barrier is called tunneling, and results in
leakage current through the gate.
Two physical mechanisms for gate tunneling are called Fowler-Nordheim (FN) tunnelingand
direct tunneling. FN tunneling is most important at high voltage and moderate oxide thickness and is
used to program EEPROM memories. Direct tunneling is most important at lower voltage with thin
oxides and is the dominant leakage component. The direct gate tunneling current can be estimated as
where A and B are technology constants.
Junction Leakage
The p–n junctions between diffusion and the substrate or well form diodes. The well-to-
substrate junction is another diode. The substrate and well are tied to GND or VDD to ensure these
diodes do not become forward biased in normal operation. However, reverse-biased diodes still
conduct a small amount of current ID.
where IS depends on doping levels and on the area and perimeter of the diffusion region and VD is the
diode voltage (e.g., –Vsb or –Vdb). When a junction is reverse biased by significantly
more than the thermal voltage, the leakage is just –IS, generally in the 0.1–0.01 fA/μm2
range, which
is negligible compared to other leakage mechanisms.
More significantly, heavily doped drains are subject to band-to-band tunneling (BTBT) and
gate-induced drain leakage (GIDL).
Temperature Dependence
Transistor characteristics are influenced by temperature. Carrier mobility decreases with temperature.
An approximate relation is
where T is the absolute temperature, Tr is room temperature, and kμ is a fitting parameterwith a
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 8
typical value of about 1.5. vsat also decreases with temperature, dropping by about20% from 300 to
400 K. The magnitude of the threshold voltage decreases nearly linearly with temperature and may
be approximated by where kvt is typically about 1–2
mV/K. Ion at high VDD decreases with temperature. Subthreshold leakage increases exponentiallywith
temperature.
 Subthreshold leakage is exponentially dependent on temperature, so lower threshold voltages can
be used. Velocity saturation occurs at higher fields, providing more current.
 As mobility is also higher, these fields are reached at a lower power supply, saving power.
Depletion regions become wider, resulting in less junction capacitance.
Geometry Dependence
 The layout designer draws transistors with width and length Wdrawn and Ldrawn. The actual gate
dimensions may differ by some factors XW and XL.
 the source and drain tend to diffuse laterally under the gate by LD, producing a shorter effective
channel length that the carriers must traverse between source and drain. Similarly, WD accounts
for other effects that shrink the transistor width. The factors of two come from lateral diffusion on
both sides of the channel.
 Therefore, a transistor drawn twice as long may have an effective length that is more than twice as
great. Similarly, two transistors differing in drawn widths by a factor of two may differ in
saturation current by more than a factor of two.
 Threshold voltages also vary with transistor dimensions because of the short and narrow channel
effects.
Combining threshold changes, effective channel lengths, channel length modulation, and
velocity saturation effects, Idsat does not scale exactly as 1/L. In general, when currents must be
precisely matched (e.g., in sense amplifiers or A/D converters), it is best to use the same width and
length for each device. Current ratios can be produced by tying several identical transistors in parallel.
CMOS TECHNOLOGIES
CMOS provides an inherently low power static circuit technology that has the capability of
providing a lower-delay product than comparable design-rule nMOS or pMOS technologies. The
four dominant CMOS technologies are:
P-well process
n-well process
twin-tub process
Silicon on chip process
nMOS FABRICATION
 Processing is carried out on a thin wafer cut from a single crystal of silicon of high purity into
which the required p-impurities are introduced as the crystal is grown.
 A layer of silicon dioxide ( SiO2), typically 1m thick is grown all over he surface of the wafer
to protect the surface, act as a barrier to dopants during processing and provide a generally
insulating substrate on to which other layers may be deposited and patterned.
 The surface is now covered with a photo resist which is deposited onto the wafer and spun to
achieve an even distribution of the required thickness.
 The photo resist layer is then exposed to ultra violet light through a mask which defines those
regions into which diffusion is to take place together with transistor channels.
 These areas are subsequently readily etched away together with the underlying silicon dioxide so
that the wafer surface is exposed in the window defined by the mask.
 Remaining photo resist is removed and a thin layer of SiO2 is grown over the entire chip surface
and then polysilicon is deposited on top of this to form the gate structure. The Layer consists of
heavily doped polysilicon deposited by chemical vapor deposition (CVD).
 Photo resist coating and masking allows the polysilicon to be patterned and then the thin oxide is
removed to expose areas into which n-type impurities are to be diffused.
 Thin oxide is grown over all again and is then masked with photo resist and etched to expose
selected areas of the polysilicon gate and the drain and source areas where connections are to be
made.
 The whole chip then has metal (Al) deposited over its surface to a thickness typically of 1 m.
This metal layer is then masked and etched to form the required interconnection pattern.
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 9
CMOS FABRICATION
 P-well process is widely used in practice and then the n-well process is also popular.
P-well process
 The diffusion must be carried out with special care since the p-well doping concentration and depth
will affect the threshold voltages as well as the breakdown voltages of the n-transistor.
 To achieve low threshold voltages ( 0.6 to 1.0 V) we need wither deep well diffusion or high well
resistivity.
 But deep wells require larger spacing due to lateral diffusion and therefore a larger chip area.
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 10
 The p-well act as substrates for the n-devices within the parent n-substrate and provided that voltage
polarity restrictions are observed, the 2 areas are electrically isolated.
Layout Design rules
Layout design rules describe how small features can be and how closely they can be reliably
packed in a particular manufacturing process. Industrial design rules are usually specified in
microns. This makes migrating from one process to a more advanced process or a different foundry‘s
process difficult because not all rules scale in the same way.
Mead and Conway popularized scalable design rules based on a single parameter ,λ, that
characterizes the resolution of the process. Λ is generally half of the minimum drawn transistor
channel length. This length is the distance between the source and drain of a transistor and is set by
the minimum width of a polysilicon wire. Designers often describe a process by its feature size.
Feature size refers to minimum transistor length, so λ is half the feature size.
This length is the distance between the source and drain of a transistor and is set by the
minimum width of a polysilicon wire. For example, a 180 nm process has a minimum polysilicon
width (and hence transistor length) of 0.18 μm and uses design rules with λ= 0.09 μm3
. Lambda-
based rules are necessarily conservative because they round up dimensions to an integer multiple of
λ
A conservative but easy-to-use set of design rules for layouts with two metal layers in an n-well
process is as follows:
 Metal and diffusion have minimum width and spacing of 4 λ.
 Contacts are 2 λ × 2 λ and must be surrounded by 1 λ on the layers above and below.
 Polysilicon uses a width of 2 λ.
 Polysilicon overlaps diffusion by 2λ where a transistor is desired and has a spacing
of 1 λ away where no transistor is desired.
 Polysilicon and contacts have a spacing of 3λ from other polysilicon or contacts.
 N-well surrounds pMOS transistors by 6λ and avoids nMOS transistors by 6λ.
Transistor dimensions are often specified by their Width/Length (W/L) ratio. For example, the
nMOS transistor in Figure 1.39 formed where polysilicon crosses n-diffusion has a W/L of 4/2. In a
0.6 μm process, this corresponds to an actual width of 1.2 μm and a length of 0.6 μm. Such a
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 11
minimum-width contacted transistor is often called a unit transistor.
pMOS transistors are often wider than nMOS transistors because holes move more slowly than
electrons so the transistor has to be wider to deliver the same current. Figure 1.40(a) shows a unit
inverter layout with a unit nMOS transistor and a double-sized pMOS transistor. Figure 1.40(b)
shows a schematic for the inverter annotated with Width/ Length for each transistor. In digital
systems, transistors are typically chosen to have the minimum possible length because short-channel
transistors are faster, smaller, and consume less power. Figure 1.40(c) shows a shorthand we will
often use, specifying multiples of unit width and assuming minimum length.
Gate layouts
Line of Diffusion based style consists of four horizontal strips:
Metal ground at the bottom of the cell, n-diffusion, p-diffusion, and metal power at the top.
The power and ground lines are often called supply rails. Polysilicon lines run vertically to form
transistor gates. Metal wires within the cell connect the transistors appropriately.
Figure 1.41(a) shows such a layout for an inverter. The input A can be connected from the
top, bottom, or left in polysilicon. The output Y is available at the right side of the cell in metal.
Recall that the p-substrate and n-well must be tied to ground and power, respectively.
Figure 1.41(b) shows the same inverter with well and substrate taps placed under the power
and ground rails, respectively. Figure 1.42 shows a 3-input NAND gate. Notice how the nMOS
transistors are connected in series while the pMOS transistors are connected in parallel. Power and
ground extend 2 λ on each side so if two gates were abutted the contents would be separated by 4 λ,
satisfying design rules. The height of the cell is 36 λ, or 40 λ if the 4 λ space between the cell and
another wire above it is counted. All these examples use transistors of width 4 λ.
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 12
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 13
UNIT II COMBINATIONAL CIRCUIT DESIGN
DESIGN PRINCIPLE OF STATIC CMOS DESIGN
Digital CMOS circuits are implemented using either static or dynamic design
techniques. In static CMOS, the output is tied to VDD or ground via a low resistance path
(except during switching) and this leads to circuits implementation robust with good noise
immunity. In static CMOS design any function can be realized as a sum of product (SOP) or
a product of sum (POS). If an SOP function pulls the output high, then an SOP-BAR function
will pull the output low. A POS function can pull the output high, while a POS-BAR function
can pull the output low, as shown in fig.
Important properties of static CMOS design:
At any instant of time, the output of the gate is directly connected to Vss or VDD. All
functions are composed of either AND'ed or OR'ed sub functions. The AND function is
composed of NMOS transistors in series. The OR function is composed of NMOS transistors
in parallel. Contains a pull-up network (PUP) and pull down network (PDN). PUP networks
consist of PMOS transistors. PDN networks consist of NMOS transistors. Each network is
the dual of the other network. The output of the complementary gate is inverted.
Advantages of static CMOS design:
 Robust in construction.
 Good noise immunity.
 Static logic has no minimum clock rate, the clock can be paused indefinitely.
 Low power consumption.
 For low operating frequencies, CMOS static logic is used to obtain a relatively small
die size.
Limitations of static CMOS design:
The main limitation of static circuits is slower-speed as compared to dynamic circuits. The
reasons are
1. Increased gate capacitance due to the presence of both PMOS and NMOS transistors.
2. Output depends on the previous cycle inputs due to charges that may be present at internal
inputs.
3. Multiple switching of the output within a cycle depending on the input switching pattern
MOSFETS as Switches
The gate controls the passage of current between the source and the drain. CMOS uses
positive logic - VDD is logic ‗1‘ and Vss is logic '0'. We turn a transistor on or off using the
gate terminal. There are two kinds of CMOS transistors, n - Channel transistors and p -
channel transistors. An n - channel transistor requires a logic T on the gate to make the switch
conducting (to turn the transistor on). A p - channel transistor requires a logic '0' on the gate
to make the switch conducting (to turn the transistor on). The conventional schematic icon
representation along with the switch characteristics is shown.
Basic CMOS Gates In this section, the basic gate implementation in static CMOS are
presented.
AND Gate
If two N-switches are placed in series, the composite switch constructed by this action is
closed (or ON) if both switches are connected to logic '1'. If any one of the switch is at logic
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 14
'0' the circuit is said to be open (or OFF) state this yields an 'AND' function. The switch logic
of AND function is shown in
OR Gate
If two N-switches are placed in parallel, the composite switch constructed by this action is
closed (or ON) if any one of the switch is connected to logic ‗1‘.
Bubble Pushing
CMOS stages are inherently inverting, so AND and OR functions must be built from
NAND and NOR gates. DeMorgan‟ s law helps with this conversion:
A NAND gate is equivalent to an OR of inverted inputs. A NOR gate is equivalent to
an AND of inverted inputs. The same relationship applies to gates with more inputs.
Switching between these representations is easy to do on a whiteboard and is often called
bubble pushing.
Compound Gates:
 Static CMOS also efficiently handles compound gates computing various
 The logical effort of each input is the ratio of the input capacitance of that input to the
input capacitance of the inverter
For the AOI21 gate, this means the logical effort is slightly lower for the OR terminal (C)
than for the two AND terminals (A, B).
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 15
The parasitic delay is crudely estimated from the total diffusion capacitance on the output
node by summing the sizes of the transistors attached to the output.
Input Ordering Delay Effect
The logical effort and parasitic delay of different gate inputs are often different. Other
gates, like NANDs and NORs, are nominally symmetric but actually have slightly different
logical effort and parasitic delays for the different inputs.
Figure shows a 2-input NAND gate annotated with diffusion parasitic. Consider the
falling output transition occurring when one input held a stable 1 value and the other rises
from 0 to 1. If input B rises last, node x will initially be at VDD – Vt ≈ VDD because it was
pulled up through the nMOS transistor on input A.
The Elmore delay is (R/2)(2C) + R(6C) = 7RC. On the other hand, if input A
rises last, node x will initially be at 0 V because it was discharged through the nMOS
transistor on input B. No charge must be delivered to node x, so the Elmore delay is simply
R(6C) = 6RC.
In general, we define the outer input to be the input closer to the supply rail (e .g., B)
and the inner input to be the input closer to the output (e.g., A). The parasitic delay is smallest
when the inner input switches last because the intermediate nodes have already been
discharged. Therefore, if one signal is known to arrive later than the others, the gate is fastest
when that signal is connected to the inner input.
The inner input has a lower parasitic delay. The logical efforts are lower than
initial estimates might predict because of velocity saturation. Interestingly, the inner input has
a slightly higher logical effort because the intermediate node x tends to rise and cause
negative feedback when the inner input turns ON.
This effect is seldom significant to the designer because the inner input remains faster
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 16
over the range of fan-outs used in reasonable circuits. When one input is far less critical than
another, even nominally symmetric gates can be made asymmetric to favor the late input at
the expense of the early one.
For example, consider the path in Figure. Under ordinary conditions, the path acts as a
buffer between A and Y. When reset is asserted, the path forces the output low.
If reset only occurs under exceptional circumstances and can take place slowly, the
circuit should be optimized for input-to-output delay at the expense of reset.
The pulldown resistance is R/4 +R/ (4/3) = R, so the gate still offers the same driver
as a unit inverter. However, the capacitance on input A is only 10/3, so the logical effort is
10/9. This is better than 4/3, which is normally associated with a NAND gate. In the limit of
an infinitely large reset transistor and unit-sized nMOS transistor for input A, the logical
effort approaches 1, just like an inverter.
The improvement in logical effort of input A comes at the cost of much higher effort
on the reset input. Note that the pMOS transistor on the reset input is also shrunk. This
reduces its diffusion capacitance and parasitic delay at the expense of slower response to
reset.
Skewed Gates
In other cases, one input transition is more important than the other. We define H-I
skew gates to favor the rising output transition and LO-skew gates to favor the falling output
transition. This favoring can be done by decreasing the size of the noncritical transistor.
The logical efforts for the rising (up) and falling (down) transitions are called ground gd,
respectively, and are the ratio of the input capacitance of the skewed gate to the input
capacitance of an unskewed inverter with equal drive for that transition.
Figure (a) shows how a H-I skew inverter is constructed by downsizing the nMOS
transistor. This maintains the same effective resistance for the critical transition while
reducing the input capacitance relative to the unskewed inverter of Figure (b), thus reducing
the logical effort on that critical transition to gu = 2.5/3 = 5/6.
Of course , the improvement comes at the expense of the effort on the
noncritical transition. The logical effort for the falling transition is estimated by comparing
the inverter to a smaller unskewed inverter with equal pulldown current, shown in Figure (c),
giving a logical effort of gd = 2.5/1.5 = 5/3.
The degree of skewing (e.g., the ratio of effective resistance for the fast transition
relative to the slow transition) impacts the logical efforts and noise margins; a factor of two is
common. Figure catalogs HI-skew and LO-skew gates with a skew factor of two. Skewed
gates are sometimes denoted with an H or an L on their symbol in a schematic.
P/N Ratios
The pMOS transistors in the unskewed gate are enormous in order to provide
equal rise delay. They contribute input capacitance for both transitions, while only helping
the rising delay. By accepting a slower rise delay, the pMOS transistors can be downsized to
reduce input capacitance and average delay significantly.
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 17
Reducing the pMOS size from 2 to for the inverter gives the theoretical fastest
average delay, but this delay improvement is only 3%. However, this significantly reduces
the pMOS transistor area.
It also reduces input capacitance, which in turn reduces power consumption.
Unfortunately, it leads to unequal delay between the outputs. Some paths can be slower than
average if they trigger the worst edge of each gate.
Excessively slow rising outputs ca n also cause hot electron de gradation. And
reducing the pMOS size also moves the switching point lower and reduces the inverter‟ s
noise margin. In summary, the P/N ratio of a library of cells should be chosen on the basis of
area, power, and reliability, not average delay.
For NOR gates , reducing the size of the pMOS transistors significantly improves
both delay and area. In most standard cell libraries, the pitch of the cell determines the P/N
ratio that can be achieved in any particular gate. Ratios of 1.5–2 are commonly used for
inverters.
Multiple Threshold Voltages
Some CMOS processes offer two or more threshold voltages . Transistors with lower
threshold voltages produce more ON current, but also leak exponentially more OFF current.
Libraries can provide both high and low threshold versions of gates. The low - threshold
gates can be used sparingly to reduce the delay of critical paths. Skewed gates can use low
threshold devices on only the critical network of transistors.
Delay estimation:
Estimation of the delay of a Boolean function from its functional description is an
important step towards design exploration at the register transfer level (RTL). This paper
addresses the problem of estimating the delay of certain optimal multi-level implementations
of combinational circuits, given only their functional description.
tpdr: rising propagation delay From input to rising output crossing VDD/2
tpdf: falling propagation delay From input to falling output crossing VDD/2
tpd: average propagation delay tpd = (tpdr + tpdf)/2
tr: rise time From output crossing 20% to 80% VDD
tf: fall time From output crossing 80% to 20% VDD
tcd: average contamination delay tcd = (tcdr + tcdf)/2
tcdr: rising contamination delay: Min from input to rising output crossing VDD/2 tcdf:
falling contamination delay: Min from input to falling output crossinVDD/2
Use RC delay models to estimate delay
C = total capacitance on the output node. Use Effective resistance R, Therefore tpd = RC
Transistors are characterized by finding their effective R.
Transistor sizing:
 Not all gates need to have the same delay.
 Not all inputs to a gate need to have the same delay.
 Adjust transistor sizes to achieve desired delay.
Logical effort
Logical effort is a gate delay model that takes transistor sizes into account. Allows us
to optimize transistor sizes over combinational networks. Isn‘t as accurate for circuits with
reconvergent fanout.
Logical effort gate delay model
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 18
 Express delays in process-independent unit
 Gate delay is measured in units of minimum-size inverter delay τ. d = dabs / τ.
τ = 3RC ≈ 12ps in 180 nm process, 40 ps in 0.6 µm process.
 Gate delay formula: d = f + p.
 Effort delay f is related to gate‘s load. Parasitic delay p depends on gate‘s structure.
Represents delay of gate driving no load Set by internal parasitic capacitance
Effort delay
 Effort delay has two components: f = gh.
 Electrical effort h is determined by gate‘s load: h = Cout/Cin Sometimes called fanout
 Logical effort g is determined by gate‘s structure. Measures relative ability of gate to
deliver current g ≡ 1 for inverter
Delay plots:
Computing Logical Effort
Logical effort is the ratio of the input capacitance of a gate to the input capacitance of an
inverter delivering the same output current. Measure from delay Vs fanout plots Or estimate
by counting transistor widths.
Circuit families and its comparison:
The method of logical effort does not apply to arbitrary transistor networks, but only
to logic gates. A logic gate has one or more inputs and one output, subject to the following
restrictions:
The gate of each transistor is connected to an input, a power supply, or the output; and
Inputs are connected only to transistor gates.
The first condition rules out multiple logic gates masquerading as one, and the second
keeps inputs from being connected to transistor sources or drains, as in transmission gates
without explicit drivers.
Pseudo-NMOS circuits
Static CMOS gates are slowed because an input must drive both NMOS and PMOS
transistors. In any transition, either the pullup or pulldown network is activated, meaning the
input capacitance of the inactive network loads the input. Moreover, PMOS transistors have
poor mobility and must be sized larger to achieve comparable rising and falling delays,
further increasing input capacitance.
Pseudo-NMOS and dynamic gates offer improved speed by removing the PMOS
transistors from loading the input. Pseudo-NMOS gates resemble static gates, but replace the
slow PMOS pullup stack with a single grounded PMOS transistor which acts as a pullup
resistor. The effective pullup resistance should be large enough that the NMOS transistors
can pull the output to near ground, yet low enough to rapidly pull the output high.
Figure shows several pseudo-NMOS gates ratioed such that the pulldown transistors
are about four times as strong as the pullup. The logical effort follows from considering the
output current and input capacitance compared to the reference inverter from Figure Sized as
shown, the PMOS transistors produce 1/3 of the current of the reference inverter and the
NMOS transistor stacks produce 4/3 of the current of the reference inverter.
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 19
For falling transitions, the output current is the pulldown current minus the pullup
current which is fighting the pulldown, For rising transitions, the output current is just the
pullup current, 1/3. The inverter and NOR gate have an input capacitance of 4/3.
Gate
type
Logical Effort g
Rising Falling Average
2 - NAND 8/3 8/9 16/9
3 - NAND 4 4/3 8/3
4 - NAND 16/3 16/9 32/9
n - NOR 4/3 4/9 8/9
n - mux 8/3 8/9 16/9
The average logical effort is g = (4=9+4=3)=2 = 8. This is independent of the number of
inputs, explaining why pseudo-NMOS is a way to build fast wide NOR gates.
Pass Transistor Logic :
It is a MOS transistor, in which gate is driven by a control signal the source (out),
the drain of the transistor is called constant or variable voltage potential(in) when the control
signal is high, input is passed to the output and when the control signal is low, the output is
floating topology such topology circuits is called pass transistor.
The Pass transistor logic is required to reduce the transistors for implementing logic
by using the primary inputs to drive gate terminals, source and drain terminals. In
complementary CMOS logic primary inputs are allowed to drive only gate terminals.
Figure shows implementation of AND function using only MOS pass transistors. In this gate
if the B input is high the left NMOS is turned ON and copies the input A to the output F.
When B is low the right NMOS pass transistor is turned ON and passes a ‗0‘ to the output F.
This satisfies the truth table of AND gate reproduced in Table below for verification. ‗OR‘
gate using pass transistor logic
The truth table of ‗OR‘ gate is as shown in Table below. Figure below shows the
implementation of OR function using NMOS transistors only. In this gate if the B input is
high the right NMOS is turned ON and copies logic 1 to F and this operation does not
affected by ‗A‘ input. When B is low the left NMOS is turned ON the logic of ‗A‘ is copied
to the output F.
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 20
Advantage:
 Fewer transistors are required to implement a given function.
 Lower capacitance because of reduced number of transistors.
 They do not have path VDD to GND and do not dissipate standby power (static power
dissipation).
Drawback:
As discussed NMOS devices are effective in passing strong ‗0‘ but it is poor at
pulling a node to VDD. Hence when the pass transistor pulls a node to high logic the output
only changes upto VDD–VTh. This is the major disadvantage of pass transistors.
Pass transistor logic (PTL) circuits are often superior to standard CMOS circuits in
terms of layout density, circuit delay and power consumption.
Transmission Gate Logic:
The transmission gate logic is used to solve the voltage drop problem of the pass
transistor logic. This technique uses the complementary properties of NMOS and PMOS
transistors. i.e. NMOS devices passes a strong ‗0‘ but a weak ‗1‘ while PMOS transistors
pass a strong ‗1‘ but a weak ‗0‘. The transmission gate combines the best of the two devices
by placing an NMOS transistor in parallel with a PMOS transistor as shown in Figure below.
The control signals to the transmission gate C and ~C are complementary to each
other. The transmission gate is mainly a bi-directional switch enabled by the gate signal ‗C‘.
When C = 1 both MOSFETs are ON and the signal pass through the gate i.e. A = B if C = 1.
Whereas C = 0 makes the MOSFETs cut off creating an open circuit between nodes A and B.
Basic Structure :
The basic structure of transmission gate is shown in Figure below which consists of
NMOS and PMOS transistors. Here, VG is applied to NMOS, and (VDD- VG) applied to the
PMOS.
The transmission gate work voltage-controlled switch. When VG is high, NMOS and
PMOS are conducting hence switch is closed. Therefore, conduction path between left and
right sides exist. When VG is low, then the MOSFETs are in cutoff and switch is open.
Therefore, there is no direct relationship between VA and VB. Figure below shows the
symbol of transmission gate controlled by switching signals X and X* that are applied to the
gates of NMOS and PMOS respectively.
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 21
The circuit constructed with the parallel connection of PMOS and NMOS with
shorted drain and source terminals. The gate terminal uses two select signals s and s, when s
is high than the transmission gates passes the signal on the input. The main advantage of
transmission gate is that it eliminates the threshold voltage drop. Multiplexing element of
path selector, A latch element An unlock switch, Act as a voltage controlled resistor
connecting the input and output.
2 : 1 MUX using transmission gate :
A 2:1 multiplexer is shown in Figure below. This gate selects either input A or B on the basis
of the value of the control signal ‗C‘. When control signal C is logic low the output is equal
to the input A and when control signal C is logic high the output is equal to the input B.
A 2 : 1 multiplexer can be implemented using transmission gates. Figure below shows the
connection diagram of the 2 : 1 multiplexer using transmission gates.
The 2 : 1 MUX selects either A or B depending upon the control signal C. This is
equivalent to implementing the Boolean function, F = (A  C + B  ~C) When the control
signal C is high then the upper transmission gate is ON and it passes A through it so that
output = A.
When the control signal C is low then the upper transmission gate turns OFF and it will not
allow A to pass through it, at the same time the lower transmission gate is ‗ON‘ and it allows
B to pass through it so the output = B.
DYNAMIC CMOS LOGIC
Ratioed circuits reduce the input capacitance by replacing the pMOS
transistors connected to the inputs with a single resistive pullup. The drawbacks of ratioed
circuits include slow rising transitions, contention on the falling transitions, static power
dissipation, and a non zero VOL.
Dynamic circuits circumvent these drawbacks by using a clocked pullup transistor
rather than a pMOS that is always ON. Figure compares (a) static CMOS, (b) pseudo- nMOS,
and (c) dynamic inverters. Dynamic circuit operation is divided into two modes, as shown in
Figure
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 22
Dynamic circuits are the fastest commonly used circuit family because they have
lower input capacitance and no contention during switching. They also have zero static power
dissipation. However, they require careful clocking, consume significant dynamic power, and
are sensitive to noise during evaluation.
In Figure, if the input A is 1 during precharge, contention will take place because both
the pMOS and nMOS transistors will be ON.
When the input cannot be guaranteed to be 0 during precharge, an extra clocked evaluation
transistor can be added to the bottom of the nMOS stack to avoid contention as shown in
Figure. The extra transistor is sometimes called a foot.
Figure estimates the falling logical effort of both footed and unfooted dynamic gates.
As usual, the pulldown transistors‟ widths are chosen to give unit resistance. Precharge
occurs while the gate is idle and often may take place more slowly. Therefore, the precharge
transistor width is chosen for twice unit resistance.
This reduces the capacitive load on the clock and the parasitic capacitance at the
expense of greater rising delays. We see that the logical efforts are very low. Footed gates
have higher logical effort than their unfooted counterparts but are still an improvement over
static logic. In practice, the logical effort of footed gates is better than predicted because
velocity saturation means series nMOS transistors have less resistance than we have
estimated.
The size of the foot can be increased relative to the other nMOS transistors to reduce
logical effort of the other inputs at the expense of greater clock loading. Like pseudo- nMOS
gates, dynamic gates are particularly well suited to wide NOR functions or multiplexers
because the logical effort is independent of the number of inputs.
A fundamental difficulty with dynamic circuits is the monotonicity
requirement. While a dynamic gate is in evaluation, the inputs must be monotonically rising.
That is, the input can start LOW and remain LOW, start LOW and rise HIGH, start HIGH
and remain HIGH, but not start HIGH and fall LOW.
Figure shows wave forms for a footed dynamic inverter in which the input violates
monotonicity. During precharge, the output is pulled HIGH. When the clock rises, the input
is HIGH so the output is discharged LOW through the pulldown network, as you would want
to have happen in an inverter. The input later falls LOW, turning off the pulldown network.
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 23
The output of a dynamic gate be gins HIGH and monotonically falls LOW during
evaluation. This monotonically falling output X is not a suitable input to a second dynamic
gate expecting monotonically rising signals.
CMOS Domino Logic
The monotonicity problem can be solved by placing a static CMOS inverter between
dynamic gates, as shown in Figure. This converts the monotonically falling output into a
monotonically rising signal suitable for the next gate, as shown in Figure.
The dynamic static pair together is called a domino gate because precharge
resembles setting up a chain of dominos and evaluation causes the gates to fire like dominos
tipping over, each triggering the next.
A single clock can be used to precharge and evaluate all the logic gates within the
chain. The dynamic output is monotonically falling during evaluation, so the static inverter
output is monotonically rising. Therefore, the static inverter is usually a HI-skew gate to
favor this rising output.
In general, more complex inverting static CMOS gates such as NANDs or NORs can
be used in place of the inverter . This mixture of dynamic and static logic is called compound
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 24
domino.
Domino gates are inherently noninverting, while some functions like XOR gates
necessarily require inversion. Three methods of addressing this problem include pushing
inversions into static logic, delaying clocks, and using dual-rail domino logic.
A second approach is to directly cascade dynamic gates without the static CMOS
inverter, delaying the clock to the later gates to ensure the inputs are monotonic during
evaluation.
Domino circuits
Pseudo-NMOS gates eliminate the bulky PMOS transistors loading the inputs, but pay
the price of quiescent power dissipation and contention between the pullup and pulldown
transistors. Dynamic gates offer even better logical effort and lower power consumption by
using a clocked precharge transistor instead of a pullup that is always conducting.
The dynamic gate is precharged HIGH then may evaluate LOW through an NMOS
stack. Unfortunately, if one dynamic inverter directly drives another, a race can corrupt the
result. When the clock rises, both outputs have been precharged HIGH.
The HIGH input to the first gate causes its output to fall, but the second gate‘s output
also falls in response to its initial HIGH input. The circuit therefore produces an incorrect
result because the second output will never rise during evaluation, as shown in Figure 10.3.
Domino circuits solve this problem by using inverting static gates between dynamic gates so
that the input to each dynamic gate is initially LOW. The falling dynamic output and rising
static output ripple through a chain of gates like a chain of toppling dominos.
In summary, domino logic runs 1:5 to 2 times faster than static CMOS logic because
dynamic gates present a much lower input capacitance for the same output current and have a
lower switching threshold, and because the inverting static gate can be skewed to favor the
critical monotonically rising evaluation edges. Figure shows some domino gates. Each
domino gate consists of a dynamic gate followed by an inverting static gate1.
The static gate is often but not always an inverter. Since the dynamic gate‘s output
falls monotonically during evaluation, the static gate should be skewed high to favor its
monotonically rising output.
A dynamic gate may be designed with or without a clocked evaluation transistor; the
extra transistor slows the gate but eliminates any path between power and ground during
precharge when the inputs are still high.
Dual-Rail Domino Logic:
Dual-rail domino gates encode each signal with a pair of wires. The input and output
signal pairs are denoted with sig_h and sig_l, respectively. Table summarizes the encoding.
The sig_h wire is asserted to indicate that the output of the gate is ―high‖ or 1. The sig_l wire
is asserted to indicate that the output of the gate is ―low‖ or 0.
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 25
When the gate is precharged, neither sig_h nor sig_l is asserted. The pair of lines
should never be both asserted simultaneously during correct operation.
Dual-rail domino gates accept both true and complementary inputs and compute both
true and complementary outputs, as shown in Figure. Observe that this is identical to static
CVSL circuits from Figure except that the cross-coupled pMOS transistors are instead
connected to the precharge clock. Therefore, dual-rail domino can be viewed as a dynamic
form of CVSL, sometimes called DCVS.
Figure shows a dual-rail AND/NAND gate and Figure shows a dual-rail XOR/XNOR
gate. The gates are shown with clocked evaluation transistors, but can also be unfooted. Dual-
rail domino is a complete logic family in that it can compute all inverting and non inverting
logic functions.
However, it requires more area, wiring, and power. Dual rail structures also lose the
efficiency of wide dynamic NOR gates because they require complementary tall dynamic
NAND stacks.
Dual rail domino signals not only the result of a computation but also indicates when
the computation is done. Before computation completes, both rails are precharged. When the
computation completes, one rail will be asserted. A NAND gate can be used for completion
detection, as shown in Figure. This is particularly useful for asynchronous circuits
Keepers
Dynamic circuits also suffer from charge leakage on the dynamic node. If a dynamic
node is precharged high and then left floating, the voltage on the dynamic node will drift over
time due to subthreshold, gate, and junction leakage. The time constants tend to be in the
millisecond to nanosecond range, depending on process and temperature. This problem is
analogous to leakage in dynamic RAMs.
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 26
More over, dynamic circuits have poor input noise margins . If the input rises above
Vt while the gate is in evaluation, the input transistors will turn on weakly and can incorrectly
discharge the output. Both leakage and noise margin problems can be addressed by adding a
keeper circuit.
Figure shows a conventional keeper on a domino buffer. The keeper is a weak
transistor that holds, or staticizes, the output at the correct level when it would otherwise
float. When the dynamic node X is high, the output Y is low and the keeper is ON to prevent
X from floating. When X falls, the keeper initially opposes the transition so it must be much
weaker than the pulldown network. Eventually Y rises, turning the keeper OFF and avoiding
static power dissipation.
The keeper must be strong (i.e., wide) enough to compensate for any leakage current
drawn when the output is floating and the pulldown stack is OFF. Strong keepers also
improve the noise margin because when the inputs are slightly above Vt the keeper can
supply enough current to hold the output high.
NP and Zipper Domino
Another variation on domino is shown in Figure. The HIskewinverting static gates
are replaced with predischarged dynamic gates using pMOS logic.
For example, a footed dynamic p-logic NAND gate is shown in Figure. When Φ is 0,
the first and third stages pre charge high while the second stage predischarges low. When Φ
rises, all the stages evaluate. Domino connections are possible, as shown in Figure. The
design style is called NP Domino or NORA Domino (NORA).
NORA has two major drawbacks. The logical effort of footed p-logic gates is
generally worse than that of HI-skew gates (e.g., 2 vs. 3/2 for NOR2 and 4/3 vs. 1 for
NAND2). Secondly, NORA is extremely susceptible to noise.
In an ordinary dynamic gate, the input has a low noise margin (about Vt ), but is
strongly driven by a static CMOS gate.
The floating dynamic output is more prone to noise from coupling and charge sharing,
but drives another static CMOS gate with a larger noise margin. In NORA, however, the
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 27
sensitive dynamic inputs are driven by noise prone dynamic outputs. Given these drawbacks
and the extra clock phase required, there is little reason to use NORA.
Zipper domino is a closely related technique that leaves the precharge transistors slightly ON
during evaluation by using precharge clocks that swing between 0 and VDD – |Vtp| for the
pMOS precharge and Vtn and VDD for the nMOS precharge. This plays much the same role
as a keeper.
THE STATIC AND DYNAMIC POWER DISSIPATION IN CMOS CIRCUITS
Static CMOS gates are very power-efficient because they dissipate nearly zero power
while idle. For much of the history of CMOS design, power was a secondary consideration
behind speed and area for many chips. As transistor counts and clock frequencies have
increased, power consumption has skyrocketed and now is a primary design constraint.
The instantaneous power P{t} drawn from the power supply is proportional to the
supply current iDD(t) and the supply voltage VDD, P(t) = iDD(t) VDD
The energy consumed over some time interval T is the integral of the instantaneous power
=
The average power over this interval is Pavg =
Power dissipation in CMOS circuits comes from two components
Static dissipation due to
 subthreshold conduction through OFF transistors
 tunneling current through gate oxide
 leakage through reverse-biased diodes
 contention current in ratioed circuits
Dynamic dissipation due to charging and discharging of load capacitances "short
circuit'' current while both pMOS and nMOS networks are partially ON
Ptotal = Pstatic + Pdynamic
Static Dissipation
Considering the static CMOS inverter shown in Figure, if the input = '0,' the
associated nMOS transistor is OFF and the pMOS transistor is ON. The output voltage is
VDD or logic 1.'
When the input = 1 the associated nMOS transistor is ON and the pMOS transistor is
OFF. The output voltage is 0 volts (GND). Note that one of the transistors is always OFF
when the gate is in either of these logic states.
Ideally, no current flows through the OFF transistor so the power dissipation is zero
when the circuit is quiescent, i.e., when no transistors are switching. Zero quiescent power
dissipation is a principle advantage of CMOS over competing transistor technologies.
However, secondary effects including subthreshold conduction, tunneling, and
leakage lead to small amounts of static current flowing through the OFF transistor. Assuming
the leakage current is constant so instantaneous and average power are the same, the static
power dissipation is the product of total leakage current and the supply voltage.
Pstatic = Istatic VDD
OFF transistors still conduct a small amount of subthreshold current. As subthreshold current
is exponentially dependent on threshold voltage, it is increasing dramatically as threshold
voltages have scaled down. There is also some small static dissipation due to reverse biased
diode leakage between diffusion regions, wells, and the substrate. In modern processes, diode
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 28
leakage is generally much smaller than the subthreshold or gate leakage and may be
neglected.
Dynamic Dissipation
Over any given interval of time T, the load will be charged and discharged Tfsw times.
Current flows from VDD to the load to charge it. Current then flows from the load to GND
during discharge. In one complete charge/discharge cycle, a total charge of Q = CVDD is
thus transferred from VDD to GND. The average dynamic power dissipation is
Pdynamic =
Pdynamic =
Because most gates do not switch every' clock cycle, it is often more convenient to express
switching frequency fsw as an activity factor a times the clock frequency.
Now the dynamic power dissipation may be rewritten as;
Pdynamic =
A clock has an activity factor of α=1, because it rises and falls every cycle. Most data
has a maximum activity factor of 0.5 because it transitions only once each cycle.
 Static CMOS logic has been empirically determined to have acvtiity factors closer to
0.1 because some gates maintain one output state more often thananother.
 Because the input rise /fall time is greater than zero, both nMOS and pMOS
transistors will be ON for a short period of time while the input is between Vtn and VDD - Vtp.
This results in an additional "short circuit" current pulse from to GND a VDD and typically
increases power dissipation by about 10% .
Methods to reduce dynamic power dissipation
1. Reducing the product of capacitance and its switching frequency.
2. Eliminate logic switching that is not necessary for computation.
3. Reduce activity factor Reduce supply voltage
Methods to reduce static power dissipation
1. By selecting multi threshold voltages on circuit paths with low-Vt transistors
while leakage on other paths with high-Vt transistors.
2. By using two operating modes, active and standby for each function blocks.
3. By adjusting the body bias (i.e) adjusting FBB (Forward Body Bias) in active
mode to increase performance and RBB (Reverse Body Bias) in standby mode
to reduce leakage.
4. By using sleep transistors to isolate the supply from the block to achieve
significant leakage power savings.
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 29
UNIT III: SEQUENTIAL LOGIC CIRCUITS
Static & Dynamic Latches and Registers, Pipelining
 In sequential logic circuits, the output not only depends upon the current values of
the inputs, but also upon preceding input values. In other words, a sequential circuit
remembers some of the past history of the system—it hasmemory.
 Figure shows a block diagram of a generic finite state machine (FSM) that consists
of combinational logic and registers, which hold the system state. The system
depicted here belongs to the class of synchronous sequential systems, in which all
registers are under control of a single global clock. The outputs of the FSM are a
function of the current Inputs and the Current State. The Next State is determined
based on the Current State and the current Inputs and is fed to the inputs of
registers.
 On the rising edge of the clock, the Next State bits are copied to the outputs of the
registers (after some propagation delay), and a new cycle begins. The register then
ignores changes in the input signals until the next rising edge. In general, registers
can be positive edge- triggered (where the input data is copied on the positive edge
of the clock) or negative edge- triggered (where the input data is copied on the
negative edge, as is indicated by a small circle at the clock input).
Block diagram of a finite state machine using positive edge-triggered registers.
Timing Metrics for Sequential Circuits
There are three important timing parameters associated with a register as illustrated in
Figure.
1. The set-up time (tsu) is the time that the data inputs (D input) must be valid before
the clock transition (this is, the 0 to 1 transition for a positive edge-triggered
register).
2. The hold time (thold) is the time the data input must remain valid after the clock
edge.
3. Assuming that the set-up and hold-times are met, the data at the D input is copied to
the Q output after a worst-case propagation delay (with reference to the clock edge)
denoted by tc-q. Given the timing information for the registers and the combination
logic, some system-level timing constraints can be derived. Assume that the worst-
case propagation delay of the logic equals tplogic,while itsminimum delay (also
called the contamination delay) is tcd. The minimum clock period T, required for
proper operation of the sequential circuit is given by
The hold time of the register imposes an extra constraint for proper operation,
Wheretcdregisteris the minimum propagation delay (or contamination delay) of the register.
It is important to minimize the values of the timing parameters associated with the register, as
these directly affect the rate at which a sequential circuit can be clocked. In fact, modern
high-performance systems are characterized by a very-low logic depth, and the register
propagation delay and set-up times account for a significant portion of the clock period.
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 30
Classification of Memory Elements
Foreground versus Background Memory
Memory that is embedded into logic is foreground memory (internal memory), and is most
often organized as individual registers of register banks. Large amounts of centralized
memory core are referred to as background memory (external memory).
Static versus Dynamic Memory
 Static memories preserve the state as long as the power is turned on.
 Built using positive feedback or regeneration, where the circuit topology consists of
intentional connections between the output and the input of a combinational circuit.
 Static memories are most useful when the register won‘t be updated for extended
periods of time. E.g. configuration data, loaded at power-up time.
 This condition also holds for most processors that use conditional clocking (i.e.,
gated clocks) where the clock is turned off for unused modules. In that case, there
are no guarantees on how frequently the registers will be clocked, and static
memories are needed to preserve the state information.
 Memory based on positive feedback fall under the class of elements called
multivibrator circuits.The bistableelement, is its most popular representative, but
other elements such as monostable and astable circuits are also frequently used.
 Dynamic memories store state for a short period of time—on the order of
milliseconds. They are based on the principle of temporary charge storage on
parasitic capacitors associated with MOS devices. Capacitors have to be refreshed
periodically to annihilate charge leakage.
 Dynamic memories tend to be simpler, resulting in significantly higher performance
and lower power dissipation. They are most useful in datapath circuits that require
high performance levels and are periodically clocked.
Latches versus Registers
A latch is an essential component in the construction of an edge-triggered register. It is
level- sensitive circuit that passes the D input to the Q output when the clock signal is high.
This latch is said to be in transparent mode. When the clock is low, the input data sampled
on the falling edge of the clock is held stable at the output for the entire phase, and the latch
is in hold mode. The inputs must be stable for a short period around the falling edge of the
clock to meet set-up and hold requirements. A latch operating under the above conditions is
a positive latch. Similarly, a negative latch passes the D input to the Q output when the
clock signal is low.
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 31
Timing of positive and negative latches
Static Latches and Registers
The Bistability Principle
Static memories use positive feedback to create a bistable circuit — a circuit having two
stable states that represent 0 and 1. The basic idea is shown in Figure a, which shows two
inverters connected in cascade along with a voltage-transfer characteristic typical of such a
circuit. Assume now that the output of the second inverter Vo2 is connected to the input of
the first Vi1, as shown by the dotted lines in Figure a.
The resulting circuit has only three possible operation points (A, B, and C). Under the
condition that the gain of the inverter in the transient region is larger than 1, only A and B
are stable operation points, and C is a metastable operation point. Suppose that the cross-
coupled inverter pair is biased at point C. A small deviation from this bias point, possibly
caused by noise, is amplified and regenerated around the circuit loop. This is a
consequence of the gain around the loop being larger than 1.
On the other hand, A and B are stable operation points. In these points, the loop gain is
much smaller than unity. Hence the cross-coupling of two inverters results in a
bistablecircuit, which serves as a memory, storing either a 1 or a 0 (corresponding to
positions A and B). In order to change the stored value, we must be able to bring the circuit
from state A to B and vice-versa. This is generally done by applying a trigger pulse at Vi1
or Vi2. The width of the trigger pulse need be only a little larger than the total propagation
delay around the circuit loop, which is twice the average propagation delay of the
inverters.
SR Flip-Flops
SR —or set- reset— flip-flopcircuit is similar to the cross-coupled inverter pair with NOR
gates replacing the inverters. The second input of the NOR gates is connected to the trigger
inputs (S and R), that make it possible to force the outputs Q and Q' to a given state. These
outputs are complimentary (except for the SR = 11 state). When both S and R are 0, the
flip-flop is in a quiescent state and both outputs retain their value. If a positive (or 1) pulse
is applied to the S input,theQ output is forced into the 1 state (with Q going to 0). Vice
versa, a 1 pulse on R resets the flip-flop and the Q output goes to 0.
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 32
When both S and R are high, both Q and Q'are forced to zero. This is forbidden. An
additional problem with this condition is that when the input triggers return to their zero
levels, the resulting state of the latch is unpredictable and depends on whatever input is last to
go low.
CMOS clocked SR flip-flop
One possible realization of a clocked SR flip-flop— a level-sensitive positive latch— is
shown in Figure. It consists of a cross-coupled inverter pair, plus 4 extra transistors to drive
the flip- flop from one state to another and to provide clocked operation.
Multiplexer-Based Latches
Advantage: the sizing of devices only affects performance and is not critical to the
functionality. For a negative latch, when the clock signal is low, the input 0 of the
multiplexer is selected, and the D input is passed to the output. When the clock signal is
high, the input 1 of the multiplexer, which connects to the output of the latch, is selected.
The feedback holds the output stable while the clock signal is high.
A transistor level implementation of a positive latch based on multiplexers is shown in
Figure.
 When CLK is high, the bottom transmission gate is on and the latch is transparent -
that is, the D input is copied to the Q output.
 The feedback does not have to be overridden to write the memory and hence sizing of
transistors is not critical for realizing correct functionality. The number of transistors
that the clock touches is important since it has an activity factor of 1.
 Not efficient from this metric as it presents a load of 4 transistors to the CLK signal.
To reduce the clock load to 2 transistors, by using NMOS only pass transistor as shown in
Figure. Advantage
 reduced clock load of only two NMOS devices.
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 33
 Simple circuit.
Disadvantage:
Results in passing of a degraded high voltage of VDD- VTnto the input of the first inverter.
This impacts both noise margin and the switching performance, especially in the case of
low values of VDD and high values of VTn. It also causes static power dissipation in first
inverter. Since the maximum input-voltage to the inverter equals VDD-VTn, the PMOS
device of the inverter is never turned off, resulting in a static current flow.
Master-Slave Edge-Triggered Register
 The register consists of cascading a negativeWSW latch (master stage) with a positive
latch (slave stage).
 On the low phase of the clock, the master stage is transparent, and the D input is passed
to the master stage output, QM. During this period, the slave stage is in the hold mode,
keeping its previous value using feedback.
 On the rising edge of the clock, the master slave stops sampling the input, and the slave
stage starts sampling. During the high phase of the clock, the slave stage samples the
output ofthe masterstage (QM), while the master stage remains in a hold mode. Since
QM is constant during the high phase of the clock, the output Q makes only one
transition per cycle.
 The value of Q is the value ofDright before the rising edge of the clock, achieving the
positive edge-triggered effect. A negative edge-triggered register can be constructed
using the same principle by simply switching the order of the positive and negative
latch (this is, placing the positive latch first).
A complete transistor-level implementation of the master-slave positive edge-triggered
register is shown in Figure below.
Drawback of the transmission gate register :the high capacitive load presented to the clock
signal. The clock load per register is important, since it directly impacts the power
dissipation of the clock network. Each register has a clock load of 8 transistors. One
approach to reduce the clock load at the cost of robustness is to make the circuit ratioed.
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 34
Figure below shows that the feedback transmission gate can be eliminated by directly cross
coupling the inverters.
Another problem with this scheme is the reverse conduction — this is, the second stage can
affect the state of the first latch. When the slave stage is on (Figure above)it is possible for
the combination of T2 and I4 to influence the data stored in I1-I2 latch. As long as I4 is a
weak device, this is fortunately not a major problem.
Non-ideal clock signals
Variations can exist in the wires used to route the two clock signals, or the load
capacitances can vary based on data stored in the connecting latches. This effect, known as
clock skew is a major problem, and causes the two clock signals to overlap as is shown in
Figure 7.20b. Clock-overlap can cause two types of failures, as illustrated for the NMOS-
only negative master- slave register.
 When the clock goes high, the slave stage should stop sampling the master stage
output and go into a hold mode. However, since CLK and CLK bar are both high for
a short period of time (the overlap period), both sampling pass transistors conduct
and there is a direct path from the D input to the Q output. As a result, data at the
output can change on the rising edge of the clock.This is a race condition in which
the value of the output Q is a function of whether the input D arrives at node X
before or after the falling edge of CLK. If node X is sampled in the metastable state,
the output will switch to a value determined by noise in the system.
 The primary advantage of the multiplexer-based register is that the feedback loop is
open during the sampling period, and therefore sizing of devices is not critical to
functionality. However, if there is clock overlap between CLK bar and CLK, node A
can be driven by both D and B, resulting in an undefinedstate.
Those problems can be avoided by using two non-overlapping clocks PHI1 and PHI2
instead, and by keeping the nonoverlap time tnon_overlapbetween the clocks large
enough such that no overlap occurs even in the presence of clock-routing delays.
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 35
Dynamic Latches and Registers
The class of circuits based on temporary storage of charge on parasitic capacitors. Charge
stored on a capacitor can be used to represent a logic signal. The absence of charge denotes
a 0, while its presence stands for a stored 1. a periodic refresh of its value is necessary.
Hence the name dynamic storage.
Dynamic Transmission-Gate Edge-triggered Registers:
A fully dynamic positive edge-triggered register based on the master-slave concept is
shown inFigure below.
 When CLK = 0, the input data is sampled on storage node 1, which has an equivalent
capacitance of C1 consisting of the gate capacitance of I1, the junction capacitance
of T1, and the overlap gate capacitance of T1.
 During this period, the slave stage is in a hold mode, with node 2 in a high-
impedance (floating) state.
 On the rising edge of clock, the transmission gate T2 turns on, and the value sampled
on node 1 right before the rising edge propagates to the output Q
 Node 2 now stores the inverted version of node 1.
Very efficient - requires only 8 transistors. The sampling switches
canbeimplementedusingNMOS-onlypasstransistors (6-transistorimplementation).
The set-up time of this circuit is simply the delay of the transmission gate, and corresponds
to the time it takes node 1 to sample the D input. The hold time is approximately zero, since
the transmission gate is turned off on the clock edge and further inputs changes are ignored.
The propagation delay (tc-q) is equal to two inverter delays plus the delay of the
transmission gate T2.
Race Condition and Preventive Measures
Clock overlap is an important concern for this dynamic register. Consider the clock
waveforms shown in Figure below. During the 0-0 overlap period, the PMOS of T1 and
the PMOS of T2 are simultaneously on, creating a direct path for data to flow from the D
input of the register to the Q output. As a result, data at the output can change on the
falling edge of the clock, which is undesired for a positive edge triggered register. The is
known as a race condition in which the value of the output Q is a function of whether the
input D arrives at node X before or after the raising edge of CLK. The output Q can change
on the falling edge if the overlap period is large — obviously an undesirable effect for a
positive edge-triggered register. The sameis true for the 1-1 overlap region, where an
input-output path exists through the NMOS of T1 and the NMOS of T2. The latter case is
taken care of by enforcing a hold time constraint. That is, the data must be stable during
the high-high overlap period. The former situation (0-0 overlap) can be addressed by
making sure that there is enough delay between the D input and node 2 ensuring that new
data sampled by the master stage does not propagate through to the slave stage. Generally
the built in single inverter delay should be sufficient and the overlap period constraint is
givenas:
Similarly, the constraint for the 1-1 overlap is given as:
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 36
Impact of overlapping clocks.
C2
MOS—A Clock-Skew Insensitive Approach ( Method to prevent race
condition)
Figure below shows an ingenious positive edge-triggered register, based on a master-slave
concept insensitive to clock overlap. This circuit is called the C2
MOS (Clocked CMOS)
register, and operates in two phases.
1. CLK = 0 (CLK bar = 1): The first tri-state driver is turned on, and the master stage
acts as an inverter sampling the inverted version of D on the internal node X. The
master stage is in the evaluation mode. Meanwhile, the slave section is in a high-
impedance mode, or in ahold mode. Both transistors M7 and M8 are off, decoupling
the output from the input. The output Q retains its previous value stored on the
output capacitorCL2.
2. The roles are reversed when CLK = 1: The master stage section is in hold mode
(M3- M4 off), while the second section evaluates (M7-M8on). The value stored on
CL1propagates to the output node through the slave stage which acts as aninverter.
In the (0-0) overlap case, both PMOS devices are on during this period. New data is
sampled on node X through the series PMOS devices M2-M4, and node X can make a 0-to-1
transition during the overlap period. However, this data cannot propagate to the output
since the NMOS device M7is turned off. At the end of the overlap period, CLK=1 and both
M7 and M8 turn off, putting the slave stage is in the holdmode.
The (1-1) overlap case where both NMOS devices M3 and M7 are turned on. If the D input
changes during the overlap period, node X can make a 1-to-0 transition, but cannot
propagate to the output. However, as soon as the overlap period is over, the PMOS M8is
turned on and the 0 propagates to output. This effect is notdesirable.
The problem is fixed by imposing a hold time constraint on the input data, D, or, in other
words, the data D should be stable during the overlap period.
Pipelining: An approach to optimize sequential circuits
Pipelining is a popular design technique often used to accelerate the operation of the
datapaths in digital processors. The idea is easily explained with the example of
Figure(a).The goal of the presented circuit is to compute log(|a + b|), where both a and b
represent streams of numbers, that is, the computation must be performed on a large set of
inputvalues.
The minimal clock period Tmin necessary to ensure correct evaluation is given as:
wheretc-qand tsuare the propagation delay and the set-up time of the register, respectively.
We assume that the registers are edge-triggered D registers. The term tpd,logicstands for
the worst- case delay path through the combinational network, which consists of the adder,
absolute value, and logarithm functions. In conventional systems, the latter delay is
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 37
generally much larger than the delays associated with the registers and dominates the
circuit performance. Assume that each logic module has an equal propagation delay. We
note that each logic module is then active for only 1/3 of the clock period (if the delay of
the register is ignored). For example, the adder unit is active during the first third of the
period and remains idle—this is, it does no useful computation— during the other 2/3 of
theperiod.
(a)
(b)
Pipelining is a technique to improve the resource utilization, and increase the functional
throughput. Assume that we introduce registers between the logic blocks, as shown in
Figure b. This causes the computation for one set of input data to spread over a number of
clock periods, as shown in Table.The advantage of pipelined operation becomes apparent
when examining the minimum clock period of the modified circuit. The combinational
circuit block has been partitioned into three sections, each of which has a smaller
propagation delay than the original function. This effectively reduces the value of the
minimum allowable clock period:
Suppose that all logic blocks have approximately the same propagation delay, and that the
register overhead is small with respect to the logic delays. The pipelined network
outperforms the original circuit by a factor of three under these assumptions, or T
min,pipe=Tmin/3. The increased performance comes at the relatively small cost of two
additional registers, and an increased latency.
Latch- vs. Register-Based Pipelines
Consider the pipelined circuit of Figure below. The pipeline system is implemented based
on pass-transistor-based positive and negative latches instead of edge triggered registers.
Latch-based systems give significantly more flexibility in implementing a pipelined
system, and oftenoffers higher performance. When the clocks CLK and are non-
overlapping,correctpipelineoperationisobtained.InputdataissampledonC1atthenegativeedge
of CLK and the computation of logic block F starts; the result of the logic block F is stored
on C2 on the falling edge of , and the computation of logic block G starts. The
non
overlappingoftheclocksensurescorrectoperation.ThevaluestoredonC2attheendoftheCLKlow
phaseistheresultofpassingthepreviousinput(storedon thefallingedgeofCLKonC1) through
the logic function F.
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 38
NORA-CMOS—A Logic Style for Pipelined Structures
The latch-based pipeline circuit can also be implemented using C2
MOS latches, as shown
in Figure below. This topology has one additional, important property:A C2
MOS-based
pipelined circuit is race-free as long as all the logic functions F between the latches are
non-inverting.
The reasoning for the above argument is similar to the argument made in the construction
of a C2
MOS register. During a (0-0) overlap betweenCLK and, all C2
MOS latches,
simplify to pure pull-up networks (see Figure7.27).
The only way a signal can race from stage to stage under this condition is when the logic
function F is inverting, as illustrated in Figure above, where F is replaced by a single,
static CMOS inverter. Similar considerations are valid for the (1-1)overlap.
Sources of Clock Skew and Jitter
A perfect clock is defined as perfectly periodic signal that is simultaneous triggered at
various memory elements on the chip. However, due to a variety of process and
environmental variations, clocks are not ideal. To illustrate the sources of skew and jitter,
consider the simplistic view of clock generation and distribution as shown in Figure below.
Typically, a high frequency clock is either provided from off chip or generated on-chip.
From a central point, the clock is distributed using multiple matched paths to low-level
memory element, registers. Here two paths are shown. The clock paths include wiring and
the associated distributed buffers required to drive interconnects and loads. A key point to
realize in clock distribution is that the absolute delay through a clock distribution path is
not important; But the relative arrival time between the output of each path at the register
points is important.
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 39
The sources of clock uncertainty can be classified in several ways. Systematic errors are
nominally identical from chip to chip, and aretypically predictable (e.g., variation in total
load capacitance of each clock path). In principle, such errors can be modeled and
corrected at design time given sufficiently good models and simulators. Random errors are
due to manufacturing variations (e.g., dopant fluctuations that result in threshold
variations) that are difficult to model and eliminate.Mismatch may also be characterized as
static or time-varying. Below, the various sources ofskewand jitter, introduced in Figure
10.14, are described in detail.
 Clock-Signal Generation(1)
The generation of the clock signal itself causes jitter. A typical on-chip clock
generator takes a low-frequency reference clock signal, and produces a high-
frequency global reference for the processor. The core of such a generator is a
Voltage-Controlled Oscillator (VCO). Problem is coupling from the surrounding
noisy digital circuitry through the substrate. These noise source cause temporal
variations of the clock signal that propagate unfiltered through the clock drivers to
the flip-flops.
 Manufacturing Device Variations(2)
Distributed buffers are integral components of the clock distribution networks, as
they are required to drive both the register loads as well as the global and local
interconnects. The matching of devices in the buffers along multiple clock paths is
critical to minimizing timing uncertainty. Device parameters in the buffers vary
along different paths, resulting in static skew.There are many sources of variations
including oxide variations (that affects the gain and threshold), dopant variations,
and lateral dimension (width and length) variations.
 Interconnect Variations(3)
Vertical and lateral dimension variations cause the interconnect capacitance and
resistance to vary across a chip. Since this variation is static, it causes skew between
different paths. One important source of interconnect variation is the Inter-level
Dielectric (ILD) thickness variations. Other interconnect variations include deviation
in the width of the wires and line spacing. This results from photolithography and
etch dependencies.
 Environmental Variations (4 and 5)
The two major sources are temperature and power supply. Temperature gradients
across the chip isa result of variations in power dissipation across the die (chip). This
is an issue with clock gating where some parts of the chip maybe idle while other
parts of the chip might be active. Since the device parameters (such as threshold,
mobility, etc.) depend strongly on temperature, buffer delay for a clock distribution
network along one path can vary drastically for another path. The delay through
buffers is a very strong function of power supply as it directly affects the drive of the
transistors. As with temperature, the power supply voltage is a strong function of the
switching activity. Power supply variations can be classified into static (or slow) and
high frequency variations. Static power supply variations may result from fixed
currents drawn from various modules, while high-frequency variations result from
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 40
instantaneous IR drops along the power grid due to fluctuations in switching activity.
 Capacitive Coupling (6 and 7)
The variation in capacitive load also contributes to timing uncertainty. There are two
major sources of capacitive load variations: coupling between the clock lines and
adjacent signal wires and variation in gate capacitance. Any coupling between the
clock wire and adjacent signal results in timing uncertainty leading to clock jitter.
Another major source of clock uncertainty is variation in the gate capacitance related
to the sequential elements. The load capacitance is highly non-linear and depends on
the applied voltage.
Timing Issues in Digital Circuits, Clock Distribution Techniques,Synchronous and
Asynchronous Design
All sequential circuits have one property in common—a well-defined ordering of the
switching events must be imposed if the circuit is to operate correctly. If this were not the
case, wrong data might be written into the memory elements, resulting in a functional
failure. The synchronous system approach, in which all memory elements in the system are
simultaneously updated using a globally distributed periodic synchronization signal (that
is, a global clock signal), represents an effective and popular way to enforce this ordering.
Functionality is ensured by imposing some strict constraints on the generation of the clock
signals and their distribution to the memory elements distributed over the chip; non-
compliance often leads to malfunction.
We analyze the impact of spatial variations of the clock signal, called clock skew, and
temporal variations of the clock signal, called clock jitter, and introduce techniques to cope
with it. These variations fundamentally limit the performance that can be achieved using a
conventional design methodology.
At the other end of the design spectrum is an approach called asynchronous design,
which avoids the problem of clock uncertainty all-together by eliminating the need for
globally-distributed clocks. After discussing the basics of asynchronous design approach,
we analyze the associated overhead and identify some practical applications. The important
issue of synchronization, which is required when interfacing different clock domains
or when sampling an asynchronous signal, also deserves some in-depth treatment. Finally,
the fundamentals of on-chip clock generation using feedback is introduced along with
trends in timing.
Timing Classification Of Digital Systems
In digital systems, signals can be classified depending on how they are related to a local
clock.Signals that transition only at predetermined periods in time can be classified as
synchronous, mesochronous, or plesiochronous with respect to a system clock. A signal that
can transition at arbitrary times is considered asynchronous.
 Synchronous Interconnect: A signal with exact same frequency, and a known fixed
phase offset with respect to the local clock.
 Mesochronous interconnect:Asignal with the same frequency but an unknown
phase offset with respect to the local clock
 Plesiochronous Interconnect A signal which has nominally the same, but slightly
differentfrequency as the local clock
 Asynchronous Interconnect: Asynchronous signals can transition at any arbitrary
time, and are not slaved to any local clock.
Synchronous Design:
Synchronous Timing Basics
All systems designed today use a periodic synchronization signal or clock. The generation
and distribution of a clock has a significant impact on performance and power dissipation.
In the ideal world, assuming the clock paths from a central distribution point to each
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 41
register are perfectly balanced, the phase of the clock (i.e., the position of the clock edge
relative to a reference) at various points in the system is going to be exactly equal.
However, the clock is neither perfectly periodic nor perfectly simultaneous. This results in
performance degradation and/or circuit malfunction. Figure shows the basic structure of a
synchronous pipelineddatapath.
In the ideal scenario, the clock at registers 1 and 2 have the same clock period and
transition at the exact same time. The following timing parameters characterize the timing
of the sequential circuit.
 The contamination (minimum) delay tc-q,cd, and maximum propagation delay of
the register tc-q, the set-up (tsu) and hold time (thold) for the registers.
 The contamination delay tlogic,cdand maximum delay tlogicof the combinational
logic.
 tclk1and tclk2, corresponding to the position of the rising edge of the clock relative
to a globalreference.
Under ideal conditions (tclk1 = tclk2), the worst case propagation delays determine the
minimum clock period required for this sequential circuit. The period must be long enough
for the data to propagate through the registers and logic and be set-up at the destination
register before the next rising edge of the clock. This constraint is given by
Clock Skew
The spatial variation in arrival time of a clock transition on an integrated circuit is commonly
referred to as clock skew. The clock skew between two points iand j on an IC is given by
δ(i,j) = ti- tj, where tiand tjare the position of the rising edge of the clock with respect to a
reference. Consider the transfer of data between registers R1 and R2 in Figure10.5. The clock
skew can be positive or negative depending upon the routing direction and position of the
clock source. The timing diagram for the case with positive skew is shown in Figure. The
rising clock edge is delayed by a positive δ at the second register.
 Clock skew is caused by static path-length mismatches in the clock load and by
definition skew is constant from cycle to cycle. That is, if in one cycle CLK2 lagged
CLK1 by δ, then on the next cycle it will lag it by the same amount.
 Skew has strong implications on performance and functionality. First consider the
impact of clock skew on performance. From Figure, a new inputIn sampled by R1 at
edge 1 will propagate through the combinational logic and be sampled by R2 on edge
4. If the clock skew is positive, the time available for signal to propagate from R1 to
R2 is increased by the skew δ. The output of the combinational logic must be valid
one set-up time before the rising edge of CLK2 (point 4). The constraint on the
minimum clock period can then be derived
 Minimum clock period required to operate the circuit reliably reduces with increasing
clock skew.
 As above, assume that inputInis sampled on the rising edge ofCLK1 at edge 1 into R1.
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 42
The new values at the output ofR1 propagates through the combinational logic and
should be valid before edge 4 at CLK2. However, if the minimum delay of the
combinational logic block is small, the inputs toR2 may change before the clock edge
2, resulting in incorrect evaluation. To avoid races, we must ensure that the minimum
propagation delay through the register and logic must be long enough such that the
inputs toR2 are valid for a hold time after edge 2. The constraint can be formally
stated as
Figure above shows the timing diagram for the case when δ < 0. For this case, the rising edge
of CLK2 happens before the rising edge of CLK1. On the rising edge of CLK1, a new input is
sampled by R1. The new sampled data propagates through the combinational logic and is
sampled by R2 on the rising edge of CLK2, which corresponds to edge 4. A negative skew
directly impacts the performance of sequential system. However, a negative skew implies
that the system never fails, since edge 2 happens before edge 1.
 δ> 0—This corresponds to a clock routed in the same direction as the flow of the data
through the pipeline (Figure 10.8a). In this case, the skew has to be strictly controlled
and satisfy Eq. (10.4). If this constraint is not met, the circuit does malfunction
independent of the clock period.
 δ< 0—When the clock is routed in the opposite direction of the data (Figure 10.8b), the
skew is negative and condition (10.4) is unconditionally met. The circuit operates
correctly independent of the skew. The skew reduces the time available for actual
computation so that the clock period has to be increased by |δ|.
 Unfortunately, since a general logic circuit can have data flowing in both directions, this
solution to eliminate races will not always work. The skew can assume both positive and
negative values depending on the direction of the data transfer. The designer has to
account for the worst-case skew condition.
Clock Jitter
Clock jitter refers to the temporal variation of the clock period at a given point — that is, the
clock period can reduce or expand on a cycle-by-cycle basis. It is strictly a temporal
uncertainty measure and is often specified at a given point on the chip. Cycle-to-cycle jitter
refers to time varying deviation of a single clock period and for a given spatial location iis
given as Tjitter,i(n) = Ti, n+1 - Ti,n- TCLK, where Ti,nis the clock period for period n, Ti,
n+1 is clock period for period n+1, and TCLK is the nominal clockperiod.
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 43
Jitter directly impacts the performance of a sequential system. Figure above shows the
nominal clock period as well as variation in period. Ideally the clock period starts at edge
2and ends at edge 5 and with a nominal clock period of TCLK. However, as a result of jitter,
the worst case scenario happens when the leading edge of the current clock period is delayed
(edge 3), and the leading edge of the next clock period occurs early (edge 4).As a result, the
total time available to complete the operation is reduced by 2 tjiiterin the worst case and is
givenby
Clock-Distribution Techniques
It is necessary to design a clock network that minimizes skew and jitter. Another important
consideration in clock distribution is the power dissipation. To reduce power dissipation,
clock networks must support clock conditioning — this is, the ability to shutdown parts of
the clock network.
Fabrics for clocking
Most clock distribution schemes exploit the fact that only the relative phase between two
clocking points is important. Therefore one common approach to distributing a clock is to
use balanced paths or trees.
1) H-tree configuration:
The most common type of clock primitive is the H-tree network in Figure (a), where a 4x4
array is shown. In this scheme, the clock is routed to a central point on the chip and balanced
paths, that include both matched interconnect as well as buffers, are used to distribute the
reference to various leaf nodes. Ideally, if each path is balanced, the clock skew is zero.
However, in reality, as discussed in the previous section, process and environmental
variations cause clock skew and jitter tooccur.
(a) (b)
The H-tree configuration is particularly useful for regular-array networks in which all
elements are identical and the clock can be distributed as a binary tree (for example, arrays of
identical tiled processors). The more general approach, referred to as routed RC trees,
represents a floor plan that distributes the clock signal so that the interconnections carrying
the clock signals to the functional sub-blocks are of equal length.
2) Grid configuration:
Grids are typically used in the final stage of clock network to distribute the clock to the
clocking element loads (Fig (b)). The main difference is that the delay from the final driver
to each load is not matched. Rather, the absolute delay is minimized assuming that the grid
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 44
size is small. Advantage: It allows for late design changes since the clock is easily accessible
at various points on the die. Disadvantage: Structure has a lot of unnecessary interconnects.
Design Techniques- Dealing with Clock Skew and Jitter
To fully exploit the improved performance of logic gates with technology scaling, clock skew
and jitter must be carefully addressed. Skew and jitter can fundamentally limit the
performance of a digital circuits. Some guidelines for reducing of clock skew and jitter are
presented below.
1. To minimize skew, balance clock paths from a central distribution source to
individual clocking elements using H-tree structures or more generally routed tree
structures. When using routed clock trees, the effective clock load of each path that
includes wiring as well as transistor loads must beequalized.
2. The use of local clock grids (instead of routed trees) can reduce skew at the cost of
increased capacitive load and power dissipation.
3. If data dependent clock load variations causes significant jitter, differential registers
that have a data independent clock load should be used. The use of gated clocks to
save also results in data dependent clock load and increased jitter. In clock networks
where the fixed load is large (e.g., using clock grids), the data dependent variation
might not be significant.
4. If data flows in one direction, route data and clock in opposite directions. This
eliminates races at the cost of performance.
5. Avoid data dependent noise by shielding clock wires from adjacent signal wires. By
placing power lines (VDD or GND) next to the clock wires, coupling from neighboring
signal nets can be minimized or avoided.
6. Variations in interconnect capacitance due to inter-layer dielectric thickness
variation can be greatly reduced through the use of dummy fills. Dummy fills are
very common and reduce skew by increasing uniformity. Systematic variations should
be modeled and compensated for.
7. Variation in chip temperature across the die causes variations in clock buffer delay.
The use of feedback circuits based on delay locked loops can easily compensate for
temperature variations.
8. Power supply variation is a significant component of jitter as it impacts the cycle to
cycle delay through clock buffers. High frequency power supply variation can be
reduced by addition of on-chip decoupling capacitors. Unfortunately, decoupling
capacitors require a significant amount of area and efficient packaging solutions
must be leveraged to reduce chip area.
Asynchronous Design
Self-Timed Logic - An Asynchronous Technique
The synchronous design approach advocated in the previous sections assumes that all circuit
events are orchestrated by a central clock. Those clocks have a dual function.
 They insure that the physical timing constraints are met.
 Clock events serve as a logical ordering mechanism for the global system events.
Consider the pipelined datapath of Figure below. Inthis circuit, the data transitions through
logic stages under the command of the clock. The important point to note under this
methodology is that the clock period is chosen to be larger than the worst-case delay of each
pipeline stage, or T> max (tpd1, tpd2, tpd3) + tpd,reg. At each clock transition, a new set of
inputs is sampled and computation is started anew. The throughput of the system—which is
equivalent to the number of data samples processed per second—is equivalent to the clock
rate.
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 45
 Advantages: It presents a structured, deterministic approach to the problem of
choreographing the myriad of events that take place in digital designs. The approach
taken is to equalize the delays of all operations by making them as bad as the worst of
the set. The approach is robust and easy to adhere to.
 Disadvantages: It assumes that all clock events or timing references happen
simultaneously over the complete circuit. This is not the case in reality, because of
effects such as clock skewandjitter.
One way to avoid these problems isto opt for an asynchronous design approach and to
eliminate all the clocks.A more reliable and robust technique is the self-timed approach,
which presents a localsolution to the timing problem. The approach in Fig below assumes
that each combinational function has a means of indicating that it has completed a
computation for a particular piece of data. The computation of a logic block is initiated by
asserting a Start signal. The combinational logic block computes on the input data and in a
data-dependent fashion (taking the physical constraints into account) generates a Doneflag
once the computation is finished. Additionally, the operators must signal each other that they
are either ready to receive a next input word or that they have a legal data word at their
outputs that is ready for consumption. This signaling ensures the logical ordering of the
events and can be achieved with the aid of an extra Ack(nowledge) and Req(uest) signal. In
the case of the pipelined datapath, the scenario could proceed asfollows.
1. An input word arrives, and a Req(uest) to the block F1 is raised. If F1 is inactive at
that time, it transfers the data and acknowledges this fact to the input buffer, which
can go ahead and fetch the nextword.
2. F1 is enabled by raising the Start signal. After a certain amount of time, dependent
upon the data values, the Donesignal goes high indicating the completion of the
computation.
3. A Re(quest) is issued to the F2 module. If this function is free, an Ack(nowledge) is
raised, the output value is transferred, and F1 can go ahead with its next
computation.
The self-timed approach effectively separates the physical and logical ordering functions
implied in circuit timing. The completion signal Doneensures that the physical timing
constraints are met and that the circuit is in steady state before accepting a new input. The
logical ordering of the operations is ensured by the acknowledge- request scheme, often
called a handshaking protocol.
 In contrast to the global centralized approach of the synchronous methodology, timing
signals are generatedlocally. This avoids all problems and overheads associated with
distributing high-speedclocks.
 Separating the physical and logical ordering mechanisms results in a potential increase
in performance.
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 46
 The automatic shut-down of blocks that are not in use can result in power saving.
 Self-timed circuits are by nature robust to variations in manufacturing and operating
conditions such as temperature.
Unfortunately, these nice properties are not for free; they come at the expense of a substantial
circuit-level overhead, which is caused by the need to generate completion signals and the
need for handshaking logic that acts as a local traffic agent to order the circuit events.
DESIGNING OF MEMORY AND ARRAY STRUCTURES
A large portion of the Si area ofmany contemporary digital designs is dedicated to the storage
of data values and program instructions
Memory Classification :Classification criteria
I.Size:Depending upon the level of abstraction, different means are used to express the size of
a memory unit. The circuit designer tends to define the size of a memory in terms of the
numbering of bits that are equivalent to the number of individual cells(flip flops) needed to
store the data. The chip designer expresses the memory size in bytes or its multiples. The
system designer likes to quote the storage requirement in words.
II.Timing Parameters:
The time it takes to retrieve data (read) from the memory is called read access time which is
equal to the delay between the read request and the moment the data is available at the output.
This time is different from the write-access time which is the time elapsed between a write
request and final writing of the input data into the memory Read or write cycle time of the
memory is the minimum time required between successive reads or writes.
III. Function and Access patterns
IV. Input output architecture
Number of data at the input and output ports (multiport memories)
V. Application
Standalone ICs
Embedded
Secondary or tertiary memories (magnetic and optical disc)
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 47
Memory architecture and building blocks:
When implementing an N-word memory where each word is M-bits wide, the most intuitive
approach is to stack the subsequent memory words in a linear fashion one word at a time is
selected for reading or writing with the aid of a select bit (S0 to SN-1), if we assume that this
module is a single port memory.
A decoder is inserted to reduce the number of select signals a memory word is selected by
providing a binary encoded address word (A0 to AK-1),The decoder translates this address into
N=2K
select lines, only one of which is active at a timer. This approach reduces the number
of address lines from N to log2(2K
) = K.
This design does not address the issue of memory aspect ratio (height is very large compared
to width). This results in a design which cannot be implemented. Besides the bizzare shape
factor, the resulting design is extremely slow. The vertical wires connecting the storage cells
to the input/output becomes excessively long.To address this problem, memory arrays are
organized so that vertical and horizontal dimensions are of the same order of magnitude, thus
the aspect ratio approaches unity. Multiple words are stored in a single row and are selected
simultaneously. To route the correct word to the input/output terminals, an extra piece of
circuitry called the column decoder is needed. The address word is partitioned into a column
address (A0 to AK-1) and a row address (AK to AL-1). The row address enables one row of the
memory for R/W while the column address picks one particular word from the selected row.
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 48
For layer memories. The memory is partitioned into P smaller blocks. The composition of
each of the individual blocks is identical to the above figure. A word is selected based on the
row and column address that are broadcast to all the blocks. An extra address word called the
block address, selects one of the P blocks to be read or written. This approach has a dual
advantage.
1. The length of the local word and bitlines i.e. the length of the lines within the blocks
is kept within bounds, results in faster access times.
2. The block address can be used to activate only the addressed block. Non active blocks
are put in power saving mode with sense amplifiers and row and column decoders
disabled. This results in a substantial power saving that is desirable.
The Memory Core
Read only memories
Programs for processors with fixed applications such as washing machines, calculators and
game machines, once developed and debugged need only reading.
ROM cells - An overview:
The cell should be designed so that a 1 or 0 is presented to the bit line upon activation of its
word line. Figure shows several ways to accomplish this.
Diode ROM
 Bit line (BL) is resistively clamped to ground i.e. BL is pulled low through the
resistor connected to ground lacking any other excitations or inputs.
 0 cell : No physical connection between BL and word line.
 When high voltage is applied to WL of 1 cell, diode is enabled and the WL is pulled
up to VWL-VDON, resulting in a 1 on the BL.
 Disadvantage: does not isolate BL from WL
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 49

A better approach is to use an active device in the cell. The diode is replaced by the gate
source connection of an NMOS transistor, whose drain is connected to the supply voltage.
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 50
Read Write memories (RAM)
Static RAM (SRAM)
A generic SRAM cell consists of 6 transistors (6T) per bit. Access to the cell is enabled by
the WL, which replaces the clock and controls two pass transistors M5 and M6, shared
between the read and write operation. In contrast to ROM cells, two bit lines transferring both
the store signal and its inverse are required. Doing so improves the noise margin during both
read and write operations.
Operation of SRAM cell
Read operation:
Assume that a 1 is stored at Q. Both bit lines are precharged to 2.5 V before the read
operation is initiated. The read cycle is started by asserting the word line, enabling both pass
transistors M5 and M6 after the initial WL delay. During a correct read operation, the value
stored in Q and Q_BAR are transferred to the bit lines by leaving BL at its precharged value
and discharging BL_BAR through M1 to M5. A careful sizing of the transistor is necessary to
avoid accidentally writing a 1 into the cell. This type of malfunction is frequently called a
read upset.
Write operation:
Assume that a 1 is stored in the cell (or Q=1). A 0 is written into the cell by setting BL_BAR
to 1 and BL to 0, which is equivalent to applying a rest pulse to SR latch. This causes the flip
flop to change its state if the devices are properly sized.
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 51
Dynamic RAM (DRAM)
3T Dynamic Memory cell:
The cell is written by placing appropriate data value on
BL1 and asserting the write WL (WWL). The data is
retrieved as a charge on the capacitance CS once WWL is
lowered. When reading the cell, the RWL is raised. The
storage transistor M2 is either On or Off depending on the
stored value. The Bitline BL2 is either clamped to VDD
with the aid of a load device or is precharged to either VDD
or VDD-VT. The series connection of M2 and M3 pulls
BL2 low when a 1 is stored. BL2 remains high in the
opposite case. Notice that the cell is inverting i.e. the
inverse value of the stored signal is sensed on the BL. The
most common approach to refreshing a cell is to read the
stored data, put its inverse on BL1 and assert WWL in
consecutive order.
The properties of 3 T cell
 In contrast to SRAM cell, no constraints exist one the device ratios.
 Reading the 3T cell is non destructive i.e. the data value stored in the cell is not
affected by a read.
 No special process steps are needed. The storage capacitance is nothing more than the
gate capacitance of the readout device.
Memory Peripheral Circuitry (Control Circuitry)
Since the memory core trades performance and reliability for reduced area, memory design
relies exceedingly on the peripheral circuitry to recover both speed and electrical integrity.
The address decoders:
Whenever a memory allows for random address based access, the address decoders must be
present. Two classes of decoders – the row decoder, whose task is to enable one memory row
out of 2M
and the column and block decoders which can be described as 2K
input
multiplexers, where M and K are the widths of the respective fields in the address word.
Row decoders:
A 1-out-of-2M
decoder is nothing less than a collection of 2M
complex M-input logic gates.
Consider an 8-bit address decoder. Each of the outputs WLiisa logic function of the 8 input
address signals (A0 to A7). For example, the address 0 and 127 are enabled by the following
logic functions:
WL0=A0‘A1‘A2‘A3‘A4‘A5‘A6‘A7‘
For a single stage implementation it can be transformed in to a wide NOR using De-Morgan;s
rules
WL0=(A0+A1+A2+A3+A4+A5+A6+A7)‘
Static Decoder Design:
Implementing a wide NOR function in complementary CMOS is impractical. Splitting a
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 52
complex gate into two or more logic layers most often produces a faster and cheaper
implementation. Segments of the address are decoded in a first layer of logic called the
predecoder. A second layer of logic gates then produces the final word line signals.
WL0={(A0+A1)‘+(A2+A3)‘+(A4+A5)+(A6+A7)‘}‘
For this particular case, the address is partitioned into sections of 2 bits that are decoded in
advance. The resulting signals are combined using 4 input NAND gates to produce the fully
decoded array of WL signals.
Dynamic Decoders:
Since only one transition determines the decoder speed, it is interesting to evaluate other
circuit implementations.
Column and Block decoders:
The functionality of a column and block decoder is best described as a 2K
input multiplexer
where K stands for the size of the address word. One implementation is based on the CMOS
pass transistor multiplexer. The control signals of the pass transistor are generated using a K-
to-2K
predecoder. The schematic of a 4to1 column decoder using only NMOS transistors is
shown. The main advantage of this implementation is its speed. Only a single pass transistor
is inserted in the signal path, which introduces only a minimal extra resistance. The column
decoding is one of the last actions to be performed in the read sequence, so that the
predecoding can be executed in parallel with other operations such as memory access and
sensing and can be performed as soon as the column address is available. Consequently, the
propagation delay does not add to the overall memory access time.
A more efficient implementation is offered by a tree decoder that uses a binary reduction
scheme. Notice that no predecoder is required. The number of devices is drastically reduced
as shown.
Ntree = 2K
+ 2K-1
+ … + 4 + 2 = 2(2K
-1)
A 4-to-1 tree based column decoder
Sense Amplifiers:
They perform the following functions:
 Amplification: In certain memory structures such as a 1T RAM, amplification is
required for proper functionality since the typical circuit swing is limited to 100 mV.
 Delay reduction: The amplifier compensates for the restricted fan out driving
capability of the memory cell by accelerating the BL transition, or by detecting and
amplifying small transitions on the BL to large output swings.
 Power reduction: Reducing the signal swing on the bitlines can eliminate a substantial
part of the power dissipation related to the charging and discharging of the bit lines.
 Signal restoration: Because the read and refresh functions are intrinsically linked in
1T DRAMs, it is necessary to drive the BLs to the full signal range after sensing.
Differential Voltage Sensing Amplifiers:
Effectiveness of a differential amplifier is characterized by
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 53
1. Common mode rejection ratio CMRR: ability to amplify the true difference between
the signals and reject the common noise.
2. Power supply rejection ratio PSRR: spikes on the power supply are rejected by this
ratio
Figure shows the most basic sense amplifier. Amplification is achieved with a single stage,
based on current mirroring concept. The input signals are heavily loaded and driven by the
SRAM memory cell. The swing on those lines is small as the small memory cell drives a
large capacitive load. The inputs are fed to the differential input devices (M1 and M2) and
M3 and M4 act as active current mirror load. The amplifier is conditioned by the sense
amplifier enable signal SE. Initially inputs are precharged and equalized to a common value
while SE is low disabling the circuit. Once the read operation is initiated, one of the bit line
drops, SE is enabled when a sufficient differential signal has been established and the
amplifier evaluates.
Power dissipation in memories:
Reduction of power dissipation in memories is becoming of premier importance. Technology
scaling with its reduction in supply and threshold voltages and its deterioration of the off
current of the transistor causes the standby power of the memory to rise.
Sources of power dissipation in memories:
The power consumption in a memory chip can be attributed to three major sources – the
memory cell array, the decoders (block, row, column) and the periphery. A unified active
power equation for a modern CMOS memory array of m columns and n rows is
approximately given by:
For a normal read cycle
P = VDD IDD
IDD = Iarray + I deocde + I periphery= [miact + m(n-1)ihld] + [(n+m)CDEVintf] + [CPTVintf +
IDCP]
where iact : effective current of the selected or active cells; ihld : the data retention current of
the inactive cells ; CDE: output capacitance of each decoder ;CPT: the total capacitance of the
CMOS logic and peripheral circuits ; Vint: internal supply voltage ; IDCP: the static or
quasistatic current of the periphery. The major source of this current are the sense amplifiers
and the column circuitry. Other sources are the on chip voltage generator; f: operating
frequency
The power dissipation is proportional to the size of the memory. Dividing the memory into
subarrays and keeping n and m small are essential to keep the power within bounds.
In general, the power dissipation of the memory is dominated by the array. The active power
dissipation of the peripheral circuits is small compared to other components. Its standby
power can be high however requiring that circuits such as sense amplifiers are turned off
when not in action. The decoder charging current is also negligibly small in modern RAMs
especially if care is taken that only one out of the n or m nodes is charged at every cycle.
Power reduction Techniques:
1) Partitioning of the memory
A proper division of the memory into submodules goes a long way in confining active power
dissipation to the limited areas of the overall array. Memory units that are not in use should
consume only the power necessary for data retention. Memory portioning is accomplished by
reducing m (the number of cells on a bit line) and/or n (the number of cells on a bit line). By
dividing the word line into several sub word lines that are enabled only when addressed, the
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 54
overall switched capacitance per access is reduced.
Partitioning of the bit line reduces the capacitance switched at every read/write operation. An
approach that is often used in DRAM memories is the partially activated bit line. The bit line
is partitioned into multiple sections. All three sections share a common sense amplifier,
column decoder and I/O module.
2) Addressing the active power dissipation
Reducing the voltage levels is one of the most effective techniques to reduce power
dissipation in memories.
SRAM Active power dissipation:
To obtain a fast read operation, the voltage swing on the bit line is made as small as possible
typically between 0.1 and 0.3 V. The resulting signal is sent to the sense amplifier for
restoration. Since the signal is developed as a result of the ratio operation of the bit line load
and the cell transistor, a current flows through the bit line as long as the word line is activated
(t). Limiting t and the bit line swing helps to keep the active dissipation of SRAM low.
The saturation is worse for the write operation. Since BL and BL_BAR have to make a full
excursion. Reduction of the core voltage is the only remedy for this. Ultimately, the reduction
of the core voltage is limited by the mismatch between the paired MOS transistors in the
SRAM cell. Stringent control of the MOS transistor characteristics either at the process time
or at the run time using techniques such as body biasing is essential in low voltage operation
mode.
DRAM Active power dissipation:
The destructive readout process of a DRAM necessitates successive operations of readout,
amplification and restoration of the selected cells. Consequently, the bit lines are charged and
discharged over the full swing (VBL) for every read operation. Care should thus be taken to
reduce bit line dissipation charge mCBLVBL, since it dominates the active power. Reducing
CBL (bit line capacitance) is advantageous from both a power and SNR perspective. Reducing
VBL while very beneficial from a power perspective, negatively impacts the SNR ratio.
Voltage reduction thus has to be accompanied by either an increase in the size of the storage
capacitor and/or a noise reduction. A number of techniques have proven to be quite effective.
a) Half-VDDprecharge: Precharging the bit lines to VDD/2 helps to reduce active power in
DRAM memories by a factor of almost 2.
b) Boosted word line: Raising the value of the WL above VDD during a write operation
eliminates the threshold drop over the access transistor, yielding a substantial increase in
stored charge.
c) Increased capacitor area or value: Vertical capacitors such as those used in stacked and
trench cells are very effective in increasing the capacitance value. Keeping the ground
plate of the storage capacitor at VDD/2 reduces the maximum voltage over CS, making it
possible to use thinner oxides.
d) Increasing the cell size: Ultra-low voltage DRAM memory operation might require a
sacrifice of the area efficiency, especially for memories that are embedded in a system-on-
chip.
3) Data retention dissipation:
Data retention in SRAMs
In principle an SRAM array should not have any static power dissipation. Yet leakage current
of the cell transistors is becoming a major source of the retention current (duct subthreshold
leakage). Techniques to reduce retention current of SRAM memories:
a) Turning off unused memory blocks:
Memory function such as caches do not fully use the available capacity for most of the time.
Disconnecting unused blocks from the supply rails using high threshold switches reduces
their leakage to very low values. Obviously, the data stored in the memory is lost in this
approach.
b) Increasing the threshold by using body biasing:
Negative bias of the non active cells increases the thresholds of the devices and reduces the
leakage current.
c) Inserting extra resistance in the leakage path:
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 55
When data retention is necessary, the insertion of a low threshold switch in the leakage path
provides a means to reduce leakage current while keeping the data intact.While the low
threshold device leaks on its own, which is sufficient to maintain the state in the memory. At
the same time, a voltage drop over the switch introduces a ―stacking effect‖ in the memory
cells connected to it. A reduction of VGS combined with a negative VBS results in a substantial
drop in the leakage current.
d) Lowering supply voltage:
DRAM Retention power:
To combat leakage and loss of signal, DRAMs have to be refreshed continuously when in
data retention mode. The refresh operation is performed by reading the m cells connected
to a word line and restoring them. This operation is performed for each of the n word
lines in a sequence. The standby power is thus proportional to the bit line dissipation
charge and the refresh frequency.
The secret to leakage minimization in DRAM memories is VT control. This can be
accomplished at the design time (the fixed VT approach) or dynamically (the variable VT
technique). One option to reduce leakage through the access transistor in the DRAM cell
is to turn off the device hard by applying a negative voltage (-VWL) to the word line of
non-active cells.
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 56
UNIT IV DESIGNING ARITHMETIC BUILDING BLOCKS
Data path circuits:
Fig. Basic DSP architecture
Building Blocks for Digital Architectures include
• Arithmeticunit- Bit-sliced datapath (adder, multiplier, shifter, comparator, etc.)
• Memory- RAM, ROM, Buffers, Shift registers
• Control-Finite state machine (PLA, randomlogic.),Counters
• Interconnect-Switches,Arbiters,Bus
BIT – SLICED DATA PATH ORGANISATION:
Datapaths are often arranged in bit sliced organisation. Data processor in processor is word
based. Typical microprocessor datapaths are 32 bits or 64 bits. Those in DSL modems,
magnetic disk drives,compact disk players are of arbitrary width typically 5 to 24 nits.
Datapath consist of 32 bit slices each operating in single bit. Hence the name Bit-Sliced
data path organization.
Arithmetic Building Blocks of bit-sliced data path organization include
• Datapathelements-registers.
• Adder design
– Staticadder
– Dynamicadder
• Multiplier design
– Arraymultipliers
• Shifters, Paritycircuits
Fig.:Bit sliced Datapath Organisation
ADDERS:
Addition forms the basis for many processing operations, from ALUs to address generation
to multiplication to filtering. As a result, adder circuits that add two binary numbers are of
great interest to digital system designers.
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 57
FULL ADDER:
For a full adder, it is sometimes useful to define Generate (G), Propagate (P), and Kill (K)
signals. The adder generates a carry when Cout is true independent of Cin, so G = A · B.
The adder kills a carry when Cout is false independent of Cin, so K = A · B = A + B.
The adder propagates a carry; i.e., it produces a carry-out if and only if it receives a carry-in,
when exactly one input is true: P = A B.
The sum and carry out signals interms of G,P,K can be given by:
Co(G,P)=G+PCi
S(G,P)=P Gi
RIPPLE CARRY ADDER:
The delay of N-bit Ripple carry adder can be given by
tadder= (N-1)t Carry + t sum
There are two significant conclusion from the delay equation
1.The propogation delay of Ripple carry adder is linearly proportional to N.This properties
becomes increasingly important when designing adders for the wide datapaths(N=16,…128)
2.For designing the fast RCA using full adder, it is important to optimize the t carry.
Inverting property of RCA: Inverting all inputs to a full adder results in inverted output and it
can be expressed as
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 58
S‘(A,B,Ci)=S(A‘,B‘,Ci‘)
Co‘(A,B,Ci)=Co(A‘,B‘,Ci‘)
S=A B C S‘(A,B,Ci)=S(A‘,B‘,Ci‘)
Co=AB+BCi+ACi Co‘(A,B,Ci)=Co(A‘,B‘,Ci‘)
COMPLIMENTARY STATIC CMOS FULL ADDER USING 28 TRANSISTOR:
Fig: Static Cmos Full Adder Using 28 Transistor
 Complimentary Static Full adder consumes 28 transistors .Hence it consumes large area
and the circuit is slow .
 Tall PMOS transistor stacks are present in both carry and sum generation circuits.
 The Intrinsic load capacitance of Co signal is large and consist of two diffusion and six
gate capacitances, plus the wiring capacitance.
 The signal propagates through the inverting stages in the carry generation circuits.
Minimizing the carry path delay is the prime goal of the designer in the high speed adder
circuit .
 The sum generation requires one extra logic stage and is not that significant as the sum
delay factor appears only once in the propagation delay of RCA .
MIRROR ADDER CIRCUIT DESIGN:
Fig: Mirror Adder Design of Full Adder
 The NMOS and PMOS chains are completely symmetrical. This guarantees identical
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 59
rising and falling transitions if the NMOS and PMOS devices are properly sized. A
maximum of two series transistor can be observed in the carry – generation circuitry.
 When laying out the cell, the most critical issues is the minimization of capacitance at
node Co. The reduction of diffusion capacitance is particularly important.
 The capacitance at node Co is composed of four diffusion capacitances,two internal
gate capacitances and six gate capacitances in the connecting adder cell.
 The transistors connected to Ci are placed closest to the output.
 Only the transistors in the carry stage have to be optimized for optimal speed. All
transistors in the sum stage can be minimal size.
MANCHESTER CARRY CHAIN ADDER:
Fig:Manchester carry chain Adder
 A Manchester carry chain adder uses a cascade of pass transistors to implement the
carry chain.
 During the precharge phase (Φ=0),all intermediate nodes of the pass transistor carry
chain are precharged to Vdd.
 During evaluation, the nodes are discharged when there is an incoming carry and the
propogate and generate signals are high.
 The worst case delay of carry chain adder is modeled by the linearized RC network.
 Increasing the transistor width reduces the time constant,but it loads the gates in the
previous stage.
 Therefor transistor size is limited by the input loading capacitance
 The distributed nature of RC of the carry chain results in a propogation delay that is
quadratic in the number of nits N.
 To avoid this, it is necessary to insert signal buffering inverters
 Adding inverter makes the overall propagation delay that is quadratic in the
number of bits N.
 Adding inverter makes the overall propogation delay a linear function of N,as is the
case with ripple carry adders.
LOOK AHEAD ADDER DESIGN
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 60
LOOK AHEAD –BASIC IDEA
 Carry look ahead logic uses the concepts of generating and propagating carries.
 A carry-lookahead adder improves speed by reducing the amount of time required to
determine carry bits.
 The carry-lookahead adder calculates one or more carry bits before the sum. This
reduces the wait time to calculate the result of larger value bits. The Kogge-stone
adder and Brent-kung adder are examples of this type of adder.
 Carry lookahead depends on two things:
-Calculating for each digit position, whether that position is going to propagate
carry if one comes in from right.
-Combining these calculated values to be able to deduce quickly whether for
each group of digits, that group is going to propagate a carry that comes in
from the right
Suppose that groups of 4 digits are chosen. Then the sequence of events goes something like
this:
-All 1-bit adders calculate their results. Simultaneously, the lookahead units perform their
calculations.
-Suppose that a carry arises in a particular group. Within at most 5 gate delays, that carry will
emerge at the left-hand end of the group and starts propagating through the group to its left.
-If that carry is going to propagate all the way through the next group, the lookahead unit will
already have deduced this. Accordingly, before the carry emerges from the next group the
lookahead unit is immediately (within 1 gate delay) able to tell the next group to the left that
it is going to receive a carry –and, at the same time, to tell the next lookahead unit to the left
that a carry is on its way.
CARRY-LOOK-AHEAD ADDERS:
 Objective-generate all incoming carries in parallel
 Feasible-carries depend only on xn-1,xn-2,,...x0 and yn-1,yn-2,y0-information available to
all stages for calculating incoming carry and sum bit
 Requires large number of inputs to each stage of adder-impractical
 Number of inputs at each stage can be reduced-find out from inputs whether new
carries will be generated and whether they will be propagated.
CARRY PROPAGATION
 If xi=yi=1-carry –out generated regardless of incoming carry-no additional
information needed
 If xi,yi=10 or xiyi=01 – incoming carry propagated
 If xi-yi=0 – no carry propagation
 Gi=xiyi- generated carry;P I=XI+YI –Propagated carry
 Ci+1=xiyi +ci(xi+yi)=Gi + ci Pi
 Substituting ci= GI-1 +ci-1 Pi-1->ci+1 =Gi +Gi-1Pi+ci-1Pi-1Pi
 Further substitutions –
Ci+1=Gi + Gi-1Pi+Gi-2Pi-1Pi+ci-2Pi-2Pi-1Pi= ....
= Gi +Gi-1Pi+Gi-2Pi-1Pi+ ....+c0P0P1...Pi.
 All carries can be calculated in parallel from xn-1,xn-2,...x0,yn-1,yn-2,...y0 and forced
carry c0
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 61
Mirror implementation of Look Ahead Carry Adder
Look-Ahead: Topology
Carry Output equations for 4-bit Look Ahead Adder
c1=Go +c0P0
c2=G1+G0P1+c0P0P1
c3=G2+G1P2+G0P1P2+c0P0P1P2
c4=G3+G2P3+G1P2P3 +G0P1P2P3 +c0P0P1P2P3
4-bit module design
Addition can be reduced to a three-step process:
1. Computing bitwise generate (G) and propagate(P) signals- Bitwise PG logic
2. Combining PG signals to determine group generate(G) and propagate(P) signals-
Group PG Logic
3. Calculating the sums- Sum Logic
Fig: 4-bit Carry Look Ahead Adder Module
16-bit Carry Look Ahead Adder design
In general, a CLA using k groups of n bits each has a delay of
tds = tpg +tpg(n) +[(n-1)+(k-1)]tAO +tXOr
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 62
Manchester carry chain implementation of carry bypass adder (carry skip adder)
 Consider the four-bit adder of as in above fig. The values of Ak and Bk (k=0…3)
are such that all propagate signals Pk (k=0…3) are high.
 An incoming carry Ci,0=1 propagates under those conditions through the complete
adder chain and causes an outgoing carry C0,3=1.In other words, If (P0 P1 P2 P3
=1) then C0,3 =Ci,0 else either DELETE or GENERATE occurred.
 This information can be used to speed up the operation of the adder as in fig.
When BP=P0 P1P2P3=1 ,the incoming carry is forwarded immediately to next
block through the bypass transistor Mb –hence the name carry-bypass adder or
carry-skip adder.
Fig:Manchester carry chain implementation of carry bypass adder
 Fig. shows the possible carry propagation paths when the full-adder circuit is
implemented in Manchester carry style. This kind of arrangements speeds up
addition.
 The carry propagate either through the bypass path, or carry is generated
somewhere in the chain.
 In both the cases, the delay is smaller than the normal ripple configuration.
Fig.16-bit Carry Bypass adder
Propagation delay of carry bypass adder:
 The delay of N-bit carry skip adder is computed as
tp = tsetup + M tcarry +(N/M-2)tbypass +(M-1)tcarry +tsum .
 tsetup:the fixed overhead time to create the generate and propagate signals.
EC8095: VLSI Design Department of ECE 2020-2021
St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 63
 tcarry:thepropagationdelaythroughasinglebit.Theworstcasecarry-
propagationdelaythrough a single stage of M bits is approximately M times
larger.
 tbypass:the propagation delay through the bypass multiplexer of a single stage.
 tmin:the time to generate the sum of final stage.
Fig. ripple adder vs carry bypass adder
Carry –SelectAdder:
 InRCA,everyFAcellhastowaitfortheincomingcarrybeforeanoutgoingcarryis
generated.
 Possiblevaluesofcarryinputandresultforbothpossibilitiesareevaluatedin advance.
 Oncetherealvalueofincomingcarryisknown,thecorrectresultiseasilyselected
withasimplemultiplexerstage.
 Thisimplementationideaiscalledcarry-selectadder.
Fig:carry select adder
EC6601: VLSI Design Department of ECE 2018-19
St. Joseph’s College of Engineering / St. Joseph’s Institute of Technology 64
16 –Bit carry select adder:
Propagation delay of carry select adder
MULTIPLICATION
 MultiplicationneedsMcyclesusingN-bitadder
 Inshiftandadd
-M partial productadded
- Partial product is AND operation of multiplier bit and multiplicand followed by a ‗shift‘
PARTIAL PRODUCT-
GENERATION:
 LogicalANDofmultiplicandXandthemultiplierbitYi
 Addingzeroshasnoimpactonresults.
 Canreduceno.orpartialproductsbyhalf!!
 Eg.01111110=10000010where1=-1
-
Soonlytwopartiaproductsne
edtobeadded! (N-1)/2
-MultiplierwordY=SYj4j
withYje{-2,-
1,0,1,2} j=0
 ThistransformationisBooth’sRecoding
EC6601: VLSI Design Department of ECE 2018-19
St. Joseph’s College of Engineering / St. Joseph’s Institute of Technology 65
-Leads to less additions with area reduction and higher speed.
-Alternating 10101010 for eight bit is the worst case!
-Multiplying with {-2, -1, 0, 1, 2} versus {1, 0}; needs encoding
-Used modified Booth‘s recoding for consistent operation size.
Modified Booth’sRecoding
Partial product Selection table
Multiplier bits Recorded bits
000 0
001 + Multiplicand
010 + Multiplicand
011 +2 *Multiplicand
100 -2 *Multiplicand
101 -Multiplicand
110 -Multiplicand
111 0
 Bunchbitsfrommsbtolsbinthreewithsuccessiveoverlap
 Assignmultiplierasperthetable
 Numberofpartialproduc
tishalf
Eg.01111111isbunched
into
->01(1), 11(1), 11(1), 11(0)
->Multiplier=10 00 00 01 (see table)
->Four partial product is developed instead]
THE ARRAY MULTIPLIER:
An array multiplier is a digital combinational circuit that is used for the multiplication of two
binary numbers by employing an array of full adders and half adders. This array is used for
the nearly simultaneous addition of the various product terms involved.
To form the various product terms, an array of AND gates is used before the Adder array. An
array multiplier is a vast improvement in speed over the traditional bit serial multipliers in
which only one full adder along with a storage memory was used to carry out all the bit
additions involved and also over the row serial multipliers in which product rows (also
known as the partial products) were sequentially added one by one via the use of only one
multi-bit adder.
The tradeoff for this extra speed is the extra hardware required to lay down the adder array.
But with the much decreased costs of these adders, this extra hardware has become quite
affordable to a designer. In spite of the vast improvement in speed, there is still a level of
delay that is involved in an array multiplier before the final product is achieved.Before
committing hardware resources to the circuit, it is important for the designer to calculate the
aforementioned delay in order to make sure that the circuit is compatible with the timing
requirements of the user.
EC6601: VLSI Design Department of ECE 2018-19
St. Joseph’s College of Engineering / St. Joseph’s Institute of Technology 66
Fig:Array Multiplier
 N partial products of M bit size each.
 NxM two bit AND;N-1 Mbit adders
 Layout need not be straggled, but routing will take care of shift
Carry save multiplier:
Fig: Carry save Multiplier
 Large number of any critical paths are present in the array multiplier
 Increasing the performance of the structure through transistor sizing yields
marginal benefits
 A more efficient realization can be obtained by noticing that the multiplication
result does not change when the output carry bits are passed diagonally
downwards instead of only to the right
 An extra adder called a vector merging adder to generate the final result is
included
 This resulting multiplier is called carry save multiplier. Because he carry bits
are not immediately added but are rather saved for the next adder stage
 In the final stage, carry and sums are merged in a fast carry propagate adder
stage
 It has advantage that its worst case critical path is shorter.
The delay due to the carry save multiplier is given by the below expression
t mult=(N-1)tcarry+tand+tmerge
EC6601: VLSI Design Department of ECE 2018-19
St. Joseph’s College of Engineering / St. Joseph’s Institute of Technology 67
Wallace tree multiplier:
A Wallace tree is an efficient hardware implementation of a digital circuit that multiplies two
integers, devised by Australian Computer Scientist Chris Wallace in 1964.
The Wallace tree has three steps:
1. Multiply (that is – AND) each bit of one of the arguments, by each bit of the other,
yielding n2
results. Depending on position of the multiplied bits, the wires carry
different weights, for example wire of bit carrying result is 128
2. Reduce the number of partial products to two by layers of full and half adders.
3. Group the wires in two numbers, and add them with a conventional adder.
The second step works as follows. As long as there are three or more wires with the same
weight add a following layer:
 Take any three wires with the same weights and input them into a full adder. The result
will be an output wire of the same weight and an output wire with a higher weight for
each three input wires.
 If there are two wires of the same weight left, input them into a half adder.
 If there is just one wire left, connect it to the next layer.
 These computations only consider gate delays and don't deal with wire delays, which can
also be very substantial.
 The Wallace tree can be also represented by a tree of 3/2 or 4/2 adders.
 It is sometimes combined with Booth encoding
The advantages of Wallace Tree multipliers are
1. Substantial hardware saving
2. It offers greater speed
The disadvantage is irregular and inefficient layout.
EC6601: VLSI Design Department of ECE 2018-19
St. Joseph’s College of Engineering / St. Joseph’s Institute of Technology 68
The characteristics of Wallace tree multiplier include
 Final adder choice is critical; depends on structure of accumulator array
 Carry look ahead might be good if data arrives simultaneously
 Place pipeline stage before final addition
 In non-pipelined,other adders can be used with similar performance and less
hardware requirement
DIVIDER:
Unsigned non-restoring division:
Input:An n-bit dividend and a m-bit divisor
Output:The quotient and remainder
Begin:
1.load divisor and dividend into regoisters M and D, respectively,clear partial remainder
register R and set loop count cnt equal to n-1.
2.left shift register pair R:D one bit.
3.compute R=R-M;
4.Repeat
If(R<0)begin
D(0)=0;left shift R: D one bit; R=R+M;end
Else begin
D(0)=1 ; let shift R:D one bit ;R=R-M; end
Cnt=cnt-1: until (cnt==0)
5.If(R<0)begin D[0]=0;R=R+M;end else D(0)=1;end
Fig:Sequential Implementation of Non-Restoring Divider.
EC6601: VLSI Design Department of ECE 2018-19
St. Joseph’s College of Engineering / St. Joseph’s Institute of Technology 69
BARREL SHIFTER:
Any general purpose n-bit shifter should be able to shift incoming data upto n-1 places in a
right shift or left shift direction. If we now further specify that all shifts should be one end
around basis, so that any bit shifted out at one end of a data word, will be shifted in at the
other end of the word, then the problem of left shift or right shift is greatly eased.
For a 4 it word, a 1bt right shift is equal to a 3bit left shift and a 2bit shift right is equal to a
2bit shift left etc. Thus we can achieve a capability to shift left or right by zero,one,two or
three places by designing a circuit which will shift right only by one,two or three places.
Barrel shifter is an adaptation of the crossbar switch which recognizes the fact that we can
couple the switch gates together in groups of four and also form four separate groups
corresponding to shifts of zero, one, two and three bits.
The arrangement is readily adapted so that the in-lines also run horizontally. The resulting
arrangement is known as barrel shifter. This inter bus switches have their gate inputs
connected in a staircase fashion in groups of four and these are now four shift control inputs
which must be mutually exclusive in the active state. The structure of barrel shifter is of high
regularity and generality

vlsi.pdf important qzn answer for ece department

  • 1.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 1 UNIT – I INTRODUCTION - BASIC MOS TRANSISTOR The invention of the transistor by William B. Shockley, Walter H. Brattain and John Bardeen of Bell Telephone laboratories was followed by the development of the Integrated circuit (IC) The very first IC emerged at the beginning of 1960 and since that time there have already been 4 generations of ICs 1) SSI ( Small Scale Integration) 2) MSI ( Medium Scale Integration) 3) LSI ( Large Scale Integration) 4) VLSI ( Very Large Scale Integration) Now we see the emergence of the 5th generation, ULSI ( Ultra Large Scale Integration) which is characterized by complexities in excess of 3 million devices on a single IC chip.Within the bounds of MOS technology, the possible circuit realizations may be based on pMOS, nMOS, CMOS and now BiCMOS devices. Although CMOS is the dominant technology, some of the examples used to illustrate the design processes will be presented in nMOS form. The reasons are : 1) For NMOS technology, the design methodology and the design rules are easily learned, thus providing a simple but excellent introduction to structured design for VLSI. 2) nMOS technology and design processes provide an excellent background for other technologies. In particular some familiarity with nMOS allows a relatively easy transition to CMOS technology and design. 3) For GaAs technology some arrangements in relation to logic design are similar to those employed in nMOS technology. Therefore, understanding the basics of nMOS design will assist in the layout of GaAs circuits. BASIC MOS TRANSISTORS nMOS devices are formed in a p-type substrate of moderate doping level. The source and drain regions are formed by diffusing n-type impurities through suitable masks into 3 areas to give the desired n-impurity concentration and give rise to depletion regions which extend mainly in the more lightly doped p-region.  Thus, source and drain are isolated from one another by 2 diodes.  Connections to the source and drain are made by a deposited metal layer. . ( Fig a)  A polysilicon gate is deposited on a layer of insulation over the region between source and drain  If the gate is connected to a suitable positive voltage with respect to the source, then the electric field established between the gate and the substrate gives rise to a charge inversion region in the substrate under the gate insulation and a conducting path or channel is formed between source and drain.  Channel may also be established so that it is present under the condition Vgs = 0 by implanting suitable impurities in the region between the insulation and the gate. (fig b)  Substrate is of n-type material and the source and drain diffusions are consequently p-type.(fig c)
  • 2.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 2 ENHANCEMENT MODE TRANSISTOR ACTION:  In order to establish the channel in the first place a min. voltage level of threshold voltage Vt must be established between gate and source.  Fig (a) indicates the conditions prevailing with the channel established but no current flowing between source and drain (Vds = 0)  Condition: When current flows in the channel by applying a voltage Vds between drain and source.  Corresponding IR drop = Vds along the channel.  This results in the voltage between gate and channel varying with distance along the channel with the voltage being a max. ofVgs at the source end.  Effective voltage Vg = Vgs-Vt, there will be voltage available to invert the channel at the drain end so long as Vgs – Vt>= Vds.  Limiting condition comes when Vds = Vgs – Vt.  For all voltages Vds<Vgs – Vt, the device is in the non-saturated region of operation.  IR drop = Vgs –Vt takes place over less than the whole length of the channel so that over part of the channel, near the drain, there is insufficient electric field available to give rise to inversion layer to create the channel.  Diffusion current completes the path from source to drain causing the channel to exhibit a high resistance known as saturation region. DEPLETION MODE TRANSISTOR ACTION  The channel is established, due to the implant, even when Vgs = 0 and to cause the channel to cease to exist a –ve voltage Vtd must be applied between gate and source. Vtd is typically < -0.8Vdd, depending on the implant and substrate bias, but threshold voltage differences apart. Drain to source current Ids versus voltage Vds relationships  The whole concept of the MOS transistor evolves from the use of a voltage on the gate to induce a charge in the channel between source and drain, which may then be caused to move from source to drain under the influence of an electric field created by voltage Vds applied between source and drain.  Since the charge induced is dependent on the gate to source voltage Vgs then Ids is independent on both Vgs and Vds.  Consider a structure in which electrons will flow from source to drain. = , First, transit time ζ sd But velocity ,Where μ = electron or hole mobility (surface) Eds = electric field (drain to source) ; Now , So that , Thus, Typical values of μ at room temp. areμn = 650 cm2 /Vsec ( surface) μp = 240 cm2 /Vsec (surface) Non Saturated region:
  • 3.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 3  Charge induced in channel due to gate voltage is due to to the voltage difference between the gate and the channel Vgs  Voltage along the channel varies linearly with distance X from source due to the IR drop in the channel.  Assuming the device is not saturated then the average value is Vds/2  Effective gate voltage Vg = Vgs-Vt, Where Vt is the threshold voltage needed to invert the charge under the gate and establish the channel. , Thus induced charge , Where Eg= avg. electric field gate to channel εins = relative permittivity of insulation between gate and channel ε0 = permittivity of free space = 8.85x10-14 Fcm-1 Where D = oxide thickness Thus 3 Combine eqn 2 & 3 in 1 , we have or in the non saturated or resistive region where Vds<Vgs - Vtand /D The factor W/L is of course contributed by the geometry and it is a common practice to write  = K. W/L so that Ids =    2 / ) ( 2 ds V Vds Vt Vgs    4a ( Alternate form of Eqn 4) Gate/Channel Capacitance (parallel plate) Also , so Sometimes it is convenient to use gate capacitance per unit area Co rather than Cg. Noting that Cg = Co WL We may also write , Ids = Co W/L  2 / ) ( 2 ds V Vds Vt Vgs   4c Saturated region: Saturation begins when Vds = Vgs - Vt. Since at this point the IR drop in the channel equals the effective gate to channel voltage at the drain and we may assume that the current remains fairly constant as Vds increases further. Ideal I-V Characteristics Drain current of MOS device in different operating regions. MOS transistors have three regions of operation: • Cutoff or sub-threshold region •Linear region • Saturation region The long-channel model assumes that the current through an OFF transistor is 0.When a transistor turns ON (Vgs>Vt),the gate attracts carriers(electrons) to form a channel. The electrons drift from source to drain at a rate proportional to the electric field between these regions. Thus, we can compute currents if we know the amount of charge in the channel and the rate at which it moves. We know that the charge on each plate of a capacitor is Q=CV. Thus, the charge in the channel Qchannel is where Cg is the capacitance of the gate to the channel and Vgc-Vt is the amount of voltage attracting charge to the channel beyond the minimum required to invert from pton. The gate voltage is referenced to the channel, which is not grounded. If the source is at Vs and
  • 4.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 4 the drain is at Vd, the average is Vc=(Vs+Vd)/2= Vs+Vds/2. Therefore, the mean difference between the gate and channel potentials Vgc is Vg–Vc=Vgs–Vds /2,as shown in Figure 2.5. We can model the gate as a parallel plate capacitor with capacitance proportional to area over thickness. If the gate has length L and width W and the oxide thickness is tox, as shown in Figure2.6, the capacitance is Where ε0 is the permittivity of frees pace,8.85×10–14F/cm,andthepermittivityofSiO2is kox=3.9times as great. Often, the εox/tox term is called Cox, the capacitance per unit area of the gate oxide. Some nanometer processes use a different gate dielectric with a higher dielectric constant. In these processes, tox the equivalent oxide thickness (EOT), the thickness of a layer of SiO2 that has the same Cox. In this case, tox is thinner than the actual dielectric. Each carrier in the channel is accelerated to an average velocity, v, proportional to the lateral electric field, i.e., the field between source and drain. The constant of proportionality μ is called the mobility. The electric field E is the voltage difference between drain and source Vds divided by the channel length . The time required for carriers to cross the channel is the channel length divided by the carrier velocity: L/v. Therefore, the current between source and drain is the total amount of charge in the channel divided by the time required to cross The term Vgs–Vt arises so often that it is convenient to abbreviate it as VGT. Equation describes the linear region of operation, for Vgs>Vt, but Vds relatively small. It is called linear or resistive because when Vds<<VGT, Ids increases almost linearly with Vds, just like an ideal resistor. The geometry and technology- dependent parameters are sometimes merged into a single factor ᵝ . If Vds>Vdsat-VGT, the channel is no longer inverted in the vicinity of the drain; we say it is pinched off. Beyond this point, called the drain saturation voltage, increasing the drain voltage has no further effect on current. Substituting Vds=Vdsat at this point of maximum current into Eq(2.5),we find an expression for the saturation current that is independent of Vds. … This expression is valid for Vgs>Vt and Vds>Vdsat. Thus, long-channel MOS transistors are said to exhibit square-law behavior in saturation. Two key figures of merit for a transistor are Ion and Ioff. Ion (also called Idsat) is the ON current, Ids, when Vgs=Vds=VDD. Ioff is the OFF current when Vgs=0 and Vds=VDD. According to the long-channel model, Ioff=0and . Figure 2.7(a) showsthe I-Vcharacteristicsforthe transistor.Accordingtothefirst-ordermodel,the current
  • 5.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 5 is zero for gate voltages below Vt. For higher gate voltages, current increases linearly with Vds for small Vds. As Vds reaches the saturation point Vdsat=VGT, current rolls off and eventually becomes independent of Vds when the transistor is saturated. pMOS transistors behave in the same way, but with the signs of all voltages and currents reversed. The I-V characteristics are in the third quadrant, as shown in Figure2.7 (b). Non -Ideal I-V Effects The saturation current increases less than quadratically with increasing Vgs . This is caused by two effects: velocity saturation and mobility degradation.  At high lateral field strengths (Vds /L), carrier velocity ceases to increase linearly with field strength. This is called velocity saturation and results in lower Ids than expected at high Vds .  At high vertical field strengths (Vgs /tox ), the carriers scatter off the oxide interface more often, slowing their progess. This mobility degradation effect also leads to less current than expected at high Vgs .  The saturation current of the nonideal transistor increases somewhat with Vds . This is caused by channel length modulation, in which higher Vds increases the size of the depletion region around the drain and thus effectively shortens the channel.  Increasing the potential between the source and body raises the threshold through the body effect. Increasing the drain voltage lowers the threshold through drain-induced barrier lowering. Increasing the channel length raises the threshold through the short channel effect.  When Vgs<Vt , the current drops off exponentially rather than abruptly becoming zero. This is called subthreshold conduction. The current into the gate Ig is ideally 0. However, as the thickness of gate oxides reduces to only a small number of atomic layers, electrons tunnel through the gate, causing some gate leakage current. The source and drain diffusions are typically reverse- biased diodes and also experience junction leakage into the substrate or well. Both mobility and threshold voltage decrease with rising temperature. The mobility effect tends to dominate for strongly ON transistors, resulting in lower Ids at high temperature. The threshold effect is most important for OFF transistors, resulting in higher leakage current at high temperature. In summary, MOS characteristics degrade with temperature.
  • 6.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 6 Mobility Degradtion and Velocity Saturation  Carrier drift velocity, and hence current, is proportional to the lateral electric field Elat = Vds /L between source and drain. The constant of proportionality is called the carrier mobility, μ. The long- channel model assumed that carrier mobility is independent of the applied fields.  A high voltage at the gate of the transistor attracts the carriers to the edge of the channel, causing collisions with the oxide interface that slow the carriers. This is called mobility degradation.  Carriers approach a maximum velocity vsat when high fields are applied. This phenomenon is called velocity saturation. Channel Length Modulation Ideally, Ids is independent of Vds for a transistor in saturation, making the transistor a perfect current source. The p–n junction between the drain and body forms a depletion region with a width Ld that increases with Vdb. The depletion region effectively shortens the channel length to Leff = L - Ld Assume the source voltage is close to the body voltage so Vdb = Vds. Hence, increasing Vds decreases the effective channel length. Shorter channel length results in higher current; thus, Ids increases with Vds in saturation. This can be crudely modeled by multiplying EQ (2.10) by a factor of (1 + Vds / VA), where VA is called the Early voltage. In the saturation region As channel length gets shorter, the effect of the channel length modulation becomes relatively more important. Hence, VA is proportional to channel length. This channel length modulation model is a gross oversimplification of nonlinear behavior and is more useful for conceptual understanding than for accurate device modeling. Threshold Effects So far, we have treated the threshold voltage as a constant. However, Vt increases with the source voltage, decreases with the body voltage, decreases with the drain voltage, and increases with channel length. This section models each of these effects. Body Effect The body is an implicit fourth terminal. When a voltage Vsb is applied between the source and body, it increases the amount of charge required to invert the channel, hence, it increases the threshold voltage. The threshold voltage can be modeled as where Vt0 is the threshold voltage when the source is at the body potential, ϕs is the surface potential at threshold and γ is the body effect coefficient, typically in the range 0.4 to 1 V1/2 . i. Drain induced barrier Lowering (DIBL) The drain voltage Vds creates an electric field that affects the threshold voltage. This drain- induced barrier lowering (DIBL) effect is especially pronounced in short-channel transistors.  It can be modeled asVt = Vto –ηVds. where η is the DIBL coefficient, typically on the order of 0.1 (often expressed as 100 mV/V). Drain-induced barrier lowering causes Ids to increase with Vds in saturation, in much the same way as channel length modulation does. This effect can be lumped into a smaller Early voltage VA. Short Channel Effects The threshold voltage typically increases with channel length. This phenomenon is especially pronounced for small L where the source and drain depletion regions extend into a significant portion of the channel, and hence is called the short channel effect or Vtrolloff. ii. Leakage  Even when transistors are nominally OFF, they leak small amounts of current. Leakage mechanisms include subthreshold conduction between source and drain, gate leakage from the gate to body, and junction leakage from source to body and drain to body.  Subthreshold conduction is caused by thermal emission of carriers over the potential barrier set by the threshold. Gate leakage is a quantum-mechanical effect caused by tunneling through the
  • 7.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 7 extremely thin gate dielectric. Junction leakage is caused by current through the p-n junction between the source/drain diffusions and the body. Subthreshold Leakage  The long-channel transistor I-V model assumes current only flows from source to drain when Vgs> Vt. In real transistors, current does not abruptly cut off below threshold, but rather drops off exponentially.  When the gate voltage is high, the transistor is strongly ON. When the gate falls below Vt , the exponential decline in current appears as a straight line on the logarithmic scale. This regime of Vgs<Vt is called weak inversion.  The subthreshold leakage current increases significantly with Vds because of drain-induced barrier lowering. There is a lower limit on Ids set by drain junction leakage that is exacerbated by the negative gate voltage.  Subthreshold leakage current is described by EQ (2.42). Ids0 is the current at threshold and is dependent on process and device geometry. Gate Leakage According to quantum mechanics, the electron cloud surrounding an atom has a probabilistic spatial distribution. For gate oxides thinner than 15–20 Å, side of the oxide, where it will get whisked away through the channel. This effect of carriers crossing a thin barrier is called tunneling, and results in leakage current through the gate. Two physical mechanisms for gate tunneling are called Fowler-Nordheim (FN) tunnelingand direct tunneling. FN tunneling is most important at high voltage and moderate oxide thickness and is used to program EEPROM memories. Direct tunneling is most important at lower voltage with thin oxides and is the dominant leakage component. The direct gate tunneling current can be estimated as where A and B are technology constants. Junction Leakage The p–n junctions between diffusion and the substrate or well form diodes. The well-to- substrate junction is another diode. The substrate and well are tied to GND or VDD to ensure these diodes do not become forward biased in normal operation. However, reverse-biased diodes still conduct a small amount of current ID. where IS depends on doping levels and on the area and perimeter of the diffusion region and VD is the diode voltage (e.g., –Vsb or –Vdb). When a junction is reverse biased by significantly more than the thermal voltage, the leakage is just –IS, generally in the 0.1–0.01 fA/μm2 range, which is negligible compared to other leakage mechanisms. More significantly, heavily doped drains are subject to band-to-band tunneling (BTBT) and gate-induced drain leakage (GIDL). Temperature Dependence Transistor characteristics are influenced by temperature. Carrier mobility decreases with temperature. An approximate relation is where T is the absolute temperature, Tr is room temperature, and kμ is a fitting parameterwith a
  • 8.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 8 typical value of about 1.5. vsat also decreases with temperature, dropping by about20% from 300 to 400 K. The magnitude of the threshold voltage decreases nearly linearly with temperature and may be approximated by where kvt is typically about 1–2 mV/K. Ion at high VDD decreases with temperature. Subthreshold leakage increases exponentiallywith temperature.  Subthreshold leakage is exponentially dependent on temperature, so lower threshold voltages can be used. Velocity saturation occurs at higher fields, providing more current.  As mobility is also higher, these fields are reached at a lower power supply, saving power. Depletion regions become wider, resulting in less junction capacitance. Geometry Dependence  The layout designer draws transistors with width and length Wdrawn and Ldrawn. The actual gate dimensions may differ by some factors XW and XL.  the source and drain tend to diffuse laterally under the gate by LD, producing a shorter effective channel length that the carriers must traverse between source and drain. Similarly, WD accounts for other effects that shrink the transistor width. The factors of two come from lateral diffusion on both sides of the channel.  Therefore, a transistor drawn twice as long may have an effective length that is more than twice as great. Similarly, two transistors differing in drawn widths by a factor of two may differ in saturation current by more than a factor of two.  Threshold voltages also vary with transistor dimensions because of the short and narrow channel effects. Combining threshold changes, effective channel lengths, channel length modulation, and velocity saturation effects, Idsat does not scale exactly as 1/L. In general, when currents must be precisely matched (e.g., in sense amplifiers or A/D converters), it is best to use the same width and length for each device. Current ratios can be produced by tying several identical transistors in parallel. CMOS TECHNOLOGIES CMOS provides an inherently low power static circuit technology that has the capability of providing a lower-delay product than comparable design-rule nMOS or pMOS technologies. The four dominant CMOS technologies are: P-well process n-well process twin-tub process Silicon on chip process nMOS FABRICATION  Processing is carried out on a thin wafer cut from a single crystal of silicon of high purity into which the required p-impurities are introduced as the crystal is grown.  A layer of silicon dioxide ( SiO2), typically 1m thick is grown all over he surface of the wafer to protect the surface, act as a barrier to dopants during processing and provide a generally insulating substrate on to which other layers may be deposited and patterned.  The surface is now covered with a photo resist which is deposited onto the wafer and spun to achieve an even distribution of the required thickness.  The photo resist layer is then exposed to ultra violet light through a mask which defines those regions into which diffusion is to take place together with transistor channels.  These areas are subsequently readily etched away together with the underlying silicon dioxide so that the wafer surface is exposed in the window defined by the mask.  Remaining photo resist is removed and a thin layer of SiO2 is grown over the entire chip surface and then polysilicon is deposited on top of this to form the gate structure. The Layer consists of heavily doped polysilicon deposited by chemical vapor deposition (CVD).  Photo resist coating and masking allows the polysilicon to be patterned and then the thin oxide is removed to expose areas into which n-type impurities are to be diffused.  Thin oxide is grown over all again and is then masked with photo resist and etched to expose selected areas of the polysilicon gate and the drain and source areas where connections are to be made.  The whole chip then has metal (Al) deposited over its surface to a thickness typically of 1 m. This metal layer is then masked and etched to form the required interconnection pattern.
  • 9.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 9 CMOS FABRICATION  P-well process is widely used in practice and then the n-well process is also popular. P-well process  The diffusion must be carried out with special care since the p-well doping concentration and depth will affect the threshold voltages as well as the breakdown voltages of the n-transistor.  To achieve low threshold voltages ( 0.6 to 1.0 V) we need wither deep well diffusion or high well resistivity.  But deep wells require larger spacing due to lateral diffusion and therefore a larger chip area.
  • 10.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 10  The p-well act as substrates for the n-devices within the parent n-substrate and provided that voltage polarity restrictions are observed, the 2 areas are electrically isolated. Layout Design rules Layout design rules describe how small features can be and how closely they can be reliably packed in a particular manufacturing process. Industrial design rules are usually specified in microns. This makes migrating from one process to a more advanced process or a different foundry‘s process difficult because not all rules scale in the same way. Mead and Conway popularized scalable design rules based on a single parameter ,λ, that characterizes the resolution of the process. Λ is generally half of the minimum drawn transistor channel length. This length is the distance between the source and drain of a transistor and is set by the minimum width of a polysilicon wire. Designers often describe a process by its feature size. Feature size refers to minimum transistor length, so λ is half the feature size. This length is the distance between the source and drain of a transistor and is set by the minimum width of a polysilicon wire. For example, a 180 nm process has a minimum polysilicon width (and hence transistor length) of 0.18 μm and uses design rules with λ= 0.09 μm3 . Lambda- based rules are necessarily conservative because they round up dimensions to an integer multiple of λ A conservative but easy-to-use set of design rules for layouts with two metal layers in an n-well process is as follows:  Metal and diffusion have minimum width and spacing of 4 λ.  Contacts are 2 λ × 2 λ and must be surrounded by 1 λ on the layers above and below.  Polysilicon uses a width of 2 λ.  Polysilicon overlaps diffusion by 2λ where a transistor is desired and has a spacing of 1 λ away where no transistor is desired.  Polysilicon and contacts have a spacing of 3λ from other polysilicon or contacts.  N-well surrounds pMOS transistors by 6λ and avoids nMOS transistors by 6λ. Transistor dimensions are often specified by their Width/Length (W/L) ratio. For example, the nMOS transistor in Figure 1.39 formed where polysilicon crosses n-diffusion has a W/L of 4/2. In a 0.6 μm process, this corresponds to an actual width of 1.2 μm and a length of 0.6 μm. Such a
  • 11.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 11 minimum-width contacted transistor is often called a unit transistor. pMOS transistors are often wider than nMOS transistors because holes move more slowly than electrons so the transistor has to be wider to deliver the same current. Figure 1.40(a) shows a unit inverter layout with a unit nMOS transistor and a double-sized pMOS transistor. Figure 1.40(b) shows a schematic for the inverter annotated with Width/ Length for each transistor. In digital systems, transistors are typically chosen to have the minimum possible length because short-channel transistors are faster, smaller, and consume less power. Figure 1.40(c) shows a shorthand we will often use, specifying multiples of unit width and assuming minimum length. Gate layouts Line of Diffusion based style consists of four horizontal strips: Metal ground at the bottom of the cell, n-diffusion, p-diffusion, and metal power at the top. The power and ground lines are often called supply rails. Polysilicon lines run vertically to form transistor gates. Metal wires within the cell connect the transistors appropriately. Figure 1.41(a) shows such a layout for an inverter. The input A can be connected from the top, bottom, or left in polysilicon. The output Y is available at the right side of the cell in metal. Recall that the p-substrate and n-well must be tied to ground and power, respectively. Figure 1.41(b) shows the same inverter with well and substrate taps placed under the power and ground rails, respectively. Figure 1.42 shows a 3-input NAND gate. Notice how the nMOS transistors are connected in series while the pMOS transistors are connected in parallel. Power and ground extend 2 λ on each side so if two gates were abutted the contents would be separated by 4 λ, satisfying design rules. The height of the cell is 36 λ, or 40 λ if the 4 λ space between the cell and another wire above it is counted. All these examples use transistors of width 4 λ.
  • 12.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 12
  • 13.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 13 UNIT II COMBINATIONAL CIRCUIT DESIGN DESIGN PRINCIPLE OF STATIC CMOS DESIGN Digital CMOS circuits are implemented using either static or dynamic design techniques. In static CMOS, the output is tied to VDD or ground via a low resistance path (except during switching) and this leads to circuits implementation robust with good noise immunity. In static CMOS design any function can be realized as a sum of product (SOP) or a product of sum (POS). If an SOP function pulls the output high, then an SOP-BAR function will pull the output low. A POS function can pull the output high, while a POS-BAR function can pull the output low, as shown in fig. Important properties of static CMOS design: At any instant of time, the output of the gate is directly connected to Vss or VDD. All functions are composed of either AND'ed or OR'ed sub functions. The AND function is composed of NMOS transistors in series. The OR function is composed of NMOS transistors in parallel. Contains a pull-up network (PUP) and pull down network (PDN). PUP networks consist of PMOS transistors. PDN networks consist of NMOS transistors. Each network is the dual of the other network. The output of the complementary gate is inverted. Advantages of static CMOS design:  Robust in construction.  Good noise immunity.  Static logic has no minimum clock rate, the clock can be paused indefinitely.  Low power consumption.  For low operating frequencies, CMOS static logic is used to obtain a relatively small die size. Limitations of static CMOS design: The main limitation of static circuits is slower-speed as compared to dynamic circuits. The reasons are 1. Increased gate capacitance due to the presence of both PMOS and NMOS transistors. 2. Output depends on the previous cycle inputs due to charges that may be present at internal inputs. 3. Multiple switching of the output within a cycle depending on the input switching pattern MOSFETS as Switches The gate controls the passage of current between the source and the drain. CMOS uses positive logic - VDD is logic ‗1‘ and Vss is logic '0'. We turn a transistor on or off using the gate terminal. There are two kinds of CMOS transistors, n - Channel transistors and p - channel transistors. An n - channel transistor requires a logic T on the gate to make the switch conducting (to turn the transistor on). A p - channel transistor requires a logic '0' on the gate to make the switch conducting (to turn the transistor on). The conventional schematic icon representation along with the switch characteristics is shown. Basic CMOS Gates In this section, the basic gate implementation in static CMOS are presented. AND Gate If two N-switches are placed in series, the composite switch constructed by this action is closed (or ON) if both switches are connected to logic '1'. If any one of the switch is at logic
  • 14.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 14 '0' the circuit is said to be open (or OFF) state this yields an 'AND' function. The switch logic of AND function is shown in OR Gate If two N-switches are placed in parallel, the composite switch constructed by this action is closed (or ON) if any one of the switch is connected to logic ‗1‘. Bubble Pushing CMOS stages are inherently inverting, so AND and OR functions must be built from NAND and NOR gates. DeMorgan‟ s law helps with this conversion: A NAND gate is equivalent to an OR of inverted inputs. A NOR gate is equivalent to an AND of inverted inputs. The same relationship applies to gates with more inputs. Switching between these representations is easy to do on a whiteboard and is often called bubble pushing. Compound Gates:  Static CMOS also efficiently handles compound gates computing various  The logical effort of each input is the ratio of the input capacitance of that input to the input capacitance of the inverter For the AOI21 gate, this means the logical effort is slightly lower for the OR terminal (C) than for the two AND terminals (A, B).
  • 15.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 15 The parasitic delay is crudely estimated from the total diffusion capacitance on the output node by summing the sizes of the transistors attached to the output. Input Ordering Delay Effect The logical effort and parasitic delay of different gate inputs are often different. Other gates, like NANDs and NORs, are nominally symmetric but actually have slightly different logical effort and parasitic delays for the different inputs. Figure shows a 2-input NAND gate annotated with diffusion parasitic. Consider the falling output transition occurring when one input held a stable 1 value and the other rises from 0 to 1. If input B rises last, node x will initially be at VDD – Vt ≈ VDD because it was pulled up through the nMOS transistor on input A. The Elmore delay is (R/2)(2C) + R(6C) = 7RC. On the other hand, if input A rises last, node x will initially be at 0 V because it was discharged through the nMOS transistor on input B. No charge must be delivered to node x, so the Elmore delay is simply R(6C) = 6RC. In general, we define the outer input to be the input closer to the supply rail (e .g., B) and the inner input to be the input closer to the output (e.g., A). The parasitic delay is smallest when the inner input switches last because the intermediate nodes have already been discharged. Therefore, if one signal is known to arrive later than the others, the gate is fastest when that signal is connected to the inner input. The inner input has a lower parasitic delay. The logical efforts are lower than initial estimates might predict because of velocity saturation. Interestingly, the inner input has a slightly higher logical effort because the intermediate node x tends to rise and cause negative feedback when the inner input turns ON. This effect is seldom significant to the designer because the inner input remains faster
  • 16.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 16 over the range of fan-outs used in reasonable circuits. When one input is far less critical than another, even nominally symmetric gates can be made asymmetric to favor the late input at the expense of the early one. For example, consider the path in Figure. Under ordinary conditions, the path acts as a buffer between A and Y. When reset is asserted, the path forces the output low. If reset only occurs under exceptional circumstances and can take place slowly, the circuit should be optimized for input-to-output delay at the expense of reset. The pulldown resistance is R/4 +R/ (4/3) = R, so the gate still offers the same driver as a unit inverter. However, the capacitance on input A is only 10/3, so the logical effort is 10/9. This is better than 4/3, which is normally associated with a NAND gate. In the limit of an infinitely large reset transistor and unit-sized nMOS transistor for input A, the logical effort approaches 1, just like an inverter. The improvement in logical effort of input A comes at the cost of much higher effort on the reset input. Note that the pMOS transistor on the reset input is also shrunk. This reduces its diffusion capacitance and parasitic delay at the expense of slower response to reset. Skewed Gates In other cases, one input transition is more important than the other. We define H-I skew gates to favor the rising output transition and LO-skew gates to favor the falling output transition. This favoring can be done by decreasing the size of the noncritical transistor. The logical efforts for the rising (up) and falling (down) transitions are called ground gd, respectively, and are the ratio of the input capacitance of the skewed gate to the input capacitance of an unskewed inverter with equal drive for that transition. Figure (a) shows how a H-I skew inverter is constructed by downsizing the nMOS transistor. This maintains the same effective resistance for the critical transition while reducing the input capacitance relative to the unskewed inverter of Figure (b), thus reducing the logical effort on that critical transition to gu = 2.5/3 = 5/6. Of course , the improvement comes at the expense of the effort on the noncritical transition. The logical effort for the falling transition is estimated by comparing the inverter to a smaller unskewed inverter with equal pulldown current, shown in Figure (c), giving a logical effort of gd = 2.5/1.5 = 5/3. The degree of skewing (e.g., the ratio of effective resistance for the fast transition relative to the slow transition) impacts the logical efforts and noise margins; a factor of two is common. Figure catalogs HI-skew and LO-skew gates with a skew factor of two. Skewed gates are sometimes denoted with an H or an L on their symbol in a schematic. P/N Ratios The pMOS transistors in the unskewed gate are enormous in order to provide equal rise delay. They contribute input capacitance for both transitions, while only helping the rising delay. By accepting a slower rise delay, the pMOS transistors can be downsized to reduce input capacitance and average delay significantly.
  • 17.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 17 Reducing the pMOS size from 2 to for the inverter gives the theoretical fastest average delay, but this delay improvement is only 3%. However, this significantly reduces the pMOS transistor area. It also reduces input capacitance, which in turn reduces power consumption. Unfortunately, it leads to unequal delay between the outputs. Some paths can be slower than average if they trigger the worst edge of each gate. Excessively slow rising outputs ca n also cause hot electron de gradation. And reducing the pMOS size also moves the switching point lower and reduces the inverter‟ s noise margin. In summary, the P/N ratio of a library of cells should be chosen on the basis of area, power, and reliability, not average delay. For NOR gates , reducing the size of the pMOS transistors significantly improves both delay and area. In most standard cell libraries, the pitch of the cell determines the P/N ratio that can be achieved in any particular gate. Ratios of 1.5–2 are commonly used for inverters. Multiple Threshold Voltages Some CMOS processes offer two or more threshold voltages . Transistors with lower threshold voltages produce more ON current, but also leak exponentially more OFF current. Libraries can provide both high and low threshold versions of gates. The low - threshold gates can be used sparingly to reduce the delay of critical paths. Skewed gates can use low threshold devices on only the critical network of transistors. Delay estimation: Estimation of the delay of a Boolean function from its functional description is an important step towards design exploration at the register transfer level (RTL). This paper addresses the problem of estimating the delay of certain optimal multi-level implementations of combinational circuits, given only their functional description. tpdr: rising propagation delay From input to rising output crossing VDD/2 tpdf: falling propagation delay From input to falling output crossing VDD/2 tpd: average propagation delay tpd = (tpdr + tpdf)/2 tr: rise time From output crossing 20% to 80% VDD tf: fall time From output crossing 80% to 20% VDD tcd: average contamination delay tcd = (tcdr + tcdf)/2 tcdr: rising contamination delay: Min from input to rising output crossing VDD/2 tcdf: falling contamination delay: Min from input to falling output crossinVDD/2 Use RC delay models to estimate delay C = total capacitance on the output node. Use Effective resistance R, Therefore tpd = RC Transistors are characterized by finding their effective R. Transistor sizing:  Not all gates need to have the same delay.  Not all inputs to a gate need to have the same delay.  Adjust transistor sizes to achieve desired delay. Logical effort Logical effort is a gate delay model that takes transistor sizes into account. Allows us to optimize transistor sizes over combinational networks. Isn‘t as accurate for circuits with reconvergent fanout. Logical effort gate delay model
  • 18.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 18  Express delays in process-independent unit  Gate delay is measured in units of minimum-size inverter delay τ. d = dabs / τ. τ = 3RC ≈ 12ps in 180 nm process, 40 ps in 0.6 µm process.  Gate delay formula: d = f + p.  Effort delay f is related to gate‘s load. Parasitic delay p depends on gate‘s structure. Represents delay of gate driving no load Set by internal parasitic capacitance Effort delay  Effort delay has two components: f = gh.  Electrical effort h is determined by gate‘s load: h = Cout/Cin Sometimes called fanout  Logical effort g is determined by gate‘s structure. Measures relative ability of gate to deliver current g ≡ 1 for inverter Delay plots: Computing Logical Effort Logical effort is the ratio of the input capacitance of a gate to the input capacitance of an inverter delivering the same output current. Measure from delay Vs fanout plots Or estimate by counting transistor widths. Circuit families and its comparison: The method of logical effort does not apply to arbitrary transistor networks, but only to logic gates. A logic gate has one or more inputs and one output, subject to the following restrictions: The gate of each transistor is connected to an input, a power supply, or the output; and Inputs are connected only to transistor gates. The first condition rules out multiple logic gates masquerading as one, and the second keeps inputs from being connected to transistor sources or drains, as in transmission gates without explicit drivers. Pseudo-NMOS circuits Static CMOS gates are slowed because an input must drive both NMOS and PMOS transistors. In any transition, either the pullup or pulldown network is activated, meaning the input capacitance of the inactive network loads the input. Moreover, PMOS transistors have poor mobility and must be sized larger to achieve comparable rising and falling delays, further increasing input capacitance. Pseudo-NMOS and dynamic gates offer improved speed by removing the PMOS transistors from loading the input. Pseudo-NMOS gates resemble static gates, but replace the slow PMOS pullup stack with a single grounded PMOS transistor which acts as a pullup resistor. The effective pullup resistance should be large enough that the NMOS transistors can pull the output to near ground, yet low enough to rapidly pull the output high. Figure shows several pseudo-NMOS gates ratioed such that the pulldown transistors are about four times as strong as the pullup. The logical effort follows from considering the output current and input capacitance compared to the reference inverter from Figure Sized as shown, the PMOS transistors produce 1/3 of the current of the reference inverter and the NMOS transistor stacks produce 4/3 of the current of the reference inverter.
  • 19.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 19 For falling transitions, the output current is the pulldown current minus the pullup current which is fighting the pulldown, For rising transitions, the output current is just the pullup current, 1/3. The inverter and NOR gate have an input capacitance of 4/3. Gate type Logical Effort g Rising Falling Average 2 - NAND 8/3 8/9 16/9 3 - NAND 4 4/3 8/3 4 - NAND 16/3 16/9 32/9 n - NOR 4/3 4/9 8/9 n - mux 8/3 8/9 16/9 The average logical effort is g = (4=9+4=3)=2 = 8. This is independent of the number of inputs, explaining why pseudo-NMOS is a way to build fast wide NOR gates. Pass Transistor Logic : It is a MOS transistor, in which gate is driven by a control signal the source (out), the drain of the transistor is called constant or variable voltage potential(in) when the control signal is high, input is passed to the output and when the control signal is low, the output is floating topology such topology circuits is called pass transistor. The Pass transistor logic is required to reduce the transistors for implementing logic by using the primary inputs to drive gate terminals, source and drain terminals. In complementary CMOS logic primary inputs are allowed to drive only gate terminals. Figure shows implementation of AND function using only MOS pass transistors. In this gate if the B input is high the left NMOS is turned ON and copies the input A to the output F. When B is low the right NMOS pass transistor is turned ON and passes a ‗0‘ to the output F. This satisfies the truth table of AND gate reproduced in Table below for verification. ‗OR‘ gate using pass transistor logic The truth table of ‗OR‘ gate is as shown in Table below. Figure below shows the implementation of OR function using NMOS transistors only. In this gate if the B input is high the right NMOS is turned ON and copies logic 1 to F and this operation does not affected by ‗A‘ input. When B is low the left NMOS is turned ON the logic of ‗A‘ is copied to the output F.
  • 20.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 20 Advantage:  Fewer transistors are required to implement a given function.  Lower capacitance because of reduced number of transistors.  They do not have path VDD to GND and do not dissipate standby power (static power dissipation). Drawback: As discussed NMOS devices are effective in passing strong ‗0‘ but it is poor at pulling a node to VDD. Hence when the pass transistor pulls a node to high logic the output only changes upto VDD–VTh. This is the major disadvantage of pass transistors. Pass transistor logic (PTL) circuits are often superior to standard CMOS circuits in terms of layout density, circuit delay and power consumption. Transmission Gate Logic: The transmission gate logic is used to solve the voltage drop problem of the pass transistor logic. This technique uses the complementary properties of NMOS and PMOS transistors. i.e. NMOS devices passes a strong ‗0‘ but a weak ‗1‘ while PMOS transistors pass a strong ‗1‘ but a weak ‗0‘. The transmission gate combines the best of the two devices by placing an NMOS transistor in parallel with a PMOS transistor as shown in Figure below. The control signals to the transmission gate C and ~C are complementary to each other. The transmission gate is mainly a bi-directional switch enabled by the gate signal ‗C‘. When C = 1 both MOSFETs are ON and the signal pass through the gate i.e. A = B if C = 1. Whereas C = 0 makes the MOSFETs cut off creating an open circuit between nodes A and B. Basic Structure : The basic structure of transmission gate is shown in Figure below which consists of NMOS and PMOS transistors. Here, VG is applied to NMOS, and (VDD- VG) applied to the PMOS. The transmission gate work voltage-controlled switch. When VG is high, NMOS and PMOS are conducting hence switch is closed. Therefore, conduction path between left and right sides exist. When VG is low, then the MOSFETs are in cutoff and switch is open. Therefore, there is no direct relationship between VA and VB. Figure below shows the symbol of transmission gate controlled by switching signals X and X* that are applied to the gates of NMOS and PMOS respectively.
  • 21.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 21 The circuit constructed with the parallel connection of PMOS and NMOS with shorted drain and source terminals. The gate terminal uses two select signals s and s, when s is high than the transmission gates passes the signal on the input. The main advantage of transmission gate is that it eliminates the threshold voltage drop. Multiplexing element of path selector, A latch element An unlock switch, Act as a voltage controlled resistor connecting the input and output. 2 : 1 MUX using transmission gate : A 2:1 multiplexer is shown in Figure below. This gate selects either input A or B on the basis of the value of the control signal ‗C‘. When control signal C is logic low the output is equal to the input A and when control signal C is logic high the output is equal to the input B. A 2 : 1 multiplexer can be implemented using transmission gates. Figure below shows the connection diagram of the 2 : 1 multiplexer using transmission gates. The 2 : 1 MUX selects either A or B depending upon the control signal C. This is equivalent to implementing the Boolean function, F = (A  C + B  ~C) When the control signal C is high then the upper transmission gate is ON and it passes A through it so that output = A. When the control signal C is low then the upper transmission gate turns OFF and it will not allow A to pass through it, at the same time the lower transmission gate is ‗ON‘ and it allows B to pass through it so the output = B. DYNAMIC CMOS LOGIC Ratioed circuits reduce the input capacitance by replacing the pMOS transistors connected to the inputs with a single resistive pullup. The drawbacks of ratioed circuits include slow rising transitions, contention on the falling transitions, static power dissipation, and a non zero VOL. Dynamic circuits circumvent these drawbacks by using a clocked pullup transistor rather than a pMOS that is always ON. Figure compares (a) static CMOS, (b) pseudo- nMOS, and (c) dynamic inverters. Dynamic circuit operation is divided into two modes, as shown in Figure
  • 22.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 22 Dynamic circuits are the fastest commonly used circuit family because they have lower input capacitance and no contention during switching. They also have zero static power dissipation. However, they require careful clocking, consume significant dynamic power, and are sensitive to noise during evaluation. In Figure, if the input A is 1 during precharge, contention will take place because both the pMOS and nMOS transistors will be ON. When the input cannot be guaranteed to be 0 during precharge, an extra clocked evaluation transistor can be added to the bottom of the nMOS stack to avoid contention as shown in Figure. The extra transistor is sometimes called a foot. Figure estimates the falling logical effort of both footed and unfooted dynamic gates. As usual, the pulldown transistors‟ widths are chosen to give unit resistance. Precharge occurs while the gate is idle and often may take place more slowly. Therefore, the precharge transistor width is chosen for twice unit resistance. This reduces the capacitive load on the clock and the parasitic capacitance at the expense of greater rising delays. We see that the logical efforts are very low. Footed gates have higher logical effort than their unfooted counterparts but are still an improvement over static logic. In practice, the logical effort of footed gates is better than predicted because velocity saturation means series nMOS transistors have less resistance than we have estimated. The size of the foot can be increased relative to the other nMOS transistors to reduce logical effort of the other inputs at the expense of greater clock loading. Like pseudo- nMOS gates, dynamic gates are particularly well suited to wide NOR functions or multiplexers because the logical effort is independent of the number of inputs. A fundamental difficulty with dynamic circuits is the monotonicity requirement. While a dynamic gate is in evaluation, the inputs must be monotonically rising. That is, the input can start LOW and remain LOW, start LOW and rise HIGH, start HIGH and remain HIGH, but not start HIGH and fall LOW. Figure shows wave forms for a footed dynamic inverter in which the input violates monotonicity. During precharge, the output is pulled HIGH. When the clock rises, the input is HIGH so the output is discharged LOW through the pulldown network, as you would want to have happen in an inverter. The input later falls LOW, turning off the pulldown network.
  • 23.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 23 The output of a dynamic gate be gins HIGH and monotonically falls LOW during evaluation. This monotonically falling output X is not a suitable input to a second dynamic gate expecting monotonically rising signals. CMOS Domino Logic The monotonicity problem can be solved by placing a static CMOS inverter between dynamic gates, as shown in Figure. This converts the monotonically falling output into a monotonically rising signal suitable for the next gate, as shown in Figure. The dynamic static pair together is called a domino gate because precharge resembles setting up a chain of dominos and evaluation causes the gates to fire like dominos tipping over, each triggering the next. A single clock can be used to precharge and evaluate all the logic gates within the chain. The dynamic output is monotonically falling during evaluation, so the static inverter output is monotonically rising. Therefore, the static inverter is usually a HI-skew gate to favor this rising output. In general, more complex inverting static CMOS gates such as NANDs or NORs can be used in place of the inverter . This mixture of dynamic and static logic is called compound
  • 24.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 24 domino. Domino gates are inherently noninverting, while some functions like XOR gates necessarily require inversion. Three methods of addressing this problem include pushing inversions into static logic, delaying clocks, and using dual-rail domino logic. A second approach is to directly cascade dynamic gates without the static CMOS inverter, delaying the clock to the later gates to ensure the inputs are monotonic during evaluation. Domino circuits Pseudo-NMOS gates eliminate the bulky PMOS transistors loading the inputs, but pay the price of quiescent power dissipation and contention between the pullup and pulldown transistors. Dynamic gates offer even better logical effort and lower power consumption by using a clocked precharge transistor instead of a pullup that is always conducting. The dynamic gate is precharged HIGH then may evaluate LOW through an NMOS stack. Unfortunately, if one dynamic inverter directly drives another, a race can corrupt the result. When the clock rises, both outputs have been precharged HIGH. The HIGH input to the first gate causes its output to fall, but the second gate‘s output also falls in response to its initial HIGH input. The circuit therefore produces an incorrect result because the second output will never rise during evaluation, as shown in Figure 10.3. Domino circuits solve this problem by using inverting static gates between dynamic gates so that the input to each dynamic gate is initially LOW. The falling dynamic output and rising static output ripple through a chain of gates like a chain of toppling dominos. In summary, domino logic runs 1:5 to 2 times faster than static CMOS logic because dynamic gates present a much lower input capacitance for the same output current and have a lower switching threshold, and because the inverting static gate can be skewed to favor the critical monotonically rising evaluation edges. Figure shows some domino gates. Each domino gate consists of a dynamic gate followed by an inverting static gate1. The static gate is often but not always an inverter. Since the dynamic gate‘s output falls monotonically during evaluation, the static gate should be skewed high to favor its monotonically rising output. A dynamic gate may be designed with or without a clocked evaluation transistor; the extra transistor slows the gate but eliminates any path between power and ground during precharge when the inputs are still high. Dual-Rail Domino Logic: Dual-rail domino gates encode each signal with a pair of wires. The input and output signal pairs are denoted with sig_h and sig_l, respectively. Table summarizes the encoding. The sig_h wire is asserted to indicate that the output of the gate is ―high‖ or 1. The sig_l wire is asserted to indicate that the output of the gate is ―low‖ or 0.
  • 25.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 25 When the gate is precharged, neither sig_h nor sig_l is asserted. The pair of lines should never be both asserted simultaneously during correct operation. Dual-rail domino gates accept both true and complementary inputs and compute both true and complementary outputs, as shown in Figure. Observe that this is identical to static CVSL circuits from Figure except that the cross-coupled pMOS transistors are instead connected to the precharge clock. Therefore, dual-rail domino can be viewed as a dynamic form of CVSL, sometimes called DCVS. Figure shows a dual-rail AND/NAND gate and Figure shows a dual-rail XOR/XNOR gate. The gates are shown with clocked evaluation transistors, but can also be unfooted. Dual- rail domino is a complete logic family in that it can compute all inverting and non inverting logic functions. However, it requires more area, wiring, and power. Dual rail structures also lose the efficiency of wide dynamic NOR gates because they require complementary tall dynamic NAND stacks. Dual rail domino signals not only the result of a computation but also indicates when the computation is done. Before computation completes, both rails are precharged. When the computation completes, one rail will be asserted. A NAND gate can be used for completion detection, as shown in Figure. This is particularly useful for asynchronous circuits Keepers Dynamic circuits also suffer from charge leakage on the dynamic node. If a dynamic node is precharged high and then left floating, the voltage on the dynamic node will drift over time due to subthreshold, gate, and junction leakage. The time constants tend to be in the millisecond to nanosecond range, depending on process and temperature. This problem is analogous to leakage in dynamic RAMs.
  • 26.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 26 More over, dynamic circuits have poor input noise margins . If the input rises above Vt while the gate is in evaluation, the input transistors will turn on weakly and can incorrectly discharge the output. Both leakage and noise margin problems can be addressed by adding a keeper circuit. Figure shows a conventional keeper on a domino buffer. The keeper is a weak transistor that holds, or staticizes, the output at the correct level when it would otherwise float. When the dynamic node X is high, the output Y is low and the keeper is ON to prevent X from floating. When X falls, the keeper initially opposes the transition so it must be much weaker than the pulldown network. Eventually Y rises, turning the keeper OFF and avoiding static power dissipation. The keeper must be strong (i.e., wide) enough to compensate for any leakage current drawn when the output is floating and the pulldown stack is OFF. Strong keepers also improve the noise margin because when the inputs are slightly above Vt the keeper can supply enough current to hold the output high. NP and Zipper Domino Another variation on domino is shown in Figure. The HIskewinverting static gates are replaced with predischarged dynamic gates using pMOS logic. For example, a footed dynamic p-logic NAND gate is shown in Figure. When Φ is 0, the first and third stages pre charge high while the second stage predischarges low. When Φ rises, all the stages evaluate. Domino connections are possible, as shown in Figure. The design style is called NP Domino or NORA Domino (NORA). NORA has two major drawbacks. The logical effort of footed p-logic gates is generally worse than that of HI-skew gates (e.g., 2 vs. 3/2 for NOR2 and 4/3 vs. 1 for NAND2). Secondly, NORA is extremely susceptible to noise. In an ordinary dynamic gate, the input has a low noise margin (about Vt ), but is strongly driven by a static CMOS gate. The floating dynamic output is more prone to noise from coupling and charge sharing, but drives another static CMOS gate with a larger noise margin. In NORA, however, the
  • 27.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 27 sensitive dynamic inputs are driven by noise prone dynamic outputs. Given these drawbacks and the extra clock phase required, there is little reason to use NORA. Zipper domino is a closely related technique that leaves the precharge transistors slightly ON during evaluation by using precharge clocks that swing between 0 and VDD – |Vtp| for the pMOS precharge and Vtn and VDD for the nMOS precharge. This plays much the same role as a keeper. THE STATIC AND DYNAMIC POWER DISSIPATION IN CMOS CIRCUITS Static CMOS gates are very power-efficient because they dissipate nearly zero power while idle. For much of the history of CMOS design, power was a secondary consideration behind speed and area for many chips. As transistor counts and clock frequencies have increased, power consumption has skyrocketed and now is a primary design constraint. The instantaneous power P{t} drawn from the power supply is proportional to the supply current iDD(t) and the supply voltage VDD, P(t) = iDD(t) VDD The energy consumed over some time interval T is the integral of the instantaneous power = The average power over this interval is Pavg = Power dissipation in CMOS circuits comes from two components Static dissipation due to  subthreshold conduction through OFF transistors  tunneling current through gate oxide  leakage through reverse-biased diodes  contention current in ratioed circuits Dynamic dissipation due to charging and discharging of load capacitances "short circuit'' current while both pMOS and nMOS networks are partially ON Ptotal = Pstatic + Pdynamic Static Dissipation Considering the static CMOS inverter shown in Figure, if the input = '0,' the associated nMOS transistor is OFF and the pMOS transistor is ON. The output voltage is VDD or logic 1.' When the input = 1 the associated nMOS transistor is ON and the pMOS transistor is OFF. The output voltage is 0 volts (GND). Note that one of the transistors is always OFF when the gate is in either of these logic states. Ideally, no current flows through the OFF transistor so the power dissipation is zero when the circuit is quiescent, i.e., when no transistors are switching. Zero quiescent power dissipation is a principle advantage of CMOS over competing transistor technologies. However, secondary effects including subthreshold conduction, tunneling, and leakage lead to small amounts of static current flowing through the OFF transistor. Assuming the leakage current is constant so instantaneous and average power are the same, the static power dissipation is the product of total leakage current and the supply voltage. Pstatic = Istatic VDD OFF transistors still conduct a small amount of subthreshold current. As subthreshold current is exponentially dependent on threshold voltage, it is increasing dramatically as threshold voltages have scaled down. There is also some small static dissipation due to reverse biased diode leakage between diffusion regions, wells, and the substrate. In modern processes, diode
  • 28.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 28 leakage is generally much smaller than the subthreshold or gate leakage and may be neglected. Dynamic Dissipation Over any given interval of time T, the load will be charged and discharged Tfsw times. Current flows from VDD to the load to charge it. Current then flows from the load to GND during discharge. In one complete charge/discharge cycle, a total charge of Q = CVDD is thus transferred from VDD to GND. The average dynamic power dissipation is Pdynamic = Pdynamic = Because most gates do not switch every' clock cycle, it is often more convenient to express switching frequency fsw as an activity factor a times the clock frequency. Now the dynamic power dissipation may be rewritten as; Pdynamic = A clock has an activity factor of α=1, because it rises and falls every cycle. Most data has a maximum activity factor of 0.5 because it transitions only once each cycle.  Static CMOS logic has been empirically determined to have acvtiity factors closer to 0.1 because some gates maintain one output state more often thananother.  Because the input rise /fall time is greater than zero, both nMOS and pMOS transistors will be ON for a short period of time while the input is between Vtn and VDD - Vtp. This results in an additional "short circuit" current pulse from to GND a VDD and typically increases power dissipation by about 10% . Methods to reduce dynamic power dissipation 1. Reducing the product of capacitance and its switching frequency. 2. Eliminate logic switching that is not necessary for computation. 3. Reduce activity factor Reduce supply voltage Methods to reduce static power dissipation 1. By selecting multi threshold voltages on circuit paths with low-Vt transistors while leakage on other paths with high-Vt transistors. 2. By using two operating modes, active and standby for each function blocks. 3. By adjusting the body bias (i.e) adjusting FBB (Forward Body Bias) in active mode to increase performance and RBB (Reverse Body Bias) in standby mode to reduce leakage. 4. By using sleep transistors to isolate the supply from the block to achieve significant leakage power savings.
  • 29.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 29 UNIT III: SEQUENTIAL LOGIC CIRCUITS Static & Dynamic Latches and Registers, Pipelining  In sequential logic circuits, the output not only depends upon the current values of the inputs, but also upon preceding input values. In other words, a sequential circuit remembers some of the past history of the system—it hasmemory.  Figure shows a block diagram of a generic finite state machine (FSM) that consists of combinational logic and registers, which hold the system state. The system depicted here belongs to the class of synchronous sequential systems, in which all registers are under control of a single global clock. The outputs of the FSM are a function of the current Inputs and the Current State. The Next State is determined based on the Current State and the current Inputs and is fed to the inputs of registers.  On the rising edge of the clock, the Next State bits are copied to the outputs of the registers (after some propagation delay), and a new cycle begins. The register then ignores changes in the input signals until the next rising edge. In general, registers can be positive edge- triggered (where the input data is copied on the positive edge of the clock) or negative edge- triggered (where the input data is copied on the negative edge, as is indicated by a small circle at the clock input). Block diagram of a finite state machine using positive edge-triggered registers. Timing Metrics for Sequential Circuits There are three important timing parameters associated with a register as illustrated in Figure. 1. The set-up time (tsu) is the time that the data inputs (D input) must be valid before the clock transition (this is, the 0 to 1 transition for a positive edge-triggered register). 2. The hold time (thold) is the time the data input must remain valid after the clock edge. 3. Assuming that the set-up and hold-times are met, the data at the D input is copied to the Q output after a worst-case propagation delay (with reference to the clock edge) denoted by tc-q. Given the timing information for the registers and the combination logic, some system-level timing constraints can be derived. Assume that the worst- case propagation delay of the logic equals tplogic,while itsminimum delay (also called the contamination delay) is tcd. The minimum clock period T, required for proper operation of the sequential circuit is given by The hold time of the register imposes an extra constraint for proper operation, Wheretcdregisteris the minimum propagation delay (or contamination delay) of the register. It is important to minimize the values of the timing parameters associated with the register, as these directly affect the rate at which a sequential circuit can be clocked. In fact, modern high-performance systems are characterized by a very-low logic depth, and the register propagation delay and set-up times account for a significant portion of the clock period.
  • 30.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 30 Classification of Memory Elements Foreground versus Background Memory Memory that is embedded into logic is foreground memory (internal memory), and is most often organized as individual registers of register banks. Large amounts of centralized memory core are referred to as background memory (external memory). Static versus Dynamic Memory  Static memories preserve the state as long as the power is turned on.  Built using positive feedback or regeneration, where the circuit topology consists of intentional connections between the output and the input of a combinational circuit.  Static memories are most useful when the register won‘t be updated for extended periods of time. E.g. configuration data, loaded at power-up time.  This condition also holds for most processors that use conditional clocking (i.e., gated clocks) where the clock is turned off for unused modules. In that case, there are no guarantees on how frequently the registers will be clocked, and static memories are needed to preserve the state information.  Memory based on positive feedback fall under the class of elements called multivibrator circuits.The bistableelement, is its most popular representative, but other elements such as monostable and astable circuits are also frequently used.  Dynamic memories store state for a short period of time—on the order of milliseconds. They are based on the principle of temporary charge storage on parasitic capacitors associated with MOS devices. Capacitors have to be refreshed periodically to annihilate charge leakage.  Dynamic memories tend to be simpler, resulting in significantly higher performance and lower power dissipation. They are most useful in datapath circuits that require high performance levels and are periodically clocked. Latches versus Registers A latch is an essential component in the construction of an edge-triggered register. It is level- sensitive circuit that passes the D input to the Q output when the clock signal is high. This latch is said to be in transparent mode. When the clock is low, the input data sampled on the falling edge of the clock is held stable at the output for the entire phase, and the latch is in hold mode. The inputs must be stable for a short period around the falling edge of the clock to meet set-up and hold requirements. A latch operating under the above conditions is a positive latch. Similarly, a negative latch passes the D input to the Q output when the clock signal is low.
  • 31.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 31 Timing of positive and negative latches Static Latches and Registers The Bistability Principle Static memories use positive feedback to create a bistable circuit — a circuit having two stable states that represent 0 and 1. The basic idea is shown in Figure a, which shows two inverters connected in cascade along with a voltage-transfer characteristic typical of such a circuit. Assume now that the output of the second inverter Vo2 is connected to the input of the first Vi1, as shown by the dotted lines in Figure a. The resulting circuit has only three possible operation points (A, B, and C). Under the condition that the gain of the inverter in the transient region is larger than 1, only A and B are stable operation points, and C is a metastable operation point. Suppose that the cross- coupled inverter pair is biased at point C. A small deviation from this bias point, possibly caused by noise, is amplified and regenerated around the circuit loop. This is a consequence of the gain around the loop being larger than 1. On the other hand, A and B are stable operation points. In these points, the loop gain is much smaller than unity. Hence the cross-coupling of two inverters results in a bistablecircuit, which serves as a memory, storing either a 1 or a 0 (corresponding to positions A and B). In order to change the stored value, we must be able to bring the circuit from state A to B and vice-versa. This is generally done by applying a trigger pulse at Vi1 or Vi2. The width of the trigger pulse need be only a little larger than the total propagation delay around the circuit loop, which is twice the average propagation delay of the inverters. SR Flip-Flops SR —or set- reset— flip-flopcircuit is similar to the cross-coupled inverter pair with NOR gates replacing the inverters. The second input of the NOR gates is connected to the trigger inputs (S and R), that make it possible to force the outputs Q and Q' to a given state. These outputs are complimentary (except for the SR = 11 state). When both S and R are 0, the flip-flop is in a quiescent state and both outputs retain their value. If a positive (or 1) pulse is applied to the S input,theQ output is forced into the 1 state (with Q going to 0). Vice versa, a 1 pulse on R resets the flip-flop and the Q output goes to 0.
  • 32.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 32 When both S and R are high, both Q and Q'are forced to zero. This is forbidden. An additional problem with this condition is that when the input triggers return to their zero levels, the resulting state of the latch is unpredictable and depends on whatever input is last to go low. CMOS clocked SR flip-flop One possible realization of a clocked SR flip-flop— a level-sensitive positive latch— is shown in Figure. It consists of a cross-coupled inverter pair, plus 4 extra transistors to drive the flip- flop from one state to another and to provide clocked operation. Multiplexer-Based Latches Advantage: the sizing of devices only affects performance and is not critical to the functionality. For a negative latch, when the clock signal is low, the input 0 of the multiplexer is selected, and the D input is passed to the output. When the clock signal is high, the input 1 of the multiplexer, which connects to the output of the latch, is selected. The feedback holds the output stable while the clock signal is high. A transistor level implementation of a positive latch based on multiplexers is shown in Figure.  When CLK is high, the bottom transmission gate is on and the latch is transparent - that is, the D input is copied to the Q output.  The feedback does not have to be overridden to write the memory and hence sizing of transistors is not critical for realizing correct functionality. The number of transistors that the clock touches is important since it has an activity factor of 1.  Not efficient from this metric as it presents a load of 4 transistors to the CLK signal. To reduce the clock load to 2 transistors, by using NMOS only pass transistor as shown in Figure. Advantage  reduced clock load of only two NMOS devices.
  • 33.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 33  Simple circuit. Disadvantage: Results in passing of a degraded high voltage of VDD- VTnto the input of the first inverter. This impacts both noise margin and the switching performance, especially in the case of low values of VDD and high values of VTn. It also causes static power dissipation in first inverter. Since the maximum input-voltage to the inverter equals VDD-VTn, the PMOS device of the inverter is never turned off, resulting in a static current flow. Master-Slave Edge-Triggered Register  The register consists of cascading a negativeWSW latch (master stage) with a positive latch (slave stage).  On the low phase of the clock, the master stage is transparent, and the D input is passed to the master stage output, QM. During this period, the slave stage is in the hold mode, keeping its previous value using feedback.  On the rising edge of the clock, the master slave stops sampling the input, and the slave stage starts sampling. During the high phase of the clock, the slave stage samples the output ofthe masterstage (QM), while the master stage remains in a hold mode. Since QM is constant during the high phase of the clock, the output Q makes only one transition per cycle.  The value of Q is the value ofDright before the rising edge of the clock, achieving the positive edge-triggered effect. A negative edge-triggered register can be constructed using the same principle by simply switching the order of the positive and negative latch (this is, placing the positive latch first). A complete transistor-level implementation of the master-slave positive edge-triggered register is shown in Figure below. Drawback of the transmission gate register :the high capacitive load presented to the clock signal. The clock load per register is important, since it directly impacts the power dissipation of the clock network. Each register has a clock load of 8 transistors. One approach to reduce the clock load at the cost of robustness is to make the circuit ratioed.
  • 34.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 34 Figure below shows that the feedback transmission gate can be eliminated by directly cross coupling the inverters. Another problem with this scheme is the reverse conduction — this is, the second stage can affect the state of the first latch. When the slave stage is on (Figure above)it is possible for the combination of T2 and I4 to influence the data stored in I1-I2 latch. As long as I4 is a weak device, this is fortunately not a major problem. Non-ideal clock signals Variations can exist in the wires used to route the two clock signals, or the load capacitances can vary based on data stored in the connecting latches. This effect, known as clock skew is a major problem, and causes the two clock signals to overlap as is shown in Figure 7.20b. Clock-overlap can cause two types of failures, as illustrated for the NMOS- only negative master- slave register.  When the clock goes high, the slave stage should stop sampling the master stage output and go into a hold mode. However, since CLK and CLK bar are both high for a short period of time (the overlap period), both sampling pass transistors conduct and there is a direct path from the D input to the Q output. As a result, data at the output can change on the rising edge of the clock.This is a race condition in which the value of the output Q is a function of whether the input D arrives at node X before or after the falling edge of CLK. If node X is sampled in the metastable state, the output will switch to a value determined by noise in the system.  The primary advantage of the multiplexer-based register is that the feedback loop is open during the sampling period, and therefore sizing of devices is not critical to functionality. However, if there is clock overlap between CLK bar and CLK, node A can be driven by both D and B, resulting in an undefinedstate. Those problems can be avoided by using two non-overlapping clocks PHI1 and PHI2 instead, and by keeping the nonoverlap time tnon_overlapbetween the clocks large enough such that no overlap occurs even in the presence of clock-routing delays.
  • 35.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 35 Dynamic Latches and Registers The class of circuits based on temporary storage of charge on parasitic capacitors. Charge stored on a capacitor can be used to represent a logic signal. The absence of charge denotes a 0, while its presence stands for a stored 1. a periodic refresh of its value is necessary. Hence the name dynamic storage. Dynamic Transmission-Gate Edge-triggered Registers: A fully dynamic positive edge-triggered register based on the master-slave concept is shown inFigure below.  When CLK = 0, the input data is sampled on storage node 1, which has an equivalent capacitance of C1 consisting of the gate capacitance of I1, the junction capacitance of T1, and the overlap gate capacitance of T1.  During this period, the slave stage is in a hold mode, with node 2 in a high- impedance (floating) state.  On the rising edge of clock, the transmission gate T2 turns on, and the value sampled on node 1 right before the rising edge propagates to the output Q  Node 2 now stores the inverted version of node 1. Very efficient - requires only 8 transistors. The sampling switches canbeimplementedusingNMOS-onlypasstransistors (6-transistorimplementation). The set-up time of this circuit is simply the delay of the transmission gate, and corresponds to the time it takes node 1 to sample the D input. The hold time is approximately zero, since the transmission gate is turned off on the clock edge and further inputs changes are ignored. The propagation delay (tc-q) is equal to two inverter delays plus the delay of the transmission gate T2. Race Condition and Preventive Measures Clock overlap is an important concern for this dynamic register. Consider the clock waveforms shown in Figure below. During the 0-0 overlap period, the PMOS of T1 and the PMOS of T2 are simultaneously on, creating a direct path for data to flow from the D input of the register to the Q output. As a result, data at the output can change on the falling edge of the clock, which is undesired for a positive edge triggered register. The is known as a race condition in which the value of the output Q is a function of whether the input D arrives at node X before or after the raising edge of CLK. The output Q can change on the falling edge if the overlap period is large — obviously an undesirable effect for a positive edge-triggered register. The sameis true for the 1-1 overlap region, where an input-output path exists through the NMOS of T1 and the NMOS of T2. The latter case is taken care of by enforcing a hold time constraint. That is, the data must be stable during the high-high overlap period. The former situation (0-0 overlap) can be addressed by making sure that there is enough delay between the D input and node 2 ensuring that new data sampled by the master stage does not propagate through to the slave stage. Generally the built in single inverter delay should be sufficient and the overlap period constraint is givenas: Similarly, the constraint for the 1-1 overlap is given as:
  • 36.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 36 Impact of overlapping clocks. C2 MOS—A Clock-Skew Insensitive Approach ( Method to prevent race condition) Figure below shows an ingenious positive edge-triggered register, based on a master-slave concept insensitive to clock overlap. This circuit is called the C2 MOS (Clocked CMOS) register, and operates in two phases. 1. CLK = 0 (CLK bar = 1): The first tri-state driver is turned on, and the master stage acts as an inverter sampling the inverted version of D on the internal node X. The master stage is in the evaluation mode. Meanwhile, the slave section is in a high- impedance mode, or in ahold mode. Both transistors M7 and M8 are off, decoupling the output from the input. The output Q retains its previous value stored on the output capacitorCL2. 2. The roles are reversed when CLK = 1: The master stage section is in hold mode (M3- M4 off), while the second section evaluates (M7-M8on). The value stored on CL1propagates to the output node through the slave stage which acts as aninverter. In the (0-0) overlap case, both PMOS devices are on during this period. New data is sampled on node X through the series PMOS devices M2-M4, and node X can make a 0-to-1 transition during the overlap period. However, this data cannot propagate to the output since the NMOS device M7is turned off. At the end of the overlap period, CLK=1 and both M7 and M8 turn off, putting the slave stage is in the holdmode. The (1-1) overlap case where both NMOS devices M3 and M7 are turned on. If the D input changes during the overlap period, node X can make a 1-to-0 transition, but cannot propagate to the output. However, as soon as the overlap period is over, the PMOS M8is turned on and the 0 propagates to output. This effect is notdesirable. The problem is fixed by imposing a hold time constraint on the input data, D, or, in other words, the data D should be stable during the overlap period. Pipelining: An approach to optimize sequential circuits Pipelining is a popular design technique often used to accelerate the operation of the datapaths in digital processors. The idea is easily explained with the example of Figure(a).The goal of the presented circuit is to compute log(|a + b|), where both a and b represent streams of numbers, that is, the computation must be performed on a large set of inputvalues. The minimal clock period Tmin necessary to ensure correct evaluation is given as: wheretc-qand tsuare the propagation delay and the set-up time of the register, respectively. We assume that the registers are edge-triggered D registers. The term tpd,logicstands for the worst- case delay path through the combinational network, which consists of the adder, absolute value, and logarithm functions. In conventional systems, the latter delay is
  • 37.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 37 generally much larger than the delays associated with the registers and dominates the circuit performance. Assume that each logic module has an equal propagation delay. We note that each logic module is then active for only 1/3 of the clock period (if the delay of the register is ignored). For example, the adder unit is active during the first third of the period and remains idle—this is, it does no useful computation— during the other 2/3 of theperiod. (a) (b) Pipelining is a technique to improve the resource utilization, and increase the functional throughput. Assume that we introduce registers between the logic blocks, as shown in Figure b. This causes the computation for one set of input data to spread over a number of clock periods, as shown in Table.The advantage of pipelined operation becomes apparent when examining the minimum clock period of the modified circuit. The combinational circuit block has been partitioned into three sections, each of which has a smaller propagation delay than the original function. This effectively reduces the value of the minimum allowable clock period: Suppose that all logic blocks have approximately the same propagation delay, and that the register overhead is small with respect to the logic delays. The pipelined network outperforms the original circuit by a factor of three under these assumptions, or T min,pipe=Tmin/3. The increased performance comes at the relatively small cost of two additional registers, and an increased latency. Latch- vs. Register-Based Pipelines Consider the pipelined circuit of Figure below. The pipeline system is implemented based on pass-transistor-based positive and negative latches instead of edge triggered registers. Latch-based systems give significantly more flexibility in implementing a pipelined system, and oftenoffers higher performance. When the clocks CLK and are non- overlapping,correctpipelineoperationisobtained.InputdataissampledonC1atthenegativeedge of CLK and the computation of logic block F starts; the result of the logic block F is stored on C2 on the falling edge of , and the computation of logic block G starts. The non overlappingoftheclocksensurescorrectoperation.ThevaluestoredonC2attheendoftheCLKlow phaseistheresultofpassingthepreviousinput(storedon thefallingedgeofCLKonC1) through the logic function F.
  • 38.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 38 NORA-CMOS—A Logic Style for Pipelined Structures The latch-based pipeline circuit can also be implemented using C2 MOS latches, as shown in Figure below. This topology has one additional, important property:A C2 MOS-based pipelined circuit is race-free as long as all the logic functions F between the latches are non-inverting. The reasoning for the above argument is similar to the argument made in the construction of a C2 MOS register. During a (0-0) overlap betweenCLK and, all C2 MOS latches, simplify to pure pull-up networks (see Figure7.27). The only way a signal can race from stage to stage under this condition is when the logic function F is inverting, as illustrated in Figure above, where F is replaced by a single, static CMOS inverter. Similar considerations are valid for the (1-1)overlap. Sources of Clock Skew and Jitter A perfect clock is defined as perfectly periodic signal that is simultaneous triggered at various memory elements on the chip. However, due to a variety of process and environmental variations, clocks are not ideal. To illustrate the sources of skew and jitter, consider the simplistic view of clock generation and distribution as shown in Figure below. Typically, a high frequency clock is either provided from off chip or generated on-chip. From a central point, the clock is distributed using multiple matched paths to low-level memory element, registers. Here two paths are shown. The clock paths include wiring and the associated distributed buffers required to drive interconnects and loads. A key point to realize in clock distribution is that the absolute delay through a clock distribution path is not important; But the relative arrival time between the output of each path at the register points is important.
  • 39.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 39 The sources of clock uncertainty can be classified in several ways. Systematic errors are nominally identical from chip to chip, and aretypically predictable (e.g., variation in total load capacitance of each clock path). In principle, such errors can be modeled and corrected at design time given sufficiently good models and simulators. Random errors are due to manufacturing variations (e.g., dopant fluctuations that result in threshold variations) that are difficult to model and eliminate.Mismatch may also be characterized as static or time-varying. Below, the various sources ofskewand jitter, introduced in Figure 10.14, are described in detail.  Clock-Signal Generation(1) The generation of the clock signal itself causes jitter. A typical on-chip clock generator takes a low-frequency reference clock signal, and produces a high- frequency global reference for the processor. The core of such a generator is a Voltage-Controlled Oscillator (VCO). Problem is coupling from the surrounding noisy digital circuitry through the substrate. These noise source cause temporal variations of the clock signal that propagate unfiltered through the clock drivers to the flip-flops.  Manufacturing Device Variations(2) Distributed buffers are integral components of the clock distribution networks, as they are required to drive both the register loads as well as the global and local interconnects. The matching of devices in the buffers along multiple clock paths is critical to minimizing timing uncertainty. Device parameters in the buffers vary along different paths, resulting in static skew.There are many sources of variations including oxide variations (that affects the gain and threshold), dopant variations, and lateral dimension (width and length) variations.  Interconnect Variations(3) Vertical and lateral dimension variations cause the interconnect capacitance and resistance to vary across a chip. Since this variation is static, it causes skew between different paths. One important source of interconnect variation is the Inter-level Dielectric (ILD) thickness variations. Other interconnect variations include deviation in the width of the wires and line spacing. This results from photolithography and etch dependencies.  Environmental Variations (4 and 5) The two major sources are temperature and power supply. Temperature gradients across the chip isa result of variations in power dissipation across the die (chip). This is an issue with clock gating where some parts of the chip maybe idle while other parts of the chip might be active. Since the device parameters (such as threshold, mobility, etc.) depend strongly on temperature, buffer delay for a clock distribution network along one path can vary drastically for another path. The delay through buffers is a very strong function of power supply as it directly affects the drive of the transistors. As with temperature, the power supply voltage is a strong function of the switching activity. Power supply variations can be classified into static (or slow) and high frequency variations. Static power supply variations may result from fixed currents drawn from various modules, while high-frequency variations result from
  • 40.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 40 instantaneous IR drops along the power grid due to fluctuations in switching activity.  Capacitive Coupling (6 and 7) The variation in capacitive load also contributes to timing uncertainty. There are two major sources of capacitive load variations: coupling between the clock lines and adjacent signal wires and variation in gate capacitance. Any coupling between the clock wire and adjacent signal results in timing uncertainty leading to clock jitter. Another major source of clock uncertainty is variation in the gate capacitance related to the sequential elements. The load capacitance is highly non-linear and depends on the applied voltage. Timing Issues in Digital Circuits, Clock Distribution Techniques,Synchronous and Asynchronous Design All sequential circuits have one property in common—a well-defined ordering of the switching events must be imposed if the circuit is to operate correctly. If this were not the case, wrong data might be written into the memory elements, resulting in a functional failure. The synchronous system approach, in which all memory elements in the system are simultaneously updated using a globally distributed periodic synchronization signal (that is, a global clock signal), represents an effective and popular way to enforce this ordering. Functionality is ensured by imposing some strict constraints on the generation of the clock signals and their distribution to the memory elements distributed over the chip; non- compliance often leads to malfunction. We analyze the impact of spatial variations of the clock signal, called clock skew, and temporal variations of the clock signal, called clock jitter, and introduce techniques to cope with it. These variations fundamentally limit the performance that can be achieved using a conventional design methodology. At the other end of the design spectrum is an approach called asynchronous design, which avoids the problem of clock uncertainty all-together by eliminating the need for globally-distributed clocks. After discussing the basics of asynchronous design approach, we analyze the associated overhead and identify some practical applications. The important issue of synchronization, which is required when interfacing different clock domains or when sampling an asynchronous signal, also deserves some in-depth treatment. Finally, the fundamentals of on-chip clock generation using feedback is introduced along with trends in timing. Timing Classification Of Digital Systems In digital systems, signals can be classified depending on how they are related to a local clock.Signals that transition only at predetermined periods in time can be classified as synchronous, mesochronous, or plesiochronous with respect to a system clock. A signal that can transition at arbitrary times is considered asynchronous.  Synchronous Interconnect: A signal with exact same frequency, and a known fixed phase offset with respect to the local clock.  Mesochronous interconnect:Asignal with the same frequency but an unknown phase offset with respect to the local clock  Plesiochronous Interconnect A signal which has nominally the same, but slightly differentfrequency as the local clock  Asynchronous Interconnect: Asynchronous signals can transition at any arbitrary time, and are not slaved to any local clock. Synchronous Design: Synchronous Timing Basics All systems designed today use a periodic synchronization signal or clock. The generation and distribution of a clock has a significant impact on performance and power dissipation. In the ideal world, assuming the clock paths from a central distribution point to each
  • 41.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 41 register are perfectly balanced, the phase of the clock (i.e., the position of the clock edge relative to a reference) at various points in the system is going to be exactly equal. However, the clock is neither perfectly periodic nor perfectly simultaneous. This results in performance degradation and/or circuit malfunction. Figure shows the basic structure of a synchronous pipelineddatapath. In the ideal scenario, the clock at registers 1 and 2 have the same clock period and transition at the exact same time. The following timing parameters characterize the timing of the sequential circuit.  The contamination (minimum) delay tc-q,cd, and maximum propagation delay of the register tc-q, the set-up (tsu) and hold time (thold) for the registers.  The contamination delay tlogic,cdand maximum delay tlogicof the combinational logic.  tclk1and tclk2, corresponding to the position of the rising edge of the clock relative to a globalreference. Under ideal conditions (tclk1 = tclk2), the worst case propagation delays determine the minimum clock period required for this sequential circuit. The period must be long enough for the data to propagate through the registers and logic and be set-up at the destination register before the next rising edge of the clock. This constraint is given by Clock Skew The spatial variation in arrival time of a clock transition on an integrated circuit is commonly referred to as clock skew. The clock skew between two points iand j on an IC is given by δ(i,j) = ti- tj, where tiand tjare the position of the rising edge of the clock with respect to a reference. Consider the transfer of data between registers R1 and R2 in Figure10.5. The clock skew can be positive or negative depending upon the routing direction and position of the clock source. The timing diagram for the case with positive skew is shown in Figure. The rising clock edge is delayed by a positive δ at the second register.  Clock skew is caused by static path-length mismatches in the clock load and by definition skew is constant from cycle to cycle. That is, if in one cycle CLK2 lagged CLK1 by δ, then on the next cycle it will lag it by the same amount.  Skew has strong implications on performance and functionality. First consider the impact of clock skew on performance. From Figure, a new inputIn sampled by R1 at edge 1 will propagate through the combinational logic and be sampled by R2 on edge 4. If the clock skew is positive, the time available for signal to propagate from R1 to R2 is increased by the skew δ. The output of the combinational logic must be valid one set-up time before the rising edge of CLK2 (point 4). The constraint on the minimum clock period can then be derived  Minimum clock period required to operate the circuit reliably reduces with increasing clock skew.  As above, assume that inputInis sampled on the rising edge ofCLK1 at edge 1 into R1.
  • 42.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 42 The new values at the output ofR1 propagates through the combinational logic and should be valid before edge 4 at CLK2. However, if the minimum delay of the combinational logic block is small, the inputs toR2 may change before the clock edge 2, resulting in incorrect evaluation. To avoid races, we must ensure that the minimum propagation delay through the register and logic must be long enough such that the inputs toR2 are valid for a hold time after edge 2. The constraint can be formally stated as Figure above shows the timing diagram for the case when δ < 0. For this case, the rising edge of CLK2 happens before the rising edge of CLK1. On the rising edge of CLK1, a new input is sampled by R1. The new sampled data propagates through the combinational logic and is sampled by R2 on the rising edge of CLK2, which corresponds to edge 4. A negative skew directly impacts the performance of sequential system. However, a negative skew implies that the system never fails, since edge 2 happens before edge 1.  δ> 0—This corresponds to a clock routed in the same direction as the flow of the data through the pipeline (Figure 10.8a). In this case, the skew has to be strictly controlled and satisfy Eq. (10.4). If this constraint is not met, the circuit does malfunction independent of the clock period.  δ< 0—When the clock is routed in the opposite direction of the data (Figure 10.8b), the skew is negative and condition (10.4) is unconditionally met. The circuit operates correctly independent of the skew. The skew reduces the time available for actual computation so that the clock period has to be increased by |δ|.  Unfortunately, since a general logic circuit can have data flowing in both directions, this solution to eliminate races will not always work. The skew can assume both positive and negative values depending on the direction of the data transfer. The designer has to account for the worst-case skew condition. Clock Jitter Clock jitter refers to the temporal variation of the clock period at a given point — that is, the clock period can reduce or expand on a cycle-by-cycle basis. It is strictly a temporal uncertainty measure and is often specified at a given point on the chip. Cycle-to-cycle jitter refers to time varying deviation of a single clock period and for a given spatial location iis given as Tjitter,i(n) = Ti, n+1 - Ti,n- TCLK, where Ti,nis the clock period for period n, Ti, n+1 is clock period for period n+1, and TCLK is the nominal clockperiod.
  • 43.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 43 Jitter directly impacts the performance of a sequential system. Figure above shows the nominal clock period as well as variation in period. Ideally the clock period starts at edge 2and ends at edge 5 and with a nominal clock period of TCLK. However, as a result of jitter, the worst case scenario happens when the leading edge of the current clock period is delayed (edge 3), and the leading edge of the next clock period occurs early (edge 4).As a result, the total time available to complete the operation is reduced by 2 tjiiterin the worst case and is givenby Clock-Distribution Techniques It is necessary to design a clock network that minimizes skew and jitter. Another important consideration in clock distribution is the power dissipation. To reduce power dissipation, clock networks must support clock conditioning — this is, the ability to shutdown parts of the clock network. Fabrics for clocking Most clock distribution schemes exploit the fact that only the relative phase between two clocking points is important. Therefore one common approach to distributing a clock is to use balanced paths or trees. 1) H-tree configuration: The most common type of clock primitive is the H-tree network in Figure (a), where a 4x4 array is shown. In this scheme, the clock is routed to a central point on the chip and balanced paths, that include both matched interconnect as well as buffers, are used to distribute the reference to various leaf nodes. Ideally, if each path is balanced, the clock skew is zero. However, in reality, as discussed in the previous section, process and environmental variations cause clock skew and jitter tooccur. (a) (b) The H-tree configuration is particularly useful for regular-array networks in which all elements are identical and the clock can be distributed as a binary tree (for example, arrays of identical tiled processors). The more general approach, referred to as routed RC trees, represents a floor plan that distributes the clock signal so that the interconnections carrying the clock signals to the functional sub-blocks are of equal length. 2) Grid configuration: Grids are typically used in the final stage of clock network to distribute the clock to the clocking element loads (Fig (b)). The main difference is that the delay from the final driver to each load is not matched. Rather, the absolute delay is minimized assuming that the grid
  • 44.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 44 size is small. Advantage: It allows for late design changes since the clock is easily accessible at various points on the die. Disadvantage: Structure has a lot of unnecessary interconnects. Design Techniques- Dealing with Clock Skew and Jitter To fully exploit the improved performance of logic gates with technology scaling, clock skew and jitter must be carefully addressed. Skew and jitter can fundamentally limit the performance of a digital circuits. Some guidelines for reducing of clock skew and jitter are presented below. 1. To minimize skew, balance clock paths from a central distribution source to individual clocking elements using H-tree structures or more generally routed tree structures. When using routed clock trees, the effective clock load of each path that includes wiring as well as transistor loads must beequalized. 2. The use of local clock grids (instead of routed trees) can reduce skew at the cost of increased capacitive load and power dissipation. 3. If data dependent clock load variations causes significant jitter, differential registers that have a data independent clock load should be used. The use of gated clocks to save also results in data dependent clock load and increased jitter. In clock networks where the fixed load is large (e.g., using clock grids), the data dependent variation might not be significant. 4. If data flows in one direction, route data and clock in opposite directions. This eliminates races at the cost of performance. 5. Avoid data dependent noise by shielding clock wires from adjacent signal wires. By placing power lines (VDD or GND) next to the clock wires, coupling from neighboring signal nets can be minimized or avoided. 6. Variations in interconnect capacitance due to inter-layer dielectric thickness variation can be greatly reduced through the use of dummy fills. Dummy fills are very common and reduce skew by increasing uniformity. Systematic variations should be modeled and compensated for. 7. Variation in chip temperature across the die causes variations in clock buffer delay. The use of feedback circuits based on delay locked loops can easily compensate for temperature variations. 8. Power supply variation is a significant component of jitter as it impacts the cycle to cycle delay through clock buffers. High frequency power supply variation can be reduced by addition of on-chip decoupling capacitors. Unfortunately, decoupling capacitors require a significant amount of area and efficient packaging solutions must be leveraged to reduce chip area. Asynchronous Design Self-Timed Logic - An Asynchronous Technique The synchronous design approach advocated in the previous sections assumes that all circuit events are orchestrated by a central clock. Those clocks have a dual function.  They insure that the physical timing constraints are met.  Clock events serve as a logical ordering mechanism for the global system events. Consider the pipelined datapath of Figure below. Inthis circuit, the data transitions through logic stages under the command of the clock. The important point to note under this methodology is that the clock period is chosen to be larger than the worst-case delay of each pipeline stage, or T> max (tpd1, tpd2, tpd3) + tpd,reg. At each clock transition, a new set of inputs is sampled and computation is started anew. The throughput of the system—which is equivalent to the number of data samples processed per second—is equivalent to the clock rate.
  • 45.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 45  Advantages: It presents a structured, deterministic approach to the problem of choreographing the myriad of events that take place in digital designs. The approach taken is to equalize the delays of all operations by making them as bad as the worst of the set. The approach is robust and easy to adhere to.  Disadvantages: It assumes that all clock events or timing references happen simultaneously over the complete circuit. This is not the case in reality, because of effects such as clock skewandjitter. One way to avoid these problems isto opt for an asynchronous design approach and to eliminate all the clocks.A more reliable and robust technique is the self-timed approach, which presents a localsolution to the timing problem. The approach in Fig below assumes that each combinational function has a means of indicating that it has completed a computation for a particular piece of data. The computation of a logic block is initiated by asserting a Start signal. The combinational logic block computes on the input data and in a data-dependent fashion (taking the physical constraints into account) generates a Doneflag once the computation is finished. Additionally, the operators must signal each other that they are either ready to receive a next input word or that they have a legal data word at their outputs that is ready for consumption. This signaling ensures the logical ordering of the events and can be achieved with the aid of an extra Ack(nowledge) and Req(uest) signal. In the case of the pipelined datapath, the scenario could proceed asfollows. 1. An input word arrives, and a Req(uest) to the block F1 is raised. If F1 is inactive at that time, it transfers the data and acknowledges this fact to the input buffer, which can go ahead and fetch the nextword. 2. F1 is enabled by raising the Start signal. After a certain amount of time, dependent upon the data values, the Donesignal goes high indicating the completion of the computation. 3. A Re(quest) is issued to the F2 module. If this function is free, an Ack(nowledge) is raised, the output value is transferred, and F1 can go ahead with its next computation. The self-timed approach effectively separates the physical and logical ordering functions implied in circuit timing. The completion signal Doneensures that the physical timing constraints are met and that the circuit is in steady state before accepting a new input. The logical ordering of the operations is ensured by the acknowledge- request scheme, often called a handshaking protocol.  In contrast to the global centralized approach of the synchronous methodology, timing signals are generatedlocally. This avoids all problems and overheads associated with distributing high-speedclocks.  Separating the physical and logical ordering mechanisms results in a potential increase in performance.
  • 46.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 46  The automatic shut-down of blocks that are not in use can result in power saving.  Self-timed circuits are by nature robust to variations in manufacturing and operating conditions such as temperature. Unfortunately, these nice properties are not for free; they come at the expense of a substantial circuit-level overhead, which is caused by the need to generate completion signals and the need for handshaking logic that acts as a local traffic agent to order the circuit events. DESIGNING OF MEMORY AND ARRAY STRUCTURES A large portion of the Si area ofmany contemporary digital designs is dedicated to the storage of data values and program instructions Memory Classification :Classification criteria I.Size:Depending upon the level of abstraction, different means are used to express the size of a memory unit. The circuit designer tends to define the size of a memory in terms of the numbering of bits that are equivalent to the number of individual cells(flip flops) needed to store the data. The chip designer expresses the memory size in bytes or its multiples. The system designer likes to quote the storage requirement in words. II.Timing Parameters: The time it takes to retrieve data (read) from the memory is called read access time which is equal to the delay between the read request and the moment the data is available at the output. This time is different from the write-access time which is the time elapsed between a write request and final writing of the input data into the memory Read or write cycle time of the memory is the minimum time required between successive reads or writes. III. Function and Access patterns IV. Input output architecture Number of data at the input and output ports (multiport memories) V. Application Standalone ICs Embedded Secondary or tertiary memories (magnetic and optical disc)
  • 47.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 47 Memory architecture and building blocks: When implementing an N-word memory where each word is M-bits wide, the most intuitive approach is to stack the subsequent memory words in a linear fashion one word at a time is selected for reading or writing with the aid of a select bit (S0 to SN-1), if we assume that this module is a single port memory. A decoder is inserted to reduce the number of select signals a memory word is selected by providing a binary encoded address word (A0 to AK-1),The decoder translates this address into N=2K select lines, only one of which is active at a timer. This approach reduces the number of address lines from N to log2(2K ) = K. This design does not address the issue of memory aspect ratio (height is very large compared to width). This results in a design which cannot be implemented. Besides the bizzare shape factor, the resulting design is extremely slow. The vertical wires connecting the storage cells to the input/output becomes excessively long.To address this problem, memory arrays are organized so that vertical and horizontal dimensions are of the same order of magnitude, thus the aspect ratio approaches unity. Multiple words are stored in a single row and are selected simultaneously. To route the correct word to the input/output terminals, an extra piece of circuitry called the column decoder is needed. The address word is partitioned into a column address (A0 to AK-1) and a row address (AK to AL-1). The row address enables one row of the memory for R/W while the column address picks one particular word from the selected row.
  • 48.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 48 For layer memories. The memory is partitioned into P smaller blocks. The composition of each of the individual blocks is identical to the above figure. A word is selected based on the row and column address that are broadcast to all the blocks. An extra address word called the block address, selects one of the P blocks to be read or written. This approach has a dual advantage. 1. The length of the local word and bitlines i.e. the length of the lines within the blocks is kept within bounds, results in faster access times. 2. The block address can be used to activate only the addressed block. Non active blocks are put in power saving mode with sense amplifiers and row and column decoders disabled. This results in a substantial power saving that is desirable. The Memory Core Read only memories Programs for processors with fixed applications such as washing machines, calculators and game machines, once developed and debugged need only reading. ROM cells - An overview: The cell should be designed so that a 1 or 0 is presented to the bit line upon activation of its word line. Figure shows several ways to accomplish this. Diode ROM  Bit line (BL) is resistively clamped to ground i.e. BL is pulled low through the resistor connected to ground lacking any other excitations or inputs.  0 cell : No physical connection between BL and word line.  When high voltage is applied to WL of 1 cell, diode is enabled and the WL is pulled up to VWL-VDON, resulting in a 1 on the BL.  Disadvantage: does not isolate BL from WL
  • 49.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 49  A better approach is to use an active device in the cell. The diode is replaced by the gate source connection of an NMOS transistor, whose drain is connected to the supply voltage.
  • 50.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 50 Read Write memories (RAM) Static RAM (SRAM) A generic SRAM cell consists of 6 transistors (6T) per bit. Access to the cell is enabled by the WL, which replaces the clock and controls two pass transistors M5 and M6, shared between the read and write operation. In contrast to ROM cells, two bit lines transferring both the store signal and its inverse are required. Doing so improves the noise margin during both read and write operations. Operation of SRAM cell Read operation: Assume that a 1 is stored at Q. Both bit lines are precharged to 2.5 V before the read operation is initiated. The read cycle is started by asserting the word line, enabling both pass transistors M5 and M6 after the initial WL delay. During a correct read operation, the value stored in Q and Q_BAR are transferred to the bit lines by leaving BL at its precharged value and discharging BL_BAR through M1 to M5. A careful sizing of the transistor is necessary to avoid accidentally writing a 1 into the cell. This type of malfunction is frequently called a read upset. Write operation: Assume that a 1 is stored in the cell (or Q=1). A 0 is written into the cell by setting BL_BAR to 1 and BL to 0, which is equivalent to applying a rest pulse to SR latch. This causes the flip flop to change its state if the devices are properly sized.
  • 51.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 51 Dynamic RAM (DRAM) 3T Dynamic Memory cell: The cell is written by placing appropriate data value on BL1 and asserting the write WL (WWL). The data is retrieved as a charge on the capacitance CS once WWL is lowered. When reading the cell, the RWL is raised. The storage transistor M2 is either On or Off depending on the stored value. The Bitline BL2 is either clamped to VDD with the aid of a load device or is precharged to either VDD or VDD-VT. The series connection of M2 and M3 pulls BL2 low when a 1 is stored. BL2 remains high in the opposite case. Notice that the cell is inverting i.e. the inverse value of the stored signal is sensed on the BL. The most common approach to refreshing a cell is to read the stored data, put its inverse on BL1 and assert WWL in consecutive order. The properties of 3 T cell  In contrast to SRAM cell, no constraints exist one the device ratios.  Reading the 3T cell is non destructive i.e. the data value stored in the cell is not affected by a read.  No special process steps are needed. The storage capacitance is nothing more than the gate capacitance of the readout device. Memory Peripheral Circuitry (Control Circuitry) Since the memory core trades performance and reliability for reduced area, memory design relies exceedingly on the peripheral circuitry to recover both speed and electrical integrity. The address decoders: Whenever a memory allows for random address based access, the address decoders must be present. Two classes of decoders – the row decoder, whose task is to enable one memory row out of 2M and the column and block decoders which can be described as 2K input multiplexers, where M and K are the widths of the respective fields in the address word. Row decoders: A 1-out-of-2M decoder is nothing less than a collection of 2M complex M-input logic gates. Consider an 8-bit address decoder. Each of the outputs WLiisa logic function of the 8 input address signals (A0 to A7). For example, the address 0 and 127 are enabled by the following logic functions: WL0=A0‘A1‘A2‘A3‘A4‘A5‘A6‘A7‘ For a single stage implementation it can be transformed in to a wide NOR using De-Morgan;s rules WL0=(A0+A1+A2+A3+A4+A5+A6+A7)‘ Static Decoder Design: Implementing a wide NOR function in complementary CMOS is impractical. Splitting a
  • 52.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 52 complex gate into two or more logic layers most often produces a faster and cheaper implementation. Segments of the address are decoded in a first layer of logic called the predecoder. A second layer of logic gates then produces the final word line signals. WL0={(A0+A1)‘+(A2+A3)‘+(A4+A5)+(A6+A7)‘}‘ For this particular case, the address is partitioned into sections of 2 bits that are decoded in advance. The resulting signals are combined using 4 input NAND gates to produce the fully decoded array of WL signals. Dynamic Decoders: Since only one transition determines the decoder speed, it is interesting to evaluate other circuit implementations. Column and Block decoders: The functionality of a column and block decoder is best described as a 2K input multiplexer where K stands for the size of the address word. One implementation is based on the CMOS pass transistor multiplexer. The control signals of the pass transistor are generated using a K- to-2K predecoder. The schematic of a 4to1 column decoder using only NMOS transistors is shown. The main advantage of this implementation is its speed. Only a single pass transistor is inserted in the signal path, which introduces only a minimal extra resistance. The column decoding is one of the last actions to be performed in the read sequence, so that the predecoding can be executed in parallel with other operations such as memory access and sensing and can be performed as soon as the column address is available. Consequently, the propagation delay does not add to the overall memory access time. A more efficient implementation is offered by a tree decoder that uses a binary reduction scheme. Notice that no predecoder is required. The number of devices is drastically reduced as shown. Ntree = 2K + 2K-1 + … + 4 + 2 = 2(2K -1) A 4-to-1 tree based column decoder Sense Amplifiers: They perform the following functions:  Amplification: In certain memory structures such as a 1T RAM, amplification is required for proper functionality since the typical circuit swing is limited to 100 mV.  Delay reduction: The amplifier compensates for the restricted fan out driving capability of the memory cell by accelerating the BL transition, or by detecting and amplifying small transitions on the BL to large output swings.  Power reduction: Reducing the signal swing on the bitlines can eliminate a substantial part of the power dissipation related to the charging and discharging of the bit lines.  Signal restoration: Because the read and refresh functions are intrinsically linked in 1T DRAMs, it is necessary to drive the BLs to the full signal range after sensing. Differential Voltage Sensing Amplifiers: Effectiveness of a differential amplifier is characterized by
  • 53.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 53 1. Common mode rejection ratio CMRR: ability to amplify the true difference between the signals and reject the common noise. 2. Power supply rejection ratio PSRR: spikes on the power supply are rejected by this ratio Figure shows the most basic sense amplifier. Amplification is achieved with a single stage, based on current mirroring concept. The input signals are heavily loaded and driven by the SRAM memory cell. The swing on those lines is small as the small memory cell drives a large capacitive load. The inputs are fed to the differential input devices (M1 and M2) and M3 and M4 act as active current mirror load. The amplifier is conditioned by the sense amplifier enable signal SE. Initially inputs are precharged and equalized to a common value while SE is low disabling the circuit. Once the read operation is initiated, one of the bit line drops, SE is enabled when a sufficient differential signal has been established and the amplifier evaluates. Power dissipation in memories: Reduction of power dissipation in memories is becoming of premier importance. Technology scaling with its reduction in supply and threshold voltages and its deterioration of the off current of the transistor causes the standby power of the memory to rise. Sources of power dissipation in memories: The power consumption in a memory chip can be attributed to three major sources – the memory cell array, the decoders (block, row, column) and the periphery. A unified active power equation for a modern CMOS memory array of m columns and n rows is approximately given by: For a normal read cycle P = VDD IDD IDD = Iarray + I deocde + I periphery= [miact + m(n-1)ihld] + [(n+m)CDEVintf] + [CPTVintf + IDCP] where iact : effective current of the selected or active cells; ihld : the data retention current of the inactive cells ; CDE: output capacitance of each decoder ;CPT: the total capacitance of the CMOS logic and peripheral circuits ; Vint: internal supply voltage ; IDCP: the static or quasistatic current of the periphery. The major source of this current are the sense amplifiers and the column circuitry. Other sources are the on chip voltage generator; f: operating frequency The power dissipation is proportional to the size of the memory. Dividing the memory into subarrays and keeping n and m small are essential to keep the power within bounds. In general, the power dissipation of the memory is dominated by the array. The active power dissipation of the peripheral circuits is small compared to other components. Its standby power can be high however requiring that circuits such as sense amplifiers are turned off when not in action. The decoder charging current is also negligibly small in modern RAMs especially if care is taken that only one out of the n or m nodes is charged at every cycle. Power reduction Techniques: 1) Partitioning of the memory A proper division of the memory into submodules goes a long way in confining active power dissipation to the limited areas of the overall array. Memory units that are not in use should consume only the power necessary for data retention. Memory portioning is accomplished by reducing m (the number of cells on a bit line) and/or n (the number of cells on a bit line). By dividing the word line into several sub word lines that are enabled only when addressed, the
  • 54.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 54 overall switched capacitance per access is reduced. Partitioning of the bit line reduces the capacitance switched at every read/write operation. An approach that is often used in DRAM memories is the partially activated bit line. The bit line is partitioned into multiple sections. All three sections share a common sense amplifier, column decoder and I/O module. 2) Addressing the active power dissipation Reducing the voltage levels is one of the most effective techniques to reduce power dissipation in memories. SRAM Active power dissipation: To obtain a fast read operation, the voltage swing on the bit line is made as small as possible typically between 0.1 and 0.3 V. The resulting signal is sent to the sense amplifier for restoration. Since the signal is developed as a result of the ratio operation of the bit line load and the cell transistor, a current flows through the bit line as long as the word line is activated (t). Limiting t and the bit line swing helps to keep the active dissipation of SRAM low. The saturation is worse for the write operation. Since BL and BL_BAR have to make a full excursion. Reduction of the core voltage is the only remedy for this. Ultimately, the reduction of the core voltage is limited by the mismatch between the paired MOS transistors in the SRAM cell. Stringent control of the MOS transistor characteristics either at the process time or at the run time using techniques such as body biasing is essential in low voltage operation mode. DRAM Active power dissipation: The destructive readout process of a DRAM necessitates successive operations of readout, amplification and restoration of the selected cells. Consequently, the bit lines are charged and discharged over the full swing (VBL) for every read operation. Care should thus be taken to reduce bit line dissipation charge mCBLVBL, since it dominates the active power. Reducing CBL (bit line capacitance) is advantageous from both a power and SNR perspective. Reducing VBL while very beneficial from a power perspective, negatively impacts the SNR ratio. Voltage reduction thus has to be accompanied by either an increase in the size of the storage capacitor and/or a noise reduction. A number of techniques have proven to be quite effective. a) Half-VDDprecharge: Precharging the bit lines to VDD/2 helps to reduce active power in DRAM memories by a factor of almost 2. b) Boosted word line: Raising the value of the WL above VDD during a write operation eliminates the threshold drop over the access transistor, yielding a substantial increase in stored charge. c) Increased capacitor area or value: Vertical capacitors such as those used in stacked and trench cells are very effective in increasing the capacitance value. Keeping the ground plate of the storage capacitor at VDD/2 reduces the maximum voltage over CS, making it possible to use thinner oxides. d) Increasing the cell size: Ultra-low voltage DRAM memory operation might require a sacrifice of the area efficiency, especially for memories that are embedded in a system-on- chip. 3) Data retention dissipation: Data retention in SRAMs In principle an SRAM array should not have any static power dissipation. Yet leakage current of the cell transistors is becoming a major source of the retention current (duct subthreshold leakage). Techniques to reduce retention current of SRAM memories: a) Turning off unused memory blocks: Memory function such as caches do not fully use the available capacity for most of the time. Disconnecting unused blocks from the supply rails using high threshold switches reduces their leakage to very low values. Obviously, the data stored in the memory is lost in this approach. b) Increasing the threshold by using body biasing: Negative bias of the non active cells increases the thresholds of the devices and reduces the leakage current. c) Inserting extra resistance in the leakage path:
  • 55.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 55 When data retention is necessary, the insertion of a low threshold switch in the leakage path provides a means to reduce leakage current while keeping the data intact.While the low threshold device leaks on its own, which is sufficient to maintain the state in the memory. At the same time, a voltage drop over the switch introduces a ―stacking effect‖ in the memory cells connected to it. A reduction of VGS combined with a negative VBS results in a substantial drop in the leakage current. d) Lowering supply voltage: DRAM Retention power: To combat leakage and loss of signal, DRAMs have to be refreshed continuously when in data retention mode. The refresh operation is performed by reading the m cells connected to a word line and restoring them. This operation is performed for each of the n word lines in a sequence. The standby power is thus proportional to the bit line dissipation charge and the refresh frequency. The secret to leakage minimization in DRAM memories is VT control. This can be accomplished at the design time (the fixed VT approach) or dynamically (the variable VT technique). One option to reduce leakage through the access transistor in the DRAM cell is to turn off the device hard by applying a negative voltage (-VWL) to the word line of non-active cells.
  • 56.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 56 UNIT IV DESIGNING ARITHMETIC BUILDING BLOCKS Data path circuits: Fig. Basic DSP architecture Building Blocks for Digital Architectures include • Arithmeticunit- Bit-sliced datapath (adder, multiplier, shifter, comparator, etc.) • Memory- RAM, ROM, Buffers, Shift registers • Control-Finite state machine (PLA, randomlogic.),Counters • Interconnect-Switches,Arbiters,Bus BIT – SLICED DATA PATH ORGANISATION: Datapaths are often arranged in bit sliced organisation. Data processor in processor is word based. Typical microprocessor datapaths are 32 bits or 64 bits. Those in DSL modems, magnetic disk drives,compact disk players are of arbitrary width typically 5 to 24 nits. Datapath consist of 32 bit slices each operating in single bit. Hence the name Bit-Sliced data path organization. Arithmetic Building Blocks of bit-sliced data path organization include • Datapathelements-registers. • Adder design – Staticadder – Dynamicadder • Multiplier design – Arraymultipliers • Shifters, Paritycircuits Fig.:Bit sliced Datapath Organisation ADDERS: Addition forms the basis for many processing operations, from ALUs to address generation to multiplication to filtering. As a result, adder circuits that add two binary numbers are of great interest to digital system designers.
  • 57.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 57 FULL ADDER: For a full adder, it is sometimes useful to define Generate (G), Propagate (P), and Kill (K) signals. The adder generates a carry when Cout is true independent of Cin, so G = A · B. The adder kills a carry when Cout is false independent of Cin, so K = A · B = A + B. The adder propagates a carry; i.e., it produces a carry-out if and only if it receives a carry-in, when exactly one input is true: P = A B. The sum and carry out signals interms of G,P,K can be given by: Co(G,P)=G+PCi S(G,P)=P Gi RIPPLE CARRY ADDER: The delay of N-bit Ripple carry adder can be given by tadder= (N-1)t Carry + t sum There are two significant conclusion from the delay equation 1.The propogation delay of Ripple carry adder is linearly proportional to N.This properties becomes increasingly important when designing adders for the wide datapaths(N=16,…128) 2.For designing the fast RCA using full adder, it is important to optimize the t carry. Inverting property of RCA: Inverting all inputs to a full adder results in inverted output and it can be expressed as
  • 58.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 58 S‘(A,B,Ci)=S(A‘,B‘,Ci‘) Co‘(A,B,Ci)=Co(A‘,B‘,Ci‘) S=A B C S‘(A,B,Ci)=S(A‘,B‘,Ci‘) Co=AB+BCi+ACi Co‘(A,B,Ci)=Co(A‘,B‘,Ci‘) COMPLIMENTARY STATIC CMOS FULL ADDER USING 28 TRANSISTOR: Fig: Static Cmos Full Adder Using 28 Transistor  Complimentary Static Full adder consumes 28 transistors .Hence it consumes large area and the circuit is slow .  Tall PMOS transistor stacks are present in both carry and sum generation circuits.  The Intrinsic load capacitance of Co signal is large and consist of two diffusion and six gate capacitances, plus the wiring capacitance.  The signal propagates through the inverting stages in the carry generation circuits. Minimizing the carry path delay is the prime goal of the designer in the high speed adder circuit .  The sum generation requires one extra logic stage and is not that significant as the sum delay factor appears only once in the propagation delay of RCA . MIRROR ADDER CIRCUIT DESIGN: Fig: Mirror Adder Design of Full Adder  The NMOS and PMOS chains are completely symmetrical. This guarantees identical
  • 59.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 59 rising and falling transitions if the NMOS and PMOS devices are properly sized. A maximum of two series transistor can be observed in the carry – generation circuitry.  When laying out the cell, the most critical issues is the minimization of capacitance at node Co. The reduction of diffusion capacitance is particularly important.  The capacitance at node Co is composed of four diffusion capacitances,two internal gate capacitances and six gate capacitances in the connecting adder cell.  The transistors connected to Ci are placed closest to the output.  Only the transistors in the carry stage have to be optimized for optimal speed. All transistors in the sum stage can be minimal size. MANCHESTER CARRY CHAIN ADDER: Fig:Manchester carry chain Adder  A Manchester carry chain adder uses a cascade of pass transistors to implement the carry chain.  During the precharge phase (Φ=0),all intermediate nodes of the pass transistor carry chain are precharged to Vdd.  During evaluation, the nodes are discharged when there is an incoming carry and the propogate and generate signals are high.  The worst case delay of carry chain adder is modeled by the linearized RC network.  Increasing the transistor width reduces the time constant,but it loads the gates in the previous stage.  Therefor transistor size is limited by the input loading capacitance  The distributed nature of RC of the carry chain results in a propogation delay that is quadratic in the number of nits N.  To avoid this, it is necessary to insert signal buffering inverters  Adding inverter makes the overall propagation delay that is quadratic in the number of bits N.  Adding inverter makes the overall propogation delay a linear function of N,as is the case with ripple carry adders. LOOK AHEAD ADDER DESIGN
  • 60.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 60 LOOK AHEAD –BASIC IDEA  Carry look ahead logic uses the concepts of generating and propagating carries.  A carry-lookahead adder improves speed by reducing the amount of time required to determine carry bits.  The carry-lookahead adder calculates one or more carry bits before the sum. This reduces the wait time to calculate the result of larger value bits. The Kogge-stone adder and Brent-kung adder are examples of this type of adder.  Carry lookahead depends on two things: -Calculating for each digit position, whether that position is going to propagate carry if one comes in from right. -Combining these calculated values to be able to deduce quickly whether for each group of digits, that group is going to propagate a carry that comes in from the right Suppose that groups of 4 digits are chosen. Then the sequence of events goes something like this: -All 1-bit adders calculate their results. Simultaneously, the lookahead units perform their calculations. -Suppose that a carry arises in a particular group. Within at most 5 gate delays, that carry will emerge at the left-hand end of the group and starts propagating through the group to its left. -If that carry is going to propagate all the way through the next group, the lookahead unit will already have deduced this. Accordingly, before the carry emerges from the next group the lookahead unit is immediately (within 1 gate delay) able to tell the next group to the left that it is going to receive a carry –and, at the same time, to tell the next lookahead unit to the left that a carry is on its way. CARRY-LOOK-AHEAD ADDERS:  Objective-generate all incoming carries in parallel  Feasible-carries depend only on xn-1,xn-2,,...x0 and yn-1,yn-2,y0-information available to all stages for calculating incoming carry and sum bit  Requires large number of inputs to each stage of adder-impractical  Number of inputs at each stage can be reduced-find out from inputs whether new carries will be generated and whether they will be propagated. CARRY PROPAGATION  If xi=yi=1-carry –out generated regardless of incoming carry-no additional information needed  If xi,yi=10 or xiyi=01 – incoming carry propagated  If xi-yi=0 – no carry propagation  Gi=xiyi- generated carry;P I=XI+YI –Propagated carry  Ci+1=xiyi +ci(xi+yi)=Gi + ci Pi  Substituting ci= GI-1 +ci-1 Pi-1->ci+1 =Gi +Gi-1Pi+ci-1Pi-1Pi  Further substitutions – Ci+1=Gi + Gi-1Pi+Gi-2Pi-1Pi+ci-2Pi-2Pi-1Pi= .... = Gi +Gi-1Pi+Gi-2Pi-1Pi+ ....+c0P0P1...Pi.  All carries can be calculated in parallel from xn-1,xn-2,...x0,yn-1,yn-2,...y0 and forced carry c0
  • 61.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 61 Mirror implementation of Look Ahead Carry Adder Look-Ahead: Topology Carry Output equations for 4-bit Look Ahead Adder c1=Go +c0P0 c2=G1+G0P1+c0P0P1 c3=G2+G1P2+G0P1P2+c0P0P1P2 c4=G3+G2P3+G1P2P3 +G0P1P2P3 +c0P0P1P2P3 4-bit module design Addition can be reduced to a three-step process: 1. Computing bitwise generate (G) and propagate(P) signals- Bitwise PG logic 2. Combining PG signals to determine group generate(G) and propagate(P) signals- Group PG Logic 3. Calculating the sums- Sum Logic Fig: 4-bit Carry Look Ahead Adder Module 16-bit Carry Look Ahead Adder design In general, a CLA using k groups of n bits each has a delay of tds = tpg +tpg(n) +[(n-1)+(k-1)]tAO +tXOr
  • 62.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 62 Manchester carry chain implementation of carry bypass adder (carry skip adder)  Consider the four-bit adder of as in above fig. The values of Ak and Bk (k=0…3) are such that all propagate signals Pk (k=0…3) are high.  An incoming carry Ci,0=1 propagates under those conditions through the complete adder chain and causes an outgoing carry C0,3=1.In other words, If (P0 P1 P2 P3 =1) then C0,3 =Ci,0 else either DELETE or GENERATE occurred.  This information can be used to speed up the operation of the adder as in fig. When BP=P0 P1P2P3=1 ,the incoming carry is forwarded immediately to next block through the bypass transistor Mb –hence the name carry-bypass adder or carry-skip adder. Fig:Manchester carry chain implementation of carry bypass adder  Fig. shows the possible carry propagation paths when the full-adder circuit is implemented in Manchester carry style. This kind of arrangements speeds up addition.  The carry propagate either through the bypass path, or carry is generated somewhere in the chain.  In both the cases, the delay is smaller than the normal ripple configuration. Fig.16-bit Carry Bypass adder Propagation delay of carry bypass adder:  The delay of N-bit carry skip adder is computed as tp = tsetup + M tcarry +(N/M-2)tbypass +(M-1)tcarry +tsum .  tsetup:the fixed overhead time to create the generate and propagate signals.
  • 63.
    EC8095: VLSI DesignDepartment of ECE 2020-2021 St.Joseph’s College of Engineering / St.Joseph’s Institute of Technology 63  tcarry:thepropagationdelaythroughasinglebit.Theworstcasecarry- propagationdelaythrough a single stage of M bits is approximately M times larger.  tbypass:the propagation delay through the bypass multiplexer of a single stage.  tmin:the time to generate the sum of final stage. Fig. ripple adder vs carry bypass adder Carry –SelectAdder:  InRCA,everyFAcellhastowaitfortheincomingcarrybeforeanoutgoingcarryis generated.  Possiblevaluesofcarryinputandresultforbothpossibilitiesareevaluatedin advance.  Oncetherealvalueofincomingcarryisknown,thecorrectresultiseasilyselected withasimplemultiplexerstage.  Thisimplementationideaiscalledcarry-selectadder. Fig:carry select adder
  • 64.
    EC6601: VLSI DesignDepartment of ECE 2018-19 St. Joseph’s College of Engineering / St. Joseph’s Institute of Technology 64 16 –Bit carry select adder: Propagation delay of carry select adder MULTIPLICATION  MultiplicationneedsMcyclesusingN-bitadder  Inshiftandadd -M partial productadded - Partial product is AND operation of multiplier bit and multiplicand followed by a ‗shift‘ PARTIAL PRODUCT- GENERATION:  LogicalANDofmultiplicandXandthemultiplierbitYi  Addingzeroshasnoimpactonresults.  Canreduceno.orpartialproductsbyhalf!!  Eg.01111110=10000010where1=-1 - Soonlytwopartiaproductsne edtobeadded! (N-1)/2 -MultiplierwordY=SYj4j withYje{-2,- 1,0,1,2} j=0  ThistransformationisBooth’sRecoding
  • 65.
    EC6601: VLSI DesignDepartment of ECE 2018-19 St. Joseph’s College of Engineering / St. Joseph’s Institute of Technology 65 -Leads to less additions with area reduction and higher speed. -Alternating 10101010 for eight bit is the worst case! -Multiplying with {-2, -1, 0, 1, 2} versus {1, 0}; needs encoding -Used modified Booth‘s recoding for consistent operation size. Modified Booth’sRecoding Partial product Selection table Multiplier bits Recorded bits 000 0 001 + Multiplicand 010 + Multiplicand 011 +2 *Multiplicand 100 -2 *Multiplicand 101 -Multiplicand 110 -Multiplicand 111 0  Bunchbitsfrommsbtolsbinthreewithsuccessiveoverlap  Assignmultiplierasperthetable  Numberofpartialproduc tishalf Eg.01111111isbunched into ->01(1), 11(1), 11(1), 11(0) ->Multiplier=10 00 00 01 (see table) ->Four partial product is developed instead] THE ARRAY MULTIPLIER: An array multiplier is a digital combinational circuit that is used for the multiplication of two binary numbers by employing an array of full adders and half adders. This array is used for the nearly simultaneous addition of the various product terms involved. To form the various product terms, an array of AND gates is used before the Adder array. An array multiplier is a vast improvement in speed over the traditional bit serial multipliers in which only one full adder along with a storage memory was used to carry out all the bit additions involved and also over the row serial multipliers in which product rows (also known as the partial products) were sequentially added one by one via the use of only one multi-bit adder. The tradeoff for this extra speed is the extra hardware required to lay down the adder array. But with the much decreased costs of these adders, this extra hardware has become quite affordable to a designer. In spite of the vast improvement in speed, there is still a level of delay that is involved in an array multiplier before the final product is achieved.Before committing hardware resources to the circuit, it is important for the designer to calculate the aforementioned delay in order to make sure that the circuit is compatible with the timing requirements of the user.
  • 66.
    EC6601: VLSI DesignDepartment of ECE 2018-19 St. Joseph’s College of Engineering / St. Joseph’s Institute of Technology 66 Fig:Array Multiplier  N partial products of M bit size each.  NxM two bit AND;N-1 Mbit adders  Layout need not be straggled, but routing will take care of shift Carry save multiplier: Fig: Carry save Multiplier  Large number of any critical paths are present in the array multiplier  Increasing the performance of the structure through transistor sizing yields marginal benefits  A more efficient realization can be obtained by noticing that the multiplication result does not change when the output carry bits are passed diagonally downwards instead of only to the right  An extra adder called a vector merging adder to generate the final result is included  This resulting multiplier is called carry save multiplier. Because he carry bits are not immediately added but are rather saved for the next adder stage  In the final stage, carry and sums are merged in a fast carry propagate adder stage  It has advantage that its worst case critical path is shorter. The delay due to the carry save multiplier is given by the below expression t mult=(N-1)tcarry+tand+tmerge
  • 67.
    EC6601: VLSI DesignDepartment of ECE 2018-19 St. Joseph’s College of Engineering / St. Joseph’s Institute of Technology 67 Wallace tree multiplier: A Wallace tree is an efficient hardware implementation of a digital circuit that multiplies two integers, devised by Australian Computer Scientist Chris Wallace in 1964. The Wallace tree has three steps: 1. Multiply (that is – AND) each bit of one of the arguments, by each bit of the other, yielding n2 results. Depending on position of the multiplied bits, the wires carry different weights, for example wire of bit carrying result is 128 2. Reduce the number of partial products to two by layers of full and half adders. 3. Group the wires in two numbers, and add them with a conventional adder. The second step works as follows. As long as there are three or more wires with the same weight add a following layer:  Take any three wires with the same weights and input them into a full adder. The result will be an output wire of the same weight and an output wire with a higher weight for each three input wires.  If there are two wires of the same weight left, input them into a half adder.  If there is just one wire left, connect it to the next layer.  These computations only consider gate delays and don't deal with wire delays, which can also be very substantial.  The Wallace tree can be also represented by a tree of 3/2 or 4/2 adders.  It is sometimes combined with Booth encoding The advantages of Wallace Tree multipliers are 1. Substantial hardware saving 2. It offers greater speed The disadvantage is irregular and inefficient layout.
  • 68.
    EC6601: VLSI DesignDepartment of ECE 2018-19 St. Joseph’s College of Engineering / St. Joseph’s Institute of Technology 68 The characteristics of Wallace tree multiplier include  Final adder choice is critical; depends on structure of accumulator array  Carry look ahead might be good if data arrives simultaneously  Place pipeline stage before final addition  In non-pipelined,other adders can be used with similar performance and less hardware requirement DIVIDER: Unsigned non-restoring division: Input:An n-bit dividend and a m-bit divisor Output:The quotient and remainder Begin: 1.load divisor and dividend into regoisters M and D, respectively,clear partial remainder register R and set loop count cnt equal to n-1. 2.left shift register pair R:D one bit. 3.compute R=R-M; 4.Repeat If(R<0)begin D(0)=0;left shift R: D one bit; R=R+M;end Else begin D(0)=1 ; let shift R:D one bit ;R=R-M; end Cnt=cnt-1: until (cnt==0) 5.If(R<0)begin D[0]=0;R=R+M;end else D(0)=1;end Fig:Sequential Implementation of Non-Restoring Divider.
  • 69.
    EC6601: VLSI DesignDepartment of ECE 2018-19 St. Joseph’s College of Engineering / St. Joseph’s Institute of Technology 69 BARREL SHIFTER: Any general purpose n-bit shifter should be able to shift incoming data upto n-1 places in a right shift or left shift direction. If we now further specify that all shifts should be one end around basis, so that any bit shifted out at one end of a data word, will be shifted in at the other end of the word, then the problem of left shift or right shift is greatly eased. For a 4 it word, a 1bt right shift is equal to a 3bit left shift and a 2bit shift right is equal to a 2bit shift left etc. Thus we can achieve a capability to shift left or right by zero,one,two or three places by designing a circuit which will shift right only by one,two or three places. Barrel shifter is an adaptation of the crossbar switch which recognizes the fact that we can couple the switch gates together in groups of four and also form four separate groups corresponding to shifts of zero, one, two and three bits. The arrangement is readily adapted so that the in-lines also run horizontally. The resulting arrangement is known as barrel shifter. This inter bus switches have their gate inputs connected in a staircase fashion in groups of four and these are now four shift control inputs which must be mutually exclusive in the active state. The structure of barrel shifter is of high regularity and generality