System design using HDL - Module 5

CPLD Complex Programmable Logic Device
FPGA Field Programmable Gate Array
GAL Generic Array Logic
HDL Hardware Description Language
IEEE Institute of Electrical & Electronic Engineers
IP Intellectual Property
ILA Integrated Logic Analyzer
ISE Integrated Software Environment
ISP In-System Programming
JKFF Jack-Kilby Flip Flop
JTAG Joint Test Action Group
LEC Logic Equivalence Checker
LMG Logic Modeling Group
LUT Look-Up Table
NGC Native Generic Compiler
OTP One-Time Programmable
PACE Pin-out and Area Constraints Editor
PAL Programmable Array Logic
PCI Peripheral Component Interconnect
PLA Programmable Logic Array
TBW Test-Bench Waveform
UCF User Constraints File
VHDL VHSIC Hardware Description Language
VHSIC Very High Speed Integrated Circuit
XST Xilinx Synthesis Technology
ABBREVIATIONS

SYSTEM DESIGN USING HDL (ECE43)
#
Digital system design using Verilog,
Charles Roth, Lizy Kurian John,
Byeong Kil Lee,
1st Edition, 2016, Cengage Learning
1 2.1, 2.2, 2.3 - 2.8, 2.11, 2.13 - 2.15
2 2.9, 2.10, 2.12, 2.16 - 2.19, 8.1, 8.2
3 3.1 - 3.4, 5.1, 5.2.1, 5.3
4 4.1 - 4.5, 4.8, 4.6, 4.7, 4.9, 4.11
5 6.1 - 6.5, 6.7 - 6.12
DESIGNING
WITH FPGA

07/03/2019
Aravinda K., Dept. of E&C, NHCE, Bengaluru 4
Example-1: Design of a 4:1 multiplexer using FPGA
Configurable Logic Block in FPGA
Each CLB in the FPGA contains
two 4-variable function generators.
It also contains two flip-flops which
can be used for latching the function.

07/03/2019 Aravinda K., Dept. of E&C, NHCE, Bengaluru 5
As 4:1 mux contains 6 inputs, it is not
possible to implement it using 1 CLB
in the given FPGA. Therefore, the 4:1
mux can be decomposed into 2:1 mux
blocks. Flip-flops are of no use here.
M = S1'S0'I0 + S1'S0I1 + S1S0'I2 + S1S0I3
M1 = S0'I0 + S0I1 M2 = S0'I2 + S0I3 M = S1'M1 + S1M2

Instead of logic equations,
modern FPGAs use LUT as a
basic building block.
For this particular design of
4:1 mux, the contents of
LUT4 are as shown:
Each LUT4 can
implement 1-bit function
of 4-input variables.
Hence, 16 cells of
SRAM are required for
the input columns
(“don’t care” terms need
to be included as logic
states in the LUT).
3 LUT4s require 48
SRAM cells. Therefore,
this is an expensive
implementation.

If the CLB in the FPGA has a
provision to combine the outputs
of the function generators, then
the 4:1 mux can be implemented
using a single CLB. (e.g.: XC4000)
This method requires
2 LUT4s and 1 LUT3.
Hence, the number of
SRAM cells required
is, 16 + 16 + 8 = 40.

Example-2: Circular Shift Register (Ring Counter)
Even though FGs are of no use, they have to be used.

Q1: How many CLBs are required to implement a 3-to-8 decoder?
A1: If the decoder is implemented using logic gates, 3 NOT gates and
8 AND gates are required. When CLB is used, as there are 8 outputs,
one for each input combination, 8 FGs are required. As each CLB
contains 2 sets of (FG4+FF), the number of CLBs required is 4.
 If LUT based FPGA is used, for each output, 8 SRAM cells and one
8:1 mux is required. Thus, 8 LUTs are required. But in the CLB, as
each FG4 contains 16 SRAM cells (including the “don’t care” term),
for the 8 FG4s, 16x8 = 128 SRAM cells are required.
1
0
0
0
0
0
0
0
0 0 0

Implementing functions using Shannon’s decomposition
 Shannon’s expansion
theorem helps to decompose a
function containing larger
number of variables, into a
function containing lesser
number of variables.
 In the example shown,
instead of a single 6-variable
FG, two 5-variable FGs along
with ½ of a 3rd FG are used to
realize the function.
Thus, Shannon’s
expansion theorem
helps in the reduction
of hardware.

Example-3: Consider Z = abcd’ef’ + a’b’c’def’ + b’cde’f
By setting a = 0, Z = b’c’def’ + b’cde’f => Z0
By setting a = 1, Z = bcd’ef’ + b’cde’f => Z1
 Therefore, two LUT5s along with either a 2:1 mux or
another LUT5, can be utilized for implementing the function.
 The number of terms in Z0 or Z1 does not matter, as this is
going to be implemented by LUT.
 If only LUT4 is available in the CLB, then the function needs
to be decomposed further, by using “a” and “b” together.

This method requires seven LUT4s, in general.

Example-4: Consider
Z = abcd’ef’+a’b’c’def’+b’cde’f
Substituting a = 0 & b = 0,
Y0 = c’def’ + cde’f
Y1 = 0
Y2 = cde’f
Y3 = cd’ef’
As there is a null function,
this requires only 5 LUT4s.

Q2: What is the max. no. of LUT5s for realizing a 7-variable function?
A2:
Four LUT5s are required for implementing Y0, Y1, Y2 and Y3. Another
LUT5 is required to implement the three initial terms. The last LUT5
is required to club the last term with the previous output. Thus, the
total no. of LUT5s required is 6. However, in the last LUT5, one input
remains unused, and it has to be considered as “don’t care”.
Example-5: Implement a 7-variable function
using 4-input LUTs and 2:1 multiplexers.
(7-variable function = Two LUT6 + One 2:1 mux). (6-variable
function = Two LUT5 + One 2:1 mux). (5-variable function =
Two LUT4 + One 2:1 mux). Substituting accordingly, we obtain,
(7-variable function = Eight LUT4 + Seven 2:1 mux).

 If muxes are unavailable in CLB,
then more LUTs are needed. Xilinx
Spartan FPGA provides mux in
addition to LUT4. A logic unit is
these FPGAs is called as “slice”.
S
L
I
C
E

REALIZATION OF 7-VARIABLE
FUNCTION USING 4 SLICES
Example-6: Implement the parity
function A⊕B⊕C⊕D⊕E using 4-
variable Function Generators.
 For direct implementation, this
5-variable function requires only
one LUT5.
 Using Shannon’s expansion,
this function can be decomposed
into two 4-variable functions,
and can be realized using two
LUT4s and one 2:1 multiplexer.
 If multiplexer is not present in
the CLB, then it requires three
LUT4s.

CARRY CHAINS IN FPGA
 Addition is a very common
operation in digital circuits.
 As LUT4 is a standard building
block in FPGA, two LUT4s are
required for sum and carry bits.
 Thus, for an n-bit adder, ‘2n’
number of LUT4s are required.
 But, if the FPGA can provide
dedicated circuitry for generating
and propagating carry bit to the
next stage, then only ‘n’ number of
LUT4s are required for sum bits.
 The dedicated carry chain
generates the carry bit in parallel.

CASCADE CHAINS IN FPGA
 For a function with large number of variables, the FPGAs provide
cascade AND and cascade OR chains (for PoS and SoP terms).
 Thus, instead of using separate FGs to perform AND or OR
functions, the cascade circuitry can be used to create such functions.
 Hence, for a 32 variable SoP function, only 8 LUT4s are required.
But without the cascade chain, 11 LUT4s are required (8 + 2 + 1).
 FPGAs such as Altera Stratix IV provide register chains as well.

Examples of Logic blocks in commercial FPGAs
 Kintex uses LUT6 in each slice. CLB contains 4 copies of the slice.
 Xilinx Virtex and Spartan FPGAs use LUT4. Each slice contains
two FGs, two muxes, two flip-flops, and additional logic.
1. Xilinx Kintex CLB

 Each Stratix IV LM contains two LUT6 and two flip-flops. Each
LUT6 has two independent inputs and four shared inputs. In
addition, two 1-bit adders are built in, with carry chaining.
 Flip-flops with register chaining allows to create shift registers.
2. Altera Stratix IV Logic Module

 Fusion VersaTile consists of muxes and logic gates. Each block
has 4 inputs – X1, X2, X3 and XC. The VersaTile block is of
significantly finer grain than the LUT4 present in other FPGAs.
 Each VersaTile can be configured as: 3-input logic function, or
latch with (clear or set), or D flip-flop with (enable, clear or set).
3. Microsemi Fusion VersaTile

DEDICATED MULTIPLIERS IN FPGA
 Multiplication is also a common operation, and for implementing
it, several programmable logic blocks are required. In addition, such
multiplier will be slower, because of the interconnecting switches.
 Hence, some Xilinx and Altera FPGAs contain dedicated 18X18
multipliers. When multiplication of larger numbers are required,
several of the built-in multipliers can be put together.
 e.g., if A and B are of 32 bits, then they can be represented as:
A=(C X 216)+D, B=(E X 216)+F & A X B = (CE X 232)+(DE+CF) X 216 + DF.
Thus, 4 multipliers are required to generate the partial products CE,
DE, CF & DF, which are later added by means of several adders.

Cost
of
programmability
 The logic block shown, for its configuration, requires totally 46
SRAM cells (276 transistors). There will be additional configuration
bits required, for programmable interconnect and for programmable
I/O. Thus, the flexibility of programmable points comes with a much
higher additional cost of associated memory cells (SRAM/Flash).
 e.g.: Xilinx Virtex-II XC2V40 (with 512 LUT4s & 88 I/O pins), needs
3,38,976 configuration bits. Virtex-II XC2V8000 (with 93,184 LUT4s &
1108 I/O pins), needs more than 26 million configuration bits.

FPGAs and One-Hot state assignment
 While implementing a state machine, in general, state
encoding is performed with ‘n’ bits for 2n states. e.g.: for
a machine with 4 states, 2-bit encoding has to be used.
Increase in ‘n’ will be requiring more no. of logic blocks.
 For faster implementation of the design, it is desirable
to reduce the no. of logic blocks and interconnections.
Hence, instead of the encoding method, one-hot method
can be used, which will reduce the no. of logic blocks.
 This method, in turn, will result in the increased no. of
flip-flops; but this does not affect the implementation
much, as each FPGA logic block contains two flip-flops.

 For the state graph shown, the
encoding can be 00, 01, 10 and 11.
 But with the usage of one-hot
method, the state encoding will be
1000, 0100, 0010 and 0001. The
states will use one flip-flop each.
 The next state equation for the
flip-flop Q3 can be written as,
Q3
+ = X1Q0Q1
’Q2
’Q3
’ + X2Q0
’Q1Q2
’Q3
’
+ X3Q0
’Q1
’Q2Q3
’ + X4Q0
’Q1
’Q2
’Q3.
 In the one-hot method, this
equation will get reduced to,
Q3
+ = X1Q0 + X2Q1 + X3Q2 + X4Q3.
Here, each term in the equation
contains exactly one state variable.
The output equations are:
Z1 = X1Q0 + X3Q2,
Z2 = X2Q1 + X4Q3.
As terms contain one
state variable each, this
leads to fewer logic cells.

 In electronic designs, a “cell” is
defined as the predesigned and
precharacterized circuit element.
 Thus, a cell contains pretested
and prestored instances of circuit
diagram, its circuit symbol, and its
physical description (layout).

 ASIC contains an exact number of gates that are required
for the design. But FPGA contains arrays of gates, or arrays of
LUTs. Thus, if a larger design needs to be implemented in
FPGA, the ASIC designer needs to have an idea about the
design being fit into a given FPGA.
 For the designer, the number of gates inside FPGA is not a
useful metric, as FPGA is programmable. Hence, a term called
“equivalent gate count” is defined, as a count of the circuitry
that can fit into a particular FPGA. This type of gate count is
extremely difficult to compute, as it depends on the type of
circuitry, the type of interconnections, and the available
routing resources available in the FPGA.
FPGA CAPACITY
(Maximum gates versus usable gates)

 One method for computing the equivalent gate count for a
CLB is as follows: 2:1 mux = 4 gates, 3-input XOR gate = 6
gates, 4-input XOR gate = 9 gates, Flip-flop = 7 gates, and so
on. Thus, the equivalent gate count for a CLB can be obtained.
The total gate count can be estimated, by multiplying the
equivalent gate count with the number of CLBs in the FPGA.
In general, this type of gate count is likely to be higher than
the gate count of the practical circuitry that is being realized.
 Another method is to use the Benchmark circuits (e.g.:
Benchmark suite prepared by PREP [Programmable
Electronics Performance company]). For example, if an ASIC
contains 2000 gates, and if an FPGA can fit 20 copies of the
ASIC, with no routing between the copies, then the maximum
gate count of the FPGA can be considered as 40,000.

 Synthesis is the process of translation of an abstract
high-level design to a detailed circuit description.
 The synthesis tool implements the digital system as an
interconnection of gates, flip-flops, registers, counters,
muxes, adders, and other basic building blocks.
 The representation of the design as a logic schematic,
together with an associated wirelist, is called as netlist.
DESIGN TRANSLATION (SYNTHESIS)
results in AND gate.
results in AND gate
followed by flip-flop.

 The synthesis tool performs a line-by-
line translation of HDL into hardware.
 The synthesis tool selects components
that are available in the library.
 In general, ‘case’ statement results in
muxes, comparison results in adders,
shift results in registers, and so on.
 For implementation with different
technologies, different component
libraries can be provided.
 The resulting hardware is optimized
later on.

Synthesis of a
‘case’ statement
module case_eg (a,b);
input [1:0] a;
output reg [1:0] b;
always @(a)
begin
case (a)
0: b<=1;
1: b<=3;
2: b<=0;
3: b<=1;
endcase
end
endmodule
Synthesized circuit before optimization
Logic
optimization
Synthesized circuit after optimization

Unintentional
latch creation
Module latch_eg (a,b);
input [1:0] a;
output reg b;
always @(a)
begin
case (a)
0: b<=1;
1: b<=0;
2: b<=1;
endcase
end
endmodule
Initial output of naïve synthesizer
Optimized output of naïve synthesizer

Output of
optimizing
synthesizer
Output of
naïve
synthesizer
Solution to
eliminate latch
Module latch_eg (a,b);
input [1:0] a;
output reg b;
always @(a)
begin
case (a)
0: b<=1;
1: b<=0;
2: b<=1;
3: b<=0;
endcase
end
endmodule

Synthesis of
‘if’ statements
if (A == 1’b1)
begin
nextstate <= 3;
Z <= 1;
end
if (A == 1’b1)
begin
nextstate <= 3;
Z <= 1;
end
else
begin
nextstate <= 2;
Z <= 0;
end
Ambiguous code,
that results in latch Unambiguous code
module if_eg (A,B,C,D,E,Z);
input A,B;
input [2:0] C,D,E;
output reg [2:0] Z;
always @(A or B)
begin
if (A == 1’b1)
Z <= C;
else if (B == 1’b0)
Z <= D;
else
Z <= E;
end
endmodule
Synthesized
output

Synthesis of arithmetic components
module ar_eg (clk,A,B,ge,acc,count);
input clk;
input [3:0] A,B;
inout [3:0] acc,count;
output ge;
reg [3:0] acc_t, count_t;
assign acc = acc_t;
assign count = count_t;
assign ge = (A >= B);
always @(posedge clk)
begin
acc_t <= acc +B;
count_t <= count + 1;
end
endmodule
Synthesized output

Example-7: What hardware gets
resulted for the statement,
assign LE = (A <= B);
where A and B are 4-bit vectors?
 The symbol “<=” is a relational
operator over here.
 The following statement
inside the ‘always’ block,
LE <= (A <= B);
results in the same hardware.
Example-8: What is the
optimized hardware for,
assign EQ3 = (A == 3);
where A is 4-bit vector?
 A naïve synthesizer may
produce a 4-bit comparator,
with ‘A’ and ‘3’ as inputs.
 For optimization, the
statement can be altered as:
assign EQ3 =
~A[3]&~A[2]&A[1]&A[0];

Area of the silicon chip: Minimum
Power consumed: Minimum
Speed of operation: Maximum
Size of the product: Optimum
Weight of the product: Optimum
Memory capacity: Maximum
Cost of the product: Minimum
Delay of operation: Minimum
Ideal requirements
(Practical tradeoffs)
Area, power and delay
optimizations
Area & delay of a circuit
are inversely related (e.g.:
serial v/s parallel).
 Energy & delay of a circuit
are also inversely related
(more switching implies
increased dynamic power).
 Thus, Area-Time (AT)
product and Energy-Delay
(ED) product are the metrics
used, to qualify the circuit.
The path with the longest
delay in the circuit is called
as the “critical path”.

MAPPING, PLACEMENT AND ROUTING
 These are the 3 major steps that happen, to transform
the design that is in the netlist form, to the appropriate
target technology (MPGA, CPLD, FPGA, ASIC).
 Mapping is the process of translating the design into the
available building blocks in the target technology. [e.g.: LUT
with mux (Xilinx), Mux with gates (Microsemi)].
 In other words, it is the process of binding the technology-
dependent circuits of the target technology to the technology-
independent circuits that are in the design.
 In case of FPGA, the design has to be mapped into muxes,
LUTs etc. In case of ASIC, the design has to be mapped into the
standard cells that are available in the library (e.g.: logic gates,
muxes, decoders, encoders, comparators, counters etc.)

 Placement is the process
of taking the defined logic &
I/O blocks from the technology
mapper, and assigning them to
the physical locations of the
target implementation.
Routing is the process of
interconnecting those blocks
and sub-blocks on the target
implementation.
 “Place & route” are often
done along with each other.
Two of the popular algorithms
used for the same purpose
are, ‘Simulated annealing’ and
‘Iterative improvement’.

 In metallurgy, annealing is the process utilized to toughen
the metal, by heating it, and then cooling it slowly, in a series
of steps. The temperature is kept high in the beginning, and it
is reduced gradually in the next steps.
 In a similar fashion, for placing & routing, the simulated
annealing algorithm takes bigger risks in the beginning, by
making random modifications for a feasible solution, and
gradually arrives at an optimal solution. In the beginning, just
like high temperature, risky moves are performed. In the next
steps, as the temperature is reduced, there will be decrease
in the probability of occurrence of bad moves.
 In contrast, the iterative improvement algorithm accepts
only better solutions in each step. Such algorithms are called
as ‘greedy’. At the end of simulated annealing, the algorithm
has to be greedy, so as to accept only positive moves.

07/03/2019 41Aravinda K., Dept. of E&C, NHCE, Bengaluru
A
S
I
C
D
E
S
I
G
N
F
L
O
W

07/03/2019 42Aravinda K., Dept. of E&C, NHCE, Bengaluru

System design using HDL - Module 5

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to System design using HDL - Module 5

Similar to System design using HDL - Module 5 (20)

More from Aravinda Koithyar

More from Aravinda Koithyar (20)

Recently uploaded

Recently uploaded (20)

System design using HDL - Module 5