3. Prerequisites to FPGA Learning
• Before to learn:
• Basic Boolean operations(AND, OR, NOT, XOR)
• Number representations and binary math
• Digital Circuits
• Programming ability in 'C' or assembler
• Bit of microcontroller development experience
•To Start with:
• Hardware Description Language like VHDL/Verilog
• Coding in a programming language like C for rendering
ideas into syntax
Dept. of Information Engineering and Mathematics. University of Siena. Nov - 2015 3
4. Birth of FPGA(1975-1985)
1975 : PLA (programmable logic array) made up of
programmable AND gate planes and programmable OR gate
planes, connected to product a desired output(POS (product of
sums), and SOP (sum of products)).
1978 : PAL (programmable array logic) similar to PLA; has one
PROM array, a fixed OR plane and a programmable AND plane.
1983 : EEPROM (Electrically EPROM)
1983 : GAL (generic array logic) is completely erasable and re-
programmable, but PAL not.
1984 : FLASH (type of EEPROM) non-volatile memory. FLASH
memory can be erased in blocks.
6Dept. of Information Engineering and Mathematics. University of Siena. Nov - 2015
5. Birth of FPGA contd…
1985: First FPGA XC2064 had 64 configurable logic blocks
(CLBs), with two three-input lookup tables (LUTs).
It offered 800 gates, sold for $55, and was produced on a 2.0µ
process.
What is a Field Programmable Gate Array (FPGA)?
“FPGAs are programmable semiconductor devices that are based
around a matrix of Configurable Logic Blocks (CLBs) connected
through programmable interconnects. FPGAs can be programmed
to the desired application or functionality requirements”-Xilinx.
Types of FPGA : 1. One-Time Programmable (OTP) FPGAs
2. SRAM-based (can be reprogrammed as the
design evolves).
Company: Altera, Xilinx
7Dept. of Information Engineering and Mathematics. University of Siena. Nov - 2015
6. Field-Programmable Gate Array
CLB: CLBs contain clusters of LUTs + Registers + arithmetic +
other circuitry.
LUTs: LUT (look-up-tables) is a hardware implementation of a
truth table.
FPGA is a special kind of chip that is configurable by the end
user.
Has programmable logic and can implement any digital circuit..
Dept. of Information Engineering and Mathematics. University of Siena. Nov - 2015 8
7. But.. why Use FPGA?
Application needs tailored
HW
No need to over-provision
(Like custom ASIC)
Don’t worry about mistakes
it is “Reconfigurable”
Make chip development
faster.
FPGAs provide significantly more hardware acceleration performance/watt
[1] Image Source: Xilinx SDAccel Developer Zone. http://www.xilinx.com/products/design-tools/ software-zone/sdaccel.html.
“FPGA-based accelerators can achieve up to 25x better performance per
watt and 50-75x latency improvement compared to CPU/GPU
implementations while also providing excellent I/O integration (PCI, DDR4
SDRAM interfaces, high-speed Ethernet, etc.)..”[1]
Dept. of Information Engineering and Mathematics. University of Siena. Nov - 2015 9
8. Where are FPGAs Used Today?
Networking, Computer & Storage
Telecom and Wireless
Automotive, Aerospace, Industrial Automation, Military etc…
Dept. of Information Engineering and Mathematics. University of Siena. Nov - 2015 10
Do you know??
Microsoft Bing Search built on FPGAs based accelerators
9. But.. what about Weaknesses
For a specific circuit, relative to
a custom ASIC, FPGAs use
more area, power and slower
FPGA resources are of a fixed
size and have limited flexibility
options.
But you may not have the
option for “reconfiguration”
Metric FPGAvsASIC[1] FPGAvsASIC[2]
Area 30-40X 2-20X
Delay 3-4X 1.7-3X
Dy.
Power
12X
Static
Power
5-90X 2-5X
[1] Compares Altera Stratix-II to ST Microelectronics standard cells (90nm
technology) [Kuon et. al. (TCAD`07)].
[2] Altera Corp 2006.
Dept. of Information Engineering and Mathematics. University of Siena. Nov - 2015 11
10. Resource usage of 2-input function vs 3-input function is same 6-
input function uses 1 6-LUT; 7 input function uses 2 6-LUTs
(double resource usage).
6x6-bit multiply has same DSP usage as 8x8-bit multiply
Similar arguments for memories
The biggest problem with FPGAs used for application
acceleration has been programming.
Programming in FPGA is more than programming micro-
controller..
Practical Examples
Dept. of Information Engineering and Mathematics. University of Siena. Nov - 2015 12
12. Canonical FPGA Floor plan
CLB: Control Logic
Block
Hard IP block: multiplier,DSP,
Processor etc..
Hard IP (Intellectual Property)
directly fabricated on silicon
Combinational circuit
represented by graph.
Interconnects
Dept. of Information Engineering and Mathematics. University of Siena. Nov - 2015 14
13. Configurable Logic Block (CLB)
“CLB is the basic logic unit in a
FPGA. Every CLB consists of a
configurable switch matrix with 4
or 6 inputs, some selection
circuitry (MUX, etc), and flip-flops.
The switch matrix is highly flexible
and can be configured to handle
combinatorial logic, shift registers
or RAM”-Xilinx.
CLB in Xilinx but “LAB” (logic
array block) in Altera.
Intra-CLB interconnect for local
connections.
Dept. of Information Engineering and Mathematics. University of Siena. Nov - 2015 15
14. Logic Tile (CLB/LAB) Comparison
Xilinx CLB: 8 6-LUTs, each
with 6 inputs
Each LUT can implement any
two functions that together use
<= 5 inputs
Fast carry circuitry: 1 sum bit /
6-LUT
FFs: 2 FFs / 6-LUT
Altera LAB: 10 6-LUTs, each
with 8 inputs
Each LUT can be fractured into
two independent 4-LUTs
Fast carry circuitry: 2 sum bits /
6-LUT
FFs: 4 FFs / 6-LUT
Dept. of Information Engineering and Mathematics. University of Siena. Nov - 2015 16
15. LUT
LUTs are used to implement function generators in CLBs. These
function generators can implement any arbitrarily defined Boolean
functions.
Small memory that holds the output values for each input
combination
LUT size doubles with each input added.
Both Xilinx and Altera FPGAs allow (some) LUTs to be
used as memories.
A B C goes to LUT
0 0 0
0 1 0
1 0 0
1 1 1
LUT
Dept. of Information Engineering and Mathematics. University of Siena. Nov - 2015 17
16. LUTs & SRAM
A LUT is a small memory (6 LUT
is a 64x1 kbit memory)
Altera calls this MLAB (Memory
logic array block)
One MLAB has 10 6-LUTs = 640
Kb of memory
Xilinx calls this distributed RAM
Generally, this style of RAM is
useful for very small RAMs
SRAM Blocks are interspersed
in the fabric
Generally single or dual-port
Each is 20K-bits, with
configurable aspect ratio: 16Kx1,
8Kx2, … 1Kx20, 512x32, …
Can be chained together to build
deeper, wider RAMs
Dept. of Information Engineering and Mathematics. University of Siena. Nov - 2015 18
17. Cores: Hard & Soft
Hardcores: Speed up to
1GHz+
Can achieve much faster
processing speeds.
Fixed and cannot be
modified(dedicated silicon area
on FPGA).
Examples: PowerPC used in
Virtex-4/5 and ARM Cortex-A9
dual-core MCU used in Zynq-
7000 All Programmable SoC
from Xilinx.
Softcore: simple
microcontroller/ful-fledged
microprocessor.
Less Speed around 250MHz &
limited by the speed of the
fabric.
Can be easily modified and
tuned to specific requirements,
more features, custom
instructions, etc.
Example: LEON3, OpenRISC,
MicroBlaze+PicoBlaze, Nios II
Dept. of Information Engineering and Mathematics. University of Siena. Nov - 2015 19
18. Slices
Every slice contains four logic-
function generators (or LUTs),
eight storage elements, wide-
function muxes, and carry logic.
All are used by all slices to
provide logic, arithmetic, and
ROM functions.
CLB contains a pair of slices.
Two slices do not have direct
connections to each other.
Each slice is organized as a
column.
Each slice in a column has an
independent carry chain.
Image Source: Xilinx
For each CLB, slices in the bottom of the CLB
are labelled as SLICE(0), and slices in the top
of the CLB are labelled as SLICE(1).
Dept. of Information Engineering and Mathematics. University of Siena. Nov - 2015 20
General routing matrix
19. Xilinx Virtex-6 FPGA
Logic Block Slice Description
Image Source: “Virtex-6 FPGA CLB User Guide” UG364 (v1.2) February 3, 2012
Dept. of Information Engineering and Mathematics. University of Siena. Nov - 2015 21
20. I/Os are programmable to operate according to a variety of
signaling standards.
All state-of-the-art FPGAs incorporate Multi-gigabit transceivers
(MGTs)
High-speed serial I/Os:
Virtex-7 and Stratix V: individual MGTs
operable up to 28 Gb/s
Virtex-7 has 2.7 Tb/s peak serial bandwidth
FPGA I/O Support
Dept. of Information Engineering and Mathematics. University of Siena. Nov - 2015 22
22. About HDL
HDL : Hardware Description Language
HDL is a collective name for all hardware definition languages
(like Verilog, VHDL)
Register-Transfer Level (RTL) is a design abstraction and a
way of describing a circuit.
RTL describes flip-flops, latches and how data is transferred in
between etc.
You write your RTL level code in an HDL language which then
gets translated (by synthesis tools) to gate level description in
the same HDL language/target device/process.
A bitstream is a sequence of bits sends to FPGA to perform
the needed operations.
Dept. of Information Engineering and Mathematics. University of Siena. Nov - 2015
23. Verilog
Verilog-XL (a logic simulator + hardware description language) is
first Verilog HDL developed By Gateway Design Automation in
1980s.
The Verilog HDL is an IEEE standard(IEEE Std. 1364-1995).
SystemVerilog is a huge set of extensions to Verilog.
Verilog and VHDL are two different HDLs.
Why use Verilog?
Structural Level(Lower level): gates level
Code always synthesizable
Functional Level (Higher Level): Gate level, RTL level,
high-level behavioural
Easier to write, not always synthesizable.
Dept. of Information Engineering and Mathematics. University of Siena. Nov - 2015 25
24. Verilog
Data Types:
Basic type: Bit vector Values: 0, 1, X (don't care), Z (high
impedance)
Example: Binary: 4'b11_10, Hex: 16'h034f, Decimal:
32'd270
Use wire to connect components: Single wire
Example: wire my_wire
Array of wires : Example: wire[7:0] my_wire)
Reg for procedural assignments:
Example reg[3:0] accum; // 4 bit “reg”)
reg is not necessarily a hardware register
Dept. of Information Engineering and Mathematics. University of Siena. Nov - 2015 26
25. Simple Verilog Code
Synchronous reset D- FF
module dff_sync_reset (
data , // Data Input
clk , // Clock Input
reset , // Reset input
q // Q output
);
//Input Ports
input data, clk, reset ;
//Output Ports
output q;
//Internal Variables
reg q;
//Code Starts Here
always @ ( posedge clk)
if (~reset) begin
q <= 1'b0;
end else begin
q <= data;
end
endmodule //End Of Module dff_sync_reset
Sample Mux Code
module mux_using_if(
din_0 , // Mux first input
din_1 , // Mux Second input
sel , // Select input
mux_out // Mux output
);
//Input Ports
input din_0, din_1, sel ;
//Output Ports
output mux_out;
//Internal Variables
reg mux_out;
//Code Starts Here
always @ (sel or din_0 or din_1)
begin : MUX
if (sel == 1'b0) begin
mux_out = din_0;
end else begin
mux_out = din_1 ;
end
end
endmodule //End Of Module mux
Dept. of Information Engineering and Mathematics. University of Siena. Nov - 2015 27
26. VHDL
VHDL : VHSIC Hardware Description Language.
VHSIC: Very High Speed Integrated Circuit.
VHDL was initiated in 1981 by the United States Department of
Defense.
1983-85 : Development of baseline language by Intermetrics,
IBM and TI.
IEEE Standard IEEE 1076-1993
Simulation and synthesis are two main kinds of tools which
operate on the VHDL language.
Supports three levels of abstraction: Algorithm, Register transfer
level (RTL), and gate level.
Algorithms are un synthesizable, RTL is the input to synthesis,
gate level is the output from synthesis.
Dept. of Information Engineering and Mathematics. University of Siena. Nov - 2015 28
27. Algorithm: Consists of a set of instructions, neither it has clock
nor delays. Some synthesis tools can take algorithmic VHDL code
as input.
RTL: Has clock, but no detailed delays below the cycle level.
“Re-timing” is a feature that allows operations to be re-scheduled
across clock cycles.
Gates: consists network of gates and registers instanced from a
technology library, which contains technology-specific delay
information for each gate.
Algorithm RTL Gates
VHDL
Dept. of Information Engineering and Mathematics. University of Siena. Nov - 2015 29
28. Sample VHDL Code
AND-OR-Inverter Gate
library IEEE; //library clause
use IEEE.STD_LOGIC_1164.all; //Use package
STD_LOGIC_1164
entity AOI is
port (
A, B, C, D: in STD_LOGIC;
F : out STD_LOGIC
);
end AOI;
architecture V1 of AOI is
begin
F <= not ((A and B) or (C and D));
end V1;
Sample Mux Code
library ieee;
use ieee.std_logic_1164.all;
entity MUX2to1 is
port(
A, B: in std_logic_vector(7 downto 0);
Sel: in std_logic;
Y: out std_logic_vector(7 downto 0)
);
end MUX2to1;
architecture behavior of MUX2to1 is
begin
process (Sel, A, B) -- rerun process if any
changes, sensitivity list, all inputs
begin
if (Sel = '1') then
Y <= B;
else
Y <= A;
end if; -- note that *end if* is two words
end process;
end behavior;
Dept. of Information Engineering and Mathematics. University of Siena. Nov - 2015 30
29. Tips on Coding
Note: Std_logic_vector used to
define a signal of more than 1
bit. In this case A, B and Y are
all 8 bits and can be referred to
as a vector or as individual
components such as A(7), A(6),..
Etc.
Process(Sel, A, B) is the
sensitivity list.
Sel is 1 bit so the syntax is if
(Sel = ‘1’) .
Rule 1: To synthesize
combinational logic using a
process, all inputs to the design
must appear in the sensitivity
list.
Rule 2: To synthesize
combinational logic using a
process, all objects must be
assigned under all conditions.
Dept. of Information Engineering and Mathematics. University of Siena. Nov - 2015 31
30. Bits of Advice..
Popular HDL languages are Verilog and VHDL.
There are also vendor-specific ones like AHDL (Altera HDL).
If you're familiar with C/C++ programming, then you should
choose Verilog, rather than VHDL. Verilog's syntax is similar to C.
Get a simulator: Open Source: Icarus Verilog (1) is a Verilog
simulation and synthesis tool.
Happy Coding!!!
But…as an alternative, you could use high-level synthesis
techniques such as:
Xilinx's Vivado HLS and
Altera's OpenCL solution.
1.http://iverilog.icarus.com/
Dept. of Information Engineering and Mathematics. University of Siena. Nov - 2015 32
31. Xilinx FPGA Design Flow Overview
Design flow comprises the following
steps:
Design entry,
Design synthesis,
Design implementation
Xilinx® device program
Design verification: Includes both
functional & timing verification, takes
places at different points during the
design flow.
Image Source: Xilinx
Dept. of Information Engineering and Mathematics. University of Siena. Nov - 2015 33
32. Synthesis process checks code syntax and analyze the hierarchy of
your design. Resulting netlist is saved to an NGC file (for Xilinx®
Synthesis Technology (XST)).
Check Syntax process checks the syntax of the selected source file
prior to generating a netlist of the design by synthesis or compile.
Image Source: Xilinx
Xilinx FPGA Design Flow Overview
Dept. of Information Engineering and Mathematics. University of Siena. Nov - 2015 34
Core files are EDIF (EDIF (Electronic Design
Interchange Format))
NGC files contain both logical design data and
constraints.
33. Xilinx FPGA Design Flow Overview
Design Implementation
Image Source: Xilinx
Dept. of Information Engineering and Mathematics. University of Siena. Nov - 2015 35
Translate process merges all input netlists and
design constraints and outputs a Xilinx Native
Generic Database (NGD) file, which describes
the logical design reduced to Xilinx primitives.
Input Format- EDIF, SEDIF, EDN, EDF, NGC,
UCF, NCF, URF, NMC, BMM.
Output: BLD (report), NGD.
Map process maps the logic defined by an
NGD file into FPGA elements, such as CLBs
and IOBs. Output is NCD.
Place and Route process takes a mapped
NCD file, places and routes the design, and
produces an NCD (Native Circuit Description)
file that is used as input for bitstream
generation.
Generate Programming File process
produces a bitstream for Xilinx device
configuration. After the design is completely
routed, you must configure the device so it can
execute the desired function.
34. Introduction to High-Level Synthesis
(HLS)
Traditional Process Flow w/o HLS HLS Process Flow
Dept. of Information Engineering and Mathematics. University of Siena. Nov - 2015 36
35. HLS
LLVM: low-level virtual machine
Open-source compiler framework(http://llvm.org)
Used by Apple, NVIDIA, AMD, others
Competitive quality with gcc & performs standard
(50+)optimizations
Several HLS tools (LegUp, Altera, Xilinx) are built as “back-ends”
of LLVM
LLVM will compile C code into a control flow graph (CFG)
Dept. of Information Engineering and Mathematics. University of Siena. Nov - 2015 37
36. HLS
Control Flow Graph:
Composed of basic blocks
Basic block: is a sequence of instructions (shift, add,
divide, xor, and,branch, call, etc.) terminated with exactly
one branch
Can be represented by an acyclic data flow graph
HLS tools (Both built within LLVM compiler framework)
Xilinx Vivado HLS (Language support: C, C++, SystemC)
Altera OpenCL SDK
Open Computing Language (OpenCL) is the first open, royalty-free standard for cross-platform, parallel programming.
https://www.khronos.org/opencl/
Dept. of Information Engineering and Mathematics. University of Siena. Nov - 2015 38
37. HLS: Key Aspect
Scheduling: Defines the HW’s finite state machine
How to assign the computations of a program into the hardware
time steps?
Or Which operations can be scheduled in the same time step?
Or Which operations are dependent on others?
SDC[1]: System of Difference Constraints : formulate scheduling
as a mathematical optimization problem (linear program (LP)).
Variables: For each operation(op) to schedule, create a
variable(var). var will hold the cycle # in which each op is
scheduled.
1. Cong, Zhang, “An efficient and versatile scheduling algorithm based on SDC formulation”. DAC 2006: 433-438.
Dept. of Information Engineering and Mathematics. University of Siena. Nov - 2015 39
38. Constraints:
Dependency Constraints : The subtract can only happen after
the add and shift,
Clock Period Constraints : For each chain of dependant
operations in DFG, find the path delayD.
Resource Constraints: Allow up to 2 load/store operations in a
cycle
Binding: e.g. Bind the following scheduled operations.
Loop Pipelining : Overlap execution of adjacent loop iterations.
Can be combined with loop unrolling
for (int i = 0; i < N; i++) {
a[i] = b[i] + c[i]
}
Each iteration requires:
• 2 loads from memory
• 1 store
• No dependencies between iterations
HLS: Key Aspect
Dept. of Information Engineering and Mathematics. University of Siena. Nov - 2015 40
39. HLS Pragma
Examples
PIPELINE: pipeline a loop
UNROLL: unroll a loop
ARRAY_PARTITION: partition an array into multiple arrays
for parallel access
ARRAY_MAP: map multiple arrays into a single array
INLINE: inline a function
LATENCY: set the scheduling latency
ALLOCATION: set the # of HW instances of something
Dept. of Information Engineering and Mathematics. University of Siena. Nov - 2015 41
40. Recommended Readings
Book:
Advanced FPGA Design: Architecture, Implementation, and Optimization
1st Edition.
Circuit Design with VHDL by Volnei A. Pedroni
Embedded Systems Design with Platform FPGAs: Principles and Practices
1st Edition by Ronald Sass (Author), Andrew G. Schmidt (Author).
Verilog HDL : A Guide to Digital Design and Synthesis by Samir Palnitkar
Advanced Chip Design, Practical Examples in Verilog by Mr Kishore K
Mishra
The Verilog Hardware Description Language by Philip R. Moorby, Donald E.
Thomas
Lists of Books: http://www.verilog.com/v-books.html
Tutorials:
http://www.fpga4fun.com/HDL%20tutorials.html.
Dept. of Information Engineering and Mathematics. University of Siena. Nov - 2015 42
41. Notice
Slides has been made from the slides of Prof. Jason Anderson
(University of Toronto, Canada) and also with the help of Xilinx
manuals.
43
42. Dept. of Information Engineering and Mathematics. University of Siena. Nov - 2015 44