This document provides an overview of register transfer level (RTL) design and programmable logic circuits. It defines RTL as describing digital systems using register transfers, where data is processed by flowing between registers. Graphs are used to describe different levels of parallelism in register transfers. Control can be centralized or distributed. An example multiplier implementation is provided to illustrate an RTL design. The history of programmable logic devices is covered, from PALs to FPGAs, which allow implementing designs in reprogrammable logic and interconnects.
2. Outline
What is Register Transfer Level
• Data path
• Control
Description of Register Transfers with Graphs
• Parallelism
From Graphs to RTL Implementations
• Sequential and parallel RTL
• Centralized and distributed control
RTL design example
• Multiplier
History of Programmable logic circuits
Modern FPGAs
2
4. Motivation: Large digital system
4
Recall: hierarchical design of the plane flap control
For sequential systems, state machines can be hierarchical as well
Register Transfer Level (RTL) is a technique to describe hierarchical and large digital systems
processor a
processor b
processor c
Resolver
Flap control
5. Recall our previous example
The system multiplying A by 2,3, or 4
We implemented the control for
multiplexers, which effectively controls
what values the adder sums up
Our state machine controls the flow of
data
The whole system is divided into two
logical parts:
• Control
• Data path
5
A
0
X1
X0
start
Y1
Y0
ready
f
+
D Q
Clk
Clk
0 1 0 1
Control Data path
6. Benefits
Data path can be designed
independent of the control
Control was simplified
• Only 11 rows in state table
• General purpose: the width of A can be
changed without changes to control
We divided the problem into logical
parts and developed them
independently
6
A
0
X1
X0
start
Y1
Y0
ready
f
+
D Q
Clk
0 1 0 1
Control
Data path
PS
NS
X1 X0 st
m , r
000 001 010 011 100 101 110 111
2∙A 10,0
done 3∙A 3∙A 2∙A
3∙A 11,0
3∙A done 4∙A 3∙A
4∙A 11,0
done 01,1
done 2∙A done 2∙A done 2∙A done 2∙A
done done done done
othr 01,1
done
outp
ut
m
r
result
7. If not divided into control and data path…
Alternatively, the system as a
single sequential system
design:
• State table size depends on A’s width
• The longer A, the more complex
design
A benefit is that the result might
have been possible to obtain in
one clock cycle
• For example implemented with
memory
A trade off exists between the
“design speed” and system
speed!
7
X1
X0
result
f
Clk ready
A
NS x1, x0, st, A
PS 0000000 0000001 … 1111111
2A done done … 2A
… … … … …
”Stupid design”: A’s width affects the state table
8. What is RTL?
8
The function of the whole system is performed as a sequence of register transfers (in one or more clock
cycles)
Register transfer: A transformation performed on a data while the data is transferred from one register to
another
ƒ
…
…
RTL
Register
RTL
Register
RTL
Register
RTL
Register
9. What is ”Register” at RTL Level?
RTL Register is a general term
• Previously we used “register” only to store
state of state machine
RTL Register is a functional unit
that performs specified operation
• Each functional unit has a set of internal
operations it can do
RTL Register transfer input data
comes from
• Another clocked register (functional
unit,…)
• External input data
RTL register is synchronous
(clocked)
9
RTL
Register
Control
Data
path
Data_in Data_out
Control_in Control_out
Control Status
10. Data and Control
Data(path)
• Processes or moves data
• Many registers and functional units
• Operation selected by control
Control Unit controls the sequence of transfers and
• Specifies what data path does
• Includes most often state machine
• Can also be purely combinational (implements truth table)
• Control decisions are based on status information from data path or
from external information (e.g. user interface)
10
11. Register Transfer Expression
General form of register transfer
expression:
• Destination Register Information source
• where ‘‘ is the replacement operator
• E.g. A B*C
Destination register content is
replaced by the source information
• Usually done at the next clock
Basis of VHDL and Verilog
languages
11
RTL
Register
RTL
Register
13. Register transfers can be described with task graphs
Example algorithm:
We want to replace exponentiation by multiplications and additions
We can do this in several ways depending on how many independent operations can be performed at the same time (level
of parallelism)
Two examples: recursive and factored
•We can draw them as data flow graphs
Description of Register Transfers
13
A
B
14. Sequential Unfolded Graph
Node = execution = RTL
register
Arch = predecessor
• Execution can start when all
predecessors are ready = data is
available from all inputs
One RTL register is active
at a time
• No parallel execution, only
sequential one after another
It takes 7 cycles to get the
result
14
A
15. Sequential Iterative Graph (Loop)
Some of the RTL
registers are re-
used during the
whole operation
Still sequential
operation
15
B
16. Parallel Graph
More than one functional
unit is active at the same
time independently
Dependency exists
between rows (I-III), but
all RTL transfers can take
place at the same time
16
B
I
II
III RTL transfer
17. Hybrid graph
Within each group
truly parallel
execution, but
between groups
sequential
It takes 3 cycles to
give the output
17
I
II
III
18. Graph Transformations
18
Change the level of parallelism: from parallel to sequential
• Note that we need to store values for waiting the following operations
(extra data storage registers)
A
B
C
D
E
F
G
H
sequential
parallel
20. From Graphs to Implementation
Mapping is the process of placing RTL level graph tasks
into available hardware components (RTL registers)
Functional units can be utilized in three ways depending
on parallelism:
• Nonsharing system
• Sharing system
• Unimodule system
20
21. Non-sharing
21
Each execution graph node has corresponding functional unit in hardware
Arches in graphs correspond to connections between functional units
Also called one-to-one and direct mapping
Multiplier
Multiply
accumulate
Multiply
accumulate
Multiply
accumulate
Graph RTL implementation
22. Sharing and Unimodule Systems
22
Sharing system
• One or more functional units are reused
Unimodule system
• Extreme sharing, only one functional unit
Multiplier
Multiply
accumulate
Register
Graph RTL implementation
Transformed graph
23. Centralized Control of Functional
Units in Datapath
23
One logical (most often also physical)
control unit
• Easy to design and manage
• Problem: lots of signals
24. Distributed Control of Functional Units in
Datapath
Control is decentralized and included
in functional units
Functional unit can independently
decide what operation performed
Triggering of execution
• Perform operation when input data is
received
• A token is moved with data and tells what
operation is performed to the accompanying
data
• Modules collaborate to control operations
(pass control signals)
24
27. Example: Sharing
27
Instantiate several times identical multiply-accumulate modules
• Two independent MAC units per module (left and right side) in this example
Central control
Mapping corresponds to group-sequential graph
28. Unimodule
• The module performs the operation
• Sequence of register transfers
• Note how close to assembly
language!
28
A
z a b c
+
30. Multiplier
• Specification
• Use “Right-shift multiplication algorithm with partial
products” (recall last lecture!)
30
( 1) ( ) 1
( 2 )2
i j n
j
z z y x
+ −
= +
align to left (MSB side of product)
Next partial product
add
shift right
31. Example of Right-shift Multiplication Algorithm With Partial
Products
31
Partial product 0
( 1) ( ) 1
( 2 )2
i j n
j
z z y x
+ −
= +
align to left (MSB side of product)
Next partial product
add
shift right
Shift to obtain PP1
Shift to obtain PP2
xy0 aligned to left
ADD PP0
xy1 aligned to left
ADD PP1
Multiplicand
Multiplier
32. Example of Right-shift Multiplication Algorithm
With Partial Products
Recurrence requires following operations
32
Task graph:
33. Highest Level Multiplier Block Diagram
33
Only one external control signal ”start” to indicate that X and Y are ready in input
Recommended to have also ”output ready” to indicate to another system that the result is
ready
• Depends on how this multiplier is integrated to a larger system
”Multiplier” here consists of internal control and data subsystems
34. Datapath Design of Multiplier
34
Corresponds to the block diagram shown in previous
lecture, but here the multiplier register (Y) is not reused
(partial) product register
Multiplicand and multiplier registers
ld=load
sh=shift
clr=clear
”Align left” = place the adder result to the MSB side of Z
”Right shift” = hardwired, no separate shifting of Z register
36. Control Design of the Multiplier
36
State diagram, design of state machine
37. Conclusions
We could express the application algorithm in several ways, with
varying level of parallelism
Thus, there can be several RTL descriptions as well
• Trade off between speed and area
RTL is the standard level of abstraction in VHDL and Verilog
languages
FPGA design is performed at RTL level, even with HDL designer’s
block diagrams
37
38. RTL Summary
Data is processed by register
transfers
• Flowing linearly through each level
• Looping by reusing some levels
• In parallel
RTL Register includes internally
a data path and control
Control can be centralized or
distributed
38
Control
Data
path
Control Status
40. Motivation
Before programmable circuits there
were only
• ASICs (Application Specific Integrated Circuit)
• Expensive, fixed function
• Design may still take 2 years
• Discrete components
• Require lots of space and power
• Not possible for current products
40
ASIC
Discrete component
41. PAL – Programmable Array Logic
(1978)
41
One-time programmable AND stage for POS expressions
Fixed OR stage
42. PAL – Programmable Array Logic
42
Was used in Digital Design course exercise in 1990!
The programming was defined in a text file
45. Do it youself: PSA using memory
You can also use plain
memory for state
machine implementation
Cheap, any kind of
function, slower than
gate network + register
Feasible for simple
devices like toys
45
46. Re-programmable circuits (1983)
46
“The probabilities are high that someone will produce an electrically alterable logic array” Hartmann,
Newhagen and Magranet 1982
Altera presented first re-programmable device in 1983
Programmable Logic Device (PLD)
• EP300: 320 gates, 3-µm CMOS
• 10-MHz, 20 I/O Pins
• Programming was erased by exposing UV-light, thats why it had a window in the chip
49. FPGA Logic Elements
49
LE is programmed by configuring the Look-up-Table
• Any n-term switching expression can be realized by storing right values
to the SRAM cells (minterms)
53. FPGAs today
53
The largest FPGAs have ~10M Logic Elements and ~2000 IO-pins
Rapidly increasing application area is Machine Learning in big data centers
The trend is also to add hard cores (processors, special function units) instead of growing the LE count
55. Hard CPU core
and FPGA on the
same chip
55
Example: Xilinx Zynq-
7000
Used in course FPGA
board (ZYNQ XC7Z020-
1CLG400C)
FPGA
56. FPGA Summary
Solution between general purpose processor and fixed
function ASIC
• Slower than ASIC
• Faster than processor in special functions (massive parallelism)
FPGAs are used in ASIC emulation, AI computation and
interfacing complex logic
Current FPGAs include also processor cores and other
fixed blocks -> System on Chip
56
57. TAU’s first FPGA-based Neural Network
computer “TUTNC” (1996)
57
FPGA
DSP-processor
SRAM
Interface
to host PC