VHDL Implementation of a CORDIC Arithmetic Processor Chip
1. Faculty of Computing and Information Technology
Department of Robotics and Digital Technology
Technical Report 94-9
A VHDL Implementation of a CORDIC
Arithmetic Processor Chip
Grant Hampson, Student Member, IEEE
Andrew Paplinski, Member, IEEE
October 10, 1994
Enquiries:-
Technical Report Coordinator
Robotics and Digital Technology
Monash University
Clayton VIC 3168
Australia
tr.coord@rdt.monash.edu.au +61 3 905 3402
4. List of Figures
1.1 Rotation of a point in 2-D space. : : : : : : : : : : : : : : : : : : : : : : : 6
2.1 Generic Processor Architecture. : : : : : : : : : : : : : : : : : : : : : : : : 11
2.2 A Optimised Word-Serial CORDIC Architecture. : : : : : : : : : : : : : : 12
2.3 Word-Parallel CORDIC architecture with possible data pipelining. : : : : : 13
3.1 Numerical accuracy of the CORDIC processor. : : : : : : : : : : : : : : : : 15
3.2 Predicted and Actual accuracy of a CORDIC processor with a 12 bit in-
ternal datapath. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 15
3.3 A plot showing bits of error for a typical test vector rotated through all
possible angles. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 16
3.4 A 12 bit, 8 stage CORDIC processor produces 9 bit accurate results. : : : 17
3.5 An 8 bit, 8 stage CORDIC processor produces 7 bit accurate results. : : : 17
3.6 Simulation results from a CORDIC processor illustrating the e ects of the
normalisation scheme. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 19
3.7 An 8 bit, 8 stage CORDIC processor (a) without rounding, (b) with rounding. 20
4.1 The basic CORDIC unit. : : : : : : : : : : : : : : : : : : : : : : : : : : : : 21
4.2 A Hierarchical Design of the Adder/Subtracter for n = 4. : : : : : : : : : : 24
4.3 A Flat Design of the Adder/Subtracter for n = 4. : : : : : : : : : : : : : : 24
4.4 A Behavioural Design of the Adder/Subtracter for n = 4. : : : : : : : : : : 25
4.5 The structure of CORDIC unit showing the various entities. : : : : : : : : 28
4.6 The top level schematic of an 4 stage CORDIC processor with Increased
Convergence Range and Rounding components. : : : : : : : : : : : : : : : 32
3
5. Abstract
This report describes the fundamentals of CORDIC (Co-ordinate Rotations Digital Com-
puter) algorithm and a possible implementation using the VHDL hardware description
language. An analysis of errors associated with a xed point implementation of CORDIC
is also discussed and methods for reducing these errors. A normalisation scheme which
reduces error and requires no extra hardware is such a method. Various CORDIC struc-
tures and possible VHDL implementations are described in detail, including design and
language issues. Finally a parallel hardware implementation is described and simulated.
CORDIC has many applications, of which, some can be used for array imaging tech-
niques.
Keywords
CORDIC, VHDL
4
6. Preface
CORDIC is an acronym for Coordinate Rotations Digital Computer and was derived by
Volder 1] in the late 1950's for the purpose of calculating trigonometric functions. Its
popularity came about nearly twenty years later when VLSI solutions became a reality.
The original algorithm describes the rotation of a 2-D vector which can be applied
in applications such as Digital Signal Processing 2] (Fourier Transforms, Digital Filters),
Computer Graphics 3] and Robotics 4].
CORDIC processing o ers high computational rates making it attractive to applica-
tions such as computer graphics where a combination of scaling and rotations are required
in real time. CORDIC is also attractive to Robotics since the fundamental operation is
coordinate transformations, however it could be used for more computationally intensive
processes such as motion planning and collision detection.
Array Imaging typically involves complex signal processing which may require many
computationally intensive matrix operations. Increasing the complexity of the imaging
model places greater demands on accuracy. Solutions to such complex systems requires
better, and hence, more complex algorithms. Most of these algorithms are based on matrix
factorization (decomposition) techniques, of which Singular Value Decomposition (SVD)
is the most robust method. The SVD factorisation requires a two-sided transformation
which involves several trigometric operations and rotations ideally suited to dedicated
VLSI hardware (CORDIC processing) for real time calculations. CORDIC has also been
applied to phase correction when dynamic range focusing when Digital Baseband Demod-
ulation 5] techniques are employed in Interpolation Beamforming 6] . A complex signal
is represented by the in-phase, I, and quadrature, Q, components, and are phase corrected
by rotating the complex signal.
Haviland and Tuszynski designed and built a CORDIC processor 7] in 1980 which
used a iterative process to calculate circular, linear and hyperbolic functions. A more
recent implementation (1993) by Duprat and Muller 8] discusses the possibility of using
a redundant number system for the representation of a signed digit.
This report is broken into four logical sections, namely, CORDIC Theory, Hardware
Implementations, Improving CORDIC Accuracy and nally a VHDL Implementation.
5
7. Chapter 1
The CORDIC Algorithm
Consider a 2-D vector (x; y) represented by a point v = x + |y in the complex plane. If
the vector is rotated by an angle , the new co-ordinate vector is given by:
v = v ej
~ (1:1)
and shown in Figure (1.1).
y
v = x + |y
~ ~ ~
v = x + |y
x
Figure 1.1: Rotation of a point in 2-D space.
The angle can be expanded into a set of elementary angles i with pseudo-digits
qi 2 f?1; +1g, and angle expansion error zn , such that
n?1
X
= qi i + zn (1:2)
i=?1
and the sub-rotation angles i take on the following values:
(
i = arctan(2?i ) for i = ?1 ; ; n ? 1
=2
for i = 0; 1 (1:3)
Note that i is approximately equal to but less than 2?i and the resulting angular expan-
sion error is therefore jznj < 2?(n?1).
6
8. Substitution of Equation(1.2) into Equation (1.1) gives:
n?1
Y |q
v = v
~ e i i e | zn
i=?1
n?1
Y |q
= v (|qi) e i i e | zn (1.4)
i=0
and expanding ejqi i ,
ejqi i = cos qi i + j sin qi i
= cos qi i (1 + j tan qi i)
= cos i 1 + j qi 2?i
Finally
n?1 ! n?1 !
Y Y
v=v
~ cos i (|q?1) 1 + | qi 2?i e?j zn (1:5)
i=0 i=0
The range of rotation angles which can be represented by Equation (1.2) is max , where
n?1
X
max = i 190 (1:6)
i=?1
and some values of i are given in Table (1.1).
If the expected range of rotation angles is 90 then the initial rotation by 90 , that
is, e|q?q 2 = j q?1, does not have to be performed and the initial rotation is by 45 .
The second term is a constant scaling factor and for given value of n it can be pre-
evaluated using Equation (1.7), and the rst 15 evaluated in Table (1.2).
n?1
Y n?1
Y ?2
1 n?1
Y 1
Kn = cos i = 1 + 2?2i = q (1:7)
i=0 i=0 i=0 1 + 41i
The basic CORDIC algorithm which describes rotation of a unity length vector v =
x + |y by an angle can be derived from Equation (1.5) using the initial conditions, where
zi is the accumulated angular residue:
v?1 = v Kn
z?1 =
And, proceeding with i = ?1; 0; ;n ? 1
(
qi = ?1 if zi < 0 (1.8)
+1 0
(
v |qi if i ?
vi+1 = vi (1 + |q 2?i ) if i = 0 1 (1.9)
i i
zi+1 = zi ? qi i (1.10)
7
10. The nal rotated vector is vn, with angle expansion error zn
vn = v = v e| e?|zn
~ (1.11)
n?1
X
zn = ? qi i (1.12)
i=?1
One complex operation on vi is equivalent to two operations on real numbers. For i = ?1
x0 + |y0 = |q?1(x?1 + |y?1)
Hence =) x0 = ?q?1y?1 (1.13)
y0 = q?1x?1 (1.14)
For i = 0; 1; ;n ?1
xi+1 + |yi+1 = (xi + |yi)(1 + |qi 2?i )
Hence =) xi+1 = xi ? qi yi 2?i (1.15)
yi+1 = yi + qi xi 2?i (1.16)
The CORDIC algorithm reduces to an iterative set of operations consisting of a binary
shift and an accumulator for each of x; y and z.
Refer to Appendix A for a list of transcendental functions.
9
11. Chapter 2
CORDIC Hardware
Implementations
A Hardware implementation of CORDIC processor is dependent on the number of func-
tions required and the computational speed. If all functions are to be computed, then
there will be a necessary overhead for selecting each function. However, a small fast de-
sign will result if a small number of functions are required. This chapter presents possible
solutions to a mixture of design problems.
2.1 CORDIC Processor Architecture
A CORDIC algorithm can take on two primary architectures, namely, word serial or word
parallel. A word-serial processor minimises hardware requirements by utilising a single
CORDIC unit repeatedly. However, iterative algorithms which are controlled by a small
number of variables can be expanded on a two-dimensional area. ie., instead of executing
a certain set of instructions n times using a single element (eg., a CORDIC unit), n times
duplicated elementary cells are used in successive steps of an iteration 9]. This attened
structure can now perform many operations in parallel and is so called a word-parallel
CORDIC processor.
A word-parallel architecture has the advantage of being up to n times faster, but due
to the expansion requires, at worst, n times more hardware. However, the word-serial
architecture requires complex controlling hardware and a variable shifter, decreasing the
hardware saving ratio.
2.1.1 A Word-Serial CORDIC Architecture
The CORDIC algorithm has the advantage of not requiring any special hardware other
than an accumulator and a variable shifter which are generally available in most micro-
controllers.
A multi-function word-serial CORDIC processor architecture could be realised using
a basic micro structure consisting of a two-port register le, a variable shifter combined
with an ALU interconnected by several data paths as shown in Figure (2.1).
A generic controller could consist of a microcode instructions for the ALU and register
10
12. Result bus: xi+1 , yi+1 , zi+1
n i
CC register
ALU
ROM ROM Register
Kn 's i 's File 2?i yi or 2?i xi
Variable Controlling
Shifter i micro-code
Input data buses: xi, yi , zi
Figure 2.1: Generic Processor Architecture.
le, and would execute an iterative algorithm. This structure is simular to that of a
microprocessor or DSP and allows many variations of the CORDIC algorithm as the
order of operations and the expanded instruction set increases exibility. This type of
structure illustrates that it would be possible to implement the CORDIC algorithm on
any micro or DSP.
Optimising the generic processor-structure for a word-serial CORDIC processor is
achieved by reducing the functionality to operations only required by the CORDIC algo-
rithm. A possible word-serial architecture is shown in Figure (2.2) where the ALU now
contains three adders and dedicated registers. The microcode controller has been replaced
by faster Combination Control Logic dedicated to the CORDIC operation sequence.
2.1.2 A Word-Parallel CORDIC Architecture
The word-parallel method expands the problem of a single dimensional algorithm into
a two-dimensional problem and results in shorter computational times. Greater speeds
of computation can be obtained by pipe-lining between stages so that many partial re-
sults can be calculated in parallel. A pipelined-word-parallel architecture is shown in
Figure (2.3) where each iteration is represented by a separate CORDIC block and a latch
is placed after each iteration, or, several iterations.
The following chapters will develop, implement, and simulate such parallel CORDIC
structure using the VHDL hardware description language.
11
13. Load
Precision
Reset Initial Inputs
Clock z }| {
x0 y0 z0
Combinational
Control
Logic Select
Next State
xi yi zi
qi
?qixi2?i
q-bit register qiyi2?i Look
up
Increment Zero Table
P P P of
counter i i 's
m-bit register
Clock n-bit register n-bit register n-bit register
xi+1 yi+1 zi+1
Finished
Flag
Figure 2.2: A Optimised Word-Serial CORDIC Architecture.
12
14. y0 x0 z0
Cell #0 0
y1 x1 1
Clock
Latch for Pipelining of data
yi xi zi
?qi xi 2?i
qi yi 2?i qi = sign zi]
Cell #i P P P
i
yi+1 xi+1 zi+1
Clock
Latch for Pipelining of data
Cell #n n?1
yn xn zn
Figure 2.3: Word-Parallel CORDIC architecture with possible data pipelining.
13
15. Chapter 3
Improving CORDIC Accuracy
As expected, iterative algorithms calculate results by approximation and the solution will
contain errors. CORDIC is not an exception and errors are introduced by a combina-
tion of quantisation and approximation errors. The accuracy of a CORDIC processor is
dependent on the word length used for the three input variables x; y, and z, as well as
the number of iterations or steps performed. The following chapter describes the errors
associated with a xed point implementation and a means of reducing these errors.
3.1 Estimation of CORDIC Accuracy
The fundamental operations performed by a CORDIC processor is the shift-and-add pro-
cess of which xed point arithmetic will introduce errors. For example, consider the binary
scaling of the vector vi = (xi; yi) at the ith stage:
if i m then vi+1 is updated with the truncated value vi 2?i
if i > m then vi+1 = vi ; and the update will be 0
where m is the internal bus width of v and limits the maximumnumber of useful iterations.
Peak accuracy could be achieved after m iterations since all accuracy has been exhausted
in v. However, truncation errors may exceed the accuracy achieved by more iterations,
and it is desirable to nd the optimal number of iterations.
The accuracy of the rotation will be determined by how closely the input rotation
angle was approximated by the summation of sub-rotation angles i. The error in v after
n iterations will be proportional to the error in z. An increase in the z datapath width
will increase the accuracy of the z update and hence the v update.
The numerical accuracy of the CORDIC algorithm can be calculated by the examina-
tion of truncation and approximation errors. Truncation errors are due to the nite word
length and approximation errors are due to the nite number of iterations. Walther 10]
analyzed the x and y iterations independently of the z iterations and concluded that log n
extra bits in the data paths can provide n bits of accuracy. This work was re-calculated
by Kota and Cavallaro 11] in a non-independent manner and concluded that log n + 2
extra bits are required to achieve n bits of accuracy after n iterations.
14
16. This solution represents an upper bound of error in the CORDIC processor. A graph
of this function appears in Figure (3.1) from which it can be seen that to achieve 8 or 16
bit accuracy, the internal datapaths need to be 13 and 22 bits respectively.
Datapath resolution vs Output Resolution
32
Output resolution is (n) bits with (n) iterations
28
24
20
16
12
8
4
0
0 4 8 12 16 20 24 28 32 36 40
Internal Datapath Width (n+log(n)+2)
Figure 3.1: Numerical accuracy of the CORDIC processor.
3.2 The Lower Bound of CORDIC Accuracy
A CORDIC processor can be presented with all possible input combinations to nd the
lower bound of error. Simulation results are shown in Figure (3.2) where a 12 bit CORDIC
processor with a variable number of stages is presented with all possible rotation angles
between ? z?1 and the resulting accuracy in bits is calculated. Kota and Caval-
laro's upper bound of error (as de ned by their maximum error equation in Appendix (B))
is also shown in Figure (3.2). The upper bound of error has a well de ned peak of accu-
racy, however the simulation results indicate that accuracy will improve if more iterations
are performed.
Solid: Predicted Accuracy, Dashed: Actual Accuracy
12
10
8
Output Accuracy
6
4
2
0
0 2 4 6 8 10 12
Number of stages n
Figure 3.2: Predicted and Actual accuracy of a CORDIC processor with a 12 bit internal
datapath.
15
17. Figure (3.3) illustrates the accuracy of a 12 bit, 12 stage processor, by simulation, and
the resulting bits of error produced. About 0:3% of results are greater than 2 bits of error
which indicates that the error bound of a CORDIC processor is positioned between the
upper and lower bounds of error.
Bits error
90
3
120 60
2
150 30
1
180 0
210 330
240 300
270
Figure 3.3: A plot showing bits of error for a typical test vector rotated through all
possible angles.
The simulation results indicate that n + log n + 2 is an over estimation of data path
width required and a reduction in datapath width is possible if the number of iterations
is increased. Simulation results of two 8 stage CORDIC processors with 12 bit and 8 bit
datapaths, are shown for comparison in Figure (3.5) and Figure (3.4) respectively. The
simulation results were obtained by varying the magnitude of v and in uniform steps.
The di erence in resolution obtained is two bits, indicating that the lower bound of error
is closer to the error bound of CORDIC.
3.3 Reducing the z update error
In the rotational mode of CORDIC, converges towards zero by adding/subtracting sub-
rotation angles and the nal iterations of the zi update will result in numbers approaching
zero. More precisely, the angular error zi is approximately equal to 2?i , thus for a bus
width m, only (m ? i) bits are used to represent error.
To reduce the zi error a oating point system could be used, but it has complex
hardware implementations not suited to word-parallel structures. A simpler method to
16
19. improve accuracy, ie., to utilise all m bits, a quasi- oating point scheme or normalisation
scheme could be implemented by scaling the existing sequence by 2i , ie.,
zi = 2i zi
^
Therefore, the new sequence becomes
zi+1 =
^ 2i+1 zi+1
= 2 2i (zi ? qi i)
= 2 (2i zi ? qi 2i i)
= 2(^i ? qi ^i)
z (3.1)
which requires a shift left at each iteration, and requires no extra hardware for a word-
parallel structure. A new sequence of sub-rotation angles can be de ned as:
^i = 2i i = 2i tan(2?i) (3:2)
where ^i approaches a nite value of 1 for increasing values of i, and will utilise most of
the bus width. Since the scaling system results in full use of the databus width, over ow
may occur if the bus width is too small. Using Equation (3.1), the maximum value zi+1
can have is when zi approaches zero, giving
max zi+1] 2 max ^i] (3:3)
To calculate the increase in accuracy is beyond the scope of this report, however,
simulation indicates that there is a direct improvement in accuracy. The simulation
results indicated that using the traditional scheme the accuracy of the rotation is
accuracy / log(zi datapath width) + log(number of stages) (3:4)
whereas the normalisation scheme has the advantage of
accuracy / log(number of stages) (3:5)
since the z datapath is always in a semi-normalised state.
Using the traditional scheme, i ! 0, limiting the number of useful stages. However
when normalised, there is no limit on the number of stages and a signi cant reduction in
hardware is possible by reducing buswidth of z.
Figure (3.6) illustrates the error dependencies on the number of stages and bits for
the scaled and unscaled CORDIC processors. Figure (3.6(a)) and Figure (3.6(b)) show
the angular expansion error. Figure (3.6(c)) and Figure (3.6(d)) show the dependance of
v error on the angular expansion error.
18
20. No alpha scaling Alpha scaling
-3 -3
x 10 x 10
6 4
angle expans. error
angle expans. error
4
2
2
0 0 0 0
0 10 0 10
10 10
20 20 bits 20 20 bits
stages stages
No alpha scaling Alpha scaling
4 4
relative v error
relative v error
2 2
0 0 0 0
0 10 0 10
10 10
20 20 bits in z 20 20 bits in z
stages/bits in v stages/bits in v
Figure 3.6: Simulation results from a CORDIC processor illustrating the e ects of the
normalisation scheme.
19
21. 3.4 Unexpected Truncation Errors
Using xed point arithmetic in a CORDIC processor will introduce an unexpected trunca-
tion error. The error occurs when the vector (x; y) has a negative component. Consider
the nal iterations where the update of vector v approaches 0 since a larger number of
right shifts is performed at each iteration. However this is not the case if x or y is negative.
For example, let xi!N equal some number hex X"2D", or positive 45. The right shifted
value of xi!N approaches zero. However, the negative of X"2D" in twos-complement form
is X"D3" and the right shifted value will produce a number approaching X"FF", or ?1,
not the expected zero.
This is a signi cant problem in the CORDIC processor, since the addition of extra
iterations will only increase the error. A simple method of removing this error would be to
round the shifted value, instead of the forced truncation. A simple method for rounding
values is to add the bit that was last shifted out to the shifted value.
The rounder could be implemented using a half-adder and typically requires three
logic gates per bit to implement. Minimal extra hardware is required in the word-serial
architecture, however a word-parallel structure requires two half-adders per stage. This
will have a direct e ect on the performance of the processor with the additional delay.
Figure (3.7) are the simulation results of two CORDIC processors, with and without,
rounding units. The test vector was rotated in steps of 5 , through 360 and the rounded
results are signi cantly more accurate. The rounding maintains monoticity in the actual
angle of rotation as well as uniform magnitude.
90 90
120 60 120 60
32.95 32.95
150 30 150 30
180 0 180 0
-150 -30 -150 -30
-120 -60 -120 -60
-90 -90
Figure 3.7: An 8 bit, 8 stage CORDIC processor (a) without rounding, (b) with rounding.
20
22. Chapter 4
VHDL Implementation
Various tools can be used to implement the CORDIC processor, however, a standardised
approach to this problem would unify the solution for further development in various
applications. A VHDL (VHSIC Hardware Description Language) has been used here
to describe the structural and behavioural characteristics of a Word-Parallel CORDIC
processor. VHDL has become the standard of hardware description languages and has its
own IEEE standard 12].
4.1 The Basic CORDIC Unit
Any CORDIC structure will involve a basic unit containing three adders/subtracters, as
shown in Figure (4.1). The binary scaler would be variable in the case of a Word-Serial
device, however, much simpler in the Word-Parallel device as a shift translates directly
to a misalignment of the data bus.
yi xi zi
Cell i i
yi+1 xi+1 zi+1
Figure 4.1: The basic CORDIC unit.
This unit and a suitable FSM and registers could form a word-serial structure. A
word-parallel implementation can be obtained by linking n CORDIC units.
The rest of this chapter deals with development of a Word-Parallel unit and the in-
terconnection of these devices using the VHDL language. It should be a relatively trivial
task, but unfortunately there are many bugs in the Viewlogic VHDL Synthesiser, as well
as only containing a subset of the full VHDL standard.
The main aim of the project was to describe a CORDIC processor using the VHDL
language and to allow the application designer to change the size of structure easily. This
21
23. exibility could include fundamental changes such as variable datapath widths and vari-
able number of stages. Other options such as rounding intermediate nodes and pipelining
could also be easily integrated.
Currently, Viewlogic's VHDL is a partial implementation of the 1987 IEEE Standard
VHDL, and many constructs are missing from their implementation. However, most of
the useful constructs are there, but contain nasty ambiguous messages following to say
sorry this only works partially. This made it very di cult to work with.
4.2 VHDL Describes Structure and Behaviour
VHDL has the ability to describe a design in two ways
in terms of its component structure,
in terms of behavioural functionality of the design
and also the possibility of integrating the two streams. A requirement for structural
descriptions is that the lowest level description will be a behavioural description to ensure
portability between di erent synthesis libraries. An example of a lowest level operator is
the logical operator AND (behavioural), and used to describe the ANDing of two operands.
This may be synthesised as an AND standard cell from the library. In this way, there is
no way of directly accessing a component from a cell library and limiting portability.
Consider a slightly more complex design of an n-bit adder/subtracter, which could be
described by the following behavioural description:
addsub : PROCESS(a,b,sel)
VARIABLE res : VLBIT_VECTOR(n DOWNTO 0);
BEGIN
res := zero(n DOWNTO 0); -- needs to be initialised
IF sel = '1' THEN
res := add2c(a,b);
ELSE
res := sub2c(a,b);
END IF;
s <= res(n-1 downto 0); -- discard cout
END PROCESS;
The process activates when one of the variables in the sensitivity list changes, and
then produces a result in the internal variable res. The signal s is assigned the lower
portion of the sum. Now consider a structural description of the same adder/subtracter
where several components are used:
c(0) <= sel; -- carry in
connect: FOR i IN 0 TO n-1 GENERATE
22
24. invert: invf101 PORT MAP( b(i), b_bar(i) );
mux_b_b_bar: muxf201 PORT MAP( b_bar(i), b(i), sel, b_hat(i) );
addsub: faf001 PORT MAP( a(i), b_hat(i), c(i), s(i), c(i+1) );
END GENERATE;
Note that the muxf201 component is used to select between the non-inverted and
inverted signals of the b bus. The components are user de ned entities describing the
appropriate logic gates. For example a fragment of the faf001 component contains the
following lowest level behavioural description:
SUM <= A1 xor B1 xor CIN2;
CO <= (A1 and B1) or (A1 and CIN2) or (B1 and CIN2);
It is not immediately obvious which way a designer should describe a particular design,
however the next section reveals the results of the synthesiser on which a decision may be
based. In general however, the easier it is for a designer to write a design in VHDL, the
more optimisation the synthesiser needs to perform.
4.2.1 Hierarchical vs Flat Designs
One of very useful features of Viewlogic's VHDL Synthesiser 13] is the ability to either
create a hierarchical (top-down) or a at (bottom-up) design. A hierarchical design allows
the engineer to see lower level interconnections between design units, unlike the at design
where no (or little) hierarchy can be seen. This allows easier debugging of designs, however
its has the disadvantage of being less e cient than a at design which combines all the
design elements together into one circuit, and then performs optimisation.
Figure (4.2) illustrates the previous structural design of the Adder/Subtracter where
it can be observed that the schematic consists of higher level components than standard
library cells. This feature of Viewlogic VHDL enables easy debugging of high level com-
ponents when compared to a at design. It is relatively simple to navigate between levels
in a design.
However, most libraries contain standard cells for full adders, muxes, and inverters, but
remembering that VHDL doesn't allow direct access to Library cells, these components
had to be described by a behavioural description. A mux simply maps to an IF statement,
however no behavioural description will map to the full adder cell, and resort to the
description stated previously.
Compiling the same design using the at (bottom-up) design approach the synthesiser
produces the following statistics, if for example, using the X2000 library. The schematic
generated by the synthesiser is shown in Figure (4.3).
*********************************************
Gate Usage Summary
*********************************************
Cell Count Area/Cell Cell Count Area/Cell
----------------------------------------------------------------------------
X2000:NAND2 15 0.25 X2000:OR2 3 0.25
23
25. B3 A1 O
INVF101 A1
B2 O
SEL3
B2 A1 O A1 MUXF201
B2 O
SEL3
INVF101
MUXF201
A0 A1 SUM S0
B1 CO
CIN2
B0 A1 O A1
B2 O FAF001
SEL3
INVF101 A1 SUM S3
B1 CO
MUXF201 CIN2
A3
FAF001
A2 A1 SUM S2
B1 CO
CIN2
B1 A1 O A1
B2 O FAF001
SEL3
INVF101
SEL A1 SUM S1
MUXF201 B1 CO
CIN2
A1
FAF001
Figure 4.2: A Hierarchical Design of the Adder/Subtracter for n = 4.
X2000:XOR2 15 0.25
----------------------------------------------------------------------------
Total Cells : 33 Total Area : 8.25
*********************************************
Netlist Statistics
*********************************************
Maximum level of gates = 14 Total number of nets = 42
OR2
A1 XOR2
XOR2
NAND2 NAND2 NAND2
A0
XOR2
NAND2 NAND2
SEL
NAND2 XOR2 NAND2
XOR2 NAND2
B0
NAND2
NAND2
NAND2 OR2
A2
NAND2
NAND2 OR2
NAND2
S2
NAND2
XOR2
XOR2
B1
XOR2
XOR2
B2
S1
XOR2
XOR2
S0
XOR2
B3
XOR2
S3
A3 XOR2
XOR2
Figure 4.3: A Flat Design of the Adder/Subtracter for n = 4.
Reconsidering the behavioural description of the Adder/Subtracter and synthesizing
the design, the following statistics are generated, and the corresponding schematic shown
in Figure (4.4).
*********************************************
Gate Usage Summary
*********************************************
24
26. Cell Count Area/Cell Cell Count Area/Cell
----------------------------------------------------------------------------
X2000:AND2 21 0.25 X2000:AND3 1 0.50
X2000:INV 11 0.00 X2000:NAND2 8 0.25
X2000:OR2 17 0.25 X2000:XOR2 3 0.25
----------------------------------------------------------------------------
Total Cells : 61 Total Area : 12.75
*********************************************
Netlist Statistics
*********************************************
Maximum level of gates = 11 Total number of nets = 70
NAND2
NAND2
OR2
NAND2
INV NAND2
OR2
AND2
AND2
AND2 OR2 OR2
INV AND2
B2
INV
AND2
A2
AND2
XOR2
INV
INV
AND2
OR2
B0 INV
OR2 OR2
S0
AND2 AND2
A0 OR2
AND3
OR2
S2
NAND2
OR2
B3 INV
AND2
OR2
XOR2
A3 INV S3
INV OR2
AND2 OR2 AND2
S1
B1 NAND2
OR2
AND2 AND2
OR2 NAND2
A1 INV
XOR2
AND2
NAND2 AND2
SEL
OR2
AND2
AND2
AND2
INV AND2
AND2 OR2
Figure 4.4: A Behavioural Design of the Adder/Subtracter for n = 4.
From the statistics of each design, it is important to note that the total area and the
maximum level of gates di ers. The structural description produces a small but slow
design when compared to the behavioural description which produces a fast but large
design.
A characteristics of the synthesiser is that a behavioural description maps to a struc-
ture by representing each output in terms of its inputs, much like a lookup table, and
removes any structure. The synthesizer performs logic level optimisation on a the struc-
tural description and thus producing a design with less logic.
4.2.2 The Viewlogic Synthesiser
The Viewlogic Synthesiser has the ability to alter the emphasis on speed or area when
optimizing a design. The statistics generated in the previous section were area optimized,
25
27. and neglected the e ect of gate delays. For example, optimizing the behavioural design for
speed, the synthesiser generates 14 more gates than before, however there is a signi cant
decrease in the maximum level of gates:
*********************************************
Gate Usage Summary
*********************************************
Cell Count Area/Cell Cell Count Area/Cell
----------------------------------------------------------------------------
X2000:AND2 10 0.25 X2000:AND3 2 0.50
X2000:AND4 1 0.75 X2000:INV 15 0.00
X2000:NAND2 17 0.25 X2000:NAND3 1 0.50
X2000:NAND4 1 0.75 X2000:NOR3 2 0.50
X2000:NOR4 2 0.75 X2000:OR2 22 0.25
X2000:OR4 1 0.75 X2000:XOR2 1 0.25
----------------------------------------------------------------------------
Total Cells : 75 Total Area : 18.75
*********************************************
Netlist Statistics
*********************************************
Maximum level of gates = 9 Total number of nets = 84
The synthesiser can optimise small designs, but when the design grows large, the
memory and processing power required to optimize such a design is considerable. The
design of the CORDIC unit contains three adders/subtracters and takes several minutes
to compile and optimize the design. However, integrating this unit into a larger design of
several units, the compiler has many problems and will eventually lead to a crash after
half an hour of compilation.
A solution to get around this optimization problem is to use a hierarchical ow and
describe the components using behavioural or structural descriptions. Using this method
the compiler knows nothing about large components and cannot perform any global op-
timization. This is not a fully optimized solution, but it is currently the best solution.
However, it is possible to atten the design below the top level making the design slightly
more e cient.
4.3 VHDL Design of the CORDIC Unit
The rst stage of the design of a CORDIC processor is to create the CORDIC unit, where
two approaches can be taken: a behavioral description or a structural description. Firstly,
consider the following behavioural description where the shifted values of (xi; yi) are done
external to the CORDIC unit in the top level design. This approach is optimal, since it
only requires a misalignment of the data buses in the top level interconnections.
However, if contained inside the CORDIC unit, each unit would require a variable
shifter and could not be optimized using the current version of Viewlogic VHDL for reasons
discussed previously. Another reason why shifting is done external to the CORDIC unit
26
28. is that the LOOP variable inside the generate statement cannot be passed to any user
de ned function, procedure or entity. This is not stated in the manual and took many
days to determine the problem.
The behavioural description is as follows:
ARCHITECTURE behaviour OF adder IS
begin
cell_i : process (xi,xs,yi,ys,zi,ai)
VARIABLE x_res: vlbit_vector(n downto 0); -- temporary results
VARIABLE y_res: vlbit_vector(n downto 0);
VARIABLE z_res: vlbit_vector(k downto 0);
begin
x_res := zero(n downto 0); -- initialise, unless comp complains
y_res := zero(n downto 0);
z_res := zero(k downto 0);
if zi(k-1) = '0' then -- z_i is positive
x_res := add2c (xi, ys);
y_res := sub2c (yi, xs);
z_res := sub2c (zi, ai);
else -- z_i is negative
x_res := sub2c (xi, ys);
y_res := add2c (yi, xs);
z_res := add2c (zi, ai);
end if;
xip1 <= x_res (n-1 downto 0);
yip1 <= y_res (n-1 downto 0);
zip1 <= z_res (e-1 downto 0);
end process;
END behavior;
The synthesiser generates the following statistics for a 8 bit version of the code. The
maximum level of gates is 20, since each bit requires 2 levels, plus additional gates for the
multiplexer and inversion.
*********************************************
Gate Usage Summary
*********************************************
Cell Count Area/Cell Cell Count Area/Cell
----------------------------------------------------------------------------
X2000:AND2 159 0.25 X2000:AND3 3 0.50
X2000:INV 69 0.00 X2000:NAND2 76 0.25
27
29. X2000:OR2 125 0.25 X2000:XOR2 7 0.25
----------------------------------------------------------------------------
Total Cells : 439 Total Area : 93.25
*********************************************
Netlist Statistics
*********************************************
Maximum level of gates = 20 Total number of nets = 487
For the Structural description of the CORDIC unit is slightly more complex and is
best represented pictorially, as shown in Figure (4.5). Each box in the gure represents
a di erent VHDL entity (component), and some components are used more than once.
The design is very bulky and easier to make mistakes.
zi Full zip1
ai 2to1 Adder
INV mux faf001.vhd
inv101.vhd muxf201.vhd
addsub e.vhd
xi Full xip1
ys 2to1 Adder
INV mux faf001.vhd
inv101.vhd muxf201.vhd
addsub n.vhd
yi Full yip1
xs 2to1 Adder
INV mux faf001.vhd
inv101.vhd muxf201.vhd
addsub n.vhd
adders.vhd
Figure 4.5: The structure of CORDIC unit showing the various entities.
It achieves the same functionality as the behavioural description but requires a lot more
e ort to make sure all the connections are correct. As stated previously, the structural
design will minimise area, but will result in a slower design, as re ected by the following
synthesiser statistics.
28
30. *********************************************
Gate Usage Summary
*********************************************
Cell Count Area/Cell Cell Count Area/Cell
----------------------------------------------------------------------------
X2000:INV 3 0.00 X2000:NAND2 139 0.25
X2000:OR2 41 0.25 X2000:XOR2 75 0.25
----------------------------------------------------------------------------
Total Cells : 258 Total Area : 63.75
*********************************************
Netlist Statistics
*********************************************
Maximum level of gates = 31 Total number of nets = 306
Using the structural design will save about 30% on area but will execute 50% slower.
In a FPGA implementation speed might be more desirable than area optimization since
the devices operate relatively slower when compared to a custom VLSI device. A 30%
increase in the number of gates will be a relatively small concern.
4.3.1 The Rounding Unit
The rounding unit is formed by the interconnection of n half adders, or in behavioural
terms, the addition of the bit shifted out during the shifting process. Describing it struc-
turally involves using the inc001 component which contains an AND and a XOR gate to
form a half adder. The interconnection of the inc001 components is:
c(0) <= cin; -- first carry
connect: for i in 0 to n-1 generate
addsub: inc001 port map( a(i), c(i), s(i), c(i+1) );
end generate;
Or, a much simpler behavioural description is created using the unsigned addition routine
addum. This avoids the sign extension used in the add2c routine.
rounder : process (a,cin)
VARIABLE res: vlbit_vector(n downto 0); -- temporary results
begin
res := zero(n downto 0); -- initialise, unless comp complains
res := addum(a,cin); -- use addum instead of add2c as it sign
-- extends the cin input making it -1 not +1
s <= res (n-1 downto 0);
end process;
29
31. 4.4 Combining the CORDIC Units
The process of combining the CORDIC and Rounding units involves writing the top level
design in the hierarchical solution. As before with structural descriptions, the generate
statement is used and allows iterative or conditional generation of a portion of description.
The rst de nition to be made in top level le is the alphai constants, and this
version implements the Alpha Normalisation Scheme. Next the x; y; z intermediate signals
between CORDIC units are shifted by the appropriate amount. The function shift all is
de ned in another le and contains user de ned functions. This operation is required here
since execution inside the generate statement will not work since concurrent procedure
calls only execute when a variable in the sensitivity list changes state. A change in the
shift value is not recognizable inside the generate statement.
-- Scaled a_i * 2^i values are decimal 45 53 56 57 57 57 57 57
ai <= X"39_39_39_39_39_38_35_2D";
sh_x: xis <= shift_all(xi); -- shift intermediate signals
sh_y: yis <= shift_all(yi);
sh_z: zis <= shift_z(zi);
It should be noted that the variables xis, yis, zis, xi, yi, and zi are large vectors
containing several smaller vectors. This system had to be used since Viewlogic's VHDL
cannot handle two-dimensional arrays of vlbit. The shifting of intermediate signals is
done by the following function:
FUNCTION shift_all (x : vlbit_vector (n*(k-1)-1 downto 0))
RETURN vlbit_vector IS
VARIABLE x_s : vlbit_vector(n*(k-1)-1 downto 0) := zero(n*(k-1)-1 downto 0);
BEGIN
x_s(1*n-1 downto 0) := shiftr2c(x( 1*n-1 downto 0 ),1); -- 2 stage
x_s(2*n-1 downto 1*n) := shiftr2c(x( 2*n-1 downto 1*n ),2); -- 3 stage
x_s(3*n-1 downto 2*n) := shiftr2c(x( 3*n-1 downto 2*n ),3); -- 4 stage
x_s(4*n-1 downto 3*n) := shiftr2c(x( 4*n-1 downto 3*n ),4); -- 5 stage
x_s(5*n-1 downto 4*n) := shiftr2c(x( 5*n-1 downto 4*n ),5); -- 6 stage
x_s(6*n-1 downto 5*n) := shiftr2c(x( 6*n-1 downto 5*n ),6); -- 7 stage
x_s(7*n-1 downto 6*n) := shiftr2c(x( 7*n-1 downto 6*n ),7); -- 8 stage
x_s(8*n-1 downto 7*n) := shiftr2c(x( 8*n-1 downto 7*n ),8); -- 9 stage
x_s(9*n-1 downto 8*n) := shiftr2c(x( 9*n-1 downto 8*n ),9); -- 10 stage
return x_s;
END shift_all;
Next comes the connection of the init component which is used to expand the convergence
range of the CORDIC processor to ?190 < z < 190 . The input signals are x in, y in,
z in are connected to a unit simular to the CORDIC unit, except there is an extra bit
appended to the alpha bus to account for the expanded convergence range.
30
32. initial: init port map(xi <= X"00",
xs <= x_in,
yi <= X"00",
ys <= y_in,
zi <= z_in,
ai <= B"0_0101_1010", -- add/sub 90 degrees
xip1 <= xinit, -- xinit = 0 +- yin
yip1 <= yinit, -- yinit = 0 -+ xin
zip1 <= zinit );
The following code has been compressed to reduce detail, however it can be seen that there
a three separate stages: initial connection, intermediate connections, and nal connection.
This can be visibly seen in Figure (4.6). (Also not shown is the conditional generation of
components, eg., selection of behavioral or structural components, rounding units, etc.)
connect: for i in 0 to k-1 generate -- k stages
ls_unit: if i=0 generate
first_unit: adder port map( ... );
end generate ls_unit;
i_unit: if i>0 and i<k-1 generate
x_round: round port map ( ... );
y_round: round port map ( ... );
middle_units: adder port map( ... );
end generate ls_unit;
ms_unit: if i=k-1 generate
x_round_last: round port map ( ... );
y_round_last: round port map ( ... );
last_unit: adder port map( ... );
end generate ms_unit;
end generate connect;
The contents of ... are simular to the port map of the init component.
4.4.1 A Solution
This represents a solution to the CORDIC problem, and is close to a optimized solu-
tion, but due to compiler and language di culties a completely optimized solution is not
possible. Under these situations the design has been optimised as far as possible though.
There many choices to be made about the design of the CORDIC unit, by deciding
on whether the it is going to be area or speed e cient.
31