SlideShare a Scribd company logo
1 of 39
Download to read offline
Faculty of Computing and Information Technology
   Department of Robotics and Digital Technology
                       Technical Report 94-9



       A VHDL Implementation of a CORDIC
           Arithmetic Processor Chip
                 Grant Hampson, Student Member, IEEE
                    Andrew Paplinski, Member, IEEE
                               October 10, 1994




Enquiries:-
     Technical Report Coordinator
     Robotics and Digital Technology
     Monash University
     Clayton VIC 3168
     Australia
     tr.coord@rdt.monash.edu.au                   +61 3 905 3402
Contents
Abstract and Keywords                                                                                                            4
Preface                                                                                                                          5
1 The CORDIC Algorithm                                                                                                           6
2 CORDIC Hardware Implementations                                                                                               10
  2.1 CORDIC Processor Architecture : : : : : : : : : : : : : : : : : : : : : : : 10
      2.1.1 A Word-Serial CORDIC Architecture : : : : : : : : : : : : : : : : : 10
      2.1.2 A Word-Parallel CORDIC Architecture : : : : : : : : : : : : : : : : 11
3 Improving CORDIC Accuracy                                                                                                     14
  3.1   Estimation of CORDIC Accuracy : : : :       :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   14
  3.2   The Lower Bound of CORDIC Accuracy          :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   15
  3.3   Reducing the z update error : : : : : : :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   16
  3.4   Unexpected Truncation Errors : : : : : :    :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   20
4 VHDL Implementation                                                                                                           21
  4.1 The Basic CORDIC Unit : : : : : : : : : :         :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   21
  4.2 VHDL Describes Structure and Behaviour            :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   22
      4.2.1 Hierarchical vs Flat Designs : : : :        :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   23
      4.2.2 The Viewlogic Synthesiser : : : : :         :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   25
  4.3 VHDL Design of the CORDIC Unit : : : :            :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   26
      4.3.1 The Rounding Unit : : : : : : : : :         :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   29
  4.4 Combining the CORDIC Units : : : : : :            :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   30
      4.4.1 A Solution : : : : : : : : : : : : : :      :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   31
  4.5 Improvements : : : : : : : : : : : : : : : :      :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   33
Conclusion                                                                                                                      34
A CORDIC Functions                                                                                                              35
B Upper Bound of CORDIC Error                                                                                                   37
References                                                                                                                      38



                                          1
List of Tables
 1.1   Elementary angles of i : : : : : :     :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :    8
 1.2   Various values of Kn : : : : : : : :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :    8
 4.1   Some CORDIC hardware statistics.       :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   33
 A.1   The six CORDIC modes. : : : : : :      :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   :   36




                                          2
List of Figures
 1.1   Rotation of a point in 2-D space. : : : : : : : : : : : : : : : : : : : : : : :       6
 2.1   Generic Processor Architecture. : : : : : : : : : : : : : : : : : : : : : : : :      11
 2.2   A Optimised Word-Serial CORDIC Architecture. : : : : : : : : : : : : : :             12
 2.3   Word-Parallel CORDIC architecture with possible data pipelining. : : : : :           13
 3.1   Numerical accuracy of the CORDIC processor. : : : : : : : : : : : : : : : :          15
 3.2   Predicted and Actual accuracy of a CORDIC processor with a 12 bit in-
       ternal datapath. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :   15
 3.3   A plot showing bits of error for a typical test vector rotated through all
       possible angles. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :   16
 3.4   A 12 bit, 8 stage CORDIC processor produces 9 bit accurate results. : : :            17
 3.5   An 8 bit, 8 stage CORDIC processor produces 7 bit accurate results. : : :            17
 3.6   Simulation results from a CORDIC processor illustrating the e ects of the
       normalisation scheme. : : : : : : : : : : : : : : : : : : : : : : : : : : : : :      19
 3.7   An 8 bit, 8 stage CORDIC processor (a) without rounding, (b) with rounding.          20
 4.1   The basic CORDIC unit. : : : : : : : : : : : : : : : : : : : : : : : : : : : :       21
 4.2   A Hierarchical Design of the Adder/Subtracter for n = 4. : : : : : : : : : :         24
 4.3   A Flat Design of the Adder/Subtracter for n = 4. : : : : : : : : : : : : : :         24
 4.4   A Behavioural Design of the Adder/Subtracter for n = 4. : : : : : : : : : :          25
 4.5   The structure of CORDIC unit showing the various entities. : : : : : : : :           28
 4.6   The top level schematic of an 4 stage CORDIC processor with Increased
       Convergence Range and Rounding components. : : : : : : : : : : : : : : :             32




                                            3
Abstract
This report describes the fundamentals of CORDIC (Co-ordinate Rotations Digital Com-
puter) algorithm and a possible implementation using the VHDL hardware description
language. An analysis of errors associated with a xed point implementation of CORDIC
is also discussed and methods for reducing these errors. A normalisation scheme which
reduces error and requires no extra hardware is such a method. Various CORDIC struc-
tures and possible VHDL implementations are described in detail, including design and
language issues. Finally a parallel hardware implementation is described and simulated.
    CORDIC has many applications, of which, some can be used for array imaging tech-
niques.

Keywords
CORDIC, VHDL




                                          4
Preface
CORDIC is an acronym for Coordinate Rotations Digital Computer and was derived by
Volder 1] in the late 1950's for the purpose of calculating trigonometric functions. Its
popularity came about nearly twenty years later when VLSI solutions became a reality.
    The original algorithm describes the rotation of a 2-D vector which can be applied
in applications such as Digital Signal Processing 2] (Fourier Transforms, Digital Filters),
Computer Graphics 3] and Robotics 4].
    CORDIC processing o ers high computational rates making it attractive to applica-
tions such as computer graphics where a combination of scaling and rotations are required
in real time. CORDIC is also attractive to Robotics since the fundamental operation is
coordinate transformations, however it could be used for more computationally intensive
processes such as motion planning and collision detection.
    Array Imaging typically involves complex signal processing which may require many
computationally intensive matrix operations. Increasing the complexity of the imaging
model places greater demands on accuracy. Solutions to such complex systems requires
better, and hence, more complex algorithms. Most of these algorithms are based on matrix
factorization (decomposition) techniques, of which Singular Value Decomposition (SVD)
is the most robust method. The SVD factorisation requires a two-sided transformation
which involves several trigometric operations and rotations ideally suited to dedicated
VLSI hardware (CORDIC processing) for real time calculations. CORDIC has also been
applied to phase correction when dynamic range focusing when Digital Baseband Demod-
ulation 5] techniques are employed in Interpolation Beamforming 6] . A complex signal
is represented by the in-phase, I, and quadrature, Q, components, and are phase corrected
by rotating the complex signal.
    Haviland and Tuszynski designed and built a CORDIC processor 7] in 1980 which
used a iterative process to calculate circular, linear and hyperbolic functions. A more
recent implementation (1993) by Duprat and Muller 8] discusses the possibility of using
a redundant number system for the representation of a signed digit.
  This report is broken into four logical sections, namely, CORDIC Theory, Hardware
Implementations, Improving CORDIC Accuracy and nally a VHDL Implementation.




                                            5
Chapter 1
The CORDIC Algorithm
Consider a 2-D vector (x; y) represented by a point v = x + |y in the complex plane. If
the vector is rotated by an angle , the new co-ordinate vector is given by:
                                               v = v ej
                                               ~                                       (1:1)
and shown in Figure (1.1).
                             y

                                           v = x + |y
                                           ~ ~ ~

                                                     v = x + |y

                                                              x


                     Figure 1.1: Rotation of a point in 2-D space.
    The angle can be expanded into a set of elementary angles         i   with pseudo-digits
qi 2 f?1; +1g, and angle expansion error zn , such that
                                               n?1
                                               X
                                           =          qi   i + zn                      (1:2)
                                               i=?1
and the sub-rotation angles       i   take on the following values:
                              (
                       i   = arctan(2?i ) for i = ?1 ; ; n ? 1
                                 =2
                                          for i = 0; 1                              (1:3)
Note that i is approximately equal to but less than 2?i and the resulting angular expan-
sion error is therefore jznj < 2?(n?1).

                                                      6
Substitution of Equation(1.2) into Equation (1.1) gives:
                                                   n?1
                                                   Y |q
                                 v = v
                                 ~                     e i i e | zn
                                                  i=?1
                                                        n?1
                                                         Y |q
                                       = v        (|qi)     e i i        e | zn                           (1.4)
                                                        i=0
and expanding ejqi i ,

                             ejqi   i   = cos qi i + j sin qi i
                                        = cos qi i (1 + j tan qi i)
                                        = cos i 1 + j qi 2?i
Finally
                           n?1              !                n?1                       !
                           Y                                 Y
                 v=v
                 ~               cos    i       (|q?1)             1 + | qi 2?i            e?j zn         (1:5)
                           i=0                               i=0
The range of rotation angles which can be represented by Equation (1.2) is                          max , where
                                                      n?1
                                                      X
                                            max   =          i     190                                    (1:6)
                                                      i=?1
and some values of i are given in Table (1.1).
    If the expected range of rotation angles is 90 then the initial rotation by 90 , that
is, e|q?q 2 = j q?1, does not have to be performed and the initial rotation is by 45 .
    The second term is a constant scaling factor and for given value of n it can be pre-
evaluated using Equation (1.7), and the rst 15 evaluated in Table (1.2).
                           n?1
                           Y                    n?1
                                                Y                  ?2
                                                                    1        n?1
                                                                             Y      1
                    Kn =         cos i =               1 + 2?2i          =         q                      (1:7)
                           i=0                  i=0                          i=0   1 + 41i
    The basic CORDIC algorithm which describes rotation of a unity length vector v =
x + |y by an angle can be derived from Equation (1.5) using the initial conditions, where
zi is the accumulated angular residue:

                                             v?1 = v Kn
                                             z?1 =
And, proceeding with i = ?1; 0;             ;n ? 1
                                        (
                            qi =          ?1 if zi < 0                                                    (1.8)
                                          +1         0
                                        (
                                   v |qi         if i ?
                         vi+1 = vi (1 + |q 2?i ) if i = 0 1                                              (1.9)
                                    i     i
                         zi+1 = zi ? qi i                                                               (1.10)
                                                        7
i      Angle       Angle (degrees)           16-bit binaries
 0    arctan(2 0)      45:0000        B400    = 110001:0000000000
 1   arctan(2 ?1 )     26:5651        6A43    = 011010:1001000011
 2   arctan(2 ?2 )     14:0362         3825   = 001110:0000100101
 3   arctan(2 ?3 )      7:1250        1C80    = 000111:0010000000
 4   arctan(2?4 )       3:5763        0E40    = 000011:1001000000
 5   arctan(2 ?5 )      1:7899         0729   = 000001:1100101001
 6   arctan(2 ?6 )      0:8952         0395   = 000000:1110010101
 7   arctan(2?7 )       0:4476        01CA    = 000000:0111001010
 8   arctan(2 ?8 )      0:2238        00E5    = 000000:0011100101
 9   arctan(2 ?9 )      0:1119         0073   = 000000:0001110011
10   arctan(2 ?10 )     0:0560         0039   = 000000:0000111001
11   arctan(2?11 )      0:0280        001D    = 000000:0000011101
12   arctan(2 ?12 )     0:0140        000E    = 000000:0000001110
13   arctan(2 ?13 )     0:0070         0007   = 000000:0000000111
14   arctan(2?14 )      0:0035         0004   = 000000:0000000100
15   arctan(2 ?15 )     0:0017         0002   = 000000:0000000010
16   arctan(2 ?16 )     0:0008         0001   = 000000:0000000001

                Table 1.1: Elementary angles of    i



                      n           Kn
                       0   0.70710678118655
                       1   0.63245553203368
                       2   0.61357199107790
                       3   0.60883391251775
                       4   0.60764825625617
                       5   0.60735177014130
                       6   0.60727764409353
                       7   0.60725911229889
                       8   0.60725447933256
                       9   0.60725332108988
                      10   0.60725303152913
                      11   0.60725295913894
                      12   0.60725294104140
                      13   0.60725293651701
                      14   0.60725293538591
                      15   0.60725293510314

                 Table 1.2: Various values of Kn

                                8
The nal rotated vector is vn, with angle expansion error zn

                                vn = v = v e| e?|zn
                                     ~                                             (1.11)
                                             n?1
                                             X
                                    zn = ? qi i                                    (1.12)
                                                i=?1
One complex operation on vi is equivalent to two operations on real numbers. For i = ?1

                                x0 + |y0 = |q?1(x?1 + |y?1)
                           Hence =) x0 = ?q?1y?1                                   (1.13)
                                      y0 = q?1x?1                                  (1.14)
For i = 0; 1;   ;n ?1

                            xi+1 + |yi+1 = (xi + |yi)(1 + |qi 2?i )
                        Hence =) xi+1 = xi ? qi yi 2?i                             (1.15)
                                    yi+1 = yi + qi xi 2?i                          (1.16)
    The CORDIC algorithm reduces to an iterative set of operations consisting of a binary
shift and an accumulator for each of x; y and z.
   Refer to Appendix A for a list of transcendental functions.




                                            9
Chapter 2
CORDIC Hardware
Implementations
A Hardware implementation of CORDIC processor is dependent on the number of func-
tions required and the computational speed. If all functions are to be computed, then
there will be a necessary overhead for selecting each function. However, a small fast de-
sign will result if a small number of functions are required. This chapter presents possible
solutions to a mixture of design problems.

2.1 CORDIC Processor Architecture
A CORDIC algorithm can take on two primary architectures, namely, word serial or word
parallel. A word-serial processor minimises hardware requirements by utilising a single
CORDIC unit repeatedly. However, iterative algorithms which are controlled by a small
number of variables can be expanded on a two-dimensional area. ie., instead of executing
a certain set of instructions n times using a single element (eg., a CORDIC unit), n times
duplicated elementary cells are used in successive steps of an iteration 9]. This attened
structure can now perform many operations in parallel and is so called a word-parallel
CORDIC processor.
    A word-parallel architecture has the advantage of being up to n times faster, but due
to the expansion requires, at worst, n times more hardware. However, the word-serial
architecture requires complex controlling hardware and a variable shifter, decreasing the
hardware saving ratio.

2.1.1 A Word-Serial CORDIC Architecture
The CORDIC algorithm has the advantage of not requiring any special hardware other
than an accumulator and a variable shifter which are generally available in most micro-
controllers.
   A multi-function word-serial CORDIC processor architecture could be realised using
a basic micro structure consisting of a two-port register le, a variable shifter combined
with an ALU interconnected by several data paths as shown in Figure (2.1).
   A generic controller could consist of a microcode instructions for the ALU and register
                                            10
Result bus: xi+1 , yi+1 , zi+1
           n        i
                                                                                      CC register
                                                  ALU
        ROM      ROM            Register
        Kn 's     i 's            File                   2?i yi or 2?i xi
                                                   Variable                       Controlling
                                                    Shifter   i                   micro-code

                                                              Input data buses: xi, yi , zi

                         Figure 2.1: Generic Processor Architecture.
  le, and would execute an iterative algorithm. This structure is simular to that of a
microprocessor or DSP and allows many variations of the CORDIC algorithm as the
order of operations and the expanded instruction set increases exibility. This type of
structure illustrates that it would be possible to implement the CORDIC algorithm on
any micro or DSP.
     Optimising the generic processor-structure for a word-serial CORDIC processor is
achieved by reducing the functionality to operations only required by the CORDIC algo-
rithm. A possible word-serial architecture is shown in Figure (2.2) where the ALU now
contains three adders and dedicated registers. The microcode controller has been replaced
by faster Combination Control Logic dedicated to the CORDIC operation sequence.

2.1.2 A Word-Parallel CORDIC Architecture
The word-parallel method expands the problem of a single dimensional algorithm into
a two-dimensional problem and results in shorter computational times. Greater speeds
of computation can be obtained by pipe-lining between stages so that many partial re-
sults can be calculated in parallel. A pipelined-word-parallel architecture is shown in
Figure (2.3) where each iteration is represented by a separate CORDIC block and a latch
is placed after each iteration, or, several iterations.
    The following chapters will develop, implement, and simulate such parallel CORDIC
structure using the VHDL hardware description language.




                                             11
Load
 Precision
    Reset                                                        Initial Inputs
    Clock                                              z                }|                    {
                                                       x0               y0               z0

               Combinational
                 Control
                  Logic                   Select
Next State
                                                            xi               yi                   zi
                                                                                         qi
                                                             ?qixi2?i
                q-bit register                                    qiyi2?i                                     Look
                                                                                                               up
                       Increment         Zero                                                                 Table
                                                            P                P                    P            of
                           counter                 i                                                            i 's
                        m-bit register


       Clock                                           n-bit register n-bit register n-bit register
                                                                 xi+1             yi+1                 zi+1


                                         Finished
                                           Flag

                Figure 2.2: A Optimised Word-Serial CORDIC Architecture.




                                                            12
y0 x0        z0

                             Cell #0                     0


                          y1 x1          1



Clock
                          Latch for Pipelining of data


              yi                       xi                       zi
                       ?qi xi 2?i
                      qi yi 2?i               qi = sign zi]
 Cell #i        P                    P                         P
                                                                      i



               yi+1                 xi+1                       zi+1

Clock
                          Latch for Pipelining of data




                            Cell #n                      n?1


                          yn xn        zn
Figure 2.3: Word-Parallel CORDIC architecture with possible data pipelining.



                                             13
Chapter 3
Improving CORDIC Accuracy
As expected, iterative algorithms calculate results by approximation and the solution will
contain errors. CORDIC is not an exception and errors are introduced by a combina-
tion of quantisation and approximation errors. The accuracy of a CORDIC processor is
dependent on the word length used for the three input variables x; y, and z, as well as
the number of iterations or steps performed. The following chapter describes the errors
associated with a xed point implementation and a means of reducing these errors.

3.1 Estimation of CORDIC Accuracy
The fundamental operations performed by a CORDIC processor is the shift-and-add pro-
cess of which xed point arithmetic will introduce errors. For example, consider the binary
scaling of the vector vi = (xi; yi) at the ith stage:

            if i m then vi+1 is updated with the truncated value vi 2?i
            if i > m then vi+1 = vi ; and the update will be 0
where m is the internal bus width of v and limits the maximumnumber of useful iterations.
Peak accuracy could be achieved after m iterations since all accuracy has been exhausted
in v. However, truncation errors may exceed the accuracy achieved by more iterations,
and it is desirable to nd the optimal number of iterations.
    The accuracy of the rotation will be determined by how closely the input rotation
angle was approximated by the summation of sub-rotation angles i. The error in v after
n iterations will be proportional to the error in z. An increase in the z datapath width
will increase the accuracy of the z update and hence the v update.
    The numerical accuracy of the CORDIC algorithm can be calculated by the examina-
tion of truncation and approximation errors. Truncation errors are due to the nite word
length and approximation errors are due to the nite number of iterations. Walther 10]
analyzed the x and y iterations independently of the z iterations and concluded that log n
extra bits in the data paths can provide n bits of accuracy. This work was re-calculated
by Kota and Cavallaro 11] in a non-independent manner and concluded that log n + 2
extra bits are required to achieve n bits of accuracy after n iterations.

                                           14
This solution represents an upper bound of error in the CORDIC processor. A graph
of this function appears in Figure (3.1) from which it can be seen that to achieve 8 or 16
bit accuracy, the internal datapaths need to be 13 and 22 bits respectively.
                                                                                              Datapath resolution vs Output Resolution


                                                                            32




                        Output resolution is (n) bits with (n) iterations
                                                                            28


                                                                            24


                                                                            20


                                                                            16


                                                                            12


                                                                            8


                                                                            4


                                                                            0
                                                                             0    4       8     12       16      20      24      28       32   36    40
                                                                                                Internal Datapath Width (n+log(n)+2)




               Figure 3.1: Numerical accuracy of the CORDIC processor.


3.2 The Lower Bound of CORDIC Accuracy
A CORDIC processor can be presented with all possible input combinations to nd the
lower bound of error. Simulation results are shown in Figure (3.2) where a 12 bit CORDIC
processor with a variable number of stages is presented with all possible rotation angles
between ? z?1           and the resulting accuracy in bits is calculated. Kota and Caval-
laro's upper bound of error (as de ned by their maximum error equation in Appendix (B))
is also shown in Figure (3.2). The upper bound of error has a well de ned peak of accu-
racy, however the simulation results indicate that accuracy will improve if more iterations
are performed.
                                                                                          Solid: Predicted Accuracy, Dashed: Actual Accuracy
                                                                            12



                                                                            10



                                                                             8
                                         Output Accuracy




                                                                             6



                                                                             4



                                                                             2



                                                                             0
                                                                              0       2           4          6            8              10     12
                                                                                                         Number of stages n




Figure 3.2: Predicted and Actual accuracy of a CORDIC processor with a 12 bit internal
datapath.

                                                                                                             15
Figure (3.3) illustrates the accuracy of a 12 bit, 12 stage processor, by simulation, and
the resulting bits of error produced. About 0:3% of results are greater than 2 bits of error
which indicates that the error bound of a CORDIC processor is positioned between the
upper and lower bounds of error.
                                          Bits error
                                              90
                                                3
                               120                        60


                                                 2

                        150                                         30

                                                 1




                  180                                                    0




                        210                                        330




                               240                        300
                                            270

Figure 3.3: A plot showing bits of error for a typical test vector rotated through all
possible angles.

    The simulation results indicate that n + log n + 2 is an over estimation of data path
width required and a reduction in datapath width is possible if the number of iterations
is increased. Simulation results of two 8 stage CORDIC processors with 12 bit and 8 bit
datapaths, are shown for comparison in Figure (3.5) and Figure (3.4) respectively. The
simulation results were obtained by varying the magnitude of v and in uniform steps.
The di erence in resolution obtained is two bits, indicating that the lower bound of error
is closer to the error bound of CORDIC.

3.3 Reducing the z update error
In the rotational mode of CORDIC, converges towards zero by adding/subtracting sub-
rotation angles and the nal iterations of the zi update will result in numbers approaching
zero. More precisely, the angular error zi is approximately equal to 2?i , thus for a bus
width m, only (m ? i) bits are used to represent error.
    To reduce the zi error a oating point system could be used, but it has complex
hardware implementations not suited to word-parallel structures. A simpler method to

                                            16
90
                                            1.0
                          120                      60
                                            0.83

                                            0.66

                   150                      0.50            30

                                            0.33

                                            0.17

             180                                                  0




                   -150                                     -30




                          -120                     -60
                                      -90

Figure 3.4: A 12 bit, 8 stage CORDIC processor produces 9 bit accurate results.


                                      90
                          120               1.0    60
                                            0.9
                                            0.8
                                            0.7
                   150                      0.6             30
                                            0.5
                                            0.4
                                            0.3
                                            0.2
                                            0.1
             180                                                  0




                   -150                                     -30




                          -120                     -60
                                      -90

Figure 3.5: An 8 bit, 8 stage CORDIC processor produces 7 bit accurate results.

                                      17
improve accuracy, ie., to utilise all m bits, a quasi- oating point scheme or normalisation
scheme could be implemented by scaling the existing sequence by 2i , ie.,
                                        zi = 2i zi
                                        ^
Therefore, the new sequence becomes

                             zi+1 =
                             ^         2i+1 zi+1
                                  =    2 2i (zi ? qi i)
                                  =    2 (2i zi ? qi 2i    i)
                                  =    2(^i ? qi ^i)
                                         z                                            (3.1)
which requires a shift left at each iteration, and requires no extra hardware for a word-
parallel structure. A new sequence of sub-rotation angles can be de ned as:
                                  ^i = 2i i = 2i tan(2?i)                            (3:2)
where ^i approaches a nite value of 1 for increasing values of i, and will utilise most of
the bus width. Since the scaling system results in full use of the databus width, over ow
may occur if the bus width is too small. Using Equation (3.1), the maximum value zi+1
can have is when zi approaches zero, giving
                                 max zi+1] 2 max ^i]                                  (3:3)
   To calculate the increase in accuracy is beyond the scope of this report, however,
simulation indicates that there is a direct improvement in accuracy. The simulation
results indicated that using the traditional scheme the accuracy of the rotation is
              accuracy / log(zi datapath width) + log(number of stages)               (3:4)
whereas the normalisation scheme has the advantage of
                             accuracy / log(number of stages)                      (3:5)
since the z datapath is always in a semi-normalised state.
    Using the traditional scheme, i ! 0, limiting the number of useful stages. However
when normalised, there is no limit on the number of stages and a signi cant reduction in
hardware is possible by reducing buswidth of z.
    Figure (3.6) illustrates the error dependencies on the number of stages and bits for
the scaled and unscaled CORDIC processors. Figure (3.6(a)) and Figure (3.6(b)) show
the angular expansion error. Figure (3.6(c)) and Figure (3.6(d)) show the dependance of
v error on the angular expansion error.



                                            18
No alpha scaling                                                            Alpha scaling
                                -3                                                                      -3
                            x 10                                                                    x 10

                        6                                                                       4
  angle expans. error




                                                                          angle expans. error
                        4
                                                                                                2
                        2

                        0                                       0                               0                                        0
                        0                                10                                     0                                 10
                                     10                                                                      10
                                               20 20     bits                                                            20 20    bits
                                     stages                                                                  stages

                                      No alpha scaling                                                            Alpha scaling


                        4                                                                       4
  relative v error




                                                                          relative v error




                        2                                                                       2



                        0                                       0                               0                                        0
                        0                                10                                     0                                 10
                                     10                                                                      10
                                               20 20     bits in z                                                       20 20    bits in z
                            stages/bits in v                                                        stages/bits in v
Figure 3.6: Simulation results from a CORDIC processor illustrating the e ects of the
normalisation scheme.



                                                                     19
3.4 Unexpected Truncation Errors
Using xed point arithmetic in a CORDIC processor will introduce an unexpected trunca-
tion error. The error occurs when the vector (x; y) has a negative component. Consider
the nal iterations where the update of vector v approaches 0 since a larger number of
right shifts is performed at each iteration. However this is not the case if x or y is negative.
    For example, let xi!N equal some number hex X"2D", or positive 45. The right shifted
value of xi!N approaches zero. However, the negative of X"2D" in twos-complement form
is X"D3" and the right shifted value will produce a number approaching X"FF", or ?1,
not the expected zero.
    This is a signi cant problem in the CORDIC processor, since the addition of extra
iterations will only increase the error. A simple method of removing this error would be to
round the shifted value, instead of the forced truncation. A simple method for rounding
values is to add the bit that was last shifted out to the shifted value.
    The rounder could be implemented using a half-adder and typically requires three
logic gates per bit to implement. Minimal extra hardware is required in the word-serial
architecture, however a word-parallel structure requires two half-adders per stage. This
will have a direct e ect on the performance of the processor with the additional delay.
    Figure (3.7) are the simulation results of two CORDIC processors, with and without,
rounding units. The test vector was rotated in steps of 5 , through 360 and the rounded
results are signi cantly more accurate. The rounding maintains monoticity in the actual
angle of rotation as well as uniform magnitude.
                      90                                                   90
             120                 60                                 120              60
                       32.95                                                 32.95



      150                                30                  150                           30




180                                           0        180                                       0




      -150                              -30                  -150                          -30




             -120               -60                                 -120             -60
                     -90                                                   -90


Figure 3.7: An 8 bit, 8 stage CORDIC processor (a) without rounding, (b) with rounding.




                                                  20
Chapter 4
VHDL Implementation
Various tools can be used to implement the CORDIC processor, however, a standardised
approach to this problem would unify the solution for further development in various
applications. A VHDL (VHSIC Hardware Description Language) has been used here
to describe the structural and behavioural characteristics of a Word-Parallel CORDIC
processor. VHDL has become the standard of hardware description languages and has its
own IEEE standard 12].

4.1 The Basic CORDIC Unit
Any CORDIC structure will involve a basic unit containing three adders/subtracters, as
shown in Figure (4.1). The binary scaler would be variable in the case of a Word-Serial
device, however, much simpler in the Word-Parallel device as a shift translates directly
to a misalignment of the data bus.
                                yi xi        zi

                                    Cell i             i


                               yi+1 xi+1 zi+1
                          Figure 4.1: The basic CORDIC unit.
    This unit and a suitable FSM and registers could form a word-serial structure. A
word-parallel implementation can be obtained by linking n CORDIC units.
    The rest of this chapter deals with development of a Word-Parallel unit and the in-
terconnection of these devices using the VHDL language. It should be a relatively trivial
task, but unfortunately there are many bugs in the Viewlogic VHDL Synthesiser, as well
as only containing a subset of the full VHDL standard.
    The main aim of the project was to describe a CORDIC processor using the VHDL
language and to allow the application designer to change the size of structure easily. This
                                             21
exibility could include fundamental changes such as variable datapath widths and vari-
able number of stages. Other options such as rounding intermediate nodes and pipelining
could also be easily integrated.
   Currently, Viewlogic's VHDL is a partial implementation of the 1987 IEEE Standard
VHDL, and many constructs are missing from their implementation. However, most of
the useful constructs are there, but contain nasty ambiguous messages following to say
sorry this only works partially. This made it very di cult to work with.

4.2 VHDL Describes Structure and Behaviour
VHDL has the ability to describe a design in two ways
      in terms of its component structure,
      in terms of behavioural functionality of the design
and also the possibility of integrating the two streams. A requirement for structural
descriptions is that the lowest level description will be a behavioural description to ensure
portability between di erent synthesis libraries. An example of a lowest level operator is
the logical operator AND (behavioural), and used to describe the ANDing of two operands.
This may be synthesised as an AND standard cell from the library. In this way, there is
no way of directly accessing a component from a cell library and limiting portability.
   Consider a slightly more complex design of an n-bit adder/subtracter, which could be
described by the following behavioural description:
    addsub : PROCESS(a,b,sel)

       VARIABLE res : VLBIT_VECTOR(n DOWNTO 0);

       BEGIN
          res := zero(n DOWNTO 0);    -- needs to be initialised

          IF sel = '1' THEN
             res := add2c(a,b);
          ELSE
             res := sub2c(a,b);
          END IF;

          s <= res(n-1 downto 0);    -- discard cout
       END PROCESS;

   The process activates when one of the variables in the sensitivity list changes, and
then produces a result in the internal variable res. The signal s is assigned the lower
portion of the sum. Now consider a structural description of the same adder/subtracter
where several components are used:
    c(0) <= sel;   -- carry in

    connect: FOR i IN 0 TO n-1 GENERATE



                                             22
invert: invf101 PORT MAP( b(i), b_bar(i) );
      mux_b_b_bar: muxf201 PORT MAP( b_bar(i), b(i), sel, b_hat(i) );
      addsub: faf001 PORT MAP( a(i), b_hat(i), c(i), s(i), c(i+1) );

    END GENERATE;


    Note that the muxf201 component is used to select between the non-inverted and
inverted signals of the b bus. The components are user de ned entities describing the
appropriate logic gates. For example a fragment of the faf001 component contains the
following lowest level behavioural description:
   SUM <= A1 xor B1 xor CIN2;
   CO <= (A1 and B1) or (A1 and CIN2) or (B1 and CIN2);


   It is not immediately obvious which way a designer should describe a particular design,
however the next section reveals the results of the synthesiser on which a decision may be
based. In general however, the easier it is for a designer to write a design in VHDL, the
more optimisation the synthesiser needs to perform.

4.2.1 Hierarchical vs Flat Designs
One of very useful features of Viewlogic's VHDL Synthesiser 13] is the ability to either
create a hierarchical (top-down) or a at (bottom-up) design. A hierarchical design allows
the engineer to see lower level interconnections between design units, unlike the at design
where no (or little) hierarchy can be seen. This allows easier debugging of designs, however
its has the disadvantage of being less e cient than a at design which combines all the
design elements together into one circuit, and then performs optimisation.
    Figure (4.2) illustrates the previous structural design of the Adder/Subtracter where
it can be observed that the schematic consists of higher level components than standard
library cells. This feature of Viewlogic VHDL enables easy debugging of high level com-
ponents when compared to a at design. It is relatively simple to navigate between levels
in a design.
    However, most libraries contain standard cells for full adders, muxes, and inverters, but
remembering that VHDL doesn't allow direct access to Library cells, these components
had to be described by a behavioural description. A mux simply maps to an IF statement,
however no behavioural description will map to the full adder cell, and resort to the
description stated previously.
    Compiling the same design using the at (bottom-up) design approach the synthesiser
produces the following statistics, if for example, using the X2000 library. The schematic
generated by the synthesiser is shown in Figure (4.3).
             *********************************************
                          Gate Usage Summary
             *********************************************

Cell         Count       Area/Cell    Cell         Count       Area/Cell
----------------------------------------------------------------------------
X2000:NAND2    15           0.25      X2000:OR2       3           0.25


                                             23
B3                                       A1     O


                                                 INVF101                              A1
                                                                                      B2             O
                                                                                      SEL3


           B2            A1          O              A1                           MUXF201
                                                    B2         O
                                                    SEL3

                        INVF101
                                                 MUXF201

           A0                                                                         A1             SUM                                    S0
                                                                                      B1              CO
                                                                                      CIN2

           B0            A1          O              A1
                                                    B2         O                 FAF001
                                                    SEL3
                        INVF101                                                                                     A1        SUM           S3
                                                                                                                    B1         CO
                                                 MUXF201                                                            CIN2
           A3

                                                                                                                   FAF001
           A2                                                                         A1             SUM                                    S2
                                                                                      B1              CO
                                                                                      CIN2

           B1            A1          O              A1
                                                    B2         O                 FAF001
                                                    SEL3
                        INVF101
         SEL                                                                          A1             SUM                                    S1
                                                 MUXF201                              B1              CO
                                                                                      CIN2
           A1

                                                                                 FAF001




                Figure 4.2: A Hierarchical Design of the Adder/Subtracter for n = 4.
      X2000:XOR2     15           0.25
      ----------------------------------------------------------------------------

      Total Cells :                      33                        Total Area         :                    8.25

                        *********************************************
                                      Netlist Statistics
                        *********************************************

      Maximum level of gates =                     14          Total number of nets                        =       42



                                         OR2



 A1                                                                                                                                 XOR2

                              XOR2
                NAND2                    NAND2   NAND2
 A0

                                                                       XOR2
                                                           NAND2              NAND2
SEL

                                                                                             NAND2                   XOR2   NAND2
        XOR2                                                                                               NAND2
 B0
                NAND2
                                                                      NAND2
                                                           NAND2                OR2
 A2
                                                                                                                                    NAND2
                                                                                                                    NAND2    OR2
                                                                                                           NAND2
                                                                                                                                                        S2
                              NAND2
                                                                                                                                                 XOR2

                                                 XOR2
 B1
                                                                                                                                    XOR2

                                                                                             XOR2
 B2
                                                                                                                                                        S1

                                                                                                                                                 XOR2



                                                                                                                                    XOR2

                                                                                                                                                        S0

                                                                                                                                                 XOR2



 B3
                                                                                                                             XOR2
                                                                                                                                                        S3
 A3                                                                                                                                              XOR2
                                                                                                                                    XOR2




                         Figure 4.3: A Flat Design of the Adder/Subtracter for n = 4.
         Reconsidering the behavioural description of the Adder/Subtracter and synthesizing
      the design, the following statistics are generated, and the corresponding schematic shown
      in Figure (4.4).
                        *********************************************
                                     Gate Usage Summary
                        *********************************************


                                                                       24
Cell         Count       Area/Cell    Cell         Count       Area/Cell
      ----------------------------------------------------------------------------
      X2000:AND2     21           0.25      X2000:AND3      1           0.50
      X2000:INV      11           0.00      X2000:NAND2     8           0.25
      X2000:OR2      17           0.25      X2000:XOR2      3           0.25
      ----------------------------------------------------------------------------

      Total Cells :                    61                     Total Area      :     12.75

                           *********************************************
                                         Netlist Statistics
                           *********************************************

      Maximum level of gates =                       11      Total number of nets   =        70



                                                                                     NAND2




                                                                                              NAND2
                                                                                     OR2




                                                                                     NAND2




                                                                            INV               NAND2
                                                                                     OR2



                                                                AND2

                                                                                     AND2




                                                                AND2          OR2              OR2
                                            INV                                      AND2

B2
                                                                            INV
                                                      AND2

A2
                                                                                     AND2
                                XOR2
                                            INV
                                                                            INV


                                                                                                      AND2
                                                                                               OR2
B0     INV
                  OR2                                           OR2
                                                                                                                   S0
                                                      AND2                           AND2
A0                                                                                                           OR2
                                                                                                      AND3


                                                                                               OR2
                                                                                                                   S2
                 NAND2
                                                                                                             OR2
B3                                                                          INV
                                                                                                      AND2
                                                                                     OR2
                                                                XOR2
A3                                                                                           INV                   S3
                INV                                                                                          OR2
                                AND2          OR2                                                     AND2


                                                                                                                   S1
B1                                                                                            NAND2
                                                                                                             OR2
                                              AND2                                                    AND2
                                                      OR2                            NAND2
A1              INV
         XOR2
                                AND2

                                                                                              NAND2   AND2
SEL
                                                                                     OR2

                                              AND2

                                                                                                      AND2

                                                                                     AND2


                              INV                                                                     AND2



                                                                                     AND2      OR2




                        Figure 4.4: A Behavioural Design of the Adder/Subtracter for n = 4.
         From the statistics of each design, it is important to note that the total area and the
      maximum level of gates di ers. The structural description produces a small but slow
      design when compared to the behavioural description which produces a fast but large
      design.
         A characteristics of the synthesiser is that a behavioural description maps to a struc-
      ture by representing each output in terms of its inputs, much like a lookup table, and
      removes any structure. The synthesizer performs logic level optimisation on a the struc-
      tural description and thus producing a design with less logic.

      4.2.2 The Viewlogic Synthesiser
      The Viewlogic Synthesiser has the ability to alter the emphasis on speed or area when
      optimizing a design. The statistics generated in the previous section were area optimized,
                                                                       25
and neglected the e ect of gate delays. For example, optimizing the behavioural design for
speed, the synthesiser generates 14 more gates than before, however there is a signi cant
decrease in the maximum level of gates:
             *********************************************
                          Gate Usage Summary
             *********************************************

Cell         Count       Area/Cell    Cell         Count Area/Cell
----------------------------------------------------------------------------
X2000:AND2     10           0.25      X2000:AND3      2           0.50
X2000:AND4      1           0.75      X2000:INV      15           0.00
X2000:NAND2    17           0.25      X2000:NAND3     1           0.50
X2000:NAND4     1           0.75      X2000:NOR3      2           0.50
X2000:NOR4      2           0.75      X2000:OR2      22           0.25
X2000:OR4       1           0.75      X2000:XOR2      1           0.25
----------------------------------------------------------------------------

Total Cells :           75              Total Area   :       18.75

             *********************************************
                           Netlist Statistics
             *********************************************

Maximum level of gates =        9     Total number of nets    =      84


    The synthesiser can optimise small designs, but when the design grows large, the
memory and processing power required to optimize such a design is considerable. The
design of the CORDIC unit contains three adders/subtracters and takes several minutes
to compile and optimize the design. However, integrating this unit into a larger design of
several units, the compiler has many problems and will eventually lead to a crash after
half an hour of compilation.
    A solution to get around this optimization problem is to use a hierarchical ow and
describe the components using behavioural or structural descriptions. Using this method
the compiler knows nothing about large components and cannot perform any global op-
timization. This is not a fully optimized solution, but it is currently the best solution.
However, it is possible to atten the design below the top level making the design slightly
more e cient.

4.3 VHDL Design of the CORDIC Unit
The rst stage of the design of a CORDIC processor is to create the CORDIC unit, where
two approaches can be taken: a behavioral description or a structural description. Firstly,
consider the following behavioural description where the shifted values of (xi; yi) are done
external to the CORDIC unit in the top level design. This approach is optimal, since it
only requires a misalignment of the data buses in the top level interconnections.
    However, if contained inside the CORDIC unit, each unit would require a variable
shifter and could not be optimized using the current version of Viewlogic VHDL for reasons
discussed previously. Another reason why shifting is done external to the CORDIC unit

                                            26
is that the LOOP variable inside the generate statement cannot be passed to any user
de ned function, procedure or entity. This is not stated in the manual and took many
days to determine the problem.
    The behavioural description is as follows:
ARCHITECTURE behaviour OF adder IS

begin

  cell_i : process (xi,xs,yi,ys,zi,ai)

    VARIABLE x_res:    vlbit_vector(n downto 0);   -- temporary results
    VARIABLE y_res:    vlbit_vector(n downto 0);
    VARIABLE z_res:    vlbit_vector(k downto 0);

    begin

        x_res := zero(n downto 0);   -- initialise, unless comp complains
        y_res := zero(n downto 0);
        z_res := zero(k downto 0);

        if zi(k-1) = '0' then     -- z_i is positive

            x_res := add2c (xi, ys);
            y_res := sub2c (yi, xs);
            z_res := sub2c (zi, ai);

        else                      -- z_i is negative

            x_res := sub2c (xi, ys);
            y_res := add2c (yi, xs);
            z_res := add2c (zi, ai);

     end if;

     xip1 <= x_res (n-1 downto 0);
     yip1 <= y_res (n-1 downto 0);
     zip1 <= z_res (e-1 downto 0);

  end process;

END behavior;

The synthesiser generates the following statistics for a 8 bit version of the code. The
maximum level of gates is 20, since each bit requires 2 levels, plus additional gates for the
multiplexer and inversion.
               *********************************************
                            Gate Usage Summary
               *********************************************

Cell         Count       Area/Cell    Cell         Count Area/Cell
----------------------------------------------------------------------------
X2000:AND2    159           0.25      X2000:AND3      3           0.50
X2000:INV      69           0.00      X2000:NAND2    76           0.25


                                             27
X2000:OR2     125           0.25      X2000:XOR2      7           0.25
----------------------------------------------------------------------------

Total Cells :          439                 Total Area       :     93.25

              *********************************************
                            Netlist Statistics
              *********************************************

Maximum level of gates =           20     Total number of nets      =         487


   For the Structural description of the CORDIC unit is slightly more complex and is
best represented pictorially, as shown in Figure (4.5). Each box in the gure represents
a di erent VHDL entity (component), and some components are used more than once.
The design is very bulky and easier to make mistakes.


     zi                                                           Full              zip1
     ai                                          2to1            Adder
                               INV               mux             faf001.vhd
                             inv101.vhd       muxf201.vhd
                                           addsub e.vhd


     xi                                                           Full              xip1
     ys                                          2to1            Adder
                               INV               mux             faf001.vhd
                             inv101.vhd       muxf201.vhd
                                           addsub n.vhd


     yi                                                           Full              yip1
     xs                                          2to1            Adder
                               INV               mux             faf001.vhd
                             inv101.vhd       muxf201.vhd
                                           addsub n.vhd
                                            adders.vhd

          Figure 4.5: The structure of CORDIC unit showing the various entities.

   It achieves the same functionality as the behavioural description but requires a lot more
e ort to make sure all the connections are correct. As stated previously, the structural
design will minimise area, but will result in a slower design, as re ected by the following
synthesiser statistics.
                                                28
*********************************************
                          Gate Usage Summary
             *********************************************

Cell         Count       Area/Cell    Cell         Count Area/Cell
----------------------------------------------------------------------------
X2000:INV       3           0.00      X2000:NAND2   139           0.25
X2000:OR2      41           0.25      X2000:XOR2     75           0.25
----------------------------------------------------------------------------

Total Cells :          258              Total Area   :       63.75

             *********************************************
                           Netlist Statistics
             *********************************************

Maximum level of gates =       31     Total number of nets    =      306

Using the structural design will save about 30% on area but will execute 50% slower.
In a FPGA implementation speed might be more desirable than area optimization since
the devices operate relatively slower when compared to a custom VLSI device. A 30%
increase in the number of gates will be a relatively small concern.

4.3.1 The Rounding Unit
The rounding unit is formed by the interconnection of n half adders, or in behavioural
terms, the addition of the bit shifted out during the shifting process. Describing it struc-
turally involves using the inc001 component which contains an AND and a XOR gate to
form a half adder. The interconnection of the inc001 components is:
    c(0) <= cin;    -- first carry

    connect: for i in 0 to n-1 generate

      addsub: inc001 port map( a(i), c(i), s(i), c(i+1) );

    end generate;

Or, a much simpler behavioural description is created using the unsigned addition routine
addum. This avoids the sign extension used in the add2c routine.

  rounder : process (a,cin)

    VARIABLE res:    vlbit_vector(n downto 0);   -- temporary results

    begin

      res := zero(n downto 0);   -- initialise, unless comp complains

      res := addum(a,cin); -- use addum instead of add2c as it sign
                           -- extends the cin input making it -1 not +1

      s <= res (n-1 downto 0);

  end process;


                                            29
4.4 Combining the CORDIC Units
The process of combining the CORDIC and Rounding units involves writing the top level
design in the hierarchical solution. As before with structural descriptions, the generate
statement is used and allows iterative or conditional generation of a portion of description.
    The rst de nition to be made in top level le is the alphai constants, and this
version implements the Alpha Normalisation Scheme. Next the x; y; z intermediate signals
between CORDIC units are shifted by the appropriate amount. The function shift all is
de ned in another le and contains user de ned functions. This operation is required here
since execution inside the generate statement will not work since concurrent procedure
calls only execute when a variable in the sensitivity list changes state. A change in the
shift value is not recognizable inside the generate statement.
    -- Scaled a_i * 2^i values are decimal 45 53 56 57 57 57 57 57
    ai <= X"39_39_39_39_39_38_35_2D";

    sh_x: xis <= shift_all(xi);      -- shift intermediate signals
    sh_y: yis <= shift_all(yi);
    sh_z: zis <= shift_z(zi);


It should be noted that the variables xis, yis, zis, xi, yi, and zi are large vectors
containing several smaller vectors. This system had to be used since Viewlogic's VHDL
cannot handle two-dimensional arrays of vlbit. The shifting of intermediate signals is
done by the following function:
 FUNCTION shift_all (x : vlbit_vector (n*(k-1)-1 downto 0))
 RETURN vlbit_vector IS

  VARIABLE x_s : vlbit_vector(n*(k-1)-1 downto 0) := zero(n*(k-1)-1 downto 0);

  BEGIN

     x_s(1*n-1   downto   0) := shiftr2c(x( 1*n-1 downto 0   ),1);       --   2 stage
     x_s(2*n-1   downto   1*n) := shiftr2c(x( 2*n-1 downto   1*n ),2);   --   3 stage
     x_s(3*n-1   downto   2*n) := shiftr2c(x( 3*n-1 downto   2*n ),3);   --   4 stage
     x_s(4*n-1   downto   3*n) := shiftr2c(x( 4*n-1 downto   3*n ),4);   --   5 stage
     x_s(5*n-1   downto   4*n) := shiftr2c(x( 5*n-1 downto   4*n ),5);   --   6 stage
     x_s(6*n-1   downto   5*n) := shiftr2c(x( 6*n-1 downto   5*n ),6);   --   7 stage
     x_s(7*n-1   downto   6*n) := shiftr2c(x( 7*n-1 downto   6*n ),7);   --   8 stage
     x_s(8*n-1   downto   7*n) := shiftr2c(x( 8*n-1 downto   7*n ),8);   --   9 stage
     x_s(9*n-1   downto   8*n) := shiftr2c(x( 9*n-1 downto   8*n ),9);   --   10 stage

     return x_s;

   END shift_all;


Next comes the connection of the init component which is used to expand the convergence
range of the CORDIC processor to ?190 < z < 190 . The input signals are x in, y in,
z in are connected to a unit simular to the CORDIC unit, except there is an extra bit
appended to the alpha bus to account for the expanded convergence range.


                                              30
initial: init port map(xi <= X"00",
                           xs <= x_in,

                           yi <= X"00",
                           ys <= y_in,

                           zi <= z_in,
                           ai <= B"0_0101_1010", -- add/sub 90 degrees

                           xip1 <= xinit, -- xinit = 0 +- yin
                           yip1 <= yinit, -- yinit = 0 -+ xin
                           zip1 <= zinit );


The following code has been compressed to reduce detail, however it can be seen that there
a three separate stages: initial connection, intermediate connections, and nal connection.
This can be visibly seen in Figure (4.6). (Also not shown is the conditional generation of
components, eg., selection of behavioral or structural components, rounding units, etc.)
    connect: for i in 0 to k-1 generate           -- k stages

      ls_unit: if i=0 generate

        first_unit: adder port map( ... );

      end generate ls_unit;

      i_unit: if i>0 and i<k-1 generate

         x_round: round port map ( ... );
         y_round: round port map ( ... );
         middle_units: adder port map( ... );

      end generate ls_unit;

      ms_unit: if i=k-1 generate

        x_round_last: round port map ( ... );
        y_round_last: round port map ( ... );
        last_unit: adder port map( ... );

      end generate ms_unit;

   end generate connect;


The contents of ... are simular to the port map of the init component.

4.4.1 A Solution
This represents a solution to the CORDIC problem, and is close to a optimized solu-
tion, but due to compiler and language di culties a completely optimized solution is not
possible. Under these situations the design has been optimised as far as possible though.
    There many choices to be made about the design of the CORDIC unit, by deciding
on whether the it is going to be area or speed e cient.
                                             31
VHDL Implementation of a CORDIC Arithmetic Processor Chip
VHDL Implementation of a CORDIC Arithmetic Processor Chip
VHDL Implementation of a CORDIC Arithmetic Processor Chip
VHDL Implementation of a CORDIC Arithmetic Processor Chip
VHDL Implementation of a CORDIC Arithmetic Processor Chip
VHDL Implementation of a CORDIC Arithmetic Processor Chip
VHDL Implementation of a CORDIC Arithmetic Processor Chip

More Related Content

Similar to VHDL Implementation of a CORDIC Arithmetic Processor Chip

Reconfigurable CORDIC Low-Power Implementation of Complex Signal Processing f...
Reconfigurable CORDIC Low-Power Implementation of Complex Signal Processing f...Reconfigurable CORDIC Low-Power Implementation of Complex Signal Processing f...
Reconfigurable CORDIC Low-Power Implementation of Complex Signal Processing f...Editor IJMTER
 
Design of an arm based microcontroller circuit board
Design of an arm based microcontroller circuit boardDesign of an arm based microcontroller circuit board
Design of an arm based microcontroller circuit boardtuanngoc253
 
Wireless Radio Frequency Module Using PIC Microcontroller.
Wireless Radio Frequency Module Using PIC Microcontroller.Wireless Radio Frequency Module Using PIC Microcontroller.
Wireless Radio Frequency Module Using PIC Microcontroller.Abee Sharma
 
Quantitative methods notes
Quantitative methods notesQuantitative methods notes
Quantitative methods notesLeonard Ngatiah
 
A High Speed Successive Approximation Pipelined ADC
A High Speed Successive Approximation Pipelined ADCA High Speed Successive Approximation Pipelined ADC
A High Speed Successive Approximation Pipelined ADCPushpak Dagade
 
FPGA Implementation of Pipelined CORDIC Sine Cosine Digital Wave Generator
FPGA Implementation of Pipelined CORDIC Sine Cosine Digital Wave Generator FPGA Implementation of Pipelined CORDIC Sine Cosine Digital Wave Generator
FPGA Implementation of Pipelined CORDIC Sine Cosine Digital Wave Generator cscpconf
 
Robust and Optimal Control
Robust and Optimal ControlRobust and Optimal Control
Robust and Optimal ControlCHIH-PEI WEN
 
A High Speed Successive Approximation Pipelined ADC.pdf
A High Speed Successive Approximation Pipelined ADC.pdfA High Speed Successive Approximation Pipelined ADC.pdf
A High Speed Successive Approximation Pipelined ADC.pdfKathryn Patel
 
"Massive Parallel Decoding of Low-Density Parity-Check Codes Using Graphic Ca...
"Massive Parallel Decoding of Low-Density Parity-Check Codes Using Graphic Ca..."Massive Parallel Decoding of Low-Density Parity-Check Codes Using Graphic Ca...
"Massive Parallel Decoding of Low-Density Parity-Check Codes Using Graphic Ca...Enrique Monzo Solves
 
Full Custom IC Design Implementation of Priority Encoder
Full Custom IC Design Implementation of Priority EncoderFull Custom IC Design Implementation of Priority Encoder
Full Custom IC Design Implementation of Priority EncoderBhargavKatkam
 
Full custom Ic design Implementation of low power priority encoder
Full custom Ic design Implementation of low power priority encoderFull custom Ic design Implementation of low power priority encoder
Full custom Ic design Implementation of low power priority encodersrikanth kalemla
 
Programming with PIC microcontroller
Programming with PIC microcontroller Programming with PIC microcontroller
Programming with PIC microcontroller Raghav Shetty
 
Microcontroladores: Programación con microcontrolador PIC
Microcontroladores: Programación con microcontrolador PICMicrocontroladores: Programación con microcontrolador PIC
Microcontroladores: Programación con microcontrolador PICSANTIAGO PABLO ALBERTO
 
Ph robust-and-optimal-control-kemin-zhou-john-c-doyle-keith-glover-603s[1]
Ph robust-and-optimal-control-kemin-zhou-john-c-doyle-keith-glover-603s[1]Ph robust-and-optimal-control-kemin-zhou-john-c-doyle-keith-glover-603s[1]
Ph robust-and-optimal-control-kemin-zhou-john-c-doyle-keith-glover-603s[1]toramamohan
 
Report on Design of Automatic Flame Sensor Testing
Report on Design of Automatic Flame Sensor TestingReport on Design of Automatic Flame Sensor Testing
Report on Design of Automatic Flame Sensor Testinganuraja212
 
Layout design on MICROWIND
Layout design on MICROWINDLayout design on MICROWIND
Layout design on MICROWINDvaibhav jindal
 

Similar to VHDL Implementation of a CORDIC Arithmetic Processor Chip (20)

Reconfigurable CORDIC Low-Power Implementation of Complex Signal Processing f...
Reconfigurable CORDIC Low-Power Implementation of Complex Signal Processing f...Reconfigurable CORDIC Low-Power Implementation of Complex Signal Processing f...
Reconfigurable CORDIC Low-Power Implementation of Complex Signal Processing f...
 
Design of an arm based microcontroller circuit board
Design of an arm based microcontroller circuit boardDesign of an arm based microcontroller circuit board
Design of an arm based microcontroller circuit board
 
Wireless Radio Frequency Module Using PIC Microcontroller.
Wireless Radio Frequency Module Using PIC Microcontroller.Wireless Radio Frequency Module Using PIC Microcontroller.
Wireless Radio Frequency Module Using PIC Microcontroller.
 
Tutorial for EDA Tools:
Tutorial for EDA Tools:Tutorial for EDA Tools:
Tutorial for EDA Tools:
 
Tutorial for EDA Tools
Tutorial for EDA ToolsTutorial for EDA Tools
Tutorial for EDA Tools
 
Quantitative methods notes
Quantitative methods notesQuantitative methods notes
Quantitative methods notes
 
ASIC_Design.pdf
ASIC_Design.pdfASIC_Design.pdf
ASIC_Design.pdf
 
A High Speed Successive Approximation Pipelined ADC
A High Speed Successive Approximation Pipelined ADCA High Speed Successive Approximation Pipelined ADC
A High Speed Successive Approximation Pipelined ADC
 
FPGA Implementation of Pipelined CORDIC Sine Cosine Digital Wave Generator
FPGA Implementation of Pipelined CORDIC Sine Cosine Digital Wave Generator FPGA Implementation of Pipelined CORDIC Sine Cosine Digital Wave Generator
FPGA Implementation of Pipelined CORDIC Sine Cosine Digital Wave Generator
 
Robust and Optimal Control
Robust and Optimal ControlRobust and Optimal Control
Robust and Optimal Control
 
A High Speed Successive Approximation Pipelined ADC.pdf
A High Speed Successive Approximation Pipelined ADC.pdfA High Speed Successive Approximation Pipelined ADC.pdf
A High Speed Successive Approximation Pipelined ADC.pdf
 
"Massive Parallel Decoding of Low-Density Parity-Check Codes Using Graphic Ca...
"Massive Parallel Decoding of Low-Density Parity-Check Codes Using Graphic Ca..."Massive Parallel Decoding of Low-Density Parity-Check Codes Using Graphic Ca...
"Massive Parallel Decoding of Low-Density Parity-Check Codes Using Graphic Ca...
 
Full Custom IC Design Implementation of Priority Encoder
Full Custom IC Design Implementation of Priority EncoderFull Custom IC Design Implementation of Priority Encoder
Full Custom IC Design Implementation of Priority Encoder
 
Full custom Ic design Implementation of low power priority encoder
Full custom Ic design Implementation of low power priority encoderFull custom Ic design Implementation of low power priority encoder
Full custom Ic design Implementation of low power priority encoder
 
Programming with PIC microcontroller
Programming with PIC microcontroller Programming with PIC microcontroller
Programming with PIC microcontroller
 
Microcontroladores: Programación con microcontrolador PIC
Microcontroladores: Programación con microcontrolador PICMicrocontroladores: Programación con microcontrolador PIC
Microcontroladores: Programación con microcontrolador PIC
 
Ph robust-and-optimal-control-kemin-zhou-john-c-doyle-keith-glover-603s[1]
Ph robust-and-optimal-control-kemin-zhou-john-c-doyle-keith-glover-603s[1]Ph robust-and-optimal-control-kemin-zhou-john-c-doyle-keith-glover-603s[1]
Ph robust-and-optimal-control-kemin-zhou-john-c-doyle-keith-glover-603s[1]
 
Report on Design of Automatic Flame Sensor Testing
Report on Design of Automatic Flame Sensor TestingReport on Design of Automatic Flame Sensor Testing
Report on Design of Automatic Flame Sensor Testing
 
Layout design on MICROWIND
Layout design on MICROWINDLayout design on MICROWIND
Layout design on MICROWIND
 
Thesis
ThesisThesis
Thesis
 

VHDL Implementation of a CORDIC Arithmetic Processor Chip

  • 1. Faculty of Computing and Information Technology Department of Robotics and Digital Technology Technical Report 94-9 A VHDL Implementation of a CORDIC Arithmetic Processor Chip Grant Hampson, Student Member, IEEE Andrew Paplinski, Member, IEEE October 10, 1994 Enquiries:- Technical Report Coordinator Robotics and Digital Technology Monash University Clayton VIC 3168 Australia tr.coord@rdt.monash.edu.au +61 3 905 3402
  • 2. Contents Abstract and Keywords 4 Preface 5 1 The CORDIC Algorithm 6 2 CORDIC Hardware Implementations 10 2.1 CORDIC Processor Architecture : : : : : : : : : : : : : : : : : : : : : : : 10 2.1.1 A Word-Serial CORDIC Architecture : : : : : : : : : : : : : : : : : 10 2.1.2 A Word-Parallel CORDIC Architecture : : : : : : : : : : : : : : : : 11 3 Improving CORDIC Accuracy 14 3.1 Estimation of CORDIC Accuracy : : : : : : : : : : : : : : : : : : : : : : : 14 3.2 The Lower Bound of CORDIC Accuracy : : : : : : : : : : : : : : : : : : : 15 3.3 Reducing the z update error : : : : : : : : : : : : : : : : : : : : : : : : : : 16 3.4 Unexpected Truncation Errors : : : : : : : : : : : : : : : : : : : : : : : : : 20 4 VHDL Implementation 21 4.1 The Basic CORDIC Unit : : : : : : : : : : : : : : : : : : : : : : : : : : : : 21 4.2 VHDL Describes Structure and Behaviour : : : : : : : : : : : : : : : : : : 22 4.2.1 Hierarchical vs Flat Designs : : : : : : : : : : : : : : : : : : : : : : 23 4.2.2 The Viewlogic Synthesiser : : : : : : : : : : : : : : : : : : : : : : : 25 4.3 VHDL Design of the CORDIC Unit : : : : : : : : : : : : : : : : : : : : : : 26 4.3.1 The Rounding Unit : : : : : : : : : : : : : : : : : : : : : : : : : : : 29 4.4 Combining the CORDIC Units : : : : : : : : : : : : : : : : : : : : : : : : 30 4.4.1 A Solution : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 31 4.5 Improvements : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 33 Conclusion 34 A CORDIC Functions 35 B Upper Bound of CORDIC Error 37 References 38 1
  • 3. List of Tables 1.1 Elementary angles of i : : : : : : : : : : : : : : : : : : : : : : : : : : : : 8 1.2 Various values of Kn : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 8 4.1 Some CORDIC hardware statistics. : : : : : : : : : : : : : : : : : : : : : : 33 A.1 The six CORDIC modes. : : : : : : : : : : : : : : : : : : : : : : : : : : : : 36 2
  • 4. List of Figures 1.1 Rotation of a point in 2-D space. : : : : : : : : : : : : : : : : : : : : : : : 6 2.1 Generic Processor Architecture. : : : : : : : : : : : : : : : : : : : : : : : : 11 2.2 A Optimised Word-Serial CORDIC Architecture. : : : : : : : : : : : : : : 12 2.3 Word-Parallel CORDIC architecture with possible data pipelining. : : : : : 13 3.1 Numerical accuracy of the CORDIC processor. : : : : : : : : : : : : : : : : 15 3.2 Predicted and Actual accuracy of a CORDIC processor with a 12 bit in- ternal datapath. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 15 3.3 A plot showing bits of error for a typical test vector rotated through all possible angles. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 16 3.4 A 12 bit, 8 stage CORDIC processor produces 9 bit accurate results. : : : 17 3.5 An 8 bit, 8 stage CORDIC processor produces 7 bit accurate results. : : : 17 3.6 Simulation results from a CORDIC processor illustrating the e ects of the normalisation scheme. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 19 3.7 An 8 bit, 8 stage CORDIC processor (a) without rounding, (b) with rounding. 20 4.1 The basic CORDIC unit. : : : : : : : : : : : : : : : : : : : : : : : : : : : : 21 4.2 A Hierarchical Design of the Adder/Subtracter for n = 4. : : : : : : : : : : 24 4.3 A Flat Design of the Adder/Subtracter for n = 4. : : : : : : : : : : : : : : 24 4.4 A Behavioural Design of the Adder/Subtracter for n = 4. : : : : : : : : : : 25 4.5 The structure of CORDIC unit showing the various entities. : : : : : : : : 28 4.6 The top level schematic of an 4 stage CORDIC processor with Increased Convergence Range and Rounding components. : : : : : : : : : : : : : : : 32 3
  • 5. Abstract This report describes the fundamentals of CORDIC (Co-ordinate Rotations Digital Com- puter) algorithm and a possible implementation using the VHDL hardware description language. An analysis of errors associated with a xed point implementation of CORDIC is also discussed and methods for reducing these errors. A normalisation scheme which reduces error and requires no extra hardware is such a method. Various CORDIC struc- tures and possible VHDL implementations are described in detail, including design and language issues. Finally a parallel hardware implementation is described and simulated. CORDIC has many applications, of which, some can be used for array imaging tech- niques. Keywords CORDIC, VHDL 4
  • 6. Preface CORDIC is an acronym for Coordinate Rotations Digital Computer and was derived by Volder 1] in the late 1950's for the purpose of calculating trigonometric functions. Its popularity came about nearly twenty years later when VLSI solutions became a reality. The original algorithm describes the rotation of a 2-D vector which can be applied in applications such as Digital Signal Processing 2] (Fourier Transforms, Digital Filters), Computer Graphics 3] and Robotics 4]. CORDIC processing o ers high computational rates making it attractive to applica- tions such as computer graphics where a combination of scaling and rotations are required in real time. CORDIC is also attractive to Robotics since the fundamental operation is coordinate transformations, however it could be used for more computationally intensive processes such as motion planning and collision detection. Array Imaging typically involves complex signal processing which may require many computationally intensive matrix operations. Increasing the complexity of the imaging model places greater demands on accuracy. Solutions to such complex systems requires better, and hence, more complex algorithms. Most of these algorithms are based on matrix factorization (decomposition) techniques, of which Singular Value Decomposition (SVD) is the most robust method. The SVD factorisation requires a two-sided transformation which involves several trigometric operations and rotations ideally suited to dedicated VLSI hardware (CORDIC processing) for real time calculations. CORDIC has also been applied to phase correction when dynamic range focusing when Digital Baseband Demod- ulation 5] techniques are employed in Interpolation Beamforming 6] . A complex signal is represented by the in-phase, I, and quadrature, Q, components, and are phase corrected by rotating the complex signal. Haviland and Tuszynski designed and built a CORDIC processor 7] in 1980 which used a iterative process to calculate circular, linear and hyperbolic functions. A more recent implementation (1993) by Duprat and Muller 8] discusses the possibility of using a redundant number system for the representation of a signed digit. This report is broken into four logical sections, namely, CORDIC Theory, Hardware Implementations, Improving CORDIC Accuracy and nally a VHDL Implementation. 5
  • 7. Chapter 1 The CORDIC Algorithm Consider a 2-D vector (x; y) represented by a point v = x + |y in the complex plane. If the vector is rotated by an angle , the new co-ordinate vector is given by: v = v ej ~ (1:1) and shown in Figure (1.1). y v = x + |y ~ ~ ~ v = x + |y x Figure 1.1: Rotation of a point in 2-D space. The angle can be expanded into a set of elementary angles i with pseudo-digits qi 2 f?1; +1g, and angle expansion error zn , such that n?1 X = qi i + zn (1:2) i=?1 and the sub-rotation angles i take on the following values: ( i = arctan(2?i ) for i = ?1 ; ; n ? 1 =2 for i = 0; 1 (1:3) Note that i is approximately equal to but less than 2?i and the resulting angular expan- sion error is therefore jznj < 2?(n?1). 6
  • 8. Substitution of Equation(1.2) into Equation (1.1) gives: n?1 Y |q v = v ~ e i i e | zn i=?1 n?1 Y |q = v (|qi) e i i e | zn (1.4) i=0 and expanding ejqi i , ejqi i = cos qi i + j sin qi i = cos qi i (1 + j tan qi i) = cos i 1 + j qi 2?i Finally n?1 ! n?1 ! Y Y v=v ~ cos i (|q?1) 1 + | qi 2?i e?j zn (1:5) i=0 i=0 The range of rotation angles which can be represented by Equation (1.2) is max , where n?1 X max = i 190 (1:6) i=?1 and some values of i are given in Table (1.1). If the expected range of rotation angles is 90 then the initial rotation by 90 , that is, e|q?q 2 = j q?1, does not have to be performed and the initial rotation is by 45 . The second term is a constant scaling factor and for given value of n it can be pre- evaluated using Equation (1.7), and the rst 15 evaluated in Table (1.2). n?1 Y n?1 Y ?2 1 n?1 Y 1 Kn = cos i = 1 + 2?2i = q (1:7) i=0 i=0 i=0 1 + 41i The basic CORDIC algorithm which describes rotation of a unity length vector v = x + |y by an angle can be derived from Equation (1.5) using the initial conditions, where zi is the accumulated angular residue: v?1 = v Kn z?1 = And, proceeding with i = ?1; 0; ;n ? 1 ( qi = ?1 if zi < 0 (1.8) +1 0 ( v |qi if i ? vi+1 = vi (1 + |q 2?i ) if i = 0 1 (1.9) i i zi+1 = zi ? qi i (1.10) 7
  • 9. i Angle Angle (degrees) 16-bit binaries 0 arctan(2 0) 45:0000 B400 = 110001:0000000000 1 arctan(2 ?1 ) 26:5651 6A43 = 011010:1001000011 2 arctan(2 ?2 ) 14:0362 3825 = 001110:0000100101 3 arctan(2 ?3 ) 7:1250 1C80 = 000111:0010000000 4 arctan(2?4 ) 3:5763 0E40 = 000011:1001000000 5 arctan(2 ?5 ) 1:7899 0729 = 000001:1100101001 6 arctan(2 ?6 ) 0:8952 0395 = 000000:1110010101 7 arctan(2?7 ) 0:4476 01CA = 000000:0111001010 8 arctan(2 ?8 ) 0:2238 00E5 = 000000:0011100101 9 arctan(2 ?9 ) 0:1119 0073 = 000000:0001110011 10 arctan(2 ?10 ) 0:0560 0039 = 000000:0000111001 11 arctan(2?11 ) 0:0280 001D = 000000:0000011101 12 arctan(2 ?12 ) 0:0140 000E = 000000:0000001110 13 arctan(2 ?13 ) 0:0070 0007 = 000000:0000000111 14 arctan(2?14 ) 0:0035 0004 = 000000:0000000100 15 arctan(2 ?15 ) 0:0017 0002 = 000000:0000000010 16 arctan(2 ?16 ) 0:0008 0001 = 000000:0000000001 Table 1.1: Elementary angles of i n Kn 0 0.70710678118655 1 0.63245553203368 2 0.61357199107790 3 0.60883391251775 4 0.60764825625617 5 0.60735177014130 6 0.60727764409353 7 0.60725911229889 8 0.60725447933256 9 0.60725332108988 10 0.60725303152913 11 0.60725295913894 12 0.60725294104140 13 0.60725293651701 14 0.60725293538591 15 0.60725293510314 Table 1.2: Various values of Kn 8
  • 10. The nal rotated vector is vn, with angle expansion error zn vn = v = v e| e?|zn ~ (1.11) n?1 X zn = ? qi i (1.12) i=?1 One complex operation on vi is equivalent to two operations on real numbers. For i = ?1 x0 + |y0 = |q?1(x?1 + |y?1) Hence =) x0 = ?q?1y?1 (1.13) y0 = q?1x?1 (1.14) For i = 0; 1; ;n ?1 xi+1 + |yi+1 = (xi + |yi)(1 + |qi 2?i ) Hence =) xi+1 = xi ? qi yi 2?i (1.15) yi+1 = yi + qi xi 2?i (1.16) The CORDIC algorithm reduces to an iterative set of operations consisting of a binary shift and an accumulator for each of x; y and z. Refer to Appendix A for a list of transcendental functions. 9
  • 11. Chapter 2 CORDIC Hardware Implementations A Hardware implementation of CORDIC processor is dependent on the number of func- tions required and the computational speed. If all functions are to be computed, then there will be a necessary overhead for selecting each function. However, a small fast de- sign will result if a small number of functions are required. This chapter presents possible solutions to a mixture of design problems. 2.1 CORDIC Processor Architecture A CORDIC algorithm can take on two primary architectures, namely, word serial or word parallel. A word-serial processor minimises hardware requirements by utilising a single CORDIC unit repeatedly. However, iterative algorithms which are controlled by a small number of variables can be expanded on a two-dimensional area. ie., instead of executing a certain set of instructions n times using a single element (eg., a CORDIC unit), n times duplicated elementary cells are used in successive steps of an iteration 9]. This attened structure can now perform many operations in parallel and is so called a word-parallel CORDIC processor. A word-parallel architecture has the advantage of being up to n times faster, but due to the expansion requires, at worst, n times more hardware. However, the word-serial architecture requires complex controlling hardware and a variable shifter, decreasing the hardware saving ratio. 2.1.1 A Word-Serial CORDIC Architecture The CORDIC algorithm has the advantage of not requiring any special hardware other than an accumulator and a variable shifter which are generally available in most micro- controllers. A multi-function word-serial CORDIC processor architecture could be realised using a basic micro structure consisting of a two-port register le, a variable shifter combined with an ALU interconnected by several data paths as shown in Figure (2.1). A generic controller could consist of a microcode instructions for the ALU and register 10
  • 12. Result bus: xi+1 , yi+1 , zi+1 n i CC register ALU ROM ROM Register Kn 's i 's File 2?i yi or 2?i xi Variable Controlling Shifter i micro-code Input data buses: xi, yi , zi Figure 2.1: Generic Processor Architecture. le, and would execute an iterative algorithm. This structure is simular to that of a microprocessor or DSP and allows many variations of the CORDIC algorithm as the order of operations and the expanded instruction set increases exibility. This type of structure illustrates that it would be possible to implement the CORDIC algorithm on any micro or DSP. Optimising the generic processor-structure for a word-serial CORDIC processor is achieved by reducing the functionality to operations only required by the CORDIC algo- rithm. A possible word-serial architecture is shown in Figure (2.2) where the ALU now contains three adders and dedicated registers. The microcode controller has been replaced by faster Combination Control Logic dedicated to the CORDIC operation sequence. 2.1.2 A Word-Parallel CORDIC Architecture The word-parallel method expands the problem of a single dimensional algorithm into a two-dimensional problem and results in shorter computational times. Greater speeds of computation can be obtained by pipe-lining between stages so that many partial re- sults can be calculated in parallel. A pipelined-word-parallel architecture is shown in Figure (2.3) where each iteration is represented by a separate CORDIC block and a latch is placed after each iteration, or, several iterations. The following chapters will develop, implement, and simulate such parallel CORDIC structure using the VHDL hardware description language. 11
  • 13. Load Precision Reset Initial Inputs Clock z }| { x0 y0 z0 Combinational Control Logic Select Next State xi yi zi qi ?qixi2?i q-bit register qiyi2?i Look up Increment Zero Table P P P of counter i i 's m-bit register Clock n-bit register n-bit register n-bit register xi+1 yi+1 zi+1 Finished Flag Figure 2.2: A Optimised Word-Serial CORDIC Architecture. 12
  • 14. y0 x0 z0 Cell #0 0 y1 x1 1 Clock Latch for Pipelining of data yi xi zi ?qi xi 2?i qi yi 2?i qi = sign zi] Cell #i P P P i yi+1 xi+1 zi+1 Clock Latch for Pipelining of data Cell #n n?1 yn xn zn Figure 2.3: Word-Parallel CORDIC architecture with possible data pipelining. 13
  • 15. Chapter 3 Improving CORDIC Accuracy As expected, iterative algorithms calculate results by approximation and the solution will contain errors. CORDIC is not an exception and errors are introduced by a combina- tion of quantisation and approximation errors. The accuracy of a CORDIC processor is dependent on the word length used for the three input variables x; y, and z, as well as the number of iterations or steps performed. The following chapter describes the errors associated with a xed point implementation and a means of reducing these errors. 3.1 Estimation of CORDIC Accuracy The fundamental operations performed by a CORDIC processor is the shift-and-add pro- cess of which xed point arithmetic will introduce errors. For example, consider the binary scaling of the vector vi = (xi; yi) at the ith stage: if i m then vi+1 is updated with the truncated value vi 2?i if i > m then vi+1 = vi ; and the update will be 0 where m is the internal bus width of v and limits the maximumnumber of useful iterations. Peak accuracy could be achieved after m iterations since all accuracy has been exhausted in v. However, truncation errors may exceed the accuracy achieved by more iterations, and it is desirable to nd the optimal number of iterations. The accuracy of the rotation will be determined by how closely the input rotation angle was approximated by the summation of sub-rotation angles i. The error in v after n iterations will be proportional to the error in z. An increase in the z datapath width will increase the accuracy of the z update and hence the v update. The numerical accuracy of the CORDIC algorithm can be calculated by the examina- tion of truncation and approximation errors. Truncation errors are due to the nite word length and approximation errors are due to the nite number of iterations. Walther 10] analyzed the x and y iterations independently of the z iterations and concluded that log n extra bits in the data paths can provide n bits of accuracy. This work was re-calculated by Kota and Cavallaro 11] in a non-independent manner and concluded that log n + 2 extra bits are required to achieve n bits of accuracy after n iterations. 14
  • 16. This solution represents an upper bound of error in the CORDIC processor. A graph of this function appears in Figure (3.1) from which it can be seen that to achieve 8 or 16 bit accuracy, the internal datapaths need to be 13 and 22 bits respectively. Datapath resolution vs Output Resolution 32 Output resolution is (n) bits with (n) iterations 28 24 20 16 12 8 4 0 0 4 8 12 16 20 24 28 32 36 40 Internal Datapath Width (n+log(n)+2) Figure 3.1: Numerical accuracy of the CORDIC processor. 3.2 The Lower Bound of CORDIC Accuracy A CORDIC processor can be presented with all possible input combinations to nd the lower bound of error. Simulation results are shown in Figure (3.2) where a 12 bit CORDIC processor with a variable number of stages is presented with all possible rotation angles between ? z?1 and the resulting accuracy in bits is calculated. Kota and Caval- laro's upper bound of error (as de ned by their maximum error equation in Appendix (B)) is also shown in Figure (3.2). The upper bound of error has a well de ned peak of accu- racy, however the simulation results indicate that accuracy will improve if more iterations are performed. Solid: Predicted Accuracy, Dashed: Actual Accuracy 12 10 8 Output Accuracy 6 4 2 0 0 2 4 6 8 10 12 Number of stages n Figure 3.2: Predicted and Actual accuracy of a CORDIC processor with a 12 bit internal datapath. 15
  • 17. Figure (3.3) illustrates the accuracy of a 12 bit, 12 stage processor, by simulation, and the resulting bits of error produced. About 0:3% of results are greater than 2 bits of error which indicates that the error bound of a CORDIC processor is positioned between the upper and lower bounds of error. Bits error 90 3 120 60 2 150 30 1 180 0 210 330 240 300 270 Figure 3.3: A plot showing bits of error for a typical test vector rotated through all possible angles. The simulation results indicate that n + log n + 2 is an over estimation of data path width required and a reduction in datapath width is possible if the number of iterations is increased. Simulation results of two 8 stage CORDIC processors with 12 bit and 8 bit datapaths, are shown for comparison in Figure (3.5) and Figure (3.4) respectively. The simulation results were obtained by varying the magnitude of v and in uniform steps. The di erence in resolution obtained is two bits, indicating that the lower bound of error is closer to the error bound of CORDIC. 3.3 Reducing the z update error In the rotational mode of CORDIC, converges towards zero by adding/subtracting sub- rotation angles and the nal iterations of the zi update will result in numbers approaching zero. More precisely, the angular error zi is approximately equal to 2?i , thus for a bus width m, only (m ? i) bits are used to represent error. To reduce the zi error a oating point system could be used, but it has complex hardware implementations not suited to word-parallel structures. A simpler method to 16
  • 18. 90 1.0 120 60 0.83 0.66 150 0.50 30 0.33 0.17 180 0 -150 -30 -120 -60 -90 Figure 3.4: A 12 bit, 8 stage CORDIC processor produces 9 bit accurate results. 90 120 1.0 60 0.9 0.8 0.7 150 0.6 30 0.5 0.4 0.3 0.2 0.1 180 0 -150 -30 -120 -60 -90 Figure 3.5: An 8 bit, 8 stage CORDIC processor produces 7 bit accurate results. 17
  • 19. improve accuracy, ie., to utilise all m bits, a quasi- oating point scheme or normalisation scheme could be implemented by scaling the existing sequence by 2i , ie., zi = 2i zi ^ Therefore, the new sequence becomes zi+1 = ^ 2i+1 zi+1 = 2 2i (zi ? qi i) = 2 (2i zi ? qi 2i i) = 2(^i ? qi ^i) z (3.1) which requires a shift left at each iteration, and requires no extra hardware for a word- parallel structure. A new sequence of sub-rotation angles can be de ned as: ^i = 2i i = 2i tan(2?i) (3:2) where ^i approaches a nite value of 1 for increasing values of i, and will utilise most of the bus width. Since the scaling system results in full use of the databus width, over ow may occur if the bus width is too small. Using Equation (3.1), the maximum value zi+1 can have is when zi approaches zero, giving max zi+1] 2 max ^i] (3:3) To calculate the increase in accuracy is beyond the scope of this report, however, simulation indicates that there is a direct improvement in accuracy. The simulation results indicated that using the traditional scheme the accuracy of the rotation is accuracy / log(zi datapath width) + log(number of stages) (3:4) whereas the normalisation scheme has the advantage of accuracy / log(number of stages) (3:5) since the z datapath is always in a semi-normalised state. Using the traditional scheme, i ! 0, limiting the number of useful stages. However when normalised, there is no limit on the number of stages and a signi cant reduction in hardware is possible by reducing buswidth of z. Figure (3.6) illustrates the error dependencies on the number of stages and bits for the scaled and unscaled CORDIC processors. Figure (3.6(a)) and Figure (3.6(b)) show the angular expansion error. Figure (3.6(c)) and Figure (3.6(d)) show the dependance of v error on the angular expansion error. 18
  • 20. No alpha scaling Alpha scaling -3 -3 x 10 x 10 6 4 angle expans. error angle expans. error 4 2 2 0 0 0 0 0 10 0 10 10 10 20 20 bits 20 20 bits stages stages No alpha scaling Alpha scaling 4 4 relative v error relative v error 2 2 0 0 0 0 0 10 0 10 10 10 20 20 bits in z 20 20 bits in z stages/bits in v stages/bits in v Figure 3.6: Simulation results from a CORDIC processor illustrating the e ects of the normalisation scheme. 19
  • 21. 3.4 Unexpected Truncation Errors Using xed point arithmetic in a CORDIC processor will introduce an unexpected trunca- tion error. The error occurs when the vector (x; y) has a negative component. Consider the nal iterations where the update of vector v approaches 0 since a larger number of right shifts is performed at each iteration. However this is not the case if x or y is negative. For example, let xi!N equal some number hex X"2D", or positive 45. The right shifted value of xi!N approaches zero. However, the negative of X"2D" in twos-complement form is X"D3" and the right shifted value will produce a number approaching X"FF", or ?1, not the expected zero. This is a signi cant problem in the CORDIC processor, since the addition of extra iterations will only increase the error. A simple method of removing this error would be to round the shifted value, instead of the forced truncation. A simple method for rounding values is to add the bit that was last shifted out to the shifted value. The rounder could be implemented using a half-adder and typically requires three logic gates per bit to implement. Minimal extra hardware is required in the word-serial architecture, however a word-parallel structure requires two half-adders per stage. This will have a direct e ect on the performance of the processor with the additional delay. Figure (3.7) are the simulation results of two CORDIC processors, with and without, rounding units. The test vector was rotated in steps of 5 , through 360 and the rounded results are signi cantly more accurate. The rounding maintains monoticity in the actual angle of rotation as well as uniform magnitude. 90 90 120 60 120 60 32.95 32.95 150 30 150 30 180 0 180 0 -150 -30 -150 -30 -120 -60 -120 -60 -90 -90 Figure 3.7: An 8 bit, 8 stage CORDIC processor (a) without rounding, (b) with rounding. 20
  • 22. Chapter 4 VHDL Implementation Various tools can be used to implement the CORDIC processor, however, a standardised approach to this problem would unify the solution for further development in various applications. A VHDL (VHSIC Hardware Description Language) has been used here to describe the structural and behavioural characteristics of a Word-Parallel CORDIC processor. VHDL has become the standard of hardware description languages and has its own IEEE standard 12]. 4.1 The Basic CORDIC Unit Any CORDIC structure will involve a basic unit containing three adders/subtracters, as shown in Figure (4.1). The binary scaler would be variable in the case of a Word-Serial device, however, much simpler in the Word-Parallel device as a shift translates directly to a misalignment of the data bus. yi xi zi Cell i i yi+1 xi+1 zi+1 Figure 4.1: The basic CORDIC unit. This unit and a suitable FSM and registers could form a word-serial structure. A word-parallel implementation can be obtained by linking n CORDIC units. The rest of this chapter deals with development of a Word-Parallel unit and the in- terconnection of these devices using the VHDL language. It should be a relatively trivial task, but unfortunately there are many bugs in the Viewlogic VHDL Synthesiser, as well as only containing a subset of the full VHDL standard. The main aim of the project was to describe a CORDIC processor using the VHDL language and to allow the application designer to change the size of structure easily. This 21
  • 23. exibility could include fundamental changes such as variable datapath widths and vari- able number of stages. Other options such as rounding intermediate nodes and pipelining could also be easily integrated. Currently, Viewlogic's VHDL is a partial implementation of the 1987 IEEE Standard VHDL, and many constructs are missing from their implementation. However, most of the useful constructs are there, but contain nasty ambiguous messages following to say sorry this only works partially. This made it very di cult to work with. 4.2 VHDL Describes Structure and Behaviour VHDL has the ability to describe a design in two ways in terms of its component structure, in terms of behavioural functionality of the design and also the possibility of integrating the two streams. A requirement for structural descriptions is that the lowest level description will be a behavioural description to ensure portability between di erent synthesis libraries. An example of a lowest level operator is the logical operator AND (behavioural), and used to describe the ANDing of two operands. This may be synthesised as an AND standard cell from the library. In this way, there is no way of directly accessing a component from a cell library and limiting portability. Consider a slightly more complex design of an n-bit adder/subtracter, which could be described by the following behavioural description: addsub : PROCESS(a,b,sel) VARIABLE res : VLBIT_VECTOR(n DOWNTO 0); BEGIN res := zero(n DOWNTO 0); -- needs to be initialised IF sel = '1' THEN res := add2c(a,b); ELSE res := sub2c(a,b); END IF; s <= res(n-1 downto 0); -- discard cout END PROCESS; The process activates when one of the variables in the sensitivity list changes, and then produces a result in the internal variable res. The signal s is assigned the lower portion of the sum. Now consider a structural description of the same adder/subtracter where several components are used: c(0) <= sel; -- carry in connect: FOR i IN 0 TO n-1 GENERATE 22
  • 24. invert: invf101 PORT MAP( b(i), b_bar(i) ); mux_b_b_bar: muxf201 PORT MAP( b_bar(i), b(i), sel, b_hat(i) ); addsub: faf001 PORT MAP( a(i), b_hat(i), c(i), s(i), c(i+1) ); END GENERATE; Note that the muxf201 component is used to select between the non-inverted and inverted signals of the b bus. The components are user de ned entities describing the appropriate logic gates. For example a fragment of the faf001 component contains the following lowest level behavioural description: SUM <= A1 xor B1 xor CIN2; CO <= (A1 and B1) or (A1 and CIN2) or (B1 and CIN2); It is not immediately obvious which way a designer should describe a particular design, however the next section reveals the results of the synthesiser on which a decision may be based. In general however, the easier it is for a designer to write a design in VHDL, the more optimisation the synthesiser needs to perform. 4.2.1 Hierarchical vs Flat Designs One of very useful features of Viewlogic's VHDL Synthesiser 13] is the ability to either create a hierarchical (top-down) or a at (bottom-up) design. A hierarchical design allows the engineer to see lower level interconnections between design units, unlike the at design where no (or little) hierarchy can be seen. This allows easier debugging of designs, however its has the disadvantage of being less e cient than a at design which combines all the design elements together into one circuit, and then performs optimisation. Figure (4.2) illustrates the previous structural design of the Adder/Subtracter where it can be observed that the schematic consists of higher level components than standard library cells. This feature of Viewlogic VHDL enables easy debugging of high level com- ponents when compared to a at design. It is relatively simple to navigate between levels in a design. However, most libraries contain standard cells for full adders, muxes, and inverters, but remembering that VHDL doesn't allow direct access to Library cells, these components had to be described by a behavioural description. A mux simply maps to an IF statement, however no behavioural description will map to the full adder cell, and resort to the description stated previously. Compiling the same design using the at (bottom-up) design approach the synthesiser produces the following statistics, if for example, using the X2000 library. The schematic generated by the synthesiser is shown in Figure (4.3). ********************************************* Gate Usage Summary ********************************************* Cell Count Area/Cell Cell Count Area/Cell ---------------------------------------------------------------------------- X2000:NAND2 15 0.25 X2000:OR2 3 0.25 23
  • 25. B3 A1 O INVF101 A1 B2 O SEL3 B2 A1 O A1 MUXF201 B2 O SEL3 INVF101 MUXF201 A0 A1 SUM S0 B1 CO CIN2 B0 A1 O A1 B2 O FAF001 SEL3 INVF101 A1 SUM S3 B1 CO MUXF201 CIN2 A3 FAF001 A2 A1 SUM S2 B1 CO CIN2 B1 A1 O A1 B2 O FAF001 SEL3 INVF101 SEL A1 SUM S1 MUXF201 B1 CO CIN2 A1 FAF001 Figure 4.2: A Hierarchical Design of the Adder/Subtracter for n = 4. X2000:XOR2 15 0.25 ---------------------------------------------------------------------------- Total Cells : 33 Total Area : 8.25 ********************************************* Netlist Statistics ********************************************* Maximum level of gates = 14 Total number of nets = 42 OR2 A1 XOR2 XOR2 NAND2 NAND2 NAND2 A0 XOR2 NAND2 NAND2 SEL NAND2 XOR2 NAND2 XOR2 NAND2 B0 NAND2 NAND2 NAND2 OR2 A2 NAND2 NAND2 OR2 NAND2 S2 NAND2 XOR2 XOR2 B1 XOR2 XOR2 B2 S1 XOR2 XOR2 S0 XOR2 B3 XOR2 S3 A3 XOR2 XOR2 Figure 4.3: A Flat Design of the Adder/Subtracter for n = 4. Reconsidering the behavioural description of the Adder/Subtracter and synthesizing the design, the following statistics are generated, and the corresponding schematic shown in Figure (4.4). ********************************************* Gate Usage Summary ********************************************* 24
  • 26. Cell Count Area/Cell Cell Count Area/Cell ---------------------------------------------------------------------------- X2000:AND2 21 0.25 X2000:AND3 1 0.50 X2000:INV 11 0.00 X2000:NAND2 8 0.25 X2000:OR2 17 0.25 X2000:XOR2 3 0.25 ---------------------------------------------------------------------------- Total Cells : 61 Total Area : 12.75 ********************************************* Netlist Statistics ********************************************* Maximum level of gates = 11 Total number of nets = 70 NAND2 NAND2 OR2 NAND2 INV NAND2 OR2 AND2 AND2 AND2 OR2 OR2 INV AND2 B2 INV AND2 A2 AND2 XOR2 INV INV AND2 OR2 B0 INV OR2 OR2 S0 AND2 AND2 A0 OR2 AND3 OR2 S2 NAND2 OR2 B3 INV AND2 OR2 XOR2 A3 INV S3 INV OR2 AND2 OR2 AND2 S1 B1 NAND2 OR2 AND2 AND2 OR2 NAND2 A1 INV XOR2 AND2 NAND2 AND2 SEL OR2 AND2 AND2 AND2 INV AND2 AND2 OR2 Figure 4.4: A Behavioural Design of the Adder/Subtracter for n = 4. From the statistics of each design, it is important to note that the total area and the maximum level of gates di ers. The structural description produces a small but slow design when compared to the behavioural description which produces a fast but large design. A characteristics of the synthesiser is that a behavioural description maps to a struc- ture by representing each output in terms of its inputs, much like a lookup table, and removes any structure. The synthesizer performs logic level optimisation on a the struc- tural description and thus producing a design with less logic. 4.2.2 The Viewlogic Synthesiser The Viewlogic Synthesiser has the ability to alter the emphasis on speed or area when optimizing a design. The statistics generated in the previous section were area optimized, 25
  • 27. and neglected the e ect of gate delays. For example, optimizing the behavioural design for speed, the synthesiser generates 14 more gates than before, however there is a signi cant decrease in the maximum level of gates: ********************************************* Gate Usage Summary ********************************************* Cell Count Area/Cell Cell Count Area/Cell ---------------------------------------------------------------------------- X2000:AND2 10 0.25 X2000:AND3 2 0.50 X2000:AND4 1 0.75 X2000:INV 15 0.00 X2000:NAND2 17 0.25 X2000:NAND3 1 0.50 X2000:NAND4 1 0.75 X2000:NOR3 2 0.50 X2000:NOR4 2 0.75 X2000:OR2 22 0.25 X2000:OR4 1 0.75 X2000:XOR2 1 0.25 ---------------------------------------------------------------------------- Total Cells : 75 Total Area : 18.75 ********************************************* Netlist Statistics ********************************************* Maximum level of gates = 9 Total number of nets = 84 The synthesiser can optimise small designs, but when the design grows large, the memory and processing power required to optimize such a design is considerable. The design of the CORDIC unit contains three adders/subtracters and takes several minutes to compile and optimize the design. However, integrating this unit into a larger design of several units, the compiler has many problems and will eventually lead to a crash after half an hour of compilation. A solution to get around this optimization problem is to use a hierarchical ow and describe the components using behavioural or structural descriptions. Using this method the compiler knows nothing about large components and cannot perform any global op- timization. This is not a fully optimized solution, but it is currently the best solution. However, it is possible to atten the design below the top level making the design slightly more e cient. 4.3 VHDL Design of the CORDIC Unit The rst stage of the design of a CORDIC processor is to create the CORDIC unit, where two approaches can be taken: a behavioral description or a structural description. Firstly, consider the following behavioural description where the shifted values of (xi; yi) are done external to the CORDIC unit in the top level design. This approach is optimal, since it only requires a misalignment of the data buses in the top level interconnections. However, if contained inside the CORDIC unit, each unit would require a variable shifter and could not be optimized using the current version of Viewlogic VHDL for reasons discussed previously. Another reason why shifting is done external to the CORDIC unit 26
  • 28. is that the LOOP variable inside the generate statement cannot be passed to any user de ned function, procedure or entity. This is not stated in the manual and took many days to determine the problem. The behavioural description is as follows: ARCHITECTURE behaviour OF adder IS begin cell_i : process (xi,xs,yi,ys,zi,ai) VARIABLE x_res: vlbit_vector(n downto 0); -- temporary results VARIABLE y_res: vlbit_vector(n downto 0); VARIABLE z_res: vlbit_vector(k downto 0); begin x_res := zero(n downto 0); -- initialise, unless comp complains y_res := zero(n downto 0); z_res := zero(k downto 0); if zi(k-1) = '0' then -- z_i is positive x_res := add2c (xi, ys); y_res := sub2c (yi, xs); z_res := sub2c (zi, ai); else -- z_i is negative x_res := sub2c (xi, ys); y_res := add2c (yi, xs); z_res := add2c (zi, ai); end if; xip1 <= x_res (n-1 downto 0); yip1 <= y_res (n-1 downto 0); zip1 <= z_res (e-1 downto 0); end process; END behavior; The synthesiser generates the following statistics for a 8 bit version of the code. The maximum level of gates is 20, since each bit requires 2 levels, plus additional gates for the multiplexer and inversion. ********************************************* Gate Usage Summary ********************************************* Cell Count Area/Cell Cell Count Area/Cell ---------------------------------------------------------------------------- X2000:AND2 159 0.25 X2000:AND3 3 0.50 X2000:INV 69 0.00 X2000:NAND2 76 0.25 27
  • 29. X2000:OR2 125 0.25 X2000:XOR2 7 0.25 ---------------------------------------------------------------------------- Total Cells : 439 Total Area : 93.25 ********************************************* Netlist Statistics ********************************************* Maximum level of gates = 20 Total number of nets = 487 For the Structural description of the CORDIC unit is slightly more complex and is best represented pictorially, as shown in Figure (4.5). Each box in the gure represents a di erent VHDL entity (component), and some components are used more than once. The design is very bulky and easier to make mistakes. zi Full zip1 ai 2to1 Adder INV mux faf001.vhd inv101.vhd muxf201.vhd addsub e.vhd xi Full xip1 ys 2to1 Adder INV mux faf001.vhd inv101.vhd muxf201.vhd addsub n.vhd yi Full yip1 xs 2to1 Adder INV mux faf001.vhd inv101.vhd muxf201.vhd addsub n.vhd adders.vhd Figure 4.5: The structure of CORDIC unit showing the various entities. It achieves the same functionality as the behavioural description but requires a lot more e ort to make sure all the connections are correct. As stated previously, the structural design will minimise area, but will result in a slower design, as re ected by the following synthesiser statistics. 28
  • 30. ********************************************* Gate Usage Summary ********************************************* Cell Count Area/Cell Cell Count Area/Cell ---------------------------------------------------------------------------- X2000:INV 3 0.00 X2000:NAND2 139 0.25 X2000:OR2 41 0.25 X2000:XOR2 75 0.25 ---------------------------------------------------------------------------- Total Cells : 258 Total Area : 63.75 ********************************************* Netlist Statistics ********************************************* Maximum level of gates = 31 Total number of nets = 306 Using the structural design will save about 30% on area but will execute 50% slower. In a FPGA implementation speed might be more desirable than area optimization since the devices operate relatively slower when compared to a custom VLSI device. A 30% increase in the number of gates will be a relatively small concern. 4.3.1 The Rounding Unit The rounding unit is formed by the interconnection of n half adders, or in behavioural terms, the addition of the bit shifted out during the shifting process. Describing it struc- turally involves using the inc001 component which contains an AND and a XOR gate to form a half adder. The interconnection of the inc001 components is: c(0) <= cin; -- first carry connect: for i in 0 to n-1 generate addsub: inc001 port map( a(i), c(i), s(i), c(i+1) ); end generate; Or, a much simpler behavioural description is created using the unsigned addition routine addum. This avoids the sign extension used in the add2c routine. rounder : process (a,cin) VARIABLE res: vlbit_vector(n downto 0); -- temporary results begin res := zero(n downto 0); -- initialise, unless comp complains res := addum(a,cin); -- use addum instead of add2c as it sign -- extends the cin input making it -1 not +1 s <= res (n-1 downto 0); end process; 29
  • 31. 4.4 Combining the CORDIC Units The process of combining the CORDIC and Rounding units involves writing the top level design in the hierarchical solution. As before with structural descriptions, the generate statement is used and allows iterative or conditional generation of a portion of description. The rst de nition to be made in top level le is the alphai constants, and this version implements the Alpha Normalisation Scheme. Next the x; y; z intermediate signals between CORDIC units are shifted by the appropriate amount. The function shift all is de ned in another le and contains user de ned functions. This operation is required here since execution inside the generate statement will not work since concurrent procedure calls only execute when a variable in the sensitivity list changes state. A change in the shift value is not recognizable inside the generate statement. -- Scaled a_i * 2^i values are decimal 45 53 56 57 57 57 57 57 ai <= X"39_39_39_39_39_38_35_2D"; sh_x: xis <= shift_all(xi); -- shift intermediate signals sh_y: yis <= shift_all(yi); sh_z: zis <= shift_z(zi); It should be noted that the variables xis, yis, zis, xi, yi, and zi are large vectors containing several smaller vectors. This system had to be used since Viewlogic's VHDL cannot handle two-dimensional arrays of vlbit. The shifting of intermediate signals is done by the following function: FUNCTION shift_all (x : vlbit_vector (n*(k-1)-1 downto 0)) RETURN vlbit_vector IS VARIABLE x_s : vlbit_vector(n*(k-1)-1 downto 0) := zero(n*(k-1)-1 downto 0); BEGIN x_s(1*n-1 downto 0) := shiftr2c(x( 1*n-1 downto 0 ),1); -- 2 stage x_s(2*n-1 downto 1*n) := shiftr2c(x( 2*n-1 downto 1*n ),2); -- 3 stage x_s(3*n-1 downto 2*n) := shiftr2c(x( 3*n-1 downto 2*n ),3); -- 4 stage x_s(4*n-1 downto 3*n) := shiftr2c(x( 4*n-1 downto 3*n ),4); -- 5 stage x_s(5*n-1 downto 4*n) := shiftr2c(x( 5*n-1 downto 4*n ),5); -- 6 stage x_s(6*n-1 downto 5*n) := shiftr2c(x( 6*n-1 downto 5*n ),6); -- 7 stage x_s(7*n-1 downto 6*n) := shiftr2c(x( 7*n-1 downto 6*n ),7); -- 8 stage x_s(8*n-1 downto 7*n) := shiftr2c(x( 8*n-1 downto 7*n ),8); -- 9 stage x_s(9*n-1 downto 8*n) := shiftr2c(x( 9*n-1 downto 8*n ),9); -- 10 stage return x_s; END shift_all; Next comes the connection of the init component which is used to expand the convergence range of the CORDIC processor to ?190 < z < 190 . The input signals are x in, y in, z in are connected to a unit simular to the CORDIC unit, except there is an extra bit appended to the alpha bus to account for the expanded convergence range. 30
  • 32. initial: init port map(xi <= X"00", xs <= x_in, yi <= X"00", ys <= y_in, zi <= z_in, ai <= B"0_0101_1010", -- add/sub 90 degrees xip1 <= xinit, -- xinit = 0 +- yin yip1 <= yinit, -- yinit = 0 -+ xin zip1 <= zinit ); The following code has been compressed to reduce detail, however it can be seen that there a three separate stages: initial connection, intermediate connections, and nal connection. This can be visibly seen in Figure (4.6). (Also not shown is the conditional generation of components, eg., selection of behavioral or structural components, rounding units, etc.) connect: for i in 0 to k-1 generate -- k stages ls_unit: if i=0 generate first_unit: adder port map( ... ); end generate ls_unit; i_unit: if i>0 and i<k-1 generate x_round: round port map ( ... ); y_round: round port map ( ... ); middle_units: adder port map( ... ); end generate ls_unit; ms_unit: if i=k-1 generate x_round_last: round port map ( ... ); y_round_last: round port map ( ... ); last_unit: adder port map( ... ); end generate ms_unit; end generate connect; The contents of ... are simular to the port map of the init component. 4.4.1 A Solution This represents a solution to the CORDIC problem, and is close to a optimized solu- tion, but due to compiler and language di culties a completely optimized solution is not possible. Under these situations the design has been optimised as far as possible though. There many choices to be made about the design of the CORDIC unit, by deciding on whether the it is going to be area or speed e cient. 31