SlideShare a Scribd company logo
1 of 14
Download to read offline
Systolic, Transposed & Semi-Parallel
            Architectures and Programming

                                        Sandip Jassar
                          http://www.linkedin.com/in/sandipjassar




                                          Abstract

The critical routines within signal processing algorithms are typically data intensive and
iterate many times, carrying out the same functions on multi-dimensional arrays and input
streams. These routines use the vast majority of an implementation’s resources, and for the
longest time over the course of the algorithms execution. Such routines are classed as
Nested-Loop Programs. Whether the implementation is software running on a processor or
a higher performance customized hardware implementation, there is always a tradeoff
between the throughput performance (execution-time) and the resources used such as the
amount of processor memory and registers in a software implementation or the gate-count
in a hardware implementation. This article shows the reader techniques and methods for
manipulating a signal processing algorithm, in particular those conforming to a Nested-Loop
Program, and the different generic implementation architectures along this tradeoff
spectrum, as well as the different methods for describing and analyzing an algorithm
implementation. Each of these methods for describing an algorithm is shown to correspond
to a different abstraction-level view and as such exposes different features and properties for
ease of analysis and manipulation. Manipulation techniques, mainly algorithmic-
transformations are described and it is shown how these transformations take an
implementation and form a new one at a different point in the trade-off spectrum of
resources used versus throughput by performing calculations in a different order and a
different way to arrive at the same result.



                                           Contents


            1 Loop Optimization
               1.1    The Unrolling Transformation
               1.2    The Skewing Transformation

            2 The three abstraction levels for representing Signal Processing Algorithms
              2.1 Code Representation
              2.2 Data-dependency Graph
              2.3 Hardware Implementation

            3 Parallel implementations of a Signal Processing algorithm
              3.1 The Transposed FIR Filter
              3.2 The Systolic FIR Filter
              3.3 The Semi-Parallel FIR Filter

             Conclusion
1 Loop Optimization

The Nested-Loop Program (NLP) form of a signal processing algorithm represents what’s
known as its sequential model of computation. The Finite Impulse Response (FIR) filter is the
most well known and most widely used NLP in the field of signal processing and as such is
used throughout this article as an example of a simple NLP. The C-code description of the FIR
filter for calculating four output values is shown below in figure 1.



                           for(j = 0; j < num_of_samples_to_calculate; j++)
                           {
                                for(i = 0; i < N; i++)
                                {
                                            y[j] = y[j] + ( x[j - i] * h[i] )
                                }
                           }

                 Figure 1 : C-code description of the NLP form of the FIR filter operation


An algorithms’ sequential model of computation represents the way in which the algorithm
would be executed as software running on a standard single-threaded processor, hence the
name sequential model of computation.



1.1 The Unrolling Transformation

The algorithmic transformation of unrolling a NLP acts to enhance its task-level parallelism,
whereby the tasks that are independent of each other are explicitly shown to be so. A task in
this context refers to a loop-iteration. Such tasks are said to be mutually exclusive to one
another. After the transformation has been applied the resulting algorithm is different,
although it remains functionally equivalent to the original. The independent tasks can then be
mapped to separate processors/execution-units (H/W resources), and as such it is said
that unrolling allows for the use of spatial parallelism. Spatial parallelism results in greater
throughput as loop iterations are processed concurrently, although this is at the expense of
multiplying the hardware required (mainly processors and the inter-processor registers to
transfer data between loop-iterations) by the unrolling_factor, which is the number of
concurrent loops that result from the unrolling.


1.2 The Skewing Transformation

The algorithmic transformation of skewing exposes the allowable latency between dependent
operations (in different loop-iterations) in order for their execution to be scheduled apart on
the H/W processors/execution-units or in the S/W processing threads that will execute
them. Where a nested (inner) loop iterates over a multidimensional array, in the NLP’s
sequential model of computation each iteration of the loop is dependent of all previous
iterations. The skewing transformation re-schedules the instructions such that dependencies
only exist between iterations of the outer-loop. The skewing transformation thus allows
temporal parallelism (pipelining) to be employed in the implementation where the executions
of successive outer-loop iterations are time-overlapped, and as such, a single set of registers
are shared by the corresponding inner-loop iterations, as their executions are carried out
using the same resources. This practice is also referred to as overlapped scheduling.
Overall the unrolling and skewing transformations can be used to transform a sequential
model of computation into something that it is closer to a data-flow model which exploits the
algorithm’s inherent parallelism and thus can be used to yield a more efficient
implementation. Dependent operations can also be scheduled apart by simply changing the
order in with operations are executed. The simplest example of this is when the inner and
outer loop indices are swapped.



2 The three abstraction levels for representing Signal Processing Algorithms

The following section shows how to represent a signal processing algorithm in terms of its
Code description, as a Data-dependency graph and its Hardware implementation. Code shows
how an algorithm would be executed on a processor (its Software implementation) and is used
to describe the algorithm from the highest abstraction level. The Hardware implementation
represents the lowest abstraction level. The reader will learn how to transfer the description
of an algorithm between these abstraction levels, and also gain an appreciation of which
characteristics of an algorithm are greater exposed for ease of manipulation at a particular
abstraction level. The example algorithm used throughout this section is the FIR (Finite
Impulse Response) filter in its basic sequential form, as the FIR filter is very simple and widely
used in the field of signal processing.

The FIR filter operation essentially carries out a vector dot-product in calculating each value of
y[n]. This is illustrated below in figure 2 for N = 4, where N is essentially the width of the filter
in time samples (number of filter taps).



                                                      h[0]


                                                      h[1]

                x[n]   x[n-1]   x[n-2]   x[n-3]                   x[n].h[0] + x[n-1].h[1] + x[n-2].h[2] + x[n-3].h[3]
                                                      h[2]


                                                      h[3]




  Figure 2 : showing how the FIR filter operation is comprised of a vector dot-product for the calculation of each
                                               output-sample value




2.1 Code representation

With reference to the C-code description of the Sequential FIR filter shown below in figure 3,
all of the required input samples are assumed to be stored in the register file (with a stride of
1) of the processor that would theoretically be executing the code, with x initially pointing to
the input sample x[n] corresponding to the first output sample to be calculated y[n]. It is
also assumed that all of the required coefficients are stored in the same way in a group of
registers used to store the h[ ] array.
void sequentialFirFilter( int num_taps, int num_samples, float *x, const float *h[ ], float *y )
{
         int i, j;            // ‘j’ is the outer-loop counter and ‘i’ is the inner-loop index
         float y_accum; // output sample is accumulated into ‘y_accum’
         float *k;            // pointer to the required input sample

         for( j = 0; j < num_samples; j++ )
         {
                     k = x++;         // x points to x[n+j] and is incremented (post assignment) to point to
                                      // x[(n+j)+1]
                     y_accum = 0.0;

                        for( i = 0; i < num_taps; i++ )
                        {
                                    y_accum += h[i] * *(k--);                                     // y[n+j] += h[i] * x[(n+j) - i]
                        }

                        *y++ = y_accum; // y points to the variable address where y[n+j] is to be written and is
                                        // incremented (post assignment) to point to the variable address
         }                              // where the next output sample y[(n+j)+1] is to be written
}

                       Figure 3 : A C-code description of the FIR filter’s sequential model of computation




2.2 Data-dependency Graph

The code description of an algorithm can easily be translated into its data-dependency graph,
and for the Sequential FIR filter this is shown below in figure 4, which is the data-dependency
graph translation of the code in figure 3 (with num_samples = 4). With reference to figure 4
below the contents of a vertical bar represent an iteration of the outer-loop, where successive
output-samples are calculated in successive bars going from left to right.


                                             0                                     1                                           2                                         3               j




                          h(0)                                      h(0)                                     h(0)                                    h(0)
               0                         MACC_1                                  MACC_1                                        MACC_1                               MACC_1
                         x[n]                                     x[n+1]                                   x[n+2]                                 x[n+3]

                                             y-accum                                   y-accum                                     y-accum                                   y-accum




                                         +                                        +                                           +                                      +
                         h(1)                                      h(1)                                     h(1)                                   h(1)

               1                         MACC_1                                  MACC_1                                      MACC_1                                 MACC_1
                        x[n-1]                                     x[n]                                    x[n+1]                                 x[n+2]

                                             y-accum                                   y-accum                                     y-accum                                   y-accum




                                         +                                        +                                           +                                      +
                         h(2)                                      h(2)                                     h(2)                                   h(2)

               2                         MACC_1                                  MACC_1                                      MACC_1                                 MACC_1
                        x[n-2]                                    x[n-1]                                    x[n]                                  x[n+1]

                                             y-accum                                   y-accum                                     y-accum                                   y-accum




                                         +                                        +                                           +                                      +
                          h(3)                                      h(3)                                     h(3)                                    h(3)
                                                                                                                             MACC_1
               3        x[n-3]           MACC_1
                                                                  x[n-2]         MACC_1
                                                                                                           x[n-1]                                    x[n]           MACC_1



                                             y-accum                                   y-accum                                     y-accum                                   y-accum




                   i      h(0).x[n] + h(1).x[n-1] + h(2).x[n-2]    h(0).x[n+1] + h(1).x[n] + h(2).x[n-1]            h(0).x[n+2] + h(1).x[n+1] +            h(0).x[n+3] + h(1).x[n+2] +
                                      + h(3).x[n-3]                           + h(3).x[n-2]                           h(2).x[n] + h(3).x[n-1]                h(2).x[n+1] + h(3).x[n]




             Figure 4 : Data-dependency graph showing the operation of the Sequential FIR filter with N=4
The circles in each bar of figure 4 above represent the instances of the physical operators
that carry out each iteration of the inner-loop, where successive iterations are represented
from top to bottom. The name given to this physical operator for the Sequential FIR filter is
MACC_1 as a multiply-accumulate is the only instruction (in S/W) or operation (in H/W) that
is carried out, and with the FIR filter’s sequential model of computation all
instructions/operations in the inner-loop are dependent on all of the previous ones, and the
outer-loop is carried out one iteration at a time, so only one instruction/operation can be
scheduled for execution at any one time and thus only one execution-unit can be used.



2.3 Hardware Implementation

As can be seen above in figure 4, each MACC instruction/operation uses two new input
values and its output result is used as an input to the next MACC instruction/operation. The
hardware implementation of the Sequential FIR filter is shown below in figure 5.




                                                      MACC_1




  Input Data Samples

                                                                                           y_accum     y_out


     Coefficients




                       Figure 5 : showing an example H/W implementation of the Sequential FIR filter



Since the output of each MACC up until the last one in each outer-loop iteration feeds its
result forward to the input of the next one, instead of writing each MACC-result to the
register file and then reading it before the start of the next MACC, the hardware
implementation has a feedback loop from the output of the adder to its second input. In the
code description it can be seen that this optimization is possible, although it can actually be
shown in the data-dependency graph and more clearly in the hardware implementation.




3 Parallel implementations of a Signal Processing algorithm


The primary trade-off between sequential and parallel implementations of the same algorithm
is the amount of hardware resources or S/W processing threads (i.e. a more complex
processing architecture which is harder to write code for) required versus the throughput
achieved. As the Sequential FIR filter implements the FIR filter function in a completely
sequential manner, the required resources are reduced by a factor of N, although so too is
the throughput as compared to a fully parallel implementation that would use one execution-
unit for each of the N coefficients (where the N MACCs would be cascaded).
As can be seen from figure 3 above, and more clearly from the data-dependency graph
(shown above in figure 4), the Sequential FIR filter evaluates (accumulates) only one output
value at a time. Assuming that each multiplication of x[(n+j)-i] and h[i] takes one clock cycle,
then the performance of this implementation is given by the following equation:

Throughput = Clock frequency ÷ Number of coefficients

The simplest fully-parallel FIR filter implementation is the Direct form type 1. This is formed by
completely unrolling the inner-loop and mapping each of the unrolled instructions/operations
to separate execution-units. The problem with this implementation is that the outputs of the
execution-units (MACC outputs) are all summed through an adder-chain, where the resulting
latency (and gate-count with a H/W implementation) mean that the Direct form type 1 FIR
filter does not scale very well for large problem sizes (N). The other reason why the Direct
form type 1 FIR filter is not used in signal processing is that it requires N input values to be
read in for each output value to be calculated, and this design feature does not scale very well
for large N due to the greater latency and possibly (depending on the implementation) the
higher number of memory read ports.



3.1 The Transposed FIR filter


The Transposed FIR filter is also a fully-parallel implementation. This is formed by first
completely unrolling the inner-loop and then skewing the outer-loop in the reverse direction by
a factor of 1. (Skewing the outer loop in the reverse direction is also the same as reversing
the order of the outer-loop indices and skewing by the same factor in the forward direction).
This manipulation of the Sequential FIR filter is most easily done using a data-dependency
graph where the scheduling of MACC instructions/operations and dependencies is explicitly
shown. The data-dependency graph for the Transposed FIR filter is shown below in figure 6.
                                                                                                                             0                                    1                         j

                                                                                                                 y[n] = h(0).x[n] + h(1).x[n-1] +      y[n+1] = h(0).x[n+1] + h(1).x[n] +
                                                                                                                    h(2).x[n-2] + h(3).x[n-3]              h(2).x[n-1] + h(3).x[n-2]




          h(0)                                                                                                        MACC_4                                 MACC_4
                                                                                                                                            h(0)
                                                                                                                         +
                                                                                                                                                            +

                                                                                                                         x[n]
                                                                                                                                                                x[n+1]

                                                                                               y_accum3                               y_accum3                            y_accum3

          h(1)                                                                    MACC_3                              MACC_3                                 MACC_3
                                                                                                    h(1)                                    h(1)
                                                                                      +                                  +
                                                                                                                                                            +

                                                                                      x[n-1]                             x[n]                                   x[n+1]


                                                                y_accum2                       y_accum2                             y_accum2                              y_accum2

                                                                                  MACC_2                              MACC_2                                 MACC_2
          h(2)                                         MACC_2
                                                                       h(2)                         h(2)                                    h(2)
                                                                                      +                                  +                                  +
                                                       +

                                                       x[n-2]                         x[n-1]                             x[n]
                                                                                                                                                                x[n+1]

                                                                                                y_accum1
                                     y_accum1                   y_accum1                                                          y_accum1                               y_accum1

          h(3)          MACC_1                         MACC_1                     MACC_1                              MACC_1                                 MACC_1
                                            h(3)                     h(3)                           h(3)                                    h(3)
                        +                              +                          +                                  +                                      +


         y_accum0 = 0                   y_accum0 = 0               y_accum0 = 0                   y_accum0 = 0                          y_accum0 = 0


                            x[n-3]                     x[n-2]                         x[n-1]                             x[n]                                   x[n+1]




    Figure 6 : dependency graph showing the operation of the Transposed form of the FIR filter (with N = 4)

As can be seen above in figure 6, the circles are now labeled MACC_1 to MACC_4 where
each of these is assigned the same filter coefficient throughout the execution of the algorithm
(i.e. one of the inner-loop iterations). The purple circles represent MACC
instructions/operations that occur during the spin-up process, which is the process of filling
the pipeline. The pipeline can either consist of N separate execution-units, or one execution-
unit which itself consists of N separate stages that have the same latency.

The skewing of the outer-loop results in temporal overlapping of successive output sample
calculations. This temporal overlapping schedules apart the dependent MACC
instructions/operations within a single iteration of the outer-loop, and this provides several
advantages over the Direct form type 1 implementation.

Once the spin-up procedure is complete only one input data value is read per output value
calculated. In the Transposed implementation this input value is then simultaneously
broadcasted to all of the N execution-units. These data-values are read in time-order meaning
that the input can be streamed in, and this lends itself very well to signal processing, where
the objective is to process large amounts of data quickly.
The other advantage over the Direct form type 1 is that the MACC results summed to form
each output value are done so using an adder chain, and as such the Transposed
implementation scales up very well for large and arbitrary values of N.

The disadvantage of the Transposed implementation is that latency of the spin-up and spin-
down (opposite of spin-down) procedures which are each N-1 cycles long. This latency is not
seen with the Direct form type 1 implementation, however the more output values that are
calculated the more this overhead is amortized. During the time the pipeline is full the
throughput is equal to the frequency with which successive MACC instructions/operations
are executed, and this is equal to the throughput of the Direct form type 1 implementation.
Another disadvantage with the Transposed implementation is that the number of filter taps is
limited by the fan-out capability of the register driving the common input of all the execution-
units with the input data value.

The data-dependency graph of the Transposed FIR filter is translated into a code description
(higher-abstraction) and a hardware implementation (lower-abstraction) below in figures 7
and 8 respectively. With reference to the C-code description of the Transposed
implementation below in figure 7, where instructions are ended with a comma rather than a
semi-colon, this means that all subsequent instructions up to and including the first one
ended in a semi-colon are to be scheduled for parallel execution.


void transposedFirFilter( int num_samples, float *x, const float *h[ ], float *y )
{
         const num_taps = 4; // value of N
         int j;              // the only loop counter
         float *k = x;       // ‘x’ initially points to the input sample (x[n]) corresponding to the first output
                             // sample to be calculated (y[n])
         float y_accum0 = y_accum1 = y_accum2 = y_accum3 = 0; // variables used to store the each
                                                                      // result at different stages of the
                                                                      // accumulation
         // Spin-up procedure : filling up the pipeline
         y_accum1 = ( h[3] * *(k-3) ) + y_accum0;

         y_accum1 = ( h[3] * *(k-2) ) + y_accum0,
                y_accum2 = ( h[2] * *(k-2) ) + y_accum1;

         y_accum1 = ( h[3] * *(k-1) ) + y_accum0,
                y_accum2 = ( h[2] * *(k-1) ) + y_accum1,
                           y_accum3 = ( h[1] * *(k-1) ) + y_accum2;

         // Steady state : where the pipeline is full
for( j = 0; j < (num_samples – (num_taps-1)); j++ )
            {
                        // y points to the variable address where y[n+j] is to be written and is incremented (post
                        // assignment) to point to the variable address where the next output sample y[(n+j)+1]
                        // is to be written
                        *y++ = ( h[0] * *k ) + y_accum3,
                                   y_accum3 = ( h[1] * *k ) + y_accum2,
                                              y_accum2 = ( h[2] * *k ) + y_accum1,
                                                      y_accum1 = ( h[3] * *k ) + y_accum0;

                         k = x++; // ‘x’ initially points to x[n+j] and is incremented (post assignment) to point to
                                  // x[(n+j)+1]
            }

            // Spin-down procedure : emptying the pipeline
            *y++ = ( h[0] * *k ) + y_accum3,
                      y_accum3 = ( h[1] * *k ) + y_accum2,
                                  y_accum2 = ( h[2] * *k ) + y_accum1;

            *y++ = ( h[0] * *(k+1) ) + y_accum3,
                      y_accum3 = ( h[1] * *(k+1) ) + y_accum2;

            *y++ = ( h[0] * *(k+2) ) + y_accum3;
}

Figure 7 : A C-code description of the Transposed FIR filter which is a translation of the data-dependency graph of
                                                 figure 6 above




     y_in




                h(3)                           h(2)                     h(1)                      h(0)



                                      MACC_1              MACC_2                     MACC_3                    MACC_4




                                               y_accum1                y_accum2                   y_accum3              y_out
                       y_accum0 = 0




    Figure 8 : showing the H/W implementation of the Transposed form of the FIR filter (with N = 4) which is a
                            translation of the data-dependency graph of figure 6 above



With reference to figure 8 above, the coefficients are assigned to the execution-units in
descending order going from the first input to the adder chain to its output. The significance
of this is discussed below in section 3.2.
3.2 The Systolic FIR filter


The Systolic FIR filter is another fully-parallel implementation, and is formed from the
Sequential FIR filter by completely unrolling the inner-loop and skewing the outer-loop in the
forward direction by a factor of 1. This manipulation of the Sequential FIR filter is depicted in
the data-dependency graph of the Systolic FIR filter shown below in figure 9.

                              x[n]                           x[n+1]                           x[n+2]                                 x[n+3]                              x[n+4]

          y_accum0 = 0                y_accum0 = 0                    y_accum0 = 0                     y_accum0 = 0                           y_accum0 = 0


                         +                               +                                +                                      +                                   +
         h(0)
                         MACC_1
                                     h(0)             MACC_1
                                                                          h(0)         MACC_1
                                                                                                           h(0)               MACC_1
                                                                                                                                                   h(0)              MACC_1


                                                                                                                                                                   y_accum1                    x[n+4]
                         y_accum1                        y_accum1                     y_accum1                              y_accum1

                         x[n-1]                       x[n]                            x[n+1]                                x[n+2]                                   x[n+3]
                                                     +                                +                                      +                                       +
         h(1)

                                                     MACC_2
                                                                        h(1)          MACC_2
                                                                                                         h(1)               MACC_2
                                                                                                                                                   h(1)              MACC_2


                                                                                                                                                                                               x[n+2]
                                                     y_accum2                         y_accum2                              y_accum2                                 y_accum2

                                                     x[n-2]                           x[n-1]                                  x[n]                                 x[n+1]
                                                                                      +                                      +                                        +
          h(2)
                                                                                      MACC_3
                                                                                                         h(2)               MACC_3
                                                                                                                                                   h(2)              MACC_3



                                                                                     y_accum3                               y_accum3                               y_accum3                    x[n]

                                                                                      x[n-3]                                 x[n-2]                                x[n-1]
                                                                                                                             +                                       +
         h(3)
                                                                                                                            MACC_4
                                                                                                                                                   h(3)              MACC_4




                                                                                                                y[n] = h(0).x[n] + h(1).x[n-1] +          y[n+1] = h(0).x[n+1] + h(1).x[n] +
                                                                                                                   h(2).x[n-2] + h(3).x[n-3]                  h(2).x[n-1] + h(3).x[n-2]

                                                                                                                                                                                               j
                                                                                                                                 0                                        1



                 Figure 9 : dependency graph showing the operation of the Systolic form of the FIR filter
                                                      (with N=4)

As previously with the Transposed implementation the Systolic implementation goes through
a spin-up procedure (where the pipeline is filled, represented above in figure 9 by the purple
circles) and a spin-down procedure (where the pipeline is emptied) and again both of these
last for N-1 cycles.

As with the Transposed implementation, when the pipeline of the Systolic FIR filter is full there
is one input data value read per output value calculated, and as with the Transposed
implementation these input values are read in time order, and as previously mentioned these
are the properties desired when the input is streamed into the filter. With reference to figure
9 above it can be seen that the filter coefficients are assigned to the execution-units in the
opposite order that they were with the Transposed implementation (as the outer-loop in each
case is skewed in opposite directions). This means that there is a latency of N MACC cycles
between an input data value being read and the corresponding output value having finished
being calculated. However this latency for the Direct form type 1 as well as the Transposed
implementation was only 1 MACC cycle, although the spin-up latency (time for first output
value to emerge after starting the filter) and the spin-down latency (time after last output
value emerges before filter is stopped) is the same for both the Transposed and Systolic
implementations.

As previously explained in section 3.1 the Transposed implementation is not suitable for very
high values of N because of the way the same input data value is broadcast to all of the
MACCs. This Systolic implementation has no such limitation and thus is more suitable than
the Transposed implementation for implementing a FIR with a large number of taps. This
however comes at the cost of an additional register per execution-unit, and these transfer
input data values between adjacent execution-units with a latency of 1 MACC cycle. Therefore
in summary depending on the implementation platform, there is a breakpoint for the size of N
below which it is preferable to use the Transposed implementation and above which it is
preferable to use the Systolic implementation.



void systolicFirFilter( int num_samples, float *x, const float *h[], float *y )
{
         const num_taps // value of N
         int j;     // the only loop counter
         float *k = x++;      // ‘x’ initially points to the input sample (x[n]) corresponding to the first
                              // output sample to be calculated (y[n])
         register y_accum0 = y_accum1 = y_accum2 = y_accum3 = 0; // variables used to store the each
                                                               // result at different stages of the accumulation
         // Spin-up procedure : filling up the pipeline
         y_accum1 = ( h[0] * *k ) + y_accum0;

          k = x++;
          y_accum2 = ( h[1] * *(k-2) ) + y_accum1,
                   y_accum1 = ( h[0] * *k ) + y_accum0;

          k = x++;
          y_accum3 = ( h[2] * *(k-4) ) + y_accum2,
                   y_accum2 = ( h[1] * *(k-2) ) + y_accum1,
                            y_accum1 = ( h[0] * *k ) + y_accum0;

          // Steady state : where the pipeline remains full
          for( j = 0; j < (num_samples – (num_taps-1)); j++ )
          {
                      // ‘x’ initially points to x[n+j] and is incremented (post assignment) to point to x[(n+j)+1]
                      k = x++;
                      // y points to the variable’s address where y[n+j] is to be written and is incremented
                      // (post assignment) to point to the variable address where the next output sample
                      // y[(n+j)+1] is to be written
                      *y++ = ( h[3] * *(k-6) ) + y_accum3,
                                   y_accum3 = ( h[2] * *(k-4) ) + y_accum2,
                                             y_accum2 = ( h[1] * *(k-2) ) + y_accum1,
                                                       y_accum1 = ( h[0] * *k ) + y_accum0;
          }

          // Spin-down procedure : emptying the pipeline
          k = x++;
          *y++ = ( h[3] * *(k-6) ) + y_accum3,
                    y_accum3 = ( h[2] * *(k-4) ) + y_accum2,
                               y_accum2 = ( h[1] * *(k-2) ) + y_accum1;

          k = x++;
          *y++ = ( h[3] * *(k-6) ) + y_accum3,
                    y_accum3 = ( h[2] * *(k-4) ) + y_accum2;

          k = x;
          *y++ = ( h[0] * *(k-6) ) + y_accum3;

}

    Figure 10 : A C-code description of the Systolic form of the FIR filter (with N = 4) which is a translation of the
                                      data-dependency graph of figure 9 above
y_in



                          h(0)                                                                         h(1)                                                                         h(2)                                                                                h(3)



                                                                  MACC_1                                                                       MACC_2                                                                        MACC_3                                                                            MACC_4




                                                                                                       y_accum1                                                                    y_accum2                                                                             y_accum3                                                     y_out
                                           y_accum0 = 0




      Figure 11 : showing the H/W implementation of the Systolic form of the FIR filter (with N = 4) which is a
                             translation of the data-dependency graph of figure 9 above




3.3 The Semi-parallel FIR filter


The Semi-Parallel FIR filter implementation divides its N coefficients amongst M execution-
units. Like the Transposed and Systolic implementations, the Semi-Parallel implementation
employs both spatial and temporal parallelism, although it is not a fully-parallel implementation
as the degree of parallelism employed is dependent on the M:N ratio. The Semi-parallel FIR is
formed from the Sequential FIR by first unrolling the inner-loop by a factor of N/M so that its
number of iterations goes from N to N/M. The case where N/M is not an integer value is
discussed below. The inner-loop is then skewed in the forward direction by a factor of 1 such
that the N/M stages of calculating an output-sample value are overlapped in time on the M-
stage pipeline. Finally the outer-loop is skewed in the forward direction by a factor of 1 such
that the calculation of successive output-sample values is overlapped in time. This
manipulation of the Sequential FIR is depicted below in the data-dependency graph of the
Semi-parallel implementation.

                                    0                                                                                                                                                   1
                                                                                                                                                                                                                                                                                                                                 j
                                    0                                   1                                    2                                     3                                0                                        1                                      2                                      3
                                                                                                                                                                                                                                                                                                                                 i
                                                                                                                                                  x[n+1]                                                                                                                                                    x[n+2]




                                   x[n]                               x[n-1]                               x[n-2]                              x[n-3]                             x[n+1]                                  x[n]                                  x[n-1]                                 x[n-2]


           y_accum0 = 0                            y_accum0 = 0                         y_accum0 = 0                        y_accum0 = 0                       y_accum0 = 0                           y_accum0 = 0                           y_accum0 = 0                           y_accum0 = 0



                              +                                    +                                   +                                   +                                  +                                      +                                      +                                      +
  x[n-4]
             h(0)                 MACC_1             h(1)              MACC_1             h(2)              MACC_1            h(3)              MACC_1           h(0)              MACC_1
                                                                                                                                                                                                       h(1)
                                                                                                                                                                                                                          MACC_1
                                                                                                                                                                                                                                              h(2)
                                                                                                                                                                                                                                                                 MACC_1
                                                                                                                                                                                                                                                                                    h(3)
                                                                                                                                                                                                                                                                                                       MACC_1



                              y_accum1                                y_accum1
                                                                                                           y_accum1                            y_accum1            x[n-3]         y_accum1                               y_accum1                               y_accum1                               y_accum1         x[n-2]
                                        x[n-8]                        x[n-4]                               x[n-5]                              x[n-6]                                                                     x[n-3]                                 x[n-4]                                 x[n-5]
                                                                                                                                                                                        x[n-7]
                                                                   +                                   +                                   +                                                                             +                                      +                                      +
                          +
                                                                                                                                                                                  + MACC_2
             h(7)                 MACC_2             h(4)             MACC_2              h(5)             MACC_2             h(6)             MACC_2
                                                                                                                                                                h(7)                                   h(4)
                                                                                                                                                                                                                          MACC_2
                                                                                                                                                                                                                                              h(5)
                                                                                                                                                                                                                                                                 MACC_2
                                                                                                                                                                                                                                                                                     h(6)
                                                                                                                                                                                                                                                                                                           MACC_2


                              y_accum2                            y_accum2                             y_accum2                            y_accum2                                                      x[n-7]
                                                      x[n-8]                                                                                                                      y_accum2                               y_accum2                                 y_accum2                             y_accum2

                              x[n-11]                                                                      x[n-8]                              x[n-9]                             x[n-10]                                                                        x[n-7]                                 x[n-8]
                                                                        x[n-12]                                                                                                                                               x[n-11]
                              +                                                                        +                                   +                                      +                                                                             +                                      +
                                                                  + MACC_3                                                                                                                                               + MACC_3
                                  MACC_3                                                  h(8)             MACC_3             h(9)             MACC_3
                                                                                                                                                               h(10)
                                                                                                                                                                                   MACC_3
                                                                                                                                                                                                      h(11)                                   h(8)
                                                                                                                                                                                                                                                                 MACC_3
                                                                                                                                                                                                                                                                                    h(9)
                                                                                                                                                                                                                                                                                                           MACC_3
            h(10)                                   h(11)

                              y_accum3                                y_accum3                             y_accum3                            y_accum3                             y_accum3                               y_accum3
                                                                                                                                                                                                                                                x[n-11]           y_accum3                             y_accum3
                                                                                           x[n-12]
                              x[n-14]                              x[n-15]                                    x[n-16]                      x[n-12]                                x[n-13]                                x[n-14]                                     x[n-15]                           x[n-11]

                              +                                   +                                                                        +                                      +                                      +                                                                             +
                                  MACC_4                              MACC_4
                                                                                                             +
                                                                                                       + MACC_4                                                                                                                                                 + MACC_4
                                                                                                                             h(12)             MACC_4
                                                                                                                                                               h(13)
                                                                                                                                                                                   MACC_4
                                                                                                                                                                                                      h(14)
                                                                                                                                                                                                                          MACC_4
                                                                                                                                                                                                                                             h(15)                                  h(12)
                                                                                                                                                                                                                                                                                                       MACC_4
            h(13)                                   h(14)                                h(15)

                              y_accum4                                y_accum4                             y_accum4                            y_accum4                           y_accum4                               y_accum4                               y_accum4                               y_accum4




                          +                y_out                  +             y_out
                                                                                                       +            y_out                  +                                  +               y_out                  +               y_out                  +               y_out                  +
                                  ACC_1                                 ACC_1                              ACC_1                                 ACC_1                                ACC_1                                  ACC_1                                  ACC_1                              ACC_1


                                                                                                                                                       y_out                                                                                                                                                   y_out




                                                                                y[n-1] = h(0).x[n-1] + h(1).x[n-2] + h(2).x[n-3] + h(3).x[n-4] + h(4).x[n-5] +h(5).x[n-6]                                                            y[n] = h(0).x[n] + h(1).x[n-1] + h(2).x[n-2] + h(3).x[n-3] + h(4).x[n-4] +h(5).x[n-5] +
                                                                                + h(6).x[n-7] + h(7).x[n-8] + h(8).x[n-9] + h(9).x[n-10] + h(10).x[n-11] + h(11).x[n-12]                                                             h(6).x[n-6] + h(7).x[n-7] + h(8).x[n-8] + h(9).x[n-9] + h(10).x[n-10] + h(11).x[n-11] +
                                                                                           + h(12).x[n-13] + h(13).x[n-14] + h(14).x[n-15] + h(15).x[n-16]                                                                                       h(12).x[n-12] + h(13).x[n-13] + h(14).x[n-14] + h(15).x[n-15]




  Figure 12 : dependency graph showing the operation of the Semi-parallel form of the FIR filter (with N=16 and
                                                    M=4)
The red circles represent MACC instructions/operations used to calculate the inner-
products of y[n], and the dark-red circles represent the output-accumulator being used to
accumulate the inner-products of y[n]. The blue and dark-blue circles represent the same for
y[n-1], whilst the yellow circles represent MACC instructions/operations used to calculate
the inner-products of y[n+1].

Each group of N/M coefficients is assigned to one of the execution-units and stored in order
within the associated coefficient-memory. The first group (coefficients 0 to (N/M – 1)) is
assigned to the top execution-unit (to which the input samples are applied), with ascending
coefficient groups being assigned to the execution-units from top to bottom (towards the
output-sample value accumulator). If N is not exactly integer-divisible by M, then the higher-
order coefficient-memories are padded with zeros.

As can be seen above in figure 12, the coefficient index for each execution-unit lies in the
range 0:((N/M) – 1) + offset where the offset increases by N/M between successive
execution-units going from the input to the output-accumulator. Figure 12 above also shows
how the MACC instructions/operations in the inner-loop are rescheduled so that after the M
instructions are mapped to the M execution-units the input-data stream can be kept in time-
order and divided into M chunks, where successive chunks in time are assigned to successive
execution-units going from the input to the output-accumulator. As seen above in figure 12,
this means that between successive output-value calculations only one input-data value is
read in and each execution-unit shifts its oldest input-data value into the upstream adjacent
execution-unit in the pipeline. This yields the most efficient time overlapping of successive
output-value calculations as there are no wasted MACC cycles. This temporal parallelism is in
turn necessary to achieve the Semi-parallel implementation’s maximum throughput of one
output sample every N/M MACC cycles, because once an output sample has been retrieved
from the output-accumulator, the output-accumulator must be reset (to either zero or its
input value). For its input value to be the first sum of M inner-products of the next output
sample to be evaluated, then the evaluation of this sum needs to have finished in the previous
MACC cycle, otherwise the output-accumulator will have to be reset to zero at the start of a
new calculation-cycle (i.e. outer-loop iteration). If the accumulator was set to zero between
calculation-cycles in this way, the time-overhead in setting up a new output-value calculation
would be seen at the output, thus degrading performance.

A higher M:N ratio means the implementation is less folded and yields more parallelism both
spatially (as there are more execution-units employed) and temporally (as more inner-loop and
outer-loop execution threads are scheduled for parallel execution). Thus the trade off is
performance obtained versus the resources required, as can be seen by the equation for the
performance of a Semi-parallel FIR filter implementation:

Throughput = ( Clock frequency ÷ N ) * M

The Semi-parallel implementation may be extrapolated either towards being a fully-parallel
implementation like the Transposed and Systolic implementations by using more MACC units,
or the other way towards being a Sequential FIR filter by using fewer MACC units.

Figures 13 and 14 below show a C-code description and hardware implementation
respectively for the Semi-parallel FIR filter (again with N=16 and M=4).
void semi_parallelFirFilter( int num_taps, int num_samples, float *x, const float *h[ ], float *y )
{
           int i, j;
           int i0, i1, i2, i3;
           float *k0, *k1, *k2, *k3;
           register y_accum0 = 0, y_accum1 = 0, y_accum2 = 0, y_accum3 = 0, y_accum4 = 0, y_out = 0;

           // Spin-up procedure
           i0 = 0;
           *k0 = x++;
           y_accum1 = ( h[i0] * *k0 ) + y_accum0;

           i1 = 4 + i0++;
           k1 = (k0--) - 4;
           y_accum2 = ( h[i1] * *k1 ) + y_accum1,
                        y_accum1 = ( h[i0] * *k0 ) + y_accum0;

           i2 = i1 + 4,            i1 = 4 + i0++;
           *k2 = k1 – 4,           *k1 = (k0--) – 4;
           y_accum3 = ( h[i2] * *k2 ) + y_accum2,
                        y_accum2 = ( h[i1] * *k1 ) + y_accum1,
                                   y_accum1 = ( h[i0] * *k0 ) + y_accum0;

           i3 = i2 + 4,            i2 = i1 + 4,            i1 = 4 + i0++;
           *k3 = k2 – 4,           *k2 = k1 – 4,           *k1 = (k0--) – 4;
           y_accum4 = ( h[i3] * *k3 ) + y_accum3,
                        y_accum3 = ( h[i2] * *k2 ) + y_accum2,
                                   y_accum2 = ( h[i1] * *k1 ) + y_accum1,
                                                y_accum1 = ( h[i0] * *k0 ) + y_accum0;

           // Steady state
           i3 = i2 + 4,                  i2 = i1 + 4,           i1 = 4 + i0,           i0 = 0;
           *k3 = k2 – 4,                 *k2 = k1 – 4,          *k1 = (k0) – 4,        *k0 = x++;

           for(j = 0; j < (((num_taps / ‘4’) x num_samples) – ‘4’); j++ )
           {
                         y_out = 0;

                          for(i = 0:3)
                          {
                                         y_out += y_accum4,
                                                    y_accum4 = ( h[i3] * *k3 ) + y_accum3,
                                                              y_accum3 = ( h[i2] * *k2 ) + y_accum2,
                                                                            y_accum2 = ( h[i1] * *k1 ) + y_accum1,
                                                                                       y_accum1 = ( h[i0] * *k0 ) + y_accum0;

                                         i3 = i2 + 4,           i2 = i1 + 4,           i1 = i0 + 4,           i0++;
                                         *k3 = k2 – 4,          *k2 = k1 – 4,          *k1 = k0 – 4,          *k0--;
                          }
                          *k0 = x++;
                          i0 = 0;
                          *y++ = y_out;
           }

           // Spin-down procedure
           i3 = i2 + 4,           i2 = i1 + 4,             i1 = i0 + 4;
           *k3 = k2 – 4,          *k2 = k1 – 4,            *k1 = k0 – 4;
           y_out += y_accum4,
                        y_accum4 = ( h[i3] * *k3 ) + y_accum3,
                                  y_accum3 = ( h[i2] * *k2 ) + y_accum2,
                                                y_accum2 = ( h[i1] * *k1 ) + y_accum1;

           i3 = i2 + 4,           i2 = i1 + 4;
           *k3 = k2 – 4,          *k2 = k1 – 4;
           y_out += y_accum4,
                        y_accum4 = ( h[i3] * *k3 ) + y_accum3,
                                  y_accum3 = ( h[i2] * *k2 ) + y_accum2;

           i3 = i2 + 4;
*k3 = k2 – 4;
           y_out += y_accum4,
                       y_accum4 = ( h[i3] * *k3 ) + y_accum3;

           y_out += y_accum4;

           *y = y_out;
}


                            Figure 13 : a C-code description of the Semi-Parallel form of the FIR filter
                    (with N = 16, M = 4) which is a translation of the data-dependency graph of figure 12 above



With reference to the code description of the Semi-parallel implementation shown above in
figure 12, notice how the coefficient indices and input-data pointers for each execution-unit
are derived from those of the immediately preceding execution-unit rather than being explicitly
evaluated. This simplification of index and/or pointer calculation between loop iterations is
known as strength reduction, and in a hardware implementation translates to simplified
control logic with a lower gate-count.

    y_in




                                       MACC_1                   MACC_2                     MACC_3                    MACC_4              ACC_1

           h[0:N/M-1]                           h[N/M:2N/M-1]             h[2N/M:3N/M-1]            h[3N/M:4N/M-1]



                                                   y_accum1                  y_accum2                  y_accum3
                        y_accum0 = 0                                                                                                             y_out
                                                                                                                              y_accum4




Figure 14 : showing the H/W implementation of the Semi-Parallel form of the FIR filter (with N = 16, M = 4) which
                         is a translation of the data-dependency graph of figure 12 above




                                                                         Conclusion


This article has presented the different generic implementation architectures along the
spectrum of low performance – low resource utilization to high performance high resource
utilization. This analysis started at the sequential model of computation which is the lowest
performance architecture with the lowest resource requirements. The Transposed and
Systolic architectures were then presented, and these are fully-parallel implementations
having the maximum throughput and maximum resource requirements. It was also explained
how and why these two are suitable for different problem sizes than each other. Finally the
Semi-parallel implementation was analyzed, and this architecture can be further folded in
going more towards the sequential model of computation or further unfolded in going more
towards being a fully-parallel implementation. Throughout the article the three different
abstraction-level views of a signal processing algorithm, being the code description, the data-
dependency graph and the hardware implementation were presented to show the user how
different attributes of an algorithm are greater exposed for ease of manipulation at different
abstraction levels, and the translation between these views was also shown.

More Related Content

What's hot

Modules and packages in python
Modules and packages in pythonModules and packages in python
Modules and packages in pythonTMARAGATHAM
 
Graphics User Interface Lab Manual
Graphics User Interface Lab ManualGraphics User Interface Lab Manual
Graphics User Interface Lab ManualVivek Kumar Sinha
 
pointers,virtual functions and polymorphism
pointers,virtual functions and polymorphismpointers,virtual functions and polymorphism
pointers,virtual functions and polymorphismrattaj
 
Data structures - unit 1
Data structures - unit 1Data structures - unit 1
Data structures - unit 1SaranyaP45
 
Modular Programming in C
Modular Programming in CModular Programming in C
Modular Programming in Cbhawna kol
 
Ppt on this and super keyword
Ppt on this and super keywordPpt on this and super keyword
Ppt on this and super keywordtanu_jaswal
 
Oops concepts || Object Oriented Programming Concepts in Java
Oops concepts || Object Oriented Programming Concepts in JavaOops concepts || Object Oriented Programming Concepts in Java
Oops concepts || Object Oriented Programming Concepts in JavaMadishetty Prathibha
 
Data Types & Variables in JAVA
Data Types & Variables in JAVAData Types & Variables in JAVA
Data Types & Variables in JAVAAnkita Totala
 
Class and Objects in Java
Class and Objects in JavaClass and Objects in Java
Class and Objects in JavaSpotle.ai
 
Handwriting Recognition Using Deep Learning and Computer Version
Handwriting Recognition Using Deep Learning and Computer VersionHandwriting Recognition Using Deep Learning and Computer Version
Handwriting Recognition Using Deep Learning and Computer VersionNaiyan Noor
 
An Introduction to IoT: Connectivity & Case Studies
An Introduction to IoT: Connectivity & Case StudiesAn Introduction to IoT: Connectivity & Case Studies
An Introduction to IoT: Connectivity & Case Studies3G4G
 
Verification and Validation in Software Engineering SE19
Verification and Validation in Software Engineering SE19Verification and Validation in Software Engineering SE19
Verification and Validation in Software Engineering SE19koolkampus
 

What's hot (20)

Packages in java
Packages in javaPackages in java
Packages in java
 
Modules and packages in python
Modules and packages in pythonModules and packages in python
Modules and packages in python
 
Object detection
Object detectionObject detection
Object detection
 
Project report
Project reportProject report
Project report
 
Graphics User Interface Lab Manual
Graphics User Interface Lab ManualGraphics User Interface Lab Manual
Graphics User Interface Lab Manual
 
Merge sort
Merge sortMerge sort
Merge sort
 
pointers,virtual functions and polymorphism
pointers,virtual functions and polymorphismpointers,virtual functions and polymorphism
pointers,virtual functions and polymorphism
 
Data structures - unit 1
Data structures - unit 1Data structures - unit 1
Data structures - unit 1
 
Modular Programming in C
Modular Programming in CModular Programming in C
Modular Programming in C
 
Ppt on this and super keyword
Ppt on this and super keywordPpt on this and super keyword
Ppt on this and super keyword
 
Oops concepts || Object Oriented Programming Concepts in Java
Oops concepts || Object Oriented Programming Concepts in JavaOops concepts || Object Oriented Programming Concepts in Java
Oops concepts || Object Oriented Programming Concepts in Java
 
Merge sort algorithm
Merge sort algorithmMerge sort algorithm
Merge sort algorithm
 
Data Types & Variables in JAVA
Data Types & Variables in JAVAData Types & Variables in JAVA
Data Types & Variables in JAVA
 
Class and Objects in Java
Class and Objects in JavaClass and Objects in Java
Class and Objects in Java
 
Handwriting Recognition Using Deep Learning and Computer Version
Handwriting Recognition Using Deep Learning and Computer VersionHandwriting Recognition Using Deep Learning and Computer Version
Handwriting Recognition Using Deep Learning and Computer Version
 
AD3251-Data Structures Design-Notes-Searching-Hashing.pdf
AD3251-Data Structures  Design-Notes-Searching-Hashing.pdfAD3251-Data Structures  Design-Notes-Searching-Hashing.pdf
AD3251-Data Structures Design-Notes-Searching-Hashing.pdf
 
Final keyword in java
Final keyword in javaFinal keyword in java
Final keyword in java
 
An Introduction to IoT: Connectivity & Case Studies
An Introduction to IoT: Connectivity & Case StudiesAn Introduction to IoT: Connectivity & Case Studies
An Introduction to IoT: Connectivity & Case Studies
 
Verification and Validation in Software Engineering SE19
Verification and Validation in Software Engineering SE19Verification and Validation in Software Engineering SE19
Verification and Validation in Software Engineering SE19
 
Chapter 07 inheritance
Chapter 07 inheritanceChapter 07 inheritance
Chapter 07 inheritance
 

Similar to Systolic, Transposed & Semi-Parallel Architectures and Programming

PERFORMANCE EVALUATIONS OF GRIORYAN FFT AND COOLEY-TUKEY FFT ONTO XILINX VIRT...
PERFORMANCE EVALUATIONS OF GRIORYAN FFT AND COOLEY-TUKEY FFT ONTO XILINX VIRT...PERFORMANCE EVALUATIONS OF GRIORYAN FFT AND COOLEY-TUKEY FFT ONTO XILINX VIRT...
PERFORMANCE EVALUATIONS OF GRIORYAN FFT AND COOLEY-TUKEY FFT ONTO XILINX VIRT...cscpconf
 
Performance evaluations of grioryan fft and cooley tukey fft onto xilinx virt...
Performance evaluations of grioryan fft and cooley tukey fft onto xilinx virt...Performance evaluations of grioryan fft and cooley tukey fft onto xilinx virt...
Performance evaluations of grioryan fft and cooley tukey fft onto xilinx virt...csandit
 
VCE Unit 01 (1).pptx
VCE Unit 01 (1).pptxVCE Unit 01 (1).pptx
VCE Unit 01 (1).pptxskilljiolms
 
VCE Unit 01 (2).pptx
VCE Unit 01 (2).pptxVCE Unit 01 (2).pptx
VCE Unit 01 (2).pptxskilljiolms
 
Implementation of an arithmetic logic using area efficient carry lookahead adder
Implementation of an arithmetic logic using area efficient carry lookahead adderImplementation of an arithmetic logic using area efficient carry lookahead adder
Implementation of an arithmetic logic using area efficient carry lookahead adderVLSICS Design
 
A High Speed Transposed Form FIR Filter Using Floating Point Dadda Multiplier
A High Speed Transposed Form FIR Filter Using Floating Point Dadda MultiplierA High Speed Transposed Form FIR Filter Using Floating Point Dadda Multiplier
A High Speed Transposed Form FIR Filter Using Floating Point Dadda MultiplierIJRES Journal
 
Unit i basic concepts of algorithms
Unit i basic concepts of algorithmsUnit i basic concepts of algorithms
Unit i basic concepts of algorithmssangeetha s
 
Performance Analysis of OFDM Transceiver with Folded FFT and LMS Filter
Performance Analysis of OFDM Transceiver with Folded FFT and LMS FilterPerformance Analysis of OFDM Transceiver with Folded FFT and LMS Filter
Performance Analysis of OFDM Transceiver with Folded FFT and LMS Filteridescitation
 
FPGA based Efficient Interpolator design using DALUT Algorithm
FPGA based Efficient Interpolator design using DALUT AlgorithmFPGA based Efficient Interpolator design using DALUT Algorithm
FPGA based Efficient Interpolator design using DALUT Algorithmcscpconf
 
FPGA based Efficient Interpolator design using DALUT Algorithm
FPGA based Efficient Interpolator design using DALUT AlgorithmFPGA based Efficient Interpolator design using DALUT Algorithm
FPGA based Efficient Interpolator design using DALUT Algorithmcscpconf
 
Fpga sotcore architecture for lifting scheme revised
Fpga sotcore architecture for lifting scheme revisedFpga sotcore architecture for lifting scheme revised
Fpga sotcore architecture for lifting scheme revisedijcite
 
IMPLEMENTATION OF FRACTIONAL ORDER TRANSFER FUNCTION USING LOW COST DSP
IMPLEMENTATION OF FRACTIONAL ORDER TRANSFER FUNCTION USING LOW COST DSPIMPLEMENTATION OF FRACTIONAL ORDER TRANSFER FUNCTION USING LOW COST DSP
IMPLEMENTATION OF FRACTIONAL ORDER TRANSFER FUNCTION USING LOW COST DSPIAEME Publication
 
Design of optimized Interval Arithmetic Multiplier
Design of optimized Interval Arithmetic MultiplierDesign of optimized Interval Arithmetic Multiplier
Design of optimized Interval Arithmetic MultiplierVLSICS Design
 
An Algorithm for Optimized Cost in a Distributed Computing System
An Algorithm for Optimized Cost in a Distributed Computing SystemAn Algorithm for Optimized Cost in a Distributed Computing System
An Algorithm for Optimized Cost in a Distributed Computing SystemIRJET Journal
 
Analysis of different FIR Filter Design Method in terms of Resource Utilizati...
Analysis of different FIR Filter Design Method in terms of Resource Utilizati...Analysis of different FIR Filter Design Method in terms of Resource Utilizati...
Analysis of different FIR Filter Design Method in terms of Resource Utilizati...ijsrd.com
 
Bca examination 2016 csa
Bca examination 2016 csaBca examination 2016 csa
Bca examination 2016 csaAnjaan Gajendra
 
Paper id 37201520
Paper id 37201520Paper id 37201520
Paper id 37201520IJRAT
 

Similar to Systolic, Transposed & Semi-Parallel Architectures and Programming (20)

PERFORMANCE EVALUATIONS OF GRIORYAN FFT AND COOLEY-TUKEY FFT ONTO XILINX VIRT...
PERFORMANCE EVALUATIONS OF GRIORYAN FFT AND COOLEY-TUKEY FFT ONTO XILINX VIRT...PERFORMANCE EVALUATIONS OF GRIORYAN FFT AND COOLEY-TUKEY FFT ONTO XILINX VIRT...
PERFORMANCE EVALUATIONS OF GRIORYAN FFT AND COOLEY-TUKEY FFT ONTO XILINX VIRT...
 
Performance evaluations of grioryan fft and cooley tukey fft onto xilinx virt...
Performance evaluations of grioryan fft and cooley tukey fft onto xilinx virt...Performance evaluations of grioryan fft and cooley tukey fft onto xilinx virt...
Performance evaluations of grioryan fft and cooley tukey fft onto xilinx virt...
 
VCE Unit 01 (1).pptx
VCE Unit 01 (1).pptxVCE Unit 01 (1).pptx
VCE Unit 01 (1).pptx
 
VCE Unit 01 (2).pptx
VCE Unit 01 (2).pptxVCE Unit 01 (2).pptx
VCE Unit 01 (2).pptx
 
FPGA Implementation of High Speed FIR Filters and less power consumption stru...
FPGA Implementation of High Speed FIR Filters and less power consumption stru...FPGA Implementation of High Speed FIR Filters and less power consumption stru...
FPGA Implementation of High Speed FIR Filters and less power consumption stru...
 
Implementation of an arithmetic logic using area efficient carry lookahead adder
Implementation of an arithmetic logic using area efficient carry lookahead adderImplementation of an arithmetic logic using area efficient carry lookahead adder
Implementation of an arithmetic logic using area efficient carry lookahead adder
 
A High Speed Transposed Form FIR Filter Using Floating Point Dadda Multiplier
A High Speed Transposed Form FIR Filter Using Floating Point Dadda MultiplierA High Speed Transposed Form FIR Filter Using Floating Point Dadda Multiplier
A High Speed Transposed Form FIR Filter Using Floating Point Dadda Multiplier
 
Unit i basic concepts of algorithms
Unit i basic concepts of algorithmsUnit i basic concepts of algorithms
Unit i basic concepts of algorithms
 
Handout#10
Handout#10Handout#10
Handout#10
 
Performance Analysis of OFDM Transceiver with Folded FFT and LMS Filter
Performance Analysis of OFDM Transceiver with Folded FFT and LMS FilterPerformance Analysis of OFDM Transceiver with Folded FFT and LMS Filter
Performance Analysis of OFDM Transceiver with Folded FFT and LMS Filter
 
FPGA based Efficient Interpolator design using DALUT Algorithm
FPGA based Efficient Interpolator design using DALUT AlgorithmFPGA based Efficient Interpolator design using DALUT Algorithm
FPGA based Efficient Interpolator design using DALUT Algorithm
 
FPGA based Efficient Interpolator design using DALUT Algorithm
FPGA based Efficient Interpolator design using DALUT AlgorithmFPGA based Efficient Interpolator design using DALUT Algorithm
FPGA based Efficient Interpolator design using DALUT Algorithm
 
Fpga sotcore architecture for lifting scheme revised
Fpga sotcore architecture for lifting scheme revisedFpga sotcore architecture for lifting scheme revised
Fpga sotcore architecture for lifting scheme revised
 
IMPLEMENTATION OF FRACTIONAL ORDER TRANSFER FUNCTION USING LOW COST DSP
IMPLEMENTATION OF FRACTIONAL ORDER TRANSFER FUNCTION USING LOW COST DSPIMPLEMENTATION OF FRACTIONAL ORDER TRANSFER FUNCTION USING LOW COST DSP
IMPLEMENTATION OF FRACTIONAL ORDER TRANSFER FUNCTION USING LOW COST DSP
 
Design of optimized Interval Arithmetic Multiplier
Design of optimized Interval Arithmetic MultiplierDesign of optimized Interval Arithmetic Multiplier
Design of optimized Interval Arithmetic Multiplier
 
An Algorithm for Optimized Cost in a Distributed Computing System
An Algorithm for Optimized Cost in a Distributed Computing SystemAn Algorithm for Optimized Cost in a Distributed Computing System
An Algorithm for Optimized Cost in a Distributed Computing System
 
Gf3511031106
Gf3511031106Gf3511031106
Gf3511031106
 
Analysis of different FIR Filter Design Method in terms of Resource Utilizati...
Analysis of different FIR Filter Design Method in terms of Resource Utilizati...Analysis of different FIR Filter Design Method in terms of Resource Utilizati...
Analysis of different FIR Filter Design Method in terms of Resource Utilizati...
 
Bca examination 2016 csa
Bca examination 2016 csaBca examination 2016 csa
Bca examination 2016 csa
 
Paper id 37201520
Paper id 37201520Paper id 37201520
Paper id 37201520
 

Recently uploaded

Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 

Recently uploaded (20)

Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 

Systolic, Transposed & Semi-Parallel Architectures and Programming

  • 1. Systolic, Transposed & Semi-Parallel Architectures and Programming Sandip Jassar http://www.linkedin.com/in/sandipjassar Abstract The critical routines within signal processing algorithms are typically data intensive and iterate many times, carrying out the same functions on multi-dimensional arrays and input streams. These routines use the vast majority of an implementation’s resources, and for the longest time over the course of the algorithms execution. Such routines are classed as Nested-Loop Programs. Whether the implementation is software running on a processor or a higher performance customized hardware implementation, there is always a tradeoff between the throughput performance (execution-time) and the resources used such as the amount of processor memory and registers in a software implementation or the gate-count in a hardware implementation. This article shows the reader techniques and methods for manipulating a signal processing algorithm, in particular those conforming to a Nested-Loop Program, and the different generic implementation architectures along this tradeoff spectrum, as well as the different methods for describing and analyzing an algorithm implementation. Each of these methods for describing an algorithm is shown to correspond to a different abstraction-level view and as such exposes different features and properties for ease of analysis and manipulation. Manipulation techniques, mainly algorithmic- transformations are described and it is shown how these transformations take an implementation and form a new one at a different point in the trade-off spectrum of resources used versus throughput by performing calculations in a different order and a different way to arrive at the same result. Contents 1 Loop Optimization 1.1 The Unrolling Transformation 1.2 The Skewing Transformation 2 The three abstraction levels for representing Signal Processing Algorithms 2.1 Code Representation 2.2 Data-dependency Graph 2.3 Hardware Implementation 3 Parallel implementations of a Signal Processing algorithm 3.1 The Transposed FIR Filter 3.2 The Systolic FIR Filter 3.3 The Semi-Parallel FIR Filter Conclusion
  • 2. 1 Loop Optimization The Nested-Loop Program (NLP) form of a signal processing algorithm represents what’s known as its sequential model of computation. The Finite Impulse Response (FIR) filter is the most well known and most widely used NLP in the field of signal processing and as such is used throughout this article as an example of a simple NLP. The C-code description of the FIR filter for calculating four output values is shown below in figure 1. for(j = 0; j < num_of_samples_to_calculate; j++) { for(i = 0; i < N; i++) { y[j] = y[j] + ( x[j - i] * h[i] ) } } Figure 1 : C-code description of the NLP form of the FIR filter operation An algorithms’ sequential model of computation represents the way in which the algorithm would be executed as software running on a standard single-threaded processor, hence the name sequential model of computation. 1.1 The Unrolling Transformation The algorithmic transformation of unrolling a NLP acts to enhance its task-level parallelism, whereby the tasks that are independent of each other are explicitly shown to be so. A task in this context refers to a loop-iteration. Such tasks are said to be mutually exclusive to one another. After the transformation has been applied the resulting algorithm is different, although it remains functionally equivalent to the original. The independent tasks can then be mapped to separate processors/execution-units (H/W resources), and as such it is said that unrolling allows for the use of spatial parallelism. Spatial parallelism results in greater throughput as loop iterations are processed concurrently, although this is at the expense of multiplying the hardware required (mainly processors and the inter-processor registers to transfer data between loop-iterations) by the unrolling_factor, which is the number of concurrent loops that result from the unrolling. 1.2 The Skewing Transformation The algorithmic transformation of skewing exposes the allowable latency between dependent operations (in different loop-iterations) in order for their execution to be scheduled apart on the H/W processors/execution-units or in the S/W processing threads that will execute them. Where a nested (inner) loop iterates over a multidimensional array, in the NLP’s sequential model of computation each iteration of the loop is dependent of all previous iterations. The skewing transformation re-schedules the instructions such that dependencies only exist between iterations of the outer-loop. The skewing transformation thus allows temporal parallelism (pipelining) to be employed in the implementation where the executions of successive outer-loop iterations are time-overlapped, and as such, a single set of registers are shared by the corresponding inner-loop iterations, as their executions are carried out using the same resources. This practice is also referred to as overlapped scheduling.
  • 3. Overall the unrolling and skewing transformations can be used to transform a sequential model of computation into something that it is closer to a data-flow model which exploits the algorithm’s inherent parallelism and thus can be used to yield a more efficient implementation. Dependent operations can also be scheduled apart by simply changing the order in with operations are executed. The simplest example of this is when the inner and outer loop indices are swapped. 2 The three abstraction levels for representing Signal Processing Algorithms The following section shows how to represent a signal processing algorithm in terms of its Code description, as a Data-dependency graph and its Hardware implementation. Code shows how an algorithm would be executed on a processor (its Software implementation) and is used to describe the algorithm from the highest abstraction level. The Hardware implementation represents the lowest abstraction level. The reader will learn how to transfer the description of an algorithm between these abstraction levels, and also gain an appreciation of which characteristics of an algorithm are greater exposed for ease of manipulation at a particular abstraction level. The example algorithm used throughout this section is the FIR (Finite Impulse Response) filter in its basic sequential form, as the FIR filter is very simple and widely used in the field of signal processing. The FIR filter operation essentially carries out a vector dot-product in calculating each value of y[n]. This is illustrated below in figure 2 for N = 4, where N is essentially the width of the filter in time samples (number of filter taps). h[0] h[1] x[n] x[n-1] x[n-2] x[n-3] x[n].h[0] + x[n-1].h[1] + x[n-2].h[2] + x[n-3].h[3] h[2] h[3] Figure 2 : showing how the FIR filter operation is comprised of a vector dot-product for the calculation of each output-sample value 2.1 Code representation With reference to the C-code description of the Sequential FIR filter shown below in figure 3, all of the required input samples are assumed to be stored in the register file (with a stride of 1) of the processor that would theoretically be executing the code, with x initially pointing to the input sample x[n] corresponding to the first output sample to be calculated y[n]. It is also assumed that all of the required coefficients are stored in the same way in a group of registers used to store the h[ ] array.
  • 4. void sequentialFirFilter( int num_taps, int num_samples, float *x, const float *h[ ], float *y ) { int i, j; // ‘j’ is the outer-loop counter and ‘i’ is the inner-loop index float y_accum; // output sample is accumulated into ‘y_accum’ float *k; // pointer to the required input sample for( j = 0; j < num_samples; j++ ) { k = x++; // x points to x[n+j] and is incremented (post assignment) to point to // x[(n+j)+1] y_accum = 0.0; for( i = 0; i < num_taps; i++ ) { y_accum += h[i] * *(k--); // y[n+j] += h[i] * x[(n+j) - i] } *y++ = y_accum; // y points to the variable address where y[n+j] is to be written and is // incremented (post assignment) to point to the variable address } // where the next output sample y[(n+j)+1] is to be written } Figure 3 : A C-code description of the FIR filter’s sequential model of computation 2.2 Data-dependency Graph The code description of an algorithm can easily be translated into its data-dependency graph, and for the Sequential FIR filter this is shown below in figure 4, which is the data-dependency graph translation of the code in figure 3 (with num_samples = 4). With reference to figure 4 below the contents of a vertical bar represent an iteration of the outer-loop, where successive output-samples are calculated in successive bars going from left to right. 0 1 2 3 j h(0) h(0) h(0) h(0) 0 MACC_1 MACC_1 MACC_1 MACC_1 x[n] x[n+1] x[n+2] x[n+3] y-accum y-accum y-accum y-accum + + + + h(1) h(1) h(1) h(1) 1 MACC_1 MACC_1 MACC_1 MACC_1 x[n-1] x[n] x[n+1] x[n+2] y-accum y-accum y-accum y-accum + + + + h(2) h(2) h(2) h(2) 2 MACC_1 MACC_1 MACC_1 MACC_1 x[n-2] x[n-1] x[n] x[n+1] y-accum y-accum y-accum y-accum + + + + h(3) h(3) h(3) h(3) MACC_1 3 x[n-3] MACC_1 x[n-2] MACC_1 x[n-1] x[n] MACC_1 y-accum y-accum y-accum y-accum i h(0).x[n] + h(1).x[n-1] + h(2).x[n-2] h(0).x[n+1] + h(1).x[n] + h(2).x[n-1] h(0).x[n+2] + h(1).x[n+1] + h(0).x[n+3] + h(1).x[n+2] + + h(3).x[n-3] + h(3).x[n-2] h(2).x[n] + h(3).x[n-1] h(2).x[n+1] + h(3).x[n] Figure 4 : Data-dependency graph showing the operation of the Sequential FIR filter with N=4
  • 5. The circles in each bar of figure 4 above represent the instances of the physical operators that carry out each iteration of the inner-loop, where successive iterations are represented from top to bottom. The name given to this physical operator for the Sequential FIR filter is MACC_1 as a multiply-accumulate is the only instruction (in S/W) or operation (in H/W) that is carried out, and with the FIR filter’s sequential model of computation all instructions/operations in the inner-loop are dependent on all of the previous ones, and the outer-loop is carried out one iteration at a time, so only one instruction/operation can be scheduled for execution at any one time and thus only one execution-unit can be used. 2.3 Hardware Implementation As can be seen above in figure 4, each MACC instruction/operation uses two new input values and its output result is used as an input to the next MACC instruction/operation. The hardware implementation of the Sequential FIR filter is shown below in figure 5. MACC_1 Input Data Samples y_accum y_out Coefficients Figure 5 : showing an example H/W implementation of the Sequential FIR filter Since the output of each MACC up until the last one in each outer-loop iteration feeds its result forward to the input of the next one, instead of writing each MACC-result to the register file and then reading it before the start of the next MACC, the hardware implementation has a feedback loop from the output of the adder to its second input. In the code description it can be seen that this optimization is possible, although it can actually be shown in the data-dependency graph and more clearly in the hardware implementation. 3 Parallel implementations of a Signal Processing algorithm The primary trade-off between sequential and parallel implementations of the same algorithm is the amount of hardware resources or S/W processing threads (i.e. a more complex processing architecture which is harder to write code for) required versus the throughput achieved. As the Sequential FIR filter implements the FIR filter function in a completely sequential manner, the required resources are reduced by a factor of N, although so too is the throughput as compared to a fully parallel implementation that would use one execution- unit for each of the N coefficients (where the N MACCs would be cascaded).
  • 6. As can be seen from figure 3 above, and more clearly from the data-dependency graph (shown above in figure 4), the Sequential FIR filter evaluates (accumulates) only one output value at a time. Assuming that each multiplication of x[(n+j)-i] and h[i] takes one clock cycle, then the performance of this implementation is given by the following equation: Throughput = Clock frequency ÷ Number of coefficients The simplest fully-parallel FIR filter implementation is the Direct form type 1. This is formed by completely unrolling the inner-loop and mapping each of the unrolled instructions/operations to separate execution-units. The problem with this implementation is that the outputs of the execution-units (MACC outputs) are all summed through an adder-chain, where the resulting latency (and gate-count with a H/W implementation) mean that the Direct form type 1 FIR filter does not scale very well for large problem sizes (N). The other reason why the Direct form type 1 FIR filter is not used in signal processing is that it requires N input values to be read in for each output value to be calculated, and this design feature does not scale very well for large N due to the greater latency and possibly (depending on the implementation) the higher number of memory read ports. 3.1 The Transposed FIR filter The Transposed FIR filter is also a fully-parallel implementation. This is formed by first completely unrolling the inner-loop and then skewing the outer-loop in the reverse direction by a factor of 1. (Skewing the outer loop in the reverse direction is also the same as reversing the order of the outer-loop indices and skewing by the same factor in the forward direction). This manipulation of the Sequential FIR filter is most easily done using a data-dependency graph where the scheduling of MACC instructions/operations and dependencies is explicitly shown. The data-dependency graph for the Transposed FIR filter is shown below in figure 6. 0 1 j y[n] = h(0).x[n] + h(1).x[n-1] + y[n+1] = h(0).x[n+1] + h(1).x[n] + h(2).x[n-2] + h(3).x[n-3] h(2).x[n-1] + h(3).x[n-2] h(0) MACC_4 MACC_4 h(0) + + x[n] x[n+1] y_accum3 y_accum3 y_accum3 h(1) MACC_3 MACC_3 MACC_3 h(1) h(1) + + + x[n-1] x[n] x[n+1] y_accum2 y_accum2 y_accum2 y_accum2 MACC_2 MACC_2 MACC_2 h(2) MACC_2 h(2) h(2) h(2) + + + + x[n-2] x[n-1] x[n] x[n+1] y_accum1 y_accum1 y_accum1 y_accum1 y_accum1 h(3) MACC_1 MACC_1 MACC_1 MACC_1 MACC_1 h(3) h(3) h(3) h(3) + + + + + y_accum0 = 0 y_accum0 = 0 y_accum0 = 0 y_accum0 = 0 y_accum0 = 0 x[n-3] x[n-2] x[n-1] x[n] x[n+1] Figure 6 : dependency graph showing the operation of the Transposed form of the FIR filter (with N = 4) As can be seen above in figure 6, the circles are now labeled MACC_1 to MACC_4 where each of these is assigned the same filter coefficient throughout the execution of the algorithm (i.e. one of the inner-loop iterations). The purple circles represent MACC
  • 7. instructions/operations that occur during the spin-up process, which is the process of filling the pipeline. The pipeline can either consist of N separate execution-units, or one execution- unit which itself consists of N separate stages that have the same latency. The skewing of the outer-loop results in temporal overlapping of successive output sample calculations. This temporal overlapping schedules apart the dependent MACC instructions/operations within a single iteration of the outer-loop, and this provides several advantages over the Direct form type 1 implementation. Once the spin-up procedure is complete only one input data value is read per output value calculated. In the Transposed implementation this input value is then simultaneously broadcasted to all of the N execution-units. These data-values are read in time-order meaning that the input can be streamed in, and this lends itself very well to signal processing, where the objective is to process large amounts of data quickly. The other advantage over the Direct form type 1 is that the MACC results summed to form each output value are done so using an adder chain, and as such the Transposed implementation scales up very well for large and arbitrary values of N. The disadvantage of the Transposed implementation is that latency of the spin-up and spin- down (opposite of spin-down) procedures which are each N-1 cycles long. This latency is not seen with the Direct form type 1 implementation, however the more output values that are calculated the more this overhead is amortized. During the time the pipeline is full the throughput is equal to the frequency with which successive MACC instructions/operations are executed, and this is equal to the throughput of the Direct form type 1 implementation. Another disadvantage with the Transposed implementation is that the number of filter taps is limited by the fan-out capability of the register driving the common input of all the execution- units with the input data value. The data-dependency graph of the Transposed FIR filter is translated into a code description (higher-abstraction) and a hardware implementation (lower-abstraction) below in figures 7 and 8 respectively. With reference to the C-code description of the Transposed implementation below in figure 7, where instructions are ended with a comma rather than a semi-colon, this means that all subsequent instructions up to and including the first one ended in a semi-colon are to be scheduled for parallel execution. void transposedFirFilter( int num_samples, float *x, const float *h[ ], float *y ) { const num_taps = 4; // value of N int j; // the only loop counter float *k = x; // ‘x’ initially points to the input sample (x[n]) corresponding to the first output // sample to be calculated (y[n]) float y_accum0 = y_accum1 = y_accum2 = y_accum3 = 0; // variables used to store the each // result at different stages of the // accumulation // Spin-up procedure : filling up the pipeline y_accum1 = ( h[3] * *(k-3) ) + y_accum0; y_accum1 = ( h[3] * *(k-2) ) + y_accum0, y_accum2 = ( h[2] * *(k-2) ) + y_accum1; y_accum1 = ( h[3] * *(k-1) ) + y_accum0, y_accum2 = ( h[2] * *(k-1) ) + y_accum1, y_accum3 = ( h[1] * *(k-1) ) + y_accum2; // Steady state : where the pipeline is full
  • 8. for( j = 0; j < (num_samples – (num_taps-1)); j++ ) { // y points to the variable address where y[n+j] is to be written and is incremented (post // assignment) to point to the variable address where the next output sample y[(n+j)+1] // is to be written *y++ = ( h[0] * *k ) + y_accum3, y_accum3 = ( h[1] * *k ) + y_accum2, y_accum2 = ( h[2] * *k ) + y_accum1, y_accum1 = ( h[3] * *k ) + y_accum0; k = x++; // ‘x’ initially points to x[n+j] and is incremented (post assignment) to point to // x[(n+j)+1] } // Spin-down procedure : emptying the pipeline *y++ = ( h[0] * *k ) + y_accum3, y_accum3 = ( h[1] * *k ) + y_accum2, y_accum2 = ( h[2] * *k ) + y_accum1; *y++ = ( h[0] * *(k+1) ) + y_accum3, y_accum3 = ( h[1] * *(k+1) ) + y_accum2; *y++ = ( h[0] * *(k+2) ) + y_accum3; } Figure 7 : A C-code description of the Transposed FIR filter which is a translation of the data-dependency graph of figure 6 above y_in h(3) h(2) h(1) h(0) MACC_1 MACC_2 MACC_3 MACC_4 y_accum1 y_accum2 y_accum3 y_out y_accum0 = 0 Figure 8 : showing the H/W implementation of the Transposed form of the FIR filter (with N = 4) which is a translation of the data-dependency graph of figure 6 above With reference to figure 8 above, the coefficients are assigned to the execution-units in descending order going from the first input to the adder chain to its output. The significance of this is discussed below in section 3.2.
  • 9. 3.2 The Systolic FIR filter The Systolic FIR filter is another fully-parallel implementation, and is formed from the Sequential FIR filter by completely unrolling the inner-loop and skewing the outer-loop in the forward direction by a factor of 1. This manipulation of the Sequential FIR filter is depicted in the data-dependency graph of the Systolic FIR filter shown below in figure 9. x[n] x[n+1] x[n+2] x[n+3] x[n+4] y_accum0 = 0 y_accum0 = 0 y_accum0 = 0 y_accum0 = 0 y_accum0 = 0 + + + + + h(0) MACC_1 h(0) MACC_1 h(0) MACC_1 h(0) MACC_1 h(0) MACC_1 y_accum1 x[n+4] y_accum1 y_accum1 y_accum1 y_accum1 x[n-1] x[n] x[n+1] x[n+2] x[n+3] + + + + h(1) MACC_2 h(1) MACC_2 h(1) MACC_2 h(1) MACC_2 x[n+2] y_accum2 y_accum2 y_accum2 y_accum2 x[n-2] x[n-1] x[n] x[n+1] + + + h(2) MACC_3 h(2) MACC_3 h(2) MACC_3 y_accum3 y_accum3 y_accum3 x[n] x[n-3] x[n-2] x[n-1] + + h(3) MACC_4 h(3) MACC_4 y[n] = h(0).x[n] + h(1).x[n-1] + y[n+1] = h(0).x[n+1] + h(1).x[n] + h(2).x[n-2] + h(3).x[n-3] h(2).x[n-1] + h(3).x[n-2] j 0 1 Figure 9 : dependency graph showing the operation of the Systolic form of the FIR filter (with N=4) As previously with the Transposed implementation the Systolic implementation goes through a spin-up procedure (where the pipeline is filled, represented above in figure 9 by the purple circles) and a spin-down procedure (where the pipeline is emptied) and again both of these last for N-1 cycles. As with the Transposed implementation, when the pipeline of the Systolic FIR filter is full there is one input data value read per output value calculated, and as with the Transposed implementation these input values are read in time order, and as previously mentioned these are the properties desired when the input is streamed into the filter. With reference to figure 9 above it can be seen that the filter coefficients are assigned to the execution-units in the opposite order that they were with the Transposed implementation (as the outer-loop in each case is skewed in opposite directions). This means that there is a latency of N MACC cycles between an input data value being read and the corresponding output value having finished being calculated. However this latency for the Direct form type 1 as well as the Transposed implementation was only 1 MACC cycle, although the spin-up latency (time for first output value to emerge after starting the filter) and the spin-down latency (time after last output value emerges before filter is stopped) is the same for both the Transposed and Systolic implementations. As previously explained in section 3.1 the Transposed implementation is not suitable for very high values of N because of the way the same input data value is broadcast to all of the MACCs. This Systolic implementation has no such limitation and thus is more suitable than the Transposed implementation for implementing a FIR with a large number of taps. This however comes at the cost of an additional register per execution-unit, and these transfer input data values between adjacent execution-units with a latency of 1 MACC cycle. Therefore in summary depending on the implementation platform, there is a breakpoint for the size of N
  • 10. below which it is preferable to use the Transposed implementation and above which it is preferable to use the Systolic implementation. void systolicFirFilter( int num_samples, float *x, const float *h[], float *y ) { const num_taps // value of N int j; // the only loop counter float *k = x++; // ‘x’ initially points to the input sample (x[n]) corresponding to the first // output sample to be calculated (y[n]) register y_accum0 = y_accum1 = y_accum2 = y_accum3 = 0; // variables used to store the each // result at different stages of the accumulation // Spin-up procedure : filling up the pipeline y_accum1 = ( h[0] * *k ) + y_accum0; k = x++; y_accum2 = ( h[1] * *(k-2) ) + y_accum1, y_accum1 = ( h[0] * *k ) + y_accum0; k = x++; y_accum3 = ( h[2] * *(k-4) ) + y_accum2, y_accum2 = ( h[1] * *(k-2) ) + y_accum1, y_accum1 = ( h[0] * *k ) + y_accum0; // Steady state : where the pipeline remains full for( j = 0; j < (num_samples – (num_taps-1)); j++ ) { // ‘x’ initially points to x[n+j] and is incremented (post assignment) to point to x[(n+j)+1] k = x++; // y points to the variable’s address where y[n+j] is to be written and is incremented // (post assignment) to point to the variable address where the next output sample // y[(n+j)+1] is to be written *y++ = ( h[3] * *(k-6) ) + y_accum3, y_accum3 = ( h[2] * *(k-4) ) + y_accum2, y_accum2 = ( h[1] * *(k-2) ) + y_accum1, y_accum1 = ( h[0] * *k ) + y_accum0; } // Spin-down procedure : emptying the pipeline k = x++; *y++ = ( h[3] * *(k-6) ) + y_accum3, y_accum3 = ( h[2] * *(k-4) ) + y_accum2, y_accum2 = ( h[1] * *(k-2) ) + y_accum1; k = x++; *y++ = ( h[3] * *(k-6) ) + y_accum3, y_accum3 = ( h[2] * *(k-4) ) + y_accum2; k = x; *y++ = ( h[0] * *(k-6) ) + y_accum3; } Figure 10 : A C-code description of the Systolic form of the FIR filter (with N = 4) which is a translation of the data-dependency graph of figure 9 above
  • 11. y_in h(0) h(1) h(2) h(3) MACC_1 MACC_2 MACC_3 MACC_4 y_accum1 y_accum2 y_accum3 y_out y_accum0 = 0 Figure 11 : showing the H/W implementation of the Systolic form of the FIR filter (with N = 4) which is a translation of the data-dependency graph of figure 9 above 3.3 The Semi-parallel FIR filter The Semi-Parallel FIR filter implementation divides its N coefficients amongst M execution- units. Like the Transposed and Systolic implementations, the Semi-Parallel implementation employs both spatial and temporal parallelism, although it is not a fully-parallel implementation as the degree of parallelism employed is dependent on the M:N ratio. The Semi-parallel FIR is formed from the Sequential FIR by first unrolling the inner-loop by a factor of N/M so that its number of iterations goes from N to N/M. The case where N/M is not an integer value is discussed below. The inner-loop is then skewed in the forward direction by a factor of 1 such that the N/M stages of calculating an output-sample value are overlapped in time on the M- stage pipeline. Finally the outer-loop is skewed in the forward direction by a factor of 1 such that the calculation of successive output-sample values is overlapped in time. This manipulation of the Sequential FIR is depicted below in the data-dependency graph of the Semi-parallel implementation. 0 1 j 0 1 2 3 0 1 2 3 i x[n+1] x[n+2] x[n] x[n-1] x[n-2] x[n-3] x[n+1] x[n] x[n-1] x[n-2] y_accum0 = 0 y_accum0 = 0 y_accum0 = 0 y_accum0 = 0 y_accum0 = 0 y_accum0 = 0 y_accum0 = 0 y_accum0 = 0 + + + + + + + + x[n-4] h(0) MACC_1 h(1) MACC_1 h(2) MACC_1 h(3) MACC_1 h(0) MACC_1 h(1) MACC_1 h(2) MACC_1 h(3) MACC_1 y_accum1 y_accum1 y_accum1 y_accum1 x[n-3] y_accum1 y_accum1 y_accum1 y_accum1 x[n-2] x[n-8] x[n-4] x[n-5] x[n-6] x[n-3] x[n-4] x[n-5] x[n-7] + + + + + + + + MACC_2 h(7) MACC_2 h(4) MACC_2 h(5) MACC_2 h(6) MACC_2 h(7) h(4) MACC_2 h(5) MACC_2 h(6) MACC_2 y_accum2 y_accum2 y_accum2 y_accum2 x[n-7] x[n-8] y_accum2 y_accum2 y_accum2 y_accum2 x[n-11] x[n-8] x[n-9] x[n-10] x[n-7] x[n-8] x[n-12] x[n-11] + + + + + + + MACC_3 + MACC_3 MACC_3 h(8) MACC_3 h(9) MACC_3 h(10) MACC_3 h(11) h(8) MACC_3 h(9) MACC_3 h(10) h(11) y_accum3 y_accum3 y_accum3 y_accum3 y_accum3 y_accum3 x[n-11] y_accum3 y_accum3 x[n-12] x[n-14] x[n-15] x[n-16] x[n-12] x[n-13] x[n-14] x[n-15] x[n-11] + + + + + + MACC_4 MACC_4 + + MACC_4 + MACC_4 h(12) MACC_4 h(13) MACC_4 h(14) MACC_4 h(15) h(12) MACC_4 h(13) h(14) h(15) y_accum4 y_accum4 y_accum4 y_accum4 y_accum4 y_accum4 y_accum4 y_accum4 + y_out + y_out + y_out + + y_out + y_out + y_out + ACC_1 ACC_1 ACC_1 ACC_1 ACC_1 ACC_1 ACC_1 ACC_1 y_out y_out y[n-1] = h(0).x[n-1] + h(1).x[n-2] + h(2).x[n-3] + h(3).x[n-4] + h(4).x[n-5] +h(5).x[n-6] y[n] = h(0).x[n] + h(1).x[n-1] + h(2).x[n-2] + h(3).x[n-3] + h(4).x[n-4] +h(5).x[n-5] + + h(6).x[n-7] + h(7).x[n-8] + h(8).x[n-9] + h(9).x[n-10] + h(10).x[n-11] + h(11).x[n-12] h(6).x[n-6] + h(7).x[n-7] + h(8).x[n-8] + h(9).x[n-9] + h(10).x[n-10] + h(11).x[n-11] + + h(12).x[n-13] + h(13).x[n-14] + h(14).x[n-15] + h(15).x[n-16] h(12).x[n-12] + h(13).x[n-13] + h(14).x[n-14] + h(15).x[n-15] Figure 12 : dependency graph showing the operation of the Semi-parallel form of the FIR filter (with N=16 and M=4)
  • 12. The red circles represent MACC instructions/operations used to calculate the inner- products of y[n], and the dark-red circles represent the output-accumulator being used to accumulate the inner-products of y[n]. The blue and dark-blue circles represent the same for y[n-1], whilst the yellow circles represent MACC instructions/operations used to calculate the inner-products of y[n+1]. Each group of N/M coefficients is assigned to one of the execution-units and stored in order within the associated coefficient-memory. The first group (coefficients 0 to (N/M – 1)) is assigned to the top execution-unit (to which the input samples are applied), with ascending coefficient groups being assigned to the execution-units from top to bottom (towards the output-sample value accumulator). If N is not exactly integer-divisible by M, then the higher- order coefficient-memories are padded with zeros. As can be seen above in figure 12, the coefficient index for each execution-unit lies in the range 0:((N/M) – 1) + offset where the offset increases by N/M between successive execution-units going from the input to the output-accumulator. Figure 12 above also shows how the MACC instructions/operations in the inner-loop are rescheduled so that after the M instructions are mapped to the M execution-units the input-data stream can be kept in time- order and divided into M chunks, where successive chunks in time are assigned to successive execution-units going from the input to the output-accumulator. As seen above in figure 12, this means that between successive output-value calculations only one input-data value is read in and each execution-unit shifts its oldest input-data value into the upstream adjacent execution-unit in the pipeline. This yields the most efficient time overlapping of successive output-value calculations as there are no wasted MACC cycles. This temporal parallelism is in turn necessary to achieve the Semi-parallel implementation’s maximum throughput of one output sample every N/M MACC cycles, because once an output sample has been retrieved from the output-accumulator, the output-accumulator must be reset (to either zero or its input value). For its input value to be the first sum of M inner-products of the next output sample to be evaluated, then the evaluation of this sum needs to have finished in the previous MACC cycle, otherwise the output-accumulator will have to be reset to zero at the start of a new calculation-cycle (i.e. outer-loop iteration). If the accumulator was set to zero between calculation-cycles in this way, the time-overhead in setting up a new output-value calculation would be seen at the output, thus degrading performance. A higher M:N ratio means the implementation is less folded and yields more parallelism both spatially (as there are more execution-units employed) and temporally (as more inner-loop and outer-loop execution threads are scheduled for parallel execution). Thus the trade off is performance obtained versus the resources required, as can be seen by the equation for the performance of a Semi-parallel FIR filter implementation: Throughput = ( Clock frequency ÷ N ) * M The Semi-parallel implementation may be extrapolated either towards being a fully-parallel implementation like the Transposed and Systolic implementations by using more MACC units, or the other way towards being a Sequential FIR filter by using fewer MACC units. Figures 13 and 14 below show a C-code description and hardware implementation respectively for the Semi-parallel FIR filter (again with N=16 and M=4).
  • 13. void semi_parallelFirFilter( int num_taps, int num_samples, float *x, const float *h[ ], float *y ) { int i, j; int i0, i1, i2, i3; float *k0, *k1, *k2, *k3; register y_accum0 = 0, y_accum1 = 0, y_accum2 = 0, y_accum3 = 0, y_accum4 = 0, y_out = 0; // Spin-up procedure i0 = 0; *k0 = x++; y_accum1 = ( h[i0] * *k0 ) + y_accum0; i1 = 4 + i0++; k1 = (k0--) - 4; y_accum2 = ( h[i1] * *k1 ) + y_accum1, y_accum1 = ( h[i0] * *k0 ) + y_accum0; i2 = i1 + 4, i1 = 4 + i0++; *k2 = k1 – 4, *k1 = (k0--) – 4; y_accum3 = ( h[i2] * *k2 ) + y_accum2, y_accum2 = ( h[i1] * *k1 ) + y_accum1, y_accum1 = ( h[i0] * *k0 ) + y_accum0; i3 = i2 + 4, i2 = i1 + 4, i1 = 4 + i0++; *k3 = k2 – 4, *k2 = k1 – 4, *k1 = (k0--) – 4; y_accum4 = ( h[i3] * *k3 ) + y_accum3, y_accum3 = ( h[i2] * *k2 ) + y_accum2, y_accum2 = ( h[i1] * *k1 ) + y_accum1, y_accum1 = ( h[i0] * *k0 ) + y_accum0; // Steady state i3 = i2 + 4, i2 = i1 + 4, i1 = 4 + i0, i0 = 0; *k3 = k2 – 4, *k2 = k1 – 4, *k1 = (k0) – 4, *k0 = x++; for(j = 0; j < (((num_taps / ‘4’) x num_samples) – ‘4’); j++ ) { y_out = 0; for(i = 0:3) { y_out += y_accum4, y_accum4 = ( h[i3] * *k3 ) + y_accum3, y_accum3 = ( h[i2] * *k2 ) + y_accum2, y_accum2 = ( h[i1] * *k1 ) + y_accum1, y_accum1 = ( h[i0] * *k0 ) + y_accum0; i3 = i2 + 4, i2 = i1 + 4, i1 = i0 + 4, i0++; *k3 = k2 – 4, *k2 = k1 – 4, *k1 = k0 – 4, *k0--; } *k0 = x++; i0 = 0; *y++ = y_out; } // Spin-down procedure i3 = i2 + 4, i2 = i1 + 4, i1 = i0 + 4; *k3 = k2 – 4, *k2 = k1 – 4, *k1 = k0 – 4; y_out += y_accum4, y_accum4 = ( h[i3] * *k3 ) + y_accum3, y_accum3 = ( h[i2] * *k2 ) + y_accum2, y_accum2 = ( h[i1] * *k1 ) + y_accum1; i3 = i2 + 4, i2 = i1 + 4; *k3 = k2 – 4, *k2 = k1 – 4; y_out += y_accum4, y_accum4 = ( h[i3] * *k3 ) + y_accum3, y_accum3 = ( h[i2] * *k2 ) + y_accum2; i3 = i2 + 4;
  • 14. *k3 = k2 – 4; y_out += y_accum4, y_accum4 = ( h[i3] * *k3 ) + y_accum3; y_out += y_accum4; *y = y_out; } Figure 13 : a C-code description of the Semi-Parallel form of the FIR filter (with N = 16, M = 4) which is a translation of the data-dependency graph of figure 12 above With reference to the code description of the Semi-parallel implementation shown above in figure 12, notice how the coefficient indices and input-data pointers for each execution-unit are derived from those of the immediately preceding execution-unit rather than being explicitly evaluated. This simplification of index and/or pointer calculation between loop iterations is known as strength reduction, and in a hardware implementation translates to simplified control logic with a lower gate-count. y_in MACC_1 MACC_2 MACC_3 MACC_4 ACC_1 h[0:N/M-1] h[N/M:2N/M-1] h[2N/M:3N/M-1] h[3N/M:4N/M-1] y_accum1 y_accum2 y_accum3 y_accum0 = 0 y_out y_accum4 Figure 14 : showing the H/W implementation of the Semi-Parallel form of the FIR filter (with N = 16, M = 4) which is a translation of the data-dependency graph of figure 12 above Conclusion This article has presented the different generic implementation architectures along the spectrum of low performance – low resource utilization to high performance high resource utilization. This analysis started at the sequential model of computation which is the lowest performance architecture with the lowest resource requirements. The Transposed and Systolic architectures were then presented, and these are fully-parallel implementations having the maximum throughput and maximum resource requirements. It was also explained how and why these two are suitable for different problem sizes than each other. Finally the Semi-parallel implementation was analyzed, and this architecture can be further folded in going more towards the sequential model of computation or further unfolded in going more towards being a fully-parallel implementation. Throughout the article the three different abstraction-level views of a signal processing algorithm, being the code description, the data- dependency graph and the hardware implementation were presented to show the user how different attributes of an algorithm are greater exposed for ease of manipulation at different abstraction levels, and the translation between these views was also shown.