bachelors_thesis_stephensen1987

           
                 
BSc Thesis
Hans Jacob Teglbjærg Stephensen
Fast general purpose implementation of
convolutions using FFT
Supervisor: Christian Igel, Pengfei Diao
June 7, 2015

Contents
1 Introduction 3
2 Theoretical Foundation 4
2.1 Direct Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Fourier Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Fourier Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.2 Discrete Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.3 2D Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.4 Convolution Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.5 Fast Fourier Transform: Cooley-Tukey Radix-2 algorithm . . . . . . . . 9
2.2.6 Alternative methods and implementations . . . . . . . . . . . . . . . . 10
3 Time-complexity analysis 11
3.1 The Cooley-Tukey Radix-2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 2D Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4 Optimizations 11
4.1 In-place calculations and bit-reversal . . . . . . . . . . . . . . . . . . . . . . . 12
4.2 Precomputation of twiddle factors . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.3 Inverse-FFT division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.4 Row major transposing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.5 Multi-threading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.6 Switching mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5 Implementation 15
5.1 Cooley-Tukey Radix-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.2 FFTW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
6 Empirical evaluation & experimental setup 17
7 Results 19
7.1 Correctness & error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
7.2 Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
7.2.1 Multithreaded implimentation . . . . . . . . . . . . . . . . . . . . . . . 20
7.2.2 Row major transposing & precomputing twiddle factors . . . . . . . . 20
7.3 External benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
7.4 Switching Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
7.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
7.5.1 Correctness & error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
7.5.2 Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
7.5.3 External benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . 26
7.5.4 Switching mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
7.5.5 Possible improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
8 Conclusions and Outlook 27
A Testing results 28
A.1 Correctness & Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
A.2 Finding hard cut-offs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
A.3 Linear model results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2

B Tutorial - Convolution using FFTW 33
B.1 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
B.2 Simple convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
B.3 Convolutions with plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
B.4 Convolution in other dimensions . . . . . . . . . . . . . . . . . . . . . . . . . 34
C Linear model based on relative error 35
Abstract
In this project, I show how convolutions can be computed efficiently using the Fast Fourier
Transform (FFT). I explain the theory that makes FFT possible and present an implementa-
tion, which can be incorporated into to the SHARK Machine Learning Library. Being a low
level algorithm consisting of many small mathematical operations, I show how both reduc-
tions in computational complexity and specific low level optimizations lead to a considerable
reduction in execution time. Different methods for computing the convolution exist, and I
show how it is possible to implement a mechanism to switch between methods. I conclude
that my implementation works and is asymptotically efficient, but that much more work is
needed for the efficiency to compete with state-of-the-art implementations. An alternative
approach using the FFTW library is therefore presented.
Resumé
I dette projekt viser jeg, hvordan man beregner foldninger effektivt ved brug af Fast Fourier
Transform (FFT). Jeg forklarer teorien, som muliggører FFT, og præsenterer en implementer-
ing, som kan indkorporeres i Machine Learning biblioteket SHARK. Fordi FFT er en low-level
algoritme, viser jeg hvordan reduktioner i beregningskompleksitet og specifikke små opti-
mieringer, gør en betydelig forskel for køretiden. Da der findes forskellige metoder til at
beregne foldninger, viser jeg hvordan en mekanisme til at vælge den hurtigste, kan imple-
menteres. Jeg konkluderer, at min implementering er asymptotisk effektiv, men at mere
arbejde er nødvendigt, for at konkurrere med de bedste eksisterende implementeringer. En
alternativ implementering som benytter biblioteket FFTW præsenteres derfor.
1 Introduction
Convolution, being a general mathematical technique, enjoy a broad range of applications in
areas such digital signal and image processing, data processing, finance and even in some-
thing as fundamental as multiplying large numbers. In general, the convolution is an inte-
gral defined function, but in the case of this project, we need only look at the special case
for discrete functions. The convolution is then a sum of products, and therefore computa-
tionally impractical to do by hand for large numbers. In time the emergence of the computer
breached the gap, and the applications is now wide and plentiful.
However, at a computational complexity of O n2 , there’s limitations on the problem size.
Since computational speed is a primary concern during this thesis, reducing complexity is
central to success, and luckily possible as well. As we will see, the convolution operation
enjoys an intricate relationship with the Fourier Transform through what is often known as
the convolution theorem, effectively converting the sum of products into a single product.
3

While Fourier Analysis itself was formalized by Fourier in his study of thermodynamics in
1822 [Berg and Solevej, 2011], the result we will use of discrete Fourier transforms (DFT),
was in fact originally established by Gauss around 1805. In 1965 it was shown how to ef-
ficiently compute the DFT on problems of sizes N = 2m by Cooley-Tukey [Duhamel and
Vetterli, 1990] with what is now widely known as the Cooley-Tukey Radix-2 algorithm,
one of the first of many Fast Fourier Transforms (FFT). With a computational complexity
of O(N lg N), with lg N being the base-2 logarithm of N, the algorithm opens up convolu-
tions to vastly larger problem sizes.
As well for this project is the hope for an efficient implementation of a convolution function
for use in the SHARK Machine Learning C++ library [Igel et al., 2008]. The SHARK library,
having a wide array of machine learning algorithms implemented, is presently lacking the
support for performing fast convolutions on signal data, making this project of direct and
immediate use.
From a theoretical point, the difference between solving the convolution problem using FFT
on one, two or n-dimensional data is arbitrary, I will however, focus mainly on implementing
and explaining the case of the 2D convolution using FFT. In addition, most of the work will
lie in explaining and implementing the Cooley-Tukey Radix-2 algorithm. A primary reason
for choosing this algorithm in particular is that its presentation was a breakthrough in per-
forming fast discrete Fourier transforms, this algorithm also dips into some of the advanced
index mappings needed, without getting too tedious and confusing.
Finally, I will assess correctness of the implementation and give estimates of the error on real
world data. I will also perform benchmarking and compare the results with benchmarking
results from the widely used tool Matlab and and implementation using the FFTW library for
solving the FFT.
Throughout this BSc thesis, I will use the terms signal and image interchangeably referring
to a discrete function which in practice is just an array of input data. Moreover N and M
will often without definition refer to arbitrary dimensions of input data and the size refers to
the total size of the input data, i.e. Size = MN. Likewise, K will refer to the dimension(s)
of a K or K×K vector or square matrix called the kernel.
2 Theoretical Foundation
2.1 Direct Convolution
A convolution is a mathematical operation on two functions f and g resulting in a new
function denoted f ∗g. The convolution operation has many practical applications. In image
processing for instance, the function g could represent an image, and the function f a weight
function known as a kernel. The kernel could in turn represent some filter or transformation
on the image such as a bloom filter or simply a translation or scaling. The convolution on
continuous functions f, g is defined by
(f ∗ g)(x) =
∞
−∞
f(t)g(x − t) dt .
4

In many areas of use, we are often presented with discrete data 1. A discrete version of the
convolution and its two-dimensional counterpart is defined as
(f ∗ g)[x] =
∞
k=−∞
f[k]g[x − k] .
(f ∗ g)[x, y] =
∞
k=−∞
∞
m=−∞
f[k, m]g[x − t, y − m] .
Note that even thought the functions can theoretically extend to infinity, in practice the func-
tions will often be assumed zero-valued outside some bound. In case of image processing,
the bounds is simply the image and filter size. If we assume zero outside the bounds, con-
volved values near the edge will contain irregularities (also known as edge artifacts) since
not all values computed were based on actual image pixels. We therefore define the valid
region as the output region computed exclusively from values inside the bounds.
Assuming zero outside the boundary reduces the sums to
(f ∗ g)[x] =
b
k=a
f[k]g[x − k] ,
where we, in context of the computer, choose to set a = 0 whereto it follows that b = K − 1
for K the kernel array size.
The above method for computing the convolution is called the direct method, whereas using
Fourier Transforms to perform the convolution is called an indirect convolution [Sundarara-
jan, 2001]. When we introduce the convolution in terms of the Fourier transformation, the
convolution will be assumed periodic, meaning for a function defined on n points, for every
point, f[x] = f[x + n]. This is fundamental to Fourier transformation, since the theory at the
root of Fourier Analysis requires the functions to have a period. Due to this, we will see edge
artifact appear near the edges due to the convolution sum doing a wraparound catching val-
ues from the other end of the signal. This is of little concern in practice as we simply extend
function by desired values to a size of N +K −1, in effect extending the period to N +K −1
as well. To see why this works we examine the boundary case x = 0 where k = K − 1. Since
the sum in the convolution goes from 0 to K − 1 we get
f[k]g[x − k] = f[K − 1]g[−(K − 1)]
= f[K − 1]g[N + K − 1 − (K − 1)] (Period N + K − 1)
= f[K − 1]g[N] ,
where we note that g[N] is placed safely in the area of our constructed values. Subtracting
one from every index of g in the above calculations shows the same is true in the boundary
case g[N − 1].
It is often necessary to assume the kernel has what is called a kernel center (sometimes
kernel anchor). The center can be seen as the special case in the sum, in which the signal
input, multiplied by the kernel center, is placed back at the same index. This does not happen
in general. For instance, if we want the result of the convolution to be the sum of every value
with its neighboring values we can apply the one-dimensional filter [1, 1, 1] to the signal. If
the signal is [0, −1, 3, 0, 0], we want the result to be [−1, 2, 2, 3, 0] but from the definition
alone we get [0 + 0 + 0, −1 + 0 + 0, 3 − 1 + 0, 0 + 3 − 1, 0 + 0 + 3] = [0, −1, 2, 2, 3]. Since
1
data is only defined on a (possibly infinite) number of points
5

the convolution has no center by definition, this abstraction is one we create and maintain
ourself. Define kc as the kernel center. When performing the convolution (f ∗ g)[x] for some
x, at some point in the sum, the product f[kc]g[x−kc] appears with signal values originating
from x − kc. In other words, if the kernel center is placed at index kc. The wanted result
will have values shifted in the positive direction by kc. This is a small detail relevant only to
implementation.
2.2 Fourier Transformation
2.2.1 Fourier Series
Before we introduce the Fourier Transformation, it is instructive to introduce the Fourier
Series. In solving the heat-equation2 Jean-Baptiste Joseph Fourier (1768-1830) founded
Harmonic Analysis. Fourier showed that any periodic functions can be described as an infi-
nite sum of sine and cosine functions. It’s worth giving this some thought, as this is neither
a trivial or intuitive result at first, it’s actually quite remarkable. Even without delving into
the theory of why this is true, some intuition for the Fourier transform is granted.
Let f(x) be a periodic function with period P. If we adapt the notation of the complex
exponential cos(2πnx/P) + i sin(2πnx/P) = e2iπnx/P = W−nx
P , the function f(x) can be
described exactly by the Fourier series
f(x) =
∞
n=−∞
cnW−nx
P ,
where the coefficients cn is given by
cn =
1
P
x0+P
x0
f(x)Wnx
P .
This opens up the possibility of describing the function f, not by its values f(x) in the spatial
domain, but by its frequency components cn instead. When we talk about representing a
function in the frequency domain, these are the values we talk about.
2.2.2 Discrete Fourier Transform
Until now we have talked almost exclusively in terms of functions f. For discrete functions,
it is sometimes notationally helpful to simply use xn = f[n] and to treat the values as a
sequence.
On a sequence of values x0, x1, . . . , xN−1, we define the Discrete Fourier Transform (DFT) F
as the sequence of values X0, X1, . . . , XN−1 given by:
F{xn} =
N−1
k=0
xkWkn
N = Xn .
Notice the resemblance with the definition of the frequency coefficients of the Fourier Series.
The integral simply reduces to a sum. By tradition, the factor of 1/N (1/P in the continuous
case), is removed. This has no consequences until when we want to perform the inverse
transformation, transforming the frequency components into values in the spatial domain
again. We could have let the transform have the factor of 1/N, and then let the inverse
transform be without it. The same is true for the factor of −1 in the exponent as should
2
Differential equation describing how heat moves through a metal plate
6

become apparent in the beginning of the next proof due to free movement of the 1/N factor
and symmetry in the exponent. The Inverse Fourier Transform F is hereby propositioned to
be given by:
F−1
{Xn} =
1
N
N−1
k=0
XkW−kn
N = xn .
Proof Inserting the formula of the Discrete Fourier Transform into the proposed inverse
formula yields
F−1
{F{xn}} =
1
N
N−1
k=0
N−1
m=0
xmWkm
N W−kn
N
=
1
N
N−1
m=0
N−1
k=0
xmW
k(m−n)
N .
Now we have two cases. If m = n we have
1
N
N−1
k=0
xmW
k(m−n)
N =
1
N
N−1
k=0
xn = xn ,
and if n = m we first notice that −2π(k + N/2)/N = −2πk/N − π and remind the reader
that for both sine and cosine it holds that sin(x − π) = − sin(x) and cos(x − π) = − cos(x)
and for our purpose we get in addition Wkx+N/2 = −Wkx. Letting m − n = p, and choosing
to sum differently, we get
1
N
N−1
k=0
xmW
k(m−n)
N =
1
N
N−1
k=0
xmWkp
N = xm
1
N
N/2−1
k=0
Wkp
N + W
kp+N/2
N
= xm
1
N
N/2−1
k=0
Wkp
N − Wkp
N = 0 ,
meaning only the case where m is equal to n is non-zero, and the whole summation reduces
to
F−1
{F{xn}} =
1
N
N−1
m=0
N−1
k=0
xmW
k(m−n)
N = xn .
This proves the proposed inverse is in fact the inverse of the Discrete Fourier Transform.
As a side note, one could think the step from the continuous case to the discrete case of
the transform, was at the cost of equivalence or precision, as is the case with many discrete
approximations. This proves that this is in fact not the case. This has the consequence, that
the error of computation should be very low, as any error must come from rounding errors
alone.
2.2.3 2D Transform
The Fourier Transform is defined in any finite number of dimensions, and of particular inter-
est is the two dimensional case given by
Xn,m =
N−1
k=0
M−1
l=0
xk,lWkn
N Wlm
M =
N−1
k=0
M−1
l=0
xk,lWlm
M Wkn
N .
Notice how we right away can, by keeping k constant, rewrite the 2D DFT in terms of an
inner 1D DFT, and then another 1D DFT of the result. This is a common way to perform the
multidimensional DFT’s in practice.
7

2.2.4 Convolution Theorem
The motivation for going through so much trouble of talking frequencies and transformations
has so far only been remarked briefly, so it’s a perfect time to introduce the convolution
theorem, which is the one central result used in every computation of the convolution using
Fourier Transformation, and therefore central to this project.
Theorem 2.1 (Convolution Theorem) For two discrete periodic functions f and g with equal
period, the following is true
F {f ∗ g} = F {f} · F {g} , (2.1)
and equally
(f ∗ g)[n] = F−1
{F {f} [n] · F {g} [n]} . (2.2)
Before proving the convolution theorem, it’s worth discussing what this means in our case,
and under what restrictions. Firstly, instead of convolving the two functions directly by
calculating a sum for each value, we end up with a single multiplication. As we want the
convolution theorem to be applicable, we need the two convolving functions to have the
same period. When performing convolution using the direct method, we need only multiply
the numbers inside the range as we assume zero values outside the bounds, which therefore
contributes nothing. Conversely, when performing the convolution using the convolution
theorem we will need to extend the kernel to the size of the signal.
Proof This proof uses the same trick we used when proving F−1 was the inverse of F that
reduced a sum of complex exponential to a single non-zero case. We start with the definition
of the convolution, and insert the inverse of the transform F−1{F{f}} = f in place of both
functions. For convenience we define F{f} = F and F{g} = G. We get the following:
(f ∗ g)[m] =
N−1
k=0
f[k]g[m − k]
=
N−1
k=0
N−1
n=0
1
N
F[n]W
−nk
N
N−1
l=0
1
N
G[l]W
−l(m−k)
N
=
1
N
N−1
n=0
F[n]
N−1
l=0
G[l]
1
N
N−1
k=0
W
−nk
N W
−l(m−k)
N
=
1
N
N−1
n=0
F[n]
N−1
l=0
G[l]W
−lm
N
1
N
N−1
k=0
W
k(l−n)
N .
With similar arguments as in the proof of the inverse. If n = l the sum over k becomes N,
and reduces to zero in all other cases. Thus the sum over l collapses to the case l = n and
we get:
1
N
N−1
n=0
F[n]
N−1
l=0
G[l]W
−lm
N
1
N
N−1
k=0
W
k(l−n)
N =
1
N
N−1
n=0
F[n] · G[n]W
−nm
N
1
N
N
= F−1
{F[n] · G[n]} .
This ends the proof.
8

2.2.5 Fast Fourier Transform: Cooley-Tukey Radix-2 algorithm
The same year as the Programma 101, the first Desktop Personal Computer went into produc-
tion, Cooley and Tukey published a short paper detailing how the Discrete transformation
problem of size N = 2n, can be solved efficiently[Duhamel and Vetterli, 1990]. The idea
is to restate the problem recursively in terms of two subproblems of half the size occurring
when splitting the input in even and odd indices. We will here use notation FN to explicitly
state the size of the input sequence of the transform. The derivation goes as follows:
FN {xk} =
N−1
n=0
xnWkn
N
=
n even
xnWkn
N +
n odd
xnWkn
N
=
N/2−1
m=0
x2mW2km
N +
N/2−1
m=0
x2m+1W
k(2m+1)
N
=
N/2−1
m=0
x2mW2km
N + Wk
N
N/2−1
m=0
x2m+1W2km
N .
Renaming ym = x2m and zm = x2m+1, letting M = N/2 and rewriting the exponent to
reflect the new problem size
W2km
N = e−4πkm/N
= e
−2πkm
N/2 = Wkm
N/2 = Wkm
M ,
we end up having two DFTs of half the size
M−1
m=0
ymWkm
M + Wk
N
M−1
m=0
zmWkm
M = FM {yk} + Wk
N · FM {zk} .
The factor Wk
N is called a twiddle factor and is the only complex exponential needing to be
calculated in practice since in the base case N = 1 we simply get
F1{xk} =
0
n=0
xnWkn
N = x0W0
1 = x0 .
With analogous derivation, since 1/N = 1/2 · 1/M, the inverse becomes
F−1
N {xn} =
1
2
1
M
M−1
m=0
ymW−km
M + W−k
N
1
M
M−1
m=0
zmW−km
M (2.3)
=
1
2
F−1
M {yk} + W−k
N · F−1
M {zk} . (2.4)
This can be recognized as a classic divide and conquer approach. We gain some simplicity
by how neatly the new subproblems shrink until the absolutely trivial base case solves itself.
However, we gain a great bit of complexity due to the non-trivial index mapping required as
a result of the splitting of even and odd indices repeatedly. This will be addressed in 4.1.
Another consequence of this way of splitting is the possibility to reuse reoccurring twid-
dle factors. This will be explored further in 4.2. Also to be noted is the occurrence of a
great number of so called free complex multiplications. Free multiplications are multipli-
cation of any complex number by +1, −i, −1, +i that happens at each recursive step when
k is 0, N/4, (2N)/4 and (3N)/4 respectively. The exception is N = 2 where we only have
multiplication by +1 and −1. This was not exploited in the project due to time constraints.
9

2.2.6 Alternative methods and implementations
James W. Cooley and John Tukey made a breakthrough with their presentation of the Radix-
2 divide and conquer algorithm, but other methods has since been presented, some with
great success as well. Winograd, for instance, managed to reduce the number of arithmetic
computations using convolutions. Given the goal of this project, this should come as a bit
of a surprise. However, implementations of this method has been disappointing and the
emergence of so called mixed-radix and split-radix algorithms took the scene[Duhamel and
Vetterli, 1990]. The split-radix algorithm resembles the radix-2 approach greatly. Like the
Cooley-Tukey algorithm, the split-radix algorithm divide the input problem into subproblems
of smaller size, but instead of just dividing into two problems of equal size, the problem is
now divided into 3 subproblems, one at half size and two at a quarter the size of the origi-
nal problem. This happens to open up the possibility of performing a higher degree of free
complex multiplications.
The Cooley-Tukey algorithm has also been generalized to any problem size N that can be
factored N = N1N2. This is the aforementioned method named the mixed-radix method
[Frigo and Johnson, 2005]. The index mapping is now instead k = k1 + k2N1 and n =
n1N + n2 and the complete formula needing to be computed is
Xk1+k2N1 =
N2−1
n2=0
N1−1
n1=0
xn1N2+n2 Wn1k1
N1
Wn2k1
N Wn2k2
N2
.
This is effectively making a DFT of size N into a 2D DFT of size N1 × N2. It is not at all
clear why this way of rewriting actually reduces the number of operations, but it’s due to
the reuse of the smaller DFTs. This is what we do in the Cooley-Tukey Radix-2 algorithm as
well having just two DFTs of half the size. We know a DFT of size N has worst case time
complexity in the order of N2 operations by use of the definition. Using this index mapping,
the inner sum are DFTs of size N1. These are each computed a total of N2 times and then
reused to calculate the N values of the original DFT. The outer sum of length N2 is computed
for each N values. In total we get:
T(N) = NN2 + N2N2
1 = N1N2
2 + N2N2
1 = N1N2 (N1 + N2) .
For N1, N2 > 2 this is less than N2[Duhamel and Vetterli, 1990].
One of the most, if not the most, successful FFT library, FFTW, tailors the execution of the
transform to the specific computer architecture it runs on by use of an adaptive method. In
effect, the FFTW implementation is composed of various highly optimized and interchange-
able code snippets (codelets), of which they benchmark the efficiency, during computation,
and switches based on the measurements[Frigo and Johnson, 1998]. An implementation
of this will not be done due to time constraints. Instead, an alternative implementation of
the convolution solver, where the transforms are performed by use of the FFTW library, is
implemented alongside. Since the FFTW library supports precomputation of transformation
plans based on execution time measurements, I have implemented functionality to create a
convolution plan in case multiple convolutions of equal size is needed. This serves as an top
layer interface for the FFTW planning code. A tutorial detailing the use of the implemented
convolution functions using FFTW can be found in the appendix B.
10

3 Time-complexity analysis
3.1 The Cooley-Tukey Radix-2 Algorithm
As shown in the derivation of the Cooley-Tukey algorithm Radix-2 in section 2.2.5, the al-
gorithm uses a classic divide and conquer approach with some not so classic splitting of the
input data. Since all input data of size N will be padded to size Ñ = 2n where n is at most
lg(N + K − 1) . We then get n − 1 = lg( Ñ/2) recursion levels when skipping the trivial
base case. For every recursion level, we perform 2 Ñ operations on complex numbers. We
also reorder the data using bit reversal in linear time. The total computational complexity
of a 1D FFT thus becomes T(N) = 2 Ñ lg( Ñ/2) + Ñ. As we zero-pad first by K − 1 and then
to the next power of two, we can upper bound the zero-padding by the inequality Ñ ≤ 3N
giving us:
T(N) = 2 Ñ lg( Ñ/2) + Ñ
≤ 6N lg(3N/2) + 3N
≤ 6N lg N + 7N
= O(N lg N) .
In case of the 2D FFT, let N1, N2 be the input dimensions of N1 × N2 input. Foreshadowing
a little, we define N = N1N2. We first perform N1 1D FFT’s on the rows of size N2 and then
N2 1D FFT’s on the columns of size N1. With results from above, we get
T(N1, N2) ≤ N1(6N2 lg N2 + 7N2) + N2(6N1 lg N1 + 7N1)
= 6N1N2(lg N1 + lg N2) + 14N1N2
= 6N1N2 lg(N1N2) + 14N1N2
= 6N lg(N) + 14N = O(N lg N) .
Besides an increase in the factor of the linear complexity arising from the fact that we need
to reorder the data again before performing 1D transforms on the columns, we get the same
computational complexity in regards to the total size of the input. Note that I purposely
elected not to add in other linear factors, such as array copying, and later another linear
factor arises from transposing the data, since they do not contribute to the asymptotic com-
plexity.
3.2 2D Convolution
As per the definition of the convolution, the implementation of the direct method requires
each of the N resulting discrete values to be calculated as a sum of length K where K is
kernel size. Thus we get a total computational complexity T(N, K) = O(NK) operations.
Again it is assumed that constant factors are negligible. In case of the 2D convolution,
with the same argument, we simply get O(NMK2). Using the indirect method instead,
the running time is now completely dominated by the Fourier Transformation. Using the
Cooley-Tukey Radix-2 algorithm for computing the forward and backwards transformation
the computational complexity is then O(MN lg(MN)) in the 2D case.
4 Optimizations
While an implementation of the Fourier Transformation using the Cooley-Tukey Radix-2 al-
gorithm reduces the computational complexity greatly from the order O(N2) to O(N lg N),
it is done at the cost of increased constant factors as well as implementation complexity. It
is therefore relevant looking into the possibility of optimizing the computations.
11

4.1 In-place calculations and bit-reversal
The input data type is, as with most in the SHARK library, assumed to be real. Since the
fourier transformation produces complex numbers, the closest we can get to a complete in-
place calculations are to copy once from real to complex, and then back when done. This is
indeed possible, while not entirely trivial to see why since each recursive step of the divide
and conquer algorithm splits the data in a way that seems intuitively like opening a zipper.
This operation however, has a well defined structure as I will show in the following.
To systematically assert the index mapping at each isolated recursion level, observe the fol-
lowing function:
s(x) =
x
2 : x even
x+N−1
2 : x odd
.
The function s(x) is a permutation of the set B = [0, 1, . . . , N − 1] with N = 2m for m ∈ N,
and represents the index mapping of a single step of the recursion in the Cooley-Tukey Radix-
2 algorithm. It happens that s(x) is also what occurs in a cyclic right shift on bit-strings of
length m. To see this we need only remind the reader what happens in a cyclic right shift.
For a number x, if the least significant bit is 0, x is even and the now ordinary right shift
simply halves x. If the least significant bit is 1, x is odd, and the least significant bit becomes
the most significant. In effect, the process can be seen as subtracting the least significant bit,
divide by two, and adding 2m−1, thus we get s(x) = (x − 1)/2 + 2m−1 = (x − 1)/2 + n/2 =
(x + N − 1)/2 which is the odd case of s(x).
Definition 4.1 (Bit-reversal) Define the bit reversal of a bit string bm−1bm−2 . . . b0 of length
m, as the bit string b0b1 . . . bm−1 with bits placed in reverse order such that bit bi is placed at
index m − 1 − i.
Theorem 4.1 Recursively splitting an array of length N = 2m into two evenly sized arrays
with even numbers in the first and odd numbers in the second can be done in-place by complete
bit-reversal of the original index i ∈ [0, 1, . . . , N − 1].
Proof At each recursive step j ∈ [1, . . . , m] the splitting is equivalent to a circular right-shift
of the m − j − 1 rightmost bits while not touching the j − 1 first. Thus at the end of the j’th
split, the j first bits are placed correctly. The recursion terminates in the trivial case when
there’s only one bit remaining, and a circular right-shift alters nothing. Since any bit bi is
right-shifted i times before it at the (i+1)’th step is placed at index m−(i+1) = m−1−i, the
entire bit-string ends up being reversed. Thus by definition of the Bit-reversal, recursively
splitting in even and odd of the sub-arrays is equivalent to reversing the entire bit-string.
In the case of the case of Cooley-Tukey, it is not needed (or efficient for that matter) to per-
form a dividing operation at each recursive step. Instead, reordering the entries beforehand,
into the indices given by the reversed bit string of the indices is sufficient. The output will
be placed correctly when the algorithm finishes. As a side note, this procedure could have
been done by index mapping instead. Two problems then arise. Firstly, it turns out not to be
a trivial task to have input placed correctly in the end. Usually you will have them placed in
a manner comparable to the bit-reversed order. Secondly, this may in fact slow computation
due to inefficient memory placement. In truth, many different methods for reordering has
been proposed, many doing so during computation. As to which one is faster may depend
on the particular hardware used in the computation [Frigo and Johnson, 2005].
12

4.2 Precomputation of twiddle factors
At each level of the recursion. A twiddle factor e−2πik/N with 0 ≤ k < N ∈ Z is calculated.
Assume two recursion levels of problem size N, M with M = N/2. For any twiddle factor
e−2πik/M we get e−2πik/M = e−2πi(2k)/N . Since 2k < 2M = N, this twiddle factor is needed
again here. This property can be exploited by precomputing the twiddle factors based on
the largest sub-problem and then reusing them. The factor of 2k specifies clearly that the
needed entry will be found at the index of double value in the recursion level above.
Note also the following useful property
e−2πi(k+N/2)/N
= e−2πik/N
e−2πi(N/2)/N
= e−2πik/N
e−πi
= −e−2πik/N
.
The result is that we only need half the twiddle factors. Since the twiddle factor needed at
index k < N/2 is simply the negative of the twiddle factor needed at index k + N/2.
Without precomputing twiddle factors, we end up computing, for any recursion level, N/2
twiddle factors (since the second identity above applies). Thus we end up with an order of
O(N lg N) twiddle factors to compute while precomputation requires only an order of O(N).
4.3 Inverse-FFT division
From formula 2.4, the recursion for the inverse transform, the factor of 1/2 is multiplied
at each recursion level. Since the recursion performs an even split on a problems of size
N = 2m, we get a total recursion depth of lgN = m. As every entry is multiplied by 2 at
each recursion level, the factor ends up being simply 2m = 2lg N = N. It is therefore more
efficient to instead multiply every entry by 1/N when the recursion has completed. On a
side note, in the FFTW library, they have elected not to perform the division at all. The only
difference of the forward and backward transform is therefore just the sign of the exponent.
4.4 Row major transposing
Because the 2D FFT is performed by doing 1D FFTs on each row and then on each column,
row major memory optimizations3 will only take effect in the latter case if we transpose the
data so the location in memory becomes consecutive again. Note that we need not transpose
the result after the first transformation since both signal and kernel will have been trans-
posed, and the term by term multiplication is therefore unaffected. Since we equally should
transpose again when performing the inverse FFT, the final result will, by coincidence, not
need to be transposed again before the convolution terminates. Unfortunately we lose some
generality in the FFT implementation, since the function gets further tied (or specifically
designed) to work with this particular convolution implementation, and less oriented for
general use elsewhere. To correct this, the FFT will transpose the data again by default, but
can be omitted by choice.
4.5 Multi-threading
While performing calculations in parallel can be tricky, and therefore not something initially
thought to be part of this project, the fact that the FFT is performed on both the signal and
3
Systems are optimized to pull consecutive data in RAM memory to the cache in batches in an attempt to
guess what data will be used next.
13

kernel separately lends itself to easy parallelization. The parent thread will carry out one
FFT, while a child thread will carry out the second FFT. It is to be expected that performing
the FFTs carry the bulk of the computation time, we can therefore expect a reduction of little
less than 1/3 of the total computation time. This is under the assumption that the input is
big enough for the second thread to execute at roughly the same time as the main thread.
4.6 Switching mechanism
From the asymptotic analysis, we expect a difference in the growth in execution time of the
direct and indirect convolution method for increasing problem sizes. The ability to switch to
the faster method is therefore of interest in regards to minimizing running time. While we
do expect the indirect convolution method to out-compete the direct method asymptotically,
the constant factors means the direct method will be faster for smaller input, and especially
if the kernel is small. It is also expected that the difference in running time will not be a
function which can be found by simple inspection, due to the running time of the indirect
method primarily relying on the zero-padded size. Recall that we need to both zero-pad the
kernel to match the size of the signal, and then again both signal and kernel to reach a power
of two for the Cooley-Tukey Radix-2 algorithm to be applicable. The result is a function of
2 lg(max{N,K}) . A possible obscurity one could envision is the possibility that, even if the
indirect method wins over the direct method for some problem size, bigger problem sizes
may be solved faster with the direct method due the zero-padded size jumping to the next
power of two.
In an effort to develop a mechanism for switching between the two methods, I will attempt
to develop two functions that model the running time of the methods. To do this, bench-
marking data will be generated containing running times for a number of input. In case
of the indirect method, the input sizes will be converted to zero-padded size, and for both
methods the data will be linearized in accordance to the expectation from the asymptotic
analysis. The hope is to fit a good linear model for estimating running time. Since the in-
verse of n lg n is not among the elementary functions, this will be estimated using a simple
bisection search algorithm written in Python 2.7.
While a classic linear model often shows good results when fitting data, results in this project
has show the linear models becomes too inaccurate relative to the running time for smaller
input. This makes sense since the linear model minimizes the square of the distance from
the modeled function to the data points. Intuitively, an error of 0.1 second is relatively small
if the running time is 10 seconds, but enormous is the running time is 0.001 seconds. Some
preliminary testing showed point-wise errors as large as 30000 times the measured time.
This will be addressed using an alternative linear modeling scheme detailed below. Another
problem is that the running times varies a lot on the relative size for small input sizes and
little for bigger input.
A simple and practical solution is to perform a two pass test before even calculating the
running times of the two functions or attempting to model them, where we find two lower
bounds to filter out some small signal and kernel sizes where only the direct method should
be used. The first test is to find the signal size s0 where the direct method is faster for any
signal and kernel size smaller than s0. The second is to find the signal size s1, where for all
kernels of quarter of the signal size, the direct methods is faster than the indirect method.
Note that it must be true that s0 ≤ s1. Since this is a finite small number of signal and kernel
sizes, the difference in execution time can then be measured for all of them and the breaking
point can be found simply by inspection.
Using a linear model ˆy = mx + c estimating the running time t, we can directly compute the
14

estimated time consumption of the direct method by plugging NMK2 into the linear model.
For the linear model arising from the indirect method we need to undo the transformation.
Let z be the zero-padded size of the input, t the execution time and T(x) the transformation
with T−1(x) = x lg x, we have:
T(t) = mz + c
t = T−1
(mz + c)
= (mz + c) lg(mz + c) .
As noted, one big difficulty is the relative nature of the error when benchmarking. In practice
a great relative inaccuracy is seen for smaller input when minimizing the squared difference.
A solution is to use the following linear modeling scheme. Let (xi, yi) be measured data
points. We want to approximate the data by a predicting function ˆyi = mxi + c to not
minimize the squared difference as is usually done, but minimize the squared relative error
n
i ((xi − ˆyi)/xi)2. The complete derivation was performed and can be found in appendix C,
but in short we compute m and c by
m =
n
i=1
n
j=1
xj−xi
y2
i yj
n
i=1
n
j=1
x2
j −xixj
y2
i y2
j
, c =
n
i=1
1
yi
− m · n
i=1
xi
y2
i
n
i=1
1
y2
i
.
As noted by [Frigo and Johnson, 1998], there’s a great difference between which operations
during computation are faster using different computer architectures. This means that what
I conclude during this project may not yield optimal results on other architectures. It is
however possible to automate some estimation, effectively carrying out the steps above on
the system it is used on, in an effort to choose the optimal signal sizes to switch. This will not
be explored during this project, mainly due to how measuring execution time is OS specific,
and thus a bit out-of-scope.
5 Implementation
5.1 Cooley-Tukey Radix-2
In short, the implementation requires the input signal and kernel to be transformed using
FFT, multiplied element-wise, and the result transformed back. The code has primarily been
broken into two functions, one for performing the convolution, and one for the FFT and the
its inverse.
A flow chart of the convolution program can be seen in figure 1. The convolution function,
based on the requested boundary mode, determines how much zero-padding is needed and
then create the needed data structures for storing the complex numbers used during trans-
formation and in the frequency domain. Since the two FFT’s are needed at the same time,
an attempt to spawn a child thread for carrying out one FFT is done. To not require the user
to use C++11, this is done using pthread. A simple data structure for a carrying parameters,
and a wrapper function has been written to not put requirements on the implementation of
the FFT function. Given the results of the transforms, the convolution function multiplies the
values element-wise, performs an inverse FFT and copies the resulting values back into the
given output data structure indexing in accordance with kernel center and boundary mode.
The implemented FFT function, carrying the majority of the complexity in the implementa-
tion, works as follows. First the data is copied into the given output data structure using
15

Figure 1: Flowchart detailing the implementation of the convolution function, the FFT func-
tion and how they interact.
16

bit-reversed indexing. Then the twiddle factors are precomputed. After that, a four times
nested for-loops carries out the 1D recursion on the rows in which the following happens:
1. For every row
2. for every recursive level starting with N/2 subproblems of size 2.
3. for every subproblem of the current size (2, 4, 8, . . . )
4. for every element in the current subproblem, multiply the required values and twiddle
factors, and save them accordingly to fit the next subproblem size of double size.
By this time we reorder the input again using bit-reversal preparing 1D FFT’s on the columns,
but instead of performing the calculations on the columns, we transpose the input and per-
form the transformations on the rows again. As these computations are completed, the
function call terminates.
5.2 FFTW
Since results were not what was hoped for, an effort to make an implementation using the
FFTW library has been created alongside for both one, two and three dimensional input data
as well. To use the FFTW library, it is required to create both input and output arrays, since
in-place calculations is not possible for larger input sizes using FFTW [FFTW-Docs]. To save
memory use, three auxiliary arrays are made instead of two for both signal and kernel, and
the FFT results are juggled around to only overwrite already used data.
Since FFTW uses precomputed plans to optimize computations, an interface for performing
convolution utilizing this feature has been written as well. This is an optional step which is
advisable only in the case many convolutions of the same size input is needed. Using a plan
also gave the possibility to store a transformed kernel for reuse in case the same kernel is
needed in multiple convolutions. The specific details on the use of this implementation can
be found in appendix B.
6 Empirical evaluation & experimental setup
To systematically test the written program, a number of tests has been created and suitable
data chosen to best assert the success of different formulated goals. I have chosen to use
the well known Lena image shown in figure 2, for most testing and benchmarking purposes,
sliced or extended to needed sizes. This image is often used for comparing results between
different image processing algorithms. Here, the choice is a little bit arbitrary, since we do
not need to resort to a qualitative measure for the convolution method. Either it works or it
doesn’t, and the error can be measured at round-off level, making the testing entirely quan-
titative. For this reason, special input data, of which the outcome is known by analytical
means, has been used as well. When benchmarking the implementation I will be using CPU
time measurements in most cases. The reason for this is two-fold. Firstly, I’d like to factor
out the unknowns of operating systems CPU scheduling to give a more general measure, and
secondly, this seems to be the way benchmarking is often done other places (as in Matlab).
This in turn helps to compare results.
To cover the widest range of sizes of input data, the problem sizes used in testing are set to
increase exponentially by a real exponent by either 1.5 or 2, and then floored to an integer.
Unless otherwise stated, each data point in each test is generated from an average of 25
17

Figure 2: Lena, the primary test image of use.
repetitions of that particular problem size.
All tests were done on an Intel Core 4x2 50Ghz i7-4710MQ. The testing goals are, and
will be tested, as follows:
• Correctness & error estimation: As mentioned previously, the error should be low.
Due to round-off errors, some error in the calculations is to be expected. An IEEE
754 double-precision floating point numbers has 52 bits for representing the fractional
part of the number. Since we get an extra bit of precision due to the first digit being
represented implicitly4, the error can thus at most be the difference in the last digits5
2−53 = 10−15.955. While it is theoretically possible to formally derive bounds on the
propagating error, the error is tightly connected to both the relative and absolute size
of the numbers. I therefore expect a result with too large an upper bound on the error
when compared to what is both acceptable and realistic. Previous work suggest the
error in general for different FFT methods increases as a factor of
√
N on average
where N is the input size [Duhamel and Vetterli, 1990]. Thus we should expect and
error of about
10−15.955
·
√
20402 ≈ 10−13
,
on average for the biggest input being transformed. Since every output value from the
convolution has passed two transforms, the error as a result of the convolution should
be around 10−10. As well as the average error, the maximum error is observed. This
is primarily to ensure no single significant error is hiding under a low average error.
These could come from boundary case errors for instance.
Using constructed M ×N input data such that the value at index (i, j) is i+j, and using
a 3 × 3 kernel with center (1, 1) and values 1/4 at indices {(0, 1), (1, 0), (2, 1), (1, 2)}
while leaving the rest zero, should return the input data as output in the valid region.
The upper edge (j = 0) should, except for corner pieces, have the convolved value
4
The fractional number is represented in a such way that the first bit is always one, and can therefore be
represented implicitly.
5
This is actually only true under the assumption that the exponent is 1.
18






0 1 2 3
1 2 3 4
2 3 4 5
3 4 5 6





∗



0 1
4 0
1
4 0 1
4
0 1
4 0


 =





1
2 1 7
4
3
2
1 2 3 11
4
7
4 3 4 7
2
3
2
11
4
7
2
5
2





Figure 3: Example of test input for the correctness test
(3i + 1)/4. The bottom edge (j = M − 1) should have values (3(M + i) − 4)/4. Due to
symmetry, this also applies for the left and right edges switching i with j and M with
N. The corner values should, in case M, N > 1, have values 1/2, (N − 1)/2, (M − 1)/2
and (N + M − 3)/2 respectively. An example is shown in figure 3.
Two concerns are present with the proposed test. Firstly, since the kernel is symmetric,
it leaves open a few areas of potential symmetry indexing errors. A second problem
has to do with the nature of values in the frequency domain. Patterns or structures
in the spacial domain often show up as zero values in the frequency domain. For the
above test, the signal input turns out to only have non-zero values in the first row
and first column in the frequency domain. To cover these two problems, tests with
a translating and scaling kernel using the Lena image as signal will also be tested.
If the kernel anchor is c = (cx, cy) and all values in the kernel is zero except for
a = (ax, ay). Convolving any image with this kernel will shift all pixels right by a − c =
(ax − cx, ay − cy) and scale them equal to the value of the kernel at a.
• Optimization Benchmarking: For the implemented optimizations to be any good, we
want to see a performance increase. For the purpose of testing this, a base imple-
mentation has been written, and predefined problem sizes to test on has been chosen.
Benchmarking each optimization separately, I will measure the absolute and relative
difference in the execution time of the optimization and the base test.
• Switching mechanism: Since both a direct and indirect method has been imple-
mented, we need to test if the estimation process is working. To test this, measure-
ments and estimates using the developed models will be compared. Ideally, the esti-
mates are accurate in regard to time consumption of the methods, but most impor-
tantly is that if we chose to use a method solely based on the prediction, we would like
it to be the case that it was in fact the best choice. For this purpose, a wide range of
new problem sizes, not used in modeling, will be benchmarked to see if the prediction
mechanism chose the faster method. A success rate will be presented.
• Convolution Benchmarking (External): For testing the implementation against other
methods, I will compare the implemented Cooley-Tukey Radix-2 algorithm against an
implementation using FFTW and Matlab. Since the convolution implementation using
FFTW also supports precomputed convolution plans and static kernels, the execution
time when using this feature will be measured and compared as well here.
7 Results
7.1 Correctness & error
Using three test cases (symmetric, asymmetric and symmetric on valid boundary mode), the
absolute difference between the correct analytical result and output is show in figure 4. The
19

Figure 4: Error plot of the three correctness and error tests plotted alongside a fitted square
root function and the expected error for comparison.
error is plotted alongside a fitted square root function. The first thing we notice is that the
error is quite low. An order of about 3 · 10−13 at most found at for the largest input. Looking
at the fit it’s not clear if the error rises of the order of a square of the input size as suggested
but a residual plot would no doubt conclude this is not the case with the supplied data. The
error is also lower than expected.
7.2 Optimizations
7.2.1 Multithreaded implimentation
Results of the benchmarking of the multi-threaded implementation can be seen in figure
5. It is clear that as the input reaches a certain size, we see a factor of reduction of about
0.3 in computation time for the Cooley-Tukey Radix-2 implementation. In case of enabling
multi-threading using 4 threads and using FFTW, we see a quite negative impact for small
input but greater reduction in execution time for bigger input sizes. We also see that the
execution time reduction seems to be more volatile and harder to predict. Notice that the
measured time is real time as opposed to CPU time. This is because measuring the CPU time
measures the total time across all threads thus showing as a gain in running time as opposed
to a saving.
7.2.2 Row major transposing & precomputing twiddle factors
The result of the optimizations shown in figure 6 shows clearly the optimizations makes a
great difference. Transposing the data seems to save little for small input sizes, but ends
up making quite a difference for larger sizes. Conversely, precomputing twiddle factors has
great effect in the beginning, but the improvement drops quite a bit for large input sizes. We
notice that this change seems to happen around the same signal sizes.
7.3 External benchmarking
Benchmarking the Cooley-Tukey Radix-2 implementation against an implementation using
the FFTW library and performing the convolution in Matlab was performed and the result can
20

Figure 5: Real time benchmarking results comparing elapsed time using of 1 and 2 threads
for the Cooley-Tukey Radix-2 implementation and 1 and 4 threads for the implementation
using FFTW.
Figure 6: The speed increase of the implemented optimizations measured in CPU time. (Left)
Direct comparison of running times (Right) Factor of time reduction compared to the base
implementation.
21

Figure 7: Running time comparison of the implementations compared with Matlab. The
implementation using FFTW has been benchmarked both with and without precomputed
plan. Matlab has been measured two times since restarting Matlab each time seems to have
a great effect on the running time.
be seen in figure 7. Performing the convolution using Matlab clearly comes out on top, with
the FFTW implementation following reasonably close. On the other hand, the implementa-
tion using the Cooley-Tukey Radix-2 implementation written during the project is lacking
behind. What can be seen from the log-log plot is that it does in fact, for small input, come
out on top relative to Matlab and on one of the FFTW implementation, but that this changes
quickly in favor of the other solutions.
Inconsistent benchmarking results led me to perform additional tests with Matlab, restarting
Matlab after each run. This has a clear negative impact on the running time when using
Matlab. From the plot we also notice how using the implemented convolution plan succeed
in further reductions to the execution time. Adding a static kernel to the plan results in the
lowest measured execution time per convolution.
7.4 Switching Mechanism
Performing the discussed two pass test to filter out small input sizes where only the direct
method should be used yielded the result that for signal sizes 522 or lower, for any kernel
size, the direct method will be faster. The second pass gave the result that for signal sizes
782 or smaller with kernel size 192 or smaller, the direct method should be chosen as well. A
more detailed test output can be seen in appendix A. In the rest of this section, the modeled
data is all of bigger input sizes.
Holding kernel size constant, we plot the benchmarking results as a function of the signal
size. The result is seen in figure 8. We see the direct method seems to be linear in the signal
size with varying growth rate based on the kernel size. The indirect method shows clear
sign that the running time is dominated by the zero-padded size the FFT require. Since both
signal and kernel is square N × N and K × K, the jumps happen when the zero-padded size
of N + K − 1 goes from 2m to 2m + 1.
Recall that we expect the direct method to be linear in the product of the size of the signal
and kernel. The result of a linear fit minimizing the square of the relative difference can be
22

Figure 8: Benchmarking plot showing execution times. (left) direct method (right) indirect
method. Each color indicates a running time function of constant kernel size but varying
signal size.
seen in figure 9 with m ≈ 2.4 · 10−9 and c ≈ −2.3 · 10−5. Since we are minimizing the square
of the relative error, it’s hard to see from any plot if it fits the data well. Instead we look at
the error plot also shown in figure 9. The error is clearly not gauss distributed and has clear
signs of patterns. The error plot shows deviations of a factor of about 0.25 at most.
For the indirect method we plot the time measurements as a function of the zero-padded size,
and it’s approximation shown in figure 10 with m ≈ 2.8·10−7 and c ≈ 1.0. As noted, the data
points has been transformed by the inverse of the n lg n function and re-transformed before
plotting. The function can seem to miss the last small group of points but this is a result
of the relative error method allowing a higher degree of absolute error for data points with
higher values. The points in the error plot is clearly grouped. This is due to all measurements
being zero-padded to the shown sizes.
By benchmarking again on a total of 379 different combinations of signal and kernel sizes
spanning most of the range of possible input sizes, the estimation process succeeded in
choosing the correct method in 365 cases meaning a success rate of about 96%. The specific
correct and incorrect hits can be seen in figure 11. Note that only a single run for each
input size was done because averaging the results of many runs does not correspond to the
intended use. This means that the benchmarked times vary more than the times used to
model the running times. The wrong estimates seem to be grouped slightly.
7.5 Discussion
7.5.1 Correctness & error
The error being at most and order of 10−13 is lower than the expected 10−10. The error is
actually far better than expected. Possible explanations are that the square root estimate does
not hold for consecutive FFTs, or that the particular tests generate lower errors on average,
or that the Cooley-Tukey algorithm produces a lower average errors than other FFT methods.
The last one is a decent possibility since the correctness test of the implementation using FFTW
does show a higher degree of error on my test. This however, could also be explained by
23

Figure 9: (left) Direct method running times plotted as a function of both signal and kernel
size. (right) Residual error plot showing difference in approximated and measured value.
Negative values means the data point is below the approximated value, and vice versa.
Figure 10: (left) Indirect method running times plotted as a function of the zero-padded
size. (right) Residual error plot showing difference in approximated and measured value.
Negative values means the data point is below the approximated value, and vice versa.
24

Figure 11: Visualization of successful and unsuccessful attempts to predict which method is
faster.
the zero-padding of the input data, which at least is an intuitive explanation of the reduced
error. To get a better estimate on the error, the FFT should be tested separately and more
sophisticated test data could be used. Testing on only powers of two would eliminate zero-
padding as being the cause. Since the error is low enough, this will not be done. The
conclusion is therefore that the implementation is satisfactory in regards to correctness and
error.
7.5.2 Optimizations
As could be seen from figure 6, the gain in performance for the optimizations is either climb-
ing in case of the row major transposing, and falling in the precomputation case at one
seemingly particular point. One theory to explain this difference has to do with the size of
the CPU cache. The computer running the test has 6Mb of CPU cache available, and since
complex numbers take up 16 bytes each, it means that for the entire input to be able to fit in
the cache, the amount of complex numbers must be at most 6 · 10242/16 = 393216. This co-
incides quite well with the area the jump in time savings happens. In case of the transposing,
it’s clear now why we get no time savings. We perform row major transposing to help the
CPU predict what values to pull from memory to cache, and thus should make no difference
when the data can reside completely in the cache. The small increase we see instead can be
due to cache layering6. In case of the precomputed twiddle factors it is not as clear. When
the data can not completely reside in the cache, the twiddle factors can as well not reside in
the cache alone, but this should be the case for the base case as well. A possibility is that the
compiler makes a prediction and optimizes the code further, but this is speculative. In both
cases it is clearly worth it having implemented these optimizations.
In regards to the effect of using multiple threads. The reduction of about 30% running time
for the Cooley-Tukey implementation was as expected. However, it is surprising that FFTW
does not gain more from 4 threads compared to the two threads used in the Cooley-Tukey
implementation. More testing may be needed to determine how to get most out of the FFTW
6
Cache has multiple layers, as small as can fit on individual CPU dies, and often two levels up with exponen-
tially increased size.
25

implementation in regards to the convolution operation. The volatile nature of the execution
time reduction in case of the implementation using FFTW does not clearly show if each trans-
form should have only one thread available and then be parallelized on basis of an entire
transform instead. This however, as well as having two threads on each transform is likely
not possible due to many FFTW operations not being thread-safe and therefore potentially
causing segmentation faults without extensive and invasive use of mutex locks. An attempt
of an implementation of this and further benchmarking is therefore needed in further devel-
opment of the convolution using FFTW.
7.5.3 External benchmarking
During benchmarking against other implementations, it is clear the implementation of the
Cooley-Tukey Radix-2 algorithm is not competing with existing implementations. The expla-
nations for this is probably that I have not explored all possible methods for optimizing the
code, this will be detailed further in 7.5.5. Moreover, there may be some overhead I have
not been able to find, perhaps even from C++ itself, as opposed to using C alone. It is not
impossible to remove this overhead, even in the context of SHARK. In theory, the overhead
should disappear as the C++ code is rewritten to C, but this will not be explored during this
project, as available time for such a task was not present.
During testing it became apparent that Matlab does a lot under the hood. Running the com-
putations in Matlab multiple time results in much shorter running times on average. It is
unclear what makes the difference, but could be explained by reuse of results, or optimiza-
tions in the underlying FFTW library Matlab uses. For this reason, data is presented showing
the result of restarting Matlab each time as well. What we see now is what seems to be a big
constant added to the running time, and remember that only the two forward transforms,
point-wise multiplication and inverse transforms is being measured, so this difference should
have nothing to do with data preparation such as loading the used data or other core Matlab
startup processes. It could be that Matlab initialize an FFTW plan as detailed in 2.2.6, which
should make some difference, but this is counter to what the online Matlab documentation
states. Surprisingly, my FFTW implementation does not matching the Matlab results either.
It is unclear how much Matlab has tailored FFTW into their program, and what advantages
they may have gotten should this be the case. It is therefore unclear if this could explain the
difference. Another explanation is if Matlab is simply reusing results which some simple test-
ing does seem to indicate, but not with the results we see. Alternatively, I am either missing
some build flag or have done an oversight in the coding using inefficient code by accident.
From experience, the running time of the FFTW transform can vary noticeably simply when
recompiling, but this as well does not explain the difference. I can deny other parts of the
code is the cause, as measurements show the transformations bear the bulk of the execution
time. As expected from the asymptotic analysis, the workload for the FFT grows faster than
the other parts of the code. For instance, at input size 20402 the FFT (FFTW with precom-
puted plan to be specific) carries about 92% of the total CPU time used.
For use in SHARK, I suggest using the implementation made using FFTW. While it does not
match the Matlab results in computation speed completely, the results are indeed quite close
and for most uses I know, acceptable as well. The results also show that the implemented
feature to create a convolution plan, even though it can take several seconds, can be worth
it if needing to perform the convolution many times on different input of the same size.
26

7.5.4 Switching mechanism
Firstly, the switching mechanism seems to work. Secondly, considering that the input varies
quite a bit, and since hitting the exact crossover point between methods is practically impos-
sible due to random variations in the execution time. An error of less than 5% of the tested
input is quite respectable. That said, it’s hard to argue what is good. The particular input
sizes where the method has a tendency to choose the wrong method may be input sizes that
are used often in practice, or may not be used at all. It is most likely that the error happens
in areas where theres a crossover point. This could explain the grouping we see. In that
case, the difference in time used by either of the methods shouldn’t be too big, and choosing
the wrong method at these points is therefore of less concern as well.
7.5.5 Possible improvements
As noted already, complex multiplications with +1, −1, +i, −i can be regarded as free, and
since I did not exploit this, this is a simple way to reduce execution time. Using only features
from C, and thus skipping the overhead that comes along with C++ should, in the case of an
algorithm like this, yield substantially better results. If we instead wanted to save the mem-
ory use, one way is to only allocate space for the complex numbers, and then use the given
array for the real part of the number, thus getting closer to in-place calculations, but since
the two arrays is now not placed efficiently in memory, this is expected to slow computations
greatly. Alternatively, one could make assumptions on the input data, and assume the data
is already stored in an efficient complex data structure. However, this would violate a goal
in this project, to write a general-purpose implementation, and for that reason fit less in the
context of the SHARK library. I would therefore advice against any of these solutions.
When it comes to the FFTW library, its adaptive nature should resist most attempts to per-
form FFT faster using only a single FFT algorithm. If instead we went for a more realistic
goal just to have an implementation comparable to FFTW and Matlab, I would argue the di-
rection to take would be first to rewrite the code in a low level language such as C. Look
into ensuring memory allocation is efficient and thereafter take a look into implementing the
optimization I have done so far, and adjust it further by previously mentioned optimizations
while looking into even more ways to reduce the number of simple operations. Perhaps
even implementing the split-radix algorithm with the savings in operations count that fol-
lows along, and a mixed-radix implementation would remove most needs for zero-padding.
Doing small optimizations can seem like beating a dead horse, but as noted in [Frigo and
Johnson, 1998], FFT implementations consist of a large number of the same small opera-
tions, meaning a small increases in efficiency on these small operations often leads to great
gain in performance for the entire implementation. And in that regard, the Cooley-Tukey
Radix-2 implementation in this project may in fact be nearly comparable to other implemen-
tations.
8 Conclusions and Outlook
In this BSc thesis, I have shown how the Discrete Fourier Transform, through the convolu-
tion theorem, presents us with an indirect way to perform the convolution operation. I have
shown how it allows for reductions in computational complexity due to FFT algorithms re-
ducing the computational complexity to an order of O(n lg n) as opposed to the polynomial
O(n2).
During the project, an implementation of an indirect convolution method using the Cooley-
Tukey Radix-2 algorithm has been implemented with varying success. I conclude that the
27

implementation works with satisfactory error estimates and that it is significantly faster than
the direct method for larger input sizes. While great success was found by implementing
different optimizations, it did not succeed in getting close to the efficiency of implementa-
tions like Matlab and FFTW. During the project, it became apparent that the FFT function
in Matlab uses the FFTW library as well, and that the adaptive FFT implementation in FFTW
is practically unbeatable with standard FFT algorithms, as FFTW in some sense use all fast
implementations for best results. Even an implementation using FFTW did not manage to
match the speed of the Matlab either, and the reason for this has eluded me.
Improving on the developed solution would require multiple different methods to be im-
plemented along an implementation of a general purpose mixed-radix algorithm to help
relinquish the need for zero-padding. Being a low level algorithm, I suspect that much can
be gained from a low level language implementation. But it would probably be better to
look into optimizing the FFTW implementation.
In an effort to efficiently switch between the direct and indirect method for best results,
a linear model minimizing the square of the relative difference was derived to ensure the
percentage error remained low. The primary reason for the choice was how minimizing the
squared difference, as it’s done in the classic linear regression, yield poor results. Testing
showed satisfactory results choosing the most efficient method in about 95% of test cases for
a wide array of different input sizes.
A Testing results
A.1 Correctness & Error
Presented here is most of the test results. One little detail is what in the test output is de-
noted pseudo-tests. Due to time constraints, the FFTW solution was not tested as rigorously
as the primary implementation. Instead the difference in the results of the direct and indirect
method was used instead. The problem with this is that if both methods give the same wrong
result, the test shows up as a success. A reality check was inserted to ensure the trivial case
where both methods gave no output, did not happen.
Test format:
input height, input width, average error, maximum error
Correctness test (Symmetric synthetic data full mode)
1x1 average error: 0 max error: 0
4x3 average error: 3.23815e-17 max error: 2.22045e-16
28

Correctness test (Symmetric synthetic data valid mode)
Correctness test (Asymmetric real data)
Correctness (pseudo)test for fftw 1d implementation towards simple implementation
size: 3 average error (full ): 0 Reality Check: PASSED
size: 3 average error (valid): 0 Reality Check: PASSED
size: 13 average error (full ): 2.23876e-12 Reality Check: PASSED
size: 43 average error (valid): 4.60247e-11 Reality Check: PASSED
29

size: 3x4 average error (full ): 0 Reality Check: PASSED
size: 3x4 average error (valid): 0 Reality Check: PASSED
size: 3x6 average error (full ): 0 Reality Check: PASSED
size: 6x5 average error (full ): 9.21621e-12 Reality Check: PASSED
size: 23x4 average error (valid): 8.69951e-12 Reality Check: PASSED
30

size: 3x4x3 average error (full ): 0 Reality Check: PASSED
size: 3x4x3 average error (valid): 0 Reality Check: PASSED
size: 3x6x7 average error (full ): 5.77457e-13 Reality Check: PASSED
size: 3x6x7 average error (valid): 5.77457e-13 Reality Check: PASSED
size: 32x32x24 average error (valid): 0 Reality Check: PASSED
Correctness test for fftw 2d implementation (Symmetric synthetic data)
31

Correctness test for fftw 3d implementation (Assymetric synthetic data)
3x3x3 (Boundary not checked) average error: 0 max error: 0
4x4x4 (Boundary not checked) average error: 2.22045e-16 max error: 9.99201e-16
A.2 Finding hard cut-offs
Test format:
signal width, kernel width, indirect time, direct time
Finding hard breaking point for kernel = 1 of signalSize
10,10,0.000320477,1.32999e-05
11,11,0.000259366,1.61502e-05
12,12,0.000249166,2.26399e-05
13,13,0.000247844,3.0429e-05
...
50,50,0.00480329,0.00338261
51,51,0.0048728,0.00390723
52,52,0.00481161,0.00429582
53,53,0.0048077,0.00480856
Finding hard breaking point for kernel = 0.25 of signalSize
10,2,7.53263e-05,1.1628e-06
11,2,6.49629e-05,1.24966e-06
12,3,6.4596e-05,2.65874e-06
13,3,6.56e-05,2.8538e-06
...
76,19,0.00640333,0.00594393
77,19,0.00640245,0.00611908
78,19,0.00640296,0.00629646
79,19,0.00641832,0.00647539
32

A.3 Linear model results
Relative error model results for modeling the running time of the direct and indirect method.
Test format:
m: slope, c: constant term
Direct m = 2.41558753833e-09 c = -2.26816301201e-05
Indirect m = 2.75190398957e-07 c = 1.01787136851
B Tutorial - Convolution using FFTW
B.1 Installation
In order to perform convolution using FFTW, you will need to ﬁrst install the FFTW library
v3.3.4 on your system. The tutorial at the FFTW library website found at
http://www.fftw.org/fftw3_doc/Installation-and-Customization.html is simple to fol-
low. During installation you will need to set the --enable-threads ﬂag by issuing
.configure --enable-threads to enable multi-threading support.
B.2 Simple convolution
Any 2D convolution call will have the form
convolution_fftw_2d(source, destination, kernel, width, height,
kernel_width, kernel_center_x, kernel_center_y,
boundary_mode, thread_max_count, convolution_plan);
Due to the need for internal auxiliary arrays, source and destination is allowed to be the
same array as long as the parameter boundary_mode is set to 1 (default). A simple piece
of code for performing a convolution could in SHARK look like the following.
1 #include "shark/LinAlg/Base.h"
2 #include "shark/Data/Dataset.h"
3 #include "convolution_fftw.hpp"
4
5 int width = 1024;
6 int height = 1024;
7 int kernel_width = 3;
8 int kernel_center_x = 1;
9 int kernel_center_y = 1;
10
11 RealVector imIn(width*height);
12 RealVector imOut(width*height);
13 RealVector kernel(kernel_width*kernel_width);
14
15 /* Fill imIn here */
16
17 convolution_fftw_2d(imIn, imOut, kernel, width, height,
18 kernel_width, kernel_center_x , kernel_center_y);
33

B.3 Convolutions with plan
If one needs to perform many convolutions on equally sized input, it is advised to ﬁrst create
a convolution plan. The convolution plan has the form
convolution_plan(width, height, kernel_width, boundary_mode, thread_max_count);
If the intend is to use the same kernel for multiple input signals, it’s possible to supply the
plan with a static kernel reducing execution time greatly. To use the plan, simply create it
and pass it as argument for repeated convolution calls. Note that the convolution function
will still need a kernel passed, but that it is unused should the plan contain a static kernel.
Here’s some simple example code
1 int width = 1024;
2 int height = 1024;
3 int kernel_width = 3;
4 int kernel_center_x = 1;
5 int kernel_center_y = 1;
6 int thread_max_count = 4;
7
8 convolution_plan* plan = new convolution_plan(width, height, kernel_width,
9 1, thread_max_count);
10
11 plan->setStaticKernel(kernel); // OPTIONAL
12
13 /* Prepare input and output arrays */
14
15 convolution_fftw_2d(imIn, imOut, kernel, width, height,
16 kernel_width, kernel_center_x , kernel_center_y, plan);
17
18 /* Repeat above as needed */
19
20 delete plan;
Creating the plan make take several seconds depending on the input size, but can be reused
indeﬁnitely.
B.4 Convolution in other dimensions
To perform a 1D or 3D convolution, the following is supplied
convolution_fftw_1d(source, destination, kernel, width, kernel_width,
kernel_center_x, boundary_mode, thread_max_count,
convolution_plan);
convolution_plan(width, kernel_width, boundary_mode, thread_max_count);
convolution_fftw_3d(source, destination, kernel, width, height, depth,
kernel_width, kernel_center_x, kernel_center_y,
kernel_center_z, boundary_mode, thread_max_count,
convolution_plan);
convolution_plan(width, height, depth, kernel_width,
boundary_mode, thread_max_count);
34

C Linear model based on relative error
We want to fit data using a linear model minimizing the square of the relative error. Define
the linear model ˆy = mxi + c fitting n data points (xi, yi). We then want to minimize:
R =
n
i=1
yi − ˆy
yi
2
=
n
i=1
1 −
ˆy
yi
2
=
n
i=1
1 −
mxi + c
yi
2
.
We now want find the minimum of R as a function of m and c. This is done by finding the
partial derivatives and setting it zero. For convenience define
sy =
n
i=1
1
yi
, syy =
n
i=1
1
y2
i
, sxy =
n
i=1
xi
yi
,
sxyy =
n
i=1
xi
y2
i
, sxxyy =
n
i=1
x2
i
y2
i
,
so we get
∂R
∂c
=
n
i=1
2 (c + mxi − yi)
y2
i
= 0
c
n
i=1
1
y2
i
+ m
n
i=1
xi
y2
i
−
n
i=1
1
yi
= csyy + msxyy − sy = 0 ,
c =
sy − msxyy
syy
.
Computing the other derivative and substituting in the result yields
∂R
∂m
=
n
i=1
2xi (c + mxi − yi)
y2
i
= 0 ,
c
n
i=1
xi
y2
i
+ m
n
i=1
x2
i
y2
i
−
n
i=1
xi
yi
=
sy − msxyy
syy
sxyy + msxxyy − sxy
= m sxxyy −
s2
xyy
syy
+
sy
syy
sxyy − sxy = 0 ,
m =
sxy − sxyy
sy
syy
sxxyy −
s2
xyy
syy
=
syysxy − sxyysy
syysxxyy − s2
xyy
=
n
i=1
1
y2
i
n
i=1
xi
yi
− n
i=1
xi
y2
i
n
i=1
1
yi
n
i=1
1
y2
i
n
i=1
x2
i
y2
i
− n
i=1
xi
y2
i
n
i=1
xi
y2
i
=
n
i=1
n
j=1
xj
y2
i yj
− xi
y2
i yj
n
i=1
n
j=1
x2
j
y2
i y2
j
−
xixj
y2
i y2
j
=
n
i=1
n
j=1
xj−xi
y2
i yj
n
i=1
n
j=1
x2
j −xixj
y2
i y2
j
.
35

To determine if this is a minimum, we perform the second derivative test.
A =
∂2R
∂m2
=
n
i=1
2x2
i
y2
i
, B =
∂2R
∂m∂c
=
n
i=1
2xi
y2
i
, C =
∂2R
∂c2
=
n
i=1
2
y2
i
.
D = AC − B2
=
n
i=1
2x2
i
y2
i
·
n
i=1
2
y2
i
−
n
i=1
2xi
y2
i
2
=
n
i=1
n
j=1
2x2
i
y2
i
2
y2
j
−
n
i=1
n
i=j
2xi
y2
i
2xj
y2
j
= 4
n
i=1
n
j=1
x2
i − xixj
y2
i y2
j
.
As we only want to show D is positive, we can drop the factor of 4. Note that this double
sum produces all n2 combinations of i, j. The trivial case where i = j we get x2
i − xixj = 0.
If i = j note that there will always be exactly one corresponding pair with i, j in reversed.
Let xi = a and xj = b, if we choose to sum these two in pair we get
a2
− ab + (b2
− ba) = (a − b)2
≥ 0 .
Since D is a sum of these pairs it must hold that D ≥ 0. If D = 0 we don’t know what kind
of extrema this is, but since this corresponds to the trivial case where all points are zero,
this is really of no concern. Since then D > 0 and A > 0 (except for the trivial case), the
ABC-Criteria states that this is a local minimum.
References
C. Berg and J. P. Solevej. Noter til Analyse 1: Fourierrækker og Metriske Rum. Department of
Mathematical Science, Copenhagen, ﬁrst edition edition, 2011.
P. Duhamel and M. Vetterli. Fast fourier transforms: A tutorial review and a state of the art.
Signal Processing, 19:259–299, 1990.
FFTW-Docs. FFTW documentation. http://www.fftw.org/fftw2_doc/fftw_3.html. Ac-
cessed: 2015-05-24.
M. Frigo and S. G. Johnson. FFTW: An adaptive software architecture for the FFT. In Proceed-
ings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing,
volume 3, pages 1381–1384. IEEE, 1998.
M. Frigo and S. G. Johnson. The design and implementation of fftw3. Proceedings of the
IEEE, 93(2):216–231, 2005.
C. Igel, V. Heidrich-Meisner, and T. Glasmachers. Shark. Journal of Machine Learning Re-
search, 9:993–996, 2008.
D. Sundararajan. Discrete Fourier Transform: Theory, Algorithms and Applications. World
Scientiﬁc, 2001.
36

bachelors_thesis_stephensen1987

More Related Content

What's hot

Viewers also liked

Similar to bachelors_thesis_stephensen1987

bachelors_thesis_stephensen1987