TR-CIS-0420-09 BobZigon

TWO LAYER LINEAR DIFFUSION EQUATION ON THE GPU
Technical Report TR-CIS-0420-09
Submitted to the Faculty
of
Indiana University-Purdue University Indianapolis
by
Robert J. Zigon
December 2015
Indianapolis, Indiana

ii
TABLE OF CONTENTS
Page
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Graphics Processing Unit (GPU) . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 Software Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Hardware Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Diffusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1 One Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Factorization of a Tridiagonal Matrix . . . . . . . . . . . . . . . . . . . 11
3.2.1 LU Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.2 UL Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 CPU Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4 Parallel LU Factorization . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.5 Parallel Solver and Recursive Doubling . . . . . . . . . . . . . . . . . . 18
3.6 Recursive Doubling and Nilpotent Matrices . . . . . . . . . . . . . . . 19
3.7 GPU Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.8 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4 Diffusion with Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2 Two Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3 Interface Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.4 Complete System for Two Layers . . . . . . . . . . . . . . . . . . . . . 30
4.5 Difference Between Inclusion and Exclusion of the Interface Condition . 31
4.6 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
A CPU code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
B GPU code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

iii
LIST OF TABLES
Table Page
3.1 CPU LU vs CPU ZC with 1 Layer, Terms=20, Time Steps=50 . . . . . . . 23
3.3 CPU LU vs GPU RD1L with 1 Layer, Terms=20, Time Steps=50 . . . . . 25
4.1 Experimental parameters for 50 timesteps . . . . . . . . . . . . . . . . . . 33
4.2 CPU 2L vs GPU RD2L, κ1 = 1, κ2 = 1, Terms=32, Time Steps=50 . . . . 34
4.3 CPU 2L vs GPU RD2L, κ1 = 1, κ2 = 10, Time Steps=50 . . . . . . . . . . 35

iv
LIST OF FIGURES
Figure Page
2.1 Memory bandwidth for the CPU and GPU . . . . . . . . . . . . . . . . . . 4
2.2 Floating point operations per second for the CPU and GPU . . . . . . . . 4
2.3 A GPU Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 A GPU and a Streaming Multiprocessor (SM or SMX) . . . . . . . . . . . 8
3.2 CPU LU vs GPU RD1L with 1 Layer, Terms=20, Time Steps=50 . . . . . 25
4.1 A plant with multiple layers of soil . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Time Evolution of Interface-Experiment 1 . . . . . . . . . . . . . . . . . . 35
4.3 Solution to the Interface Neighborhood-Experiment 1 . . . . . . . . . . . . 36
4.6 Error in the Interface Neighborhood-Experiment 2 . . . . . . . . . . . . . . 37
4.11 Error in the Interface Neighborhood-Experiment 4 . . . . . . . . . . . . . . 40

v
ABSTRACT
Zigon, Robert MS, Purdue University, December 2015. Two Layer Linear Diffusion
Equation on the GPU. Major Professors: Raymond Chin, Shaofin Fang and Fengguang
Song.
The purpose of this project is to investigate the mathematical framework for evalu-
ating the two layer linear diffusion equation on a GPU. The diffusion equation is first
approximated using finite differences to produce the matrix equation Ax = f. The two
term non-linear recurrence relation for the LU factorization of the A matrix is then con-
verted into a three term linear recurrence relation by way of a Riccati transform. The
three term relation is then shown to be parallelizable. After the numeric underflow prob-
lem for the LU solver of the system is reconciled, Stone’s recursive doubling algorithm is
then implemented. Finally, the parallel implementation is applied to a form of the two
layer diffusion equation that properly models the flux across the internal boundary.

1
1 INTRODUCTION
In physics, diffusion is simply defined as the change in distribution of a collection of
particles, as well as its depletion, in time and space. The underlying partial differential
equation can be used to model many different types of processes. For example, open a
bottle of perfume. As the molecules of the scent first escape the container, they are in
very high concentration. Over time they spread outward in every direction where they
are in low concentration.
Another example of diffusion exists in biology. A process called morphogenesis con-
trols the spatial distribution of cells during the embryonic development of an organism.
Natural patterns, such as the spots on a leopard, are believed to be the result of cellular
differentiation in many different directions [1].
The diffusion equation also appears in oncology with the use of radio frequency
thermal ablation (RFA). In this process, tumor cells are killed by focusing energy on a
diseased portion of the body. In order to better understand the ablation process, models
are used to analyze the energy and temperature distribution in the context of the muscle,
fat and bone that are adjacent to the tumor cells [2].
Yet another example of diffusion exists in hydrogeology – the study of the movement
of groundwater in the soil and rocks of the Earth’s crust. Groundwater does not always
flow down hill in the subsurface by following the surface topology. Instead, it can be
driven by pressure gradients in both saturated and unsaturated regions. This results in
a behavior that is difficult to predict for all but the simplest situations.
The goal of this project is to implement a solver for the one dimensional linear
diffusion equation with two layers on a GPU. We will begin with a description of the
modern GPU. The problem itself will then be investigated in two phases. The first
phase will start with the finite difference equations for the one dimensional, constant
coefficient, linear diffusion equation (due to its relative simplicity) on the CPU and GPU.

2
The second phase will then investigate adding layers to the first phase. This two phase
approach will allow us to first understand the issues surrounding diffusion on different
hardware architectures, and then focus on the two layer problem, so that an efficient
parallel solver can be implemented.
The solver consists of two components that are designed to improve execution time or
the accuracy of the solution. First, the tridiagonal structure of the underlying matrices
will be considered so that the LU-decomposition can be applied with a computational
complexity of O(n). While doing so, we will show how to implement Stone’s recursive
doubling algorithm to solve 226
equations, a goal that, up to now, has not exceeded
1, 024 equations. For the second component, we will demonstrate the mathematics to
treat the boundary of two different diffusion coefficients in a manner that reduces error
in the solution.

3
2 GRAPHICS PROCESSING UNIT (GPU)
The modern Graphics Processing Unit (GPU) has its genesis in 2D and 3D computer
graphics. In 2000, parallel processing and floating point arithmetic capabilities were
added to graphics cards to accelerate the rate that world geometries could be trans-
formed, illuminated, projected, clipped and then displayed as pixels. This sequence of
operations is called the graphics pipeline. It makes heavy use of five basic floating point
operators (addition, subtraction, multiplication, division and square root). The process
itself is called embarrassingly parallel because the transformation sequence applied to a
three dimensional vertex is independent of the other vertices.
In 2002, researchers became interested in these parallel processing and floating point
capabilities. They used the graphics application programming interface (API) to com-
pute functions such as fast fourier transforms and convolutions. NVidia took note of
the scientific computing trend with GPUs and developed CUDA - the Compute Unified
Device Architecture. CUDA [6] is a computing platform and programming model that
uses a C like language to expose the massive parallelism of GPU hardare. In retrospect,
the demand for real time graphics has caused the GPU to evolve into a highly parallel,
multithreaded, many core processor with very high memory bandwidth and computing
throughput as illustrated in figures 2.1 and 2.2.
2.1 Software Model
CUDA was designed to overcome the challenge of writing applications that transpar-
ently scale with increasing numbers of processing cores by maintaining a low learning
curve for programmers familiar with the C programming language. There are three
abstractions at the core of CUDA - a hierarchy of thread groups, shared memories and
barrier synchronization. These abstractions guide the programmer to partition a prob-

4
Figure 2.1.: Memory bandwidth for the CPU and GPU
Figure 2.2.: Floating point operations per second for the CPU and GPU

5
lem into sub-problems that can be solved independently in parallel by blocks of threads
executing a kernel.
A kernel is a program written in CUDA C that is downloaded from the host to a
GPU board at runtime. Parameters are passed to the kernel at invocation to provide
it with operands that are transformed by the GPU. In listing 2.1 the kernel program
extends from line 1 through line 5. Line 13 of the main program (that is executing on
the Intel CPU) essentially downloads the V ectorAdd kernel to the GPU and launches
it on 1,000 threads. At runtime each thread is assigned a unique thread index, in this
case ranging from 0 to 999. The ith
thread loads A[i], B[i], adds them, and then writes
the result to C[i]. When all of the threads have executed the V ectorAdd kernel, control
is returned to the main program at line 14.
Modern GPU hardware (like an NVidia Tesla K20 board) can have as many as 2,496
processing cores. When a kernel is launched, one of the required parameters is the thread
count. The requested thread count can exceed the number of physical cores on the GPU.
From a conceptual standpoint, the hardware maps blocks of 32 threads (called warps)
to 32 cores until they have ﬁnished executing. When one warp ﬁnishes another one is
allocated to the idle block of cores for execution. This is one of the key abstractions
in CUDA that lends itself to transparent scalability. If, for example, a next generation
board arrives with 10,000 cores, the kernels are oblivious to the environmental change.
The hardware and runtime take care of the mapping from threads to cores, and the
kernel is executed. The result is a platform that preserves the user’s investment in code
while insulating from hardware changes.

6
1 __global__ void VectorAdd(float *A, float *B, float *C)
2 {
3 int i = threadIdx.x;
4 C[i] = A[i] + B[i];
5 }
6
7 int main ()
8 {
9 const int N = 1000;
10 ...
11 // Kernel invocation from the host with N threads
12
13 VectorAdd <<<1, N>>>(A, B, C);
14 ...
15 }
Listing 2.1: Example kernel and host code
2.2 Hardware Model
From a hardware perspective, the fundamental computing unit in an NVidia GPU
is a core (see figure 2.3). A core contains a 32 bit arithmetic logic unit (ALU) capable
of performing operations such as min, max, add, subtract, multiply, divide, compare
and bitwise logical operators. A core also contains a single and double precision floating
point unit.
In figure 2.4 we see that a collection of cores are grouped together in a unit known as
a Streaming Multiprocessor (SM). The SM in the figure has 192 cores within. A GPU
itself is then a collection of SM’s. Although figure 2.4 shows 8 SM’s (for a total of 1, 536
cores), the Tesla K20 cards used in this project have 13 SM’s, for a total of 2, 496 cores.
An SM is designed to execute thousands of programming threads concurrently.
To manage such a large number of threads, it employs a unique architecture called
SIMT(Single Instruction, Multiple Thread). An SM schedules and executes the threads
grouped as warps. Individual threads composing a warp begin execution at the same
instruction address, but have their own register state and are therefore free to branch

7
Figure 2.3.: A GPU Core
and execute independently. However, full efficiency is realized when all 32 threads of a
warp agree on their execution path. If threads of a warp diverge via a data dependent
conditional branch, the warp serially executes each branch taken, disabling threads that
are not on that path, and when all paths complete, the threads converge back to a
common path.
The SIMT architecture is somewhat similar to the architecture of vector processors
known as SIMD(Single Instruction, Multiple Data). A key difference, however, is
that the SIMD organization exposes the width of the vector to the software (and
programmer). SIMT, on the other hand, specifies the execution and branching behavior
in terms of a single thread. This model simplifies parallel programming somewhat and
aids in program correctness.

8
Figure 2.4.: A GPU and a Streaming Multiprocessor (SM or SMX)

9
3 DIFFUSION
3.1 One Layer
The one dimensional diffusion equation is
∂u
∂t
= κ
∂2
u
∂x2
, 0 ≤ x ≤ L, t > 0 (3.1)
where u = u(x, t) is the dependent variable and κ is a real constant. The initial
condition and boundary conditions are u(x, 0) = 0, u(0, t) = g(t) and u(L, t) = h(t).
The particular problem we will solve has boundary and initial conditions that are equal
to
u(0, t) = 0, u(L, t) = 0, u(x, 0) = sin(πx/L). (3.2)
The solution to (3.1) subject to (3.2) is then
u(x, t) = sin(πx/L) exp(−κπ2
t/L2
). (3.3)
The results of this project will be validated for both the CPU and GPU using (3.3).
We begin with an implicit time discretization scheme with a trapezoidal rule in which
tj = jδt. Then
∂u
∂t
= F(x, t)
is converted to
u(j)
(xi) − u(j−1)
(xi)
δt
=
F(xi, tj) + F(xi, tj−1)
2
.
Define u(xi, tj) to be replaced by u
(j)
i . The time discretization of the diffusion equation
becomes
u(j)
(xi) − u(j−1)
(xi)
δt
=
κ
2
∂2
u
∂x2
(j)
xi
+
∂2
u
∂x2
(j−1)
xi
. (3.4)
We now use a central difference with uniform spacing h = L/N for the spatial
dimension, where N is the number of sub-intervals. As a result, (3.4) becomes

10
u(j)
(xi) − u(j−1)
(xi)
δt
=
κ
2
ui−1 − ui + ui+1
h2
(j)
+
ui−1 − ui + ui+1
h2
(j−1)
. (3.5)
If we let r = 2h2
κδt
, then (3.5) can be rewritten as
−u
(j)
i−1 + (2 + r)u
(j)
i − u
(j)
i+1 = −u
(j−1)
i−1 − (2 − r)u
(j−1)
i − u
(j−1)
i+1 , i = 1 . . . N, j > 0. (3.6)
For convenience, deﬁne
f
(j)
1 = u
(j−1)
0 − (2 − r)u
(j−1)
1 − u
(j−1)
2
f
(j)
i = u
(j−1)
i−1 − (2 − r)u
(j−1)
i − u
(j−1)
i+1 , 2 ≤ i ≤ N − 2
f
(j)
N−1 = u
(j−1)
N−2 − (2 − r)u
(j−1)
N−1 + u
(j−1)
N
then (3.6) becomes
(2 + r)u
(j)
1 − u
(j)
2 = f
(j)
1 + u
(j)
0
−u
(j)
i−1 + (2 + r)u
(j)
i − u
(j)
i+1 = f
(j)
i , i = 2 . . . N − 2, j > 0 (3.7)
−u
(j)
N−2 + (2 + r)u
(j)
N−1 = f
(j)
i + u
(j)
N .
Now rewrite (3.7) in terms of a left boundary equation, a set of interior equations
for 1 < i < N − 1 and a right boundary equation. The left boundary equation is
(2 + r)u
(j)
1 − u
(j)
2 = f
(j)
1 + g(j)
, j > 0. (3.8)
The interior equations are
−u
(j)
i−1 + (2 + r)u
(j)
i − u
(j)
i+1 = f
(j)
i , i = 2 . . . N − 2, j > 0 (3.9)
and the right boundary equation is
−u
(j)
N−2 + (2 + r)u
(j)
N−1 = f
(j)
N−1 + h(j)
, j > 0. (3.10)
This is done to emphasize the roles of the interior and boundary forcing functions. As
such, the boundaries need no special treatment.

11
We can now rewrite (3.8), (3.9) and (3.10) in matrix form to yield
Ax = f (3.11)
where
A =














2 + r −1
−1 2 + r −1
...
...
...
...
...
...
−1 2 + r −1
−1 2 + r














(N−1)×(N−1)
(3.12)
x =











u
(j)
1
u
(j)
2
...
u
(j)
N−2
u
(j)
N−1











(N−1)
, and f =











f
(j)
1 + g(j)
f
(j)
2
...
f
(j)
N−2
f
(j)
N−1 + h(j)











(N−1)
. (3.13)
This is the system of equations that will be solved.
3.2 Factorization of a Tridiagonal Matrix
3.2.1 LU Factorization
Assume a tridiagonal matrix B is represented as
B =














b1 c1
a2 b2 c2
...
...
...
...
...
...
aN−1 bN−1 cN−1
aN bN














. (3.14)
B can be factored into a lower bi-diagonal matrix L and an upper bi-diagonal matrix
U such that B = LU, where

12
LU =














γ1
a2 γ2
a3 γ3
...
...
aN−1 γN−1
aN γN




























1 δ1
1 δ2
1 δ3
...
...
1 δN−1
1














. (3.15)
If we equate the coeﬃcients of (3.14) and (3.15), the result is
γ1 = b1
δ1 = c1/γ1
γi = bi − aiδi−1, i = 2, . . . , N
δi = ci/γi, i = 2, . . . , N − 1.
Alternatively, you may cast LU such that
LU =














1
β2 1
β3 1
...
...
βN−1 1
βN 1




























η1 c1
η2 c2
η3 c3
...
...
ηN−1 cN−1
ηN














with similar systems of recurrence relations for their coeﬃcients. The method chosen
depends on its application.
If we have a system of equations involving B it can be written as
Bx = f, or (3.16)
L(Ux) = f.
Now, introduce the intermediate vector y and we have
Ly = f, the forward substitution (3.17)
Ux = y, the backward substitution. (3.18)

13
The system of equations that results from (3.17) is
γ1y1 = f1
a2y1 + γ2y2 = f2
aiyi−1 + γiyi = fi, i = 3, . . . , N
and solving for the y vector yields,
y1 = f1/γ1
yi = (fi − aiyi−1)/γi, i = 2, . . . , N. (3.19)
With the forward substitution step complete, we can write the backward substitution
(3.18) as
xN = yN
xi = yi − δixi+1, i = N − 1, . . . , 1 (3.20)
and generate the solution to (3.16).
3.2.2 UL Factorization
As it turns out, there is also a UL factorization. If L is pre-multiplied by U, the
resulting system is then
Bx = f, or (3.21)
U(Lx) = f,
where
UL =














1 ¯δ1
1 ¯δ2
1 ¯δ3
...
...
1 ¯δN−1
1




























β1
a2 β2
a3 β3
...
...
aN−1 βN−1
aN βN














. (3.22)

14
The UL factorization is equivalent to reordering the vectors x and f from N to 1. It
follows that the information is transmitted from N to 1. The need for this will become
apparent when dealing with the interface equation in section 4.
When the intermediate vector z is introduced, we obtain
Uz = f, the backward substitution
Lx = z, the forward substitution.
If we equate the coeﬃcients of (3.14) and (3.22), the result is
βN = bN
¯δi = c1/βi+1, i = N − 1, . . . , 1
βi = bi − ai+1
¯δi, i = N − 1, . . . , 1.
Now that the coeﬃcients of U and L are computed, we can solve for the elements of
vectors z and x to yield
zN = fN
zi = fi − ¯δizi+1, i = N − 1, . . . , 1
and
x1 = z1/β1
xi = (zi − aixi−1)/βi, i = 2, . . . , N
which generates the solution to the system (3.21).
3.3 CPU Implementation
The linear system (3.11) that results from discretizing (3.1) is both tridiagonal and
diagonally dominant (r > 0). The tridiagonal property implies that the LU decompo-
sition can be performed in O(n) time. The diagonal dominance implies that pivoting is
not required. We use these properties to generate the solution to (3.11).

15
Listing A.1 contains the implementation of the factorization and solver for (3.12)
and (3.13) that runs on the CPU. A main function named RunCPU 1K1LTest re-
peatedly calls the pair with various conﬁgurations, then measures the execution time
and the relative error with respect to the initial conditions in (3.2). At a high level,
RunCPU 1K1LTest performs 5 functions.
1. Assign boundary conditions.
2. Assign initial conditions.
3. Initialize the sub-diagonal, diagonal, and sup-diagonal of matrix A.
4. Compute the LU factorization.
5. Repeatedly Solve the system and Step time.
3.4 Parallel LU Factorization
The symmetry of A in (3.12) leads to an LU factorization of
LU =














γ1
−1 γ2
−1 γ3
... ...
−1 γN−1
−1 γN




























1 δ1
1 δ2
1 δ3
... ...
1 δN−1
1














where
γ1 = 2 + r
δ1 = −1/γ1
γi = 2 + r + δi−1, i = 2, . . . , N
δi = −1/γi, i = 2, . . . , N − 1

16
which can be simplified to
γ1 = 2 + r
γi = 2 + r − 1/γi−1, i = 2, . . . , N. (3.23)
The nonlinear two-term recurrence in (3.23) does not lend itself to parallel evaluation.
Stone [7] and Blelloch [8], however, describe an algorithm for parallel evaluation of m-th
order linear recurrence relations. So (3.23) is modified through the use of a Riccati
transformation γi = qi/qi−1 to produce a linear three-term recurrence
q0 = 1
q1 = 2 + r
qi = (2 + r)qi−1 − qi−2, i = 2, . . . , N. (3.24)
The issue with (3.24) is that it can overflow when using floating point arithmetic.
The term qi grows and becomes unbounded as i → ∞. We seek a truncated form that
gives acceptable results within the finite precision of the microprocessor. Recognize that
initial value problem in (3.24) can be solved analytically. Assume a solution of the form
qi = Axi
+ + Bxi
−
with A and B real constants. Examination of the characteristic equation x2
−(2+r)x+
1 = 0 for (3.24) yields roots of
x± =
b ±
√
b2 − 4
2
, where b = 2 + r.
With the initial conditions of q0 = 1 and q1 = b, we can find the constants A and B
from
q0 = 1 = A + B
q1 = b = Ax+ + Bx−,
where A = (b − x−)/(x+ − x−) and B = 1 − A, to yield the solution
qi =
xi+1
+ − xi+1
−
x+ − x−

17
which can be rewritten as
qi = xi
+
1 − (x−/x+)i+1
1 − (x−/x+)
.
With
1 − (x−/x+)i+1
1 − (x−/x+)
= 1 + x−/x+ + O
x−
x+
2
,
it follows that
qi = xi
+ 1 + x−/x+ + O
x−
x+
2
.
For sufficiently large i = N
qN = xN
+ 1 + x−/x+ + O
x−
x+
2
and qi will overflow as qN → xN
+ .
When (3.23) is combined with (3.19) and (3.20), the forward and backward substi-
tution logic simplifies to
y1 = f1/γ1
yi = (fi + yi−1)/γi, i = 2, . . . , N.
and
xN = yN
xi = yi + xi+1/γi, i = N − 1, . . . , 1.
We now see that 1/γi is needed and not qi, so
1/γi =
xi
+ − xi
−
xi+1
+ − xi+1
−
=
xi
+
xi+1
+
1 − (x−/x+)i
1 − (x−/x+)i+1
=
1
x+
1 − (x−/x+)i
+ O
x−
x+
i+1
.
This tells us that 1/γi approaches its asymptotic limit faster than qi. If we impose the
constraint that
1
γK
−
1
γK−1
< 10−M
(3.25)

18
for some integers K and M, then
1
γK
≈
1
x+
−
1
x+
x−
x+
i
=
1
x+
(1 − 10−M
).
This suggests we stop computing when (3.25) is satisfied. From that point on we impose
1
γi
=
1
x+
, for i > K.
In summary, this section shows how to transform a non-linear two term recurrence rela-
tion into a linear three term recurrence relation and conversely, that is now a candidate
for parallelization. While doing so, numeric overflow needed to be identified and the
problem mitigated by truncating the series when an acceptable level of accuracy has
been reached.
3.5 Parallel Solver and Recursive Doubling
The sequential nature of the solver for the CPU does not lend itself to efficient
implementation on the GPU. Other researchers have chosen to implement Cyclic Re-
duction (CR) and Parallel Cyclic Reduction (PCR) for their tridiagonal solvers. Two
other researchers [9,10] have attempted to implement Stone’s Recursive Doubling (RD)
algorithm [11] to solve tridiagonal systems on a GPU. Each of them reported problems
like huge numerical errors and RD suffers from arithmetic underflow and instability but
they failed to analyze the source of instability. This section, on the other hand, will
discuss the source of the instability (the numeric underflow that occurs on the GPU)
and then describes how to address the issue.
Stone described RD in terms of the following theorem and claimed that it could be
used to solve recurrence relations of all orders.
Theorem (Stone) 1 Let yi(j) satisfy a non-homogeneous two-term recurrence
yi+1(j) = y1(j) + yi(j − 1) ∗ (−mj), i, j ≥ 1 (3.26)
with the boundary conditions
y1(j) = bj, j ≥ 1; yi(j) = 0, j ≤ 0; yi(j) = 0, i ≤ 0.

19
Then,
a) for s ≥ 1, yi(j) satisfies the recurrence relation
yi+s(j) = ys(j) + yi(j − s)
i
k=j−s+1
(−mk), i ≥ 1, j ≥ s;
b)
yi(j) =
j
k=1
y1(k)
j
s=k+1
(−ms), i ≥ j ≥ 1; (3.27)
c) for i ≥ j ≥ 1, yi(j) = zj, where zj is the jth component of the solution to
zi = bi − mizi−1, z1 = b1.
Quite simply, the problem lies with (3.27). If the sequence {mk}n
1 is bounded by ¯m,
then
i
k=j−s+1
(−mk) ≈ ¯mp
, where p = i − j + s.
The RD algorithm needs to be modified to avoid the arithmetic underflow that ¯mp
causes when ¯m < 1.
3.6 Recursive Doubling and Nilpotent Matrices
The non-homogeneous two-term recurrence (3.26) with initial value y1 = b1 has a
solution of the form
yi =
i
k=1
bk
i
s=k+1
(−ms) (3.28)
If the first few terms of the solution are expanded, we have
y1 = b1
y2 = b2 − b1m2
y3 = b3 − b2m3 + b1m2m3.
This is a compact representation of the linear system
Ly = b (3.29)

20
where L is a lower bidiagonal matrix given by
L =














1
m2 1
m3 1
...
...
mn−1 1
mn 1














n×n
.
We rewrite L as the sum of an identity matrix I of order n and a nilpotent matrix N of
index n, such that
L = I + N
where N is the ﬁrst lower diagonal matrix
N =














0
m2 0
m3 0
...
...
mn−1 0
mn 0














n×n
.
With N nilpotent, the inverse of L can be expressed as
L−1
= (I + N)−1
= I − N1
+ N2
− N3
+ · · · + (−1)n−1
Nn−1
.
For some p < n, the pth power of N is matrix with non-zero values ﬁlling the pth lower
diagonal. For example,








0
m2 0
0 m3 0
0 0 m4 0








2
=








0
0 0
m2m3 0 0
0 m3m4 0 0








.

21
We can now relate the solution of the non-homogeneous two-term recurrence (3.28) to
the entries of the pth lower diagonal of Np
given by
Np
p+i,i =
p+i
k=i+1
mk, i = 1, 2, . . . , n − p.
If there is an upper bound on the entries of N,
¯m = max
1≤i≤n
|mi|,
then the absolute value of an entry on the pth lower diagonal is
|Np
p+i,i| =
p+i
k=i+1
|mk| ≤ ¯mp
, i = 1, 2, . . . , n − p.
If ¯m < 1, we can now choose a value for p ∈ N such that the solution has an error
no bigger than 10−digit
,
¯mp
≤ 10−digit
or p ≤ −digit
ln 10
ln ¯m
,
where digit is the number of required decimal places. All of this now suggests computing
y = (I − N1
+ N2
− N3
+ · · · + (−1)p
Np
)b
as the truncated form of the solution to (3.29). From a practical standpoint we have
yi = bi − miyi−1, i = 2, . . . , p with y1 = b1 (3.30)
for the ﬁrst p elements of y. Thereafter, for elements p + 1 ≤ j ≤ n, we have
yj = bj − bj−1mj + bj−2(mjmj−1) + · · · + bj−p−2
j−p−2
k=j
(−mk) + bj−p−1
j−p−1
k=j
(−mk)
p terms
(3.31)
which is just (3.27).

22
3.7 GPU Implementation
The forward and backward substitution logic is now represented by
y1 = f1/γ1
yi =
1
γi
(yi−1 + fi), i = 2, . . . , N − 1
and
xN−1 = yN−1
xi = yi +
1
γi
xi+1, i = N − 2, . . . , 1
respectively. Each of these two equations are remarkably similar to (3.30) and (3.31),
and the GPU code takes advantage of this.
Assume, for example, a diffusion problem has 220
equations in 220
unknowns, and for
accuracy reasons it requires p = 15. The γ vector is first computed with the CPU and
then passed into the GPU kernels. It was deemed unnecessary to implement the trun-
cated version of the γ vector on the GPU given that the LU factorization is performed
once outside of the solver loop. The ForwardLU Kernel (listing B.1) is launched with
n = 220
threads. Recall that the threads are organized in groups of 32 called a warp,
all of which execute the same instruction on a different data element. For a given warp
0 ≤ j < 215
, threads numbered 32j ≤ i < (32j + 32) will each execute
sum = 0
sum = (sum + fi−k)
1
γi−k
, k = 1, . . . , p.
The BackwardLU Kernel in (listing B.2) behaves in a similar way.
3.8 Results
All of the software development was performed on a Lenovo S20 workstation with
14GB of RAM, a 300gb Western Digital VelociRaptor hard drive, an Intel Xeon W3520
quad core cpu @ 2.67GHz, and an NVIDIA Tesla K40c. The workstation is running

23
Windows 7/64. The development tools consisted of Visual Studio 2010, CUDA 6.5 and
Tesla Driver 341.44. All of the generated applications were 64 bit.
Table 3.1 and ﬁgure 3.1 summarize the results of the CPU version of the LU decom-
position (CPU LU) versus the CPU version of the truncated γ vector (CPU ZC) that
was described in section 3.4. Each element xi of the solution vector x was the sum of
20 terms. The number of time steps was 50. Table 3.2 is a similar experiment with the
number of terms equal to 32. Table 3.2 emphasizes that the additional 12 terms results
in relative errors that are nearly identical to the CPU LU version.
Table 3.3 and ﬁgure 3.2 summarize the results of the CPU version of the LU decom-
position versus the GPU version of the RD algorithm (GPU RD1L) that was described
in section 3.5. Again, each element xi of the solution vector x was the sum of 20 terms.
The number of time steps was 50. In this case, the data clearly shows that the GPU
prefers large problems over smaller problems, and ultimately executes 36 times faster
than the CPU LU algorithm.
Table 3.1.: CPU LU vs CPU ZC with 1 Layer, Terms=20, Time Steps=50
Points CPU LU
(ms)
CPU
MaxRel
Err
CPU
ZC (ms)
CPU
MaxRel
Err
Speed
Up
210
1,024 1.1 2.467E-05 3.5 3.136E-05 0.31
216
65,536 69.4 6.858E-09 231.0 1.773E-05 0.30
220
1,048,576 1,222.2 7.650E-11 3,650.9 4.526E-06 0.33
222
4,194,304 4,809.4 2.056E-10 14,671.2 2.333E-07 0.32
224
16,777,216 19,164.3 3.473E-10 59,080.4 2.333E-07 0.32
226
67,108,864 77,671.4 5.239E-09 236,833.5 2.333E-07 0.32

24
Table 3.2.: CPU LU vs CPU ZC with 1 Layer, Terms=32, Time Steps=50
Points CPU LU
(ms)
CPU
MaxRel
Err
CPU
ZC (ms)
CPU
MaxRel
Err
Speed
Up
1,024 1.1 2.467E-05 5.7 2.467E-05 0.19
65,536 69.4 6.858E-09 373.6 8.523E-09 0.18
1,048,576 1,222.2 7.650E-11 5,848.9 2.486E-10 0.20
4,194,304 4,809.4 2.056E-10 23,322.6 2.070E-10 0.20
16,777,216 19,164.3 3.473E-10 93,991.6 3.458E-10 0.20
67,108,864 77,671.4 5.239E-09 377,045.0 5.241E-09 0.20
Figure 3.1.: CPU LU vs CPU ZC with 1 Layer, Terms=20, Time Steps=50
103
104
105
106
107
108
100
101
102
103
104
105
106
Number of Points
Time(ms)
CPU LU
CPU ZC

25
Table 3.3.: CPU LU vs GPU RD1L with 1 Layer, Terms=20, Time Steps=50
Points CPU LU
(ms)
CPU
MaxRel
Err
GPU
RD1L
(ms)
GPU
MaxRel
Err
Speed
Up
1,024 1.1 2.467E-05 2.7 3.136E-05 0.4
65,536 69.4 6.858E-09 4.9 1.773E-05 14.1
1,048,576 1,222.2 7.650E-11 37.3 4.526E-06 32.7
4,194,304 4,809.4 2.056E-10 136.9 2.333E-07 35.1
16,777,216 19,164.3 3.473E-10 536.8 2.333E-07 35.7
67,108,864 77,671.4 5.239E-09 2,143.9 2.333E-07 36.2
Figure 3.2.: CPU LU vs GPU RD1L with 1 Layer, Terms=20, Time Steps=50
103
104
105
106
107
108
100
101
102
103
104
105
106
Number of Points
Time(ms)
CPU LU
GPU RD1L

26
4 DIFFUSION WITH LAYERS
4.1 Background
Consider a practical problem from hydrogeolgy shown in figure 4.1. The one dimen-
sional, transient state, unsaturated flow through this section of earth is
∂θ(z, t)
∂t
=
∂
∂z
K(θ)
∂ψ(z, t)
∂z
+ 1 (4.1)
where
θ(z, t) is the saturation (water content),
K(θ) is the hydraulic conductivity,
ψ(z, t) is the pressure head,
z is the elevation above a vertical datum, and
t is time.
Figure 4.1 effectively has multiple diffusion coefficients (in this case 5) because the
hydraulic conductivity K(θ) varies with the subsurface soil type.1
The hydraulic conduc-
tivity coefficient has units of distance/time, and as such, can be considered the diffusive
velocity with which water moves through a substructure. The value can be as small
as 10−8
meters/day for unfractured shale, and as large as 104
meters/day for gravel,
thereby spanning 12 orders of magnitude [12].
If the diffusion coefficients are approximately the same order of magnitude then a
finite difference approach may work. However, as the ratio of the coefficients begins to
1
Although equation (4.1) technically is a non-linear partial differential equation, it is used as an example
because we believe the reader can relate to the notion of water percolating down through different layers
of soil.

27
Figure 4.1.: A plant with multiple layers of soil
vary by an order of magnitude or more, the finite difference approach can suffer from a
host of numerical problems effecting its convergence.
4.2 Two Layers
A two layer problem is simply two single layers connected by interface conditions
across a common boundary. The interface conditions describe the continuity of u(x, t)
and its flux, K∂u/∂x, across an interface x = l. Equations (4.2) through (4.5) describe
the two layer problem.

28
∂u
∂t
= κ1
∂2
u
∂x2
, 0 ≤ x < l1, t > 0 (4.2)
∂u
∂t
= κ2
∂2
u
∂x2
, l1 < x ≤ 1, t > 0 (4.3)
u(l−
1 , t) = u(l+
1 , t) (4.4)
K1
∂u
∂x l−
1
= K2
∂u
∂x l+
1
(4.5)
In this case, the diffusion coefficients κ1, κ2, are constant, as are the flux coefficients
K1 and K2. We have used ± to denote the right side (+) and left side (−) of the interface
x = l1. The mathematics of two layers is similar to that in section 3.1, but it will now
be generalized.
Assume we have layers with width l1 and l2. Each layer i has Ni sub-intervals and
the width of each sub-interval is hi = li/Ni. Now define
Mi =
i
k=1
Nk, ri =
2h2
i
κiδt
i = 1, 2;
so that the coefficient matrices take on the form
A1 =














2 + r1 −1
−1 2 + r1 −1
...
...
...
...
...
...
−1 2 + r1 −1
−1 2 + r1














(N1−1)×(N1−1)
, (4.6)
A2 =














2 + r2 −1
−1 2 + r2 −1
... ... ...
...
...
...
−1 2 + r2 −1
−1 2 + r2














(N2−1)×(N2−1)
; (4.7)

29
the solution vectors take the form
x1 =











u
(j)
1
u
(j)
2
...
u
(j)
N1−2
u
(j)
N1−1











(N1−1)
, x2 =











u
(j)
N1+1
u
(j)
N1+2
...
u
(j)
M2−2
u
(j)
M2−1











(N2−1)
;
the right hand terms take the form
f1 =











f
(j)
1 + g(j)
f
(j)
2
...
f
(j)
N1−2
f
(j)
N1−1 + u
(j)
N1
−











(N1−1)
, and f2 =











f
(j)
N1+1 + u
(j)
N1
+
f
(j)
N1+2
...
f
(j)
M2−2
f
(j)
M2−1 + h(j)











(N2−1)
;
and the system of equations is
A1x1 = f1 and A2x2 = f2.
We now need an equation that connects the layers, while preserving continuity of the
state variable and the ﬂux.
4.3 Interface Equation
At the interface x = l1 we have the continuity of the state variable
u
(j)
N1
− = u
(j)
N1
+ = u
(j)
N1
and the continuity of the ﬂux
K1
u
(j)
N1
− − u
(j)
N1−1
h1
= K2
u
(j)
N1+1 − u
(j)
N1
+
h2
.
When the two previous interface conditions are combined, we are left with an expression
for the solution across the interface,
K1
h1
+
K2
h2
u
(j)
N1
=
K1
h1
u
(j)
N1−1 +
K2
h2
u
(j)
N1+1 ∀j. (4.8)

30
4.4 Complete System for Two Layers
In matrix form, the tridiagonal system for two layers with the interface included
consists of
Aif x(j)
= f(j)
(4.9)
where
Aif =


















































2 + r1 −1
−1 2 + r1 −1
...
...
...
... ... ...
−1 2 + r1 −1
−1 2 + r1














A1
−1
−K1/h1 K1/h1 + K2/h2 −K2/h2
−1














2 + r2 −1
−1 2 + r2 −1
...
...
...
...
...
...
−1 2 + r2 −1
−1 2 + r2














A2




































and
x =























u1
u2
...
uN1−1
uN1
uN1+1
...
uM2−2
uM2−1























, f =























f1 + u0
f2
...
fN1−1
0
fN1+1
...
fM2−2
fM2−1 + uM2























.
Notice that the sub-matrix in the upper left hand corner of Aif is A1 from (4.6), the
sub-matrix in the lower right hand corner of Aif is A2 from (4.7), and the 5 terms in
the center of Aif relate to the interface equation (4.8).
Solving (4.9) now consists of the following 5 steps.

31
1. Compute the LU factorization of A1 and then perform the forward substitution
to obtain the relation between u
(j)
N1
− and u
(j)
N1−1.
2. Compute the UL factorization of A2 and then perform the backward substitution
to obtain the relation between u
(j)
N1
+ and u
(j)
N1+1.
3. Substitute the results of (1) and (2) into the interface equation (4.8) to find u
(j)
N1
.
4. Perform the backward substitution to layer 1.
5. Perform the forward substitution to layer 2.
4.5 Difference Between Inclusion and Exclusion of the Interface Condition
Equations (4.4) and (4.5) describe the continuity of the state variable and its flux
across the interface. If we look at
K1
∂u1
∂x
= K2
∂u2
∂x
and assume that K2 >> K1, we have
K1
K2
∂u1
∂x
=
∂u2
∂x
→ 0 ≈
∂u2
∂x
.
This suggests a horizontal tangent of u2 that is present in the solution at the interface.
This behavior is not captured in the form that excludes the interface condition and
results in an error between the two. These errors are described in section 4.7 when the
experiments are discussed.
4.6 Implementation
The CPU implementation for the two layer problem is a simple adaptation to the
standard LU factorization and solver running through the entire set of state variables
from 1 to M2. Listing A.1 shows factorization requiring 5 lines of code and the solver
requires 6 lines of code. In contrast, the two layer factorization in listing A.2 requires
11 lines of code and the solver in listing A.3 requires 13 lines.

32
The GPU implementation for the two layer problem also leverages the logic from the
one layer problem. The forward and backward substitution kernels for the LU solver
require approximately 27 lines of code each. Since we now need forward and backward
substitution kernels for the UL solver, they also contribute approximately 27 lines each
to the total line count.
4.7 Results
If the interface expression in (4.8) is excluded and we are presented with a two layer
problem, the tridiagonal matrix A12 has the form
A12 =




















2 + r1 −1
−1 2 + r1 −1
...
...
...
−1 2 + r1 −1
−1 2 + r2 −1
...
...
...
−1 2 + r2 −1
−1 2 + r2




















(4.10)
for the system
A12x = f. (4.11)
Four experiments were performed that apply the CPU LU decomposition to (4.11)
and compared the results to the 5 step algorithm (CPU 2L) described in section 4.4.
The parameters for the experiments are described in table 4.1.
Experiment 1 in ﬁgures 4.2 and 4.3 shows that the results of solving (4.9) and (4.11)
are nearly identical. Figure 4.2 demonstrates the time evolution of the single interface
point uN1 when the interface equation is utilized and compares that value with two values
uN1 and uN1+1 that straddle uN1+1/2 in (4.11) that excludes the interface. In ﬁgure 4.3,
the systems are solved with 1024 points over 50 time steps. With the interface point at
u512, the 32 points on either side are in excellent agreement with both systems.

33
Table 4.1.: Experimental parameters for 50 timesteps
Experiment# κ1 κ2 K1 K2
1 1 1 1 1
2 1 10 1 1
3 1 1 1 10
4 1 10 1 10
Experiment 2 in figures 4.4 and 4.5 demonstrates the next set of results. The time
evolution plot in figure 4.4 shows the interface point significantly above the excluded
form, while the graph of the solutions in figure 4.5 do a poor job of coinciding. The
error plot between the two solutions in figure 4.6 further quantifies the consequences of
ignoring the interface equation.
Experiment 3 in figures 4.7 and 4.8 demonstrates results that are in good agreement
with each other. This experiment also seems to suggest that the flux coefficient has little
impact on the solution.
Finally, experiment 4 in figures 4.9 and 4.10 graphically demonstrates the problem
that occurs when the diffusion and conductivity coefficients are κ1 = 1, κ2 = 10 and
K1 = 1, K2 = 10 respectively. Figure 4.9 shows the actual interface point uN1 receding
from the two points that excluded the interface equation. Again, when the solution
is graphed after 50 time steps in figure 4.10, it is readily apparent that excluding the
interface equation introduces error into the solution. As in experiment 2, figure 4.11
quantifies the solution error. The results of experiment 2 and 4 seem to suggest that
the diffusion coefficients, along with the solver, play a large role in producing a correct
solution.
With experiments 1 through 4 giving us faith in the interface equation approach, the
benefits of accelerating the process on the GPU are now reported. Table 4.2 demon-
strates the 18 fold acceleration of the GPU (GPU RD2L) over the CPU (CPU 2L)

34
version with 2 layers, as well as relative errors that are small when compared with the
analytical solution corresponding to κ1 = 1, κ2 = 1, K1 = 1, K2 = 1 with 32 terms.
The results of table Table 4.3 are not surprising. With κ1 = 1, κ2 = 10, K1 = 1,K2 =
1 and the number of terms set to 32, a maximum acceleration of 18 fold is measured.
The modiﬁcations made to RD allow us to process problems that are 65, 536 times larger
than the previously reported sizes for Stone’s approach.
Table 4.2.: CPU 2L vs GPU RD2L, κ1 = 1, κ2 = 1, Terms=32, Time Steps=50
Points CPU 2L
(ms)
CPU
MaxRel
Err
GPU
RD2L
(ms)
GPU
MaxRel
Err
Speed
Up
1,024 1.7 8.756E-05 7.4 8.756E-05 0.2
65,536 70.9 2.318E-08 11.2 2.360E-08 6.3
1,048,576 1,038.4 8.091E-11 66.1 2.533E-10 15.7
4,194,304 4,163.8 2.056E-10 238.1 2.070E-10 17.4
16,777,216 16,638.1 3.473E-10 928.7 3.458E-10 17.9
67,108,864 66,963.9 5.239E-09 3,698.4 5.241E-09 18.1

35
Table 4.3.: CPU 2L vs GPU RD2L, κ1 = 1, κ2 = 10, Time Steps=50
Points CPU 2L
(ms)
GPU
RD2L
(ms)
Speed
Up
768 0.7 7.4 0.1
49,152 50.7 10.2 4.9
786,432 818.5 51.6 15.8
3,145,728 3,161.0 180.8 17.4
12,582,912 12,524.5 700.1 17.8
50,331,648 50,639.1 2,779.5 18.2
Figure 4.2.: Time Evolution of Interface-Experiment 1
0 10 20 30 40 50
0.9900
0.9920
0.9940
0.9960
0.9980
1.0000
Time Step
Solutionvalues
Include interface eqn, uN1
Exclude interface eqn, uN1
Exclude interface eqn, uN1+1
κ1 = 1, κ2 = 1
K1 = 1, K2 = 1

36
Figure 4.3.: Solution to the Interface Neighborhood-Experiment 1
480 500 520 540
0.9700
0.9800
0.9900
1.0000
Spatial Index
Solutionvalues
Include interface eqn
Exclude interface eqn
κ1 = 1, κ2 = 1
K1 = 1, K2 = 1
at ts = 50.
0 10 20 30 40 50
0.9900
0.9920
0.9940
0.9960
0.9980
1.0000
Time Step
Solutionvalues
κ1 = 1, κ2 = 10
K1 = 1, K2 = 1

37
480 500 520 540
0.9700
0.9800
0.9900
1.0000
Spatial Index
Solutionvalues
κ1 = 1, κ2 = 10
K1 = 1, K2 = 1
at ts = 50.
Figure 4.6.: Error in the Interface Neighborhood-Experiment 2
480 500 520 540
10−5
10−4
10−3
10−2
Spatial Index
|Error|
κ1 = 1, κ2 = 10
K1 = 1, K2 = 1
at ts = 50.

38
0 10 20 30 40 50
0.9900
0.9920
0.9940
0.9960
0.9980
1.0000
Time Step
Solutionvalues
κ1 = 1, κ2 = 1
K1 = 1, K2 = 10
480 500 520 540
0.9700
0.9800
0.9900
1.0000
Spatial Index
Solutionvalues
κ1 = 1, κ2 = 1
K1 = 1, K2 = 10
at ts = 50.

39
0 10 20 30 40 50
0.9900
0.9920
0.9940
0.9960
0.9980
1.0000
Time Step
Solutionvalues
κ1 = 1, κ2 = 10
K1 = 1, K2 = 10
480 500 520 540
0.9700
0.9800
0.9900
1.0000
Spatial Index
Solutionvalues
κ1 = 1, κ2 = 10
K1 = 1, K2 = 10
at ts = 50.

40
Figure 4.11.: Error in the Interface Neighborhood-Experiment 4
480 500 520 540
10−5
10−4
10−3
10−2
Spatial Index
|Error|
κ1 = 1, κ2 = 10
K1 = 1, K2 = 10
at ts = 50.

41
5 CONCLUSION
We started this project with the goal of creating a GPU based solver for the two layer
linear diffusion equation. Before the solver could be addressed, the LU factorization of
the finite difference form of the diffusion equation was investigated.
The two term non-linear recurrence relation that results from the LU factorization
was shown to be sequential and not to be parallelizable. When converted to a three term
linear recurrence relation the result could now be paralellized for a GPU. However, it
suffered from arithmetic overflow. When the underlying initial value problem was solved
analytically, we were able to demonstrate that a polynomial form of the solution could
be truncated to achieve a prescribed level of machine accuracy. Our experiments showed
excellent agreement with the solution (3.3) to the initial value problem.
We mentioned briefly that other researchers reported numerical instabilities when
applying Stone’s recursive doubling algorithm to the actual solver for problems with
1, 024 unknowns. Overflow is inherent in Stone’s algorithm. In addition, the instability is
due to an arithmetic underflow problem that appears when the two term linear recurrence
relation associated with the solver is evaluated. When Stone’s algorithm is recast using
matrix algebra, we were able to overcome these deficiencies by applying the theory of
nilpotent matrices to generate a truncated form of the matrix polynomial.
Our experiments demonstrated a GPU based algorithm that is approximately 36
times faster than the CPU version for the 1 layer problem, and approximately 18 times
faster than the CPU for the 2 layer problem. Our 1 layer implementation required two
GPU kernels and the 2 layer implementation required 5 kernels. The GPU is designed
for large problems that stream RAM through the cores. We attribute the 2 fold loss in
efficiency to the presence of the additional kernels, as well as awkward coding patterns
that were required to handle the interface condition. On the positive side, our algorithm

42
is also capable of processing 226
equations, a value that is 65, 536 times larger than that
of other researchers.
Finally, we described the mathematical framework for solving the two layer problem.
Our experiments show that error is introduced into the solution when the interface
conditions, the continuity of the state variable and the flux across the interface boundary,
are not properly accounted for. In fact, our work suggests that the flux coefficients
control the placement of the solution, while the diffusion coefficients in combination
with the type of solver used controls the error in the solution. We’ve also shown that
with the straight forward LU solver for two layers requiring approximately 54 lines of
kernel code on the GPU, our parallel two layer solution is only modestly more complex
with approximately 108 lines of code.

43
REFERENCES
[1] Greg Turk. Generating textures on arbitrary surfaces using reaction diffusion. In
Proc. of the 18th annual confereence on Computer Graphics and Interactive Tech-
niques, pages 289–298. ACM Press, 1991.
[2] Daniele Bertaccini and Daniela Calvetti. Fast simulation of solid tumors thermal
ablation treatments with a 3d reaction diffusion model. Computers in Biology and
Medicine, 37(8):1173–1182, 2007.
[3] F. T. Tracy. Clean two- and three-dimensional analytical solutions of richards’
equation for testing numerical solvers. Water Resources Research, 42(8), 2006.
[4] R. Allan Freeze. Three dimensional, transient, saturated-unsaturated flow in a
groundwater basin. Water Resources Research, 7(2):347–366, 1971.
[5] James S. Boswell and Greg A. Olyphant. Modeling the hydrologic response of
groundwater dominated wetlands to transient boundary conditions: Implications
for wetland restoraton. Journal of Hydrology, 332(3):467–476, 2007.
[6] NVidia Corp. Cuda C Programming Guide. Technical Report PG02829001 v5.5,
NVidia Corp, July 2013.
[7] Harold S. Stone and Peter M. Kogge. A parallel algorithm for the efficient solution
of a general class of recurrence equations. IEEE Transactions on Computers, C-
22(8):786–793, 1973.
[8] Guy E. Blelloch. Prefix sums and their applications. Technical Report CMU-CS-
90-190, Carnegie Mellon University, November 1990.
[9] Volodymyr Kindratenko. A guide for implementing tridiagonal solvers on gpus. In
Numerical Computations with GPUs. Springer International Publishing, 2014.
[10] Jonathan Cohen Yao Zhang and John D. Owens. Fast tridiagonal solvers on the
gpu. In Proc. of the 15th ACM SIGPLAN Symposium on Principles and Practice
of Parallel Programming, pages 127–136. ACM Press, January 2010.
[11] Harold S. Stone. An efficient parallel algorithm for the solution of a tridiagonal
linear system of equations. Journal of the ACM, 2(1):27–38, 1973.
[12] Ralph C. Heath. Basic ground-water hydrology. Technical Report 2220, U.S. Geo-
logical Survey Water-Supply, 1983.

44
A CPU CODE
1 // =======================================================================
2 // LUFactorizationCPU
3 // =======================================================================
4 void LUFactorizationCPU (const CVector &SubDiag , const CVector &MainDiag ,
5 const CVector &SuperDiag , CVector &Gamma , CVector &Delta)
6 {
7 Gamma [1] = MainDiag [1];
8 Delta [1] = SuperDiag [1] / Gamma [1];
9
10 for (int i = 2; i <= Gamma.Last (); i++)
11 {
12 Gamma[i] = MainDiag[i]-SubDiag[i]* Delta[i -1];
13 Delta[i] = SuperDiag[i] / Gamma[i];
14 }
15 }
16
17 // =======================================================================
18 // LUSolveCPU
19 // =======================================================================
20 void LUSolveCPU(const CVector &Gamma , const CVector &Delta ,
21 const CVector &ASubDiag ,
22 const CVector &f, CVector &x, CVector &y)
23 {
24 //
25 // Forward substitution . Solve Ly = f.
26 //
27 y[1] = f[1] / Gamma [1];
28
29 for (int i = 2; i <= y.Last (); i++)
30 {
31 y[i] = (f[i] - ASubDiag[i]*y[i -1]) / Gamma[i];
32 }
33
34 //
35 // Backward substitution . Solve Ux = Temp.
36 //
37 x[x.Last ()] = y[y.Last ()];

45
38
39 for (int i = x.Last () -1; i > 0; i--)
40 {
41 x[i] = y[i] - Delta[i]*x[i+1];
42 }
43 }
Listing A.1: 1 Layer CPU Factorization and Solver

46
1 // =======================================================================
2 // TwoLayerFactorization
3 // =======================================================================
4 void TwoLayerFactorization (const double r1 , const double r2 ,
5 CVector &Gamma1 , CVector &Delta1 , CVector &Gamma2 , CVector &Delta2)
6 {
7 //
8 // The LU factorization for matrix A1
9 //
10 const double b1 = 2.0 + r1;
11 const double b2 = 2.0 + r2;
12
13 Gamma1 [1] = b1;
14 Delta1 [1] = -1.0 / Gamma1 [1];
15
16 for (int i = 2; i <= Gamma1.Last (); i++)
17 {
18 Gamma1[i] = b1 + Delta1[i -1];
19 Delta1[i] = -1.0 / Gamma1[i];
20 }
21
22 //
23 // The UL factorization for matrix A2
24 //
25
26 Gamma2[Gamma2.Last ()] = b2;
27
28 for (int i = Gamma2.Last () -1; i >= 1; i--)
29 {
30 Delta2[i] = -1.0 / Gamma2[i+1];
31 Gamma2[i] = b2 + Delta2[i];
32 }
33
34 }
Listing A.2: 2 Layer CPU Factorization

47
1 // =======================================================================
2 // Solve2LayerCPU
3 // =======================================================================
4 void Solve2LayerCPU (const CVector &Gamma1 , const CVector &Delta1 , CVector &y,
5 CVector &f, const CVector &Gamma2 , const CVector &Delta2 ,
6 CVector &z, CVector &u,
7 const double K1 , const double K2 ,
8 const double H1 , const double H2)
9 {
10
11 // Top down , y[1] to the point y[N_1 -1]
12 // The forward substitution for LU.
13
14 y[1] = f[1] / Gamma1 [1];
15
17 {
18 y[i] = (f[i] + y[i -1]) / Gamma1[i];
19 }
20
21 // Bottom up , z[N_1 -1] to z[1]
22 // The backward substitution for UL.
23
24 z[z.Last ()] = f[f.Last ()];
25
26 for (int i = 1; i < z.Last (); i++)
27 {
28 z[z.Last ()-i] = f[f.Last ()-i] - Delta2[Delta2.Last ()-i]*z[z.Last ()-i+1];
29 }
30
31 // Fix up U_{N_1}
32
33 const int N1 = Gamma1.Last ()+1;
34
35 const double alpha = (K1/H1)*(1.0 - (1.0/ Gamma1[Gamma1.Last ()])) + (K2/H2)*(1.0 -
(1.0/ Gamma2 [1]));
36
37 u[N1] = ((K1/H1)*y[y.Last ()] + (K2/H2)*(z[1]/ Gamma2 [1])) / alpha; // update U_{
N_1}
38
39 // Advance up from point U{N_1 -1} to U_1
40 // The backward substitution for LU.
41

48
42 for (int i = N1 -1; i >= 1; i--)
43 {
44 u[i] = y[i] - Delta1[i]*u[i+1];
45 }
46
47 // Move down from point U{N_1 +1} to U_{M_2}
48 // The forward substitution for UL.
49
51 {
52 u[N1+i] = (z[i] + u[N1+i -1]) / Gamma2[i];
53 }
54
55 }
Listing A.3: 2 Layer CPU Solver

49
B GPU CODE
1 // =======================================================================
2 // ForwardLU_kernel -- compute the forward scan on the GPU
3 // =======================================================================
4 __global__ void ForwardLU_kernel (const double * __restrict__ gpuGamma ,
5 const double * __restrict__ gpuFhat ,
6 double *gpuYhat , unsigned int NumberOfElements )
7 {
8 __shared__ double sG[MAXTERMS +1024];
9 __shared__ double sF[MAXTERMS +1024];
10
11 const unsigned int gtid0 = blockIdx.x*blockDim.x + threadIdx.x;
12 const unsigned int idx0 = threadIdx.x;
13
14 double TG0 = 0.0;
15 double TF0 = 0.0;
16
17 if (gtid0 < NumberOfElements )
18 {
19 TG0 = gpuGamma[gtid0 ];
20 TF0 = gpuFhat[gtid0 ];
21 }
22 sG[idx0+MAXTERMS] = TG0;
23 sF[idx0+MAXTERMS] = TF0;
24
25 if (idx0 < MAXTERMS)
26 {
27 TG0 = 0.0;
28 TF0 = 0.0;
29 if (gtid0 > MAXTERMS)
30 {
31 TG0 = gpuGamma[gtid0 -MAXTERMS ];
32 TF0 = gpuFhat[gtid0 -MAXTERMS ];
33 }
34 sG[idx0] = TG0;
35 sF[idx0] = TF0;
36 }
37 __syncthreads ();

50
38
39 double Sum0 = 0.0;
40
41 for (int i = 1; i <= MAXTERMS; i++)
42 {
43 double F0 = sF[idx0+i];
44 double G0 = sG[idx0+i];
45
46 Sum0 = (Sum0 + F0)*G0;
47 }
48
49 if (gtid0 < NumberOfElements )
50 {
51 gpuYhat[gtid0] = Sum0;
52 }
53 }
Listing B.1: GPU Forward LU Solver

51
1 // =======================================================================
2 // BackwardLU_kernel -- compute the backward scan on the GPU
3 // =======================================================================
4 __global__ void BackwardLU_kernel (const double * __restrict__ gpuGamma ,
5 const double * __restrict__ gpuYhat ,
6 double *gpuXhat , unsigned int NumberOfElements )
7 {
8 __shared__ double sG [1024+ MAXTERMS ];
9 __shared__ double sY [1024+ MAXTERMS ];
10
11 const unsigned int gtid = blockIdx.x*blockDim.x + threadIdx.x;
12 const unsigned int idx = threadIdx.x;
13 const unsigned int warpid = idx >> 5;
14
15 double TG = 0.0;
16 double TY = 0.0;
17 if (gtid < NumberOfElements )
18 {
19 TG = gpuGamma[gtid ];
20 TY = gpuYhat[gtid ];
21 }
22 sG[idx] = TG;
23 sY[idx] = TY;
24
25 if (blockIdx.x < (NumberOfElements > >10)) // if not in the last block
26 {
27 if (warpid == 31) // if in the last warp
28 {
29 TG = gpuGamma[gtid+MAXTERMS ];
30 TY = gpuYhat[gtid+MAXTERMS ];
31 sG[idx+MAXTERMS] = TG;
32 sY[idx+MAXTERMS] = TY;
33 }
34 }
35
36 __syncthreads ();
37
38 double Sum = 0.0;
39
40 for (int i = MAXTERMS -1; i >= 0; i--)
41 {
42 double Y = sY[idx+i];
43 double G = (i == 0 ? 1.0 : sG[idx+i -1]);

52
44
45 Sum = (Sum + Y)*G;
46 }
47
48 if (gtid < NumberOfElements )
49 gpuXhat[gtid] = Sum;
50 }
Listing B.2: GPU Backward LU Solver

TR-CIS-0420-09 BobZigon

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (12)

Similar to TR-CIS-0420-09 BobZigon

Similar to TR-CIS-0420-09 BobZigon (20)

TR-CIS-0420-09 BobZigon