SlideShare a Scribd company logo
1 of 57
Download to read offline
TWO LAYER LINEAR DIFFUSION EQUATION ON THE GPU
Technical Report TR-CIS-0420-09
Submitted to the Faculty
of
Indiana University-Purdue University Indianapolis
by
Robert J. Zigon
December 2015
Indianapolis, Indiana
ii
TABLE OF CONTENTS
Page
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Graphics Processing Unit (GPU) . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 Software Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Hardware Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Diffusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1 One Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Factorization of a Tridiagonal Matrix . . . . . . . . . . . . . . . . . . . 11
3.2.1 LU Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.2 UL Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 CPU Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4 Parallel LU Factorization . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.5 Parallel Solver and Recursive Doubling . . . . . . . . . . . . . . . . . . 18
3.6 Recursive Doubling and Nilpotent Matrices . . . . . . . . . . . . . . . 19
3.7 GPU Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.8 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4 Diffusion with Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2 Two Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3 Interface Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.4 Complete System for Two Layers . . . . . . . . . . . . . . . . . . . . . 30
4.5 Difference Between Inclusion and Exclusion of the Interface Condition . 31
4.6 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
A CPU code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
B GPU code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
iii
LIST OF TABLES
Table Page
3.1 CPU LU vs CPU ZC with 1 Layer, Terms=20, Time Steps=50 . . . . . . . 23
3.2 CPU LU vs CPU ZC with 1 Layer, Terms=32, Time Steps=50 . . . . . . . 24
3.3 CPU LU vs GPU RD1L with 1 Layer, Terms=20, Time Steps=50 . . . . . 25
4.1 Experimental parameters for 50 timesteps . . . . . . . . . . . . . . . . . . 33
4.2 CPU 2L vs GPU RD2L, κ1 = 1, κ2 = 1, Terms=32, Time Steps=50 . . . . 34
4.3 CPU 2L vs GPU RD2L, κ1 = 1, κ2 = 10, Time Steps=50 . . . . . . . . . . 35
iv
LIST OF FIGURES
Figure Page
2.1 Memory bandwidth for the CPU and GPU . . . . . . . . . . . . . . . . . . 4
2.2 Floating point operations per second for the CPU and GPU . . . . . . . . 4
2.3 A GPU Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 A GPU and a Streaming Multiprocessor (SM or SMX) . . . . . . . . . . . 8
3.1 CPU LU vs CPU ZC with 1 Layer, Terms=20, Time Steps=50 . . . . . . . 24
3.2 CPU LU vs GPU RD1L with 1 Layer, Terms=20, Time Steps=50 . . . . . 25
4.1 A plant with multiple layers of soil . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Time Evolution of Interface-Experiment 1 . . . . . . . . . . . . . . . . . . 35
4.3 Solution to the Interface Neighborhood-Experiment 1 . . . . . . . . . . . . 36
4.4 Time Evolution of Interface-Experiment 2 . . . . . . . . . . . . . . . . . . 36
4.5 Solution to the Interface Neighborhood-Experiment 2 . . . . . . . . . . . . 37
4.6 Error in the Interface Neighborhood-Experiment 2 . . . . . . . . . . . . . . 37
4.7 Time Evolution of Interface-Experiment 3 . . . . . . . . . . . . . . . . . . 38
4.8 Solution to the Interface Neighborhood-Experiment 3 . . . . . . . . . . . . 38
4.9 Time Evolution of Interface-Experiment 4 . . . . . . . . . . . . . . . . . . 39
4.10 Solution to the Interface Neighborhood-Experiment 4 . . . . . . . . . . . . 39
4.11 Error in the Interface Neighborhood-Experiment 4 . . . . . . . . . . . . . . 40
v
ABSTRACT
Zigon, Robert MS, Purdue University, December 2015. Two Layer Linear Diffusion
Equation on the GPU. Major Professors: Raymond Chin, Shaofin Fang and Fengguang
Song.
The purpose of this project is to investigate the mathematical framework for evalu-
ating the two layer linear diffusion equation on a GPU. The diffusion equation is first
approximated using finite differences to produce the matrix equation Ax = f. The two
term non-linear recurrence relation for the LU factorization of the A matrix is then con-
verted into a three term linear recurrence relation by way of a Riccati transform. The
three term relation is then shown to be parallelizable. After the numeric underflow prob-
lem for the LU solver of the system is reconciled, Stone’s recursive doubling algorithm is
then implemented. Finally, the parallel implementation is applied to a form of the two
layer diffusion equation that properly models the flux across the internal boundary.
1
1 INTRODUCTION
In physics, diffusion is simply defined as the change in distribution of a collection of
particles, as well as its depletion, in time and space. The underlying partial differential
equation can be used to model many different types of processes. For example, open a
bottle of perfume. As the molecules of the scent first escape the container, they are in
very high concentration. Over time they spread outward in every direction where they
are in low concentration.
Another example of diffusion exists in biology. A process called morphogenesis con-
trols the spatial distribution of cells during the embryonic development of an organism.
Natural patterns, such as the spots on a leopard, are believed to be the result of cellular
differentiation in many different directions [1].
The diffusion equation also appears in oncology with the use of radio frequency
thermal ablation (RFA). In this process, tumor cells are killed by focusing energy on a
diseased portion of the body. In order to better understand the ablation process, models
are used to analyze the energy and temperature distribution in the context of the muscle,
fat and bone that are adjacent to the tumor cells [2].
Yet another example of diffusion exists in hydrogeology – the study of the movement
of groundwater in the soil and rocks of the Earth’s crust. Groundwater does not always
flow down hill in the subsurface by following the surface topology. Instead, it can be
driven by pressure gradients in both saturated and unsaturated regions. This results in
a behavior that is difficult to predict for all but the simplest situations.
The goal of this project is to implement a solver for the one dimensional linear
diffusion equation with two layers on a GPU. We will begin with a description of the
modern GPU. The problem itself will then be investigated in two phases. The first
phase will start with the finite difference equations for the one dimensional, constant
coefficient, linear diffusion equation (due to its relative simplicity) on the CPU and GPU.
2
The second phase will then investigate adding layers to the first phase. This two phase
approach will allow us to first understand the issues surrounding diffusion on different
hardware architectures, and then focus on the two layer problem, so that an efficient
parallel solver can be implemented.
The solver consists of two components that are designed to improve execution time or
the accuracy of the solution. First, the tridiagonal structure of the underlying matrices
will be considered so that the LU-decomposition can be applied with a computational
complexity of O(n). While doing so, we will show how to implement Stone’s recursive
doubling algorithm to solve 226
equations, a goal that, up to now, has not exceeded
1, 024 equations. For the second component, we will demonstrate the mathematics to
treat the boundary of two different diffusion coefficients in a manner that reduces error
in the solution.
3
2 GRAPHICS PROCESSING UNIT (GPU)
The modern Graphics Processing Unit (GPU) has its genesis in 2D and 3D computer
graphics. In 2000, parallel processing and floating point arithmetic capabilities were
added to graphics cards to accelerate the rate that world geometries could be trans-
formed, illuminated, projected, clipped and then displayed as pixels. This sequence of
operations is called the graphics pipeline. It makes heavy use of five basic floating point
operators (addition, subtraction, multiplication, division and square root). The process
itself is called embarrassingly parallel because the transformation sequence applied to a
three dimensional vertex is independent of the other vertices.
In 2002, researchers became interested in these parallel processing and floating point
capabilities. They used the graphics application programming interface (API) to com-
pute functions such as fast fourier transforms and convolutions. NVidia took note of
the scientific computing trend with GPUs and developed CUDA - the Compute Unified
Device Architecture. CUDA [6] is a computing platform and programming model that
uses a C like language to expose the massive parallelism of GPU hardare. In retrospect,
the demand for real time graphics has caused the GPU to evolve into a highly parallel,
multithreaded, many core processor with very high memory bandwidth and computing
throughput as illustrated in figures 2.1 and 2.2.
2.1 Software Model
CUDA was designed to overcome the challenge of writing applications that transpar-
ently scale with increasing numbers of processing cores by maintaining a low learning
curve for programmers familiar with the C programming language. There are three
abstractions at the core of CUDA - a hierarchy of thread groups, shared memories and
barrier synchronization. These abstractions guide the programmer to partition a prob-
4
Figure 2.1.: Memory bandwidth for the CPU and GPU
Figure 2.2.: Floating point operations per second for the CPU and GPU
5
lem into sub-problems that can be solved independently in parallel by blocks of threads
executing a kernel.
A kernel is a program written in CUDA C that is downloaded from the host to a
GPU board at runtime. Parameters are passed to the kernel at invocation to provide
it with operands that are transformed by the GPU. In listing 2.1 the kernel program
extends from line 1 through line 5. Line 13 of the main program (that is executing on
the Intel CPU) essentially downloads the V ectorAdd kernel to the GPU and launches
it on 1,000 threads. At runtime each thread is assigned a unique thread index, in this
case ranging from 0 to 999. The ith
thread loads A[i], B[i], adds them, and then writes
the result to C[i]. When all of the threads have executed the V ectorAdd kernel, control
is returned to the main program at line 14.
Modern GPU hardware (like an NVidia Tesla K20 board) can have as many as 2,496
processing cores. When a kernel is launched, one of the required parameters is the thread
count. The requested thread count can exceed the number of physical cores on the GPU.
From a conceptual standpoint, the hardware maps blocks of 32 threads (called warps)
to 32 cores until they have finished executing. When one warp finishes another one is
allocated to the idle block of cores for execution. This is one of the key abstractions
in CUDA that lends itself to transparent scalability. If, for example, a next generation
board arrives with 10,000 cores, the kernels are oblivious to the environmental change.
The hardware and runtime take care of the mapping from threads to cores, and the
kernel is executed. The result is a platform that preserves the user’s investment in code
while insulating from hardware changes.
6
1 __global__ void VectorAdd(float *A, float *B, float *C)
2 {
3 int i = threadIdx.x;
4 C[i] = A[i] + B[i];
5 }
6
7 int main ()
8 {
9 const int N = 1000;
10 ...
11 // Kernel invocation from the host with N threads
12
13 VectorAdd <<<1, N>>>(A, B, C);
14 ...
15 }
Listing 2.1: Example kernel and host code
2.2 Hardware Model
From a hardware perspective, the fundamental computing unit in an NVidia GPU
is a core (see figure 2.3). A core contains a 32 bit arithmetic logic unit (ALU) capable
of performing operations such as min, max, add, subtract, multiply, divide, compare
and bitwise logical operators. A core also contains a single and double precision floating
point unit.
In figure 2.4 we see that a collection of cores are grouped together in a unit known as
a Streaming Multiprocessor (SM). The SM in the figure has 192 cores within. A GPU
itself is then a collection of SM’s. Although figure 2.4 shows 8 SM’s (for a total of 1, 536
cores), the Tesla K20 cards used in this project have 13 SM’s, for a total of 2, 496 cores.
An SM is designed to execute thousands of programming threads concurrently.
To manage such a large number of threads, it employs a unique architecture called
SIMT(Single Instruction, Multiple Thread). An SM schedules and executes the threads
grouped as warps. Individual threads composing a warp begin execution at the same
instruction address, but have their own register state and are therefore free to branch
7
Figure 2.3.: A GPU Core
and execute independently. However, full efficiency is realized when all 32 threads of a
warp agree on their execution path. If threads of a warp diverge via a data dependent
conditional branch, the warp serially executes each branch taken, disabling threads that
are not on that path, and when all paths complete, the threads converge back to a
common path.
The SIMT architecture is somewhat similar to the architecture of vector processors
known as SIMD(Single Instruction, Multiple Data). A key difference, however, is
that the SIMD organization exposes the width of the vector to the software (and
programmer). SIMT, on the other hand, specifies the execution and branching behavior
in terms of a single thread. This model simplifies parallel programming somewhat and
aids in program correctness.
8
Figure 2.4.: A GPU and a Streaming Multiprocessor (SM or SMX)
9
3 DIFFUSION
3.1 One Layer
The one dimensional diffusion equation is
∂u
∂t
= κ
∂2
u
∂x2
, 0 ≤ x ≤ L, t > 0 (3.1)
where u = u(x, t) is the dependent variable and κ is a real constant. The initial
condition and boundary conditions are u(x, 0) = 0, u(0, t) = g(t) and u(L, t) = h(t).
The particular problem we will solve has boundary and initial conditions that are equal
to
u(0, t) = 0, u(L, t) = 0, u(x, 0) = sin(πx/L). (3.2)
The solution to (3.1) subject to (3.2) is then
u(x, t) = sin(πx/L) exp(−κπ2
t/L2
). (3.3)
The results of this project will be validated for both the CPU and GPU using (3.3).
We begin with an implicit time discretization scheme with a trapezoidal rule in which
tj = jδt. Then
∂u
∂t
= F(x, t)
is converted to
u(j)
(xi) − u(j−1)
(xi)
δt
=
F(xi, tj) + F(xi, tj−1)
2
.
Define u(xi, tj) to be replaced by u
(j)
i . The time discretization of the diffusion equation
becomes
u(j)
(xi) − u(j−1)
(xi)
δt
=
κ
2
∂2
u
∂x2
(j)
xi
+
∂2
u
∂x2
(j−1)
xi
. (3.4)
We now use a central difference with uniform spacing h = L/N for the spatial
dimension, where N is the number of sub-intervals. As a result, (3.4) becomes
10
u(j)
(xi) − u(j−1)
(xi)
δt
=
κ
2
ui−1 − ui + ui+1
h2
(j)
+
ui−1 − ui + ui+1
h2
(j−1)
. (3.5)
If we let r = 2h2
κδt
, then (3.5) can be rewritten as
−u
(j)
i−1 + (2 + r)u
(j)
i − u
(j)
i+1 = −u
(j−1)
i−1 − (2 − r)u
(j−1)
i − u
(j−1)
i+1 , i = 1 . . . N, j > 0. (3.6)
For convenience, define
f
(j)
1 = u
(j−1)
0 − (2 − r)u
(j−1)
1 − u
(j−1)
2
f
(j)
i = u
(j−1)
i−1 − (2 − r)u
(j−1)
i − u
(j−1)
i+1 , 2 ≤ i ≤ N − 2
f
(j)
N−1 = u
(j−1)
N−2 − (2 − r)u
(j−1)
N−1 + u
(j−1)
N
then (3.6) becomes
(2 + r)u
(j)
1 − u
(j)
2 = f
(j)
1 + u
(j)
0
−u
(j)
i−1 + (2 + r)u
(j)
i − u
(j)
i+1 = f
(j)
i , i = 2 . . . N − 2, j > 0 (3.7)
−u
(j)
N−2 + (2 + r)u
(j)
N−1 = f
(j)
i + u
(j)
N .
Now rewrite (3.7) in terms of a left boundary equation, a set of interior equations
for 1 < i < N − 1 and a right boundary equation. The left boundary equation is
(2 + r)u
(j)
1 − u
(j)
2 = f
(j)
1 + g(j)
, j > 0. (3.8)
The interior equations are
−u
(j)
i−1 + (2 + r)u
(j)
i − u
(j)
i+1 = f
(j)
i , i = 2 . . . N − 2, j > 0 (3.9)
and the right boundary equation is
−u
(j)
N−2 + (2 + r)u
(j)
N−1 = f
(j)
N−1 + h(j)
, j > 0. (3.10)
This is done to emphasize the roles of the interior and boundary forcing functions. As
such, the boundaries need no special treatment.
11
We can now rewrite (3.8), (3.9) and (3.10) in matrix form to yield
Ax = f (3.11)
where
A =














2 + r −1
−1 2 + r −1
...
...
...
...
...
...
−1 2 + r −1
−1 2 + r














(N−1)×(N−1)
(3.12)
x =











u
(j)
1
u
(j)
2
...
u
(j)
N−2
u
(j)
N−1











(N−1)
, and f =











f
(j)
1 + g(j)
f
(j)
2
...
f
(j)
N−2
f
(j)
N−1 + h(j)











(N−1)
. (3.13)
This is the system of equations that will be solved.
3.2 Factorization of a Tridiagonal Matrix
3.2.1 LU Factorization
Assume a tridiagonal matrix B is represented as
B =














b1 c1
a2 b2 c2
...
...
...
...
...
...
aN−1 bN−1 cN−1
aN bN














. (3.14)
B can be factored into a lower bi-diagonal matrix L and an upper bi-diagonal matrix
U such that B = LU, where
12
LU =














γ1
a2 γ2
a3 γ3
...
...
aN−1 γN−1
aN γN




























1 δ1
1 δ2
1 δ3
...
...
1 δN−1
1














. (3.15)
If we equate the coefficients of (3.14) and (3.15), the result is
γ1 = b1
δ1 = c1/γ1
γi = bi − aiδi−1, i = 2, . . . , N
δi = ci/γi, i = 2, . . . , N − 1.
Alternatively, you may cast LU such that
LU =














1
β2 1
β3 1
...
...
βN−1 1
βN 1




























η1 c1
η2 c2
η3 c3
...
...
ηN−1 cN−1
ηN














with similar systems of recurrence relations for their coefficients. The method chosen
depends on its application.
If we have a system of equations involving B it can be written as
Bx = f, or (3.16)
L(Ux) = f.
Now, introduce the intermediate vector y and we have
Ly = f, the forward substitution (3.17)
Ux = y, the backward substitution. (3.18)
13
The system of equations that results from (3.17) is
γ1y1 = f1
a2y1 + γ2y2 = f2
aiyi−1 + γiyi = fi, i = 3, . . . , N
and solving for the y vector yields,
y1 = f1/γ1
yi = (fi − aiyi−1)/γi, i = 2, . . . , N. (3.19)
With the forward substitution step complete, we can write the backward substitution
(3.18) as
xN = yN
xi = yi − δixi+1, i = N − 1, . . . , 1 (3.20)
and generate the solution to (3.16).
3.2.2 UL Factorization
As it turns out, there is also a UL factorization. If L is pre-multiplied by U, the
resulting system is then
Bx = f, or (3.21)
U(Lx) = f,
where
UL =














1 ¯δ1
1 ¯δ2
1 ¯δ3
...
...
1 ¯δN−1
1




























β1
a2 β2
a3 β3
...
...
aN−1 βN−1
aN βN














. (3.22)
14
The UL factorization is equivalent to reordering the vectors x and f from N to 1. It
follows that the information is transmitted from N to 1. The need for this will become
apparent when dealing with the interface equation in section 4.
When the intermediate vector z is introduced, we obtain
Uz = f, the backward substitution
Lx = z, the forward substitution.
If we equate the coefficients of (3.14) and (3.22), the result is
βN = bN
¯δi = c1/βi+1, i = N − 1, . . . , 1
βi = bi − ai+1
¯δi, i = N − 1, . . . , 1.
Now that the coefficients of U and L are computed, we can solve for the elements of
vectors z and x to yield
zN = fN
zi = fi − ¯δizi+1, i = N − 1, . . . , 1
and
x1 = z1/β1
xi = (zi − aixi−1)/βi, i = 2, . . . , N
which generates the solution to the system (3.21).
3.3 CPU Implementation
The linear system (3.11) that results from discretizing (3.1) is both tridiagonal and
diagonally dominant (r > 0). The tridiagonal property implies that the LU decompo-
sition can be performed in O(n) time. The diagonal dominance implies that pivoting is
not required. We use these properties to generate the solution to (3.11).
15
Listing A.1 contains the implementation of the factorization and solver for (3.12)
and (3.13) that runs on the CPU. A main function named RunCPU 1K1LTest re-
peatedly calls the pair with various configurations, then measures the execution time
and the relative error with respect to the initial conditions in (3.2). At a high level,
RunCPU 1K1LTest performs 5 functions.
1. Assign boundary conditions.
2. Assign initial conditions.
3. Initialize the sub-diagonal, diagonal, and sup-diagonal of matrix A.
4. Compute the LU factorization.
5. Repeatedly Solve the system and Step time.
3.4 Parallel LU Factorization
The symmetry of A in (3.12) leads to an LU factorization of
LU =














γ1
−1 γ2
−1 γ3
... ...
−1 γN−1
−1 γN




























1 δ1
1 δ2
1 δ3
... ...
1 δN−1
1














where
γ1 = 2 + r
δ1 = −1/γ1
γi = 2 + r + δi−1, i = 2, . . . , N
δi = −1/γi, i = 2, . . . , N − 1
16
which can be simplified to
γ1 = 2 + r
γi = 2 + r − 1/γi−1, i = 2, . . . , N. (3.23)
The nonlinear two-term recurrence in (3.23) does not lend itself to parallel evaluation.
Stone [7] and Blelloch [8], however, describe an algorithm for parallel evaluation of m-th
order linear recurrence relations. So (3.23) is modified through the use of a Riccati
transformation γi = qi/qi−1 to produce a linear three-term recurrence
q0 = 1
q1 = 2 + r
qi = (2 + r)qi−1 − qi−2, i = 2, . . . , N. (3.24)
The issue with (3.24) is that it can overflow when using floating point arithmetic.
The term qi grows and becomes unbounded as i → ∞. We seek a truncated form that
gives acceptable results within the finite precision of the microprocessor. Recognize that
initial value problem in (3.24) can be solved analytically. Assume a solution of the form
qi = Axi
+ + Bxi
−
with A and B real constants. Examination of the characteristic equation x2
−(2+r)x+
1 = 0 for (3.24) yields roots of
x± =
b ±
√
b2 − 4
2
, where b = 2 + r.
With the initial conditions of q0 = 1 and q1 = b, we can find the constants A and B
from
q0 = 1 = A + B
q1 = b = Ax+ + Bx−,
where A = (b − x−)/(x+ − x−) and B = 1 − A, to yield the solution
qi =
xi+1
+ − xi+1
−
x+ − x−
17
which can be rewritten as
qi = xi
+
1 − (x−/x+)i+1
1 − (x−/x+)
.
With
1 − (x−/x+)i+1
1 − (x−/x+)
= 1 + x−/x+ + O
x−
x+
2
,
it follows that
qi = xi
+ 1 + x−/x+ + O
x−
x+
2
.
For sufficiently large i = N
qN = xN
+ 1 + x−/x+ + O
x−
x+
2
and qi will overflow as qN → xN
+ .
When (3.23) is combined with (3.19) and (3.20), the forward and backward substi-
tution logic simplifies to
y1 = f1/γ1
yi = (fi + yi−1)/γi, i = 2, . . . , N.
and
xN = yN
xi = yi + xi+1/γi, i = N − 1, . . . , 1.
We now see that 1/γi is needed and not qi, so
1/γi =
xi
+ − xi
−
xi+1
+ − xi+1
−
=
xi
+
xi+1
+
1 − (x−/x+)i
1 − (x−/x+)i+1
=
1
x+
1 − (x−/x+)i
+ O
x−
x+
i+1
.
This tells us that 1/γi approaches its asymptotic limit faster than qi. If we impose the
constraint that
1
γK
−
1
γK−1
< 10−M
(3.25)
18
for some integers K and M, then
1
γK
≈
1
x+
−
1
x+
x−
x+
i
=
1
x+
(1 − 10−M
).
This suggests we stop computing when (3.25) is satisfied. From that point on we impose
1
γi
=
1
x+
, for i > K.
In summary, this section shows how to transform a non-linear two term recurrence rela-
tion into a linear three term recurrence relation and conversely, that is now a candidate
for parallelization. While doing so, numeric overflow needed to be identified and the
problem mitigated by truncating the series when an acceptable level of accuracy has
been reached.
3.5 Parallel Solver and Recursive Doubling
The sequential nature of the solver for the CPU does not lend itself to efficient
implementation on the GPU. Other researchers have chosen to implement Cyclic Re-
duction (CR) and Parallel Cyclic Reduction (PCR) for their tridiagonal solvers. Two
other researchers [9,10] have attempted to implement Stone’s Recursive Doubling (RD)
algorithm [11] to solve tridiagonal systems on a GPU. Each of them reported problems
like huge numerical errors and RD suffers from arithmetic underflow and instability but
they failed to analyze the source of instability. This section, on the other hand, will
discuss the source of the instability (the numeric underflow that occurs on the GPU)
and then describes how to address the issue.
Stone described RD in terms of the following theorem and claimed that it could be
used to solve recurrence relations of all orders.
Theorem (Stone) 1 Let yi(j) satisfy a non-homogeneous two-term recurrence
yi+1(j) = y1(j) + yi(j − 1) ∗ (−mj), i, j ≥ 1 (3.26)
with the boundary conditions
y1(j) = bj, j ≥ 1; yi(j) = 0, j ≤ 0; yi(j) = 0, i ≤ 0.
19
Then,
a) for s ≥ 1, yi(j) satisfies the recurrence relation
yi+s(j) = ys(j) + yi(j − s)
i
k=j−s+1
(−mk), i ≥ 1, j ≥ s;
b)
yi(j) =
j
k=1
y1(k)
j
s=k+1
(−ms), i ≥ j ≥ 1; (3.27)
c) for i ≥ j ≥ 1, yi(j) = zj, where zj is the jth component of the solution to
zi = bi − mizi−1, z1 = b1.
Quite simply, the problem lies with (3.27). If the sequence {mk}n
1 is bounded by ¯m,
then
i
k=j−s+1
(−mk) ≈ ¯mp
, where p = i − j + s.
The RD algorithm needs to be modified to avoid the arithmetic underflow that ¯mp
causes when ¯m < 1.
3.6 Recursive Doubling and Nilpotent Matrices
The non-homogeneous two-term recurrence (3.26) with initial value y1 = b1 has a
solution of the form
yi =
i
k=1
bk
i
s=k+1
(−ms) (3.28)
If the first few terms of the solution are expanded, we have
y1 = b1
y2 = b2 − b1m2
y3 = b3 − b2m3 + b1m2m3.
This is a compact representation of the linear system
Ly = b (3.29)
20
where L is a lower bidiagonal matrix given by
L =














1
m2 1
m3 1
...
...
mn−1 1
mn 1














n×n
.
We rewrite L as the sum of an identity matrix I of order n and a nilpotent matrix N of
index n, such that
L = I + N
where N is the first lower diagonal matrix
N =














0
m2 0
m3 0
...
...
mn−1 0
mn 0














n×n
.
With N nilpotent, the inverse of L can be expressed as
L−1
= (I + N)−1
= I − N1
+ N2
− N3
+ · · · + (−1)n−1
Nn−1
.
For some p < n, the pth power of N is matrix with non-zero values filling the pth lower
diagonal. For example,








0
m2 0
0 m3 0
0 0 m4 0








2
=








0
0 0
m2m3 0 0
0 m3m4 0 0








.
21
We can now relate the solution of the non-homogeneous two-term recurrence (3.28) to
the entries of the pth lower diagonal of Np
given by
Np
p+i,i =
p+i
k=i+1
mk, i = 1, 2, . . . , n − p.
If there is an upper bound on the entries of N,
¯m = max
1≤i≤n
|mi|,
then the absolute value of an entry on the pth lower diagonal is
|Np
p+i,i| =
p+i
k=i+1
|mk| ≤ ¯mp
, i = 1, 2, . . . , n − p.
If ¯m < 1, we can now choose a value for p ∈ N such that the solution has an error
no bigger than 10−digit
,
¯mp
≤ 10−digit
or p ≤ −digit
ln 10
ln ¯m
,
where digit is the number of required decimal places. All of this now suggests computing
y = (I − N1
+ N2
− N3
+ · · · + (−1)p
Np
)b
as the truncated form of the solution to (3.29). From a practical standpoint we have
yi = bi − miyi−1, i = 2, . . . , p with y1 = b1 (3.30)
for the first p elements of y. Thereafter, for elements p + 1 ≤ j ≤ n, we have
yj = bj − bj−1mj + bj−2(mjmj−1) + · · · + bj−p−2
j−p−2
k=j
(−mk) + bj−p−1
j−p−1
k=j
(−mk)
p terms
(3.31)
which is just (3.27).
22
3.7 GPU Implementation
The forward and backward substitution logic is now represented by
y1 = f1/γ1
yi =
1
γi
(yi−1 + fi), i = 2, . . . , N − 1
and
xN−1 = yN−1
xi = yi +
1
γi
xi+1, i = N − 2, . . . , 1
respectively. Each of these two equations are remarkably similar to (3.30) and (3.31),
and the GPU code takes advantage of this.
Assume, for example, a diffusion problem has 220
equations in 220
unknowns, and for
accuracy reasons it requires p = 15. The γ vector is first computed with the CPU and
then passed into the GPU kernels. It was deemed unnecessary to implement the trun-
cated version of the γ vector on the GPU given that the LU factorization is performed
once outside of the solver loop. The ForwardLU Kernel (listing B.1) is launched with
n = 220
threads. Recall that the threads are organized in groups of 32 called a warp,
all of which execute the same instruction on a different data element. For a given warp
0 ≤ j < 215
, threads numbered 32j ≤ i < (32j + 32) will each execute
sum = 0
sum = (sum + fi−k)
1
γi−k
, k = 1, . . . , p.
The BackwardLU Kernel in (listing B.2) behaves in a similar way.
3.8 Results
All of the software development was performed on a Lenovo S20 workstation with
14GB of RAM, a 300gb Western Digital VelociRaptor hard drive, an Intel Xeon W3520
quad core cpu @ 2.67GHz, and an NVIDIA Tesla K40c. The workstation is running
23
Windows 7/64. The development tools consisted of Visual Studio 2010, CUDA 6.5 and
Tesla Driver 341.44. All of the generated applications were 64 bit.
Table 3.1 and figure 3.1 summarize the results of the CPU version of the LU decom-
position (CPU LU) versus the CPU version of the truncated γ vector (CPU ZC) that
was described in section 3.4. Each element xi of the solution vector x was the sum of
20 terms. The number of time steps was 50. Table 3.2 is a similar experiment with the
number of terms equal to 32. Table 3.2 emphasizes that the additional 12 terms results
in relative errors that are nearly identical to the CPU LU version.
Table 3.3 and figure 3.2 summarize the results of the CPU version of the LU decom-
position versus the GPU version of the RD algorithm (GPU RD1L) that was described
in section 3.5. Again, each element xi of the solution vector x was the sum of 20 terms.
The number of time steps was 50. In this case, the data clearly shows that the GPU
prefers large problems over smaller problems, and ultimately executes 36 times faster
than the CPU LU algorithm.
Table 3.1.: CPU LU vs CPU ZC with 1 Layer, Terms=20, Time Steps=50
Points CPU LU
(ms)
CPU
MaxRel
Err
CPU
ZC (ms)
CPU
MaxRel
Err
Speed
Up
210
1,024 1.1 2.467E-05 3.5 3.136E-05 0.31
216
65,536 69.4 6.858E-09 231.0 1.773E-05 0.30
220
1,048,576 1,222.2 7.650E-11 3,650.9 4.526E-06 0.33
222
4,194,304 4,809.4 2.056E-10 14,671.2 2.333E-07 0.32
224
16,777,216 19,164.3 3.473E-10 59,080.4 2.333E-07 0.32
226
67,108,864 77,671.4 5.239E-09 236,833.5 2.333E-07 0.32
24
Table 3.2.: CPU LU vs CPU ZC with 1 Layer, Terms=32, Time Steps=50
Points CPU LU
(ms)
CPU
MaxRel
Err
CPU
ZC (ms)
CPU
MaxRel
Err
Speed
Up
1,024 1.1 2.467E-05 5.7 2.467E-05 0.19
65,536 69.4 6.858E-09 373.6 8.523E-09 0.18
1,048,576 1,222.2 7.650E-11 5,848.9 2.486E-10 0.20
4,194,304 4,809.4 2.056E-10 23,322.6 2.070E-10 0.20
16,777,216 19,164.3 3.473E-10 93,991.6 3.458E-10 0.20
67,108,864 77,671.4 5.239E-09 377,045.0 5.241E-09 0.20
Figure 3.1.: CPU LU vs CPU ZC with 1 Layer, Terms=20, Time Steps=50
103
104
105
106
107
108
100
101
102
103
104
105
106
Number of Points
Time(ms)
CPU LU
CPU ZC
25
Table 3.3.: CPU LU vs GPU RD1L with 1 Layer, Terms=20, Time Steps=50
Points CPU LU
(ms)
CPU
MaxRel
Err
GPU
RD1L
(ms)
GPU
MaxRel
Err
Speed
Up
1,024 1.1 2.467E-05 2.7 3.136E-05 0.4
65,536 69.4 6.858E-09 4.9 1.773E-05 14.1
1,048,576 1,222.2 7.650E-11 37.3 4.526E-06 32.7
4,194,304 4,809.4 2.056E-10 136.9 2.333E-07 35.1
16,777,216 19,164.3 3.473E-10 536.8 2.333E-07 35.7
67,108,864 77,671.4 5.239E-09 2,143.9 2.333E-07 36.2
Figure 3.2.: CPU LU vs GPU RD1L with 1 Layer, Terms=20, Time Steps=50
103
104
105
106
107
108
100
101
102
103
104
105
106
Number of Points
Time(ms)
CPU LU
GPU RD1L
26
4 DIFFUSION WITH LAYERS
4.1 Background
Consider a practical problem from hydrogeolgy shown in figure 4.1. The one dimen-
sional, transient state, unsaturated flow through this section of earth is
∂θ(z, t)
∂t
=
∂
∂z
K(θ)
∂ψ(z, t)
∂z
+ 1 (4.1)
where
θ(z, t) is the saturation (water content),
K(θ) is the hydraulic conductivity,
ψ(z, t) is the pressure head,
z is the elevation above a vertical datum, and
t is time.
Figure 4.1 effectively has multiple diffusion coefficients (in this case 5) because the
hydraulic conductivity K(θ) varies with the subsurface soil type.1
The hydraulic conduc-
tivity coefficient has units of distance/time, and as such, can be considered the diffusive
velocity with which water moves through a substructure. The value can be as small
as 10−8
meters/day for unfractured shale, and as large as 104
meters/day for gravel,
thereby spanning 12 orders of magnitude [12].
If the diffusion coefficients are approximately the same order of magnitude then a
finite difference approach may work. However, as the ratio of the coefficients begins to
1
Although equation (4.1) technically is a non-linear partial differential equation, it is used as an example
because we believe the reader can relate to the notion of water percolating down through different layers
of soil.
27
Figure 4.1.: A plant with multiple layers of soil
vary by an order of magnitude or more, the finite difference approach can suffer from a
host of numerical problems effecting its convergence.
4.2 Two Layers
A two layer problem is simply two single layers connected by interface conditions
across a common boundary. The interface conditions describe the continuity of u(x, t)
and its flux, K∂u/∂x, across an interface x = l. Equations (4.2) through (4.5) describe
the two layer problem.
28
∂u
∂t
= κ1
∂2
u
∂x2
, 0 ≤ x < l1, t > 0 (4.2)
∂u
∂t
= κ2
∂2
u
∂x2
, l1 < x ≤ 1, t > 0 (4.3)
u(l−
1 , t) = u(l+
1 , t) (4.4)
K1
∂u
∂x l−
1
= K2
∂u
∂x l+
1
(4.5)
In this case, the diffusion coefficients κ1, κ2, are constant, as are the flux coefficients
K1 and K2. We have used ± to denote the right side (+) and left side (−) of the interface
x = l1. The mathematics of two layers is similar to that in section 3.1, but it will now
be generalized.
Assume we have layers with width l1 and l2. Each layer i has Ni sub-intervals and
the width of each sub-interval is hi = li/Ni. Now define
Mi =
i
k=1
Nk, ri =
2h2
i
κiδt
i = 1, 2;
so that the coefficient matrices take on the form
A1 =














2 + r1 −1
−1 2 + r1 −1
...
...
...
...
...
...
−1 2 + r1 −1
−1 2 + r1














(N1−1)×(N1−1)
, (4.6)
A2 =














2 + r2 −1
−1 2 + r2 −1
... ... ...
...
...
...
−1 2 + r2 −1
−1 2 + r2














(N2−1)×(N2−1)
; (4.7)
29
the solution vectors take the form
x1 =











u
(j)
1
u
(j)
2
...
u
(j)
N1−2
u
(j)
N1−1











(N1−1)
, x2 =











u
(j)
N1+1
u
(j)
N1+2
...
u
(j)
M2−2
u
(j)
M2−1











(N2−1)
;
the right hand terms take the form
f1 =











f
(j)
1 + g(j)
f
(j)
2
...
f
(j)
N1−2
f
(j)
N1−1 + u
(j)
N1
−











(N1−1)
, and f2 =











f
(j)
N1+1 + u
(j)
N1
+
f
(j)
N1+2
...
f
(j)
M2−2
f
(j)
M2−1 + h(j)











(N2−1)
;
and the system of equations is
A1x1 = f1 and A2x2 = f2.
We now need an equation that connects the layers, while preserving continuity of the
state variable and the flux.
4.3 Interface Equation
At the interface x = l1 we have the continuity of the state variable
u
(j)
N1
− = u
(j)
N1
+ = u
(j)
N1
and the continuity of the flux
K1
u
(j)
N1
− − u
(j)
N1−1
h1
= K2
u
(j)
N1+1 − u
(j)
N1
+
h2
.
When the two previous interface conditions are combined, we are left with an expression
for the solution across the interface,
K1
h1
+
K2
h2
u
(j)
N1
=
K1
h1
u
(j)
N1−1 +
K2
h2
u
(j)
N1+1 ∀j. (4.8)
30
4.4 Complete System for Two Layers
In matrix form, the tridiagonal system for two layers with the interface included
consists of
Aif x(j)
= f(j)
(4.9)
where
Aif =


















































2 + r1 −1
−1 2 + r1 −1
...
...
...
... ... ...
−1 2 + r1 −1
−1 2 + r1














A1
−1
−K1/h1 K1/h1 + K2/h2 −K2/h2
−1














2 + r2 −1
−1 2 + r2 −1
...
...
...
...
...
...
−1 2 + r2 −1
−1 2 + r2














A2




































and
x =























u1
u2
...
uN1−1
uN1
uN1+1
...
uM2−2
uM2−1























, f =























f1 + u0
f2
...
fN1−1
0
fN1+1
...
fM2−2
fM2−1 + uM2























.
Notice that the sub-matrix in the upper left hand corner of Aif is A1 from (4.6), the
sub-matrix in the lower right hand corner of Aif is A2 from (4.7), and the 5 terms in
the center of Aif relate to the interface equation (4.8).
Solving (4.9) now consists of the following 5 steps.
31
1. Compute the LU factorization of A1 and then perform the forward substitution
to obtain the relation between u
(j)
N1
− and u
(j)
N1−1.
2. Compute the UL factorization of A2 and then perform the backward substitution
to obtain the relation between u
(j)
N1
+ and u
(j)
N1+1.
3. Substitute the results of (1) and (2) into the interface equation (4.8) to find u
(j)
N1
.
4. Perform the backward substitution to layer 1.
5. Perform the forward substitution to layer 2.
4.5 Difference Between Inclusion and Exclusion of the Interface Condition
Equations (4.4) and (4.5) describe the continuity of the state variable and its flux
across the interface. If we look at
K1
∂u1
∂x
= K2
∂u2
∂x
and assume that K2 >> K1, we have
K1
K2
∂u1
∂x
=
∂u2
∂x
→ 0 ≈
∂u2
∂x
.
This suggests a horizontal tangent of u2 that is present in the solution at the interface.
This behavior is not captured in the form that excludes the interface condition and
results in an error between the two. These errors are described in section 4.7 when the
experiments are discussed.
4.6 Implementation
The CPU implementation for the two layer problem is a simple adaptation to the
standard LU factorization and solver running through the entire set of state variables
from 1 to M2. Listing A.1 shows factorization requiring 5 lines of code and the solver
requires 6 lines of code. In contrast, the two layer factorization in listing A.2 requires
11 lines of code and the solver in listing A.3 requires 13 lines.
32
The GPU implementation for the two layer problem also leverages the logic from the
one layer problem. The forward and backward substitution kernels for the LU solver
require approximately 27 lines of code each. Since we now need forward and backward
substitution kernels for the UL solver, they also contribute approximately 27 lines each
to the total line count.
4.7 Results
If the interface expression in (4.8) is excluded and we are presented with a two layer
problem, the tridiagonal matrix A12 has the form
A12 =




















2 + r1 −1
−1 2 + r1 −1
...
...
...
−1 2 + r1 −1
−1 2 + r2 −1
...
...
...
−1 2 + r2 −1
−1 2 + r2




















(4.10)
for the system
A12x = f. (4.11)
Four experiments were performed that apply the CPU LU decomposition to (4.11)
and compared the results to the 5 step algorithm (CPU 2L) described in section 4.4.
The parameters for the experiments are described in table 4.1.
Experiment 1 in figures 4.2 and 4.3 shows that the results of solving (4.9) and (4.11)
are nearly identical. Figure 4.2 demonstrates the time evolution of the single interface
point uN1 when the interface equation is utilized and compares that value with two values
uN1 and uN1+1 that straddle uN1+1/2 in (4.11) that excludes the interface. In figure 4.3,
the systems are solved with 1024 points over 50 time steps. With the interface point at
u512, the 32 points on either side are in excellent agreement with both systems.
33
Table 4.1.: Experimental parameters for 50 timesteps
Experiment# κ1 κ2 K1 K2
1 1 1 1 1
2 1 10 1 1
3 1 1 1 10
4 1 10 1 10
Experiment 2 in figures 4.4 and 4.5 demonstrates the next set of results. The time
evolution plot in figure 4.4 shows the interface point significantly above the excluded
form, while the graph of the solutions in figure 4.5 do a poor job of coinciding. The
error plot between the two solutions in figure 4.6 further quantifies the consequences of
ignoring the interface equation.
Experiment 3 in figures 4.7 and 4.8 demonstrates results that are in good agreement
with each other. This experiment also seems to suggest that the flux coefficient has little
impact on the solution.
Finally, experiment 4 in figures 4.9 and 4.10 graphically demonstrates the problem
that occurs when the diffusion and conductivity coefficients are κ1 = 1, κ2 = 10 and
K1 = 1, K2 = 10 respectively. Figure 4.9 shows the actual interface point uN1 receding
from the two points that excluded the interface equation. Again, when the solution
is graphed after 50 time steps in figure 4.10, it is readily apparent that excluding the
interface equation introduces error into the solution. As in experiment 2, figure 4.11
quantifies the solution error. The results of experiment 2 and 4 seem to suggest that
the diffusion coefficients, along with the solver, play a large role in producing a correct
solution.
With experiments 1 through 4 giving us faith in the interface equation approach, the
benefits of accelerating the process on the GPU are now reported. Table 4.2 demon-
strates the 18 fold acceleration of the GPU (GPU RD2L) over the CPU (CPU 2L)
34
version with 2 layers, as well as relative errors that are small when compared with the
analytical solution corresponding to κ1 = 1, κ2 = 1, K1 = 1, K2 = 1 with 32 terms.
The results of table Table 4.3 are not surprising. With κ1 = 1, κ2 = 10, K1 = 1,K2 =
1 and the number of terms set to 32, a maximum acceleration of 18 fold is measured.
The modifications made to RD allow us to process problems that are 65, 536 times larger
than the previously reported sizes for Stone’s approach.
Table 4.2.: CPU 2L vs GPU RD2L, κ1 = 1, κ2 = 1, Terms=32, Time Steps=50
Points CPU 2L
(ms)
CPU
MaxRel
Err
GPU
RD2L
(ms)
GPU
MaxRel
Err
Speed
Up
1,024 1.7 8.756E-05 7.4 8.756E-05 0.2
65,536 70.9 2.318E-08 11.2 2.360E-08 6.3
1,048,576 1,038.4 8.091E-11 66.1 2.533E-10 15.7
4,194,304 4,163.8 2.056E-10 238.1 2.070E-10 17.4
16,777,216 16,638.1 3.473E-10 928.7 3.458E-10 17.9
67,108,864 66,963.9 5.239E-09 3,698.4 5.241E-09 18.1
35
Table 4.3.: CPU 2L vs GPU RD2L, κ1 = 1, κ2 = 10, Time Steps=50
Points CPU 2L
(ms)
GPU
RD2L
(ms)
Speed
Up
768 0.7 7.4 0.1
49,152 50.7 10.2 4.9
786,432 818.5 51.6 15.8
3,145,728 3,161.0 180.8 17.4
12,582,912 12,524.5 700.1 17.8
50,331,648 50,639.1 2,779.5 18.2
Figure 4.2.: Time Evolution of Interface-Experiment 1
0 10 20 30 40 50
0.9900
0.9920
0.9940
0.9960
0.9980
1.0000
Time Step
Solutionvalues
Include interface eqn, uN1
Exclude interface eqn, uN1
Exclude interface eqn, uN1+1
κ1 = 1, κ2 = 1
K1 = 1, K2 = 1
36
Figure 4.3.: Solution to the Interface Neighborhood-Experiment 1
480 500 520 540
0.9700
0.9800
0.9900
1.0000
Spatial Index
Solutionvalues
Include interface eqn
Exclude interface eqn
κ1 = 1, κ2 = 1
K1 = 1, K2 = 1
at ts = 50.
Figure 4.4.: Time Evolution of Interface-Experiment 2
0 10 20 30 40 50
0.9900
0.9920
0.9940
0.9960
0.9980
1.0000
Time Step
Solutionvalues
Include interface eqn, uN1
Exclude interface eqn, uN1
Exclude interface eqn, uN1+1
κ1 = 1, κ2 = 10
K1 = 1, K2 = 1
37
Figure 4.5.: Solution to the Interface Neighborhood-Experiment 2
480 500 520 540
0.9700
0.9800
0.9900
1.0000
Spatial Index
Solutionvalues
Include interface eqn
Exclude interface eqn
κ1 = 1, κ2 = 10
K1 = 1, K2 = 1
at ts = 50.
Figure 4.6.: Error in the Interface Neighborhood-Experiment 2
480 500 520 540
10−5
10−4
10−3
10−2
Spatial Index
|Error|
κ1 = 1, κ2 = 10
K1 = 1, K2 = 1
at ts = 50.
38
Figure 4.7.: Time Evolution of Interface-Experiment 3
0 10 20 30 40 50
0.9900
0.9920
0.9940
0.9960
0.9980
1.0000
Time Step
Solutionvalues
Include interface eqn, uN1
Exclude interface eqn, uN1
Exclude interface eqn, uN1+1
κ1 = 1, κ2 = 1
K1 = 1, K2 = 10
Figure 4.8.: Solution to the Interface Neighborhood-Experiment 3
480 500 520 540
0.9700
0.9800
0.9900
1.0000
Spatial Index
Solutionvalues
Include interface eqn
Exclude interface eqn
κ1 = 1, κ2 = 1
K1 = 1, K2 = 10
at ts = 50.
39
Figure 4.9.: Time Evolution of Interface-Experiment 4
0 10 20 30 40 50
0.9900
0.9920
0.9940
0.9960
0.9980
1.0000
Time Step
Solutionvalues
Include interface eqn, uN1
Exclude interface eqn, uN1
Exclude interface eqn, uN1+1
κ1 = 1, κ2 = 10
K1 = 1, K2 = 10
Figure 4.10.: Solution to the Interface Neighborhood-Experiment 4
480 500 520 540
0.9700
0.9800
0.9900
1.0000
Spatial Index
Solutionvalues
Include interface eqn
Exclude interface eqn
κ1 = 1, κ2 = 10
K1 = 1, K2 = 10
at ts = 50.
40
Figure 4.11.: Error in the Interface Neighborhood-Experiment 4
480 500 520 540
10−5
10−4
10−3
10−2
Spatial Index
|Error|
κ1 = 1, κ2 = 10
K1 = 1, K2 = 10
at ts = 50.
41
5 CONCLUSION
We started this project with the goal of creating a GPU based solver for the two layer
linear diffusion equation. Before the solver could be addressed, the LU factorization of
the finite difference form of the diffusion equation was investigated.
The two term non-linear recurrence relation that results from the LU factorization
was shown to be sequential and not to be parallelizable. When converted to a three term
linear recurrence relation the result could now be paralellized for a GPU. However, it
suffered from arithmetic overflow. When the underlying initial value problem was solved
analytically, we were able to demonstrate that a polynomial form of the solution could
be truncated to achieve a prescribed level of machine accuracy. Our experiments showed
excellent agreement with the solution (3.3) to the initial value problem.
We mentioned briefly that other researchers reported numerical instabilities when
applying Stone’s recursive doubling algorithm to the actual solver for problems with
1, 024 unknowns. Overflow is inherent in Stone’s algorithm. In addition, the instability is
due to an arithmetic underflow problem that appears when the two term linear recurrence
relation associated with the solver is evaluated. When Stone’s algorithm is recast using
matrix algebra, we were able to overcome these deficiencies by applying the theory of
nilpotent matrices to generate a truncated form of the matrix polynomial.
Our experiments demonstrated a GPU based algorithm that is approximately 36
times faster than the CPU version for the 1 layer problem, and approximately 18 times
faster than the CPU for the 2 layer problem. Our 1 layer implementation required two
GPU kernels and the 2 layer implementation required 5 kernels. The GPU is designed
for large problems that stream RAM through the cores. We attribute the 2 fold loss in
efficiency to the presence of the additional kernels, as well as awkward coding patterns
that were required to handle the interface condition. On the positive side, our algorithm
42
is also capable of processing 226
equations, a value that is 65, 536 times larger than that
of other researchers.
Finally, we described the mathematical framework for solving the two layer problem.
Our experiments show that error is introduced into the solution when the interface
conditions, the continuity of the state variable and the flux across the interface boundary,
are not properly accounted for. In fact, our work suggests that the flux coefficients
control the placement of the solution, while the diffusion coefficients in combination
with the type of solver used controls the error in the solution. We’ve also shown that
with the straight forward LU solver for two layers requiring approximately 54 lines of
kernel code on the GPU, our parallel two layer solution is only modestly more complex
with approximately 108 lines of code.
43
REFERENCES
[1] Greg Turk. Generating textures on arbitrary surfaces using reaction diffusion. In
Proc. of the 18th annual confereence on Computer Graphics and Interactive Tech-
niques, pages 289–298. ACM Press, 1991.
[2] Daniele Bertaccini and Daniela Calvetti. Fast simulation of solid tumors thermal
ablation treatments with a 3d reaction diffusion model. Computers in Biology and
Medicine, 37(8):1173–1182, 2007.
[3] F. T. Tracy. Clean two- and three-dimensional analytical solutions of richards’
equation for testing numerical solvers. Water Resources Research, 42(8), 2006.
[4] R. Allan Freeze. Three dimensional, transient, saturated-unsaturated flow in a
groundwater basin. Water Resources Research, 7(2):347–366, 1971.
[5] James S. Boswell and Greg A. Olyphant. Modeling the hydrologic response of
groundwater dominated wetlands to transient boundary conditions: Implications
for wetland restoraton. Journal of Hydrology, 332(3):467–476, 2007.
[6] NVidia Corp. Cuda C Programming Guide. Technical Report PG02829001 v5.5,
NVidia Corp, July 2013.
[7] Harold S. Stone and Peter M. Kogge. A parallel algorithm for the efficient solution
of a general class of recurrence equations. IEEE Transactions on Computers, C-
22(8):786–793, 1973.
[8] Guy E. Blelloch. Prefix sums and their applications. Technical Report CMU-CS-
90-190, Carnegie Mellon University, November 1990.
[9] Volodymyr Kindratenko. A guide for implementing tridiagonal solvers on gpus. In
Numerical Computations with GPUs. Springer International Publishing, 2014.
[10] Jonathan Cohen Yao Zhang and John D. Owens. Fast tridiagonal solvers on the
gpu. In Proc. of the 15th ACM SIGPLAN Symposium on Principles and Practice
of Parallel Programming, pages 127–136. ACM Press, January 2010.
[11] Harold S. Stone. An efficient parallel algorithm for the solution of a tridiagonal
linear system of equations. Journal of the ACM, 2(1):27–38, 1973.
[12] Ralph C. Heath. Basic ground-water hydrology. Technical Report 2220, U.S. Geo-
logical Survey Water-Supply, 1983.
44
A CPU CODE
1 // =======================================================================
2 // LUFactorizationCPU
3 // =======================================================================
4 void LUFactorizationCPU (const CVector &SubDiag , const CVector &MainDiag ,
5 const CVector &SuperDiag , CVector &Gamma , CVector &Delta)
6 {
7 Gamma [1] = MainDiag [1];
8 Delta [1] = SuperDiag [1] / Gamma [1];
9
10 for (int i = 2; i <= Gamma.Last (); i++)
11 {
12 Gamma[i] = MainDiag[i]-SubDiag[i]* Delta[i -1];
13 Delta[i] = SuperDiag[i] / Gamma[i];
14 }
15 }
16
17 // =======================================================================
18 // LUSolveCPU
19 // =======================================================================
20 void LUSolveCPU(const CVector &Gamma , const CVector &Delta ,
21 const CVector &ASubDiag ,
22 const CVector &f, CVector &x, CVector &y)
23 {
24 //
25 // Forward substitution . Solve Ly = f.
26 //
27 y[1] = f[1] / Gamma [1];
28
29 for (int i = 2; i <= y.Last (); i++)
30 {
31 y[i] = (f[i] - ASubDiag[i]*y[i -1]) / Gamma[i];
32 }
33
34 //
35 // Backward substitution . Solve Ux = Temp.
36 //
37 x[x.Last ()] = y[y.Last ()];
45
38
39 for (int i = x.Last () -1; i > 0; i--)
40 {
41 x[i] = y[i] - Delta[i]*x[i+1];
42 }
43 }
Listing A.1: 1 Layer CPU Factorization and Solver
46
1 // =======================================================================
2 // TwoLayerFactorization
3 // =======================================================================
4 void TwoLayerFactorization (const double r1 , const double r2 ,
5 CVector &Gamma1 , CVector &Delta1 , CVector &Gamma2 , CVector &Delta2)
6 {
7 //
8 // The LU factorization for matrix A1
9 //
10 const double b1 = 2.0 + r1;
11 const double b2 = 2.0 + r2;
12
13 Gamma1 [1] = b1;
14 Delta1 [1] = -1.0 / Gamma1 [1];
15
16 for (int i = 2; i <= Gamma1.Last (); i++)
17 {
18 Gamma1[i] = b1 + Delta1[i -1];
19 Delta1[i] = -1.0 / Gamma1[i];
20 }
21
22 //
23 // The UL factorization for matrix A2
24 //
25
26 Gamma2[Gamma2.Last ()] = b2;
27
28 for (int i = Gamma2.Last () -1; i >= 1; i--)
29 {
30 Delta2[i] = -1.0 / Gamma2[i+1];
31 Gamma2[i] = b2 + Delta2[i];
32 }
33
34 }
Listing A.2: 2 Layer CPU Factorization
47
1 // =======================================================================
2 // Solve2LayerCPU
3 // =======================================================================
4 void Solve2LayerCPU (const CVector &Gamma1 , const CVector &Delta1 , CVector &y,
5 CVector &f, const CVector &Gamma2 , const CVector &Delta2 ,
6 CVector &z, CVector &u,
7 const double K1 , const double K2 ,
8 const double H1 , const double H2)
9 {
10
11 // Top down , y[1] to the point y[N_1 -1]
12 // The forward substitution for LU.
13
14 y[1] = f[1] / Gamma1 [1];
15
16 for (int i = 2; i <= Gamma1.Last (); i++)
17 {
18 y[i] = (f[i] + y[i -1]) / Gamma1[i];
19 }
20
21 // Bottom up , z[N_1 -1] to z[1]
22 // The backward substitution for UL.
23
24 z[z.Last ()] = f[f.Last ()];
25
26 for (int i = 1; i < z.Last (); i++)
27 {
28 z[z.Last ()-i] = f[f.Last ()-i] - Delta2[Delta2.Last ()-i]*z[z.Last ()-i+1];
29 }
30
31 // Fix up U_{N_1}
32
33 const int N1 = Gamma1.Last ()+1;
34
35 const double alpha = (K1/H1)*(1.0 - (1.0/ Gamma1[Gamma1.Last ()])) + (K2/H2)*(1.0 -
(1.0/ Gamma2 [1]));
36
37 u[N1] = ((K1/H1)*y[y.Last ()] + (K2/H2)*(z[1]/ Gamma2 [1])) / alpha; // update U_{
N_1}
38
39 // Advance up from point U{N_1 -1} to U_1
40 // The backward substitution for LU.
41
48
42 for (int i = N1 -1; i >= 1; i--)
43 {
44 u[i] = y[i] - Delta1[i]*u[i+1];
45 }
46
47 // Move down from point U{N_1 +1} to U_{M_2}
48 // The forward substitution for UL.
49
50 for (int i = 1; i <= Gamma2.Last (); i++)
51 {
52 u[N1+i] = (z[i] + u[N1+i -1]) / Gamma2[i];
53 }
54
55 }
Listing A.3: 2 Layer CPU Solver
49
B GPU CODE
1 // =======================================================================
2 // ForwardLU_kernel -- compute the forward scan on the GPU
3 // =======================================================================
4 __global__ void ForwardLU_kernel (const double * __restrict__ gpuGamma ,
5 const double * __restrict__ gpuFhat ,
6 double *gpuYhat , unsigned int NumberOfElements )
7 {
8 __shared__ double sG[MAXTERMS +1024];
9 __shared__ double sF[MAXTERMS +1024];
10
11 const unsigned int gtid0 = blockIdx.x*blockDim.x + threadIdx.x;
12 const unsigned int idx0 = threadIdx.x;
13
14 double TG0 = 0.0;
15 double TF0 = 0.0;
16
17 if (gtid0 < NumberOfElements )
18 {
19 TG0 = gpuGamma[gtid0 ];
20 TF0 = gpuFhat[gtid0 ];
21 }
22 sG[idx0+MAXTERMS] = TG0;
23 sF[idx0+MAXTERMS] = TF0;
24
25 if (idx0 < MAXTERMS)
26 {
27 TG0 = 0.0;
28 TF0 = 0.0;
29 if (gtid0 > MAXTERMS)
30 {
31 TG0 = gpuGamma[gtid0 -MAXTERMS ];
32 TF0 = gpuFhat[gtid0 -MAXTERMS ];
33 }
34 sG[idx0] = TG0;
35 sF[idx0] = TF0;
36 }
37 __syncthreads ();
50
38
39 double Sum0 = 0.0;
40
41 for (int i = 1; i <= MAXTERMS; i++)
42 {
43 double F0 = sF[idx0+i];
44 double G0 = sG[idx0+i];
45
46 Sum0 = (Sum0 + F0)*G0;
47 }
48
49 if (gtid0 < NumberOfElements )
50 {
51 gpuYhat[gtid0] = Sum0;
52 }
53 }
Listing B.1: GPU Forward LU Solver
51
1 // =======================================================================
2 // BackwardLU_kernel -- compute the backward scan on the GPU
3 // =======================================================================
4 __global__ void BackwardLU_kernel (const double * __restrict__ gpuGamma ,
5 const double * __restrict__ gpuYhat ,
6 double *gpuXhat , unsigned int NumberOfElements )
7 {
8 __shared__ double sG [1024+ MAXTERMS ];
9 __shared__ double sY [1024+ MAXTERMS ];
10
11 const unsigned int gtid = blockIdx.x*blockDim.x + threadIdx.x;
12 const unsigned int idx = threadIdx.x;
13 const unsigned int warpid = idx >> 5;
14
15 double TG = 0.0;
16 double TY = 0.0;
17 if (gtid < NumberOfElements )
18 {
19 TG = gpuGamma[gtid ];
20 TY = gpuYhat[gtid ];
21 }
22 sG[idx] = TG;
23 sY[idx] = TY;
24
25 if (blockIdx.x < (NumberOfElements > >10)) // if not in the last block
26 {
27 if (warpid == 31) // if in the last warp
28 {
29 TG = gpuGamma[gtid+MAXTERMS ];
30 TY = gpuYhat[gtid+MAXTERMS ];
31 sG[idx+MAXTERMS] = TG;
32 sY[idx+MAXTERMS] = TY;
33 }
34 }
35
36 __syncthreads ();
37
38 double Sum = 0.0;
39
40 for (int i = MAXTERMS -1; i >= 0; i--)
41 {
42 double Y = sY[idx+i];
43 double G = (i == 0 ? 1.0 : sG[idx+i -1]);
52
44
45 Sum = (Sum + Y)*G;
46 }
47
48 if (gtid < NumberOfElements )
49 gpuXhat[gtid] = Sum;
50 }
Listing B.2: GPU Backward LU Solver

More Related Content

What's hot

Efficiency Optimization of Realtime GPU Raytracing in Modeling of Car2Car Com...
Efficiency Optimization of Realtime GPU Raytracing in Modeling of Car2Car Com...Efficiency Optimization of Realtime GPU Raytracing in Modeling of Car2Car Com...
Efficiency Optimization of Realtime GPU Raytracing in Modeling of Car2Car Com...Alexander Zhdanov
 
Modelling and Dynamics of Thermoelectric Generators
Modelling and Dynamics of Thermoelectric GeneratorsModelling and Dynamics of Thermoelectric Generators
Modelling and Dynamics of Thermoelectric GeneratorsFelipe Ferrari
 
M sc thesis of nicolo' savioli
M sc thesis of nicolo' savioliM sc thesis of nicolo' savioli
M sc thesis of nicolo' savioliNicolò Savioli
 
Automatic generation of power system network diagram(Mimic diagram) from a CI...
Automatic generation of power system network diagram(Mimic diagram) from a CI...Automatic generation of power system network diagram(Mimic diagram) from a CI...
Automatic generation of power system network diagram(Mimic diagram) from a CI...Nikhil Valiveti
 
ME75-2014-myan076-report
ME75-2014-myan076-reportME75-2014-myan076-report
ME75-2014-myan076-reportMicky Yang
 
Quantum Variables in Finance and Neuroscience Lecture Slides
Quantum Variables in Finance and Neuroscience Lecture SlidesQuantum Variables in Finance and Neuroscience Lecture Slides
Quantum Variables in Finance and Neuroscience Lecture SlidesLester Ingber
 
UiA Slam (Øystein Øihusom & Ørjan l. Olsen)
UiA Slam (Øystein Øihusom & Ørjan l. Olsen)UiA Slam (Øystein Øihusom & Ørjan l. Olsen)
UiA Slam (Øystein Øihusom & Ørjan l. Olsen)Øystein Øihusom
 
Mohan_Sidharth_final_report_565_f14
Mohan_Sidharth_final_report_565_f14Mohan_Sidharth_final_report_565_f14
Mohan_Sidharth_final_report_565_f14Sidharth Kumar
 
Tesis de posicionamiento
Tesis de posicionamientoTesis de posicionamiento
Tesis de posicionamientojosesocola27
 
Au anthea-ws-201011-ma sc-thesis
Au anthea-ws-201011-ma sc-thesisAu anthea-ws-201011-ma sc-thesis
Au anthea-ws-201011-ma sc-thesisevegod
 
Thesis-MitchellColgan_LongTerm_PowerSystem_Planning
Thesis-MitchellColgan_LongTerm_PowerSystem_PlanningThesis-MitchellColgan_LongTerm_PowerSystem_Planning
Thesis-MitchellColgan_LongTerm_PowerSystem_PlanningElliott Mitchell-Colgan
 
All Minimal and Maximal Open Single Machine Scheduling Problems Are Polynomia...
All Minimal and Maximal Open Single Machine Scheduling Problems Are Polynomia...All Minimal and Maximal Open Single Machine Scheduling Problems Are Polynomia...
All Minimal and Maximal Open Single Machine Scheduling Problems Are Polynomia...SSA KPI
 
Big Data and the Web: Algorithms for Data Intensive Scalable Computing
Big Data and the Web: Algorithms for Data Intensive Scalable ComputingBig Data and the Web: Algorithms for Data Intensive Scalable Computing
Big Data and the Web: Algorithms for Data Intensive Scalable ComputingGabriela Agustini
 

What's hot (19)

Efficiency Optimization of Realtime GPU Raytracing in Modeling of Car2Car Com...
Efficiency Optimization of Realtime GPU Raytracing in Modeling of Car2Car Com...Efficiency Optimization of Realtime GPU Raytracing in Modeling of Car2Car Com...
Efficiency Optimization of Realtime GPU Raytracing in Modeling of Car2Car Com...
 
thesis
thesisthesis
thesis
 
thesis
thesisthesis
thesis
 
Modelling and Dynamics of Thermoelectric Generators
Modelling and Dynamics of Thermoelectric GeneratorsModelling and Dynamics of Thermoelectric Generators
Modelling and Dynamics of Thermoelectric Generators
 
M sc thesis of nicolo' savioli
M sc thesis of nicolo' savioliM sc thesis of nicolo' savioli
M sc thesis of nicolo' savioli
 
main
mainmain
main
 
Automatic generation of power system network diagram(Mimic diagram) from a CI...
Automatic generation of power system network diagram(Mimic diagram) from a CI...Automatic generation of power system network diagram(Mimic diagram) from a CI...
Automatic generation of power system network diagram(Mimic diagram) from a CI...
 
ME75-2014-myan076-report
ME75-2014-myan076-reportME75-2014-myan076-report
ME75-2014-myan076-report
 
thesis
thesisthesis
thesis
 
Final Thesis
Final ThesisFinal Thesis
Final Thesis
 
Quantum Variables in Finance and Neuroscience Lecture Slides
Quantum Variables in Finance and Neuroscience Lecture SlidesQuantum Variables in Finance and Neuroscience Lecture Slides
Quantum Variables in Finance and Neuroscience Lecture Slides
 
UiA Slam (Øystein Øihusom & Ørjan l. Olsen)
UiA Slam (Øystein Øihusom & Ørjan l. Olsen)UiA Slam (Øystein Øihusom & Ørjan l. Olsen)
UiA Slam (Øystein Øihusom & Ørjan l. Olsen)
 
Mohan_Sidharth_final_report_565_f14
Mohan_Sidharth_final_report_565_f14Mohan_Sidharth_final_report_565_f14
Mohan_Sidharth_final_report_565_f14
 
THESIS
THESISTHESIS
THESIS
 
Tesis de posicionamiento
Tesis de posicionamientoTesis de posicionamiento
Tesis de posicionamiento
 
Au anthea-ws-201011-ma sc-thesis
Au anthea-ws-201011-ma sc-thesisAu anthea-ws-201011-ma sc-thesis
Au anthea-ws-201011-ma sc-thesis
 
Thesis-MitchellColgan_LongTerm_PowerSystem_Planning
Thesis-MitchellColgan_LongTerm_PowerSystem_PlanningThesis-MitchellColgan_LongTerm_PowerSystem_Planning
Thesis-MitchellColgan_LongTerm_PowerSystem_Planning
 
All Minimal and Maximal Open Single Machine Scheduling Problems Are Polynomia...
All Minimal and Maximal Open Single Machine Scheduling Problems Are Polynomia...All Minimal and Maximal Open Single Machine Scheduling Problems Are Polynomia...
All Minimal and Maximal Open Single Machine Scheduling Problems Are Polynomia...
 
Big Data and the Web: Algorithms for Data Intensive Scalable Computing
Big Data and the Web: Algorithms for Data Intensive Scalable ComputingBig Data and the Web: Algorithms for Data Intensive Scalable Computing
Big Data and the Web: Algorithms for Data Intensive Scalable Computing
 

Viewers also liked

Moving Healthcare forward
Moving Healthcare forwardMoving Healthcare forward
Moving Healthcare forwardMichael Sandnes
 
Quantitative Finance Program Info
Quantitative Finance Program InfoQuantitative Finance Program Info
Quantitative Finance Program InfoPatrick DeFeo
 
Management by walking around
Management by walking aroundManagement by walking around
Management by walking aroundFred Mmbololo
 
Khadi and village industries 30.12.2015
Khadi and village industries   30.12.2015Khadi and village industries   30.12.2015
Khadi and village industries 30.12.2015akjhaies
 
Peran perempuan terhadap pembangunan ekonomi
Peran perempuan terhadap pembangunan ekonomiPeran perempuan terhadap pembangunan ekonomi
Peran perempuan terhadap pembangunan ekonomiStanley Stefanus
 
Marketing stuff mcgraw-hill- The marketing environment
Marketing stuff mcgraw-hill- The marketing environmentMarketing stuff mcgraw-hill- The marketing environment
Marketing stuff mcgraw-hill- The marketing environmentFred Mmbololo
 
Cadmium toxicity in human
Cadmium toxicity in humanCadmium toxicity in human
Cadmium toxicity in humanShashi Singh
 
Shoplink 브랜드 입점제안서
Shoplink 브랜드 입점제안서Shoplink 브랜드 입점제안서
Shoplink 브랜드 입점제안서Jason Im
 

Viewers also liked (12)

571
571571
571
 
Moving Healthcare forward
Moving Healthcare forwardMoving Healthcare forward
Moving Healthcare forward
 
CV 简历
CV 简历CV 简历
CV 简历
 
Quantitative Finance Program Info
Quantitative Finance Program InfoQuantitative Finance Program Info
Quantitative Finance Program Info
 
7066
70667066
7066
 
Ias26
Ias26Ias26
Ias26
 
Management by walking around
Management by walking aroundManagement by walking around
Management by walking around
 
Khadi and village industries 30.12.2015
Khadi and village industries   30.12.2015Khadi and village industries   30.12.2015
Khadi and village industries 30.12.2015
 
Peran perempuan terhadap pembangunan ekonomi
Peran perempuan terhadap pembangunan ekonomiPeran perempuan terhadap pembangunan ekonomi
Peran perempuan terhadap pembangunan ekonomi
 
Marketing stuff mcgraw-hill- The marketing environment
Marketing stuff mcgraw-hill- The marketing environmentMarketing stuff mcgraw-hill- The marketing environment
Marketing stuff mcgraw-hill- The marketing environment
 
Cadmium toxicity in human
Cadmium toxicity in humanCadmium toxicity in human
Cadmium toxicity in human
 
Shoplink 브랜드 입점제안서
Shoplink 브랜드 입점제안서Shoplink 브랜드 입점제안서
Shoplink 브랜드 입점제안서
 

Similar to TR-CIS-0420-09 BobZigon

Nonlinear Simulation of Rotor-Bearing System Dynamics
Nonlinear Simulation of Rotor-Bearing System DynamicsNonlinear Simulation of Rotor-Bearing System Dynamics
Nonlinear Simulation of Rotor-Bearing System DynamicsFrederik Budde
 
bachelors_thesis_stephensen1987
bachelors_thesis_stephensen1987bachelors_thesis_stephensen1987
bachelors_thesis_stephensen1987Hans Jacob Teglbj
 
matconvnet-manual.pdf
matconvnet-manual.pdfmatconvnet-manual.pdf
matconvnet-manual.pdfKhamis37
 
SpectrumSharing_Thesis_BSingh_AaltoUni_2014
SpectrumSharing_Thesis_BSingh_AaltoUni_2014SpectrumSharing_Thesis_BSingh_AaltoUni_2014
SpectrumSharing_Thesis_BSingh_AaltoUni_2014Bikramjit Singh
 
Ivo Pavlik - thesis (print version)
Ivo Pavlik - thesis (print version)Ivo Pavlik - thesis (print version)
Ivo Pavlik - thesis (print version)Ivo Pavlik
 
MSc Thesis - Jaguar Land Rover
MSc Thesis - Jaguar Land RoverMSc Thesis - Jaguar Land Rover
MSc Thesis - Jaguar Land RoverAkshat Srivastava
 
On-Line Presentation of Data from PMU Stations
On-Line Presentation of Data from PMU StationsOn-Line Presentation of Data from PMU Stations
On-Line Presentation of Data from PMU StationsPower System Operation
 
A Comparative Study Of Generalized Arc-Consistency Algorithms
A Comparative Study Of Generalized Arc-Consistency AlgorithmsA Comparative Study Of Generalized Arc-Consistency Algorithms
A Comparative Study Of Generalized Arc-Consistency AlgorithmsSandra Long
 
Mining of massive datasets
Mining of massive datasetsMining of massive datasets
Mining of massive datasetssunsine123
 
Thesis_Walter_PhD_final_updated
Thesis_Walter_PhD_final_updatedThesis_Walter_PhD_final_updated
Thesis_Walter_PhD_final_updatedWalter Rodrigues
 

Similar to TR-CIS-0420-09 BobZigon (20)

Honours_Thesis2015_final
Honours_Thesis2015_finalHonours_Thesis2015_final
Honours_Thesis2015_final
 
report
reportreport
report
 
Nonlinear Simulation of Rotor-Bearing System Dynamics
Nonlinear Simulation of Rotor-Bearing System DynamicsNonlinear Simulation of Rotor-Bearing System Dynamics
Nonlinear Simulation of Rotor-Bearing System Dynamics
 
bachelors_thesis_stephensen1987
bachelors_thesis_stephensen1987bachelors_thesis_stephensen1987
bachelors_thesis_stephensen1987
 
matconvnet-manual.pdf
matconvnet-manual.pdfmatconvnet-manual.pdf
matconvnet-manual.pdf
 
MSC-2013-12
MSC-2013-12MSC-2013-12
MSC-2013-12
 
Matconvnet manual
Matconvnet manualMatconvnet manual
Matconvnet manual
 
Examensarbete
ExamensarbeteExamensarbete
Examensarbete
 
T401
T401T401
T401
 
final (1)
final (1)final (1)
final (1)
 
SpectrumSharing_Thesis_BSingh_AaltoUni_2014
SpectrumSharing_Thesis_BSingh_AaltoUni_2014SpectrumSharing_Thesis_BSingh_AaltoUni_2014
SpectrumSharing_Thesis_BSingh_AaltoUni_2014
 
Ivo Pavlik - thesis (print version)
Ivo Pavlik - thesis (print version)Ivo Pavlik - thesis (print version)
Ivo Pavlik - thesis (print version)
 
dissertation
dissertationdissertation
dissertation
 
MSc Thesis - Jaguar Land Rover
MSc Thesis - Jaguar Land RoverMSc Thesis - Jaguar Land Rover
MSc Thesis - Jaguar Land Rover
 
Project Dissertation
Project DissertationProject Dissertation
Project Dissertation
 
Sarda_uta_2502M_12076
Sarda_uta_2502M_12076Sarda_uta_2502M_12076
Sarda_uta_2502M_12076
 
On-Line Presentation of Data from PMU Stations
On-Line Presentation of Data from PMU StationsOn-Line Presentation of Data from PMU Stations
On-Line Presentation of Data from PMU Stations
 
A Comparative Study Of Generalized Arc-Consistency Algorithms
A Comparative Study Of Generalized Arc-Consistency AlgorithmsA Comparative Study Of Generalized Arc-Consistency Algorithms
A Comparative Study Of Generalized Arc-Consistency Algorithms
 
Mining of massive datasets
Mining of massive datasetsMining of massive datasets
Mining of massive datasets
 
Thesis_Walter_PhD_final_updated
Thesis_Walter_PhD_final_updatedThesis_Walter_PhD_final_updated
Thesis_Walter_PhD_final_updated
 

TR-CIS-0420-09 BobZigon

  • 1. TWO LAYER LINEAR DIFFUSION EQUATION ON THE GPU Technical Report TR-CIS-0420-09 Submitted to the Faculty of Indiana University-Purdue University Indianapolis by Robert J. Zigon December 2015 Indianapolis, Indiana
  • 2. ii TABLE OF CONTENTS Page LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 Graphics Processing Unit (GPU) . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1 Software Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Hardware Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3 Diffusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.1 One Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2 Factorization of a Tridiagonal Matrix . . . . . . . . . . . . . . . . . . . 11 3.2.1 LU Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.2.2 UL Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.3 CPU Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.4 Parallel LU Factorization . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.5 Parallel Solver and Recursive Doubling . . . . . . . . . . . . . . . . . . 18 3.6 Recursive Doubling and Nilpotent Matrices . . . . . . . . . . . . . . . 19 3.7 GPU Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.8 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4 Diffusion with Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.2 Two Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.3 Interface Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.4 Complete System for Two Layers . . . . . . . . . . . . . . . . . . . . . 30 4.5 Difference Between Inclusion and Exclusion of the Interface Condition . 31 4.6 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 A CPU code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 B GPU code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
  • 3. iii LIST OF TABLES Table Page 3.1 CPU LU vs CPU ZC with 1 Layer, Terms=20, Time Steps=50 . . . . . . . 23 3.2 CPU LU vs CPU ZC with 1 Layer, Terms=32, Time Steps=50 . . . . . . . 24 3.3 CPU LU vs GPU RD1L with 1 Layer, Terms=20, Time Steps=50 . . . . . 25 4.1 Experimental parameters for 50 timesteps . . . . . . . . . . . . . . . . . . 33 4.2 CPU 2L vs GPU RD2L, κ1 = 1, κ2 = 1, Terms=32, Time Steps=50 . . . . 34 4.3 CPU 2L vs GPU RD2L, κ1 = 1, κ2 = 10, Time Steps=50 . . . . . . . . . . 35
  • 4. iv LIST OF FIGURES Figure Page 2.1 Memory bandwidth for the CPU and GPU . . . . . . . . . . . . . . . . . . 4 2.2 Floating point operations per second for the CPU and GPU . . . . . . . . 4 2.3 A GPU Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.4 A GPU and a Streaming Multiprocessor (SM or SMX) . . . . . . . . . . . 8 3.1 CPU LU vs CPU ZC with 1 Layer, Terms=20, Time Steps=50 . . . . . . . 24 3.2 CPU LU vs GPU RD1L with 1 Layer, Terms=20, Time Steps=50 . . . . . 25 4.1 A plant with multiple layers of soil . . . . . . . . . . . . . . . . . . . . . . 27 4.2 Time Evolution of Interface-Experiment 1 . . . . . . . . . . . . . . . . . . 35 4.3 Solution to the Interface Neighborhood-Experiment 1 . . . . . . . . . . . . 36 4.4 Time Evolution of Interface-Experiment 2 . . . . . . . . . . . . . . . . . . 36 4.5 Solution to the Interface Neighborhood-Experiment 2 . . . . . . . . . . . . 37 4.6 Error in the Interface Neighborhood-Experiment 2 . . . . . . . . . . . . . . 37 4.7 Time Evolution of Interface-Experiment 3 . . . . . . . . . . . . . . . . . . 38 4.8 Solution to the Interface Neighborhood-Experiment 3 . . . . . . . . . . . . 38 4.9 Time Evolution of Interface-Experiment 4 . . . . . . . . . . . . . . . . . . 39 4.10 Solution to the Interface Neighborhood-Experiment 4 . . . . . . . . . . . . 39 4.11 Error in the Interface Neighborhood-Experiment 4 . . . . . . . . . . . . . . 40
  • 5. v ABSTRACT Zigon, Robert MS, Purdue University, December 2015. Two Layer Linear Diffusion Equation on the GPU. Major Professors: Raymond Chin, Shaofin Fang and Fengguang Song. The purpose of this project is to investigate the mathematical framework for evalu- ating the two layer linear diffusion equation on a GPU. The diffusion equation is first approximated using finite differences to produce the matrix equation Ax = f. The two term non-linear recurrence relation for the LU factorization of the A matrix is then con- verted into a three term linear recurrence relation by way of a Riccati transform. The three term relation is then shown to be parallelizable. After the numeric underflow prob- lem for the LU solver of the system is reconciled, Stone’s recursive doubling algorithm is then implemented. Finally, the parallel implementation is applied to a form of the two layer diffusion equation that properly models the flux across the internal boundary.
  • 6. 1 1 INTRODUCTION In physics, diffusion is simply defined as the change in distribution of a collection of particles, as well as its depletion, in time and space. The underlying partial differential equation can be used to model many different types of processes. For example, open a bottle of perfume. As the molecules of the scent first escape the container, they are in very high concentration. Over time they spread outward in every direction where they are in low concentration. Another example of diffusion exists in biology. A process called morphogenesis con- trols the spatial distribution of cells during the embryonic development of an organism. Natural patterns, such as the spots on a leopard, are believed to be the result of cellular differentiation in many different directions [1]. The diffusion equation also appears in oncology with the use of radio frequency thermal ablation (RFA). In this process, tumor cells are killed by focusing energy on a diseased portion of the body. In order to better understand the ablation process, models are used to analyze the energy and temperature distribution in the context of the muscle, fat and bone that are adjacent to the tumor cells [2]. Yet another example of diffusion exists in hydrogeology – the study of the movement of groundwater in the soil and rocks of the Earth’s crust. Groundwater does not always flow down hill in the subsurface by following the surface topology. Instead, it can be driven by pressure gradients in both saturated and unsaturated regions. This results in a behavior that is difficult to predict for all but the simplest situations. The goal of this project is to implement a solver for the one dimensional linear diffusion equation with two layers on a GPU. We will begin with a description of the modern GPU. The problem itself will then be investigated in two phases. The first phase will start with the finite difference equations for the one dimensional, constant coefficient, linear diffusion equation (due to its relative simplicity) on the CPU and GPU.
  • 7. 2 The second phase will then investigate adding layers to the first phase. This two phase approach will allow us to first understand the issues surrounding diffusion on different hardware architectures, and then focus on the two layer problem, so that an efficient parallel solver can be implemented. The solver consists of two components that are designed to improve execution time or the accuracy of the solution. First, the tridiagonal structure of the underlying matrices will be considered so that the LU-decomposition can be applied with a computational complexity of O(n). While doing so, we will show how to implement Stone’s recursive doubling algorithm to solve 226 equations, a goal that, up to now, has not exceeded 1, 024 equations. For the second component, we will demonstrate the mathematics to treat the boundary of two different diffusion coefficients in a manner that reduces error in the solution.
  • 8. 3 2 GRAPHICS PROCESSING UNIT (GPU) The modern Graphics Processing Unit (GPU) has its genesis in 2D and 3D computer graphics. In 2000, parallel processing and floating point arithmetic capabilities were added to graphics cards to accelerate the rate that world geometries could be trans- formed, illuminated, projected, clipped and then displayed as pixels. This sequence of operations is called the graphics pipeline. It makes heavy use of five basic floating point operators (addition, subtraction, multiplication, division and square root). The process itself is called embarrassingly parallel because the transformation sequence applied to a three dimensional vertex is independent of the other vertices. In 2002, researchers became interested in these parallel processing and floating point capabilities. They used the graphics application programming interface (API) to com- pute functions such as fast fourier transforms and convolutions. NVidia took note of the scientific computing trend with GPUs and developed CUDA - the Compute Unified Device Architecture. CUDA [6] is a computing platform and programming model that uses a C like language to expose the massive parallelism of GPU hardare. In retrospect, the demand for real time graphics has caused the GPU to evolve into a highly parallel, multithreaded, many core processor with very high memory bandwidth and computing throughput as illustrated in figures 2.1 and 2.2. 2.1 Software Model CUDA was designed to overcome the challenge of writing applications that transpar- ently scale with increasing numbers of processing cores by maintaining a low learning curve for programmers familiar with the C programming language. There are three abstractions at the core of CUDA - a hierarchy of thread groups, shared memories and barrier synchronization. These abstractions guide the programmer to partition a prob-
  • 9. 4 Figure 2.1.: Memory bandwidth for the CPU and GPU Figure 2.2.: Floating point operations per second for the CPU and GPU
  • 10. 5 lem into sub-problems that can be solved independently in parallel by blocks of threads executing a kernel. A kernel is a program written in CUDA C that is downloaded from the host to a GPU board at runtime. Parameters are passed to the kernel at invocation to provide it with operands that are transformed by the GPU. In listing 2.1 the kernel program extends from line 1 through line 5. Line 13 of the main program (that is executing on the Intel CPU) essentially downloads the V ectorAdd kernel to the GPU and launches it on 1,000 threads. At runtime each thread is assigned a unique thread index, in this case ranging from 0 to 999. The ith thread loads A[i], B[i], adds them, and then writes the result to C[i]. When all of the threads have executed the V ectorAdd kernel, control is returned to the main program at line 14. Modern GPU hardware (like an NVidia Tesla K20 board) can have as many as 2,496 processing cores. When a kernel is launched, one of the required parameters is the thread count. The requested thread count can exceed the number of physical cores on the GPU. From a conceptual standpoint, the hardware maps blocks of 32 threads (called warps) to 32 cores until they have finished executing. When one warp finishes another one is allocated to the idle block of cores for execution. This is one of the key abstractions in CUDA that lends itself to transparent scalability. If, for example, a next generation board arrives with 10,000 cores, the kernels are oblivious to the environmental change. The hardware and runtime take care of the mapping from threads to cores, and the kernel is executed. The result is a platform that preserves the user’s investment in code while insulating from hardware changes.
  • 11. 6 1 __global__ void VectorAdd(float *A, float *B, float *C) 2 { 3 int i = threadIdx.x; 4 C[i] = A[i] + B[i]; 5 } 6 7 int main () 8 { 9 const int N = 1000; 10 ... 11 // Kernel invocation from the host with N threads 12 13 VectorAdd <<<1, N>>>(A, B, C); 14 ... 15 } Listing 2.1: Example kernel and host code 2.2 Hardware Model From a hardware perspective, the fundamental computing unit in an NVidia GPU is a core (see figure 2.3). A core contains a 32 bit arithmetic logic unit (ALU) capable of performing operations such as min, max, add, subtract, multiply, divide, compare and bitwise logical operators. A core also contains a single and double precision floating point unit. In figure 2.4 we see that a collection of cores are grouped together in a unit known as a Streaming Multiprocessor (SM). The SM in the figure has 192 cores within. A GPU itself is then a collection of SM’s. Although figure 2.4 shows 8 SM’s (for a total of 1, 536 cores), the Tesla K20 cards used in this project have 13 SM’s, for a total of 2, 496 cores. An SM is designed to execute thousands of programming threads concurrently. To manage such a large number of threads, it employs a unique architecture called SIMT(Single Instruction, Multiple Thread). An SM schedules and executes the threads grouped as warps. Individual threads composing a warp begin execution at the same instruction address, but have their own register state and are therefore free to branch
  • 12. 7 Figure 2.3.: A GPU Core and execute independently. However, full efficiency is realized when all 32 threads of a warp agree on their execution path. If threads of a warp diverge via a data dependent conditional branch, the warp serially executes each branch taken, disabling threads that are not on that path, and when all paths complete, the threads converge back to a common path. The SIMT architecture is somewhat similar to the architecture of vector processors known as SIMD(Single Instruction, Multiple Data). A key difference, however, is that the SIMD organization exposes the width of the vector to the software (and programmer). SIMT, on the other hand, specifies the execution and branching behavior in terms of a single thread. This model simplifies parallel programming somewhat and aids in program correctness.
  • 13. 8 Figure 2.4.: A GPU and a Streaming Multiprocessor (SM or SMX)
  • 14. 9 3 DIFFUSION 3.1 One Layer The one dimensional diffusion equation is ∂u ∂t = κ ∂2 u ∂x2 , 0 ≤ x ≤ L, t > 0 (3.1) where u = u(x, t) is the dependent variable and κ is a real constant. The initial condition and boundary conditions are u(x, 0) = 0, u(0, t) = g(t) and u(L, t) = h(t). The particular problem we will solve has boundary and initial conditions that are equal to u(0, t) = 0, u(L, t) = 0, u(x, 0) = sin(πx/L). (3.2) The solution to (3.1) subject to (3.2) is then u(x, t) = sin(πx/L) exp(−κπ2 t/L2 ). (3.3) The results of this project will be validated for both the CPU and GPU using (3.3). We begin with an implicit time discretization scheme with a trapezoidal rule in which tj = jδt. Then ∂u ∂t = F(x, t) is converted to u(j) (xi) − u(j−1) (xi) δt = F(xi, tj) + F(xi, tj−1) 2 . Define u(xi, tj) to be replaced by u (j) i . The time discretization of the diffusion equation becomes u(j) (xi) − u(j−1) (xi) δt = κ 2 ∂2 u ∂x2 (j) xi + ∂2 u ∂x2 (j−1) xi . (3.4) We now use a central difference with uniform spacing h = L/N for the spatial dimension, where N is the number of sub-intervals. As a result, (3.4) becomes
  • 15. 10 u(j) (xi) − u(j−1) (xi) δt = κ 2 ui−1 − ui + ui+1 h2 (j) + ui−1 − ui + ui+1 h2 (j−1) . (3.5) If we let r = 2h2 κδt , then (3.5) can be rewritten as −u (j) i−1 + (2 + r)u (j) i − u (j) i+1 = −u (j−1) i−1 − (2 − r)u (j−1) i − u (j−1) i+1 , i = 1 . . . N, j > 0. (3.6) For convenience, define f (j) 1 = u (j−1) 0 − (2 − r)u (j−1) 1 − u (j−1) 2 f (j) i = u (j−1) i−1 − (2 − r)u (j−1) i − u (j−1) i+1 , 2 ≤ i ≤ N − 2 f (j) N−1 = u (j−1) N−2 − (2 − r)u (j−1) N−1 + u (j−1) N then (3.6) becomes (2 + r)u (j) 1 − u (j) 2 = f (j) 1 + u (j) 0 −u (j) i−1 + (2 + r)u (j) i − u (j) i+1 = f (j) i , i = 2 . . . N − 2, j > 0 (3.7) −u (j) N−2 + (2 + r)u (j) N−1 = f (j) i + u (j) N . Now rewrite (3.7) in terms of a left boundary equation, a set of interior equations for 1 < i < N − 1 and a right boundary equation. The left boundary equation is (2 + r)u (j) 1 − u (j) 2 = f (j) 1 + g(j) , j > 0. (3.8) The interior equations are −u (j) i−1 + (2 + r)u (j) i − u (j) i+1 = f (j) i , i = 2 . . . N − 2, j > 0 (3.9) and the right boundary equation is −u (j) N−2 + (2 + r)u (j) N−1 = f (j) N−1 + h(j) , j > 0. (3.10) This is done to emphasize the roles of the interior and boundary forcing functions. As such, the boundaries need no special treatment.
  • 16. 11 We can now rewrite (3.8), (3.9) and (3.10) in matrix form to yield Ax = f (3.11) where A =               2 + r −1 −1 2 + r −1 ... ... ... ... ... ... −1 2 + r −1 −1 2 + r               (N−1)×(N−1) (3.12) x =            u (j) 1 u (j) 2 ... u (j) N−2 u (j) N−1            (N−1) , and f =            f (j) 1 + g(j) f (j) 2 ... f (j) N−2 f (j) N−1 + h(j)            (N−1) . (3.13) This is the system of equations that will be solved. 3.2 Factorization of a Tridiagonal Matrix 3.2.1 LU Factorization Assume a tridiagonal matrix B is represented as B =               b1 c1 a2 b2 c2 ... ... ... ... ... ... aN−1 bN−1 cN−1 aN bN               . (3.14) B can be factored into a lower bi-diagonal matrix L and an upper bi-diagonal matrix U such that B = LU, where
  • 17. 12 LU =               γ1 a2 γ2 a3 γ3 ... ... aN−1 γN−1 aN γN                             1 δ1 1 δ2 1 δ3 ... ... 1 δN−1 1               . (3.15) If we equate the coefficients of (3.14) and (3.15), the result is γ1 = b1 δ1 = c1/γ1 γi = bi − aiδi−1, i = 2, . . . , N δi = ci/γi, i = 2, . . . , N − 1. Alternatively, you may cast LU such that LU =               1 β2 1 β3 1 ... ... βN−1 1 βN 1                             η1 c1 η2 c2 η3 c3 ... ... ηN−1 cN−1 ηN               with similar systems of recurrence relations for their coefficients. The method chosen depends on its application. If we have a system of equations involving B it can be written as Bx = f, or (3.16) L(Ux) = f. Now, introduce the intermediate vector y and we have Ly = f, the forward substitution (3.17) Ux = y, the backward substitution. (3.18)
  • 18. 13 The system of equations that results from (3.17) is γ1y1 = f1 a2y1 + γ2y2 = f2 aiyi−1 + γiyi = fi, i = 3, . . . , N and solving for the y vector yields, y1 = f1/γ1 yi = (fi − aiyi−1)/γi, i = 2, . . . , N. (3.19) With the forward substitution step complete, we can write the backward substitution (3.18) as xN = yN xi = yi − δixi+1, i = N − 1, . . . , 1 (3.20) and generate the solution to (3.16). 3.2.2 UL Factorization As it turns out, there is also a UL factorization. If L is pre-multiplied by U, the resulting system is then Bx = f, or (3.21) U(Lx) = f, where UL =               1 ¯δ1 1 ¯δ2 1 ¯δ3 ... ... 1 ¯δN−1 1                             β1 a2 β2 a3 β3 ... ... aN−1 βN−1 aN βN               . (3.22)
  • 19. 14 The UL factorization is equivalent to reordering the vectors x and f from N to 1. It follows that the information is transmitted from N to 1. The need for this will become apparent when dealing with the interface equation in section 4. When the intermediate vector z is introduced, we obtain Uz = f, the backward substitution Lx = z, the forward substitution. If we equate the coefficients of (3.14) and (3.22), the result is βN = bN ¯δi = c1/βi+1, i = N − 1, . . . , 1 βi = bi − ai+1 ¯δi, i = N − 1, . . . , 1. Now that the coefficients of U and L are computed, we can solve for the elements of vectors z and x to yield zN = fN zi = fi − ¯δizi+1, i = N − 1, . . . , 1 and x1 = z1/β1 xi = (zi − aixi−1)/βi, i = 2, . . . , N which generates the solution to the system (3.21). 3.3 CPU Implementation The linear system (3.11) that results from discretizing (3.1) is both tridiagonal and diagonally dominant (r > 0). The tridiagonal property implies that the LU decompo- sition can be performed in O(n) time. The diagonal dominance implies that pivoting is not required. We use these properties to generate the solution to (3.11).
  • 20. 15 Listing A.1 contains the implementation of the factorization and solver for (3.12) and (3.13) that runs on the CPU. A main function named RunCPU 1K1LTest re- peatedly calls the pair with various configurations, then measures the execution time and the relative error with respect to the initial conditions in (3.2). At a high level, RunCPU 1K1LTest performs 5 functions. 1. Assign boundary conditions. 2. Assign initial conditions. 3. Initialize the sub-diagonal, diagonal, and sup-diagonal of matrix A. 4. Compute the LU factorization. 5. Repeatedly Solve the system and Step time. 3.4 Parallel LU Factorization The symmetry of A in (3.12) leads to an LU factorization of LU =               γ1 −1 γ2 −1 γ3 ... ... −1 γN−1 −1 γN                             1 δ1 1 δ2 1 δ3 ... ... 1 δN−1 1               where γ1 = 2 + r δ1 = −1/γ1 γi = 2 + r + δi−1, i = 2, . . . , N δi = −1/γi, i = 2, . . . , N − 1
  • 21. 16 which can be simplified to γ1 = 2 + r γi = 2 + r − 1/γi−1, i = 2, . . . , N. (3.23) The nonlinear two-term recurrence in (3.23) does not lend itself to parallel evaluation. Stone [7] and Blelloch [8], however, describe an algorithm for parallel evaluation of m-th order linear recurrence relations. So (3.23) is modified through the use of a Riccati transformation γi = qi/qi−1 to produce a linear three-term recurrence q0 = 1 q1 = 2 + r qi = (2 + r)qi−1 − qi−2, i = 2, . . . , N. (3.24) The issue with (3.24) is that it can overflow when using floating point arithmetic. The term qi grows and becomes unbounded as i → ∞. We seek a truncated form that gives acceptable results within the finite precision of the microprocessor. Recognize that initial value problem in (3.24) can be solved analytically. Assume a solution of the form qi = Axi + + Bxi − with A and B real constants. Examination of the characteristic equation x2 −(2+r)x+ 1 = 0 for (3.24) yields roots of x± = b ± √ b2 − 4 2 , where b = 2 + r. With the initial conditions of q0 = 1 and q1 = b, we can find the constants A and B from q0 = 1 = A + B q1 = b = Ax+ + Bx−, where A = (b − x−)/(x+ − x−) and B = 1 − A, to yield the solution qi = xi+1 + − xi+1 − x+ − x−
  • 22. 17 which can be rewritten as qi = xi + 1 − (x−/x+)i+1 1 − (x−/x+) . With 1 − (x−/x+)i+1 1 − (x−/x+) = 1 + x−/x+ + O x− x+ 2 , it follows that qi = xi + 1 + x−/x+ + O x− x+ 2 . For sufficiently large i = N qN = xN + 1 + x−/x+ + O x− x+ 2 and qi will overflow as qN → xN + . When (3.23) is combined with (3.19) and (3.20), the forward and backward substi- tution logic simplifies to y1 = f1/γ1 yi = (fi + yi−1)/γi, i = 2, . . . , N. and xN = yN xi = yi + xi+1/γi, i = N − 1, . . . , 1. We now see that 1/γi is needed and not qi, so 1/γi = xi + − xi − xi+1 + − xi+1 − = xi + xi+1 + 1 − (x−/x+)i 1 − (x−/x+)i+1 = 1 x+ 1 − (x−/x+)i + O x− x+ i+1 . This tells us that 1/γi approaches its asymptotic limit faster than qi. If we impose the constraint that 1 γK − 1 γK−1 < 10−M (3.25)
  • 23. 18 for some integers K and M, then 1 γK ≈ 1 x+ − 1 x+ x− x+ i = 1 x+ (1 − 10−M ). This suggests we stop computing when (3.25) is satisfied. From that point on we impose 1 γi = 1 x+ , for i > K. In summary, this section shows how to transform a non-linear two term recurrence rela- tion into a linear three term recurrence relation and conversely, that is now a candidate for parallelization. While doing so, numeric overflow needed to be identified and the problem mitigated by truncating the series when an acceptable level of accuracy has been reached. 3.5 Parallel Solver and Recursive Doubling The sequential nature of the solver for the CPU does not lend itself to efficient implementation on the GPU. Other researchers have chosen to implement Cyclic Re- duction (CR) and Parallel Cyclic Reduction (PCR) for their tridiagonal solvers. Two other researchers [9,10] have attempted to implement Stone’s Recursive Doubling (RD) algorithm [11] to solve tridiagonal systems on a GPU. Each of them reported problems like huge numerical errors and RD suffers from arithmetic underflow and instability but they failed to analyze the source of instability. This section, on the other hand, will discuss the source of the instability (the numeric underflow that occurs on the GPU) and then describes how to address the issue. Stone described RD in terms of the following theorem and claimed that it could be used to solve recurrence relations of all orders. Theorem (Stone) 1 Let yi(j) satisfy a non-homogeneous two-term recurrence yi+1(j) = y1(j) + yi(j − 1) ∗ (−mj), i, j ≥ 1 (3.26) with the boundary conditions y1(j) = bj, j ≥ 1; yi(j) = 0, j ≤ 0; yi(j) = 0, i ≤ 0.
  • 24. 19 Then, a) for s ≥ 1, yi(j) satisfies the recurrence relation yi+s(j) = ys(j) + yi(j − s) i k=j−s+1 (−mk), i ≥ 1, j ≥ s; b) yi(j) = j k=1 y1(k) j s=k+1 (−ms), i ≥ j ≥ 1; (3.27) c) for i ≥ j ≥ 1, yi(j) = zj, where zj is the jth component of the solution to zi = bi − mizi−1, z1 = b1. Quite simply, the problem lies with (3.27). If the sequence {mk}n 1 is bounded by ¯m, then i k=j−s+1 (−mk) ≈ ¯mp , where p = i − j + s. The RD algorithm needs to be modified to avoid the arithmetic underflow that ¯mp causes when ¯m < 1. 3.6 Recursive Doubling and Nilpotent Matrices The non-homogeneous two-term recurrence (3.26) with initial value y1 = b1 has a solution of the form yi = i k=1 bk i s=k+1 (−ms) (3.28) If the first few terms of the solution are expanded, we have y1 = b1 y2 = b2 − b1m2 y3 = b3 − b2m3 + b1m2m3. This is a compact representation of the linear system Ly = b (3.29)
  • 25. 20 where L is a lower bidiagonal matrix given by L =               1 m2 1 m3 1 ... ... mn−1 1 mn 1               n×n . We rewrite L as the sum of an identity matrix I of order n and a nilpotent matrix N of index n, such that L = I + N where N is the first lower diagonal matrix N =               0 m2 0 m3 0 ... ... mn−1 0 mn 0               n×n . With N nilpotent, the inverse of L can be expressed as L−1 = (I + N)−1 = I − N1 + N2 − N3 + · · · + (−1)n−1 Nn−1 . For some p < n, the pth power of N is matrix with non-zero values filling the pth lower diagonal. For example,         0 m2 0 0 m3 0 0 0 m4 0         2 =         0 0 0 m2m3 0 0 0 m3m4 0 0         .
  • 26. 21 We can now relate the solution of the non-homogeneous two-term recurrence (3.28) to the entries of the pth lower diagonal of Np given by Np p+i,i = p+i k=i+1 mk, i = 1, 2, . . . , n − p. If there is an upper bound on the entries of N, ¯m = max 1≤i≤n |mi|, then the absolute value of an entry on the pth lower diagonal is |Np p+i,i| = p+i k=i+1 |mk| ≤ ¯mp , i = 1, 2, . . . , n − p. If ¯m < 1, we can now choose a value for p ∈ N such that the solution has an error no bigger than 10−digit , ¯mp ≤ 10−digit or p ≤ −digit ln 10 ln ¯m , where digit is the number of required decimal places. All of this now suggests computing y = (I − N1 + N2 − N3 + · · · + (−1)p Np )b as the truncated form of the solution to (3.29). From a practical standpoint we have yi = bi − miyi−1, i = 2, . . . , p with y1 = b1 (3.30) for the first p elements of y. Thereafter, for elements p + 1 ≤ j ≤ n, we have yj = bj − bj−1mj + bj−2(mjmj−1) + · · · + bj−p−2 j−p−2 k=j (−mk) + bj−p−1 j−p−1 k=j (−mk) p terms (3.31) which is just (3.27).
  • 27. 22 3.7 GPU Implementation The forward and backward substitution logic is now represented by y1 = f1/γ1 yi = 1 γi (yi−1 + fi), i = 2, . . . , N − 1 and xN−1 = yN−1 xi = yi + 1 γi xi+1, i = N − 2, . . . , 1 respectively. Each of these two equations are remarkably similar to (3.30) and (3.31), and the GPU code takes advantage of this. Assume, for example, a diffusion problem has 220 equations in 220 unknowns, and for accuracy reasons it requires p = 15. The γ vector is first computed with the CPU and then passed into the GPU kernels. It was deemed unnecessary to implement the trun- cated version of the γ vector on the GPU given that the LU factorization is performed once outside of the solver loop. The ForwardLU Kernel (listing B.1) is launched with n = 220 threads. Recall that the threads are organized in groups of 32 called a warp, all of which execute the same instruction on a different data element. For a given warp 0 ≤ j < 215 , threads numbered 32j ≤ i < (32j + 32) will each execute sum = 0 sum = (sum + fi−k) 1 γi−k , k = 1, . . . , p. The BackwardLU Kernel in (listing B.2) behaves in a similar way. 3.8 Results All of the software development was performed on a Lenovo S20 workstation with 14GB of RAM, a 300gb Western Digital VelociRaptor hard drive, an Intel Xeon W3520 quad core cpu @ 2.67GHz, and an NVIDIA Tesla K40c. The workstation is running
  • 28. 23 Windows 7/64. The development tools consisted of Visual Studio 2010, CUDA 6.5 and Tesla Driver 341.44. All of the generated applications were 64 bit. Table 3.1 and figure 3.1 summarize the results of the CPU version of the LU decom- position (CPU LU) versus the CPU version of the truncated γ vector (CPU ZC) that was described in section 3.4. Each element xi of the solution vector x was the sum of 20 terms. The number of time steps was 50. Table 3.2 is a similar experiment with the number of terms equal to 32. Table 3.2 emphasizes that the additional 12 terms results in relative errors that are nearly identical to the CPU LU version. Table 3.3 and figure 3.2 summarize the results of the CPU version of the LU decom- position versus the GPU version of the RD algorithm (GPU RD1L) that was described in section 3.5. Again, each element xi of the solution vector x was the sum of 20 terms. The number of time steps was 50. In this case, the data clearly shows that the GPU prefers large problems over smaller problems, and ultimately executes 36 times faster than the CPU LU algorithm. Table 3.1.: CPU LU vs CPU ZC with 1 Layer, Terms=20, Time Steps=50 Points CPU LU (ms) CPU MaxRel Err CPU ZC (ms) CPU MaxRel Err Speed Up 210 1,024 1.1 2.467E-05 3.5 3.136E-05 0.31 216 65,536 69.4 6.858E-09 231.0 1.773E-05 0.30 220 1,048,576 1,222.2 7.650E-11 3,650.9 4.526E-06 0.33 222 4,194,304 4,809.4 2.056E-10 14,671.2 2.333E-07 0.32 224 16,777,216 19,164.3 3.473E-10 59,080.4 2.333E-07 0.32 226 67,108,864 77,671.4 5.239E-09 236,833.5 2.333E-07 0.32
  • 29. 24 Table 3.2.: CPU LU vs CPU ZC with 1 Layer, Terms=32, Time Steps=50 Points CPU LU (ms) CPU MaxRel Err CPU ZC (ms) CPU MaxRel Err Speed Up 1,024 1.1 2.467E-05 5.7 2.467E-05 0.19 65,536 69.4 6.858E-09 373.6 8.523E-09 0.18 1,048,576 1,222.2 7.650E-11 5,848.9 2.486E-10 0.20 4,194,304 4,809.4 2.056E-10 23,322.6 2.070E-10 0.20 16,777,216 19,164.3 3.473E-10 93,991.6 3.458E-10 0.20 67,108,864 77,671.4 5.239E-09 377,045.0 5.241E-09 0.20 Figure 3.1.: CPU LU vs CPU ZC with 1 Layer, Terms=20, Time Steps=50 103 104 105 106 107 108 100 101 102 103 104 105 106 Number of Points Time(ms) CPU LU CPU ZC
  • 30. 25 Table 3.3.: CPU LU vs GPU RD1L with 1 Layer, Terms=20, Time Steps=50 Points CPU LU (ms) CPU MaxRel Err GPU RD1L (ms) GPU MaxRel Err Speed Up 1,024 1.1 2.467E-05 2.7 3.136E-05 0.4 65,536 69.4 6.858E-09 4.9 1.773E-05 14.1 1,048,576 1,222.2 7.650E-11 37.3 4.526E-06 32.7 4,194,304 4,809.4 2.056E-10 136.9 2.333E-07 35.1 16,777,216 19,164.3 3.473E-10 536.8 2.333E-07 35.7 67,108,864 77,671.4 5.239E-09 2,143.9 2.333E-07 36.2 Figure 3.2.: CPU LU vs GPU RD1L with 1 Layer, Terms=20, Time Steps=50 103 104 105 106 107 108 100 101 102 103 104 105 106 Number of Points Time(ms) CPU LU GPU RD1L
  • 31. 26 4 DIFFUSION WITH LAYERS 4.1 Background Consider a practical problem from hydrogeolgy shown in figure 4.1. The one dimen- sional, transient state, unsaturated flow through this section of earth is ∂θ(z, t) ∂t = ∂ ∂z K(θ) ∂ψ(z, t) ∂z + 1 (4.1) where θ(z, t) is the saturation (water content), K(θ) is the hydraulic conductivity, ψ(z, t) is the pressure head, z is the elevation above a vertical datum, and t is time. Figure 4.1 effectively has multiple diffusion coefficients (in this case 5) because the hydraulic conductivity K(θ) varies with the subsurface soil type.1 The hydraulic conduc- tivity coefficient has units of distance/time, and as such, can be considered the diffusive velocity with which water moves through a substructure. The value can be as small as 10−8 meters/day for unfractured shale, and as large as 104 meters/day for gravel, thereby spanning 12 orders of magnitude [12]. If the diffusion coefficients are approximately the same order of magnitude then a finite difference approach may work. However, as the ratio of the coefficients begins to 1 Although equation (4.1) technically is a non-linear partial differential equation, it is used as an example because we believe the reader can relate to the notion of water percolating down through different layers of soil.
  • 32. 27 Figure 4.1.: A plant with multiple layers of soil vary by an order of magnitude or more, the finite difference approach can suffer from a host of numerical problems effecting its convergence. 4.2 Two Layers A two layer problem is simply two single layers connected by interface conditions across a common boundary. The interface conditions describe the continuity of u(x, t) and its flux, K∂u/∂x, across an interface x = l. Equations (4.2) through (4.5) describe the two layer problem.
  • 33. 28 ∂u ∂t = κ1 ∂2 u ∂x2 , 0 ≤ x < l1, t > 0 (4.2) ∂u ∂t = κ2 ∂2 u ∂x2 , l1 < x ≤ 1, t > 0 (4.3) u(l− 1 , t) = u(l+ 1 , t) (4.4) K1 ∂u ∂x l− 1 = K2 ∂u ∂x l+ 1 (4.5) In this case, the diffusion coefficients κ1, κ2, are constant, as are the flux coefficients K1 and K2. We have used ± to denote the right side (+) and left side (−) of the interface x = l1. The mathematics of two layers is similar to that in section 3.1, but it will now be generalized. Assume we have layers with width l1 and l2. Each layer i has Ni sub-intervals and the width of each sub-interval is hi = li/Ni. Now define Mi = i k=1 Nk, ri = 2h2 i κiδt i = 1, 2; so that the coefficient matrices take on the form A1 =               2 + r1 −1 −1 2 + r1 −1 ... ... ... ... ... ... −1 2 + r1 −1 −1 2 + r1               (N1−1)×(N1−1) , (4.6) A2 =               2 + r2 −1 −1 2 + r2 −1 ... ... ... ... ... ... −1 2 + r2 −1 −1 2 + r2               (N2−1)×(N2−1) ; (4.7)
  • 34. 29 the solution vectors take the form x1 =            u (j) 1 u (j) 2 ... u (j) N1−2 u (j) N1−1            (N1−1) , x2 =            u (j) N1+1 u (j) N1+2 ... u (j) M2−2 u (j) M2−1            (N2−1) ; the right hand terms take the form f1 =            f (j) 1 + g(j) f (j) 2 ... f (j) N1−2 f (j) N1−1 + u (j) N1 −            (N1−1) , and f2 =            f (j) N1+1 + u (j) N1 + f (j) N1+2 ... f (j) M2−2 f (j) M2−1 + h(j)            (N2−1) ; and the system of equations is A1x1 = f1 and A2x2 = f2. We now need an equation that connects the layers, while preserving continuity of the state variable and the flux. 4.3 Interface Equation At the interface x = l1 we have the continuity of the state variable u (j) N1 − = u (j) N1 + = u (j) N1 and the continuity of the flux K1 u (j) N1 − − u (j) N1−1 h1 = K2 u (j) N1+1 − u (j) N1 + h2 . When the two previous interface conditions are combined, we are left with an expression for the solution across the interface, K1 h1 + K2 h2 u (j) N1 = K1 h1 u (j) N1−1 + K2 h2 u (j) N1+1 ∀j. (4.8)
  • 35. 30 4.4 Complete System for Two Layers In matrix form, the tridiagonal system for two layers with the interface included consists of Aif x(j) = f(j) (4.9) where Aif =                                                   2 + r1 −1 −1 2 + r1 −1 ... ... ... ... ... ... −1 2 + r1 −1 −1 2 + r1               A1 −1 −K1/h1 K1/h1 + K2/h2 −K2/h2 −1               2 + r2 −1 −1 2 + r2 −1 ... ... ... ... ... ... −1 2 + r2 −1 −1 2 + r2               A2                                     and x =                        u1 u2 ... uN1−1 uN1 uN1+1 ... uM2−2 uM2−1                        , f =                        f1 + u0 f2 ... fN1−1 0 fN1+1 ... fM2−2 fM2−1 + uM2                        . Notice that the sub-matrix in the upper left hand corner of Aif is A1 from (4.6), the sub-matrix in the lower right hand corner of Aif is A2 from (4.7), and the 5 terms in the center of Aif relate to the interface equation (4.8). Solving (4.9) now consists of the following 5 steps.
  • 36. 31 1. Compute the LU factorization of A1 and then perform the forward substitution to obtain the relation between u (j) N1 − and u (j) N1−1. 2. Compute the UL factorization of A2 and then perform the backward substitution to obtain the relation between u (j) N1 + and u (j) N1+1. 3. Substitute the results of (1) and (2) into the interface equation (4.8) to find u (j) N1 . 4. Perform the backward substitution to layer 1. 5. Perform the forward substitution to layer 2. 4.5 Difference Between Inclusion and Exclusion of the Interface Condition Equations (4.4) and (4.5) describe the continuity of the state variable and its flux across the interface. If we look at K1 ∂u1 ∂x = K2 ∂u2 ∂x and assume that K2 >> K1, we have K1 K2 ∂u1 ∂x = ∂u2 ∂x → 0 ≈ ∂u2 ∂x . This suggests a horizontal tangent of u2 that is present in the solution at the interface. This behavior is not captured in the form that excludes the interface condition and results in an error between the two. These errors are described in section 4.7 when the experiments are discussed. 4.6 Implementation The CPU implementation for the two layer problem is a simple adaptation to the standard LU factorization and solver running through the entire set of state variables from 1 to M2. Listing A.1 shows factorization requiring 5 lines of code and the solver requires 6 lines of code. In contrast, the two layer factorization in listing A.2 requires 11 lines of code and the solver in listing A.3 requires 13 lines.
  • 37. 32 The GPU implementation for the two layer problem also leverages the logic from the one layer problem. The forward and backward substitution kernels for the LU solver require approximately 27 lines of code each. Since we now need forward and backward substitution kernels for the UL solver, they also contribute approximately 27 lines each to the total line count. 4.7 Results If the interface expression in (4.8) is excluded and we are presented with a two layer problem, the tridiagonal matrix A12 has the form A12 =                     2 + r1 −1 −1 2 + r1 −1 ... ... ... −1 2 + r1 −1 −1 2 + r2 −1 ... ... ... −1 2 + r2 −1 −1 2 + r2                     (4.10) for the system A12x = f. (4.11) Four experiments were performed that apply the CPU LU decomposition to (4.11) and compared the results to the 5 step algorithm (CPU 2L) described in section 4.4. The parameters for the experiments are described in table 4.1. Experiment 1 in figures 4.2 and 4.3 shows that the results of solving (4.9) and (4.11) are nearly identical. Figure 4.2 demonstrates the time evolution of the single interface point uN1 when the interface equation is utilized and compares that value with two values uN1 and uN1+1 that straddle uN1+1/2 in (4.11) that excludes the interface. In figure 4.3, the systems are solved with 1024 points over 50 time steps. With the interface point at u512, the 32 points on either side are in excellent agreement with both systems.
  • 38. 33 Table 4.1.: Experimental parameters for 50 timesteps Experiment# κ1 κ2 K1 K2 1 1 1 1 1 2 1 10 1 1 3 1 1 1 10 4 1 10 1 10 Experiment 2 in figures 4.4 and 4.5 demonstrates the next set of results. The time evolution plot in figure 4.4 shows the interface point significantly above the excluded form, while the graph of the solutions in figure 4.5 do a poor job of coinciding. The error plot between the two solutions in figure 4.6 further quantifies the consequences of ignoring the interface equation. Experiment 3 in figures 4.7 and 4.8 demonstrates results that are in good agreement with each other. This experiment also seems to suggest that the flux coefficient has little impact on the solution. Finally, experiment 4 in figures 4.9 and 4.10 graphically demonstrates the problem that occurs when the diffusion and conductivity coefficients are κ1 = 1, κ2 = 10 and K1 = 1, K2 = 10 respectively. Figure 4.9 shows the actual interface point uN1 receding from the two points that excluded the interface equation. Again, when the solution is graphed after 50 time steps in figure 4.10, it is readily apparent that excluding the interface equation introduces error into the solution. As in experiment 2, figure 4.11 quantifies the solution error. The results of experiment 2 and 4 seem to suggest that the diffusion coefficients, along with the solver, play a large role in producing a correct solution. With experiments 1 through 4 giving us faith in the interface equation approach, the benefits of accelerating the process on the GPU are now reported. Table 4.2 demon- strates the 18 fold acceleration of the GPU (GPU RD2L) over the CPU (CPU 2L)
  • 39. 34 version with 2 layers, as well as relative errors that are small when compared with the analytical solution corresponding to κ1 = 1, κ2 = 1, K1 = 1, K2 = 1 with 32 terms. The results of table Table 4.3 are not surprising. With κ1 = 1, κ2 = 10, K1 = 1,K2 = 1 and the number of terms set to 32, a maximum acceleration of 18 fold is measured. The modifications made to RD allow us to process problems that are 65, 536 times larger than the previously reported sizes for Stone’s approach. Table 4.2.: CPU 2L vs GPU RD2L, κ1 = 1, κ2 = 1, Terms=32, Time Steps=50 Points CPU 2L (ms) CPU MaxRel Err GPU RD2L (ms) GPU MaxRel Err Speed Up 1,024 1.7 8.756E-05 7.4 8.756E-05 0.2 65,536 70.9 2.318E-08 11.2 2.360E-08 6.3 1,048,576 1,038.4 8.091E-11 66.1 2.533E-10 15.7 4,194,304 4,163.8 2.056E-10 238.1 2.070E-10 17.4 16,777,216 16,638.1 3.473E-10 928.7 3.458E-10 17.9 67,108,864 66,963.9 5.239E-09 3,698.4 5.241E-09 18.1
  • 40. 35 Table 4.3.: CPU 2L vs GPU RD2L, κ1 = 1, κ2 = 10, Time Steps=50 Points CPU 2L (ms) GPU RD2L (ms) Speed Up 768 0.7 7.4 0.1 49,152 50.7 10.2 4.9 786,432 818.5 51.6 15.8 3,145,728 3,161.0 180.8 17.4 12,582,912 12,524.5 700.1 17.8 50,331,648 50,639.1 2,779.5 18.2 Figure 4.2.: Time Evolution of Interface-Experiment 1 0 10 20 30 40 50 0.9900 0.9920 0.9940 0.9960 0.9980 1.0000 Time Step Solutionvalues Include interface eqn, uN1 Exclude interface eqn, uN1 Exclude interface eqn, uN1+1 κ1 = 1, κ2 = 1 K1 = 1, K2 = 1
  • 41. 36 Figure 4.3.: Solution to the Interface Neighborhood-Experiment 1 480 500 520 540 0.9700 0.9800 0.9900 1.0000 Spatial Index Solutionvalues Include interface eqn Exclude interface eqn κ1 = 1, κ2 = 1 K1 = 1, K2 = 1 at ts = 50. Figure 4.4.: Time Evolution of Interface-Experiment 2 0 10 20 30 40 50 0.9900 0.9920 0.9940 0.9960 0.9980 1.0000 Time Step Solutionvalues Include interface eqn, uN1 Exclude interface eqn, uN1 Exclude interface eqn, uN1+1 κ1 = 1, κ2 = 10 K1 = 1, K2 = 1
  • 42. 37 Figure 4.5.: Solution to the Interface Neighborhood-Experiment 2 480 500 520 540 0.9700 0.9800 0.9900 1.0000 Spatial Index Solutionvalues Include interface eqn Exclude interface eqn κ1 = 1, κ2 = 10 K1 = 1, K2 = 1 at ts = 50. Figure 4.6.: Error in the Interface Neighborhood-Experiment 2 480 500 520 540 10−5 10−4 10−3 10−2 Spatial Index |Error| κ1 = 1, κ2 = 10 K1 = 1, K2 = 1 at ts = 50.
  • 43. 38 Figure 4.7.: Time Evolution of Interface-Experiment 3 0 10 20 30 40 50 0.9900 0.9920 0.9940 0.9960 0.9980 1.0000 Time Step Solutionvalues Include interface eqn, uN1 Exclude interface eqn, uN1 Exclude interface eqn, uN1+1 κ1 = 1, κ2 = 1 K1 = 1, K2 = 10 Figure 4.8.: Solution to the Interface Neighborhood-Experiment 3 480 500 520 540 0.9700 0.9800 0.9900 1.0000 Spatial Index Solutionvalues Include interface eqn Exclude interface eqn κ1 = 1, κ2 = 1 K1 = 1, K2 = 10 at ts = 50.
  • 44. 39 Figure 4.9.: Time Evolution of Interface-Experiment 4 0 10 20 30 40 50 0.9900 0.9920 0.9940 0.9960 0.9980 1.0000 Time Step Solutionvalues Include interface eqn, uN1 Exclude interface eqn, uN1 Exclude interface eqn, uN1+1 κ1 = 1, κ2 = 10 K1 = 1, K2 = 10 Figure 4.10.: Solution to the Interface Neighborhood-Experiment 4 480 500 520 540 0.9700 0.9800 0.9900 1.0000 Spatial Index Solutionvalues Include interface eqn Exclude interface eqn κ1 = 1, κ2 = 10 K1 = 1, K2 = 10 at ts = 50.
  • 45. 40 Figure 4.11.: Error in the Interface Neighborhood-Experiment 4 480 500 520 540 10−5 10−4 10−3 10−2 Spatial Index |Error| κ1 = 1, κ2 = 10 K1 = 1, K2 = 10 at ts = 50.
  • 46. 41 5 CONCLUSION We started this project with the goal of creating a GPU based solver for the two layer linear diffusion equation. Before the solver could be addressed, the LU factorization of the finite difference form of the diffusion equation was investigated. The two term non-linear recurrence relation that results from the LU factorization was shown to be sequential and not to be parallelizable. When converted to a three term linear recurrence relation the result could now be paralellized for a GPU. However, it suffered from arithmetic overflow. When the underlying initial value problem was solved analytically, we were able to demonstrate that a polynomial form of the solution could be truncated to achieve a prescribed level of machine accuracy. Our experiments showed excellent agreement with the solution (3.3) to the initial value problem. We mentioned briefly that other researchers reported numerical instabilities when applying Stone’s recursive doubling algorithm to the actual solver for problems with 1, 024 unknowns. Overflow is inherent in Stone’s algorithm. In addition, the instability is due to an arithmetic underflow problem that appears when the two term linear recurrence relation associated with the solver is evaluated. When Stone’s algorithm is recast using matrix algebra, we were able to overcome these deficiencies by applying the theory of nilpotent matrices to generate a truncated form of the matrix polynomial. Our experiments demonstrated a GPU based algorithm that is approximately 36 times faster than the CPU version for the 1 layer problem, and approximately 18 times faster than the CPU for the 2 layer problem. Our 1 layer implementation required two GPU kernels and the 2 layer implementation required 5 kernels. The GPU is designed for large problems that stream RAM through the cores. We attribute the 2 fold loss in efficiency to the presence of the additional kernels, as well as awkward coding patterns that were required to handle the interface condition. On the positive side, our algorithm
  • 47. 42 is also capable of processing 226 equations, a value that is 65, 536 times larger than that of other researchers. Finally, we described the mathematical framework for solving the two layer problem. Our experiments show that error is introduced into the solution when the interface conditions, the continuity of the state variable and the flux across the interface boundary, are not properly accounted for. In fact, our work suggests that the flux coefficients control the placement of the solution, while the diffusion coefficients in combination with the type of solver used controls the error in the solution. We’ve also shown that with the straight forward LU solver for two layers requiring approximately 54 lines of kernel code on the GPU, our parallel two layer solution is only modestly more complex with approximately 108 lines of code.
  • 48. 43 REFERENCES [1] Greg Turk. Generating textures on arbitrary surfaces using reaction diffusion. In Proc. of the 18th annual confereence on Computer Graphics and Interactive Tech- niques, pages 289–298. ACM Press, 1991. [2] Daniele Bertaccini and Daniela Calvetti. Fast simulation of solid tumors thermal ablation treatments with a 3d reaction diffusion model. Computers in Biology and Medicine, 37(8):1173–1182, 2007. [3] F. T. Tracy. Clean two- and three-dimensional analytical solutions of richards’ equation for testing numerical solvers. Water Resources Research, 42(8), 2006. [4] R. Allan Freeze. Three dimensional, transient, saturated-unsaturated flow in a groundwater basin. Water Resources Research, 7(2):347–366, 1971. [5] James S. Boswell and Greg A. Olyphant. Modeling the hydrologic response of groundwater dominated wetlands to transient boundary conditions: Implications for wetland restoraton. Journal of Hydrology, 332(3):467–476, 2007. [6] NVidia Corp. Cuda C Programming Guide. Technical Report PG02829001 v5.5, NVidia Corp, July 2013. [7] Harold S. Stone and Peter M. Kogge. A parallel algorithm for the efficient solution of a general class of recurrence equations. IEEE Transactions on Computers, C- 22(8):786–793, 1973. [8] Guy E. Blelloch. Prefix sums and their applications. Technical Report CMU-CS- 90-190, Carnegie Mellon University, November 1990. [9] Volodymyr Kindratenko. A guide for implementing tridiagonal solvers on gpus. In Numerical Computations with GPUs. Springer International Publishing, 2014. [10] Jonathan Cohen Yao Zhang and John D. Owens. Fast tridiagonal solvers on the gpu. In Proc. of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 127–136. ACM Press, January 2010. [11] Harold S. Stone. An efficient parallel algorithm for the solution of a tridiagonal linear system of equations. Journal of the ACM, 2(1):27–38, 1973. [12] Ralph C. Heath. Basic ground-water hydrology. Technical Report 2220, U.S. Geo- logical Survey Water-Supply, 1983.
  • 49. 44 A CPU CODE 1 // ======================================================================= 2 // LUFactorizationCPU 3 // ======================================================================= 4 void LUFactorizationCPU (const CVector &SubDiag , const CVector &MainDiag , 5 const CVector &SuperDiag , CVector &Gamma , CVector &Delta) 6 { 7 Gamma [1] = MainDiag [1]; 8 Delta [1] = SuperDiag [1] / Gamma [1]; 9 10 for (int i = 2; i <= Gamma.Last (); i++) 11 { 12 Gamma[i] = MainDiag[i]-SubDiag[i]* Delta[i -1]; 13 Delta[i] = SuperDiag[i] / Gamma[i]; 14 } 15 } 16 17 // ======================================================================= 18 // LUSolveCPU 19 // ======================================================================= 20 void LUSolveCPU(const CVector &Gamma , const CVector &Delta , 21 const CVector &ASubDiag , 22 const CVector &f, CVector &x, CVector &y) 23 { 24 // 25 // Forward substitution . Solve Ly = f. 26 // 27 y[1] = f[1] / Gamma [1]; 28 29 for (int i = 2; i <= y.Last (); i++) 30 { 31 y[i] = (f[i] - ASubDiag[i]*y[i -1]) / Gamma[i]; 32 } 33 34 // 35 // Backward substitution . Solve Ux = Temp. 36 // 37 x[x.Last ()] = y[y.Last ()];
  • 50. 45 38 39 for (int i = x.Last () -1; i > 0; i--) 40 { 41 x[i] = y[i] - Delta[i]*x[i+1]; 42 } 43 } Listing A.1: 1 Layer CPU Factorization and Solver
  • 51. 46 1 // ======================================================================= 2 // TwoLayerFactorization 3 // ======================================================================= 4 void TwoLayerFactorization (const double r1 , const double r2 , 5 CVector &Gamma1 , CVector &Delta1 , CVector &Gamma2 , CVector &Delta2) 6 { 7 // 8 // The LU factorization for matrix A1 9 // 10 const double b1 = 2.0 + r1; 11 const double b2 = 2.0 + r2; 12 13 Gamma1 [1] = b1; 14 Delta1 [1] = -1.0 / Gamma1 [1]; 15 16 for (int i = 2; i <= Gamma1.Last (); i++) 17 { 18 Gamma1[i] = b1 + Delta1[i -1]; 19 Delta1[i] = -1.0 / Gamma1[i]; 20 } 21 22 // 23 // The UL factorization for matrix A2 24 // 25 26 Gamma2[Gamma2.Last ()] = b2; 27 28 for (int i = Gamma2.Last () -1; i >= 1; i--) 29 { 30 Delta2[i] = -1.0 / Gamma2[i+1]; 31 Gamma2[i] = b2 + Delta2[i]; 32 } 33 34 } Listing A.2: 2 Layer CPU Factorization
  • 52. 47 1 // ======================================================================= 2 // Solve2LayerCPU 3 // ======================================================================= 4 void Solve2LayerCPU (const CVector &Gamma1 , const CVector &Delta1 , CVector &y, 5 CVector &f, const CVector &Gamma2 , const CVector &Delta2 , 6 CVector &z, CVector &u, 7 const double K1 , const double K2 , 8 const double H1 , const double H2) 9 { 10 11 // Top down , y[1] to the point y[N_1 -1] 12 // The forward substitution for LU. 13 14 y[1] = f[1] / Gamma1 [1]; 15 16 for (int i = 2; i <= Gamma1.Last (); i++) 17 { 18 y[i] = (f[i] + y[i -1]) / Gamma1[i]; 19 } 20 21 // Bottom up , z[N_1 -1] to z[1] 22 // The backward substitution for UL. 23 24 z[z.Last ()] = f[f.Last ()]; 25 26 for (int i = 1; i < z.Last (); i++) 27 { 28 z[z.Last ()-i] = f[f.Last ()-i] - Delta2[Delta2.Last ()-i]*z[z.Last ()-i+1]; 29 } 30 31 // Fix up U_{N_1} 32 33 const int N1 = Gamma1.Last ()+1; 34 35 const double alpha = (K1/H1)*(1.0 - (1.0/ Gamma1[Gamma1.Last ()])) + (K2/H2)*(1.0 - (1.0/ Gamma2 [1])); 36 37 u[N1] = ((K1/H1)*y[y.Last ()] + (K2/H2)*(z[1]/ Gamma2 [1])) / alpha; // update U_{ N_1} 38 39 // Advance up from point U{N_1 -1} to U_1 40 // The backward substitution for LU. 41
  • 53. 48 42 for (int i = N1 -1; i >= 1; i--) 43 { 44 u[i] = y[i] - Delta1[i]*u[i+1]; 45 } 46 47 // Move down from point U{N_1 +1} to U_{M_2} 48 // The forward substitution for UL. 49 50 for (int i = 1; i <= Gamma2.Last (); i++) 51 { 52 u[N1+i] = (z[i] + u[N1+i -1]) / Gamma2[i]; 53 } 54 55 } Listing A.3: 2 Layer CPU Solver
  • 54. 49 B GPU CODE 1 // ======================================================================= 2 // ForwardLU_kernel -- compute the forward scan on the GPU 3 // ======================================================================= 4 __global__ void ForwardLU_kernel (const double * __restrict__ gpuGamma , 5 const double * __restrict__ gpuFhat , 6 double *gpuYhat , unsigned int NumberOfElements ) 7 { 8 __shared__ double sG[MAXTERMS +1024]; 9 __shared__ double sF[MAXTERMS +1024]; 10 11 const unsigned int gtid0 = blockIdx.x*blockDim.x + threadIdx.x; 12 const unsigned int idx0 = threadIdx.x; 13 14 double TG0 = 0.0; 15 double TF0 = 0.0; 16 17 if (gtid0 < NumberOfElements ) 18 { 19 TG0 = gpuGamma[gtid0 ]; 20 TF0 = gpuFhat[gtid0 ]; 21 } 22 sG[idx0+MAXTERMS] = TG0; 23 sF[idx0+MAXTERMS] = TF0; 24 25 if (idx0 < MAXTERMS) 26 { 27 TG0 = 0.0; 28 TF0 = 0.0; 29 if (gtid0 > MAXTERMS) 30 { 31 TG0 = gpuGamma[gtid0 -MAXTERMS ]; 32 TF0 = gpuFhat[gtid0 -MAXTERMS ]; 33 } 34 sG[idx0] = TG0; 35 sF[idx0] = TF0; 36 } 37 __syncthreads ();
  • 55. 50 38 39 double Sum0 = 0.0; 40 41 for (int i = 1; i <= MAXTERMS; i++) 42 { 43 double F0 = sF[idx0+i]; 44 double G0 = sG[idx0+i]; 45 46 Sum0 = (Sum0 + F0)*G0; 47 } 48 49 if (gtid0 < NumberOfElements ) 50 { 51 gpuYhat[gtid0] = Sum0; 52 } 53 } Listing B.1: GPU Forward LU Solver
  • 56. 51 1 // ======================================================================= 2 // BackwardLU_kernel -- compute the backward scan on the GPU 3 // ======================================================================= 4 __global__ void BackwardLU_kernel (const double * __restrict__ gpuGamma , 5 const double * __restrict__ gpuYhat , 6 double *gpuXhat , unsigned int NumberOfElements ) 7 { 8 __shared__ double sG [1024+ MAXTERMS ]; 9 __shared__ double sY [1024+ MAXTERMS ]; 10 11 const unsigned int gtid = blockIdx.x*blockDim.x + threadIdx.x; 12 const unsigned int idx = threadIdx.x; 13 const unsigned int warpid = idx >> 5; 14 15 double TG = 0.0; 16 double TY = 0.0; 17 if (gtid < NumberOfElements ) 18 { 19 TG = gpuGamma[gtid ]; 20 TY = gpuYhat[gtid ]; 21 } 22 sG[idx] = TG; 23 sY[idx] = TY; 24 25 if (blockIdx.x < (NumberOfElements > >10)) // if not in the last block 26 { 27 if (warpid == 31) // if in the last warp 28 { 29 TG = gpuGamma[gtid+MAXTERMS ]; 30 TY = gpuYhat[gtid+MAXTERMS ]; 31 sG[idx+MAXTERMS] = TG; 32 sY[idx+MAXTERMS] = TY; 33 } 34 } 35 36 __syncthreads (); 37 38 double Sum = 0.0; 39 40 for (int i = MAXTERMS -1; i >= 0; i--) 41 { 42 double Y = sY[idx+i]; 43 double G = (i == 0 ? 1.0 : sG[idx+i -1]);
  • 57. 52 44 45 Sum = (Sum + Y)*G; 46 } 47 48 if (gtid < NumberOfElements ) 49 gpuXhat[gtid] = Sum; 50 } Listing B.2: GPU Backward LU Solver