Smith waterman algorithm parallelization

Universitat Polit`cnica de Catalunya
e
Facultat d’Inform`tica de Barcelona
a

AMPP Final Project

Smith-Waterman Algorithm
Parallelization

Authors: Supervisors:
M´rio Almeida
a Josep Ramon Herrero Zaragoza
ˇ
Zygimantas Bruzgys Daniel Jimenez Gonzalez
Umit Cavus Buyuksahin

Barcelona
2012

Contents
1 Introduction 3

2 Main Issues and Solutions 3
2.1 Parallelization Techniques . . . . . . . . . . . . . . . . . . . . 4
2.1.1 Blocking Technique . . . . . . . . . . . . . . . . . . . . 4
2.1.2 Blocking and Interleaving Technique . . . . . . . . . . 6
2.2 Performance Model on Linear Network Topology . . . . . . . . 7
2.2.2 Blocking Technique: Optimum B . . . . . . . . . . . . 10
2.2.4 Blocking and Interleaving Technique: Optimum B . . . 14
2.2.5 Blocking and Interleaving Technique: Optimum I . . . 14
2.3 Performance Model on 2D Torus Network Topology . . . . . . 15
2.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3 Performance Results 21
3.1 Finding Optimal P and B . . . . . . . . . . . . . . . . . . . . 21
3.2 Finding Optimal I . . . . . . . . . . . . . . . . . . . . . . . . . 21

4 Conclusions 22

A How to Compile 24

B How to Execute on ALTIX 24

C Code 24

1 Introduction
In this project the parallel implementation of the Smith-Waterman Algo-
rithm using Message Passing Interface (MPI). This algorithm is a well-known
algorithm for performing local sequence alignment, which is, for determining
similar regions between two amino-acid sequences.
In order to find the best alignment between two amino-acid sequences a
matrix H is computed of size N × N , where N is a size of each sequences.
Every element of this matrix is based on Score Matrix (cost of matching two
symbols) and a gap penalty for mismatching symbols of sequences. When
matrix H is computed, the optimum alignment of sequences can be obtained
by tracking back the matrix starting with the highest value in the matrix.
In our parallel implementation only H matrix calculation was parallelized
as it is our only interest. The tracking part was removed from the code
and from the sequential code as well, in order to gather the most accurate
computation times for comparison. For parallelization a pipelining method
was used. Following this model, each process communicates with another
after calculating B columns of N rows. This is called blocking. We introduced
P
a parameter that easily allowed to change this value. Later interleaving
parameter I was added.
During this project several performance models were created. One model
is for linear interconnection network and another one for 2D torus network.
In models calculations B and I parameters were included. Later, the opti-
mum B and I were found and performance tests were executed to empirically
find out those two parameters.

2 Main Issues and Solutions
In this section parallelization solutions are described. Solution with blocking
at column level is explained and performance model is described. Then the
solution with both blocking at column level and interleaving at row level is
explained and performance model is described as well. Also, in this section
the calculations are provided for optimum blocking factor B and interleaving
factor I. The second part of the section is for description of the perfor-
mance models for both solutions on the different network topology and the
calculations for finding optimum blocking factor B and interleaving factor
I. Finally, our implementation of these techniques in C++ is provided and
explained.

3

2.1 Parallelization Techniques
2.1.1 Blocking Technique

Figure 1: Parallelization approach by introducing blocking at column level

The P processes share the matrix M in terms of consecutive rows. For
calculating the matrix M of size N × N , each process Pi works with N/P
consecutive rows of the matrix. When using a blocking technique for paral-
lelization, columns are divided by a defined block size B. So, each process
has to calculate N/B blocks. These parameters are visualized at Figure 1.
At the top part of the figure it can be seen how elements of the matrix are
divided between processes. And at the bottom part of the figure the par-
allelization of calculations between processes is visualized. There is shown
that when first process computes the first block of the matrix, which is size
of N/P × B, it communicates with the next process. Then the next process
start calculating the other block of the matrix while the first process con-
tinues calculations on the next block and so on. This type of parallelization
is called pipelining. In this type of parallelization, the problem is divided
into a series of tasks that have to be completed one after the other. Be-
fore explaining the parallelization in detail, we should analyze data and task
dependencies between processes to calculate the matrix.
In Figure 2 the data dependency for a particular matrix element is shown.
In order to calculate a matrix element M [i][j], the process Pi+1 needs the
calculated data form the previous column M [i][j − 1] and elements M [i −
1][j − 1] and M [i − 1][j] from the previous row as seen in the picture. If

4

Figure 2: Data dependency for calculating one matrix element

the previous row is calculated by the process Pi then that row is sent after
process Pi calculates the block of size N/P × B. This introduces a data
and task dependencies. The process Pi+1 can not start calculations till the
process Pi sends the last row of the block, which is needed for calculating the
block of process Pi+1 . To calculate the ﬁrst row and column of the matrix it
is considered that the predecessor row and column is ﬁlled with zeros.

Figure 3: Data dependecies between blocks of matrix

The Figure 3 shows the parallelism of the matrix in the wide window.
The squares represent the blocks matrix and three arrows show data decen-
cies between the blocks. As mentioned before, an element needs its upper,
left, and upper-left values to be calculated. It is called data dependency.
Therefore, blocks on the same minor diagonal are independent from each
other. So these blocks can be and are calculated in parallel.
The steps of calculations are as follows:

5

1. The process waits till the previous process finish calculation of a block
(if applicable);
2. The process receives the last row of a block that was calculated by the
previous process;
3. After receiving the last row of a block calculated by previous process,
the process has all necessary information to calculate its block. So, the
process performs a calculation of its block;
4. When process finish the calculation, it sends the last row of its block
to the next process (if applicable);
5. The process repeats these steps until it finishes the calculation of all
blocks, that is, calculates all rows that are assigned to the process.

2.1.2 Blocking and Interleaving Technique

Figure 4: Matrix calculation with interleaving factor, when I = 2

This parallelization method adds an interleave factor to a blocking tech-
nique that was described above. With this method the matrix is divided
into I parts, so that each part has N × N/I elements. Every part is then
calculated as explained in the previous section, that is, using blocking tech-
nique. As soon as the process finish processing rows assigned to it from the
first interleaving part it continues with the blocks from another interleave
part. For example, in Figure 4, where interleaving factor I = 2, the matrix
is divided into two smaller parts. Each process Pi calculates N/(P · I) rows
of one part before moving to the second part.

6

The steps of calculations are very similar to those where blocking tech-
nique is used and are as follows:

1. The process waits till the previous process finish calculation of a block
(if applicable);

2. The process receives the last row of a block that was calculated by the
previous process;

3. After receiving the last row of a block calculated by previous process,
the process has all necessary information to calculate its block. So, the
process performs a calculation of its block;

4. When process finish the calculation, it sends the last row of its block
to the next process. If the process is the last one and there is another
interleave part to calculate, then it sends the row to the first process.
Otherwise it does not send anything;

5. The process repeats these steps until it finishes the calculation of all
blocks within the current interleave part, that is, calculates all rows
that are assigned to the process within the interleave part. If there
is another interleave part to calculate it moves to next interleave part
and repeats theses steps until all blocks from all interleave parts are
calculated.

2.2 Performance Model on Linear Network Topology
In this section we will be describing the performance model of our imple-
mentation with blocking technique for a linear network topology. In later
sections we will compare it with non linear topology, taking into account the
differences in the performance models.
In order to focus on the main objectives of this performance analysis, we
will only take into account the parallel algorithms used for matrix calculation.
This means that some parts of the code that were done sequentially on a
single process such as opening and reading the input files were ignored in
this model.
Some assumptions were made in terms of the models for different network
topologies, such as the assumption that the creation of new processes is
location aware in terms of their place in the network to make it more efficient.
For all the performance models described in this section we will use the
following annotation to represent them:

7

• ts : Startup time. (prepare message + routing algorithm + interface
between local node and the router).

• tc : Time of computation for each value in matrix.

• tw : Time of traversing per word.

• Tcomm : Total communication time.

• Tcomp : Total computation time.

Figure 5: Communication and computation times of matrix parallel calcula-
tions by process using the blocking technique.

The diagram in Figure 5 represents the steps of the matrix calculation
performed by our algorithm as well as initial declarations and needed com-
munications. These diﬀerent steps are represented with diﬀerent colors. The
blue color represents the scattering of one protein sequence to all the pro-
cesses. The green colored areas represent the computation time needed to
do the matrix calculations in each block and yellow color represents the time
taken to send the last row of a block to the next process.
In order to simplify the diagram, the time the last process needs to receive
the last row of the block of the previous process is already taken in account
in the upper yellow area. This explains why the last process doesn’t have
yellow areas in its time-line but still has to wait to receive the blocks needed
to perform the matrix calculations. All of this will be considered in this
performance model.
As we can observe from the diagram, the communication time of this
model is composed by the scattering of the protein sequence vector (blue
area) and several communications to send the last row of each block to the
next process (yellow). The scatter method [2] will receive a vector of size N

8

and deliver a vector with size N/P to each process. The scattering time is
given by:
N
Tscatter = ts · log(p) + · (P − 1) · tw (1)
P
The sending of the last row of each block to the next process is composed
by the communication startup time (ts ) and the traversing time of the B
elements in this blocks row. This is given by:

TrowComm = ts + B · tw (2)
In the total communication time, this startup and traversing are done
N/B times for the ﬁrst process and an extra P − 1 times for the remaining
pipeline stages of the remaining processes. In order to take into consideration
the fact that the last process doesn’t need to send its last row to another
process we will consider that it takes P −2. So the total communication time
is given by:
N
Tcomm = Tscatter + ( + P − 1 − 1) · TrowComm
B
N N
Tcomm = ts · log(p) + · (P − 1) · tw + ( + P − 2) · (ts + B · tw ) (3)
P B
The next step is to calculate the total computation time. Having in mind
that a block is composed by N/P rows and B columns, the total number of
block elements is B · N/P . This means that the computation time of a single
block is given by:
N
Tcomp block = tc · B · (4)
P
As we did for the total communication time, this computation time is
multiplied N/B + P − 1 to calculate the computing of the blocks for all the
processes:
N
Tcomp = ( + P − 1) · Tcomp block
B
N N
Tcomp = ( + P − 1) · (tc · B · ) (5)
B P
To conclude this performance model, the total parallelization time is given
by the sum of the total communication and computation times. So the total
parallelization time is given by:

9

Tparallel = Tcomp + Tcomm

N N
Tparallel = + P − 1 · tc · B · +
B P
N
+ ts · log(P ) + · (P − 1) · tw +
P
N
+ + P − 2 · (ts + B · tw ) (6)
B

2.2.2 Blocking Technique: Optimum B
In order to find an optimum B for fixed values of N and P , and assuming
N is much bigger than P , we need to find the value of B for each the total
parallel time of computation and communication is smaller. This value can
be found be deriving the total parallelization time equation and finding the
value of B for which the derivate is equal to zero.
dTparallel
=0⇔
dB

tc BN N tc N tc N
⇔ −N + ts + Btw B −2 + +P −2 + tw + =0⇔
P B P P

N · ts · P
⇔B= ⇔
P · tc · N + P2 · tw − tc · N − 2 · tw · P

N · ts · P
⇔B= ⇔
tc · N · (P − 1) + P · tw · (P − 2)

ts
⇔B= tw ·(P −2) tc ·(P −1)
N
+ P

For N P:
ts
B≈ (7)
tc

10

In this section we will be describing the performance model of our implemen-
tation with blocking and interleaving techniques for a linear network topol-
ogy. In later sections we will compare it with non linear topology, taking into
account the differences in the performance models. As in the previous model,
we will use the mentioned annotation and we will only take into account the
parallel algorithms used for matrix calculation.

Figure 6: Communication and computation times of matrix parallel calcula-
tions by process using the blocking and interleaving techniques.

The diagram in Figure 6 represents the steps of the matrix calculation
performed by our algorithm as well as initial declarations and needed com-
munications. These different steps are represented with different colors. The
blue color represents the scattering of one protein sequence to all the pro-
cesses. The green colored areas represent the computation time needed to
do the matrix calculations in each block and yellow color represents the time
taken to send the last row of a block to the next process.
In order to simplify the diagram, the time the last process in the last
interleave needs to receive the last row of the block of the previous process
is already taken in account in the upper yellow area. This explains why this
last process doesn’t have yellow areas in its time-line but still has to wait to
receive the blocks needed to perform the matrix calculations. All of this will
be considered in this performance model.
As we can observe from the diagram, the communication time of this
model is composed by the scattering of a part of the protein sequence vector
(blue area) for each interleave and several communications to send the last

11

row of each block to the next process (yellow). The scatter method will
receive a vector of size N and deliver a vector with size N/(P · I) to each
process per interleave. The scattering time is given by:
N
Tscatter = ts · log(p) + · (P − 1) · tw (8)
P ·I
This scattering is done for each interleave. This means that we have to
multiply this Tscatter by I:
N
TT scatter = I · (ts · log(p) + · (P − 1) · tw )
P ·I
The sending of the last row of each block to the next process is composed
by the communication startup time (ts ) and the traversing time of the B
elements in this blocks row. This is given by:

TrowComm = ts + B · tw (9)
In order to clearly describe the calculation of the total communication
time we will be splitting it into communication time in the first I − 1 inter-
leaves and the special case of the last interleave. For the first I −1 interleaves,
one might notice that each interleave introduces N/B extra yellow areas.
This means that the communication time for all the startups and traversing
for the first I − 1 interleaves is given by:
N
TcommInter = (I − 1) · ( ) · TrowComm
B
N
TcommInter = (I − 1) · ( ) · (ts + B · tw ) (10)
B
The case of the last interleave is slightly different, we must have into ac-
count the typical pipelining extra P − 1 communications due to the different
pipeline stages. Since in our implementation, the last process doesn’t need
to send its last row to another process, there will be only P − 2 extra com-
munications. So the communication time for all the startups and traversing
is given by:
N
TcommLastInter = ( + P − 2) · TrowComm
B
N
TcommLastInterleave = ( + P − 2) · (ts + B · tw ) (11)
B

12

With these formulas we can finally describe the total communication time
as being the sum of scattering times and startups and traversing times of all
the interleaves. So the total communication time is given by:

Tcomm = TT scatter + TcommInter + TcommLastInterleave

N N
Tcomm = I · (ts · log(p) + · (P − 1) · tw ) + (I − 1) · ( ) · (ts + B · tw ) +
P ·I B
N
+( + P − 2) · (ts + B · tw )
B

N N
Tcomm = I · (ts · log(p) + · (P − 1) · tw ) + ((I − 1) · ( ) +
P ·I B
N
+( + P − 2)) · (ts + B · tw ) (12)
B
The next step is to calculate the total computation time. Having in mind
that a block is composed by N/(P · I) rows and B columns, the total number
of block elements is B · N/(P · I). This means that the computation time of
a single block is given by:
N
TcompBlock = tc · B · (13)
P ·I
As we did for the total communication time, we have to take into account
how the interleaving affects the computation. For the first I − 1 interleaves
the computation time is given by:
N N
TcompInter = (I − 1) · (
) · tc · B · (14)
B P ·I
Differently from the communication time, the last interleave has exactly
N/B + P − 1 extra computations of blocks. This means that the total com-
putation time is given by:
N N N
Tcomp = ((I − 1) · ( ) + ( + P − 1)) · tc · B · (15)
B B P ·I
To conclude this performance model, the total parallelization time is given
by the sum of the total communication and computation times. So the total
parallelization time is given by:

Tparallel = Tcomp + Tcomm

13

N
Tparallel = (I · (ts · log(p) + · (P − 1) · tw )) +
P ·I
N N
+ ((I − 1) · ( ) + ( + P − 2))×
B B
N
× (tc · B · + ts + B · tw ) +
P ·I
N
+ tc · B · (16)
P ·I

2.2.4 Blocking and Interleaving Technique: Optimum B
In order to find an optimum B in order to N, P and I values, and assuming
N is much bigger than P, we need to find the value of B for each the total
value of B for which the derivate is equal to zero.
dTparallel
=0⇔
dB

(I − 1)N N tc BN
⇔ − 2
− 2 · + ts + Btw +
B B IP
(I − 1)N N tc N tc N
+ + +P −2 · + tw + =0⇔
B B IP IP

N ts P I 2
⇔B= (17)
P tc N + P 2 tw I − tc N − 2tw IP
For N P:
IN ts
B≈ (18)
tw

2.2.5 Blocking and Interleaving Technique: Optimum I
In order to find an optimum I in order to N, P and B values, and assuming
N is much bigger than P, we need to find the value of I for each the total
value of I for which the derivate is equal to zero.

14

dTparallel
=0⇔
dI

N tc (P − 1)B 2
⇔I= (19)
P (Bts log(P ) + N ts + N Btw )

N tc B 2
I≈ (20)
Bts log(P ) + N ts + N Btw

2.3 Performance Model on 2D Torus Network Topol-
ogy
Assuming that the spawning of processes is location aware in terms of the
network topology, the only difference between the linear topology mentioned
in the previous sections and the 2D Torus network topology is in the scat-
tering of data [1]. So the new performance model for this topology is given
by:

N N
Tparallel = + P − 1 · tc · B · +
B P
√ N
+ 2 · ts · log( P ) + · (P − 1) · tw +
P
N
+ + P − 2 · (ts + B · tw ) (21)
B

Although the scattering of data is done faster, as it is not affected by the vari-
able B, it will not affect the calculation of the optimum B. So the optimum
B remains the following:

ts
B≈ (22)
tc

Lets also assume that the spawning of processes is location aware in terms
of the network topology. This means the only difference between the lin-
ear topology mentioned in the previous sections and the 2D Torus network

15

topology is in the scattering of data. So the new performance model for this
topology is given by:
√ N
Tparallel = (I · 2 · (ts · log( P ) + · (P − 1) · tw )) +
P ·I
N N
+ ((I − 1) · ( ) + ( + P − 2))×
B B
N
× (tc · B · + ts + B · tw ) +
P ·I
N
+ tc · B · (23)
P ·I
Just as in the blocking technique, the scattering is not affected by B but it
is affected by I. This means that the scattering is dependent on the level of
interleaving. So the new equation for the optimum I is given by:

N t B2
I≈ √ c (24)
2Bts log( P ) + N ts + N Btw
The corresponding optimum B is given by:

IN ts
B≈ (25)
tw
Taking into account the logarithmic properties, we deduce that the opti-
mum I is the same for both network topologies. The only difference between
the two is the time needed to perform the scattering.

2.4 Implementation
In this section, the implementation of the our solution is provided and ex-
plained. Our solution compared to provided sequential one requires extra
parameters B and I. Where B is a blocking factor and I is an interleaving
factor. Note that in order not to use interleaving, the I parameter should be
set to 1.
In our solution, all required data is firstly read by the root process and
later broad-casted or scattered to other processes. Vector A is scattered to all
of the process. How much of information is scattered to every process depends
on I parameter and number of processes and every process receives N/(I · P )
rows before computing each of the interleave parts. Usually, N elements
can not be divided by I · P parameter, so the padding is introduced. The
amount of elements that each process will receive during scatter procedure
is calculated and stored as follows:

16

sizeA = N % (total_processes * I) != 0 ? N +
(total_processes * I) - (N % (total_processes * I)) : N;
chunk_size = sizeA / (total_processes * I);

Then the root process reads the data and shares the data as follows:
// Broadcast the Similarity Matrix
MPI_Bcast(sim_ptr, AA * AA, MPI_INT, 0, MPI_COMM_WORLD);
// Broadcast the portion of vector A that will be received
during broadcast
MPI_Bcast(&chunk_size, 1, MPI_INT, 0, MPI_COMM_WORLD);
// Broadcast N, B, I and DELTA parameters
MPI_Bcast(&N, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Bcast(&B, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Bcast(&I, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Bcast(&DELTA, 1, MPI_INT, 0, MPI_COMM_WORLD);

Later, each process allocates space for a portion of H matrix, portion of A
vector and for a whole B vector. Note that in our solution every process does
not allocate the full-sized H matrix, but just enough portion of this matrix
where every process writes their results. So the sum of sizes of each H matrix
portions distributed throughout the processes will be N ×N +N +N ·(P ·I). It
is the whole matrix, initial column ﬁlled with zeros and extra lines where the
processes receives information from other processes. The portions is stored in
a three dimensional array where the ﬁrst dimension refers to an interleaving
ID and the rest refers to column and row. The memory is allocated mapped
and the B vector is broad-casted as follows:
CHECK_NULL((chunk_hptr = (int *) malloc(sizeof(int) * (N) *
(chunk_size + 1))));
CHECK_NULL((chunk_h = (int **) malloc(sizeof(int*) *
(chunk_size + 1))));
CHECK_NULL((chunk_ih = (int ***) malloc(sizeof(int*) * I)));
for(int i = 0; i < (chunk_size + 1) * I; i++)
chunk_h[i] = chunk_hptr + i * N;

for (int i = 0; i < I; i++)
chunk_ih[i] = chunk_h + i * (chunk_size + 1);

CHECK_NULL((chunk_a = (short *) malloc(sizeof(short) *
(chunk_size))));
if (rank != 0) { // The root process already have B vector
CHECK_NULL((b = (short *) malloc(sizeof(short) * (N))));

17

}

MPI_Bcast(b, N, MPI_SHORT, 0, MPI_COMM_WORLD);

Later each process calculates how many blocks there are in total and what
is the size of the ﬁnal block. This is needed since usually N is not dividable
by B, so the ﬁnal block is usually smaller then the rest of them. The time
that marks the beginning of computation is stored in a variable start. In
the main loop that counts interleaves, each process receives a portion of A
vector. Main loop is repeated I times as explained earlier (in the section
describing blocking and interleaving technique).
int total_blocks = N / B + (N % B == 0 ? 0 : 1);
int last_block_size = N % B == 0 ? B : N % B;
MPI_Status status;

int start, end;
start = getTimeMilli();

for (int current_interleave = 0; current_interleave < I;
current_interleave++) {
MPI_Scatter(a + current_interleave * chunk_size *
total_processes,
chunk_size, MPI_SHORT, chunk_a, chunk_size, MPI_SHORT,
0, MPI_COMM_WORLD);

int current_column = 1;
// Fill first column with 0
for (int i = 0; i < chunk_size + 1; i++)
chunk_ih[current_interleave][i][0] = 0;

Then the main calculations begin. Firstly, the process checks whether it
has to receive from another process. If so, it receives data required for the
calculations. Then it processes the current cell, stores the result in separate
array which will be gathered later. Finally, the process checks if it has to
send the to another process. If so, it sends the last row of current block to
another process. The process repeats these actions totalb locks times. Finally,
it saves the time after execution in the end variable.
for (int current_block = 0; current_block < total_blocks;
current_block++) {
// Receive
int block_end = MIN2(current_column - (current_block ==

18

0 ? 1 : 0) + B, N);
if (rank == 0 && current_interleave == 0) {
for (int k = current_column; k < block_end; k++) {
chunk_ih[current_interleave][0][k] = 0;
}
} else {
int receive_from = rank == 0 ? total_processes - 1 :
rank - 1;
int size_to_receive = current_block == total_blocks
- 1 ? last_block_size : B;
MPI_Recv(chunk_ih[current_interleave][0] +
current_block * B, size_to_receive, MPI_INT,
receive_from, 0, MPI_COMM_WORLD, &status);
if (DEBUG) printf("[%d] Received from %d: ", rank,
receive_from);
if (DEBUG)
print_vector(chunk_ih[current_interleave][0] +
current_block * B, size_to_receive);
}
// Process
for (int j = current_column; j < block_end; j++,
current_column++) {
for (int i = 1; i < chunk_size + 1; i++) {
int diag = chunk_ih[current_interleave][i - 1][j
- 1] + sim[chunk_a[i - 1]][b[j - 1]];
int down = chunk_ih[current_interleave][i -
1][j] + DELTA;
int right = chunk_ih[current_interleave][i][j -
1] + DELTA;
int max = MAX3(diag, down, right);
chunk_ih[current_interleave][i][j] = max < 0 ? 0
: max;
}
}

// Send
if (current_interleave != I - 1 || rank + 1 !=
total_processes) {
int send_to = rank + 1 == total_processes ? 0 : rank
+ 1;
int size_to_send = current_block == total_blocks - 1
? last_block_size : B;
MPI_Send(chunk_ih[current_interleave][chunk_size] +

19

current_block * B, size_to_send, MPI_INT,
send_to, 0, MPI_COMM_WORLD);
if (DEBUG) printf("[%d] Sent to %d: ", rank,
send_to);
if (DEBUG)
print_vector(chunk_ih[current_interleave][chunk_size]
+ current_block * B, size_to_send);
}
}
}
end = getTimeMilli();

When all he calculations are ﬁnished, all processes starts the gather exe-
cution. After gather is executed, the root process has the all H matrix. Then
the root process prints an execution time to stderr stream and if debug is
enabled it prints the H matrix.
for (int i = 0; i < I; i++) {
MPI_Gather(chunk_hptr + N + i * chunk_size * N, N *
chunk_size, MPI_INT,
hptr + i * chunk_size * total_processes * N,
N * chunk_size, MPI_INT, 0, MPI_COMM_WORLD);
}

if (rank == 0) {
fprintf(stderr, "Execution: %f sn", (double) (end - start)
/ 1000000);
}

if (DEBUG) {
if (rank == 0) {
for (int i = 0; i < N - 1; i++) {
print_vector(h[i], N);
}
}
}

MPI_Finalize();

The full code is provided in the Appendix section.

20

3 Performance Results
In this section, the performance results of our implementation on ALTIX is
provided. Also, the results is compared to a sequential code performance.

3.1 Finding Optimal P and B
In order to find out optimal P and B, we tested the application with different
P and B parameters, where N = 10, 000. Before that we tested the sequential
code. This code executed calculations for 12.598 seconds. The parallelized
version execution times are shown in Figure 7.

Figure 7: Performance results with different P and B where N = 10, 000

From this it can be concluded that with parameters N = 10000, B = 100,
P = 8 and I = 1 the parallel code executed calculations 9 times faster.

3.2 Finding Optimal I
In order to find out the optimal I, we selected the best result from the
precious test where P = 8 and ran the test with different I and B parameters.
The result is shown in Figure 8.
Because the environment like network congestion affects our performance
tests, the results might not be completely accurate. That is why we deduced
from the results that the optimal parameters configuration for N = 10, 000 is
I = 2, B = 200, P = 8. With this configuration parallel code calculates the
matrix 8 times faster than the sequential code. Finally, we tested the parallel
code with N = 25, 000 and parameters that we found to be optimal. The
code executed calculations for 11.822213 seconds, where the sequential code
ran for 76.884 seconds. From this it can be concluded that the parallel code
runs 6.5 times faster. The result is slower, because as it was stated earlier,

21

Figure 8: Performance results with different I and B where N = 10, 000 and
P =8

the B and I depends on N , so the parameters configuration for calculating
vectors similarity of size N = 25, 000 is not optimal.

4 Conclusions
During this project the parallel implementation of the Smith-Waterman Al-
gorithm was made using blocking and interleaving techniques. The tech-
niques and the code were explained in detail. The performance models for
both linear and 2D torus were calculated. Also, for each network topologies
the equations for finding optimum blocking factor B when using blocking
technique and optimum B and interleaving factor I when using blocking and
interleaving technique were found. After calculating the models, the conclu-
sion was made that the calculation of B and I factors for our algorithm on
these particular network topologies is the same.
Performance tests using multiple processes on different processors were
done. It was found out that the optimal configuration for calculating se-
quence alignment of two vectors of size N = 10, 000 using our implemen-
tation is I = 2, B = 200, P = 8. With this configuration the parallel
code calculates the matrix 8 times faster than the sequential code. With the
same parameters configuration the parallel code calculates the matrix of size
N = 25, 000 6.5 times faster than the sequential code.

22

References
[1] Peter Harrison, William Knottenbelt, Parallel Algorithms. Department
of Computing, Imperial College London, 2009.

[2] Norm Matlo, Programming on Parallel Machines. University of Califor-
nia, Davis, 2011.

23

A How to Compile
all: seq par

seq:
gcc SW.c -o seq.out

par:
icc protein.cpp -o protein.out -lmpi

B How to Execute on ALTIX
#!/bin/bash
# @ job_name = ampp01parallel
# @ initialdir = .
# @ output = mpi_%j.out
# @ error = mpi_%j.err
# @ total_tasks = <number_of_process>
# @ wall_clock_limit = 00:01:00
mpirun -np <number_of_process> ./protein.out <vector_a> <vector_b>
<similarity_matrix> <gap_penalty> <N> <B> <I>

C Code
#include <stdio.h>
#include <stdlib.h>
#include <ctype.h> // character handling
#include <stdlib.h> // def of RAND_MAX
#include <sys/time.h>
#include "mpi.h"

#define DEBUG 1

#define MAX_SEQ 50

#define CHECK_NULL(_check) {
if ((_check)==NULL) {
fprintf(stderr, "Null Pointer allocating memoryn");

24

exit(-1);
}
}

#define AA 20 // number of amino acids
#define MAX2(x,y) ((x)<(y) ? (y) : (x))
#define MAX3(x,y,z) (MAX2(x,y)<(z) ? (z) : MAX2(x,y))
#define MIN2(x,y) ((x)>(y) ? (y) : (x))

// function prototypes
int getTimeMilli();
void read_pam(FILE* pam);
void read_files(FILE* in1, FILE* in2);
void print_vector(int* vector, int size);
void print_short_vector(short* vector, int size);
void memcopy(int* src, int* dst, int count);

/* begin AMPP*/
int char2AAmem[256];
int AA2charmem[AA];
void initChar2AATranslation(void);
/* end AMPP */

/* Define global variables */
int rank, total_processes;
int DELTA;
short *a, *b;
int *chunk_hptr;
int **chunk_h, ***chunk_ih;
int *sim_ptr, **sim; // PAM similarity matrix
int N, sizeA, B, I, chunk_size;
short *chunk_a;
int* hptr;
int** h;
FILE *pam;

main(int argc, char *argv[]) {
/* begin AMPP */
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &total_processes);

CHECK_NULL((sim_ptr = (int *) malloc(AA * AA * sizeof(int))));

25

CHECK_NULL((sim = (int **) malloc(AA * sizeof(int*))));
for(int i = 0; i < AA; i++)
sim[i] = sim_ptr + i * AA;

if (rank == 0) {
FILE *in1, *in2;
/**** Error handling for input file ****/
if (!(argc >= 5 && argc <= 8)) {
fprintf(stderr,"%s protein1 protein2 PAM gapPenalty [N]
[B] [I]n",argv[0]);
exit(1);
} else {
in1 = fopen(argv[1],"r");
in2 = fopen(argv[2],"r");
N = (argc > 5 ? atoi(argv[5]) : MAX_SEQ) + 1;
B = argc > 6 ? atoi(argv[6]) : total_processes;
I = argc > 7 ? atoi(argv[7]) : 1;
DELTA = atoi(argv[4]);
}
/* end AMPP */

/* begin AMPP */
sizeA = N % (total_processes * I) != 0 ? N +
(total_processes * I) - (N % (total_processes * I)) : N;
CHECK_NULL((a = (short *) calloc(sizeof(short), sizeA)));

initChar2AATranslation();
read_files(in1, in2);
chunk_size = sizeA / (total_processes * I);

CHECK_NULL((hptr = (int *) calloc(N * sizeA, sizeof(int))));
CHECK_NULL((h = (int **) calloc(sizeA, sizeof(int*))));
for(int i = 0; i < sizeA; i++)
h[i] = hptr + i * N;

pam = fopen(argv[3], "r");
read_pam(pam);
}

MPI_Bcast(sim_ptr, AA * AA, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Bcast(&chunk_size, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Bcast(&N, 1, MPI_INT, 0, MPI_COMM_WORLD);

26

MPI_Bcast(&B, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Bcast(&I, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Bcast(&DELTA, 1, MPI_INT, 0, MPI_COMM_WORLD);

CHECK_NULL((chunk_hptr = (int *) malloc(sizeof(int) * (N) *
(chunk_size + 1) * I)));
CHECK_NULL((chunk_h = (int **) malloc(sizeof(int*) *
(chunk_size + 1) * I)));
CHECK_NULL((chunk_ih = (int ***) malloc(sizeof(int*) * I)));
for(int i = 0; i < (chunk_size + 1) * I; i++)
chunk_h[i] = chunk_hptr + i * N;

for (int i = 0; i < I; i++)
chunk_ih[i] = chunk_h + i * (chunk_size + 1);

CHECK_NULL((chunk_a = (short *) malloc(sizeof(short) *
(chunk_size))));
if (rank != 0) {
}

MPI_Bcast(b, N, MPI_SHORT, 0, MPI_COMM_WORLD);

/*** PARALLEL PART ***/
/** compute "h" local similarity array **/
int total_blocks = N / B + (N % B == 0 ? 0 : 1);
int last_block_size = N % B == 0 ? B : N % B;
MPI_Status status;

int start, end;
start = getTimeMilli();

for (int current_interleave = 0; current_interleave < I;
current_interleave++) {
MPI_Scatter(a + current_interleave * chunk_size *
total_processes,
chunk_size, MPI_SHORT, chunk_a, chunk_size, MPI_SHORT,
0, MPI_COMM_WORLD);

int current_column = 1;
// Fill first column with 0

27

for (int i = 0; i < chunk_size + 1; i++)
chunk_ih[current_interleave][i][0] = 0;

for (int current_block = 0; current_block < total_blocks;
current_block++) {
// Receive
int block_end = MIN2(current_column - (current_block ==
0 ? 1 : 0) + B, N);
if (rank == 0 && current_interleave == 0) {
for (int k = current_column; k < block_end; k++) {
chunk_ih[current_interleave][0][k] = 0;
}
} else {
int receive_from = rank == 0 ? total_processes - 1 :
rank - 1;
int size_to_receive = current_block == total_blocks
- 1 ? last_block_size : B;
MPI_Recv(chunk_ih[current_interleave][0] +
current_block * B, size_to_receive, MPI_INT,
receive_from, 0, MPI_COMM_WORLD, &status);
if (DEBUG) printf("[%d] Received from %d: ", rank,
receive_from);
if (DEBUG)
print_vector(chunk_ih[current_interleave][0] +
current_block * B, size_to_receive);
}
// Process
for (int j = current_column; j < block_end; j++,
current_column++) {
for (int i = 1; i < chunk_size + 1; i++) {
int diag = chunk_ih[current_interleave][i - 1][j
- 1] + sim[chunk_a[i - 1]][b[j - 1]];
int down = chunk_ih[current_interleave][i -
1][j] + DELTA;
int right = chunk_ih[current_interleave][i][j -
1] + DELTA;
int max = MAX3(diag, down, right);
chunk_ih[current_interleave][i][j] = max < 0 ? 0
: max;
}
}

// Send

28

if (current_interleave != I - 1 || rank + 1 !=
total_processes) {
int send_to = rank + 1 == total_processes ? 0 : rank
+ 1;
int size_to_send = current_block == total_blocks - 1
? last_block_size : B;
MPI_Send(chunk_ih[current_interleave][chunk_size] +
current_block * B, size_to_send, MPI_INT,
send_to, 0, MPI_COMM_WORLD);
if (DEBUG) printf("[%d] Sent to %d: ", rank,
send_to);
if (DEBUG)
print_vector(chunk_ih[current_interleave][chunk_size]
+ current_block * B, size_to_send);
}
}
}
end = getTimeMilli();

for (int i = 0; i < I; i++) {
MPI_Gather(chunk_hptr + N + i * chunk_size * N, N *
chunk_size, MPI_INT,
hptr + i * chunk_size * total_processes * N,
N * chunk_size, MPI_INT, 0, MPI_COMM_WORLD);
}

if (rank == 0) {
fprintf(stderr, "Execution: %f sn", (double) (end - start)
/ 1000000);
}

if (DEBUG) {
if (rank == 0) {
for (int i = 0; i < N - 1; i++) {
print_vector(h[i], N);
}
}
}

//Free everything!
free(sim_ptr);
free(sim);
free(b);

29

free(chunk_ih);
free(chunk_h);
free(chunk_hptr);
free(chunk_a);
if (rank == 0) {
free(a);
free(hptr);
free(h);
}

MPI_Finalize();
}

void memcopy(int* src, int* dst, int count) {
for (int i = 0; i < count; i++) {
dst[i] = src[i];
}
}

void print_vector(int* vector, int size) {
for (int i = 0; i < size; i++) {
printf("%2d ", vector[i]);
}
printf("n");
}

void print_short_vector(short* vector, int size) {
for (int i = 0; i < size; i++) {
printf("%2d ", vector[i]);
}
printf("n");
}

void read_pam(FILE* pam) {
int i, j;
int temp;
/** read PAM250 similarity matrix **/
/* begin AMPP */
fscanf(pam,"%*s");
/* end AMPP */
for (i = 0; i < AA; i++)
for (j = 0; j <= i; j++) {
if (fscanf(pam, "%d ", &temp) == EOF) {

30

fprintf(stderr, "PAM file emptyn");
fclose(pam);
exit(1);
}
sim[i][j]=temp;
}
fclose(pam);
for (i = 0; i < AA; i++)
for (j = i + 1; j < AA; j++)
sim[i][j] = sim[j][i]; // symmetrify
}

void read_files(FILE* in1, FILE* in2) {
int i=0;
int nc;
char ch;
do {
nc=fscanf(in1,"%c",&ch);
if (nc>0 && char2AAmem[ch]>=0)
{
a[i++] = char2AAmem[ch];
}
} while (nc>0 && (i<N));
fclose(in1);

/** read second file in array "b" **/
i=0;
do {
nc=fscanf(in2,"%c",&ch);
if (nc>0 && char2AAmem[ch]>=0)
{
b[i++] = char2AAmem[ch];
}
} while (nc>0 && (i<N));
fclose(in2);
}

/* Begin AMPP */
void initChar2AATranslation(void)
{
int i;
for(i=0; i<256; i++) char2AAmem[i]=-1;
char2AAmem[’c’]=char2AAmem[’C’]=0;

31

AA2charmem[0]=’c’;
char2AAmem[’g’]=char2AAmem[’G’]=1;
AA2charmem[1]=’g’;
char2AAmem[’p’]=char2AAmem[’P’]=2;
AA2charmem[2]=’p’;
char2AAmem[’s’]=char2AAmem[’S’]=3;
AA2charmem[3]=’s’;
char2AAmem[’a’]=char2AAmem[’A’]=4;
AA2charmem[4]=’a’;
char2AAmem[’t’]=char2AAmem[’T’]=5;
AA2charmem[5]=’t’;
char2AAmem[’d’]=char2AAmem[’D’]=6;
AA2charmem[6]=’d’;
char2AAmem[’e’]=char2AAmem[’E’]=7;
AA2charmem[7]=’e’;
char2AAmem[’n’]=char2AAmem[’N’]=8;
AA2charmem[8]=’n’;
char2AAmem[’q’]=char2AAmem[’Q’]=9;
AA2charmem[9]=’q’;
char2AAmem[’h’]=char2AAmem[’H’]=10;
AA2charmem[10]=’h’;
char2AAmem[’k’]=char2AAmem[’K’]=11;
AA2charmem[11]=’k’;
char2AAmem[’r’]=char2AAmem[’R’]=12;
AA2charmem[12]=’r’;
char2AAmem[’v’]=char2AAmem[’V’]=13;
AA2charmem[13]=’v’;
char2AAmem[’m’]=char2AAmem[’M’]=14;
AA2charmem[14]=’m’;
char2AAmem[’i’]=char2AAmem[’I’]=15;
AA2charmem[15]=’i’;
char2AAmem[’l’]=char2AAmem[’L’]=16;
AA2charmem[16]=’l’;
char2AAmem[’f’]=char2AAmem[’F’]=17;
AA2charmem[17]=’L’;
char2AAmem[’y’]=char2AAmem[’Y’]=18;
AA2charmem[18]=’y’;
char2AAmem[’w’]=char2AAmem[’W’]=19;
AA2charmem[19]=’w’;
}

int getTimeMilli() {
struct timeval tv;

32

gettimeofday(&tv, NULL);
int ret = tv.tv_usec;
ret += (tv.tv_sec * 1000000); // Add seconds
return ret;
}

/* end AMPP*/

33

Smith waterman algorithm parallelization

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Smith waterman algorithm parallelization

Similar to Smith waterman algorithm parallelization (20)

More from Mário Almeida

More from Mário Almeida (14)

Recently uploaded

Recently uploaded (20)

Smith waterman algorithm parallelization