SlideShare a Scribd company logo
1 of 33
Download to read offline
Universitat Polit`cnica de Catalunya
                             e
            Facultat d’Inform`tica de Barcelona
                               a




            AMPP Final Project



Smith-Waterman Algorithm
      Parallelization

Authors:                                         Supervisors:
M´rio Almeida
  a                        Josep Ramon Herrero Zaragoza
ˇ
Zygimantas Bruzgys                 Daniel Jimenez Gonzalez
Umit Cavus Buyuksahin




                         Barcelona
                           2012
Contents
1 Introduction                                                                    3

2 Main Issues and Solutions                                                       3
  2.1 Parallelization Techniques . . . . . . . . . . . . . . . . .   .   .   .    4
      2.1.1 Blocking Technique . . . . . . . . . . . . . . . . .     .   .   .    4
      2.1.2 Blocking and Interleaving Technique . . . . . . .        .   .   .    6
  2.2 Performance Model on Linear Network Topology . . . . .         .   .   .    7
      2.2.1 Blocking Technique . . . . . . . . . . . . . . . . .     .   .   .    7
      2.2.2 Blocking Technique: Optimum B . . . . . . . . .          .   .   .   10
      2.2.3 Blocking and Interleaving Technique . . . . . . .        .   .   .   11
      2.2.4 Blocking and Interleaving Technique: Optimum B           .   .   .   14
      2.2.5 Blocking and Interleaving Technique: Optimum I           .   .   .   14
  2.3 Performance Model on 2D Torus Network Topology . . .           .   .   .   15
      2.3.1 Blocking Technique . . . . . . . . . . . . . . . . .     .   .   .   15
      2.3.2 Blocking and Interleaving Technique . . . . . . .        .   .   .   15
  2.4 Implementation . . . . . . . . . . . . . . . . . . . . . . .   .   .   .   16

3 Performance Results                                                     21
  3.1 Finding Optimal P and B . . . . . . . . . . . . . . . . . . . . 21
  3.2 Finding Optimal I . . . . . . . . . . . . . . . . . . . . . . . . . 21

4 Conclusions                                                                    22

A How to Compile                                                                 24

B How to Execute on ALTIX                                                        24

C Code                                                                           24
1    Introduction
In this project the parallel implementation of the Smith-Waterman Algo-
rithm using Message Passing Interface (MPI). This algorithm is a well-known
algorithm for performing local sequence alignment, which is, for determining
similar regions between two amino-acid sequences.
    In order to find the best alignment between two amino-acid sequences a
matrix H is computed of size N × N , where N is a size of each sequences.
Every element of this matrix is based on Score Matrix (cost of matching two
symbols) and a gap penalty for mismatching symbols of sequences. When
matrix H is computed, the optimum alignment of sequences can be obtained
by tracking back the matrix starting with the highest value in the matrix.
    In our parallel implementation only H matrix calculation was parallelized
as it is our only interest. The tracking part was removed from the code
and from the sequential code as well, in order to gather the most accurate
computation times for comparison. For parallelization a pipelining method
was used. Following this model, each process communicates with another
after calculating B columns of N rows. This is called blocking. We introduced
                                P
a parameter that easily allowed to change this value. Later interleaving
parameter I was added.
    During this project several performance models were created. One model
is for linear interconnection network and another one for 2D torus network.
In models calculations B and I parameters were included. Later, the opti-
mum B and I were found and performance tests were executed to empirically
find out those two parameters.


2    Main Issues and Solutions
In this section parallelization solutions are described. Solution with blocking
at column level is explained and performance model is described. Then the
solution with both blocking at column level and interleaving at row level is
explained and performance model is described as well. Also, in this section
the calculations are provided for optimum blocking factor B and interleaving
factor I. The second part of the section is for description of the perfor-
mance models for both solutions on the different network topology and the
calculations for finding optimum blocking factor B and interleaving factor
I. Finally, our implementation of these techniques in C++ is provided and
explained.




                                      3
2.1     Parallelization Techniques
2.1.1   Blocking Technique




Figure 1: Parallelization approach by introducing blocking at column level

    The P processes share the matrix M in terms of consecutive rows. For
calculating the matrix M of size N × N , each process Pi works with N/P
consecutive rows of the matrix. When using a blocking technique for paral-
lelization, columns are divided by a defined block size B. So, each process
has to calculate N/B blocks. These parameters are visualized at Figure 1.
At the top part of the figure it can be seen how elements of the matrix are
divided between processes. And at the bottom part of the figure the par-
allelization of calculations between processes is visualized. There is shown
that when first process computes the first block of the matrix, which is size
of N/P × B, it communicates with the next process. Then the next process
start calculating the other block of the matrix while the first process con-
tinues calculations on the next block and so on. This type of parallelization
is called pipelining. In this type of parallelization, the problem is divided
into a series of tasks that have to be completed one after the other. Be-
fore explaining the parallelization in detail, we should analyze data and task
dependencies between processes to calculate the matrix.
    In Figure 2 the data dependency for a particular matrix element is shown.
In order to calculate a matrix element M [i][j], the process Pi+1 needs the
calculated data form the previous column M [i][j − 1] and elements M [i −
1][j − 1] and M [i − 1][j] from the previous row as seen in the picture. If

                                      4
Figure 2: Data dependency for calculating one matrix element

the previous row is calculated by the process Pi then that row is sent after
process Pi calculates the block of size N/P × B. This introduces a data
and task dependencies. The process Pi+1 can not start calculations till the
process Pi sends the last row of the block, which is needed for calculating the
block of process Pi+1 . To calculate the first row and column of the matrix it
is considered that the predecessor row and column is filled with zeros.




           Figure 3: Data dependecies between blocks of matrix

    The Figure 3 shows the parallelism of the matrix in the wide window.
The squares represent the blocks matrix and three arrows show data decen-
cies between the blocks. As mentioned before, an element needs its upper,
left, and upper-left values to be calculated. It is called data dependency.
Therefore, blocks on the same minor diagonal are independent from each
other. So these blocks can be and are calculated in parallel.
    The steps of calculations are as follows:

                                      5
1. The process waits till the previous process finish calculation of a block
     (if applicable);
  2. The process receives the last row of a block that was calculated by the
     previous process;
  3. After receiving the last row of a block calculated by previous process,
     the process has all necessary information to calculate its block. So, the
     process performs a calculation of its block;
  4. When process finish the calculation, it sends the last row of its block
     to the next process (if applicable);
  5. The process repeats these steps until it finishes the calculation of all
     blocks, that is, calculates all rows that are assigned to the process.

2.1.2   Blocking and Interleaving Technique




     Figure 4: Matrix calculation with interleaving factor, when I = 2

    This parallelization method adds an interleave factor to a blocking tech-
nique that was described above. With this method the matrix is divided
into I parts, so that each part has N × N/I elements. Every part is then
calculated as explained in the previous section, that is, using blocking tech-
nique. As soon as the process finish processing rows assigned to it from the
first interleaving part it continues with the blocks from another interleave
part. For example, in Figure 4, where interleaving factor I = 2, the matrix
is divided into two smaller parts. Each process Pi calculates N/(P · I) rows
of one part before moving to the second part.

                                      6
The steps of calculations are very similar to those where blocking tech-
nique is used and are as follows:

  1. The process waits till the previous process finish calculation of a block
     (if applicable);

  2. The process receives the last row of a block that was calculated by the
     previous process;

  3. After receiving the last row of a block calculated by previous process,
     the process has all necessary information to calculate its block. So, the
     process performs a calculation of its block;

  4. When process finish the calculation, it sends the last row of its block
     to the next process. If the process is the last one and there is another
     interleave part to calculate, then it sends the row to the first process.
     Otherwise it does not send anything;

  5. The process repeats these steps until it finishes the calculation of all
     blocks within the current interleave part, that is, calculates all rows
     that are assigned to the process within the interleave part. If there
     is another interleave part to calculate it moves to next interleave part
     and repeats theses steps until all blocks from all interleave parts are
     calculated.

2.2     Performance Model on Linear Network Topology
2.2.1   Blocking Technique
In this section we will be describing the performance model of our imple-
mentation with blocking technique for a linear network topology. In later
sections we will compare it with non linear topology, taking into account the
differences in the performance models.
    In order to focus on the main objectives of this performance analysis, we
will only take into account the parallel algorithms used for matrix calculation.
This means that some parts of the code that were done sequentially on a
single process such as opening and reading the input files were ignored in
this model.
    Some assumptions were made in terms of the models for different network
topologies, such as the assumption that the creation of new processes is
location aware in terms of their place in the network to make it more efficient.
    For all the performance models described in this section we will use the
following annotation to represent them:

                                       7
• ts : Startup time. (prepare message + routing algorithm + interface
     between local node and the router).

   • tc : Time of computation for each value in matrix.

   • tw : Time of traversing per word.

   • Tcomm : Total communication time.

   • Tcomp : Total computation time.




Figure 5: Communication and computation times of matrix parallel calcula-
tions by process using the blocking technique.

    The diagram in Figure 5 represents the steps of the matrix calculation
performed by our algorithm as well as initial declarations and needed com-
munications. These different steps are represented with different colors. The
blue color represents the scattering of one protein sequence to all the pro-
cesses. The green colored areas represent the computation time needed to
do the matrix calculations in each block and yellow color represents the time
taken to send the last row of a block to the next process.
    In order to simplify the diagram, the time the last process needs to receive
the last row of the block of the previous process is already taken in account
in the upper yellow area. This explains why the last process doesn’t have
yellow areas in its time-line but still has to wait to receive the blocks needed
to perform the matrix calculations. All of this will be considered in this
performance model.
    As we can observe from the diagram, the communication time of this
model is composed by the scattering of the protein sequence vector (blue
area) and several communications to send the last row of each block to the
next process (yellow). The scatter method [2] will receive a vector of size N

                                       8
and deliver a vector with size N/P to each process. The scattering time is
given by:
                                          N
                    Tscatter = ts · log(p) + · (P − 1) · tw              (1)
                                          P
   The sending of the last row of each block to the next process is composed
by the communication startup time (ts ) and the traversing time of the B
elements in this blocks row. This is given by:

                            TrowComm = ts + B · tw                         (2)
    In the total communication time, this startup and traversing are done
N/B times for the first process and an extra P − 1 times for the remaining
pipeline stages of the remaining processes. In order to take into consideration
the fact that the last process doesn’t need to send its last row to another
process we will consider that it takes P −2. So the total communication time
is given by:
                                   N
               Tcomm = Tscatter + (   + P − 1 − 1) · TrowComm
                                   B
                         N                    N
   Tcomm = ts · log(p) +    · (P − 1) · tw + ( + P − 2) · (ts + B · tw ) (3)
                         P                    B
   The next step is to calculate the total computation time. Having in mind
that a block is composed by N/P rows and B columns, the total number of
block elements is B · N/P . This means that the computation time of a single
block is given by:
                                            N
                            Tcomp   block   = tc · B ·                  (4)
                                            P
   As we did for the total communication time, this computation time is
multiplied N/B + P − 1 to calculate the computing of the blocks for all the
processes:
                                  N
                      Tcomp = (     + P − 1) · Tcomp     block
                                  B
                                N                     N
                     Tcomp = (    + P − 1) · (tc · B · )                   (5)
                                B                     P
   To conclude this performance model, the total parallelization time is given
by the sum of the total communication and computation times. So the total
parallelization time is given by:



                                            9
Tparallel = Tcomp + Tcomm


                           N                       N
              Tparallel =      + P − 1 · tc · B ·         +
                           B                        P
                                       N
                      + ts · log(P ) +   · (P − 1) · tw +
                                       P
                        N
                      +     + P − 2 · (ts + B · tw )                                 (6)
                        B

2.2.2   Blocking Technique: Optimum B
In order to find an optimum B for fixed values of N and P , and assuming
N is much bigger than P , we need to find the value of B for each the total
parallel time of computation and communication is smaller. This value can
be found be deriving the total parallelization time equation and finding the
value of B for which the derivate is equal to zero.
                                dTparallel
                                           =0⇔
                                  dB

         tc BN                   N                                  tc N        tc N
⇔ −N           + ts + Btw B −2 +   +P −2                                 + tw +      =0⇔
            P                    B                                   P           P

                                          N · ts · P
            ⇔B=                                                      ⇔
                      P · tc · N +   P2   · tw − tc · N − 2 · tw · P

                                         N · ts · P
              ⇔B=                                               ⇔
                            tc · N · (P − 1) + P · tw · (P − 2)

                                                  ts
                        ⇔B=          tw ·(P −2)        tc ·(P −1)
                                          N
                                                  +         P




   For N     P:
                                             ts
                                    B≈                                               (7)
                                             tc




                                       10
2.2.3   Blocking and Interleaving Technique
In this section we will be describing the performance model of our implemen-
tation with blocking and interleaving techniques for a linear network topol-
ogy. In later sections we will compare it with non linear topology, taking into
account the differences in the performance models. As in the previous model,
we will use the mentioned annotation and we will only take into account the
parallel algorithms used for matrix calculation.




Figure 6: Communication and computation times of matrix parallel calcula-
tions by process using the blocking and interleaving techniques.

    The diagram in Figure 6 represents the steps of the matrix calculation
performed by our algorithm as well as initial declarations and needed com-
munications. These different steps are represented with different colors. The
blue color represents the scattering of one protein sequence to all the pro-
cesses. The green colored areas represent the computation time needed to
do the matrix calculations in each block and yellow color represents the time
taken to send the last row of a block to the next process.
    In order to simplify the diagram, the time the last process in the last
interleave needs to receive the last row of the block of the previous process
is already taken in account in the upper yellow area. This explains why this
last process doesn’t have yellow areas in its time-line but still has to wait to
receive the blocks needed to perform the matrix calculations. All of this will
be considered in this performance model.
    As we can observe from the diagram, the communication time of this
model is composed by the scattering of a part of the protein sequence vector
(blue area) for each interleave and several communications to send the last

                                      11
row of each block to the next process (yellow). The scatter method will
receive a vector of size N and deliver a vector with size N/(P · I) to each
process per interleave. The scattering time is given by:
                                         N
                   Tscatter = ts · log(p) +  · (P − 1) · tw           (8)
                                       P ·I
  This scattering is done for each interleave. This means that we have to
multiply this Tscatter by I:
                                            N
                TT scatter = I · (ts · log(p) + · (P − 1) · tw )
                                           P ·I
   The sending of the last row of each block to the next process is composed
by the communication startup time (ts ) and the traversing time of the B
elements in this blocks row. This is given by:

                             TrowComm = ts + B · tw                           (9)
    In order to clearly describe the calculation of the total communication
time we will be splitting it into communication time in the first I − 1 inter-
leaves and the special case of the last interleave. For the first I −1 interleaves,
one might notice that each interleave introduces N/B extra yellow areas.
This means that the communication time for all the startups and traversing
for the first I − 1 interleaves is given by:
                                                  N
                     TcommInter = (I − 1) · (       ) · TrowComm
                                                  B
                                            N
                   TcommInter = (I − 1) · (    ) · (ts + B · tw )          (10)
                                            B
    The case of the last interleave is slightly different, we must have into ac-
count the typical pipelining extra P − 1 communications due to the different
pipeline stages. Since in our implementation, the last process doesn’t need
to send its last row to another process, there will be only P − 2 extra com-
munications. So the communication time for all the startups and traversing
is given by:
                                        N
                   TcommLastInter = (     + P − 2) · TrowComm
                                        B
                                         N
               TcommLastInterleave = (     + P − 2) · (ts + B · tw )         (11)
                                         B



                                          12
With these formulas we can finally describe the total communication time
as being the sum of scattering times and startups and traversing times of all
the interleaves. So the total communication time is given by:

              Tcomm = TT scatter + TcommInter + TcommLastInterleave


                              N                                 N
Tcomm = I · (ts · log(p) +        · (P − 1) · tw ) + (I − 1) · ( ) · (ts + B · tw ) +
                             P ·I                               B
           N
      +(     + P − 2) · (ts + B · tw )
           B


                                     N                                  N
      Tcomm = I · (ts · log(p) +         · (P − 1) · tw ) + ((I − 1) · ( ) +
                                    P ·I                                B
                   N
              +(     + P − 2)) · (ts + B · tw )                                 (12)
                   B
    The next step is to calculate the total computation time. Having in mind
that a block is composed by N/(P · I) rows and B columns, the total number
of block elements is B · N/(P · I). This means that the computation time of
a single block is given by:
                                             N
                             TcompBlock = tc · B ·                     (13)
                                            P ·I
   As we did for the total communication time, we have to take into account
how the interleaving affects the computation. For the first I − 1 interleaves
the computation time is given by:
                                      N               N
                     TcompInter = (I − 1) · (
                                        ) · tc · B ·                  (14)
                                      B              P ·I
   Differently from the communication time, the last interleave has exactly
N/B + P − 1 extra computations of blocks. This means that the total com-
putation time is given by:
                                N      N                      N
            Tcomp = ((I − 1) · (  ) + ( + P − 1)) · tc · B ·              (15)
                                B      B                     P ·I
   To conclude this performance model, the total parallelization time is given
by the sum of the total communication and computation times. So the total
parallelization time is given by:

                             Tparallel = Tcomp + Tcomm

                                         13
N
             Tparallel = (I · (ts · log(p) +    · (P − 1) · tw )) +
                                           P ·I
                                    N       N
                      + ((I − 1) · ( ) + ( + P − 2))×
                                    B       B
                                   N
                      × (tc · B ·      + ts + B · tw ) +
                                  P ·I
                                  N
                      + tc · B ·                                      (16)
                                 P ·I

2.2.4   Blocking and Interleaving Technique: Optimum B
In order to find an optimum B in order to N, P and I values, and assuming
N is much bigger than P, we need to find the value of B for each the total
parallel time of computation and communication is smaller. This value can
be found be deriving the total parallelization time equation and finding the
value of B for which the derivate is equal to zero.
                                dTparallel
                                           =0⇔
                                  dB

                        (I − 1)N  N       tc BN
                  ⇔     −    2
                                 − 2 ·          + ts + Btw +
                           B      B         IP
             (I − 1)N    N           tc N           tc N
         +            +    +P −2 ·         + tw +        =0⇔
                 B       B           IP             IP

                                           N ts P I 2
                  ⇔B=                                                 (17)
                              P tc N + P 2 tw I − tc N − 2tw IP
   For N     P:
                                             IN ts
                                  B≈                                  (18)
                                              tw

2.2.5   Blocking and Interleaving Technique: Optimum I
In order to find an optimum I in order to N, P and B values, and assuming
N is much bigger than P, we need to find the value of I for each the total
parallel time of computation and communication is smaller. This value can
be found be deriving the total parallelization time equation and finding the
value of I for which the derivate is equal to zero.


                                        14
dTparallel
                                           =0⇔
                                   dI

                                     N tc (P − 1)B 2
                   ⇔I=                                                      (19)
                             P (Bts log(P ) + N ts + N Btw )

                                        N tc B 2
                       I≈                                                   (20)
                              Bts log(P ) + N ts + N Btw

2.3     Performance Model on 2D Torus Network Topol-
        ogy
2.3.1   Blocking Technique
Assuming that the spawning of processes is location aware in terms of the
network topology, the only difference between the linear topology mentioned
in the previous sections and the 2D Torus network topology is in the scat-
tering of data [1]. So the new performance model for this topology is given
by:

                             N                     N
              Tparallel =      + P − 1 · tc · B ·          +
                             B                     P
                                     √      N
                      + 2 · ts · log( P ) +   · (P − 1) · tw +
                                            P
                          N
                      +      + P − 2 · (ts + B · tw )                       (21)
                           B

Although the scattering of data is done faster, as it is not affected by the vari-
able B, it will not affect the calculation of the optimum B. So the optimum
B remains the following:

                                            ts
                                   B≈                                       (22)
                                            tc

2.3.2   Blocking and Interleaving Technique
Lets also assume that the spawning of processes is location aware in terms
of the network topology. This means the only difference between the lin-
ear topology mentioned in the previous sections and the 2D Torus network



                                       15
topology is in the scattering of data. So the new performance model for this
topology is given by:
                                          √        N
           Tparallel = (I · 2 · (ts · log( P ) +        · (P − 1) · tw )) +
                                                 P ·I
                                     N       N
                     + ((I − 1) · ( ) + ( + P − 2))×
                                     B       B
                                   N
                     × (tc · B ·        + ts + B · tw ) +
                                 P ·I
                                  N
                     + tc · B ·                                               (23)
                                P ·I
Just as in the blocking technique, the scattering is not affected by B but it
is affected by I. This means that the scattering is dependent on the level of
interleaving. So the new equation for the optimum I is given by:

                                       N t B2
                     I≈               √ c                                     (24)
                             2Bts log( P ) + N ts + N Btw
   The corresponding optimum B is given by:

                                      IN ts
                                   B≈                                  (25)
                                        tw
   Taking into account the logarithmic properties, we deduce that the opti-
mum I is the same for both network topologies. The only difference between
the two is the time needed to perform the scattering.

2.4    Implementation
In this section, the implementation of the our solution is provided and ex-
plained. Our solution compared to provided sequential one requires extra
parameters B and I. Where B is a blocking factor and I is an interleaving
factor. Note that in order not to use interleaving, the I parameter should be
set to 1.
    In our solution, all required data is firstly read by the root process and
later broad-casted or scattered to other processes. Vector A is scattered to all
of the process. How much of information is scattered to every process depends
on I parameter and number of processes and every process receives N/(I · P )
rows before computing each of the interleave parts. Usually, N elements
can not be divided by I · P parameter, so the padding is introduced. The
amount of elements that each process will receive during scatter procedure
is calculated and stored as follows:

                                         16
sizeA = N % (total_processes * I) != 0 ? N +
           (total_processes * I) - (N % (total_processes * I)) : N;
        chunk_size = sizeA / (total_processes * I);

   Then the root process reads the data and shares the data as follows:
    // Broadcast the Similarity Matrix
    MPI_Bcast(sim_ptr, AA * AA, MPI_INT, 0, MPI_COMM_WORLD);
    // Broadcast the portion of vector A that will be received
       during broadcast
    MPI_Bcast(&chunk_size, 1, MPI_INT, 0, MPI_COMM_WORLD);
    // Broadcast N, B, I and DELTA parameters
    MPI_Bcast(&N, 1, MPI_INT, 0, MPI_COMM_WORLD);
    MPI_Bcast(&B, 1, MPI_INT, 0, MPI_COMM_WORLD);
    MPI_Bcast(&I, 1, MPI_INT, 0, MPI_COMM_WORLD);
    MPI_Bcast(&DELTA, 1, MPI_INT, 0, MPI_COMM_WORLD);

    Later, each process allocates space for a portion of H matrix, portion of A
vector and for a whole B vector. Note that in our solution every process does
not allocate the full-sized H matrix, but just enough portion of this matrix
where every process writes their results. So the sum of sizes of each H matrix
portions distributed throughout the processes will be N ×N +N +N ·(P ·I). It
is the whole matrix, initial column filled with zeros and extra lines where the
processes receives information from other processes. The portions is stored in
a three dimensional array where the first dimension refers to an interleaving
ID and the rest refers to column and row. The memory is allocated mapped
and the B vector is broad-casted as follows:
    CHECK_NULL((chunk_hptr = (int *) malloc(sizeof(int) * (N) *
       (chunk_size + 1))));
    CHECK_NULL((chunk_h = (int **) malloc(sizeof(int*) *
       (chunk_size + 1))));
    CHECK_NULL((chunk_ih = (int ***) malloc(sizeof(int*) * I)));
    for(int i = 0; i < (chunk_size + 1) * I; i++)
        chunk_h[i] = chunk_hptr + i * N;

    for (int i = 0; i < I; i++)
        chunk_ih[i] = chunk_h + i * (chunk_size + 1);

    CHECK_NULL((chunk_a = (short *) malloc(sizeof(short) *
       (chunk_size))));
    if (rank != 0) { // The root process already have B vector
        CHECK_NULL((b = (short *) malloc(sizeof(short) * (N))));

                                      17
}

    MPI_Bcast(b, N, MPI_SHORT, 0, MPI_COMM_WORLD);

    Later each process calculates how many blocks there are in total and what
is the size of the final block. This is needed since usually N is not dividable
by B, so the final block is usually smaller then the rest of them. The time
that marks the beginning of computation is stored in a variable start. In
the main loop that counts interleaves, each process receives a portion of A
vector. Main loop is repeated I times as explained earlier (in the section
describing blocking and interleaving technique).
    int total_blocks = N / B + (N % B == 0 ? 0 : 1);
    int last_block_size = N % B == 0 ? B : N % B;
    MPI_Status status;

    int start, end;
    start = getTimeMilli();

    for (int current_interleave = 0; current_interleave < I;
       current_interleave++) {
        MPI_Scatter(a + current_interleave * chunk_size *
           total_processes,
            chunk_size, MPI_SHORT, chunk_a, chunk_size, MPI_SHORT,
               0, MPI_COMM_WORLD);

        int current_column = 1;
        // Fill first column with 0
        for (int i = 0; i < chunk_size + 1; i++)
           chunk_ih[current_interleave][i][0] = 0;

    Then the main calculations begin. Firstly, the process checks whether it
has to receive from another process. If so, it receives data required for the
calculations. Then it processes the current cell, stores the result in separate
array which will be gathered later. Finally, the process checks if it has to
send the to another process. If so, it sends the last row of current block to
another process. The process repeats these actions totalb locks times. Finally,
it saves the time after execution in the end variable.
        for (int current_block = 0; current_block < total_blocks;
           current_block++) {
             // Receive
             int block_end = MIN2(current_column - (current_block ==


                                      18
0 ? 1 : 0) + B, N);
if (rank == 0 && current_interleave == 0) {
    for (int k = current_column; k < block_end; k++) {
        chunk_ih[current_interleave][0][k] = 0;
    }
} else {
    int receive_from = rank == 0 ? total_processes - 1 :
       rank - 1;
    int size_to_receive = current_block == total_blocks
       - 1 ? last_block_size : B;
    MPI_Recv(chunk_ih[current_interleave][0] +
       current_block * B, size_to_receive, MPI_INT,
       receive_from, 0, MPI_COMM_WORLD, &status);
    if (DEBUG) printf("[%d] Received from %d: ", rank,
       receive_from);
    if (DEBUG)
       print_vector(chunk_ih[current_interleave][0] +
       current_block * B, size_to_receive);
}
// Process
for (int j = current_column; j < block_end; j++,
   current_column++) {
    for (int i = 1; i < chunk_size + 1; i++) {
        int diag = chunk_ih[current_interleave][i - 1][j
           - 1] + sim[chunk_a[i - 1]][b[j - 1]];
        int down = chunk_ih[current_interleave][i -
           1][j] + DELTA;
        int right = chunk_ih[current_interleave][i][j -
           1] + DELTA;
        int max = MAX3(diag, down, right);
        chunk_ih[current_interleave][i][j] = max < 0 ? 0
           : max;
    }
}

// Send
if (current_interleave != I - 1 || rank + 1 !=
   total_processes) {
    int send_to = rank + 1 == total_processes ? 0 : rank
        + 1;
    int size_to_send = current_block == total_blocks - 1
        ? last_block_size : B;
    MPI_Send(chunk_ih[current_interleave][chunk_size] +


                      19
current_block * B, size_to_send, MPI_INT,
                   send_to, 0, MPI_COMM_WORLD);
                if (DEBUG) printf("[%d] Sent to %d: ", rank,
                   send_to);
                if (DEBUG)
                   print_vector(chunk_ih[current_interleave][chunk_size]
                   + current_block * B, size_to_send);
            }
        }
    }
    end = getTimeMilli();

   When all he calculations are finished, all processes starts the gather exe-
cution. After gather is executed, the root process has the all H matrix. Then
the root process prints an execution time to stderr stream and if debug is
enabled it prints the H matrix.
    for (int i = 0; i < I; i++) {
        MPI_Gather(chunk_hptr + N + i * chunk_size * N, N *
           chunk_size, MPI_INT,
            hptr + i * chunk_size * total_processes * N,
            N * chunk_size, MPI_INT, 0, MPI_COMM_WORLD);
    }

    if (rank == 0) {
        fprintf(stderr, "Execution: %f sn", (double) (end - start)
           / 1000000);
    }

    if (DEBUG) {
        if (rank == 0) {
            for (int i = 0; i < N - 1; i++) {
                print_vector(h[i], N);
            }
        }
    }

    MPI_Finalize();

   The full code is provided in the Appendix section.




                                     20
3     Performance Results
In this section, the performance results of our implementation on ALTIX is
provided. Also, the results is compared to a sequential code performance.

3.1    Finding Optimal P and B
In order to find out optimal P and B, we tested the application with different
P and B parameters, where N = 10, 000. Before that we tested the sequential
code. This code executed calculations for 12.598 seconds. The parallelized
version execution times are shown in Figure 7.




 Figure 7: Performance results with different P and B where N = 10, 000

   From this it can be concluded that with parameters N = 10000, B = 100,
P = 8 and I = 1 the parallel code executed calculations 9 times faster.

3.2    Finding Optimal I
In order to find out the optimal I, we selected the best result from the
precious test where P = 8 and ran the test with different I and B parameters.
The result is shown in Figure 8.
    Because the environment like network congestion affects our performance
tests, the results might not be completely accurate. That is why we deduced
from the results that the optimal parameters configuration for N = 10, 000 is
I = 2, B = 200, P = 8. With this configuration parallel code calculates the
matrix 8 times faster than the sequential code. Finally, we tested the parallel
code with N = 25, 000 and parameters that we found to be optimal. The
code executed calculations for 11.822213 seconds, where the sequential code
ran for 76.884 seconds. From this it can be concluded that the parallel code
runs 6.5 times faster. The result is slower, because as it was stated earlier,

                                      21
Figure 8: Performance results with different I and B where N = 10, 000 and
P =8

the B and I depends on N , so the parameters configuration for calculating
vectors similarity of size N = 25, 000 is not optimal.


4    Conclusions
During this project the parallel implementation of the Smith-Waterman Al-
gorithm was made using blocking and interleaving techniques. The tech-
niques and the code were explained in detail. The performance models for
both linear and 2D torus were calculated. Also, for each network topologies
the equations for finding optimum blocking factor B when using blocking
technique and optimum B and interleaving factor I when using blocking and
interleaving technique were found. After calculating the models, the conclu-
sion was made that the calculation of B and I factors for our algorithm on
these particular network topologies is the same.
    Performance tests using multiple processes on different processors were
done. It was found out that the optimal configuration for calculating se-
quence alignment of two vectors of size N = 10, 000 using our implemen-
tation is I = 2, B = 200, P = 8. With this configuration the parallel
code calculates the matrix 8 times faster than the sequential code. With the
same parameters configuration the parallel code calculates the matrix of size
N = 25, 000 6.5 times faster than the sequential code.




                                    22
References
[1] Peter Harrison, William Knottenbelt, Parallel Algorithms. Department
    of Computing, Imperial College London, 2009.

[2] Norm Matlo, Programming on Parallel Machines. University of Califor-
    nia, Davis, 2011.




                                  23
A     How to Compile
all: seq par

seq:
    gcc SW.c -o seq.out

par:
    icc protein.cpp -o protein.out -lmpi



B    How to Execute on ALTIX
#!/bin/bash
# @ job_name = ampp01parallel
# @ initialdir = .
# @ output = mpi_%j.out
# @ error = mpi_%j.err
# @ total_tasks = <number_of_process>
# @ wall_clock_limit = 00:01:00
mpirun -np <number_of_process> ./protein.out <vector_a> <vector_b>
   <similarity_matrix> <gap_penalty> <N> <B> <I>



C    Code
#include   <stdio.h>
#include   <stdlib.h>
#include   <ctype.h>    // character handling
#include   <stdlib.h> // def of RAND_MAX
#include   <sys/time.h>
#include   "mpi.h"

#define DEBUG 1

#define MAX_SEQ 50

#define CHECK_NULL(_check) {
    if ((_check)==NULL) {
           fprintf(stderr, "Null Pointer allocating memoryn");


                                   24
exit(-1);
          }
   }

#define   AA 20         // number of   amino acids
#define   MAX2(x,y) ((x)<(y) ? (y)     : (x))
#define   MAX3(x,y,z) (MAX2(x,y)<(z)   ? (z) : MAX2(x,y))
#define   MIN2(x,y) ((x)>(y) ? (y)     : (x))

// function prototypes
int getTimeMilli();
void read_pam(FILE* pam);
void read_files(FILE* in1, FILE* in2);
void print_vector(int* vector, int size);
void print_short_vector(short* vector, int size);
void memcopy(int* src, int* dst, int count);

/* begin AMPP*/
int char2AAmem[256];
int AA2charmem[AA];
void initChar2AATranslation(void);
/* end AMPP */

/* Define global variables */
int rank, total_processes;
int DELTA;
short *a, *b;
int *chunk_hptr;
int **chunk_h, ***chunk_ih;
int *sim_ptr, **sim;     // PAM similarity matrix
int N, sizeA, B, I, chunk_size;
short *chunk_a;
int* hptr;
int** h;
FILE *pam;

main(int argc, char *argv[]) {
    /* begin AMPP */
    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &total_processes);

   CHECK_NULL((sim_ptr = (int *) malloc(AA * AA * sizeof(int))));


                                  25
CHECK_NULL((sim = (int **) malloc(AA * sizeof(int*))));
for(int i = 0; i < AA; i++)
    sim[i] = sim_ptr + i * AA;

if (rank == 0) {
        FILE *in1, *in2;
    /**** Error handling for input file ****/
    if (!(argc >= 5 && argc <= 8)) {
        fprintf(stderr,"%s protein1 protein2 PAM gapPenalty [N]
           [B] [I]n",argv[0]);
        exit(1);
    } else {
        in1 = fopen(argv[1],"r");
        in2 = fopen(argv[2],"r");
        N = (argc > 5 ? atoi(argv[5]) : MAX_SEQ) + 1;
        B = argc > 6 ? atoi(argv[6]) : total_processes;
        I = argc > 7 ? atoi(argv[7]) : 1;
        DELTA = atoi(argv[4]);
    }
    /* end AMPP */

    /* begin AMPP */
    sizeA = N % (total_processes * I) != 0 ? N +
       (total_processes * I) - (N % (total_processes * I)) : N;
    CHECK_NULL((a = (short *) calloc(sizeof(short), sizeA)));
    CHECK_NULL((b = (short *) malloc(sizeof(short) * (N))));

    initChar2AATranslation();
    read_files(in1, in2);
    chunk_size = sizeA / (total_processes * I);

    CHECK_NULL((hptr = (int *) calloc(N * sizeA, sizeof(int))));
    CHECK_NULL((h = (int **) calloc(sizeA, sizeof(int*))));
    for(int i = 0; i < sizeA; i++)
        h[i] = hptr + i * N;

    pam = fopen(argv[3], "r");
    read_pam(pam);
}

MPI_Bcast(sim_ptr, AA * AA, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Bcast(&chunk_size, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Bcast(&N, 1, MPI_INT, 0, MPI_COMM_WORLD);


                             26
MPI_Bcast(&B, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Bcast(&I, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Bcast(&DELTA, 1, MPI_INT, 0, MPI_COMM_WORLD);

CHECK_NULL((chunk_hptr = (int *) malloc(sizeof(int) * (N) *
   (chunk_size + 1) * I)));
CHECK_NULL((chunk_h = (int **) malloc(sizeof(int*) *
   (chunk_size + 1) * I)));
CHECK_NULL((chunk_ih = (int ***) malloc(sizeof(int*) * I)));
for(int i = 0; i < (chunk_size + 1) * I; i++)
    chunk_h[i] = chunk_hptr + i * N;

for (int i = 0; i < I; i++)
    chunk_ih[i] = chunk_h + i * (chunk_size + 1);


CHECK_NULL((chunk_a = (short *) malloc(sizeof(short) *
   (chunk_size))));
if (rank != 0) {
    CHECK_NULL((b = (short *) malloc(sizeof(short) * (N))));
}

MPI_Bcast(b, N, MPI_SHORT, 0, MPI_COMM_WORLD);

/*** PARALLEL PART ***/
/** compute "h" local similarity array **/
int total_blocks = N / B + (N % B == 0 ? 0 : 1);
int last_block_size = N % B == 0 ? B : N % B;
MPI_Status status;

int start, end;
start = getTimeMilli();

for (int current_interleave = 0; current_interleave < I;
   current_interleave++) {
    MPI_Scatter(a + current_interleave * chunk_size *
       total_processes,
        chunk_size, MPI_SHORT, chunk_a, chunk_size, MPI_SHORT,
           0, MPI_COMM_WORLD);

   int current_column = 1;
   // Fill first column with 0



                             27
for (int i = 0; i < chunk_size + 1; i++)
   chunk_ih[current_interleave][i][0] = 0;

for (int current_block = 0; current_block < total_blocks;
   current_block++) {
    // Receive
    int block_end = MIN2(current_column - (current_block ==
       0 ? 1 : 0) + B, N);
    if (rank == 0 && current_interleave == 0) {
        for (int k = current_column; k < block_end; k++) {
            chunk_ih[current_interleave][0][k] = 0;
        }
    } else {
        int receive_from = rank == 0 ? total_processes - 1 :
           rank - 1;
        int size_to_receive = current_block == total_blocks
           - 1 ? last_block_size : B;
        MPI_Recv(chunk_ih[current_interleave][0] +
           current_block * B, size_to_receive, MPI_INT,
           receive_from, 0, MPI_COMM_WORLD, &status);
        if (DEBUG) printf("[%d] Received from %d: ", rank,
           receive_from);
        if (DEBUG)
           print_vector(chunk_ih[current_interleave][0] +
           current_block * B, size_to_receive);
    }
    // Process
    for (int j = current_column; j < block_end; j++,
       current_column++) {
        for (int i = 1; i < chunk_size + 1; i++) {
            int diag = chunk_ih[current_interleave][i - 1][j
               - 1] + sim[chunk_a[i - 1]][b[j - 1]];
            int down = chunk_ih[current_interleave][i -
               1][j] + DELTA;
            int right = chunk_ih[current_interleave][i][j -
               1] + DELTA;
            int max = MAX3(diag, down, right);
            chunk_ih[current_interleave][i][j] = max < 0 ? 0
               : max;
        }
    }

   // Send


                          28
if (current_interleave != I - 1 || rank + 1 !=
          total_processes) {
           int send_to = rank + 1 == total_processes ? 0 : rank
              + 1;
           int size_to_send = current_block == total_blocks - 1
              ? last_block_size : B;
           MPI_Send(chunk_ih[current_interleave][chunk_size] +
              current_block * B, size_to_send, MPI_INT,
              send_to, 0, MPI_COMM_WORLD);
           if (DEBUG) printf("[%d] Sent to %d: ", rank,
              send_to);
           if (DEBUG)
              print_vector(chunk_ih[current_interleave][chunk_size]
              + current_block * B, size_to_send);
       }
   }
}
end = getTimeMilli();

for (int i = 0; i < I; i++) {
    MPI_Gather(chunk_hptr + N + i * chunk_size * N, N *
       chunk_size, MPI_INT,
        hptr + i * chunk_size * total_processes * N,
        N * chunk_size, MPI_INT, 0, MPI_COMM_WORLD);
}

if (rank == 0) {
    fprintf(stderr, "Execution: %f sn", (double) (end - start)
       / 1000000);
}

if (DEBUG) {
    if (rank == 0) {
        for (int i = 0; i < N - 1; i++) {
            print_vector(h[i], N);
        }
    }
}

//Free everything!
free(sim_ptr);
free(sim);
free(b);


                             29
free(chunk_ih);
    free(chunk_h);
    free(chunk_hptr);
    free(chunk_a);
    if (rank == 0) {
        free(a);
        free(hptr);
        free(h);
    }

    MPI_Finalize();
}

void memcopy(int* src, int* dst, int count) {
    for (int i = 0; i < count; i++) {
        dst[i] = src[i];
    }
}

void print_vector(int* vector, int size) {
    for (int i = 0; i < size; i++) {
        printf("%2d ", vector[i]);
    }
    printf("n");
}

void print_short_vector(short* vector, int size) {
    for (int i = 0; i < size; i++) {
        printf("%2d ", vector[i]);
    }
    printf("n");
}

void read_pam(FILE* pam) {
    int i, j;
    int temp;
    /** read PAM250 similarity matrix **/
    /* begin AMPP */
    fscanf(pam,"%*s");
    /* end AMPP */
    for (i = 0; i < AA; i++)
        for (j = 0; j <= i; j++) {
            if (fscanf(pam, "%d ", &temp) == EOF) {


                                 30
fprintf(stderr, "PAM file emptyn");
              fclose(pam);
              exit(1);
           }
           sim[i][j]=temp;
        }
    fclose(pam);
    for (i = 0; i < AA; i++)
        for (j = i + 1; j < AA; j++)
            sim[i][j] = sim[j][i]; // symmetrify
}

void read_files(FILE* in1, FILE* in2) {
    int i=0;
    int nc;
    char ch;
        do {
            nc=fscanf(in1,"%c",&ch);
            if (nc>0 && char2AAmem[ch]>=0)
            {
               a[i++] = char2AAmem[ch];
            }
        } while (nc>0 && (i<N));
        fclose(in1);

       /** read second file in array "b" **/
       i=0;
       do {
           nc=fscanf(in2,"%c",&ch);
           if (nc>0 && char2AAmem[ch]>=0)
           {
             b[i++] = char2AAmem[ch];
           }
       } while (nc>0 && (i<N));
       fclose(in2);
}

/* Begin AMPP */
void initChar2AATranslation(void)
{
    int i;
    for(i=0; i<256; i++) char2AAmem[i]=-1;
    char2AAmem[’c’]=char2AAmem[’C’]=0;


                                 31
AA2charmem[0]=’c’;
    char2AAmem[’g’]=char2AAmem[’G’]=1;
    AA2charmem[1]=’g’;
    char2AAmem[’p’]=char2AAmem[’P’]=2;
    AA2charmem[2]=’p’;
    char2AAmem[’s’]=char2AAmem[’S’]=3;
    AA2charmem[3]=’s’;
    char2AAmem[’a’]=char2AAmem[’A’]=4;
    AA2charmem[4]=’a’;
    char2AAmem[’t’]=char2AAmem[’T’]=5;
    AA2charmem[5]=’t’;
    char2AAmem[’d’]=char2AAmem[’D’]=6;
    AA2charmem[6]=’d’;
    char2AAmem[’e’]=char2AAmem[’E’]=7;
    AA2charmem[7]=’e’;
    char2AAmem[’n’]=char2AAmem[’N’]=8;
    AA2charmem[8]=’n’;
    char2AAmem[’q’]=char2AAmem[’Q’]=9;
    AA2charmem[9]=’q’;
    char2AAmem[’h’]=char2AAmem[’H’]=10;
    AA2charmem[10]=’h’;
    char2AAmem[’k’]=char2AAmem[’K’]=11;
    AA2charmem[11]=’k’;
    char2AAmem[’r’]=char2AAmem[’R’]=12;
    AA2charmem[12]=’r’;
    char2AAmem[’v’]=char2AAmem[’V’]=13;
    AA2charmem[13]=’v’;
    char2AAmem[’m’]=char2AAmem[’M’]=14;
    AA2charmem[14]=’m’;
    char2AAmem[’i’]=char2AAmem[’I’]=15;
    AA2charmem[15]=’i’;
    char2AAmem[’l’]=char2AAmem[’L’]=16;
    AA2charmem[16]=’l’;
    char2AAmem[’f’]=char2AAmem[’F’]=17;
    AA2charmem[17]=’L’;
    char2AAmem[’y’]=char2AAmem[’Y’]=18;
    AA2charmem[18]=’y’;
    char2AAmem[’w’]=char2AAmem[’W’]=19;
    AA2charmem[19]=’w’;
}

int getTimeMilli() {
    struct timeval tv;


                                 32
gettimeofday(&tv, NULL);
    int ret = tv.tv_usec;
    ret += (tv.tv_sec * 1000000); // Add seconds
    return ret;
}

/* end AMPP*/




                                 33

More Related Content

What's hot

Human Head Counting and Detection using Convnets
Human Head Counting and Detection using ConvnetsHuman Head Counting and Detection using Convnets
Human Head Counting and Detection using Convnetsrahulmonikasharma
 
CSC 347 – Computer Hardware and Maintenance
CSC 347 – Computer Hardware and MaintenanceCSC 347 – Computer Hardware and Maintenance
CSC 347 – Computer Hardware and MaintenanceSumaiya Ismail
 
FPGA IMPLEMENTATION OF SOFT OUTPUT VITERBI ALGORITHM USING MEMORYLESS HYBRID ...
FPGA IMPLEMENTATION OF SOFT OUTPUT VITERBI ALGORITHM USING MEMORYLESS HYBRID ...FPGA IMPLEMENTATION OF SOFT OUTPUT VITERBI ALGORITHM USING MEMORYLESS HYBRID ...
FPGA IMPLEMENTATION OF SOFT OUTPUT VITERBI ALGORITHM USING MEMORYLESS HYBRID ...VLSICS Design
 
IRJET- Implementation of FIR Filter using Self Tested 2n-2k-1 Modulo Adder
IRJET- Implementation of FIR Filter using Self Tested 2n-2k-1 Modulo AdderIRJET- Implementation of FIR Filter using Self Tested 2n-2k-1 Modulo Adder
IRJET- Implementation of FIR Filter using Self Tested 2n-2k-1 Modulo AdderIRJET Journal
 
Texture classification of fabric defects using machine learning
Texture classification of fabric defects using machine learning Texture classification of fabric defects using machine learning
Texture classification of fabric defects using machine learning IJECEIAES
 
Local Binary Fitting Segmentation by Cooperative Quantum Particle Optimization
Local Binary Fitting Segmentation by Cooperative Quantum Particle OptimizationLocal Binary Fitting Segmentation by Cooperative Quantum Particle Optimization
Local Binary Fitting Segmentation by Cooperative Quantum Particle OptimizationTELKOMNIKA JOURNAL
 
Faster Interleaved Modular Multiplier Based on Sign Detection
Faster Interleaved Modular Multiplier Based on Sign DetectionFaster Interleaved Modular Multiplier Based on Sign Detection
Faster Interleaved Modular Multiplier Based on Sign DetectionVLSICS Design
 
Fast Multiplier for FIR Filters
Fast Multiplier for FIR FiltersFast Multiplier for FIR Filters
Fast Multiplier for FIR FiltersIJSTA
 
2007 santiago marchi_cobem_2007
2007 santiago marchi_cobem_20072007 santiago marchi_cobem_2007
2007 santiago marchi_cobem_2007CosmoSantiago
 
Symbol Based Modulation Classification using Combination of Fuzzy Clustering ...
Symbol Based Modulation Classification using Combination of Fuzzy Clustering ...Symbol Based Modulation Classification using Combination of Fuzzy Clustering ...
Symbol Based Modulation Classification using Combination of Fuzzy Clustering ...CSCJournals
 
High Speed Low Power Veterbi Decoder Design for TCM Decoders
High Speed Low Power Veterbi Decoder Design for TCM DecodersHigh Speed Low Power Veterbi Decoder Design for TCM Decoders
High Speed Low Power Veterbi Decoder Design for TCM Decodersijsrd.com
 
E4040.2016 fall.cjmd.report.ce2330.jb3852.jdr2162
E4040.2016 fall.cjmd.report.ce2330.jb3852.jdr2162E4040.2016 fall.cjmd.report.ce2330.jb3852.jdr2162
E4040.2016 fall.cjmd.report.ce2330.jb3852.jdr2162Jose Daniel Ramirez Soto
 
The Framework of Image Recognition based on Modified Freeman Chain Code
The Framework of Image Recognition based on Modified Freeman Chain CodeThe Framework of Image Recognition based on Modified Freeman Chain Code
The Framework of Image Recognition based on Modified Freeman Chain CodeCSCJournals
 
Semantic Video Segmentation with Using Ensemble of Particular Classifiers and...
Semantic Video Segmentation with Using Ensemble of Particular Classifiers and...Semantic Video Segmentation with Using Ensemble of Particular Classifiers and...
Semantic Video Segmentation with Using Ensemble of Particular Classifiers and...ITIIIndustries
 
FIDUCIAL POINTS DETECTION USING SVM LINEAR CLASSIFIERS
FIDUCIAL POINTS DETECTION USING SVM LINEAR CLASSIFIERSFIDUCIAL POINTS DETECTION USING SVM LINEAR CLASSIFIERS
FIDUCIAL POINTS DETECTION USING SVM LINEAR CLASSIFIERScsandit
 
AN EFFICIENT CODEBOOK INITIALIZATION APPROACH FOR LBG ALGORITHM
AN EFFICIENT CODEBOOK INITIALIZATION APPROACH FOR LBG ALGORITHMAN EFFICIENT CODEBOOK INITIALIZATION APPROACH FOR LBG ALGORITHM
AN EFFICIENT CODEBOOK INITIALIZATION APPROACH FOR LBG ALGORITHMIJCSEA Journal
 
Part 3. Further Data preprocessing
Part 3. Further Data preprocessingPart 3. Further Data preprocessing
Part 3. Further Data preprocessingbutest
 

What's hot (20)

Human Head Counting and Detection using Convnets
Human Head Counting and Detection using ConvnetsHuman Head Counting and Detection using Convnets
Human Head Counting and Detection using Convnets
 
CSC 347 – Computer Hardware and Maintenance
CSC 347 – Computer Hardware and MaintenanceCSC 347 – Computer Hardware and Maintenance
CSC 347 – Computer Hardware and Maintenance
 
FPGA IMPLEMENTATION OF SOFT OUTPUT VITERBI ALGORITHM USING MEMORYLESS HYBRID ...
FPGA IMPLEMENTATION OF SOFT OUTPUT VITERBI ALGORITHM USING MEMORYLESS HYBRID ...FPGA IMPLEMENTATION OF SOFT OUTPUT VITERBI ALGORITHM USING MEMORYLESS HYBRID ...
FPGA IMPLEMENTATION OF SOFT OUTPUT VITERBI ALGORITHM USING MEMORYLESS HYBRID ...
 
IRJET- Implementation of FIR Filter using Self Tested 2n-2k-1 Modulo Adder
IRJET- Implementation of FIR Filter using Self Tested 2n-2k-1 Modulo AdderIRJET- Implementation of FIR Filter using Self Tested 2n-2k-1 Modulo Adder
IRJET- Implementation of FIR Filter using Self Tested 2n-2k-1 Modulo Adder
 
Texture classification of fabric defects using machine learning
Texture classification of fabric defects using machine learning Texture classification of fabric defects using machine learning
Texture classification of fabric defects using machine learning
 
High Speed 8-bit Counters using State Excitation Logic and their Application ...
High Speed 8-bit Counters using State Excitation Logic and their Application ...High Speed 8-bit Counters using State Excitation Logic and their Application ...
High Speed 8-bit Counters using State Excitation Logic and their Application ...
 
Local Binary Fitting Segmentation by Cooperative Quantum Particle Optimization
Local Binary Fitting Segmentation by Cooperative Quantum Particle OptimizationLocal Binary Fitting Segmentation by Cooperative Quantum Particle Optimization
Local Binary Fitting Segmentation by Cooperative Quantum Particle Optimization
 
Faster Interleaved Modular Multiplier Based on Sign Detection
Faster Interleaved Modular Multiplier Based on Sign DetectionFaster Interleaved Modular Multiplier Based on Sign Detection
Faster Interleaved Modular Multiplier Based on Sign Detection
 
Fast Multiplier for FIR Filters
Fast Multiplier for FIR FiltersFast Multiplier for FIR Filters
Fast Multiplier for FIR Filters
 
2007 santiago marchi_cobem_2007
2007 santiago marchi_cobem_20072007 santiago marchi_cobem_2007
2007 santiago marchi_cobem_2007
 
Symbol Based Modulation Classification using Combination of Fuzzy Clustering ...
Symbol Based Modulation Classification using Combination of Fuzzy Clustering ...Symbol Based Modulation Classification using Combination of Fuzzy Clustering ...
Symbol Based Modulation Classification using Combination of Fuzzy Clustering ...
 
P1121133746
P1121133746P1121133746
P1121133746
 
High Speed Low Power Veterbi Decoder Design for TCM Decoders
High Speed Low Power Veterbi Decoder Design for TCM DecodersHigh Speed Low Power Veterbi Decoder Design for TCM Decoders
High Speed Low Power Veterbi Decoder Design for TCM Decoders
 
E4040.2016 fall.cjmd.report.ce2330.jb3852.jdr2162
E4040.2016 fall.cjmd.report.ce2330.jb3852.jdr2162E4040.2016 fall.cjmd.report.ce2330.jb3852.jdr2162
E4040.2016 fall.cjmd.report.ce2330.jb3852.jdr2162
 
The Short-term Swap Rate Models in China
The Short-term Swap Rate Models in ChinaThe Short-term Swap Rate Models in China
The Short-term Swap Rate Models in China
 
The Framework of Image Recognition based on Modified Freeman Chain Code
The Framework of Image Recognition based on Modified Freeman Chain CodeThe Framework of Image Recognition based on Modified Freeman Chain Code
The Framework of Image Recognition based on Modified Freeman Chain Code
 
Semantic Video Segmentation with Using Ensemble of Particular Classifiers and...
Semantic Video Segmentation with Using Ensemble of Particular Classifiers and...Semantic Video Segmentation with Using Ensemble of Particular Classifiers and...
Semantic Video Segmentation with Using Ensemble of Particular Classifiers and...
 
FIDUCIAL POINTS DETECTION USING SVM LINEAR CLASSIFIERS
FIDUCIAL POINTS DETECTION USING SVM LINEAR CLASSIFIERSFIDUCIAL POINTS DETECTION USING SVM LINEAR CLASSIFIERS
FIDUCIAL POINTS DETECTION USING SVM LINEAR CLASSIFIERS
 
AN EFFICIENT CODEBOOK INITIALIZATION APPROACH FOR LBG ALGORITHM
AN EFFICIENT CODEBOOK INITIALIZATION APPROACH FOR LBG ALGORITHMAN EFFICIENT CODEBOOK INITIALIZATION APPROACH FOR LBG ALGORITHM
AN EFFICIENT CODEBOOK INITIALIZATION APPROACH FOR LBG ALGORITHM
 
Part 3. Further Data preprocessing
Part 3. Further Data preprocessingPart 3. Further Data preprocessing
Part 3. Further Data preprocessing
 

Viewers also liked

The Smith Waterman algorithm
The Smith Waterman algorithmThe Smith Waterman algorithm
The Smith Waterman algorithmavrilcoghlan
 
2.5 Dimensional Lateral Force Display
2.5 Dimensional Lateral Force Display2.5 Dimensional Lateral Force Display
2.5 Dimensional Lateral Force DisplaySaga Satoshi
 
特許取得した3D柔軟触覚センサの応用例【インタラクティブな製作物】
特許取得した3D柔軟触覚センサの応用例【インタラクティブな製作物】特許取得した3D柔軟触覚センサの応用例【インタラクティブな製作物】
特許取得した3D柔軟触覚センサの応用例【インタラクティブな製作物】Asuka Kadowaki
 
研究室訪問2010
研究室訪問2010研究室訪問2010
研究室訪問2010cs15
 
特許取得した3D柔軟触覚センサの開発理由と手法
特許取得した3D柔軟触覚センサの開発理由と手法特許取得した3D柔軟触覚センサの開発理由と手法
特許取得した3D柔軟触覚センサの開発理由と手法Asuka Kadowaki
 
Web prof onpar
Web prof onparWeb prof onpar
Web prof onparkw412608
 
タッチエンス会社説明
タッチエンス会社説明タッチエンス会社説明
タッチエンス会社説明Touchence
 
プレゼンテーション1.ハッカソン
プレゼンテーション1.ハッカソンプレゼンテーション1.ハッカソン
プレゼンテーション1.ハッカソンkazuyayoshinari
 
クロスモーダル設計調査分科会設立の狙い
クロスモーダル設計調査分科会設立の狙いクロスモーダル設計調査分科会設立の狙い
クロスモーダル設計調査分科会設立の狙いTakuji Narumi
 
#2 circutsmp syokohama20170311
#2 circutsmp syokohama20170311#2 circutsmp syokohama20170311
#2 circutsmp syokohama20170311正雄 上野
 
ジェスチャ認識・物体形状取得がもたらす新たな未来
ジェスチャ認識・物体形状取得がもたらす新たな未来ジェスチャ認識・物体形状取得がもたらす新たな未来
ジェスチャ認識・物体形状取得がもたらす新たな未来Kaoru NAKAMURA
 
内蔵化、モバイル化に向かうDepthセンサー
内蔵化、モバイル化に向かうDepthセンサー内蔵化、モバイル化に向かうDepthセンサー
内蔵化、モバイル化に向かうDepthセンサーKaoru NAKAMURA
 
Circuit yokohama20170311
Circuit yokohama20170311Circuit yokohama20170311
Circuit yokohama20170311正雄 上野
 
Leap motion 実践活用 ダイジェスト版
Leap motion 実践活用 ダイジェスト版Leap motion 実践活用 ダイジェスト版
Leap motion 実践活用 ダイジェスト版Kaoru NAKAMURA
 
目が見えない状態を想定したUI設計
目が見えない状態を想定したUI設計目が見えない状態を想定したUI設計
目が見えない状態を想定したUI設計Keita Kawamoto
 
Parallelization of Smith-Waterman Algorithm using MPI
Parallelization of Smith-Waterman Algorithm using MPIParallelization of Smith-Waterman Algorithm using MPI
Parallelization of Smith-Waterman Algorithm using MPIArinto Murdopo
 
来栖川電算の技術紹介
来栖川電算の技術紹介来栖川電算の技術紹介
来栖川電算の技術紹介陽平 山口
 
The Needleman-Wunsch Algorithm for Sequence Alignment
The Needleman-Wunsch Algorithm for Sequence Alignment The Needleman-Wunsch Algorithm for Sequence Alignment
The Needleman-Wunsch Algorithm for Sequence Alignment Parinda Rajapaksha
 
ARコンテンツ作成勉強会:UnityとVuforiaではじめるAR [主要部分]
ARコンテンツ作成勉強会:UnityとVuforiaではじめるAR [主要部分]ARコンテンツ作成勉強会:UnityとVuforiaではじめるAR [主要部分]
ARコンテンツ作成勉強会:UnityとVuforiaではじめるAR [主要部分]Takashi Yoshinaga
 
モーションセンサーデバイス調査
モーションセンサーデバイス調査モーションセンサーデバイス調査
モーションセンサーデバイス調査@TMYSYSKW
 

Viewers also liked (20)

The Smith Waterman algorithm
The Smith Waterman algorithmThe Smith Waterman algorithm
The Smith Waterman algorithm
 
2.5 Dimensional Lateral Force Display
2.5 Dimensional Lateral Force Display2.5 Dimensional Lateral Force Display
2.5 Dimensional Lateral Force Display
 
特許取得した3D柔軟触覚センサの応用例【インタラクティブな製作物】
特許取得した3D柔軟触覚センサの応用例【インタラクティブな製作物】特許取得した3D柔軟触覚センサの応用例【インタラクティブな製作物】
特許取得した3D柔軟触覚センサの応用例【インタラクティブな製作物】
 
研究室訪問2010
研究室訪問2010研究室訪問2010
研究室訪問2010
 
特許取得した3D柔軟触覚センサの開発理由と手法
特許取得した3D柔軟触覚センサの開発理由と手法特許取得した3D柔軟触覚センサの開発理由と手法
特許取得した3D柔軟触覚センサの開発理由と手法
 
Web prof onpar
Web prof onparWeb prof onpar
Web prof onpar
 
タッチエンス会社説明
タッチエンス会社説明タッチエンス会社説明
タッチエンス会社説明
 
プレゼンテーション1.ハッカソン
プレゼンテーション1.ハッカソンプレゼンテーション1.ハッカソン
プレゼンテーション1.ハッカソン
 
クロスモーダル設計調査分科会設立の狙い
クロスモーダル設計調査分科会設立の狙いクロスモーダル設計調査分科会設立の狙い
クロスモーダル設計調査分科会設立の狙い
 
#2 circutsmp syokohama20170311
#2 circutsmp syokohama20170311#2 circutsmp syokohama20170311
#2 circutsmp syokohama20170311
 
ジェスチャ認識・物体形状取得がもたらす新たな未来
ジェスチャ認識・物体形状取得がもたらす新たな未来ジェスチャ認識・物体形状取得がもたらす新たな未来
ジェスチャ認識・物体形状取得がもたらす新たな未来
 
内蔵化、モバイル化に向かうDepthセンサー
内蔵化、モバイル化に向かうDepthセンサー内蔵化、モバイル化に向かうDepthセンサー
内蔵化、モバイル化に向かうDepthセンサー
 
Circuit yokohama20170311
Circuit yokohama20170311Circuit yokohama20170311
Circuit yokohama20170311
 
Leap motion 実践活用 ダイジェスト版
Leap motion 実践活用 ダイジェスト版Leap motion 実践活用 ダイジェスト版
Leap motion 実践活用 ダイジェスト版
 
目が見えない状態を想定したUI設計
目が見えない状態を想定したUI設計目が見えない状態を想定したUI設計
目が見えない状態を想定したUI設計
 
Parallelization of Smith-Waterman Algorithm using MPI
Parallelization of Smith-Waterman Algorithm using MPIParallelization of Smith-Waterman Algorithm using MPI
Parallelization of Smith-Waterman Algorithm using MPI
 
来栖川電算の技術紹介
来栖川電算の技術紹介来栖川電算の技術紹介
来栖川電算の技術紹介
 
The Needleman-Wunsch Algorithm for Sequence Alignment
The Needleman-Wunsch Algorithm for Sequence Alignment The Needleman-Wunsch Algorithm for Sequence Alignment
The Needleman-Wunsch Algorithm for Sequence Alignment
 
ARコンテンツ作成勉強会:UnityとVuforiaではじめるAR [主要部分]
ARコンテンツ作成勉強会:UnityとVuforiaではじめるAR [主要部分]ARコンテンツ作成勉強会:UnityとVuforiaではじめるAR [主要部分]
ARコンテンツ作成勉強会:UnityとVuforiaではじめるAR [主要部分]
 
モーションセンサーデバイス調査
モーションセンサーデバイス調査モーションセンサーデバイス調査
モーションセンサーデバイス調査
 

Similar to Smith waterman algorithm parallelization

Bitonic sort97
Bitonic sort97Bitonic sort97
Bitonic sort97Shubh Sam
 
Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012
Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012
Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012Florent Renucci
 
Final Project Report
Final Project ReportFinal Project Report
Final Project ReportRiddhi Shah
 
DESIGN OF DELAY COMPUTATION METHOD FOR CYCLOTOMIC FAST FOURIER TRANSFORM
DESIGN OF DELAY COMPUTATION METHOD FOR CYCLOTOMIC FAST FOURIER TRANSFORMDESIGN OF DELAY COMPUTATION METHOD FOR CYCLOTOMIC FAST FOURIER TRANSFORM
DESIGN OF DELAY COMPUTATION METHOD FOR CYCLOTOMIC FAST FOURIER TRANSFORMsipij
 
Robert_Tanner_Temperature_Distribution
Robert_Tanner_Temperature_DistributionRobert_Tanner_Temperature_Distribution
Robert_Tanner_Temperature_DistributionRobert Tanner
 
Improving face recognition by artificial neural network using principal compo...
Improving face recognition by artificial neural network using principal compo...Improving face recognition by artificial neural network using principal compo...
Improving face recognition by artificial neural network using principal compo...TELKOMNIKA JOURNAL
 
Design and Implementation of 8 Bit Multiplier Using M.G.D.I. Technique
Design and Implementation of 8 Bit Multiplier Using M.G.D.I. TechniqueDesign and Implementation of 8 Bit Multiplier Using M.G.D.I. Technique
Design and Implementation of 8 Bit Multiplier Using M.G.D.I. TechniqueIJMER
 
Parallel Algorithms: Sort & Merge, Image Processing, Fault Tolerance
Parallel Algorithms: Sort & Merge, Image Processing, Fault ToleranceParallel Algorithms: Sort & Merge, Image Processing, Fault Tolerance
Parallel Algorithms: Sort & Merge, Image Processing, Fault ToleranceUniversity of Technology - Iraq
 
The efficient interleaving of digital-video-broadcasting-satellite 2nd genera...
The efficient interleaving of digital-video-broadcasting-satellite 2nd genera...The efficient interleaving of digital-video-broadcasting-satellite 2nd genera...
The efficient interleaving of digital-video-broadcasting-satellite 2nd genera...TELKOMNIKA JOURNAL
 
Stereo matching algorithm using census transform and segment tree for depth e...
Stereo matching algorithm using census transform and segment tree for depth e...Stereo matching algorithm using census transform and segment tree for depth e...
Stereo matching algorithm using census transform and segment tree for depth e...TELKOMNIKA JOURNAL
 
Development of stereo matching algorithm based on sum of absolute RGB color d...
Development of stereo matching algorithm based on sum of absolute RGB color d...Development of stereo matching algorithm based on sum of absolute RGB color d...
Development of stereo matching algorithm based on sum of absolute RGB color d...IJECEIAES
 
Design and Analysis of a Full Subtractor using Various Design Techniques
Design and Analysis of a Full Subtractor using Various Design TechniquesDesign and Analysis of a Full Subtractor using Various Design Techniques
Design and Analysis of a Full Subtractor using Various Design TechniquesIRJET Journal
 
Method for a Simple Encryption of Images Based on the Chaotic Map of Bernoulli
Method for a Simple Encryption of Images Based on the Chaotic Map of BernoulliMethod for a Simple Encryption of Images Based on the Chaotic Map of Bernoulli
Method for a Simple Encryption of Images Based on the Chaotic Map of BernoulliAIRCC Publishing Corporation
 
METHOD FOR A SIMPLE ENCRYPTION OF IMAGES BASED ON THE CHAOTIC MAP OF BERNOULLI
METHOD FOR A SIMPLE ENCRYPTION OF IMAGES BASED ON THE CHAOTIC MAP OF BERNOULLIMETHOD FOR A SIMPLE ENCRYPTION OF IMAGES BASED ON THE CHAOTIC MAP OF BERNOULLI
METHOD FOR A SIMPLE ENCRYPTION OF IMAGES BASED ON THE CHAOTIC MAP OF BERNOULLIijcsit
 
METHOD FOR A SIMPLE ENCRYPTION OF IMAGES BASED ON THE CHAOTIC MAP OF BERNOULLI
METHOD FOR A SIMPLE ENCRYPTION OF IMAGES BASED ON THE CHAOTIC MAP OF BERNOULLIMETHOD FOR A SIMPLE ENCRYPTION OF IMAGES BASED ON THE CHAOTIC MAP OF BERNOULLI
METHOD FOR A SIMPLE ENCRYPTION OF IMAGES BASED ON THE CHAOTIC MAP OF BERNOULLIAIRCC Publishing Corporation
 
Application of a merit function based interior point method to linear model p...
Application of a merit function based interior point method to linear model p...Application of a merit function based interior point method to linear model p...
Application of a merit function based interior point method to linear model p...Zac Darcy
 
Overview of quantum computing and it's application in artificial intelligence
Overview of quantum computing and it's application in artificial intelligenceOverview of quantum computing and it's application in artificial intelligence
Overview of quantum computing and it's application in artificial intelligenceBincySam2
 

Similar to Smith waterman algorithm parallelization (20)

vorlage
vorlagevorlage
vorlage
 
Book
BookBook
Book
 
Bitonic sort97
Bitonic sort97Bitonic sort97
Bitonic sort97
 
Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012
Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012
Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012
 
Final Project Report
Final Project ReportFinal Project Report
Final Project Report
 
DESIGN OF DELAY COMPUTATION METHOD FOR CYCLOTOMIC FAST FOURIER TRANSFORM
DESIGN OF DELAY COMPUTATION METHOD FOR CYCLOTOMIC FAST FOURIER TRANSFORMDESIGN OF DELAY COMPUTATION METHOD FOR CYCLOTOMIC FAST FOURIER TRANSFORM
DESIGN OF DELAY COMPUTATION METHOD FOR CYCLOTOMIC FAST FOURIER TRANSFORM
 
Robert_Tanner_Temperature_Distribution
Robert_Tanner_Temperature_DistributionRobert_Tanner_Temperature_Distribution
Robert_Tanner_Temperature_Distribution
 
Improving face recognition by artificial neural network using principal compo...
Improving face recognition by artificial neural network using principal compo...Improving face recognition by artificial neural network using principal compo...
Improving face recognition by artificial neural network using principal compo...
 
Design and Implementation of 8 Bit Multiplier Using M.G.D.I. Technique
Design and Implementation of 8 Bit Multiplier Using M.G.D.I. TechniqueDesign and Implementation of 8 Bit Multiplier Using M.G.D.I. Technique
Design and Implementation of 8 Bit Multiplier Using M.G.D.I. Technique
 
Parallel Algorithms: Sort & Merge, Image Processing, Fault Tolerance
Parallel Algorithms: Sort & Merge, Image Processing, Fault ToleranceParallel Algorithms: Sort & Merge, Image Processing, Fault Tolerance
Parallel Algorithms: Sort & Merge, Image Processing, Fault Tolerance
 
M2R Group 26
M2R Group 26M2R Group 26
M2R Group 26
 
The efficient interleaving of digital-video-broadcasting-satellite 2nd genera...
The efficient interleaving of digital-video-broadcasting-satellite 2nd genera...The efficient interleaving of digital-video-broadcasting-satellite 2nd genera...
The efficient interleaving of digital-video-broadcasting-satellite 2nd genera...
 
Stereo matching algorithm using census transform and segment tree for depth e...
Stereo matching algorithm using census transform and segment tree for depth e...Stereo matching algorithm using census transform and segment tree for depth e...
Stereo matching algorithm using census transform and segment tree for depth e...
 
Development of stereo matching algorithm based on sum of absolute RGB color d...
Development of stereo matching algorithm based on sum of absolute RGB color d...Development of stereo matching algorithm based on sum of absolute RGB color d...
Development of stereo matching algorithm based on sum of absolute RGB color d...
 
Design and Analysis of a Full Subtractor using Various Design Techniques
Design and Analysis of a Full Subtractor using Various Design TechniquesDesign and Analysis of a Full Subtractor using Various Design Techniques
Design and Analysis of a Full Subtractor using Various Design Techniques
 
Method for a Simple Encryption of Images Based on the Chaotic Map of Bernoulli
Method for a Simple Encryption of Images Based on the Chaotic Map of BernoulliMethod for a Simple Encryption of Images Based on the Chaotic Map of Bernoulli
Method for a Simple Encryption of Images Based on the Chaotic Map of Bernoulli
 
METHOD FOR A SIMPLE ENCRYPTION OF IMAGES BASED ON THE CHAOTIC MAP OF BERNOULLI
METHOD FOR A SIMPLE ENCRYPTION OF IMAGES BASED ON THE CHAOTIC MAP OF BERNOULLIMETHOD FOR A SIMPLE ENCRYPTION OF IMAGES BASED ON THE CHAOTIC MAP OF BERNOULLI
METHOD FOR A SIMPLE ENCRYPTION OF IMAGES BASED ON THE CHAOTIC MAP OF BERNOULLI
 
METHOD FOR A SIMPLE ENCRYPTION OF IMAGES BASED ON THE CHAOTIC MAP OF BERNOULLI
METHOD FOR A SIMPLE ENCRYPTION OF IMAGES BASED ON THE CHAOTIC MAP OF BERNOULLIMETHOD FOR A SIMPLE ENCRYPTION OF IMAGES BASED ON THE CHAOTIC MAP OF BERNOULLI
METHOD FOR A SIMPLE ENCRYPTION OF IMAGES BASED ON THE CHAOTIC MAP OF BERNOULLI
 
Application of a merit function based interior point method to linear model p...
Application of a merit function based interior point method to linear model p...Application of a merit function based interior point method to linear model p...
Application of a merit function based interior point method to linear model p...
 
Overview of quantum computing and it's application in artificial intelligence
Overview of quantum computing and it's application in artificial intelligenceOverview of quantum computing and it's application in artificial intelligence
Overview of quantum computing and it's application in artificial intelligence
 

More from Mário Almeida

Empirical Study of Android Alarm Usage for Application Scheduling
Empirical Study of Android Alarm Usage for Application SchedulingEmpirical Study of Android Alarm Usage for Application Scheduling
Empirical Study of Android Alarm Usage for Application SchedulingMário Almeida
 
Android reverse engineering - Analyzing skype
Android reverse engineering - Analyzing skypeAndroid reverse engineering - Analyzing skype
Android reverse engineering - Analyzing skypeMário Almeida
 
High-Availability of YARN (MRv2)
High-Availability of YARN (MRv2)High-Availability of YARN (MRv2)
High-Availability of YARN (MRv2)Mário Almeida
 
Flume impact of reliability on scalability
Flume impact of reliability on scalabilityFlume impact of reliability on scalability
Flume impact of reliability on scalabilityMário Almeida
 
Dimemas and Multi-Level Cache Simulations
Dimemas and Multi-Level Cache SimulationsDimemas and Multi-Level Cache Simulations
Dimemas and Multi-Level Cache SimulationsMário Almeida
 
Self-Adapting, Energy-Conserving Distributed File Systems
Self-Adapting, Energy-Conserving Distributed File SystemsSelf-Adapting, Energy-Conserving Distributed File Systems
Self-Adapting, Energy-Conserving Distributed File SystemsMário Almeida
 
Man-In-The-Browser attacks
Man-In-The-Browser attacksMan-In-The-Browser attacks
Man-In-The-Browser attacksMário Almeida
 
Flume-based Independent News Aggregator
Flume-based Independent News AggregatorFlume-based Independent News Aggregator
Flume-based Independent News AggregatorMário Almeida
 
Exploiting Availability Prediction in Distributed Systems
Exploiting Availability Prediction in Distributed SystemsExploiting Availability Prediction in Distributed Systems
Exploiting Availability Prediction in Distributed SystemsMário Almeida
 
High Availability of Services in Wide-Area Shared Computing Networks
High Availability of Services in Wide-Area Shared Computing NetworksHigh Availability of Services in Wide-Area Shared Computing Networks
High Availability of Services in Wide-Area Shared Computing NetworksMário Almeida
 
Instrumenting parsecs raytrace
Instrumenting parsecs raytraceInstrumenting parsecs raytrace
Instrumenting parsecs raytraceMário Almeida
 
Architecting a cloud scale identity fabric
Architecting a cloud scale identity fabricArchitecting a cloud scale identity fabric
Architecting a cloud scale identity fabricMário Almeida
 

More from Mário Almeida (14)

Empirical Study of Android Alarm Usage for Application Scheduling
Empirical Study of Android Alarm Usage for Application SchedulingEmpirical Study of Android Alarm Usage for Application Scheduling
Empirical Study of Android Alarm Usage for Application Scheduling
 
Android reverse engineering - Analyzing skype
Android reverse engineering - Analyzing skypeAndroid reverse engineering - Analyzing skype
Android reverse engineering - Analyzing skype
 
Spark
SparkSpark
Spark
 
High-Availability of YARN (MRv2)
High-Availability of YARN (MRv2)High-Availability of YARN (MRv2)
High-Availability of YARN (MRv2)
 
Flume impact of reliability on scalability
Flume impact of reliability on scalabilityFlume impact of reliability on scalability
Flume impact of reliability on scalability
 
Dimemas and Multi-Level Cache Simulations
Dimemas and Multi-Level Cache SimulationsDimemas and Multi-Level Cache Simulations
Dimemas and Multi-Level Cache Simulations
 
Self-Adapting, Energy-Conserving Distributed File Systems
Self-Adapting, Energy-Conserving Distributed File SystemsSelf-Adapting, Energy-Conserving Distributed File Systems
Self-Adapting, Energy-Conserving Distributed File Systems
 
Man-In-The-Browser attacks
Man-In-The-Browser attacksMan-In-The-Browser attacks
Man-In-The-Browser attacks
 
Flume-based Independent News Aggregator
Flume-based Independent News AggregatorFlume-based Independent News Aggregator
Flume-based Independent News Aggregator
 
Exploiting Availability Prediction in Distributed Systems
Exploiting Availability Prediction in Distributed SystemsExploiting Availability Prediction in Distributed Systems
Exploiting Availability Prediction in Distributed Systems
 
High Availability of Services in Wide-Area Shared Computing Networks
High Availability of Services in Wide-Area Shared Computing NetworksHigh Availability of Services in Wide-Area Shared Computing Networks
High Availability of Services in Wide-Area Shared Computing Networks
 
Instrumenting parsecs raytrace
Instrumenting parsecs raytraceInstrumenting parsecs raytrace
Instrumenting parsecs raytrace
 
Architecting a cloud scale identity fabric
Architecting a cloud scale identity fabricArchitecting a cloud scale identity fabric
Architecting a cloud scale identity fabric
 
SOAP vs REST
SOAP vs RESTSOAP vs REST
SOAP vs REST
 

Recently uploaded

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 

Recently uploaded (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

Smith waterman algorithm parallelization

  • 1. Universitat Polit`cnica de Catalunya e Facultat d’Inform`tica de Barcelona a AMPP Final Project Smith-Waterman Algorithm Parallelization Authors: Supervisors: M´rio Almeida a Josep Ramon Herrero Zaragoza ˇ Zygimantas Bruzgys Daniel Jimenez Gonzalez Umit Cavus Buyuksahin Barcelona 2012
  • 2. Contents 1 Introduction 3 2 Main Issues and Solutions 3 2.1 Parallelization Techniques . . . . . . . . . . . . . . . . . . . . 4 2.1.1 Blocking Technique . . . . . . . . . . . . . . . . . . . . 4 2.1.2 Blocking and Interleaving Technique . . . . . . . . . . 6 2.2 Performance Model on Linear Network Topology . . . . . . . . 7 2.2.1 Blocking Technique . . . . . . . . . . . . . . . . . . . . 7 2.2.2 Blocking Technique: Optimum B . . . . . . . . . . . . 10 2.2.3 Blocking and Interleaving Technique . . . . . . . . . . 11 2.2.4 Blocking and Interleaving Technique: Optimum B . . . 14 2.2.5 Blocking and Interleaving Technique: Optimum I . . . 14 2.3 Performance Model on 2D Torus Network Topology . . . . . . 15 2.3.1 Blocking Technique . . . . . . . . . . . . . . . . . . . . 15 2.3.2 Blocking and Interleaving Technique . . . . . . . . . . 15 2.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3 Performance Results 21 3.1 Finding Optimal P and B . . . . . . . . . . . . . . . . . . . . 21 3.2 Finding Optimal I . . . . . . . . . . . . . . . . . . . . . . . . . 21 4 Conclusions 22 A How to Compile 24 B How to Execute on ALTIX 24 C Code 24
  • 3. 1 Introduction In this project the parallel implementation of the Smith-Waterman Algo- rithm using Message Passing Interface (MPI). This algorithm is a well-known algorithm for performing local sequence alignment, which is, for determining similar regions between two amino-acid sequences. In order to find the best alignment between two amino-acid sequences a matrix H is computed of size N × N , where N is a size of each sequences. Every element of this matrix is based on Score Matrix (cost of matching two symbols) and a gap penalty for mismatching symbols of sequences. When matrix H is computed, the optimum alignment of sequences can be obtained by tracking back the matrix starting with the highest value in the matrix. In our parallel implementation only H matrix calculation was parallelized as it is our only interest. The tracking part was removed from the code and from the sequential code as well, in order to gather the most accurate computation times for comparison. For parallelization a pipelining method was used. Following this model, each process communicates with another after calculating B columns of N rows. This is called blocking. We introduced P a parameter that easily allowed to change this value. Later interleaving parameter I was added. During this project several performance models were created. One model is for linear interconnection network and another one for 2D torus network. In models calculations B and I parameters were included. Later, the opti- mum B and I were found and performance tests were executed to empirically find out those two parameters. 2 Main Issues and Solutions In this section parallelization solutions are described. Solution with blocking at column level is explained and performance model is described. Then the solution with both blocking at column level and interleaving at row level is explained and performance model is described as well. Also, in this section the calculations are provided for optimum blocking factor B and interleaving factor I. The second part of the section is for description of the perfor- mance models for both solutions on the different network topology and the calculations for finding optimum blocking factor B and interleaving factor I. Finally, our implementation of these techniques in C++ is provided and explained. 3
  • 4. 2.1 Parallelization Techniques 2.1.1 Blocking Technique Figure 1: Parallelization approach by introducing blocking at column level The P processes share the matrix M in terms of consecutive rows. For calculating the matrix M of size N × N , each process Pi works with N/P consecutive rows of the matrix. When using a blocking technique for paral- lelization, columns are divided by a defined block size B. So, each process has to calculate N/B blocks. These parameters are visualized at Figure 1. At the top part of the figure it can be seen how elements of the matrix are divided between processes. And at the bottom part of the figure the par- allelization of calculations between processes is visualized. There is shown that when first process computes the first block of the matrix, which is size of N/P × B, it communicates with the next process. Then the next process start calculating the other block of the matrix while the first process con- tinues calculations on the next block and so on. This type of parallelization is called pipelining. In this type of parallelization, the problem is divided into a series of tasks that have to be completed one after the other. Be- fore explaining the parallelization in detail, we should analyze data and task dependencies between processes to calculate the matrix. In Figure 2 the data dependency for a particular matrix element is shown. In order to calculate a matrix element M [i][j], the process Pi+1 needs the calculated data form the previous column M [i][j − 1] and elements M [i − 1][j − 1] and M [i − 1][j] from the previous row as seen in the picture. If 4
  • 5. Figure 2: Data dependency for calculating one matrix element the previous row is calculated by the process Pi then that row is sent after process Pi calculates the block of size N/P × B. This introduces a data and task dependencies. The process Pi+1 can not start calculations till the process Pi sends the last row of the block, which is needed for calculating the block of process Pi+1 . To calculate the first row and column of the matrix it is considered that the predecessor row and column is filled with zeros. Figure 3: Data dependecies between blocks of matrix The Figure 3 shows the parallelism of the matrix in the wide window. The squares represent the blocks matrix and three arrows show data decen- cies between the blocks. As mentioned before, an element needs its upper, left, and upper-left values to be calculated. It is called data dependency. Therefore, blocks on the same minor diagonal are independent from each other. So these blocks can be and are calculated in parallel. The steps of calculations are as follows: 5
  • 6. 1. The process waits till the previous process finish calculation of a block (if applicable); 2. The process receives the last row of a block that was calculated by the previous process; 3. After receiving the last row of a block calculated by previous process, the process has all necessary information to calculate its block. So, the process performs a calculation of its block; 4. When process finish the calculation, it sends the last row of its block to the next process (if applicable); 5. The process repeats these steps until it finishes the calculation of all blocks, that is, calculates all rows that are assigned to the process. 2.1.2 Blocking and Interleaving Technique Figure 4: Matrix calculation with interleaving factor, when I = 2 This parallelization method adds an interleave factor to a blocking tech- nique that was described above. With this method the matrix is divided into I parts, so that each part has N × N/I elements. Every part is then calculated as explained in the previous section, that is, using blocking tech- nique. As soon as the process finish processing rows assigned to it from the first interleaving part it continues with the blocks from another interleave part. For example, in Figure 4, where interleaving factor I = 2, the matrix is divided into two smaller parts. Each process Pi calculates N/(P · I) rows of one part before moving to the second part. 6
  • 7. The steps of calculations are very similar to those where blocking tech- nique is used and are as follows: 1. The process waits till the previous process finish calculation of a block (if applicable); 2. The process receives the last row of a block that was calculated by the previous process; 3. After receiving the last row of a block calculated by previous process, the process has all necessary information to calculate its block. So, the process performs a calculation of its block; 4. When process finish the calculation, it sends the last row of its block to the next process. If the process is the last one and there is another interleave part to calculate, then it sends the row to the first process. Otherwise it does not send anything; 5. The process repeats these steps until it finishes the calculation of all blocks within the current interleave part, that is, calculates all rows that are assigned to the process within the interleave part. If there is another interleave part to calculate it moves to next interleave part and repeats theses steps until all blocks from all interleave parts are calculated. 2.2 Performance Model on Linear Network Topology 2.2.1 Blocking Technique In this section we will be describing the performance model of our imple- mentation with blocking technique for a linear network topology. In later sections we will compare it with non linear topology, taking into account the differences in the performance models. In order to focus on the main objectives of this performance analysis, we will only take into account the parallel algorithms used for matrix calculation. This means that some parts of the code that were done sequentially on a single process such as opening and reading the input files were ignored in this model. Some assumptions were made in terms of the models for different network topologies, such as the assumption that the creation of new processes is location aware in terms of their place in the network to make it more efficient. For all the performance models described in this section we will use the following annotation to represent them: 7
  • 8. • ts : Startup time. (prepare message + routing algorithm + interface between local node and the router). • tc : Time of computation for each value in matrix. • tw : Time of traversing per word. • Tcomm : Total communication time. • Tcomp : Total computation time. Figure 5: Communication and computation times of matrix parallel calcula- tions by process using the blocking technique. The diagram in Figure 5 represents the steps of the matrix calculation performed by our algorithm as well as initial declarations and needed com- munications. These different steps are represented with different colors. The blue color represents the scattering of one protein sequence to all the pro- cesses. The green colored areas represent the computation time needed to do the matrix calculations in each block and yellow color represents the time taken to send the last row of a block to the next process. In order to simplify the diagram, the time the last process needs to receive the last row of the block of the previous process is already taken in account in the upper yellow area. This explains why the last process doesn’t have yellow areas in its time-line but still has to wait to receive the blocks needed to perform the matrix calculations. All of this will be considered in this performance model. As we can observe from the diagram, the communication time of this model is composed by the scattering of the protein sequence vector (blue area) and several communications to send the last row of each block to the next process (yellow). The scatter method [2] will receive a vector of size N 8
  • 9. and deliver a vector with size N/P to each process. The scattering time is given by: N Tscatter = ts · log(p) + · (P − 1) · tw (1) P The sending of the last row of each block to the next process is composed by the communication startup time (ts ) and the traversing time of the B elements in this blocks row. This is given by: TrowComm = ts + B · tw (2) In the total communication time, this startup and traversing are done N/B times for the first process and an extra P − 1 times for the remaining pipeline stages of the remaining processes. In order to take into consideration the fact that the last process doesn’t need to send its last row to another process we will consider that it takes P −2. So the total communication time is given by: N Tcomm = Tscatter + ( + P − 1 − 1) · TrowComm B N N Tcomm = ts · log(p) + · (P − 1) · tw + ( + P − 2) · (ts + B · tw ) (3) P B The next step is to calculate the total computation time. Having in mind that a block is composed by N/P rows and B columns, the total number of block elements is B · N/P . This means that the computation time of a single block is given by: N Tcomp block = tc · B · (4) P As we did for the total communication time, this computation time is multiplied N/B + P − 1 to calculate the computing of the blocks for all the processes: N Tcomp = ( + P − 1) · Tcomp block B N N Tcomp = ( + P − 1) · (tc · B · ) (5) B P To conclude this performance model, the total parallelization time is given by the sum of the total communication and computation times. So the total parallelization time is given by: 9
  • 10. Tparallel = Tcomp + Tcomm N N Tparallel = + P − 1 · tc · B · + B P N + ts · log(P ) + · (P − 1) · tw + P N + + P − 2 · (ts + B · tw ) (6) B 2.2.2 Blocking Technique: Optimum B In order to find an optimum B for fixed values of N and P , and assuming N is much bigger than P , we need to find the value of B for each the total parallel time of computation and communication is smaller. This value can be found be deriving the total parallelization time equation and finding the value of B for which the derivate is equal to zero. dTparallel =0⇔ dB tc BN N tc N tc N ⇔ −N + ts + Btw B −2 + +P −2 + tw + =0⇔ P B P P N · ts · P ⇔B= ⇔ P · tc · N + P2 · tw − tc · N − 2 · tw · P N · ts · P ⇔B= ⇔ tc · N · (P − 1) + P · tw · (P − 2) ts ⇔B= tw ·(P −2) tc ·(P −1) N + P For N P: ts B≈ (7) tc 10
  • 11. 2.2.3 Blocking and Interleaving Technique In this section we will be describing the performance model of our implemen- tation with blocking and interleaving techniques for a linear network topol- ogy. In later sections we will compare it with non linear topology, taking into account the differences in the performance models. As in the previous model, we will use the mentioned annotation and we will only take into account the parallel algorithms used for matrix calculation. Figure 6: Communication and computation times of matrix parallel calcula- tions by process using the blocking and interleaving techniques. The diagram in Figure 6 represents the steps of the matrix calculation performed by our algorithm as well as initial declarations and needed com- munications. These different steps are represented with different colors. The blue color represents the scattering of one protein sequence to all the pro- cesses. The green colored areas represent the computation time needed to do the matrix calculations in each block and yellow color represents the time taken to send the last row of a block to the next process. In order to simplify the diagram, the time the last process in the last interleave needs to receive the last row of the block of the previous process is already taken in account in the upper yellow area. This explains why this last process doesn’t have yellow areas in its time-line but still has to wait to receive the blocks needed to perform the matrix calculations. All of this will be considered in this performance model. As we can observe from the diagram, the communication time of this model is composed by the scattering of a part of the protein sequence vector (blue area) for each interleave and several communications to send the last 11
  • 12. row of each block to the next process (yellow). The scatter method will receive a vector of size N and deliver a vector with size N/(P · I) to each process per interleave. The scattering time is given by: N Tscatter = ts · log(p) + · (P − 1) · tw (8) P ·I This scattering is done for each interleave. This means that we have to multiply this Tscatter by I: N TT scatter = I · (ts · log(p) + · (P − 1) · tw ) P ·I The sending of the last row of each block to the next process is composed by the communication startup time (ts ) and the traversing time of the B elements in this blocks row. This is given by: TrowComm = ts + B · tw (9) In order to clearly describe the calculation of the total communication time we will be splitting it into communication time in the first I − 1 inter- leaves and the special case of the last interleave. For the first I −1 interleaves, one might notice that each interleave introduces N/B extra yellow areas. This means that the communication time for all the startups and traversing for the first I − 1 interleaves is given by: N TcommInter = (I − 1) · ( ) · TrowComm B N TcommInter = (I − 1) · ( ) · (ts + B · tw ) (10) B The case of the last interleave is slightly different, we must have into ac- count the typical pipelining extra P − 1 communications due to the different pipeline stages. Since in our implementation, the last process doesn’t need to send its last row to another process, there will be only P − 2 extra com- munications. So the communication time for all the startups and traversing is given by: N TcommLastInter = ( + P − 2) · TrowComm B N TcommLastInterleave = ( + P − 2) · (ts + B · tw ) (11) B 12
  • 13. With these formulas we can finally describe the total communication time as being the sum of scattering times and startups and traversing times of all the interleaves. So the total communication time is given by: Tcomm = TT scatter + TcommInter + TcommLastInterleave N N Tcomm = I · (ts · log(p) + · (P − 1) · tw ) + (I − 1) · ( ) · (ts + B · tw ) + P ·I B N +( + P − 2) · (ts + B · tw ) B N N Tcomm = I · (ts · log(p) + · (P − 1) · tw ) + ((I − 1) · ( ) + P ·I B N +( + P − 2)) · (ts + B · tw ) (12) B The next step is to calculate the total computation time. Having in mind that a block is composed by N/(P · I) rows and B columns, the total number of block elements is B · N/(P · I). This means that the computation time of a single block is given by: N TcompBlock = tc · B · (13) P ·I As we did for the total communication time, we have to take into account how the interleaving affects the computation. For the first I − 1 interleaves the computation time is given by: N N TcompInter = (I − 1) · ( ) · tc · B · (14) B P ·I Differently from the communication time, the last interleave has exactly N/B + P − 1 extra computations of blocks. This means that the total com- putation time is given by: N N N Tcomp = ((I − 1) · ( ) + ( + P − 1)) · tc · B · (15) B B P ·I To conclude this performance model, the total parallelization time is given by the sum of the total communication and computation times. So the total parallelization time is given by: Tparallel = Tcomp + Tcomm 13
  • 14. N Tparallel = (I · (ts · log(p) + · (P − 1) · tw )) + P ·I N N + ((I − 1) · ( ) + ( + P − 2))× B B N × (tc · B · + ts + B · tw ) + P ·I N + tc · B · (16) P ·I 2.2.4 Blocking and Interleaving Technique: Optimum B In order to find an optimum B in order to N, P and I values, and assuming N is much bigger than P, we need to find the value of B for each the total parallel time of computation and communication is smaller. This value can be found be deriving the total parallelization time equation and finding the value of B for which the derivate is equal to zero. dTparallel =0⇔ dB (I − 1)N N tc BN ⇔ − 2 − 2 · + ts + Btw + B B IP (I − 1)N N tc N tc N + + +P −2 · + tw + =0⇔ B B IP IP N ts P I 2 ⇔B= (17) P tc N + P 2 tw I − tc N − 2tw IP For N P: IN ts B≈ (18) tw 2.2.5 Blocking and Interleaving Technique: Optimum I In order to find an optimum I in order to N, P and B values, and assuming N is much bigger than P, we need to find the value of I for each the total parallel time of computation and communication is smaller. This value can be found be deriving the total parallelization time equation and finding the value of I for which the derivate is equal to zero. 14
  • 15. dTparallel =0⇔ dI N tc (P − 1)B 2 ⇔I= (19) P (Bts log(P ) + N ts + N Btw ) N tc B 2 I≈ (20) Bts log(P ) + N ts + N Btw 2.3 Performance Model on 2D Torus Network Topol- ogy 2.3.1 Blocking Technique Assuming that the spawning of processes is location aware in terms of the network topology, the only difference between the linear topology mentioned in the previous sections and the 2D Torus network topology is in the scat- tering of data [1]. So the new performance model for this topology is given by: N N Tparallel = + P − 1 · tc · B · + B P √ N + 2 · ts · log( P ) + · (P − 1) · tw + P N + + P − 2 · (ts + B · tw ) (21) B Although the scattering of data is done faster, as it is not affected by the vari- able B, it will not affect the calculation of the optimum B. So the optimum B remains the following: ts B≈ (22) tc 2.3.2 Blocking and Interleaving Technique Lets also assume that the spawning of processes is location aware in terms of the network topology. This means the only difference between the lin- ear topology mentioned in the previous sections and the 2D Torus network 15
  • 16. topology is in the scattering of data. So the new performance model for this topology is given by: √ N Tparallel = (I · 2 · (ts · log( P ) + · (P − 1) · tw )) + P ·I N N + ((I − 1) · ( ) + ( + P − 2))× B B N × (tc · B · + ts + B · tw ) + P ·I N + tc · B · (23) P ·I Just as in the blocking technique, the scattering is not affected by B but it is affected by I. This means that the scattering is dependent on the level of interleaving. So the new equation for the optimum I is given by: N t B2 I≈ √ c (24) 2Bts log( P ) + N ts + N Btw The corresponding optimum B is given by: IN ts B≈ (25) tw Taking into account the logarithmic properties, we deduce that the opti- mum I is the same for both network topologies. The only difference between the two is the time needed to perform the scattering. 2.4 Implementation In this section, the implementation of the our solution is provided and ex- plained. Our solution compared to provided sequential one requires extra parameters B and I. Where B is a blocking factor and I is an interleaving factor. Note that in order not to use interleaving, the I parameter should be set to 1. In our solution, all required data is firstly read by the root process and later broad-casted or scattered to other processes. Vector A is scattered to all of the process. How much of information is scattered to every process depends on I parameter and number of processes and every process receives N/(I · P ) rows before computing each of the interleave parts. Usually, N elements can not be divided by I · P parameter, so the padding is introduced. The amount of elements that each process will receive during scatter procedure is calculated and stored as follows: 16
  • 17. sizeA = N % (total_processes * I) != 0 ? N + (total_processes * I) - (N % (total_processes * I)) : N; chunk_size = sizeA / (total_processes * I); Then the root process reads the data and shares the data as follows: // Broadcast the Similarity Matrix MPI_Bcast(sim_ptr, AA * AA, MPI_INT, 0, MPI_COMM_WORLD); // Broadcast the portion of vector A that will be received during broadcast MPI_Bcast(&chunk_size, 1, MPI_INT, 0, MPI_COMM_WORLD); // Broadcast N, B, I and DELTA parameters MPI_Bcast(&N, 1, MPI_INT, 0, MPI_COMM_WORLD); MPI_Bcast(&B, 1, MPI_INT, 0, MPI_COMM_WORLD); MPI_Bcast(&I, 1, MPI_INT, 0, MPI_COMM_WORLD); MPI_Bcast(&DELTA, 1, MPI_INT, 0, MPI_COMM_WORLD); Later, each process allocates space for a portion of H matrix, portion of A vector and for a whole B vector. Note that in our solution every process does not allocate the full-sized H matrix, but just enough portion of this matrix where every process writes their results. So the sum of sizes of each H matrix portions distributed throughout the processes will be N ×N +N +N ·(P ·I). It is the whole matrix, initial column filled with zeros and extra lines where the processes receives information from other processes. The portions is stored in a three dimensional array where the first dimension refers to an interleaving ID and the rest refers to column and row. The memory is allocated mapped and the B vector is broad-casted as follows: CHECK_NULL((chunk_hptr = (int *) malloc(sizeof(int) * (N) * (chunk_size + 1)))); CHECK_NULL((chunk_h = (int **) malloc(sizeof(int*) * (chunk_size + 1)))); CHECK_NULL((chunk_ih = (int ***) malloc(sizeof(int*) * I))); for(int i = 0; i < (chunk_size + 1) * I; i++) chunk_h[i] = chunk_hptr + i * N; for (int i = 0; i < I; i++) chunk_ih[i] = chunk_h + i * (chunk_size + 1); CHECK_NULL((chunk_a = (short *) malloc(sizeof(short) * (chunk_size)))); if (rank != 0) { // The root process already have B vector CHECK_NULL((b = (short *) malloc(sizeof(short) * (N)))); 17
  • 18. } MPI_Bcast(b, N, MPI_SHORT, 0, MPI_COMM_WORLD); Later each process calculates how many blocks there are in total and what is the size of the final block. This is needed since usually N is not dividable by B, so the final block is usually smaller then the rest of them. The time that marks the beginning of computation is stored in a variable start. In the main loop that counts interleaves, each process receives a portion of A vector. Main loop is repeated I times as explained earlier (in the section describing blocking and interleaving technique). int total_blocks = N / B + (N % B == 0 ? 0 : 1); int last_block_size = N % B == 0 ? B : N % B; MPI_Status status; int start, end; start = getTimeMilli(); for (int current_interleave = 0; current_interleave < I; current_interleave++) { MPI_Scatter(a + current_interleave * chunk_size * total_processes, chunk_size, MPI_SHORT, chunk_a, chunk_size, MPI_SHORT, 0, MPI_COMM_WORLD); int current_column = 1; // Fill first column with 0 for (int i = 0; i < chunk_size + 1; i++) chunk_ih[current_interleave][i][0] = 0; Then the main calculations begin. Firstly, the process checks whether it has to receive from another process. If so, it receives data required for the calculations. Then it processes the current cell, stores the result in separate array which will be gathered later. Finally, the process checks if it has to send the to another process. If so, it sends the last row of current block to another process. The process repeats these actions totalb locks times. Finally, it saves the time after execution in the end variable. for (int current_block = 0; current_block < total_blocks; current_block++) { // Receive int block_end = MIN2(current_column - (current_block == 18
  • 19. 0 ? 1 : 0) + B, N); if (rank == 0 && current_interleave == 0) { for (int k = current_column; k < block_end; k++) { chunk_ih[current_interleave][0][k] = 0; } } else { int receive_from = rank == 0 ? total_processes - 1 : rank - 1; int size_to_receive = current_block == total_blocks - 1 ? last_block_size : B; MPI_Recv(chunk_ih[current_interleave][0] + current_block * B, size_to_receive, MPI_INT, receive_from, 0, MPI_COMM_WORLD, &status); if (DEBUG) printf("[%d] Received from %d: ", rank, receive_from); if (DEBUG) print_vector(chunk_ih[current_interleave][0] + current_block * B, size_to_receive); } // Process for (int j = current_column; j < block_end; j++, current_column++) { for (int i = 1; i < chunk_size + 1; i++) { int diag = chunk_ih[current_interleave][i - 1][j - 1] + sim[chunk_a[i - 1]][b[j - 1]]; int down = chunk_ih[current_interleave][i - 1][j] + DELTA; int right = chunk_ih[current_interleave][i][j - 1] + DELTA; int max = MAX3(diag, down, right); chunk_ih[current_interleave][i][j] = max < 0 ? 0 : max; } } // Send if (current_interleave != I - 1 || rank + 1 != total_processes) { int send_to = rank + 1 == total_processes ? 0 : rank + 1; int size_to_send = current_block == total_blocks - 1 ? last_block_size : B; MPI_Send(chunk_ih[current_interleave][chunk_size] + 19
  • 20. current_block * B, size_to_send, MPI_INT, send_to, 0, MPI_COMM_WORLD); if (DEBUG) printf("[%d] Sent to %d: ", rank, send_to); if (DEBUG) print_vector(chunk_ih[current_interleave][chunk_size] + current_block * B, size_to_send); } } } end = getTimeMilli(); When all he calculations are finished, all processes starts the gather exe- cution. After gather is executed, the root process has the all H matrix. Then the root process prints an execution time to stderr stream and if debug is enabled it prints the H matrix. for (int i = 0; i < I; i++) { MPI_Gather(chunk_hptr + N + i * chunk_size * N, N * chunk_size, MPI_INT, hptr + i * chunk_size * total_processes * N, N * chunk_size, MPI_INT, 0, MPI_COMM_WORLD); } if (rank == 0) { fprintf(stderr, "Execution: %f sn", (double) (end - start) / 1000000); } if (DEBUG) { if (rank == 0) { for (int i = 0; i < N - 1; i++) { print_vector(h[i], N); } } } MPI_Finalize(); The full code is provided in the Appendix section. 20
  • 21. 3 Performance Results In this section, the performance results of our implementation on ALTIX is provided. Also, the results is compared to a sequential code performance. 3.1 Finding Optimal P and B In order to find out optimal P and B, we tested the application with different P and B parameters, where N = 10, 000. Before that we tested the sequential code. This code executed calculations for 12.598 seconds. The parallelized version execution times are shown in Figure 7. Figure 7: Performance results with different P and B where N = 10, 000 From this it can be concluded that with parameters N = 10000, B = 100, P = 8 and I = 1 the parallel code executed calculations 9 times faster. 3.2 Finding Optimal I In order to find out the optimal I, we selected the best result from the precious test where P = 8 and ran the test with different I and B parameters. The result is shown in Figure 8. Because the environment like network congestion affects our performance tests, the results might not be completely accurate. That is why we deduced from the results that the optimal parameters configuration for N = 10, 000 is I = 2, B = 200, P = 8. With this configuration parallel code calculates the matrix 8 times faster than the sequential code. Finally, we tested the parallel code with N = 25, 000 and parameters that we found to be optimal. The code executed calculations for 11.822213 seconds, where the sequential code ran for 76.884 seconds. From this it can be concluded that the parallel code runs 6.5 times faster. The result is slower, because as it was stated earlier, 21
  • 22. Figure 8: Performance results with different I and B where N = 10, 000 and P =8 the B and I depends on N , so the parameters configuration for calculating vectors similarity of size N = 25, 000 is not optimal. 4 Conclusions During this project the parallel implementation of the Smith-Waterman Al- gorithm was made using blocking and interleaving techniques. The tech- niques and the code were explained in detail. The performance models for both linear and 2D torus were calculated. Also, for each network topologies the equations for finding optimum blocking factor B when using blocking technique and optimum B and interleaving factor I when using blocking and interleaving technique were found. After calculating the models, the conclu- sion was made that the calculation of B and I factors for our algorithm on these particular network topologies is the same. Performance tests using multiple processes on different processors were done. It was found out that the optimal configuration for calculating se- quence alignment of two vectors of size N = 10, 000 using our implemen- tation is I = 2, B = 200, P = 8. With this configuration the parallel code calculates the matrix 8 times faster than the sequential code. With the same parameters configuration the parallel code calculates the matrix of size N = 25, 000 6.5 times faster than the sequential code. 22
  • 23. References [1] Peter Harrison, William Knottenbelt, Parallel Algorithms. Department of Computing, Imperial College London, 2009. [2] Norm Matlo, Programming on Parallel Machines. University of Califor- nia, Davis, 2011. 23
  • 24. A How to Compile all: seq par seq: gcc SW.c -o seq.out par: icc protein.cpp -o protein.out -lmpi B How to Execute on ALTIX #!/bin/bash # @ job_name = ampp01parallel # @ initialdir = . # @ output = mpi_%j.out # @ error = mpi_%j.err # @ total_tasks = <number_of_process> # @ wall_clock_limit = 00:01:00 mpirun -np <number_of_process> ./protein.out <vector_a> <vector_b> <similarity_matrix> <gap_penalty> <N> <B> <I> C Code #include <stdio.h> #include <stdlib.h> #include <ctype.h> // character handling #include <stdlib.h> // def of RAND_MAX #include <sys/time.h> #include "mpi.h" #define DEBUG 1 #define MAX_SEQ 50 #define CHECK_NULL(_check) { if ((_check)==NULL) { fprintf(stderr, "Null Pointer allocating memoryn"); 24
  • 25. exit(-1); } } #define AA 20 // number of amino acids #define MAX2(x,y) ((x)<(y) ? (y) : (x)) #define MAX3(x,y,z) (MAX2(x,y)<(z) ? (z) : MAX2(x,y)) #define MIN2(x,y) ((x)>(y) ? (y) : (x)) // function prototypes int getTimeMilli(); void read_pam(FILE* pam); void read_files(FILE* in1, FILE* in2); void print_vector(int* vector, int size); void print_short_vector(short* vector, int size); void memcopy(int* src, int* dst, int count); /* begin AMPP*/ int char2AAmem[256]; int AA2charmem[AA]; void initChar2AATranslation(void); /* end AMPP */ /* Define global variables */ int rank, total_processes; int DELTA; short *a, *b; int *chunk_hptr; int **chunk_h, ***chunk_ih; int *sim_ptr, **sim; // PAM similarity matrix int N, sizeA, B, I, chunk_size; short *chunk_a; int* hptr; int** h; FILE *pam; main(int argc, char *argv[]) { /* begin AMPP */ MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &total_processes); CHECK_NULL((sim_ptr = (int *) malloc(AA * AA * sizeof(int)))); 25
  • 26. CHECK_NULL((sim = (int **) malloc(AA * sizeof(int*)))); for(int i = 0; i < AA; i++) sim[i] = sim_ptr + i * AA; if (rank == 0) { FILE *in1, *in2; /**** Error handling for input file ****/ if (!(argc >= 5 && argc <= 8)) { fprintf(stderr,"%s protein1 protein2 PAM gapPenalty [N] [B] [I]n",argv[0]); exit(1); } else { in1 = fopen(argv[1],"r"); in2 = fopen(argv[2],"r"); N = (argc > 5 ? atoi(argv[5]) : MAX_SEQ) + 1; B = argc > 6 ? atoi(argv[6]) : total_processes; I = argc > 7 ? atoi(argv[7]) : 1; DELTA = atoi(argv[4]); } /* end AMPP */ /* begin AMPP */ sizeA = N % (total_processes * I) != 0 ? N + (total_processes * I) - (N % (total_processes * I)) : N; CHECK_NULL((a = (short *) calloc(sizeof(short), sizeA))); CHECK_NULL((b = (short *) malloc(sizeof(short) * (N)))); initChar2AATranslation(); read_files(in1, in2); chunk_size = sizeA / (total_processes * I); CHECK_NULL((hptr = (int *) calloc(N * sizeA, sizeof(int)))); CHECK_NULL((h = (int **) calloc(sizeA, sizeof(int*)))); for(int i = 0; i < sizeA; i++) h[i] = hptr + i * N; pam = fopen(argv[3], "r"); read_pam(pam); } MPI_Bcast(sim_ptr, AA * AA, MPI_INT, 0, MPI_COMM_WORLD); MPI_Bcast(&chunk_size, 1, MPI_INT, 0, MPI_COMM_WORLD); MPI_Bcast(&N, 1, MPI_INT, 0, MPI_COMM_WORLD); 26
  • 27. MPI_Bcast(&B, 1, MPI_INT, 0, MPI_COMM_WORLD); MPI_Bcast(&I, 1, MPI_INT, 0, MPI_COMM_WORLD); MPI_Bcast(&DELTA, 1, MPI_INT, 0, MPI_COMM_WORLD); CHECK_NULL((chunk_hptr = (int *) malloc(sizeof(int) * (N) * (chunk_size + 1) * I))); CHECK_NULL((chunk_h = (int **) malloc(sizeof(int*) * (chunk_size + 1) * I))); CHECK_NULL((chunk_ih = (int ***) malloc(sizeof(int*) * I))); for(int i = 0; i < (chunk_size + 1) * I; i++) chunk_h[i] = chunk_hptr + i * N; for (int i = 0; i < I; i++) chunk_ih[i] = chunk_h + i * (chunk_size + 1); CHECK_NULL((chunk_a = (short *) malloc(sizeof(short) * (chunk_size)))); if (rank != 0) { CHECK_NULL((b = (short *) malloc(sizeof(short) * (N)))); } MPI_Bcast(b, N, MPI_SHORT, 0, MPI_COMM_WORLD); /*** PARALLEL PART ***/ /** compute "h" local similarity array **/ int total_blocks = N / B + (N % B == 0 ? 0 : 1); int last_block_size = N % B == 0 ? B : N % B; MPI_Status status; int start, end; start = getTimeMilli(); for (int current_interleave = 0; current_interleave < I; current_interleave++) { MPI_Scatter(a + current_interleave * chunk_size * total_processes, chunk_size, MPI_SHORT, chunk_a, chunk_size, MPI_SHORT, 0, MPI_COMM_WORLD); int current_column = 1; // Fill first column with 0 27
  • 28. for (int i = 0; i < chunk_size + 1; i++) chunk_ih[current_interleave][i][0] = 0; for (int current_block = 0; current_block < total_blocks; current_block++) { // Receive int block_end = MIN2(current_column - (current_block == 0 ? 1 : 0) + B, N); if (rank == 0 && current_interleave == 0) { for (int k = current_column; k < block_end; k++) { chunk_ih[current_interleave][0][k] = 0; } } else { int receive_from = rank == 0 ? total_processes - 1 : rank - 1; int size_to_receive = current_block == total_blocks - 1 ? last_block_size : B; MPI_Recv(chunk_ih[current_interleave][0] + current_block * B, size_to_receive, MPI_INT, receive_from, 0, MPI_COMM_WORLD, &status); if (DEBUG) printf("[%d] Received from %d: ", rank, receive_from); if (DEBUG) print_vector(chunk_ih[current_interleave][0] + current_block * B, size_to_receive); } // Process for (int j = current_column; j < block_end; j++, current_column++) { for (int i = 1; i < chunk_size + 1; i++) { int diag = chunk_ih[current_interleave][i - 1][j - 1] + sim[chunk_a[i - 1]][b[j - 1]]; int down = chunk_ih[current_interleave][i - 1][j] + DELTA; int right = chunk_ih[current_interleave][i][j - 1] + DELTA; int max = MAX3(diag, down, right); chunk_ih[current_interleave][i][j] = max < 0 ? 0 : max; } } // Send 28
  • 29. if (current_interleave != I - 1 || rank + 1 != total_processes) { int send_to = rank + 1 == total_processes ? 0 : rank + 1; int size_to_send = current_block == total_blocks - 1 ? last_block_size : B; MPI_Send(chunk_ih[current_interleave][chunk_size] + current_block * B, size_to_send, MPI_INT, send_to, 0, MPI_COMM_WORLD); if (DEBUG) printf("[%d] Sent to %d: ", rank, send_to); if (DEBUG) print_vector(chunk_ih[current_interleave][chunk_size] + current_block * B, size_to_send); } } } end = getTimeMilli(); for (int i = 0; i < I; i++) { MPI_Gather(chunk_hptr + N + i * chunk_size * N, N * chunk_size, MPI_INT, hptr + i * chunk_size * total_processes * N, N * chunk_size, MPI_INT, 0, MPI_COMM_WORLD); } if (rank == 0) { fprintf(stderr, "Execution: %f sn", (double) (end - start) / 1000000); } if (DEBUG) { if (rank == 0) { for (int i = 0; i < N - 1; i++) { print_vector(h[i], N); } } } //Free everything! free(sim_ptr); free(sim); free(b); 29
  • 30. free(chunk_ih); free(chunk_h); free(chunk_hptr); free(chunk_a); if (rank == 0) { free(a); free(hptr); free(h); } MPI_Finalize(); } void memcopy(int* src, int* dst, int count) { for (int i = 0; i < count; i++) { dst[i] = src[i]; } } void print_vector(int* vector, int size) { for (int i = 0; i < size; i++) { printf("%2d ", vector[i]); } printf("n"); } void print_short_vector(short* vector, int size) { for (int i = 0; i < size; i++) { printf("%2d ", vector[i]); } printf("n"); } void read_pam(FILE* pam) { int i, j; int temp; /** read PAM250 similarity matrix **/ /* begin AMPP */ fscanf(pam,"%*s"); /* end AMPP */ for (i = 0; i < AA; i++) for (j = 0; j <= i; j++) { if (fscanf(pam, "%d ", &temp) == EOF) { 30
  • 31. fprintf(stderr, "PAM file emptyn"); fclose(pam); exit(1); } sim[i][j]=temp; } fclose(pam); for (i = 0; i < AA; i++) for (j = i + 1; j < AA; j++) sim[i][j] = sim[j][i]; // symmetrify } void read_files(FILE* in1, FILE* in2) { int i=0; int nc; char ch; do { nc=fscanf(in1,"%c",&ch); if (nc>0 && char2AAmem[ch]>=0) { a[i++] = char2AAmem[ch]; } } while (nc>0 && (i<N)); fclose(in1); /** read second file in array "b" **/ i=0; do { nc=fscanf(in2,"%c",&ch); if (nc>0 && char2AAmem[ch]>=0) { b[i++] = char2AAmem[ch]; } } while (nc>0 && (i<N)); fclose(in2); } /* Begin AMPP */ void initChar2AATranslation(void) { int i; for(i=0; i<256; i++) char2AAmem[i]=-1; char2AAmem[’c’]=char2AAmem[’C’]=0; 31
  • 32. AA2charmem[0]=’c’; char2AAmem[’g’]=char2AAmem[’G’]=1; AA2charmem[1]=’g’; char2AAmem[’p’]=char2AAmem[’P’]=2; AA2charmem[2]=’p’; char2AAmem[’s’]=char2AAmem[’S’]=3; AA2charmem[3]=’s’; char2AAmem[’a’]=char2AAmem[’A’]=4; AA2charmem[4]=’a’; char2AAmem[’t’]=char2AAmem[’T’]=5; AA2charmem[5]=’t’; char2AAmem[’d’]=char2AAmem[’D’]=6; AA2charmem[6]=’d’; char2AAmem[’e’]=char2AAmem[’E’]=7; AA2charmem[7]=’e’; char2AAmem[’n’]=char2AAmem[’N’]=8; AA2charmem[8]=’n’; char2AAmem[’q’]=char2AAmem[’Q’]=9; AA2charmem[9]=’q’; char2AAmem[’h’]=char2AAmem[’H’]=10; AA2charmem[10]=’h’; char2AAmem[’k’]=char2AAmem[’K’]=11; AA2charmem[11]=’k’; char2AAmem[’r’]=char2AAmem[’R’]=12; AA2charmem[12]=’r’; char2AAmem[’v’]=char2AAmem[’V’]=13; AA2charmem[13]=’v’; char2AAmem[’m’]=char2AAmem[’M’]=14; AA2charmem[14]=’m’; char2AAmem[’i’]=char2AAmem[’I’]=15; AA2charmem[15]=’i’; char2AAmem[’l’]=char2AAmem[’L’]=16; AA2charmem[16]=’l’; char2AAmem[’f’]=char2AAmem[’F’]=17; AA2charmem[17]=’L’; char2AAmem[’y’]=char2AAmem[’Y’]=18; AA2charmem[18]=’y’; char2AAmem[’w’]=char2AAmem[’W’]=19; AA2charmem[19]=’w’; } int getTimeMilli() { struct timeval tv; 32
  • 33. gettimeofday(&tv, NULL); int ret = tv.tv_usec; ret += (tv.tv_sec * 1000000); // Add seconds return ret; } /* end AMPP*/ 33