Smith waterman algorithm parallelization
Upcoming SlideShare
Loading in...5
×
 

Smith waterman algorithm parallelization

on

  • 1,499 views

Smith-Waterman Algorithm Parallelization

Smith-Waterman Algorithm Parallelization

Statistics

Views

Total Views
1,499
Views on SlideShare
1,489
Embed Views
10

Actions

Likes
0
Downloads
30
Comments
0

1 Embed 10

http://www.marioalmeida.eu 10

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Smith waterman algorithm parallelization Smith waterman algorithm parallelization Document Transcript

  • Universitat Polit`cnica de Catalunya e Facultat d’Inform`tica de Barcelona a AMPP Final ProjectSmith-Waterman Algorithm ParallelizationAuthors: Supervisors:M´rio Almeida a Josep Ramon Herrero ZaragozaˇZygimantas Bruzgys Daniel Jimenez GonzalezUmit Cavus Buyuksahin Barcelona 2012
  • Contents1 Introduction 32 Main Issues and Solutions 3 2.1 Parallelization Techniques . . . . . . . . . . . . . . . . . . . . 4 2.1.1 Blocking Technique . . . . . . . . . . . . . . . . . . . . 4 2.1.2 Blocking and Interleaving Technique . . . . . . . . . . 6 2.2 Performance Model on Linear Network Topology . . . . . . . . 7 2.2.1 Blocking Technique . . . . . . . . . . . . . . . . . . . . 7 2.2.2 Blocking Technique: Optimum B . . . . . . . . . . . . 10 2.2.3 Blocking and Interleaving Technique . . . . . . . . . . 11 2.2.4 Blocking and Interleaving Technique: Optimum B . . . 14 2.2.5 Blocking and Interleaving Technique: Optimum I . . . 14 2.3 Performance Model on 2D Torus Network Topology . . . . . . 15 2.3.1 Blocking Technique . . . . . . . . . . . . . . . . . . . . 15 2.3.2 Blocking and Interleaving Technique . . . . . . . . . . 15 2.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 163 Performance Results 21 3.1 Finding Optimal P and B . . . . . . . . . . . . . . . . . . . . 21 3.2 Finding Optimal I . . . . . . . . . . . . . . . . . . . . . . . . . 214 Conclusions 22A How to Compile 24B How to Execute on ALTIX 24C Code 24
  • 1 IntroductionIn this project the parallel implementation of the Smith-Waterman Algo-rithm using Message Passing Interface (MPI). This algorithm is a well-knownalgorithm for performing local sequence alignment, which is, for determiningsimilar regions between two amino-acid sequences. In order to find the best alignment between two amino-acid sequences amatrix H is computed of size N × N , where N is a size of each sequences.Every element of this matrix is based on Score Matrix (cost of matching twosymbols) and a gap penalty for mismatching symbols of sequences. Whenmatrix H is computed, the optimum alignment of sequences can be obtainedby tracking back the matrix starting with the highest value in the matrix. In our parallel implementation only H matrix calculation was parallelizedas it is our only interest. The tracking part was removed from the codeand from the sequential code as well, in order to gather the most accuratecomputation times for comparison. For parallelization a pipelining methodwas used. Following this model, each process communicates with anotherafter calculating B columns of N rows. This is called blocking. We introduced Pa parameter that easily allowed to change this value. Later interleavingparameter I was added. During this project several performance models were created. One modelis for linear interconnection network and another one for 2D torus network.In models calculations B and I parameters were included. Later, the opti-mum B and I were found and performance tests were executed to empiricallyfind out those two parameters.2 Main Issues and SolutionsIn this section parallelization solutions are described. Solution with blockingat column level is explained and performance model is described. Then thesolution with both blocking at column level and interleaving at row level isexplained and performance model is described as well. Also, in this sectionthe calculations are provided for optimum blocking factor B and interleavingfactor I. The second part of the section is for description of the perfor-mance models for both solutions on the different network topology and thecalculations for finding optimum blocking factor B and interleaving factorI. Finally, our implementation of these techniques in C++ is provided andexplained. 3
  • 2.1 Parallelization Techniques2.1.1 Blocking TechniqueFigure 1: Parallelization approach by introducing blocking at column level The P processes share the matrix M in terms of consecutive rows. Forcalculating the matrix M of size N × N , each process Pi works with N/Pconsecutive rows of the matrix. When using a blocking technique for paral-lelization, columns are divided by a defined block size B. So, each processhas to calculate N/B blocks. These parameters are visualized at Figure 1.At the top part of the figure it can be seen how elements of the matrix aredivided between processes. And at the bottom part of the figure the par-allelization of calculations between processes is visualized. There is shownthat when first process computes the first block of the matrix, which is sizeof N/P × B, it communicates with the next process. Then the next processstart calculating the other block of the matrix while the first process con-tinues calculations on the next block and so on. This type of parallelizationis called pipelining. In this type of parallelization, the problem is dividedinto a series of tasks that have to be completed one after the other. Be-fore explaining the parallelization in detail, we should analyze data and taskdependencies between processes to calculate the matrix. In Figure 2 the data dependency for a particular matrix element is shown.In order to calculate a matrix element M [i][j], the process Pi+1 needs thecalculated data form the previous column M [i][j − 1] and elements M [i −1][j − 1] and M [i − 1][j] from the previous row as seen in the picture. If 4
  • Figure 2: Data dependency for calculating one matrix elementthe previous row is calculated by the process Pi then that row is sent afterprocess Pi calculates the block of size N/P × B. This introduces a dataand task dependencies. The process Pi+1 can not start calculations till theprocess Pi sends the last row of the block, which is needed for calculating theblock of process Pi+1 . To calculate the first row and column of the matrix itis considered that the predecessor row and column is filled with zeros. Figure 3: Data dependecies between blocks of matrix The Figure 3 shows the parallelism of the matrix in the wide window.The squares represent the blocks matrix and three arrows show data decen-cies between the blocks. As mentioned before, an element needs its upper,left, and upper-left values to be calculated. It is called data dependency.Therefore, blocks on the same minor diagonal are independent from eachother. So these blocks can be and are calculated in parallel. The steps of calculations are as follows: 5
  • 1. The process waits till the previous process finish calculation of a block (if applicable); 2. The process receives the last row of a block that was calculated by the previous process; 3. After receiving the last row of a block calculated by previous process, the process has all necessary information to calculate its block. So, the process performs a calculation of its block; 4. When process finish the calculation, it sends the last row of its block to the next process (if applicable); 5. The process repeats these steps until it finishes the calculation of all blocks, that is, calculates all rows that are assigned to the process.2.1.2 Blocking and Interleaving Technique Figure 4: Matrix calculation with interleaving factor, when I = 2 This parallelization method adds an interleave factor to a blocking tech-nique that was described above. With this method the matrix is dividedinto I parts, so that each part has N × N/I elements. Every part is thencalculated as explained in the previous section, that is, using blocking tech-nique. As soon as the process finish processing rows assigned to it from thefirst interleaving part it continues with the blocks from another interleavepart. For example, in Figure 4, where interleaving factor I = 2, the matrixis divided into two smaller parts. Each process Pi calculates N/(P · I) rowsof one part before moving to the second part. 6
  • The steps of calculations are very similar to those where blocking tech-nique is used and are as follows: 1. The process waits till the previous process finish calculation of a block (if applicable); 2. The process receives the last row of a block that was calculated by the previous process; 3. After receiving the last row of a block calculated by previous process, the process has all necessary information to calculate its block. So, the process performs a calculation of its block; 4. When process finish the calculation, it sends the last row of its block to the next process. If the process is the last one and there is another interleave part to calculate, then it sends the row to the first process. Otherwise it does not send anything; 5. The process repeats these steps until it finishes the calculation of all blocks within the current interleave part, that is, calculates all rows that are assigned to the process within the interleave part. If there is another interleave part to calculate it moves to next interleave part and repeats theses steps until all blocks from all interleave parts are calculated.2.2 Performance Model on Linear Network Topology2.2.1 Blocking TechniqueIn this section we will be describing the performance model of our imple-mentation with blocking technique for a linear network topology. In latersections we will compare it with non linear topology, taking into account thedifferences in the performance models. In order to focus on the main objectives of this performance analysis, wewill only take into account the parallel algorithms used for matrix calculation.This means that some parts of the code that were done sequentially on asingle process such as opening and reading the input files were ignored inthis model. Some assumptions were made in terms of the models for different networktopologies, such as the assumption that the creation of new processes islocation aware in terms of their place in the network to make it more efficient. For all the performance models described in this section we will use thefollowing annotation to represent them: 7
  • • ts : Startup time. (prepare message + routing algorithm + interface between local node and the router). • tc : Time of computation for each value in matrix. • tw : Time of traversing per word. • Tcomm : Total communication time. • Tcomp : Total computation time.Figure 5: Communication and computation times of matrix parallel calcula-tions by process using the blocking technique. The diagram in Figure 5 represents the steps of the matrix calculationperformed by our algorithm as well as initial declarations and needed com-munications. These different steps are represented with different colors. Theblue color represents the scattering of one protein sequence to all the pro-cesses. The green colored areas represent the computation time needed todo the matrix calculations in each block and yellow color represents the timetaken to send the last row of a block to the next process. In order to simplify the diagram, the time the last process needs to receivethe last row of the block of the previous process is already taken in accountin the upper yellow area. This explains why the last process doesn’t haveyellow areas in its time-line but still has to wait to receive the blocks neededto perform the matrix calculations. All of this will be considered in thisperformance model. As we can observe from the diagram, the communication time of thismodel is composed by the scattering of the protein sequence vector (bluearea) and several communications to send the last row of each block to thenext process (yellow). The scatter method [2] will receive a vector of size N 8
  • and deliver a vector with size N/P to each process. The scattering time isgiven by: N Tscatter = ts · log(p) + · (P − 1) · tw (1) P The sending of the last row of each block to the next process is composedby the communication startup time (ts ) and the traversing time of the Belements in this blocks row. This is given by: TrowComm = ts + B · tw (2) In the total communication time, this startup and traversing are doneN/B times for the first process and an extra P − 1 times for the remainingpipeline stages of the remaining processes. In order to take into considerationthe fact that the last process doesn’t need to send its last row to anotherprocess we will consider that it takes P −2. So the total communication timeis given by: N Tcomm = Tscatter + ( + P − 1 − 1) · TrowComm B N N Tcomm = ts · log(p) + · (P − 1) · tw + ( + P − 2) · (ts + B · tw ) (3) P B The next step is to calculate the total computation time. Having in mindthat a block is composed by N/P rows and B columns, the total number ofblock elements is B · N/P . This means that the computation time of a singleblock is given by: N Tcomp block = tc · B · (4) P As we did for the total communication time, this computation time ismultiplied N/B + P − 1 to calculate the computing of the blocks for all theprocesses: N Tcomp = ( + P − 1) · Tcomp block B N N Tcomp = ( + P − 1) · (tc · B · ) (5) B P To conclude this performance model, the total parallelization time is givenby the sum of the total communication and computation times. So the totalparallelization time is given by: 9
  • Tparallel = Tcomp + Tcomm N N Tparallel = + P − 1 · tc · B · + B P N + ts · log(P ) + · (P − 1) · tw + P N + + P − 2 · (ts + B · tw ) (6) B2.2.2 Blocking Technique: Optimum BIn order to find an optimum B for fixed values of N and P , and assumingN is much bigger than P , we need to find the value of B for each the totalparallel time of computation and communication is smaller. This value canbe found be deriving the total parallelization time equation and finding thevalue of B for which the derivate is equal to zero. dTparallel =0⇔ dB tc BN N tc N tc N⇔ −N + ts + Btw B −2 + +P −2 + tw + =0⇔ P B P P N · ts · P ⇔B= ⇔ P · tc · N + P2 · tw − tc · N − 2 · tw · P N · ts · P ⇔B= ⇔ tc · N · (P − 1) + P · tw · (P − 2) ts ⇔B= tw ·(P −2) tc ·(P −1) N + P For N P: ts B≈ (7) tc 10
  • 2.2.3 Blocking and Interleaving TechniqueIn this section we will be describing the performance model of our implemen-tation with blocking and interleaving techniques for a linear network topol-ogy. In later sections we will compare it with non linear topology, taking intoaccount the differences in the performance models. As in the previous model,we will use the mentioned annotation and we will only take into account theparallel algorithms used for matrix calculation.Figure 6: Communication and computation times of matrix parallel calcula-tions by process using the blocking and interleaving techniques. The diagram in Figure 6 represents the steps of the matrix calculationperformed by our algorithm as well as initial declarations and needed com-munications. These different steps are represented with different colors. Theblue color represents the scattering of one protein sequence to all the pro-cesses. The green colored areas represent the computation time needed todo the matrix calculations in each block and yellow color represents the timetaken to send the last row of a block to the next process. In order to simplify the diagram, the time the last process in the lastinterleave needs to receive the last row of the block of the previous processis already taken in account in the upper yellow area. This explains why thislast process doesn’t have yellow areas in its time-line but still has to wait toreceive the blocks needed to perform the matrix calculations. All of this willbe considered in this performance model. As we can observe from the diagram, the communication time of thismodel is composed by the scattering of a part of the protein sequence vector(blue area) for each interleave and several communications to send the last 11
  • row of each block to the next process (yellow). The scatter method willreceive a vector of size N and deliver a vector with size N/(P · I) to eachprocess per interleave. The scattering time is given by: N Tscatter = ts · log(p) + · (P − 1) · tw (8) P ·I This scattering is done for each interleave. This means that we have tomultiply this Tscatter by I: N TT scatter = I · (ts · log(p) + · (P − 1) · tw ) P ·I The sending of the last row of each block to the next process is composedby the communication startup time (ts ) and the traversing time of the Belements in this blocks row. This is given by: TrowComm = ts + B · tw (9) In order to clearly describe the calculation of the total communicationtime we will be splitting it into communication time in the first I − 1 inter-leaves and the special case of the last interleave. For the first I −1 interleaves,one might notice that each interleave introduces N/B extra yellow areas.This means that the communication time for all the startups and traversingfor the first I − 1 interleaves is given by: N TcommInter = (I − 1) · ( ) · TrowComm B N TcommInter = (I − 1) · ( ) · (ts + B · tw ) (10) B The case of the last interleave is slightly different, we must have into ac-count the typical pipelining extra P − 1 communications due to the differentpipeline stages. Since in our implementation, the last process doesn’t needto send its last row to another process, there will be only P − 2 extra com-munications. So the communication time for all the startups and traversingis given by: N TcommLastInter = ( + P − 2) · TrowComm B N TcommLastInterleave = ( + P − 2) · (ts + B · tw ) (11) B 12
  • With these formulas we can finally describe the total communication timeas being the sum of scattering times and startups and traversing times of allthe interleaves. So the total communication time is given by: Tcomm = TT scatter + TcommInter + TcommLastInterleave N NTcomm = I · (ts · log(p) + · (P − 1) · tw ) + (I − 1) · ( ) · (ts + B · tw ) + P ·I B N +( + P − 2) · (ts + B · tw ) B N N Tcomm = I · (ts · log(p) + · (P − 1) · tw ) + ((I − 1) · ( ) + P ·I B N +( + P − 2)) · (ts + B · tw ) (12) B The next step is to calculate the total computation time. Having in mindthat a block is composed by N/(P · I) rows and B columns, the total numberof block elements is B · N/(P · I). This means that the computation time ofa single block is given by: N TcompBlock = tc · B · (13) P ·I As we did for the total communication time, we have to take into accounthow the interleaving affects the computation. For the first I − 1 interleavesthe computation time is given by: N N TcompInter = (I − 1) · ( ) · tc · B · (14) B P ·I Differently from the communication time, the last interleave has exactlyN/B + P − 1 extra computations of blocks. This means that the total com-putation time is given by: N N N Tcomp = ((I − 1) · ( ) + ( + P − 1)) · tc · B · (15) B B P ·I To conclude this performance model, the total parallelization time is givenby the sum of the total communication and computation times. So the totalparallelization time is given by: Tparallel = Tcomp + Tcomm 13
  • N Tparallel = (I · (ts · log(p) + · (P − 1) · tw )) + P ·I N N + ((I − 1) · ( ) + ( + P − 2))× B B N × (tc · B · + ts + B · tw ) + P ·I N + tc · B · (16) P ·I2.2.4 Blocking and Interleaving Technique: Optimum BIn order to find an optimum B in order to N, P and I values, and assumingN is much bigger than P, we need to find the value of B for each the totalparallel time of computation and communication is smaller. This value canbe found be deriving the total parallelization time equation and finding thevalue of B for which the derivate is equal to zero. dTparallel =0⇔ dB (I − 1)N N tc BN ⇔ − 2 − 2 · + ts + Btw + B B IP (I − 1)N N tc N tc N + + +P −2 · + tw + =0⇔ B B IP IP N ts P I 2 ⇔B= (17) P tc N + P 2 tw I − tc N − 2tw IP For N P: IN ts B≈ (18) tw2.2.5 Blocking and Interleaving Technique: Optimum IIn order to find an optimum I in order to N, P and B values, and assumingN is much bigger than P, we need to find the value of I for each the totalparallel time of computation and communication is smaller. This value canbe found be deriving the total parallelization time equation and finding thevalue of I for which the derivate is equal to zero. 14
  • dTparallel =0⇔ dI N tc (P − 1)B 2 ⇔I= (19) P (Bts log(P ) + N ts + N Btw ) N tc B 2 I≈ (20) Bts log(P ) + N ts + N Btw2.3 Performance Model on 2D Torus Network Topol- ogy2.3.1 Blocking TechniqueAssuming that the spawning of processes is location aware in terms of thenetwork topology, the only difference between the linear topology mentionedin the previous sections and the 2D Torus network topology is in the scat-tering of data [1]. So the new performance model for this topology is givenby: N N Tparallel = + P − 1 · tc · B · + B P √ N + 2 · ts · log( P ) + · (P − 1) · tw + P N + + P − 2 · (ts + B · tw ) (21) BAlthough the scattering of data is done faster, as it is not affected by the vari-able B, it will not affect the calculation of the optimum B. So the optimumB remains the following: ts B≈ (22) tc2.3.2 Blocking and Interleaving TechniqueLets also assume that the spawning of processes is location aware in termsof the network topology. This means the only difference between the lin-ear topology mentioned in the previous sections and the 2D Torus network 15
  • topology is in the scattering of data. So the new performance model for thistopology is given by: √ N Tparallel = (I · 2 · (ts · log( P ) + · (P − 1) · tw )) + P ·I N N + ((I − 1) · ( ) + ( + P − 2))× B B N × (tc · B · + ts + B · tw ) + P ·I N + tc · B · (23) P ·IJust as in the blocking technique, the scattering is not affected by B but itis affected by I. This means that the scattering is dependent on the level ofinterleaving. So the new equation for the optimum I is given by: N t B2 I≈ √ c (24) 2Bts log( P ) + N ts + N Btw The corresponding optimum B is given by: IN ts B≈ (25) tw Taking into account the logarithmic properties, we deduce that the opti-mum I is the same for both network topologies. The only difference betweenthe two is the time needed to perform the scattering.2.4 ImplementationIn this section, the implementation of the our solution is provided and ex-plained. Our solution compared to provided sequential one requires extraparameters B and I. Where B is a blocking factor and I is an interleavingfactor. Note that in order not to use interleaving, the I parameter should beset to 1. In our solution, all required data is firstly read by the root process andlater broad-casted or scattered to other processes. Vector A is scattered to allof the process. How much of information is scattered to every process dependson I parameter and number of processes and every process receives N/(I · P )rows before computing each of the interleave parts. Usually, N elementscan not be divided by I · P parameter, so the padding is introduced. Theamount of elements that each process will receive during scatter procedureis calculated and stored as follows: 16
  • sizeA = N % (total_processes * I) != 0 ? N + (total_processes * I) - (N % (total_processes * I)) : N; chunk_size = sizeA / (total_processes * I); Then the root process reads the data and shares the data as follows: // Broadcast the Similarity Matrix MPI_Bcast(sim_ptr, AA * AA, MPI_INT, 0, MPI_COMM_WORLD); // Broadcast the portion of vector A that will be received during broadcast MPI_Bcast(&chunk_size, 1, MPI_INT, 0, MPI_COMM_WORLD); // Broadcast N, B, I and DELTA parameters MPI_Bcast(&N, 1, MPI_INT, 0, MPI_COMM_WORLD); MPI_Bcast(&B, 1, MPI_INT, 0, MPI_COMM_WORLD); MPI_Bcast(&I, 1, MPI_INT, 0, MPI_COMM_WORLD); MPI_Bcast(&DELTA, 1, MPI_INT, 0, MPI_COMM_WORLD); Later, each process allocates space for a portion of H matrix, portion of Avector and for a whole B vector. Note that in our solution every process doesnot allocate the full-sized H matrix, but just enough portion of this matrixwhere every process writes their results. So the sum of sizes of each H matrixportions distributed throughout the processes will be N ×N +N +N ·(P ·I). Itis the whole matrix, initial column filled with zeros and extra lines where theprocesses receives information from other processes. The portions is stored ina three dimensional array where the first dimension refers to an interleavingID and the rest refers to column and row. The memory is allocated mappedand the B vector is broad-casted as follows: CHECK_NULL((chunk_hptr = (int *) malloc(sizeof(int) * (N) * (chunk_size + 1)))); CHECK_NULL((chunk_h = (int **) malloc(sizeof(int*) * (chunk_size + 1)))); CHECK_NULL((chunk_ih = (int ***) malloc(sizeof(int*) * I))); for(int i = 0; i < (chunk_size + 1) * I; i++) chunk_h[i] = chunk_hptr + i * N; for (int i = 0; i < I; i++) chunk_ih[i] = chunk_h + i * (chunk_size + 1); CHECK_NULL((chunk_a = (short *) malloc(sizeof(short) * (chunk_size)))); if (rank != 0) { // The root process already have B vector CHECK_NULL((b = (short *) malloc(sizeof(short) * (N)))); 17
  • } MPI_Bcast(b, N, MPI_SHORT, 0, MPI_COMM_WORLD); Later each process calculates how many blocks there are in total and whatis the size of the final block. This is needed since usually N is not dividableby B, so the final block is usually smaller then the rest of them. The timethat marks the beginning of computation is stored in a variable start. Inthe main loop that counts interleaves, each process receives a portion of Avector. Main loop is repeated I times as explained earlier (in the sectiondescribing blocking and interleaving technique). int total_blocks = N / B + (N % B == 0 ? 0 : 1); int last_block_size = N % B == 0 ? B : N % B; MPI_Status status; int start, end; start = getTimeMilli(); for (int current_interleave = 0; current_interleave < I; current_interleave++) { MPI_Scatter(a + current_interleave * chunk_size * total_processes, chunk_size, MPI_SHORT, chunk_a, chunk_size, MPI_SHORT, 0, MPI_COMM_WORLD); int current_column = 1; // Fill first column with 0 for (int i = 0; i < chunk_size + 1; i++) chunk_ih[current_interleave][i][0] = 0; Then the main calculations begin. Firstly, the process checks whether ithas to receive from another process. If so, it receives data required for thecalculations. Then it processes the current cell, stores the result in separatearray which will be gathered later. Finally, the process checks if it has tosend the to another process. If so, it sends the last row of current block toanother process. The process repeats these actions totalb locks times. Finally,it saves the time after execution in the end variable. for (int current_block = 0; current_block < total_blocks; current_block++) { // Receive int block_end = MIN2(current_column - (current_block == 18
  • 0 ? 1 : 0) + B, N);if (rank == 0 && current_interleave == 0) { for (int k = current_column; k < block_end; k++) { chunk_ih[current_interleave][0][k] = 0; }} else { int receive_from = rank == 0 ? total_processes - 1 : rank - 1; int size_to_receive = current_block == total_blocks - 1 ? last_block_size : B; MPI_Recv(chunk_ih[current_interleave][0] + current_block * B, size_to_receive, MPI_INT, receive_from, 0, MPI_COMM_WORLD, &status); if (DEBUG) printf("[%d] Received from %d: ", rank, receive_from); if (DEBUG) print_vector(chunk_ih[current_interleave][0] + current_block * B, size_to_receive);}// Processfor (int j = current_column; j < block_end; j++, current_column++) { for (int i = 1; i < chunk_size + 1; i++) { int diag = chunk_ih[current_interleave][i - 1][j - 1] + sim[chunk_a[i - 1]][b[j - 1]]; int down = chunk_ih[current_interleave][i - 1][j] + DELTA; int right = chunk_ih[current_interleave][i][j - 1] + DELTA; int max = MAX3(diag, down, right); chunk_ih[current_interleave][i][j] = max < 0 ? 0 : max; }}// Sendif (current_interleave != I - 1 || rank + 1 != total_processes) { int send_to = rank + 1 == total_processes ? 0 : rank + 1; int size_to_send = current_block == total_blocks - 1 ? last_block_size : B; MPI_Send(chunk_ih[current_interleave][chunk_size] + 19
  • current_block * B, size_to_send, MPI_INT, send_to, 0, MPI_COMM_WORLD); if (DEBUG) printf("[%d] Sent to %d: ", rank, send_to); if (DEBUG) print_vector(chunk_ih[current_interleave][chunk_size] + current_block * B, size_to_send); } } } end = getTimeMilli(); When all he calculations are finished, all processes starts the gather exe-cution. After gather is executed, the root process has the all H matrix. Thenthe root process prints an execution time to stderr stream and if debug isenabled it prints the H matrix. for (int i = 0; i < I; i++) { MPI_Gather(chunk_hptr + N + i * chunk_size * N, N * chunk_size, MPI_INT, hptr + i * chunk_size * total_processes * N, N * chunk_size, MPI_INT, 0, MPI_COMM_WORLD); } if (rank == 0) { fprintf(stderr, "Execution: %f sn", (double) (end - start) / 1000000); } if (DEBUG) { if (rank == 0) { for (int i = 0; i < N - 1; i++) { print_vector(h[i], N); } } } MPI_Finalize(); The full code is provided in the Appendix section. 20
  • 3 Performance ResultsIn this section, the performance results of our implementation on ALTIX isprovided. Also, the results is compared to a sequential code performance.3.1 Finding Optimal P and BIn order to find out optimal P and B, we tested the application with differentP and B parameters, where N = 10, 000. Before that we tested the sequentialcode. This code executed calculations for 12.598 seconds. The parallelizedversion execution times are shown in Figure 7. Figure 7: Performance results with different P and B where N = 10, 000 From this it can be concluded that with parameters N = 10000, B = 100,P = 8 and I = 1 the parallel code executed calculations 9 times faster.3.2 Finding Optimal IIn order to find out the optimal I, we selected the best result from theprecious test where P = 8 and ran the test with different I and B parameters.The result is shown in Figure 8. Because the environment like network congestion affects our performancetests, the results might not be completely accurate. That is why we deducedfrom the results that the optimal parameters configuration for N = 10, 000 isI = 2, B = 200, P = 8. With this configuration parallel code calculates thematrix 8 times faster than the sequential code. Finally, we tested the parallelcode with N = 25, 000 and parameters that we found to be optimal. Thecode executed calculations for 11.822213 seconds, where the sequential coderan for 76.884 seconds. From this it can be concluded that the parallel coderuns 6.5 times faster. The result is slower, because as it was stated earlier, 21
  • Figure 8: Performance results with different I and B where N = 10, 000 andP =8the B and I depends on N , so the parameters configuration for calculatingvectors similarity of size N = 25, 000 is not optimal.4 ConclusionsDuring this project the parallel implementation of the Smith-Waterman Al-gorithm was made using blocking and interleaving techniques. The tech-niques and the code were explained in detail. The performance models forboth linear and 2D torus were calculated. Also, for each network topologiesthe equations for finding optimum blocking factor B when using blockingtechnique and optimum B and interleaving factor I when using blocking andinterleaving technique were found. After calculating the models, the conclu-sion was made that the calculation of B and I factors for our algorithm onthese particular network topologies is the same. Performance tests using multiple processes on different processors weredone. It was found out that the optimal configuration for calculating se-quence alignment of two vectors of size N = 10, 000 using our implemen-tation is I = 2, B = 200, P = 8. With this configuration the parallelcode calculates the matrix 8 times faster than the sequential code. With thesame parameters configuration the parallel code calculates the matrix of sizeN = 25, 000 6.5 times faster than the sequential code. 22
  • References[1] Peter Harrison, William Knottenbelt, Parallel Algorithms. Department of Computing, Imperial College London, 2009.[2] Norm Matlo, Programming on Parallel Machines. University of Califor- nia, Davis, 2011. 23
  • A How to Compileall: seq parseq: gcc SW.c -o seq.outpar: icc protein.cpp -o protein.out -lmpiB How to Execute on ALTIX#!/bin/bash# @ job_name = ampp01parallel# @ initialdir = .# @ output = mpi_%j.out# @ error = mpi_%j.err# @ total_tasks = <number_of_process># @ wall_clock_limit = 00:01:00mpirun -np <number_of_process> ./protein.out <vector_a> <vector_b> <similarity_matrix> <gap_penalty> <N> <B> <I>C Code#include <stdio.h>#include <stdlib.h>#include <ctype.h> // character handling#include <stdlib.h> // def of RAND_MAX#include <sys/time.h>#include "mpi.h"#define DEBUG 1#define MAX_SEQ 50#define CHECK_NULL(_check) { if ((_check)==NULL) { fprintf(stderr, "Null Pointer allocating memoryn"); 24
  • exit(-1); } }#define AA 20 // number of amino acids#define MAX2(x,y) ((x)<(y) ? (y) : (x))#define MAX3(x,y,z) (MAX2(x,y)<(z) ? (z) : MAX2(x,y))#define MIN2(x,y) ((x)>(y) ? (y) : (x))// function prototypesint getTimeMilli();void read_pam(FILE* pam);void read_files(FILE* in1, FILE* in2);void print_vector(int* vector, int size);void print_short_vector(short* vector, int size);void memcopy(int* src, int* dst, int count);/* begin AMPP*/int char2AAmem[256];int AA2charmem[AA];void initChar2AATranslation(void);/* end AMPP *//* Define global variables */int rank, total_processes;int DELTA;short *a, *b;int *chunk_hptr;int **chunk_h, ***chunk_ih;int *sim_ptr, **sim; // PAM similarity matrixint N, sizeA, B, I, chunk_size;short *chunk_a;int* hptr;int** h;FILE *pam;main(int argc, char *argv[]) { /* begin AMPP */ MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &total_processes); CHECK_NULL((sim_ptr = (int *) malloc(AA * AA * sizeof(int)))); 25
  • CHECK_NULL((sim = (int **) malloc(AA * sizeof(int*))));for(int i = 0; i < AA; i++) sim[i] = sim_ptr + i * AA;if (rank == 0) { FILE *in1, *in2; /**** Error handling for input file ****/ if (!(argc >= 5 && argc <= 8)) { fprintf(stderr,"%s protein1 protein2 PAM gapPenalty [N] [B] [I]n",argv[0]); exit(1); } else { in1 = fopen(argv[1],"r"); in2 = fopen(argv[2],"r"); N = (argc > 5 ? atoi(argv[5]) : MAX_SEQ) + 1; B = argc > 6 ? atoi(argv[6]) : total_processes; I = argc > 7 ? atoi(argv[7]) : 1; DELTA = atoi(argv[4]); } /* end AMPP */ /* begin AMPP */ sizeA = N % (total_processes * I) != 0 ? N + (total_processes * I) - (N % (total_processes * I)) : N; CHECK_NULL((a = (short *) calloc(sizeof(short), sizeA))); CHECK_NULL((b = (short *) malloc(sizeof(short) * (N)))); initChar2AATranslation(); read_files(in1, in2); chunk_size = sizeA / (total_processes * I); CHECK_NULL((hptr = (int *) calloc(N * sizeA, sizeof(int)))); CHECK_NULL((h = (int **) calloc(sizeA, sizeof(int*)))); for(int i = 0; i < sizeA; i++) h[i] = hptr + i * N; pam = fopen(argv[3], "r"); read_pam(pam);}MPI_Bcast(sim_ptr, AA * AA, MPI_INT, 0, MPI_COMM_WORLD);MPI_Bcast(&chunk_size, 1, MPI_INT, 0, MPI_COMM_WORLD);MPI_Bcast(&N, 1, MPI_INT, 0, MPI_COMM_WORLD); 26
  • MPI_Bcast(&B, 1, MPI_INT, 0, MPI_COMM_WORLD);MPI_Bcast(&I, 1, MPI_INT, 0, MPI_COMM_WORLD);MPI_Bcast(&DELTA, 1, MPI_INT, 0, MPI_COMM_WORLD);CHECK_NULL((chunk_hptr = (int *) malloc(sizeof(int) * (N) * (chunk_size + 1) * I)));CHECK_NULL((chunk_h = (int **) malloc(sizeof(int*) * (chunk_size + 1) * I)));CHECK_NULL((chunk_ih = (int ***) malloc(sizeof(int*) * I)));for(int i = 0; i < (chunk_size + 1) * I; i++) chunk_h[i] = chunk_hptr + i * N;for (int i = 0; i < I; i++) chunk_ih[i] = chunk_h + i * (chunk_size + 1);CHECK_NULL((chunk_a = (short *) malloc(sizeof(short) * (chunk_size))));if (rank != 0) { CHECK_NULL((b = (short *) malloc(sizeof(short) * (N))));}MPI_Bcast(b, N, MPI_SHORT, 0, MPI_COMM_WORLD);/*** PARALLEL PART ***//** compute "h" local similarity array **/int total_blocks = N / B + (N % B == 0 ? 0 : 1);int last_block_size = N % B == 0 ? B : N % B;MPI_Status status;int start, end;start = getTimeMilli();for (int current_interleave = 0; current_interleave < I; current_interleave++) { MPI_Scatter(a + current_interleave * chunk_size * total_processes, chunk_size, MPI_SHORT, chunk_a, chunk_size, MPI_SHORT, 0, MPI_COMM_WORLD); int current_column = 1; // Fill first column with 0 27
  • for (int i = 0; i < chunk_size + 1; i++) chunk_ih[current_interleave][i][0] = 0;for (int current_block = 0; current_block < total_blocks; current_block++) { // Receive int block_end = MIN2(current_column - (current_block == 0 ? 1 : 0) + B, N); if (rank == 0 && current_interleave == 0) { for (int k = current_column; k < block_end; k++) { chunk_ih[current_interleave][0][k] = 0; } } else { int receive_from = rank == 0 ? total_processes - 1 : rank - 1; int size_to_receive = current_block == total_blocks - 1 ? last_block_size : B; MPI_Recv(chunk_ih[current_interleave][0] + current_block * B, size_to_receive, MPI_INT, receive_from, 0, MPI_COMM_WORLD, &status); if (DEBUG) printf("[%d] Received from %d: ", rank, receive_from); if (DEBUG) print_vector(chunk_ih[current_interleave][0] + current_block * B, size_to_receive); } // Process for (int j = current_column; j < block_end; j++, current_column++) { for (int i = 1; i < chunk_size + 1; i++) { int diag = chunk_ih[current_interleave][i - 1][j - 1] + sim[chunk_a[i - 1]][b[j - 1]]; int down = chunk_ih[current_interleave][i - 1][j] + DELTA; int right = chunk_ih[current_interleave][i][j - 1] + DELTA; int max = MAX3(diag, down, right); chunk_ih[current_interleave][i][j] = max < 0 ? 0 : max; } } // Send 28
  • if (current_interleave != I - 1 || rank + 1 != total_processes) { int send_to = rank + 1 == total_processes ? 0 : rank + 1; int size_to_send = current_block == total_blocks - 1 ? last_block_size : B; MPI_Send(chunk_ih[current_interleave][chunk_size] + current_block * B, size_to_send, MPI_INT, send_to, 0, MPI_COMM_WORLD); if (DEBUG) printf("[%d] Sent to %d: ", rank, send_to); if (DEBUG) print_vector(chunk_ih[current_interleave][chunk_size] + current_block * B, size_to_send); } }}end = getTimeMilli();for (int i = 0; i < I; i++) { MPI_Gather(chunk_hptr + N + i * chunk_size * N, N * chunk_size, MPI_INT, hptr + i * chunk_size * total_processes * N, N * chunk_size, MPI_INT, 0, MPI_COMM_WORLD);}if (rank == 0) { fprintf(stderr, "Execution: %f sn", (double) (end - start) / 1000000);}if (DEBUG) { if (rank == 0) { for (int i = 0; i < N - 1; i++) { print_vector(h[i], N); } }}//Free everything!free(sim_ptr);free(sim);free(b); 29
  • free(chunk_ih); free(chunk_h); free(chunk_hptr); free(chunk_a); if (rank == 0) { free(a); free(hptr); free(h); } MPI_Finalize();}void memcopy(int* src, int* dst, int count) { for (int i = 0; i < count; i++) { dst[i] = src[i]; }}void print_vector(int* vector, int size) { for (int i = 0; i < size; i++) { printf("%2d ", vector[i]); } printf("n");}void print_short_vector(short* vector, int size) { for (int i = 0; i < size; i++) { printf("%2d ", vector[i]); } printf("n");}void read_pam(FILE* pam) { int i, j; int temp; /** read PAM250 similarity matrix **/ /* begin AMPP */ fscanf(pam,"%*s"); /* end AMPP */ for (i = 0; i < AA; i++) for (j = 0; j <= i; j++) { if (fscanf(pam, "%d ", &temp) == EOF) { 30
  • fprintf(stderr, "PAM file emptyn"); fclose(pam); exit(1); } sim[i][j]=temp; } fclose(pam); for (i = 0; i < AA; i++) for (j = i + 1; j < AA; j++) sim[i][j] = sim[j][i]; // symmetrify}void read_files(FILE* in1, FILE* in2) { int i=0; int nc; char ch; do { nc=fscanf(in1,"%c",&ch); if (nc>0 && char2AAmem[ch]>=0) { a[i++] = char2AAmem[ch]; } } while (nc>0 && (i<N)); fclose(in1); /** read second file in array "b" **/ i=0; do { nc=fscanf(in2,"%c",&ch); if (nc>0 && char2AAmem[ch]>=0) { b[i++] = char2AAmem[ch]; } } while (nc>0 && (i<N)); fclose(in2);}/* Begin AMPP */void initChar2AATranslation(void){ int i; for(i=0; i<256; i++) char2AAmem[i]=-1; char2AAmem[’c’]=char2AAmem[’C’]=0; 31
  • AA2charmem[0]=’c’; char2AAmem[’g’]=char2AAmem[’G’]=1; AA2charmem[1]=’g’; char2AAmem[’p’]=char2AAmem[’P’]=2; AA2charmem[2]=’p’; char2AAmem[’s’]=char2AAmem[’S’]=3; AA2charmem[3]=’s’; char2AAmem[’a’]=char2AAmem[’A’]=4; AA2charmem[4]=’a’; char2AAmem[’t’]=char2AAmem[’T’]=5; AA2charmem[5]=’t’; char2AAmem[’d’]=char2AAmem[’D’]=6; AA2charmem[6]=’d’; char2AAmem[’e’]=char2AAmem[’E’]=7; AA2charmem[7]=’e’; char2AAmem[’n’]=char2AAmem[’N’]=8; AA2charmem[8]=’n’; char2AAmem[’q’]=char2AAmem[’Q’]=9; AA2charmem[9]=’q’; char2AAmem[’h’]=char2AAmem[’H’]=10; AA2charmem[10]=’h’; char2AAmem[’k’]=char2AAmem[’K’]=11; AA2charmem[11]=’k’; char2AAmem[’r’]=char2AAmem[’R’]=12; AA2charmem[12]=’r’; char2AAmem[’v’]=char2AAmem[’V’]=13; AA2charmem[13]=’v’; char2AAmem[’m’]=char2AAmem[’M’]=14; AA2charmem[14]=’m’; char2AAmem[’i’]=char2AAmem[’I’]=15; AA2charmem[15]=’i’; char2AAmem[’l’]=char2AAmem[’L’]=16; AA2charmem[16]=’l’; char2AAmem[’f’]=char2AAmem[’F’]=17; AA2charmem[17]=’L’; char2AAmem[’y’]=char2AAmem[’Y’]=18; AA2charmem[18]=’y’; char2AAmem[’w’]=char2AAmem[’W’]=19; AA2charmem[19]=’w’;}int getTimeMilli() { struct timeval tv; 32
  • gettimeofday(&tv, NULL); int ret = tv.tv_usec; ret += (tv.tv_sec * 1000000); // Add seconds return ret;}/* end AMPP*/ 33