Parallelization of Smith-Waterman Algorithm using MPI

  • 408 views
Uploaded on

Final report for Parallelization of Smith-Waterman Algorithm using MPI project.

Final report for Parallelization of Smith-Waterman Algorithm using MPI project.

More in: Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
408
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
21
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. `Universitat Politecnica de Catalunya AMPP Final Project ReportParallelization of Smith-Waterman Algorithm Author: Supervisor: Iuliia Proskurnia Josep R. Herrero Arinto Murdopo Dani Jimenez-Gonzalez Muhammad Anis uddin Nasir January 16, 2012
  • 2. Contents1 Introduction 12 Main Issues and Solutions 2 2.1 Available Parallelization Techniques . . . . . . . . . . . . . . 2 2.2 Blocking Technique . . . . . . . . . . . . . . . . . . . . . . . . 2 2.2.1 Solution 1: Using Scatter and Gather . . . . . . . . . 2 2.2.2 Solution 1: Linear-array Model . . . . . . . . . . . . . 7 2.2.3 Solution 1: Optimum B for Linear-array Model . . . . 9 2.2.4 Solution 1: 2-D Mesh Model . . . . . . . . . . . . . . 9 2.2.5 Solution 1: Optimum B for 2-D Mesh Model . . . . . 10 2.2.6 Solution 2: Using Send and Receive . . . . . . . . . . 11 2.2.7 Solution 2: Linear-array Model . . . . . . . . . . . . . 15 2.2.8 Solution 2: Optimum B for Linear-array Model . . . . 15 2.2.9 Solution 2: 2-D Mesh Model . . . . . . . . . . . . . . 16 2.3 Blocking-and-Interleave Technique . . . . . . . . . . . . . . . 16 2.3.1 Solution 1: Using Scatter and Gather . . . . . . . . . 16 2.3.2 Solution 1: Linear-Array Model . . . . . . . . . . . . . 19 2.3.3 Solution 1: Optimum B and I for Linear-array Model 21 2.3.4 Solution 1: 2-D Mesh Model . . . . . . . . . . . . . . 22 2.3.5 Solution 1: Optimum B and I for 2-D Mesh Model . . 23 2.3.6 Solution 1: Improvement . . . . . . . . . . . . . . . . 24 2.3.7 Solution 1: Optimum B and I for the Improved Solution 27 2.3.8 Solution 2: Using Send and Receive . . . . . . . . . . 28 2.3.9 Solution 2: Linear-array Model . . . . . . . . . . . . . 32 2.3.10 Solution 2: Optimum B and I for Linear-array Model 32 2.3.11 Solution 2: 2-D Mesh Model . . . . . . . . . . . . . . 333 Performance Results 34 3.1 Solution 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.1.1 Performance of Sequential Code . . . . . . . . . . . . 34 3.1.2 Find Out Optimum Number of Processor (P) . . . . . 35 3.1.3 Find Out Optimum Blocking Size (B) . . . . . . . . . 36 3.1.4 Find Out Optimum Interleave Factor (I) . . . . . . . . 38 3.2 Solution 1-Improved . . . . . . . . . . . . . . . . . . . . . . . 38 3.2.1 Find Out Optimum Number of Processor (P) . . . . . 38 3.2.2 Find Out Optimum Blocking Size (B) . . . . . . . . . 40 3.2.3 Find Out Optimum Interleave Factor (I) . . . . . . . . 41 3.3 Solution 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.3.1 Find Out Optimum Number of Processor (P) . . . . . 41 3.3.2 Find Out Optimum Blocking Size (B) . . . . . . . . . 43 3.3.3 Find Out Optimum Interleave Factor (I) . . . . . . . . 44 3.4 Putting All the Optimum Values Together . . . . . . . . . . . 46 i
  • 3. 3.5 Testing with different GAP penalties . . . . . . . . . . . . . . 474 Conclusions 48A Source Code Compilation 49B Execution on ALTIX 50C Timing diagram for Blocking technique in Solution 2 51D Timing diagram for Blocking-and-Interleave technique in Solution 2 52 ii
  • 4. List of Figures 1 Blocking Communication . . . . . . . . . . . . . . . . . . . . 7 2 Data Partitioning among processes . . . . . . . . . . . . . . . 12 3 Blocking Communication . . . . . . . . . . . . . . . . . . . . 14 4 Blocking and interleave communication . . . . . . . . . . . . 19 5 Blocking and Interleave Communication . . . . . . . . . . . . 24 6 Sequential Code Performance Measurement Result . . . . . . 34 7 Measurement result when N is 5000, B is 100 and I is 1 . . . 35 8 Diagram of measurement result when N is 5000, B is 100, I is 1 35 9 Measurement result when N is 10000, B is 100 and I is 1 . . . 36 10 Diagram of measurement result when N is 10000, B is 100, I is 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 11 Performance measurement result when N is 10000, P is 8, I is 1 37 12 Diagram of measurement result when N is 10000, P is 8, I is 1 37 13 Diagram of measurement result when N is 10000, P is 8, B is 100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 14 Measurement result when N is 10000, B is 100 and I is 1 . . . 38 15 Diagram of measurement result when N is 10000, B is 100, I is 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 16 Performance measurement result when N is 10000, P is 8, I is 1 40 17 Diagram of measurement result when N is 10000, P is 8, I is 1 40 18 Diagram of measurement result when N is 10000, P is 8, B is 200 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 19 Measurement result when N is 5000, B is 100 and I is 1 . . . 41 20 Diagram of measurement result when N is 5000, B is 100, I is 1 42 21 Measurement result when N is 10000, B is 100 and I is 1 . . . 42 22 Diagram of measurement result when N is 10000, B is 100, I is 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 23 Performance measurement result when N is 10000, P is 32, I is 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 24 Diagram of measurement result when N is 10000, P is 32, I is 1 44 25 Performance measurement result when N is 10000, P is 32, B is 50 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 26 Diagram of measurement result when N is 10000, P is 32, B is 50 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 27 Putting all of them together . . . . . . . . . . . . . . . . . . . 46 28 Putting all of them together - the plot . . . . . . . . . . . . . 46 29 Testing with different gap penalties . . . . . . . . . . . . . . . 47 30 gap penalty vs Time . . . . . . . . . . . . . . . . . . . . . . . 47 31 Performance Model Solution 2 . . . . . . . . . . . . . . . . . . 51 32 Performance Model with Interleave . . . . . . . . . . . . . . . 52 iii
  • 5. 1 IntroductionThe Smith–Waterman algorithm is a well-known algorithm for performinglocal sequence alignment for determining similar regions between two nu-cleotide or protein sequences. Proteins are made by aminoacid sequencesand similar protein structure has similar aminoacid sequence.In this projectwe did the parallel implementation of the Smith-Waterman Algorithm usingMessage Passing Interface code. To compare two aminoacid sequence, initially we have to align the se-quences to compare them. To find the best alignment between two sequencesthe algorithm initially populates a matrix H of size N × N (N is size of se-quence) using a scoring criteria. It requires a scoring matrix (cost of match-ing of two symbols) and a gap penalty for mismatch of two symbols. Afterpopulating the matrix H we can obtain the optimum local alignment bytracking back the matrix starting with the highest value in the matrix. In our implementation of Smith-Waterman algorithm we populated thematrix H in parallel using multiple processes running of multicore machines.We used pipelined computation to achieve specific degree of parrallelism andcompared different parallelizing techniques to find optimum parallelizationtechnique for the problem. We started parallelizing our code using differ-ent blocking sizes B at the column level. Furthermore, we also introducedparallelization using different levels of interleave I at the row level. For performance measurement we created the performance model of boththe implementations for two interconnection networks which are linear and2D-Mesh interconnection network. We executed our code for evaluationon Altix machine using different values of parameter ∆ (gap penalty), B(column interleaving factor) and I (row interleaving factor) to empiricallyfind optimum B and I for the problem. We also calculated the optimumB and I by finding the global minima of the equations of the performancemodel. 1
  • 6. 2 Main Issues and Solutions 2.1 Available Parallelization Techniques We can achieve pipelining with both blocking at column and row level. Blocking at column level can be interpreted in different ways. 1. Each processor Pi processes B complete columns of the matrix before doing any communication. 2. Each processor Pi processes B complete columns. However after pro- cessing B columns of a row of the matrix it does a communication to next processor. 3. Each processor Pi processes B complete columns. However after pro- cessing B columns of a set of rows of the complete B columns of the matrix it does a communication. 4. Each processor Pi processes N/P complete rows. After processing B columns of those N/P rows, it does a communication. Among above mentioned techniques, we choosed the last one because it provides us with most optimum pipelined computation using the scheme. 2.2 Blocking Technique 2.2.1 Solution 1: Using Scatter and Gather Based on chosen technique from our available parallelization techniques, we developed this following solution. Note that in our solution here we already incorporated I (Interleave factor), but we set the I to 1. At the first step, process with rank 0 (which is the master process) reads all the necessary files which are two protein sequence files. The reading result is stored in short* a and short* b. Other than that, it also allocates enough memory to store the resulting matrix as shown in code snippet below1 {2 // n o t e t h a t s i z e A i s t h e t o t a l number o f rows t h a t we need p r o c e s s . We round up N i f N i s n o t d i v i s i b l e by t o t a l p r o c e s s e s as shown b e l o w . I i s s e t t o 1 h e r e .34 i f (N % ( t o t a l p r o c e s s e s ∗ I ) != 0 ) {5 s i z e A = N + ( t o t a l p r o c e s s e s ∗ I ) − (N % ( t o t a l p r o c e s s e s ∗ I ) ) ; // t o h a n d l e c a s e where N i s n o t d i v i s i b l e by ( t o t a l p r o c e s s e s ∗ I )6 } else {7 s i z e A = N;8 }9 2
  • 7. 10 r e a d f i l e s ( in1 , in2 , a , b , N − 1 ) ; // i n 1 = i n p u t f i l e 1 , i n 2 = i n p u t f i l e 2 , a = r e s u l t i n g r e a d i n g from in1 , b = r e s u l t i n g r e a d i n g from i n 211 c h u n k s i z e = s i z e A / ( t o t a l p r o c e s s e s ∗ I ) ; // number o f rows t h a t each p r o c e s s e s n e e d s t o work on12 CHECK NULL( ( h a l l p t r = ( i n t ∗ ) c a l l o c (N ∗ ( s i z e A +1) , s i z e o f ( i n t ) ) ) ) ; // r e s u l t i n g d a t a13 CHECK NULL( ( h a l l = ( i n t ∗ ∗ ) c a l l o c ( ( s i z e A +1) , s i z e o f ( i n t ∗ ) ) ) ) ; // c o n t a i n l i s t o f p o i n t e r1415 f o r ( i = 0 ; i <= s i z e A ; i ++)16 h a l l [ i ] = h a l l p t r + i ∗ N; // p u t t h e p o i n t e r i n an array1718 // i n i t i a l i z e t h e f i r s t row o f r e s u l t i n g m a t r i x w i t h 019 f o r ( i = 0 ; i < N; i ++)20 {21 h all [ 0 ] [ i ] = 0;22 }2324 } Every process reads the PAM matrix, and master process performs broadcast of N and B value. 1 MPI Bcast(& c h u n k s i z e , 1 , MPI INT , 0 , MPI COMM WORLD) ; // B r o a d c a s t chunk s i z e , which i s t h e number o f c a l c u l a t e d rows by each s l a v e 2 MPI Bcast(&N, 1 , MPI INT , 0 , MPI COMM WORLD) ; // B r o a d c a s t N 3 MPI Bcast(&B, 1 , MPI INT , 0 , MPI COMM WORLD) ; // B r o a d c a s t B 4 MPI Bcast(&I , 1 , MPI INT , 0 , MPI COMM WORLD) ; // B r o a d c a s t I Then each process needs to allocate enough memory to receive chunk size. Other than process with rank 0, they need to allocate memory to receive the whole part of protein 2 (which has size equals to N). 1 CHECK NULL( ( chunk a = ( short ∗ ) c a l l o c ( s i z e o f ( short ) , c h u n k s i z e ) ) ) ; // s l a v e p r o c e s s w i l l o b t a i n i t from master process 2 i f ( rank != 0 ) { 3 CHECK NULL( ( b = ( short ∗ ) m a l l o c ( s i z e o f ( short ) ∗ (N) ) ) ) ; // s l a v e p r o c e s s w i l l o b t a i n i t from master p r o c e s s 4 } 5 6 MPI Bcast ( b , N, MPI SHORT, 0 , MPI COMM WORLD) ; // b r o a d c a s t protein 2 to every process Now, let’s go to the parallel part, first we calculate how many blocks that we will process. We calculate, total blocks variable and also last block variable. last block variable contains the size of the last block to process if N is not divisible by B (N %B != 0) 1 i n t t o t a l b l o c k s = N / B + (N % B == 0 ? 0 : 1 ) ; 2 i n t l a s t b l o c k = N % B == 0 ? B : N % B ; 3
  • 8. Then we scatter 1st protein sequence(in here we store it in a), with size of each scattered part equals to chunk size. After each process receives each scattered part, the computation begins for process with rank 0. It will not wait to receive any data from other process and directly calculate the 1st block of data. Meanwhile other proces with rank r, will wait for data from process with rank r-1. The data sent between process here is the last row of calculated block (which is an array of short with size equals to B. After a process receive the required data, each process performs compu- tations for received data. In the end, each process with rank r will send the last row of calculated block with size B to neighboring process with rank r+1. In the end, we perform gather to combine the result. Note that cur- rent interleave variable is set to 0 and I is set to 1 here because we’re not using interleaving factor. Code snippet below show how to implement this functionality 1 for ( int c u r r e n t i n t e r l e a v e = 0 ; c u r r e n t i n t e r l e a v e < I ; c u r r e n t i n t e r l e a v e ++) { 2 MPI Scatter ( a + c u r r e n t i n t e r l e a v e ∗ c h u n k s i z e ∗ total processes , 3 c h u n k s i z e , MPI SHORT, chunk a , c h u n k s i z e , MPI SHORT, 0 , MPI COMM WORLD) ; // c h u n k a i s t h e receiving buffer 4 int current column = 1 ; 5 f o r ( i = 0 ; i < c h u n k s i z e + 1 ; i ++) h [ i ] [ 0 ] = 0 ; 6 7 for ( int c u r r e n t b l o c k = 0 ; c u r r e n t b l o c k < t o t a l b l o c k s ; c u r r e n t b l o c k ++) { 8 // R e c e i v e 9 i n t b l o c k e n d = MIN2( c u r r e n t c o l u m n − ( c u r r e n t b l o c k == 0 ? 1 : 0 ) + B, N) ;10 i f ( rank == 0 && c u r r e n t i n t e r l e a v e == 0 ) { // i f rank 0 i s p r o c e s s i n g t h e f i r s t b l o c k , i t doesn ’ t need t o r e c e i v e any t h i n g11 f o r ( i n t k = c u r r e n t c o l u m n ; k < b l o c k e n d ; k++) {12 h [ 0 ] [ k ] = 0 ; // i n i t row 013 }14 } else {15 i n t r e c e i v e f r o m = rank == 0 ? t o t a l p r o c e s s e s − 1 : rank − 1 ; // r e c e i v e from n e i g h b o r i n g process16 i n t s i z e t o r e c e i v e = c u r r e n t b l o c k == t o t a l b l o c k s − 1 ? l a s t b l o c k : B;17 MPI Recv ( h [ 0 ] + c u r r e n t b l o c k ∗ B, s i z e t o r e c e i v e , MPI INT , r e c e i v e f r o m , 0 , MPI COMM WORLD, &s t a t u s ) ;18 }19 // P r o c e s s20 f o r ( j = c u r r e n t c o l u m n ; j < b l o c k e n d ; j ++, 4
  • 9. c u r r e n t c o l u m n++) {21 f o r ( i = 1 ; i < c h u n k s i z e + 1 ; i ++) {22 d i a g = h [ i − 1 ] [ j −1] + sim [ chunk a [ i − 1]][b[ j − 1]];23 down = h [ i − 1 ] [ j ] + DELTA;24 r i g h t = h [ i ] [ j −1] + DELTA;25 max = MAX3( diag , down , r i g h t ) ;26 i f (max <= 0 ) {27 h [ i ] [ j ] = 0;28 } else {29 h [ i ] [ j ] = max ;30 }31 }32 }3334 // Send35 i f ( c u r r e n t i n t e r l e a v e + 1 != I | | rank + 1 != total processes ) {36 i n t s e n d t o = rank + 1 == t o t a l p r o c e s s e s ? 0 : rank + 1 ;37 i n t s i z e t o s e n d = c u r r e n t b l o c k == t o t a l b l o c k s − 1 ? l a s t b l o c k : B;38 MPI Send ( h [ c h u n k s i z e ] + c u r r e n t b l o c k ∗ B, s i z e t o s e n d , MPI INT , s e n d t o , 0 , MPI COMM WORLD) ;39 p r i n t v e c t o r ( h [ c h u n k s i z e ] + c u r r e n t b l o c k ∗ B, size to send ) ;40 }4142 // G a t h e r i n g r e s u l t43 MPI Gather ( h p t r + N, N ∗ c h u n k s i z e , MPI INT ,44 h a l l p t r + N + current interleave ∗ chunk size ∗ t o t a l p r o c e s s e s ∗ N,45 N ∗ c h u n k s i z e , MPI INT , 0 , MPI COMM WORLD) ;46 } Once the result is gathered, process with rank 0 deallocates the memory and perform optional verification result. The verification result is obtained by comparing the resulting parallel version of h matrix (by using h all ) with serial version of h matrix (by using hverify) 1 i f ( rank == 0 ) { 2 i f ( v e r i f y R e s u l t == 1 ) { 3 Max = 0 ; 4 xMax = 0 ; 5 yMax = 0 ; 6 CHECK NULL( ( h v e r i f y p t r = ( i n t ∗ ) m a l l o c ( s i z e o f ( i n t ) ∗ (N+1) ∗ (N+1) ) ) ) ; 7 CHECK NULL( ( h v e r i f y = ( i n t ∗ ∗ ) m a l l o c ( s i z e o f ( i n t ∗ ) ∗ ( N+1) ) ) ) ; 8 /∗ Mount h v e r i f y [N ] [ N] ∗/ 9 f o r ( i =0; i<=N; i ++)10 h v e r i f y [ i ]= h v e r i f y p t r+i ∗ (N+1) ;11 f o r ( i =0; i<=N; i ++) h v e r i f y [ i ] [ 0 ] = 0 ; 5
  • 10. 12 f o r ( j =0; j<=N; j ++) h v e r i f y [ 0 ] [ j ] = 0 ;1314 f o r ( i =1; i<=N; i ++)15 f o r ( j =1; j<=N; j ++) {16 diag = h v e r i f y [ i − 1 ] [ j −1] + sim [ a [ i − 1 ] ] [ b [ j −1]];17 down = h v e r i f y [ i − 1 ] [ j ] + DELTA;18 right = h v e r i f y [ i ] [ j −1] + DELTA;19 max=MAX3( diag , down , r i g h t ) ;20 i f (max <= 0 ) {21 hve rify [ i ] [ j ]=0;22 }23 e l s e i f (max == d i a g ) {24 h v e r i f y [ i ] [ j ]= d i a g ;25 }26 e l s e i f (max == down ) {27 h v e r i f y [ i ] [ j ]=down ;28 }29 else {30 h v e r i f y [ i ] [ j ]= r i g h t ;31 }32 i f (max > Max) {33 Max=max ;34 xMax=i ;35 yMax=j ;36 }37 }3839 int v e r F a i l F l a g = 0 ;40 f o r ( i =0; i<=N−1; i ++){41 f o r ( j =0; j<=N−1; j ++){42 i f ( h a l l [ i ] [ j ] != h v e r i f y [ i ] [ j ] ) {43 p r i n t f ( ” V e r i f i c a t i o n f a i l ! n” ) ;44 p r i n t f ( ” h a l l [ i ] [ j ] = %d , h v e r i f y [ i ] [ j ] = %dn” , h a l l [ i ] [ j ] , hverify [ i ] [ j ]) ;45 v e r F a i l F l a g = −1;46 break ;47 }48 }4950 i f ( v e r F a i l F l a g != 0 ) {51 break ;52 }53 }5455 i f ( v e r F a i l F l a g ==0)56 {57 p r i n t f ( ” V e r i f i c a t i o n s u c c e s s ! n” ) ;58 }5960 }6162 free ( hverifyptr ) ; 6
  • 11. 63 free ( hverify ) ;64 free (a) ;65 free ( h all ptr ) ;66 free ( h all ) ;67 }6869 free (b) ;70 f r e e ( chunk a ) ;71 free (h) ;72 f r e e ( hptr ) ;7374 MPI Finalize () ; Figure 1: Blocking Communication To summarize this technique, Figure 1 shows the dividing of block in a matrix. The number inside the block indicates the step. The red portion in block 1 indicate the amount of data (which is B integers) that is sent from process 0 to process 1 in the end of calculation of block 1, in step 1. 2.2.2 Solution 1: Linear-array Model First, we use linear-array topology to model our solution. Here is the model for communication part of our chosen blocking technique 1. Broadcasting chunk size, N, B, and I tcomm−bcast−4−int = 4 × (ts + tw ) × log2 (p) 2. Broadcasting of 2nd protein sequence (vector b) tcomm−bcast−protein−seq = (ts + tw × N ) × log2 (p) 3. Scattering chunk size for each process to compute Note that the size of chunk size is the following N chunk size = p Therefore communication time for scattering is shown below N tcomm−scatter−protein−seq = ts × log2 (p) + tw × p × (p − 1) 7
  • 12. 4. Sending shared data To start the first block of computation, process with rank 0 does not need to wait for any data from other processes. That means we only have ( N +p−2) stages for sending shared data. The shared data is the B last row of current finished block which consists of B items.Therefore putting all of them together, communication time to send shared data is tcomm−send−shared−data = ( N + p − 2) × (ts + (B × tw )) B 5. Gathering calculated data Finally, we need to perform gather to combine all calculated data. Note that every process will need to combine N × chunk size data, which equals to N × N amount of data. Therefore the communication P time for this step is given by N tcomm−gather = ts × log2 (p) + tw × p × N × (p − 1) 6. Putting all the communication time together tcomm−all = tcomm−bcast−4−int +tcomm−bcast−protein−seq +tcomm−scatter−protein−seq + tcomm−send−shared−data + tcomm−gather 2 tcomm−all (B) = (6 log2 (p)+p−1)ts +((4+N ) log2 (p)+N + (N +N p)(p−1) )tw + N ts B + (p − 2) × B × tw Now we calculate the calculation time for this blocking technique. Notethat in our blocking technique we have N + p − 1 stages of block-calculation. BIn each block-calculation, we need to compute N × B points. Therefore, pif we represent time to compute one point as tc , we obtain this followingcalculation time model tcalc = ( N + p − 1) × ( N × B) × tc B p 2 tcalc = ( N + N B − p NB p ) × tc N2 N ×(p−1) tcalc = ( p + ( p ) × B) × tc Final model can be obtained by adding calculation time and communi-cation time ttotal = tcomm + tcalc 2 ttotal (B) = (6 log2 (p) + p − 1)ts + ((4 + N ) log2 (p) + N + (N +N p)(p−1) )tw + + (p − 2) × B × tw + ( N + ( N ×(p−1) ) × B) × tcN ts 2 B p p 8
  • 13. 2.2.3 Solution 1: Optimum B for Linear-array ModelTo find optimum B for linear array model, we need to calculate derivative offinal model of the linear topology with respect to B , and set the derivativeto 0 as shown below dttotal (B) =0 dB And, using obtained model from section 2.2.2 we obtain this followingequation −N N (p − 1) t + (p − 2)tw + 2 s × tc = 0 B p N (p − 1) N (p − 2)tw + × tc = 2 ts p B N ts B2 = N (p−1) (p − 2)tw + p × tc pN ts B2 = p(p − 2)tw + N (p − 1) × tc pN ts B= p(p − 2)tw + N (p − 1) × tc Using assumption that P is very small in comparison with N, we simplifythe equation above into this following ts B≈ tc2.2.4 Solution 1: 2-D Mesh ModelUsing the same steps as in section 2.2.2, here is the 2-D Mesh Model ofsolution 1. 1. Broadcasting chunk size, N, B, and I √ tcomm−bcast−4−int = 4 × 2 × (ts + tw ) × log2 ( p) 2. Broadcasting of 2nd protein sequence (vector b) √ tcomm−bcast−protein−seq = 2 × (ts + tw × N ) × log2 ( p) 3. Scattering chunk size for each process to compute Note that the size of chunk size is the following N chunk size = p 9
  • 14. Communication time for scattering in 2-D Mesh model can be modeled using hypercube. It is similar as the communication time for scattering in Linear Array model.[1] N tcomm−scatter−protein−seq = ts × log2 (p) + tw × p × (p − 1) 4. Sending shared data Since sending shared data is using primitive send and receive, the communication time for this part in 2 D mesh model also does not change. tcomm−send−shared−data = ( N + p − 2) × (ts + (B × tw )) B 5. Gathering calculated data Communication time for gathering is using same formula as scattering, but different size of data that is gathered. N tcomm−gather = ts × log2 (p) + tw × p × N × (p − 1) 6. Putting all the communication time together tcomm−all = tcomm−bcast−4−int +tcomm−bcast−protein−seq +tcomm−scatter−protein−seq + tcomm−send−shared−data + tcomm−gather √ √ tcomm−all (B) = ((10 log2 ( p)+log2 (p)+p−1)×ts +((8+2N ) log2 ( p)+ N ×(p−1) N 2 ×(p−1) N p +N + p ) × tw + B × ts + (p − 2)B × tw Calculation time does not change between 2-D mesh model and LinearArray model, therefore the calculation time is tcalc = ( N + ( N ×(p−1) ) × B) × tc 2 p p Putting all together ttotal = tcomm + tcalc 2 √ √ ttotal (B) = N ×tc +(10 log2 ( p)+log2 (p)+p−1)×ts +((8+2N ) log2 ( p)+ pN ×(p−1) 2 ×(p−1) p +N + N p ) × tw + N × ts + (p − 2)B × tw + (( N ×(p−1) ) × B) × tc B p2.2.5 Solution 1: Optimum B for 2-D Mesh ModelWe need to calculate derivative of final model of the 2-D Mesh model withrespect to B , and set the derivative to 0 as shown below dttotal (B) =0 dB And, using obtained model from section 2.2.4 we obtain this followingequation 10
  • 15. −N N (p − 1) t + (p − 2)tw + 2 s tc = 0 B p N (p − 1) N (p − 2)tw + × tc = 2 ts p B N ts B2 = N (p−1) (p − 2)tw + p tc pN ts B2 = p(p − 2)tw + N (p − 1)tc pN ts B= p(p − 2)tw + N (p − 1)tc ts B≈ tc As we observed here, the optimum B does not change when we use 2-D Mesh to model the communication. Using our solution 1, the usage of2-D mesh model only affect the broadcast time. And refering to total timeequation with respect to B(ttotal (B)), broadcast time is only a constant andit disappears when we calculate dttotal (B) . dB2.2.6 Solution 2: Using Send and ReceiveIn the second solution, we used Send and Receive methods provided in MPIlibrary for communicating among the processes. In this implementationevery process reads the input file. Every process also reads the similaritymatrix. After reading the files each process calculates the number of rows thatit has to process and declares the required memory. Process with rank 0declares the matrix H of size N * N. In our implementation data distribu-tion is fair among all the process. In case of number of rows in the list arenot divisible among all the processes we give one more row to each processstarting from the master process. Figure 2 shows the distribution of data incase where data is not equally divisible among the processes. Each process calculates the block size that it needs to communicate withits neighbour. Filling starts by master process and other process waits toreceive the block to start processing. Master communicates its first block,with its neighbour, after processing its required number of rows for the firstblock. Below mentioned is the code snippet for filling the matrix at all theprocess. 11
  • 16. Figure 2: Data Partitioning among processes 1 i f ( i d == 0 ) 2 { 3 f o r ( i =0; i <ColumnBlock ; i ++) 4 { 5 f o r ( j =1; j<=s ; j ++) 6 { 7 f o r ( k=i ∗B+1;k<=( i +1)∗B && k<=b [ 0 ] && k <= N; k++) 8 { 9 i n t RowPosition ;10 i f ( id < r )11 RowPosition = i d ∗ ( (N/p ) +1)+j ;12 else13 RowPosition = ( r ∗ ( (N/p ) +1) ) +(( id−r ) ∗ (N/p ) )+j ;1415 d i a g = h [ j − 1 ] [ k −1] + sim [ a [ RowPosition ] ] [ b [ k ]];16 down = h [ j − 1 ] [ k ] + DELTA;17 r i g h t = h [ j ] [ k −1] + DELTA;18 max = MAX3( diag , down , r i g h t ) ;19 i f (max <= 0 ) {20 h [ j ] [ k ] = 0;21 } else {22 h [ j ] [ k ] = max ;23 }24 chunk [ k−( i ∗B+1) ] = h [ j ] [ k ] ;25 }26 }27 MPI Send ( chunk , B, MPI SHORT, i d +1 ,0 ,MPI COMM WORLD) ;28 }29 } else30 { 12
  • 17. 31 f o r ( i =0; i <ColumnBlock ; i ++)32 {33 MPI Recv ( chunk , B, MPI SHORT, id −1 ,0 ,MPI COMM WORLD,& status ) ;34 f o r ( z =0; z<B ; z++)35 {36 i f ( ( i ∗B+z +1) <= N)37 h [ 0 ] [ i ∗B+z +1] = chunk [ z ] ;38 }39 f o r ( j =1; j<=s ; j ++)40 {41 i n t RowPosition ;42 i f ( id < r )43 RowPosition = i d ∗ ( (N/p ) +1)+j ;44 else45 RowPosition = ( r ∗ ( (N/p ) +1) ) +(( id−r ) ∗ (N/p ) )+j ;4647 f o r ( k=i ∗B+1;k<=( i +1)∗B && k<=b [ 0 ] && k <= N; k++)48 {49 d i a g = h [ j − 1 ] [ k −1] + sim [ a [ RowPosition ] ] [ b [ k ]];50 down = h [ j − 1 ] [ k ] + DELTA;51 r i g h t = h [ j ] [ k −1] + DELTA;52 max = MAX3( diag , down , r i g h t ) ;53 i f (max <= 0 )54 h [ j ] [ k ] = 0;55 else56 h [ j ] [ k ] = max ;5758 chunk [ k−( i ∗B+1) ] = h [ j ] [ k ] ;5960 }61 }62 i f ( i d != p−1)63 MPI Send ( chunk , B, MPI SHORT, i d +1 ,0 ,MPI COMM WORLD );64 }65 } At the end every process sends its portion of the matrix H to the master process using the Send method available in the MPI library. Below men- tioned is the code snippet of gathering process. 1 i f ( i d ==0) 2 { 3 i n t row , c o l ; 4 f o r ( i =1; i <p ; i ++) 5 { 6 MPI Recv(&row , 1 , MPI INT , i , 0 ,MPI COMM WORLD,& s t a t u s ) ; 7 CHECK NULL( ( r e c v h p t r = ( i n t ∗ ) m a l l o c ( s i z e o f ( i n t ) ∗ ( row ) ∗ (N) ) ) ) ; 8 9 MPI Recv ( r e c v h p t r , row∗N, MPI INT , i , 0 ,MPI COMM WORLD 13
  • 18. ,& s t a t u s ) ;1011 f o r ( j =0; j <row ; j ++)12 {13 i n t RowPosition ;14 if ( i < r)15 RowPosition = ( i ∗ ( (N/p ) +1) )+j +1;16 else17 RowPosition = ( r ∗ ( (N/p ) +1) ) +(( i −r ) ∗ (N/p ) )+j +1;1819 f o r ( k=0;k<N; k++)20 h [ RowPosition ] [ k+1]= r e c v h p t r [ j ∗N+k ] ;21 }22 free ( recv hptr ) ;23 }24 }25 else26 {27 MPI Send(&s , 1 , MPI INT , 0 , 0 ,MPI COMM WORLD) ;28 CHECK NULL( ( r e c v h p t r = ( i n t ∗ ) m a l l o c ( s i z e o f ( i n t ) ∗ ( s ) ∗ ( N) ) ) ) ;2930 f o r ( j =0; j <s ; j ++)31 {32 f o r ( k=0;k<N; k++)33 {34 r e c v h p t r [ j ∗N+k ] = h [ j + 1 ] [ k + 1 ] ;35 }36 }37 MPI Send ( r e c v h p t r , s ∗N, MPI INT , 0 , 0 ,MPI COMM WORLD) ;38 free ( recv hptr ) ;39 } Once the result is gathered, process with rank 0 deallocates the memory. and perform optional verification result. Figure 3: Blocking Communication As reflected in Figure 3, the dividing of block in Solution 2 is same with solution 1. But, instead of using scatter and gather to distribute data, 14
  • 19. solution 2 uses primitive sends and receives.2.2.7 Solution 2: Linear-array ModelInitially we calculated the performance model for the Linear interconnectionNetwork. The timing diagram could be found in the Appendix C. 1. In solution 2 every process calculates the N/p*B number of values be- fore communicating a chunk with the other process. It takes (N/B)+p- 1 steps in total for computation.Below mentioned is the equation for computation. tcomp1 = ( N + p − 1) × ( N × B) × tc B p 2. After computation step each process communicates a Block with its neighbour process. There are (N/B)+p-2 steps of communication among all the processes. tcomm1 = ( N + p − 2) × (ts + B × tw ) B 3. After completing their part of matrix H every process sends it to the master process. N tcomm2 = (ts + p × N × tw ) 4. In the end master process puts all the partial result in the matrix H to finalize the matrix H. N tcomp2 = (ts + p × N × tw )The total time can be calculated by combining all the communication times. ttotal = tcomp1 + tcomm1 + tcomp2 + tcomm2 ttotal = ( N + p − 1) × ( N × B) × tc + ( N + p − 2) × (ts + B × tw ) + (ts + B p BNp × N × tw ) + (ts + N × N × tw ) p2.2.8 Solution 2: Optimum B for Linear-array ModelTo find optimum B for linear array model, we need to calculate derivative offinal model of the linear topology with respect to B , and set the derivativeto 0 as shown below dttotal (B) =0 dB And, using obtained model from section 2.2.7 we obtain this followingequation −N N (p − 1) t + (p − 2)tw + 2 s tc = 0 B p 15
  • 20. N (p − 1) N (p − 2)tw + tc = 2 ts p B N ts B2 = N (p−1) (p − 2)tw + p tc pN ts B2 = p(p − 2)tw + N (p − 1)tc pN ts B= p(p − 2)tw + N (p − 1)tc 2.2.9 Solution 2: 2-D Mesh Model We calculated the performance model for the 2D-Mesh interconnection Net- work. And we found that there is no difference between the Linear Array Model and 2-D Mesh model because the difference between them is mainly in the time to perform broadcasting and this solution does not involve any broadcasting of element from root to other processes in the system. 2.3 Blocking-and-Interleave Technique 2.3.1 Solution 1: Using Scatter and Gather Taking into account not only Blocking size B but also Interleave size I, we developed solution below. First step is to allocate memory for all necessary variables in each processes. Master process also allocates memory for the final matrix where all the partial results will be stored. All slave processes will also allocate memory for partial result matrices which eventually will be send to the master process. 1 main ( i n t argc , char ∗ argv [ ] ) { 2 3 {...} 4 5 i n t B, I ; 6 7 M P I I n i t (& argc , &argv ) ; 8 MPI Comm rank (MPI COMM WORLD, &rank ) ; 9 MPI Comm size (MPI COMM WORLD, &t o t a l p r o c e s s e s ) ;1011 i f ( rank == 0 ) {12 chunk size = sizeA / ( t o t a l p r o c e s s e s ∗ I ) ;1314 CHECK NULL( ( h a l l p t r = ( i n t ∗ ) c a l l o c (N ∗ ( s i z e A +1) , s i z e o f ( i n t ) ) ) ) ; // r e s u l t i n g d a t a15 CHECK NULL( ( h a l l = ( i n t ∗ ∗ ) c a l l o c ( ( s i z e A +1) , s i z e o f ( i n t ∗ ) ) ) ) ; // c o n t a i n l i s t o f p o i n t e r 16
  • 21. 16 f o r ( i = 0 ; i < s i z e A ; i ++)17 h a l l [ i ] = h a l l p t r + i ∗ N;1819 // i n i t i a l i z e t h e f i r s t row o f r e s u l t i n g m a t r i x w i t h 020 f o r ( i = 0 ; i < N; i ++)21 {22 h all [ 0 ] [ i ] = 0;23 }2425 }2627 MPI Bcast(& c h u n k size , 1 , MPI INT , 0 , MPI COMM WORLD) ;28 MPI Bcast(&N, 1 , MPI INT , 0 , MPI COMM WORLD) ;29 MPI Bcast(&B, 1 , MPI INT , 0 , MPI COMM WORLD) ;30 MPI Bcast(&I , 1 , MPI INT , 0 , MPI COMM WORLD) ;3132 CHECK NULL( ( h p t r = ( i n t ∗ ) m a l l o c ( s i z e o f ( i n t ) ∗ (N) ∗ ( chunk size + 1) ) ) ) ;33 CHECK NULL( ( h = ( i n t ∗ ∗ ) m a l l o c ( s i z e o f ( i n t ∗ ) ∗ ( c h u n k s i z e + 1) ) ) ) ;34 f o r ( i = 0 ; i < c h u n k s i z e + 1 ; i ++)35 h [ i ] = h p t r + i ∗ N;3637 CHECK NULL( ( chunk a = ( short ∗ ) c a l l o c ( s i z e o f ( short ) , chunk size ) ) ) ;38 i f ( rank != 0 ) {39 CHECK NULL( ( b = ( short ∗ ) m a l l o c ( s i z e o f ( short ) ∗ (N) ) ) ) ;40 }41 MPI Bcast ( b , N, MPI SHORT, 0 , MPI COMM WORLD) ; The master process scattering vector A to each process partially. Each interleave step there will be send part of the vector A. Sequence of code for the interleave 0 will be the same as in previous section but only with one exception that the last process will send its results to the first process. Each process receives size B data from previous one before processing next B columns. Each process sends data after processing B columns to the next processes but the last process sends the data to the first(master) one if it’s not the last stage. Finally after calculating all partial matrices each process sends its result to the master process (It happens interleave times). 1 2 for ( int c u r r e n t i n t e r l e a v e = 0 ; c u r r e n t i n t e r l e a v e < I ; c u r r e n t i n t e r l e a v e ++) { 3 MPI Scatter ( a + c u r r e n t i n t e r l e a v e ∗ c h u n k s i z e ∗ total processes , 4 c h u n k s i z e , MPI SHORT, chunk a , c h u n k s i z e , MPI SHORT, 0 , MPI COMM WORLD) ; 5 int current column = 1 ; 6 f o r ( i = 0 ; i < c h u n k s i z e + 1 ; i ++) h [ i ] [ 0 ] = 0 ; 7 for ( int c u r r e n t b l o c k = 0 ; c u r r e n t b l o c k < t o t a l b l o c k s ; c u r r e n t b l o c k ++) { 17
  • 22. 8 // R e c e i v e 9 i n t b l o c k e n d = MIN2( c u r r e n t c o l u m n − ( c u r r e n t b l o c k == 0 ? 1 : 0 ) + B, N) ;10 i f ( rank == 0 && c u r r e n t i n t e r l e a v e == 0 ) {11 f o r ( i n t k = c u r r e n t c o l u m n ; k < b l o c k e n d ; k++) {12 h [ 0 ] [ k ] = 0;13 }14 } else {15 i n t r e c e i v e f r o m = rank == 0 ? t o t a l p r o c e s s e s − 1 : rank − 1 ;16 i n t s i z e t o r e c e i v e = c u r r e n t b l o c k == t o t a l b l o c k s − 1 ? l a s t b l o c k : B;17 MPI Recv ( h [ 0 ] + c u r r e n t b l o c k ∗ B, s i z e t o r e c e i v e ,18 MPI INT , r e c e i v e f r o m , 0 , MPI COMM WORLD, & status ) ;19 }20 // P r o c e s s21 f o r ( j = c u r r e n t c o l u m n ; j < b l o c k e n d ; j ++, c u r r e n t c o l u m n++) {22 f o r ( i = 1 ; i < c h u n k s i z e + 1 ; i ++) {23 d i a g = h [ i − 1 ] [ j −1] + sim [ chunk a [ i − 1 ] ] [ b [ j − 1]];24 down = h [ i − 1 ] [ j ] + DELTA;25 r i g h t = h [ i ] [ j −1] + DELTA;26 max = MAX3( diag , down , r i g h t ) ;27 i f (max <= 0 ) {28 h[ i ] [ j ] = 0;29 } else {30 h [ i ] [ j ] = max ;31 }32 }33 }34 // Send35 i f ( c u r r e n t i n t e r l e a v e + 1 != I | | rank + 1 != total processes ) {36 i n t s e n d t o = rank + 1 == t o t a l p r o c e s s e s ? 0 : rank + 1;37 i n t s i z e t o s e n d = c u r r e n t b l o c k == t o t a l b l o c k s − 1 ? l a s t b l o c k : B;38 MPI Send ( h [ c h u n k s i z e ] + c u r r e n t b l o c k ∗ B, size to send ,39 MPI INT , s e n d t o , 0 , MPI COMM WORLD) ;40 }41 }42 MPI Gather ( h p t r + N, N ∗ c h u n k s i z e , MPI INT ,43 h a l l p t r + N + current interleave ∗ chunk size ∗ t o t a l p r o c e s s e s ∗ N,44 N ∗ c h u n k s i z e , MPI INT , 0 , MPI COMM WORLD) ;45 }46 MPI Finalize () ;47 }48 { . . . } To summarize the interleave realization illustrated on Figure 4. 18
  • 23. Figure 4: Blocking and interleave communication2.3.2 Solution 1: Linear-Array ModelHere is linear array model for communication part for blocking techniquewith interleave 1. Broadcasting chunk size, N, B, and I tcomm−bcast−4−int = 4 × (ts + tw ) × log2 (p) 2. Broadcasting of 2nd protein sequence (vector b) tcomm−bcast−protein−seq = (ts + tw × N ) × log2 (p) 3. Scattering chunk size for each process to compute Note that the size of chunk size is the following N chunk size = p×I where I is the interleave factor. And scattering is performed I times. Therefore, the communication cost of scattering is N tcomm−scatter−protein−seq = I × (ts × log2 (p) + tw × p×I × (p − 1)) 4. Sending shared data To start the first block of computation, process with rank 0 does not need to wait for any data from other processes. And, we need to take 19
  • 24. note that in each interleave except the last interleave, last process ((p − i)th process) needs to send N data to process 0. Therefore, for I − 1 occurences, we need ( N + p − 1) pipeline stages for sending B data, and for the last Interleave step (the I th steps), we will have ( N + p − 2) stages for sending data. The shared data is the last row B of current finished block which consists of B items.Therefore putting all of them together, communication time to send shared data is tcomm−send−shared−data = (I − 1) × ( N + p − 1) × (ts + (B × tw )) + ( N + B B p − 2) × (ts + (B × tw )) 5. Gathering calculated data We need to perform gather to combine all calculated data in every interleave step. Note that every process will need to combine N × chunk size data, which equals to N × PN amount of data. This ×I gather procedure is repeated I times. Therefore the communication time for this step is given by N tcomm−gather = I × (ts × log2 (p) + tw × p×I × N × (p − 1)) 6. Putting all the communication time together tcomm−all = tcomm−bcast−4−int +tcomm−bcast−protein−seq +tcomm−scatter−protein−seq + tcomm−send−shared−data + tcomm−gather Simplifying the equation with respect to B (by separating constant of the equation with the component of the equation containing B, so that we can easily calculate the derivative of the equation to obtain maximum B), we obtain this following equation tcomm−all (B) = ((5 + 2I)log2 (p) + (p − 1)(I − 1) + (p − 2)) × ts + ((4 + 2 N )log2 (p) + N (p − 1) + N (p − 1) + I − 1 + N ) × tw + IN × ts + ((I − p p B 1)(p − 1) + p − 2)B × tw Simplyfing the equation with respect to I, we obtain this following equation tcomm−all (I) = ((5 + 2I)log2 (p) − 1) × ts + ((4 + N )log2 (p) + N (p − p N2 1) + p (p − 1) + B) × tw + ( N + p − 1)(ts + Btw )I B Now we calculate the calculation time for this blocking technique. Notethat in our blocking technique we have I × ( N + p − 1) stages of block- B Ncalculation. In each block-calculation, we need to compute p×I × B points.Therefore, if we represent time to compute one point as tc , we obtain thisfollowing calculation time model 20
  • 25. tcalc = I × ( N + p − 1) × ( p×I × B) × tc B N I will be canceled and we obtain this following 2 tcalc = ( N + N B − p NB p ) × tc ( N ×(p−1) ) × B) × tc 2 tcalc = ( N + p p Final model can be obtained by adding calculation time and communi-cation time, and here is the final equation with respect to B ttotal = tcomm + tcalc ttotal (B) = ((5+2I)log2 (p)+(p−1)(I −1)+(p−2))×ts +((4+N )log2 (p)+N 2p (p− 1) + N (p − 1) + I − 1 + N ) × tw + IN × ts + ((I − 1)(p − 1) + p − p B2)B × tw + ( N + ( N ×(p−1) ) × B) × tc 2 p p Here is the final equation with respect to I N tcomm−all (I) = ((5 + 2I)log2 (p) − 1) × ts + ((4 + N )log2 (p) + p (p − 1) +N2 + ( N + p − 1)(ts + Btw )I + ( N + ( N ×(p−1) ) × B) × tc 2p (p − 1) + B) × tw B p p2.3.3 Solution 1: Optimum B and I for Linear-array Model dttotal (B)Optimum B can be derived by calculating dB and set the inequalityto 0. dttotal (B) =0 dB And, using obtained model from previous section we obtain this follow-ing equation −IN N (p − 1) t + ((I − 1)(p − 1) + (p − 2))tw + 2 s tc = 0 B p N (p − 1) IN ((I − 1)(p − 1) + (p − 2))tw + tc = 2 ts p B IN ts B2 = N (p−1) ((I − 1)(p − 1) + (p − 2))tw + p tc pIN ts B2 = ((I − 1)(p − 1) + (p − 2))ptw + N (p − 1)tc 21
  • 26. pIN ts B= ((I − 1)(p − 1) + (p − 2))ptw + N (p − 1)tc IN ts B≈ (N tc + I) However, we can not find optimum I for Blocking-and-Interleave tech-nique because the derivation of dttotal (I) results in a constant as shown below dI dttotal (I) =0 dI N (+ p − 1)(ts + Btw ) = 0 B Looking at equation of dttotal (I), interleave factor only introduce morecommunication time when sending and receiving shared data. Therefore nooptimum interleave level can be derived using this model.2.3.4 Solution 1: 2-D Mesh ModelUsing similar technique as what we have done in Linear-array model, hereis the communication and computation model of 2-D Mesh Model 1. Broadcasting chunk size, N, B, and I √ tcomm−bcast−4−int = 4 × 2 × (ts + tw ) × log2 ( p) 2. Broadcasting of 2nd protein sequence (vector b) √ tcomm−bcast−protein−seq = 2 × (ts + tw × N ) × log2 ( p) 3. Scattering chunk size for each process to compute As what we discuss in section 2.2.4, scattering communication model between 2-D Mesh model and Linear Array model are equals. N tcomm−scatter−protein−seq = I × (ts × log2 (p) + tw × p×I × (p − 1)) 4. Sending shared data Communication time for sending shared data also equal between 2-D Mesh model and Linear Array model. tcomm−send−shared−data = (I − 1) × ( N + p − 1) × (ts + (B × tw )) + ( N + B B p − 2) × (ts + (B × tw )) 5. Gathering calculated data Gathering formula is equal to scattering except for the amount of data being gathered. N tcomm−gather = I × (ts × log2 (p) + tw × p×I × N × (p − 1)) 22
  • 27. 6. Putting all the communication time together tcomm−all = tcomm−bcast−4−int +tcomm−bcast−protein−seq +tcomm−scatter−protein−seq + tcomm−send−shared−data + tcomm−gather Simplifying the equation with respect to B (by separating constant of the equation with the component of the equation containing B, so that we can easily calculate the derivative of the equation to obtain maximum B), we obtain this following equation √ tcomm−all (B) = (10 log2 ( p) + 2I log2 (p) + (p − 1)(I − 1) + (p − 2)) × √ 2 ts + ((8 + 2N )log2 ( p) + N (p − 1) + N (p − 1) + I − 1 + N ) × tw + p p IN B × ts + ((I − 1)(p − 1) + p − 2)B × tw Simplyfing the equation with respect to I, we obtain this following equation √ √ tcomm−all (I) = (10 log2 ( p)+2I log2 (p)−1)×ts +((8+2N )log2 ( p)+ N N2 N p (p − 1) + p (p − 1) + B) × tw + ( B + p − 1)(ts + Btw )I2.3.5 Solution 1: Optimum B and I for 2-D Mesh Model dttotal (B)Optimum B can be derived by calculating dB and set the inequalityto 0. dttotal (B) =0 dB And, using obtained model from previous section we obtain this follow-ing equation −IN N (p − 1) t + ((I − 1)(p − 1) + (p − 2))tw + 2 s × tc = 0 B p N (p − 1) IN ((I − 1)(p − 1) + (p − 2))tw + × tc = 2 ts p B IN ts B2 = N (p−1) ((I − 1)(p − 1) + (p − 2))tw + p × tc pIN ts B2 = ((I − 1)(p − 1) + (p − 2))ptw + N (p − 1) × tc pIN ts B= ((I − 1)(p − 1) + (p − 2))ptw + N (p − 1) × tc 23
  • 28. We observe that the resulting optimum B for 2-D Mesh model is equal toLinear Array model. As what we have discussed in section 2.2.5, 2-D Meshmodel only differs in the broadcast time which act as constant in ttotal (B)equation and the constant disappear when we calculate the derivaion of theequation. Similar to calculation of optimum I in Linear Array Model, we can notfind optimum I for Blocking-and-Interleave technique because the derivationof dttotal (I) results in a constant as shown below dI dttotal (I) =0 dI N ( + p − 1)(ts + Btw ) = 0 B2.3.6 Solution 1: Improvement Figure 5: Blocking and Interleave Communication The main idea of this improvement is moving the gathering final dataprocess into the end of whole calculation in each process. That means,refering to Figure 5, gathering will be performed after step 14. To implement this improvement, we performed these following steps: 1. Allocate enough memory for each process, to hold I × N × chunk size. Note that chunk size in this case is PN . ×I 24
  • 29. 1 CHECK NULL( ( h p t r = ( i n t ∗ ) m a l l o c ( s i z e o f ( i n t ) ∗ (N) ∗ I ∗ ( chunk size 2 + 1 ) ) ) ) ; // I n s t a n t i a t e temporary r e s u l t i n g m a t r i x f o r each process 3 CHECK NULL( ( h = ( i n t ∗ ∗ ) m a l l o c ( s i z e o f ( i n t ∗ ) ∗ I ∗ ( chunk size + 4 1 ) ) ) ) ; // l i s t o f p o i n t e r 5 6 i n t ∗∗∗ h f i n ; 7 CHECK NULL( h f i n = ( i n t ∗ ∗ ∗ ) m a l l o c ( s i z e o f ( i n t ∗ ∗ ∗ ) ∗ I ) ) ; 8 9 f o r ( i = 0 ; i < ( c h u n k s i z e + 1 ) ∗ I ; i ++) {10 h [ i ] = h p t r + i ∗ N; // p u t t h e p r o i n t e r i n t t h e a r r a y11 }1213 f o r ( i = 0 ; i < I ; i ++) {14 h f i n [ i ] = h + i ∗ ( chunk size + 1) ;15 }2. Change the way each process manipulates the data. Each process stores the data using hfin. hfin is a variable with type ***int, therefore we need to store the data as shown in the following code snippet 1 for ( int c u r r e n t i n t e r l e a v e = 0 ; c u r r e n t i n t e r l e a v e < I ; c u r r e n t i n t e r l e a v e ++) { 2 3 MPI Scatter ( a + c u r r e n t i n t e r l e a v e ∗ c h u n k s i z e ∗ total processes , 4 c h u n k s i z e , MPI SHORT, chunk a , c h u n k s i z e , MPI SHORT, 0 , MPI COMM WORLD) ; // c h u n k a i s the receiving buffer 5 6 int current column = 1 ; 7 f o r ( i = 0 ; i < c h u n k s i z e + 1 ; i ++) h f i n [ current interleave ] [ i ] [ 0 ] = 0; 8 9 for ( int c u r r e n t b l o c k = 0 ; c u r r e n t b l o c k < t o t a l b l o c k s ; c u r r e n t b l o c k ++) {10 // R e c e i v e11 i n t b l o c k e n d = MIN2( c u r r e n t c o l u m n − ( c u r r e n t b l o c k == 0 ? 1 : 0 ) + B, N) ;12 i f ( rank == 0 && c u r r e n t i n t e r l e a v e == 0 ) { // i f rank 0 i s p r o c e s s i n g t h e f i r s t b l o c k , i t doesn ’ t need t o r e c e i v e any t h i n g13 for ( int k = current column ; k < block end ; k++) {14 h f i n [ c u r r e n t i n t e r l e a v e ] [ 0 ] [ k ] = 0 ; // i n i t row 015 }16 } else {17 i n t r e c e i v e f r o m = rank == 0 ? t o t a l p r o c e s s e s − 1 : rank − 1 ; // r e c e i v e 25
  • 30. from n e i g h b o r i n g p r o c e s s18 i n t s i z e t o r e c e i v e = c u r r e n t b l o c k == t o t a l b l o c k s − 1 ? l a s t b l o c k : B;1920 MPI Recv ( h f i n [ c u r r e n t i n t e r l e a v e ] [ 0 ] + c u r r e n t b l o c k ∗ B, s i z e t o r e c e i v e , MPI INT , r e c e i v e f r o m , 0 , MPI COMM WORLD , &s t a t u s ) ;21 }22 f o r ( j = c u r r e n t c o l u m n ; j < b l o c k e n d ; j ++, c u r r e n t c o l u m n++) {23 f o r ( i = 1 ; i < c h u n k s i z e + 1 ; i ++) {24 d i a g = h f i n [ c u r r e n t i n t e r l e a v e ] [ i − 1 ] [ j −1] + sim [ chunk a [ i − 1 ] ] [ b [ j − 1 ] ] ;25 down = h f i n [ c u r r e n t i n t e r l e a v e ] [ i − 1 ] [ j ] + DELTA;26 r i g h t = h f i n [ c u r r e n t i n t e r l e a v e ] [ i ] [ j −1] + DELTA;27 max = MAX3( diag , down , r i g h t ) ;28 i f (max <= 0 ) {29 hfin [ current interleave ] [ i ] [ j ] = 0;30 } else {31 h f i n [ c u r r e n t i n t e r l e a v e ] [ i ] [ j ] = max ;32 }33 }34 }3536 // Send37 i f ( c u r r e n t i n t e r l e a v e + 1 != I | | rank + 1 != total processes ) {38 i n t s e n d t o = rank + 1 == t o t a l p r o c e s s e s ? 0 : rank + 1 ;39 i n t s i z e t o s e n d = c u r r e n t b l o c k == t o t a l b l o c k s − 1 ? l a s t b l o c k : B;40 MPI Send ( h f i n [ c u r r e n t i n t e r l e a v e ] [ c h u n k s i z e ] + c u r r e n t b l o c k ∗ B, s i z e t o s e n d , MPI INT , s e n d t o , 0 , MPI COMM WORLD) ;41 }42 }43 } Note that hfin[i] means it contains the data for the ith interleaving stage in each process.3. Move gathering process into the end of all calculation as shown in the following code snippet 1 f o r ( i = 0 ; i < I ; i ++) { 2 MPI Gather ( h p t r + N + i ∗ c h u n k s i z e ∗ N, N ∗ c h u n k s i z e , MPI INT , 3 h a l l p t r + N + i ∗ chunk size ∗ t o t a l p r o c e s s e s ∗ N, 26
  • 31. 4 N ∗ c h u n k s i z e , MPI INT , 0 , MPI COMM WORLD) ; 5 }2.3.7 Solution 1: Optimum B and I for the Improved SolutionHere is the part of the model that are affected by the improved solution. 1. Sending shared data For the first I - 1 interleaving stages the communication time is fol- lowed: N (I − 1) × (ts + tw × B) × B Then the last interleaving stage consist of following amount of com- munication time: (ts + tw × B) × ( N + P − 2) B Therefore putting all of them together, communication time to send shared data is (ts + tw × B) × ( N + P − 2) + (I − 1) × (ts + tw × B) × B N B 2. Computational time As well with sending and receive changes, time for computation are also improved. (N × B × B N P ×I × (I − 1) + B × N P ×I × ( N + P − 1)) × tc B Optimal B and I for Improved Solution To calculate the optimal value we ignore all the communication timewhich is not going to influent the value of optimal B and I. For optimal B,we only have the following formula the calculation. t total improved(B) = (ts + tw × B) × ( N + P − 2) + (I − 1) × (ts + tw × BB) × N + ( N × B × PN × (I − 1) + B × PN × ( N + P − 1)) × tc B B ×I ×I B dt total improved(B) =0 dB (I − 1) × ts × N N × ts N − 2 − 2 + (P − 2) × tw + (P − 1) × × tc = 0 B B P ×I I 2 × ts × N × P B= (P − 2) × tw × P × I + (P − 1) × N × tc I × N × ts B≈ (P − 2) × tw 27
  • 32. However, for optimal I value, we need to consider also scatter time as well. Therefore we obtain this following formula for t total improved(I) t total improved(I) = I × ts × log2 (p) + (ts + tw × B) × ( N + P − 2) + (I − B 1) × (ts + tw × B) × N + N × B × × PN × (I − 1) + B × PN × ( N + P − 1) × tc ) B B ×I ×I B dt total improved(I) =0 dI N N2 × B B×N N ts ×log2 (p)+(ts +tw ×B)× + 2 ×tc − 2 ×( +P −1)×tc = 0 B B×P ×I P ×I B B 2 × N × ( N + P − 1) × tc − N 2 × B × tc B I= B × P × ts × log2 (p) + (ts + tw × B) × N × P B × N × tc I≈ N ts × log2 (p) + N × tw + B × ts 2.3.8 Solution 2: Using Send and Receive This implementation also takes in account the row interleave factor along with the column interleave. Every process calculates the number of rows it has to process at every interleave and initializes the memory. Master process declares the matrix H and use it for its partial processing as well. Each process process N/(p*I) number of rows in every interleave and communicates the block with its neighbour process. Last process communi- cates its block with the master process and do not perform any communi- cation in the last interleave. 1 i f ( i d == 0 ) 2 { 3 f o r ( i =0; i <ColumnBlock ; i ++) 4 { 5 CHECK NULL( ( chunk = ( i n t ∗ ) m a l l o c ( s i z e o f ( i n t ) ∗ ( B) ) ) ) ; 6 7 f o r ( j =1; j<=s ; j ++) 8 { 9 f o r ( k=i ∗B+1;k<=( i +1)∗B && k<=b [ 0 ] && k <= N; k++)10 {11 i n t RowPosition ;1213 i f ( ( i n t e r l e a v e ∗p+i d ) < r )14 RowPosition = ( i n t e r l e a v e ∗ (N/ ( p∗ I ) +1)∗p ) + i d ∗ ( (N/ ( p∗ I ) +1) ) + j ;15 else 28
  • 33. 16 RowPosition = ( r ∗ (N/ ( p∗ I ) +1) ) + ( i n t e r l e a v e ∗p+id−r ) ∗ (N/ ( p∗ I ) ) + j ;1718 d i a g = h [ RowPosition − 1 ] [ k −1] + sim [ a [ RowPosition ] ] [ b [ k ] ] ;19 down = h [ RowPosition − 1 ] [ k ] + DELTA;20 r i g h t = h [ RowPosition ] [ k −1] + DELTA;21 max = MAX3( diag , down , r i g h t ) ;2223 i f (max <= 0 ) {24 h [ RowPosition ] [ k ] = 0 ;25 } else {26 h [ RowPosition ] [ k ] = max ;27 }28 chunk [ k−( i ∗B+1) ] = h [ RowPosition ] [ k ] ;29 }30 } // communicate t o e p a r t i a l b l o c k t o n e x t p r o c e s s31 MPI Send ( chunk , B, MPI INT , i d +1 ,0 ,MPI COMM WORLD) ;32 f r e e ( chunk ) ;33 }34 // end f i l l i n g m a t r i x H [ ] [ ] a t master35 } e l s e i f ( i d != p−1)36 { // f i l l i n g m a t r i x a t o t h e r p r o c e s s e s3738 f o r ( i =0; i <ColumnBlock ; i ++)39 {40 CHECK NULL( ( chunk = ( i n t ∗ ) m a l l o c ( s i z e o f ( i n t ) ∗ ( B) ) ) ) ;4142 MPI Recv ( chunk , B, MPI INT , id −1 ,0 ,MPI COMM WORLD,& status ) ;43 f o r ( z =0; z<B ; z++)44 {45 i f ( ( i ∗B+z ) <= N)46 h [ 0 ] [ i ∗B+z +1] = chunk [ z ] ;47 }48 f o r ( j =1; j<=s ; j ++)49 {50 i n t RowPosition ;5152 i f ( ( i n t e r l e a v e ∗p+i d ) < r )53 RowPosition = ( i n t e r l e a v e ∗ (N/ ( p∗ I ) +1)∗p ) + i d ∗ ( (N/ ( p∗ I ) +1) ) + j ;54 else55 RowPosition = ( r ∗ (N/ ( p∗ I ) +1) ) + ( i n t e r l e a v e ∗p+id−r ) ∗ (N/ ( p∗ I ) ) + j ;565758 f o r ( k=i ∗B+1;k<=( i +1)∗B && k<=b [ 0 ] && k <= N; k++)59 {60 d i a g = h [ j − 1 ] [ k −1] + sim [ a [ RowPosition ] ] [ b[k ] ] ;61 down = h [ j − 1 ] [ k ] + DELTA; 29
  • 34. 62 r i g h t = h [ j ] [ k −1] + DELTA;63 max = MAX3( diag , down , r i g h t ) ;64 i f (max <= 0 )65 h [ j ] [ k ] = 0;66 else67 h [ j ] [ k ] = max ;6869 chunk [ k−( i ∗B+1) ] = h [ j ] [ k ] ;70 }71 }72 MPI Send ( chunk , B, MPI INT , i d +1 ,0 ,MPI COMM WORLD) ;73 f r e e ( chunk ) ;74 } // end f i l l i n g m a t r i x a t o t h e r p r o c e s s e s75 } e l s e // s t a r t f i l l i n g m a t r i x a t l a s t p r o c e s s76 {77 f o r ( i =0; i <ColumnBlock ; i ++)78 {79 CHECK NULL( ( chunk = ( i n t ∗ ) m a l l o c ( s i z e o f ( i n t ) ∗ ( B) ) ) ) ;8081 MPI Recv ( chunk , B, MPI INT , id −1 ,0 ,MPI COMM WORLD,& status ) ;82 f o r ( z =0; z<B ; z++)83 {84 i f ( ( i ∗B+z ) <= N)85 h [ 0 ] [ i ∗B+z +1] = chunk [ z ] ;86 }8788 f r e e ( chunk ) ;89 f o r ( j =1; j<=s ; j ++)90 {91 i n t RowPosition ;92 i f ( ( i n t e r l e a v e ∗p+i d ) < r )93 RowPosition = ( i n t e r l e a v e ∗ (N/ ( p∗ I ) +1)∗p ) + i d ∗ ( (N/ ( p∗ I ) +1) ) + j ;94 else95 RowPosition = ( r ∗ (N/ ( p∗ I ) +1) ) + ( i n t e r l e a v e ∗p+id−r ) ∗ (N/ ( p∗ I ) ) + j ;969798 f o r ( k=i ∗B+1;k<=( i +1)∗B && k<=b [ 0 ] && k <= N; k++) 99 {100 d i a g = h [ j − 1 ] [ k −1] + sim [ a [ RowPosition ] ] [ b[k ] ] ;101 down = h [ j − 1 ] [ k ] + DELTA;102 r i g h t = h [ j ] [ k −1] + DELTA;103 max = MAX3( diag , down , r i g h t ) ;104 i f (max <= 0 )105 h [ j ] [ k ] = 0;106 else107 h [ j ] [ k ] = max ;108109 30
  • 35. 110 }111 }112 }113 } After filling the partial matrix H, every process sends the partial result to the master process at every interleave. Below mentioned is the code snippet of master gathering the partial result after every interleave. 1 i f ( i d ==0) 2 { 3 i n t row , c o l ; 4 f o r ( i =1; i <p ; i ++) 5 { 6 MPI Recv(&row , 1 , MPI INT , i , 0 ,MPI COMM WORLD,& status ) ; 7 CHECK NULL( ( r e c v h p t r = ( i n t ∗ ) m a l l o c ( s i z e o f ( i n t ) ∗ ( row ) ∗ (N) ) ) ) ; 8 9 MPI Recv ( r e c v h p t r , row∗N, MPI INT , i , 0 , MPI COMM WORLD,& s t a t u s ) ;1011 f o r ( j =0; j <row ; j ++)12 {13 i n t RowPosition ;1415 i f ( ( i n t e r l e a v e ∗p+i ) < r )16 RowPosition = ( i n t e r l e a v e ∗ (N/ ( p∗ I ) +1)∗p ) + i ∗ ( (N/ ( p∗ I ) +1) ) + j +1;17 else18 RowPosition = ( r ∗ (N/ ( p∗ I ) +1) ) + ( i n t e r l e a v e ∗p+i −r ) ∗ (N/ ( p∗ I ) ) + j +1;1920 f o r ( k =0;k<N; k++)21 h [ RowPosition ] [ k+1]= r e c v h p t r [ j ∗N+k ] ;2223 }24 free ( recv hptr ) ;25 }26 }27 else28 {29 MPI Send(&s , 1 , MPI INT , 0 , 0 ,MPI COMM WORLD) ;30 CHECK NULL( ( r e c v h p t r = ( i n t ∗ ) m a l l o c ( s i z e o f ( i n t ) ∗ ( s ) ∗ (N) ) ) ) ;3132 f o r ( j =0; j <s ; j ++)33 {34 f o r ( k=0;k<N; k++)35 r e c v h p t r [ j ∗N+k ] = h [ j + 1 ] [ k + 1 ] ;36 }37 MPI Send ( r e c v h p t r , s ∗N, MPI INT , 0 , 0 ,MPI COMM WORLD) ;3839 free ( recv hptr ) ; 31
  • 36. 40 } To summarize the interleave realization illustrated in Appendix D. 2.3.9 Solution 2: Linear-array Model 1. Every process calculates the (N/(p*I))*B number of values in every interleave before communicating a chunk with the other process. It takes ((N/B)+p-1)*I steps in total for computation.Below mentioned is the equation for computation. tcomp1 = I × ( N + p − 1) × ( p×I × B) × tc B N 2. After computation step each process communicates a Block with its neighbour process. There are (N/B)+p-2 steps of communication among all the processes. tcomm1 = (I −1)×( N +p−1)×(ts +B ×tw )+( N +p−2)×(ts +B ×tw ) B B 3. After completing their part of matrix H every process sends it to the master process. N tcomm2 = (ts + (p×I × N × tw ) × I 4. In the end master process puts all the partial result in the matrix H to finalize the matrix H. N tcomp2 = I × (ts + (p×I) × N × tw ) The total execution time can be calculated by combining all the times. ttotal = tcomp1 + tcomm1 + tcomp2 + tcomm2 ttotal = I ×( N +p−1)×( p×I ×B)×tc +(I −1)×( N +p−1)×(ts +B×tw )+ B N B ( N + p − 2) × (ts + B × tw ) + (ts + (p×I × N × tw ) × I + I × (ts + (p×I) × N × tw ) B N N 2.3.10 Solution 2: Optimum B and I for Linear-array Model dttotal (B) Optimum B can be derived by calculating dB and set the inequality to 0. dttotal (B) =0 dB And, using obtained model from previous section we obtain this follow- ing equation −IN N (p − 1) t + ((I − 1)(p − 1) + (p − 2))tw + 2 s =0 B p 32
  • 37. N (p − 1) IN ((I − 1)(p − 1) + (p − 2))tw + = 2 ts p B IN ts B2 = N (p−1) ((I − 1)(p − 1) + (p − 2))tw + p pIN ts B2 = ((I − 1)(p − 1) + (p − 2))ptw + N (p − 1) pIN ts B= ((I − 1)(p − 1) + (p − 2))ptw + N (p − 1) However, we can not find optimum I for Blocking-and-Interleave tech-nique because the derivation of dttotal (I) results in a constant as shown below dI dttotal (I) =0 dI N ( + p − 1)(ts + Btw ) = 0 B Looking at equation of dttotal (I), interleave factor only introduce morecommunication time when sending and receiving shared data. Therefore nooptimum interleave level can be derived using this model.2.3.11 Solution 2: 2-D Mesh ModelAs we have discussed in section 2.2.9, 2-D Mesh Model is same with LinearArray model because 2-D Mesh Model only affects the broadcast procedureand solution 2 does not include any broadcast procedure in its implementa-tion. 33
  • 38. 3 Performance ResultsWe did performance measurement of both parallel versions in Altix Machineand compare the results against the sequential version.3.1 Solution 13.1.1 Performance of Sequential CodeFirst we measured the performance of Smith-Waterman algorithm, usingsequential code. Figure 6 shows the results. Figure 6: Sequential Code Performance Measurement Result Figure 6 shows that when N is increased, the time taken to completefilling matrix h is also increased almost linearly. 34
  • 39. 3.1.2 Find Out Optimum Number of Processor (P)At first, we observe the performance by fixing number of compared pro-tein(N) to 5000 and 10000, block size (B) to 100 and set the interleavefactor (I) to 1. The result is shown in Figure 7. 1. Protein size equals to 5000 (N = 5000) Block size (B) is 100, and In- terleave factor (I) is 1 Figure 7: Measurement result when N is 5000, B is 100 and I is 1 Plotting the result in diagram in Figure 8Figure 8: Diagram of measurement result when N is 5000, B is 100, I is 1 When the protein size (N) is 5000 and number of processor (P) is 4, we obtain t tparallel = 1.454 = 2.26 times speedup. serial 3.3 35
  • 40. 2. Protein size equals to 10000 (N = 10000) We obtain this following result in Figure 9 Figure 9: Measurement result when N is 10000, B is 100 and I is 1 Plotting the result in diagram in Figure 10Figure 10: Diagram of measurement result when N is 10000, B is 100, I is 1 When the protein size (N) is 10000 and number of processor (P) is 8, we obtains t tparallel = 12.508 = 5.06 times speedup. serial 2.47 Based on the result above, we found that maximum speedup is achievedwhen number of processor (P) is 8 and protein size (N) is 10000. Therefore,for the subsequent experiment, we will fix the number of processor to 8 andmodify other parameters.3.1.3 Find Out Optimum Blocking Size (B)In this subsection, we analyze the performance result and find optimumblocking size (B). We fix number of processor (P) to 8, number of protein(N) to 10000 and interleave factor (I) to 1. The results are on Figure 11 36
  • 41. Figure 11: Performance measurement result when N is 10000, P is 8, I is 1Figure 12: Diagram of measurement result when N is 10000, P is 8, I is 1 Plotting the result into diagram as shown in Figure 12. We zoomed inthe diagram in right hand side of Figure 12 so that we have clearer pictureon the performance when B is less than or equal to 500. We found that optimum empirical blocking size (B) in the solution 1 is100. And this yield in t tparallel = 12.508 = 5.21 times speedup. serial 2.401 37
  • 42. 3.1.4 Find Out Optimum Interleave Factor (I)Using the result from previous section in finding optimum blocking size (B),we find out most optimum I. The result is shown on Figure 13Figure 13: Diagram of measurement result when N is 10000, P is 8, B is 100 We found that optimum I is 1. And using optimum I of 1, we obtain4.76 times speedup compared to sequential execution.3.2 Solution 1-ImprovedWe did the same experiment as Solution 1 performance result to obtainnecessary data about our improved solution3.2.1 Find Out Optimum Number of Processor (P)At first, we observe the performance by fixing number of compared pro-tein(N) to 10000, block size (B) to 100 and set the interleave factor (I) to 1.The result is shown in Figure 14. We obtain this following result in Figure 14 Figure 14: Measurement result when N is 10000, B is 100 and I is 1 Plotting the result in diagram in Figure 15 When the protein size (N) is 10000 and number of processor (P) is 8, weobtains t tparallel = 12.508 = 4.201 times speedup. serial 2.977 38
  • 43. Figure 15: Diagram of measurement result when N is 10000, B is 100, I is 1 Based on the result above, we found that maximum speedup is achievedwhen number of processor (P) is 8 and protein size (N) is 10000. Therefore,for the subsequent experiment, we will fix the number of processor to 8 andmodify other parameters. 39
  • 44. 3.2.2 Find Out Optimum Blocking Size (B)In this subsection, we analyze the performance result and find optimumblocking size (B). We fix number of processor (P) to 8, number of protein(N) to 10000 and interleave factor (I) to 1. The results are on Figure 16Figure 16: Performance measurement result when N is 10000, P is 8, I is 1 Plotting the result into diagram as shown in Figure 17. We zoomed inthe diagram in right hand side of Figure 17 so that we have clearer pictureon the performance when B is less than or equal to 500.Figure 17: Diagram of measurement result when N is 10000, P is 8, I is 1 We found that optimum empirical blocking size (B) in the solution 1 is serial200. And this yield in t tparallel = 12.508 = 5.08 times speedup. 2.464 40
  • 45. 3.2.3 Find Out Optimum Interleave Factor (I)Using the result from previous section in finding optimum blocking size (B),we find out most optimum I. The result is shown on Figure 18Figure 18: Diagram of measurement result when N is 10000, P is 8, B is 200 We found that optimum I is 2. And using optimum I of 2, we obtain t serial = 12.508 = 5.08 times speedup.t parallel 2.6133.3 Solution 2Using similar sequential code performance result obtained during Solution1 evalution, we measured the performance of solution 2.3.3.1 Find Out Optimum Number of Processor (P)The first step that we did is to observe the performance by fixing number ofcompared protein (N), block size (B) and set the interleave factor (I) to 1. 1. Protein size equals to 5000 (N = 5000) Block size (B) is 100, and Interleave factor (I) is 1 Figure 19: Measurement result when N is 5000, B is 100 and I is 1 Plotting the result into diagram 41
  • 46. Figure 20: Diagram of measurement result when N is 5000, B is 100, I is 1 Using protein size (N) of 5000 and number of processor (P) is 32, we achieve maximum 31.55% speedup compared to existing sequential code. 2. Protein size equals to 10000 (N = 10000) Block size (B) is 100, and Interleave factor (I) is 1 Figure 21: Measurement result when N is 10000, B is 100 and I is 1 Plotting the result into Figure 22 Using protein size (N) equals to 10000 and number of processor (P) is 32, we achieve 54.67% speedup compared to existing sequential code. Based on results obtained in this section, we found that parallel imple-mentation of solution 2 achieve most speedup when the number of procesor 42
  • 47. Figure 22: Diagram of measurement result when N is 10000, B is 100, I is 1is 32. In our subsequent performance evaluation, we will fix the number ofprocessor to 32, and observe most optimum value for other variables.3.3.2 Find Out Optimum Blocking Size (B)In this subsection, we analyze the performance result and find optimumblocking size (B). We fix number of processor (P) to 32, number of protein(N) to 10000 and interleave factor (I) to 1. The results are on Figure 23Figure 23: Performance measurement result when N is 10000, P is 32, I is 1 Plotting the result into diagram as shown in Figure 24 below We found that optimum empirical blocking size (B) in our solution 2 is50. Interestingly, the performance using optimum B is slightly worse com- 43
  • 48. Figure 24: Diagram of measurement result when N is 10000, P is 32, I is 1pared to the result from section 3.3.1. Using B equals to 50, we achieve53.91% speedup compared to sequential execution, but 1.69% slower com-pared to result from section 3.3.1.3.3.3 Find Out Optimum Interleave Factor (I)Using the result from previous section in finding optimum blocking size (B),we find out most optimum I. The result is shown on Figure 25 and Figure 26Figure 25: Performance measurement result when N is 10000, P is 32, B is50 We found that optimum I is 30. Another interesting point is the obtainedresults shows that the execution times are very close to each other when Iis 10 up to 100. That means for existing configuration (N = 10000, P = 32and B = 50), the value of I does not affect the execution time much when 44
  • 49. Figure 26: Diagram of measurement result when N is 10000, P is 32, B is50it is from 10 to 100. Practically, we can choose any I value from 10 to 100. Using optimum I of 30, we obtain 58.79% speedup compared to sequentialexecution, and 10.58% speedup compared to result without using interleav-ing. 45
  • 50. 3.4 Putting All the Optimum Values TogetherFigure 27 and Figure 28 show the result of comparing all the execution timeswhen optimum parameters are used. Figure 27: Putting all of them together Figure 28: Putting all of them together - the plot Improved solution 1 has slightly more execution time compared to theoriginal solution 1. The achieved results in improved solution 1 for timeperformance of the developed models include not only cost of the mainpart ( the interleave loop) but also all piggyback communication like initialbroadcast and final gather. Therefore, the result is pretty close to originalsolution 1. 46
  • 51. 3.5 Testing with different GAP penaltiesUsing optimum blocking size (B) of 50, optimum interleave factor (I) of 30and protein size of 10000, we tried to find out the result with different gappenalties. The result is shown on Figure 29 Figure 29: Testing with different gap penalties Figure 30: gap penalty vs Time We found that there was no effect or very minor effect of changing gappenalties on the over all execution time of the implementations. 47
  • 52. 4 ConclusionsWe successfully implemented three different solutions of the Smith-WatermanAlgorithm. Initially we provided a solution using Scatter and Gather. Wefound that first version of solution 1 exhibits MPI barrier’s property of block-ing all process at certain point. In general MPI Gather doesn’t have suchproperty but for our pipelining realization, where each processes are depen-dent from each other, each process waits till master will be able to sendthe data. d realization was proposed. Therefore we optimized our firstimplementation so that it does not have the aforementioned MPI barrierproperty. In the improved version, each process allocates enough memoryfor all chunks to store results from interleave stage and final gather will beinvoked after all calculation work is completed. The second implementationused primitive Send and Receive method provided by MPI. For all the implementation we did evaluation and Testing on the Altixmachine and empirically find out Optimum B and I. We created performancemodel for both the implementations using two different interconnection net-works i.e. Linear and 2D-Mesh. We also calculated optimum B and I byusing derivative. We tested our implementations for different values of B,I,p and DELTA.Factor p which is related with the processor has the major effect on theexecution time. Increasing number of processor decreases the executiontime of the problem. Factor B also improves the performance of the codeas shown in the result. DELTA has no effect on the execution time of theimplementations. We also found that execution time has certain deviationso the choice of optimal parameter is very tricky 48
  • 53. APPENDIX A Source Code Compilation We created Makefile to automate the compiling process. To compile the source code that we created, we use this command1 make To remove the executables that created by compilation process, we use this command1 make c l e a n Here is the content of the makefile1 CXX = icc23 all : protein free par45 clean :6 rm p r o t e i n f r e e p a r78 p r o t e i n f r e e p a r : p r o t e i n F r e e . cpp9 $ {CXX} p r o t e i n F r e e . cpp −o p r o t e i n f r e e p a r −lmpi 49
  • 54. B Execution on ALTIX We used Slurm+MOAB utility to submit the job at Altix Machine for ex- ecution of the code. Following is the script we used for submitting the job to the Slurm.1 #! / b i n / bash2 # @ job name = t e s t3 # @ initialdir = .4 # @ output = mpi %j . out5 # @ e r r o r = mpi %j . e r r6 # @ total tasks = 47 # @ wall clock limit = 00:02:0089 time mpirun −np 4 . / p r o t e i n f r e e p a r a 5 0 0 k b 500k data . s c o r e 1 5000 100 1 To execute the script we used mnsubmit command.1 mnsubmit s c r i p t You can find our script on below mentioned directory. −F1 /home/ c u r s o s /ampp/ampp03/ Documents /AMPP i n a l / P r o t e i n F r e e / s c r i p t 50
  • 55. C Timing diagram for Blocking technique in Solution 251 Figure 31: Performance Model Solution 2
  • 56. D Timing diagram for Blocking-and-Interleave technique in Solution 252 Figure 32: Performance Model with Interleave
  • 57. References[1] Jun Zhang, Chapter 5: Basic Communication Operations. University of Kentucky, Lexington, 2010. 53