Upcoming SlideShare
×

# Parallelization of Smith-Waterman Algorithm using MPI

870 views
755 views

Published on

Final report for Parallelization of Smith-Waterman Algorithm using MPI project.

Published in: Education
0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
870
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
32
0
Likes
0
Embeds 0
No embeds

No notes for slide

### Parallelization of Smith-Waterman Algorithm using MPI

1. 1. `Universitat Politecnica de Catalunya AMPP Final Project ReportParallelization of Smith-Waterman Algorithm Author: Supervisor: Iuliia Proskurnia Josep R. Herrero Arinto Murdopo Dani Jimenez-Gonzalez Muhammad Anis uddin Nasir January 16, 2012
2. 2. Contents1 Introduction 12 Main Issues and Solutions 2 2.1 Available Parallelization Techniques . . . . . . . . . . . . . . 2 2.2 Blocking Technique . . . . . . . . . . . . . . . . . . . . . . . . 2 2.2.1 Solution 1: Using Scatter and Gather . . . . . . . . . 2 2.2.2 Solution 1: Linear-array Model . . . . . . . . . . . . . 7 2.2.3 Solution 1: Optimum B for Linear-array Model . . . . 9 2.2.4 Solution 1: 2-D Mesh Model . . . . . . . . . . . . . . 9 2.2.5 Solution 1: Optimum B for 2-D Mesh Model . . . . . 10 2.2.6 Solution 2: Using Send and Receive . . . . . . . . . . 11 2.2.7 Solution 2: Linear-array Model . . . . . . . . . . . . . 15 2.2.8 Solution 2: Optimum B for Linear-array Model . . . . 15 2.2.9 Solution 2: 2-D Mesh Model . . . . . . . . . . . . . . 16 2.3 Blocking-and-Interleave Technique . . . . . . . . . . . . . . . 16 2.3.1 Solution 1: Using Scatter and Gather . . . . . . . . . 16 2.3.2 Solution 1: Linear-Array Model . . . . . . . . . . . . . 19 2.3.3 Solution 1: Optimum B and I for Linear-array Model 21 2.3.4 Solution 1: 2-D Mesh Model . . . . . . . . . . . . . . 22 2.3.5 Solution 1: Optimum B and I for 2-D Mesh Model . . 23 2.3.6 Solution 1: Improvement . . . . . . . . . . . . . . . . 24 2.3.7 Solution 1: Optimum B and I for the Improved Solution 27 2.3.8 Solution 2: Using Send and Receive . . . . . . . . . . 28 2.3.9 Solution 2: Linear-array Model . . . . . . . . . . . . . 32 2.3.10 Solution 2: Optimum B and I for Linear-array Model 32 2.3.11 Solution 2: 2-D Mesh Model . . . . . . . . . . . . . . 333 Performance Results 34 3.1 Solution 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.1.1 Performance of Sequential Code . . . . . . . . . . . . 34 3.1.2 Find Out Optimum Number of Processor (P) . . . . . 35 3.1.3 Find Out Optimum Blocking Size (B) . . . . . . . . . 36 3.1.4 Find Out Optimum Interleave Factor (I) . . . . . . . . 38 3.2 Solution 1-Improved . . . . . . . . . . . . . . . . . . . . . . . 38 3.2.1 Find Out Optimum Number of Processor (P) . . . . . 38 3.2.2 Find Out Optimum Blocking Size (B) . . . . . . . . . 40 3.2.3 Find Out Optimum Interleave Factor (I) . . . . . . . . 41 3.3 Solution 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.3.1 Find Out Optimum Number of Processor (P) . . . . . 41 3.3.2 Find Out Optimum Blocking Size (B) . . . . . . . . . 43 3.3.3 Find Out Optimum Interleave Factor (I) . . . . . . . . 44 3.4 Putting All the Optimum Values Together . . . . . . . . . . . 46 i
3. 3. 3.5 Testing with diﬀerent GAP penalties . . . . . . . . . . . . . . 474 Conclusions 48A Source Code Compilation 49B Execution on ALTIX 50C Timing diagram for Blocking technique in Solution 2 51D Timing diagram for Blocking-and-Interleave technique in Solution 2 52 ii
4. 4. List of Figures 1 Blocking Communication . . . . . . . . . . . . . . . . . . . . 7 2 Data Partitioning among processes . . . . . . . . . . . . . . . 12 3 Blocking Communication . . . . . . . . . . . . . . . . . . . . 14 4 Blocking and interleave communication . . . . . . . . . . . . 19 5 Blocking and Interleave Communication . . . . . . . . . . . . 24 6 Sequential Code Performance Measurement Result . . . . . . 34 7 Measurement result when N is 5000, B is 100 and I is 1 . . . 35 8 Diagram of measurement result when N is 5000, B is 100, I is 1 35 9 Measurement result when N is 10000, B is 100 and I is 1 . . . 36 10 Diagram of measurement result when N is 10000, B is 100, I is 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 11 Performance measurement result when N is 10000, P is 8, I is 1 37 12 Diagram of measurement result when N is 10000, P is 8, I is 1 37 13 Diagram of measurement result when N is 10000, P is 8, B is 100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 14 Measurement result when N is 10000, B is 100 and I is 1 . . . 38 15 Diagram of measurement result when N is 10000, B is 100, I is 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 16 Performance measurement result when N is 10000, P is 8, I is 1 40 17 Diagram of measurement result when N is 10000, P is 8, I is 1 40 18 Diagram of measurement result when N is 10000, P is 8, B is 200 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 19 Measurement result when N is 5000, B is 100 and I is 1 . . . 41 20 Diagram of measurement result when N is 5000, B is 100, I is 1 42 21 Measurement result when N is 10000, B is 100 and I is 1 . . . 42 22 Diagram of measurement result when N is 10000, B is 100, I is 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 23 Performance measurement result when N is 10000, P is 32, I is 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 24 Diagram of measurement result when N is 10000, P is 32, I is 1 44 25 Performance measurement result when N is 10000, P is 32, B is 50 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 26 Diagram of measurement result when N is 10000, P is 32, B is 50 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 27 Putting all of them together . . . . . . . . . . . . . . . . . . . 46 28 Putting all of them together - the plot . . . . . . . . . . . . . 46 29 Testing with diﬀerent gap penalties . . . . . . . . . . . . . . . 47 30 gap penalty vs Time . . . . . . . . . . . . . . . . . . . . . . . 47 31 Performance Model Solution 2 . . . . . . . . . . . . . . . . . . 51 32 Performance Model with Interleave . . . . . . . . . . . . . . . 52 iii
5. 5. 1 IntroductionThe Smith–Waterman algorithm is a well-known algorithm for performinglocal sequence alignment for determining similar regions between two nu-cleotide or protein sequences. Proteins are made by aminoacid sequencesand similar protein structure has similar aminoacid sequence.In this projectwe did the parallel implementation of the Smith-Waterman Algorithm usingMessage Passing Interface code. To compare two aminoacid sequence, initially we have to align the se-quences to compare them. To ﬁnd the best alignment between two sequencesthe algorithm initially populates a matrix H of size N × N (N is size of se-quence) using a scoring criteria. It requires a scoring matrix (cost of match-ing of two symbols) and a gap penalty for mismatch of two symbols. Afterpopulating the matrix H we can obtain the optimum local alignment bytracking back the matrix starting with the highest value in the matrix. In our implementation of Smith-Waterman algorithm we populated thematrix H in parallel using multiple processes running of multicore machines.We used pipelined computation to achieve speciﬁc degree of parrallelism andcompared diﬀerent parallelizing techniques to ﬁnd optimum parallelizationtechnique for the problem. We started parallelizing our code using diﬀer-ent blocking sizes B at the column level. Furthermore, we also introducedparallelization using diﬀerent levels of interleave I at the row level. For performance measurement we created the performance model of boththe implementations for two interconnection networks which are linear and2D-Mesh interconnection network. We executed our code for evaluationon Altix machine using diﬀerent values of parameter ∆ (gap penalty), B(column interleaving factor) and I (row interleaving factor) to empiricallyﬁnd optimum B and I for the problem. We also calculated the optimumB and I by ﬁnding the global minima of the equations of the performancemodel. 1
6. 6. 2 Main Issues and Solutions 2.1 Available Parallelization Techniques We can achieve pipelining with both blocking at column and row level. Blocking at column level can be interpreted in diﬀerent ways. 1. Each processor Pi processes B complete columns of the matrix before doing any communication. 2. Each processor Pi processes B complete columns. However after pro- cessing B columns of a row of the matrix it does a communication to next processor. 3. Each processor Pi processes B complete columns. However after pro- cessing B columns of a set of rows of the complete B columns of the matrix it does a communication. 4. Each processor Pi processes N/P complete rows. After processing B columns of those N/P rows, it does a communication. Among above mentioned techniques, we choosed the last one because it provides us with most optimum pipelined computation using the scheme. 2.2 Blocking Technique 2.2.1 Solution 1: Using Scatter and Gather Based on chosen technique from our available parallelization techniques, we developed this following solution. Note that in our solution here we already incorporated I (Interleave factor), but we set the I to 1. At the ﬁrst step, process with rank 0 (which is the master process) reads all the necessary ﬁles which are two protein sequence ﬁles. The reading result is stored in short* a and short* b. Other than that, it also allocates enough memory to store the resulting matrix as shown in code snippet below1 {2 // n o t e t h a t s i z e A i s t h e t o t a l number o f rows t h a t we need p r o c e s s . We round up N i f N i s n o t d i v i s i b l e by t o t a l p r o c e s s e s as shown b e l o w . I i s s e t t o 1 h e r e .34 i f (N % ( t o t a l p r o c e s s e s ∗ I ) != 0 ) {5 s i z e A = N + ( t o t a l p r o c e s s e s ∗ I ) − (N % ( t o t a l p r o c e s s e s ∗ I ) ) ; // t o h a n d l e c a s e where N i s n o t d i v i s i b l e by ( t o t a l p r o c e s s e s ∗ I )6 } else {7 s i z e A = N;8 }9 2
7. 7. 10 r e a d f i l e s ( in1 , in2 , a , b , N − 1 ) ; // i n 1 = i n p u t f i l e 1 , i n 2 = i n p u t f i l e 2 , a = r e s u l t i n g r e a d i n g from in1 , b = r e s u l t i n g r e a d i n g from i n 211 c h u n k s i z e = s i z e A / ( t o t a l p r o c e s s e s ∗ I ) ; // number o f rows t h a t each p r o c e s s e s n e e d s t o work on12 CHECK NULL( ( h a l l p t r = ( i n t ∗ ) c a l l o c (N ∗ ( s i z e A +1) , s i z e o f ( i n t ) ) ) ) ; // r e s u l t i n g d a t a13 CHECK NULL( ( h a l l = ( i n t ∗ ∗ ) c a l l o c ( ( s i z e A +1) , s i z e o f ( i n t ∗ ) ) ) ) ; // c o n t a i n l i s t o f p o i n t e r1415 f o r ( i = 0 ; i <= s i z e A ; i ++)16 h a l l [ i ] = h a l l p t r + i ∗ N; // p u t t h e p o i n t e r i n an array1718 // i n i t i a l i z e t h e f i r s t row o f r e s u l t i n g m a t r i x w i t h 019 f o r ( i = 0 ; i < N; i ++)20 {21 h all [ 0 ] [ i ] = 0;22 }2324 } Every process reads the PAM matrix, and master process performs broadcast of N and B value. 1 MPI Bcast(& c h u n k s i z e , 1 , MPI INT , 0 , MPI COMM WORLD) ; // B r o a d c a s t chunk s i z e , which i s t h e number o f c a l c u l a t e d rows by each s l a v e 2 MPI Bcast(&N, 1 , MPI INT , 0 , MPI COMM WORLD) ; // B r o a d c a s t N 3 MPI Bcast(&B, 1 , MPI INT , 0 , MPI COMM WORLD) ; // B r o a d c a s t B 4 MPI Bcast(&I , 1 , MPI INT , 0 , MPI COMM WORLD) ; // B r o a d c a s t I Then each process needs to allocate enough memory to receive chunk size. Other than process with rank 0, they need to allocate memory to receive the whole part of protein 2 (which has size equals to N). 1 CHECK NULL( ( chunk a = ( short ∗ ) c a l l o c ( s i z e o f ( short ) , c h u n k s i z e ) ) ) ; // s l a v e p r o c e s s w i l l o b t a i n i t from master process 2 i f ( rank != 0 ) { 3 CHECK NULL( ( b = ( short ∗ ) m a l l o c ( s i z e o f ( short ) ∗ (N) ) ) ) ; // s l a v e p r o c e s s w i l l o b t a i n i t from master p r o c e s s 4 } 5 6 MPI Bcast ( b , N, MPI SHORT, 0 , MPI COMM WORLD) ; // b r o a d c a s t protein 2 to every process Now, let’s go to the parallel part, ﬁrst we calculate how many blocks that we will process. We calculate, total blocks variable and also last block variable. last block variable contains the size of the last block to process if N is not divisible by B (N %B != 0) 1 i n t t o t a l b l o c k s = N / B + (N % B == 0 ? 0 : 1 ) ; 2 i n t l a s t b l o c k = N % B == 0 ? B : N % B ; 3
8. 8. Then we scatter 1st protein sequence(in here we store it in a), with size of each scattered part equals to chunk size. After each process receives each scattered part, the computation begins for process with rank 0. It will not wait to receive any data from other process and directly calculate the 1st block of data. Meanwhile other proces with rank r, will wait for data from process with rank r-1. The data sent between process here is the last row of calculated block (which is an array of short with size equals to B. After a process receive the required data, each process performs compu- tations for received data. In the end, each process with rank r will send the last row of calculated block with size B to neighboring process with rank r+1. In the end, we perform gather to combine the result. Note that cur- rent interleave variable is set to 0 and I is set to 1 here because we’re not using interleaving factor. Code snippet below show how to implement this functionality 1 for ( int c u r r e n t i n t e r l e a v e = 0 ; c u r r e n t i n t e r l e a v e < I ; c u r r e n t i n t e r l e a v e ++) { 2 MPI Scatter ( a + c u r r e n t i n t e r l e a v e ∗ c h u n k s i z e ∗ total processes , 3 c h u n k s i z e , MPI SHORT, chunk a , c h u n k s i z e , MPI SHORT, 0 , MPI COMM WORLD) ; // c h u n k a i s t h e receiving buffer 4 int current column = 1 ; 5 f o r ( i = 0 ; i < c h u n k s i z e + 1 ; i ++) h [ i ] [ 0 ] = 0 ; 6 7 for ( int c u r r e n t b l o c k = 0 ; c u r r e n t b l o c k < t o t a l b l o c k s ; c u r r e n t b l o c k ++) { 8 // R e c e i v e 9 i n t b l o c k e n d = MIN2( c u r r e n t c o l u m n − ( c u r r e n t b l o c k == 0 ? 1 : 0 ) + B, N) ;10 i f ( rank == 0 && c u r r e n t i n t e r l e a v e == 0 ) { // i f rank 0 i s p r o c e s s i n g t h e f i r s t b l o c k , i t doesn ’ t need t o r e c e i v e any t h i n g11 f o r ( i n t k = c u r r e n t c o l u m n ; k < b l o c k e n d ; k++) {12 h [ 0 ] [ k ] = 0 ; // i n i t row 013 }14 } else {15 i n t r e c e i v e f r o m = rank == 0 ? t o t a l p r o c e s s e s − 1 : rank − 1 ; // r e c e i v e from n e i g h b o r i n g process16 i n t s i z e t o r e c e i v e = c u r r e n t b l o c k == t o t a l b l o c k s − 1 ? l a s t b l o c k : B;17 MPI Recv ( h [ 0 ] + c u r r e n t b l o c k ∗ B, s i z e t o r e c e i v e , MPI INT , r e c e i v e f r o m , 0 , MPI COMM WORLD, &s t a t u s ) ;18 }19 // P r o c e s s20 f o r ( j = c u r r e n t c o l u m n ; j < b l o c k e n d ; j ++, 4
9. 9. c u r r e n t c o l u m n++) {21 f o r ( i = 1 ; i < c h u n k s i z e + 1 ; i ++) {22 d i a g = h [ i − 1 ] [ j −1] + sim [ chunk a [ i − 1]][b[ j − 1]];23 down = h [ i − 1 ] [ j ] + DELTA;24 r i g h t = h [ i ] [ j −1] + DELTA;25 max = MAX3( diag , down , r i g h t ) ;26 i f (max <= 0 ) {27 h [ i ] [ j ] = 0;28 } else {29 h [ i ] [ j ] = max ;30 }31 }32 }3334 // Send35 i f ( c u r r e n t i n t e r l e a v e + 1 != I | | rank + 1 != total processes ) {36 i n t s e n d t o = rank + 1 == t o t a l p r o c e s s e s ? 0 : rank + 1 ;37 i n t s i z e t o s e n d = c u r r e n t b l o c k == t o t a l b l o c k s − 1 ? l a s t b l o c k : B;38 MPI Send ( h [ c h u n k s i z e ] + c u r r e n t b l o c k ∗ B, s i z e t o s e n d , MPI INT , s e n d t o , 0 , MPI COMM WORLD) ;39 p r i n t v e c t o r ( h [ c h u n k s i z e ] + c u r r e n t b l o c k ∗ B, size to send ) ;40 }4142 // G a t h e r i n g r e s u l t43 MPI Gather ( h p t r + N, N ∗ c h u n k s i z e , MPI INT ,44 h a l l p t r + N + current interleave ∗ chunk size ∗ t o t a l p r o c e s s e s ∗ N,45 N ∗ c h u n k s i z e , MPI INT , 0 , MPI COMM WORLD) ;46 } Once the result is gathered, process with rank 0 deallocates the memory and perform optional veriﬁcation result. The veriﬁcation result is obtained by comparing the resulting parallel version of h matrix (by using h all ) with serial version of h matrix (by using hverify) 1 i f ( rank == 0 ) { 2 i f ( v e r i f y R e s u l t == 1 ) { 3 Max = 0 ; 4 xMax = 0 ; 5 yMax = 0 ; 6 CHECK NULL( ( h v e r i f y p t r = ( i n t ∗ ) m a l l o c ( s i z e o f ( i n t ) ∗ (N+1) ∗ (N+1) ) ) ) ; 7 CHECK NULL( ( h v e r i f y = ( i n t ∗ ∗ ) m a l l o c ( s i z e o f ( i n t ∗ ) ∗ ( N+1) ) ) ) ; 8 /∗ Mount h v e r i f y [N ] [ N] ∗/ 9 f o r ( i =0; i<=N; i ++)10 h v e r i f y [ i ]= h v e r i f y p t r+i ∗ (N+1) ;11 f o r ( i =0; i<=N; i ++) h v e r i f y [ i ] [ 0 ] = 0 ; 5
10. 10. 12 f o r ( j =0; j<=N; j ++) h v e r i f y [ 0 ] [ j ] = 0 ;1314 f o r ( i =1; i<=N; i ++)15 f o r ( j =1; j<=N; j ++) {16 diag = h v e r i f y [ i − 1 ] [ j −1] + sim [ a [ i − 1 ] ] [ b [ j −1]];17 down = h v e r i f y [ i − 1 ] [ j ] + DELTA;18 right = h v e r i f y [ i ] [ j −1] + DELTA;19 max=MAX3( diag , down , r i g h t ) ;20 i f (max <= 0 ) {21 hve rify [ i ] [ j ]=0;22 }23 e l s e i f (max == d i a g ) {24 h v e r i f y [ i ] [ j ]= d i a g ;25 }26 e l s e i f (max == down ) {27 h v e r i f y [ i ] [ j ]=down ;28 }29 else {30 h v e r i f y [ i ] [ j ]= r i g h t ;31 }32 i f (max > Max) {33 Max=max ;34 xMax=i ;35 yMax=j ;36 }37 }3839 int v e r F a i l F l a g = 0 ;40 f o r ( i =0; i<=N−1; i ++){41 f o r ( j =0; j<=N−1; j ++){42 i f ( h a l l [ i ] [ j ] != h v e r i f y [ i ] [ j ] ) {43 p r i n t f ( ” V e r i f i c a t i o n f a i l ! n” ) ;44 p r i n t f ( ” h a l l [ i ] [ j ] = %d , h v e r i f y [ i ] [ j ] = %dn” , h a l l [ i ] [ j ] , hverify [ i ] [ j ]) ;45 v e r F a i l F l a g = −1;46 break ;47 }48 }4950 i f ( v e r F a i l F l a g != 0 ) {51 break ;52 }53 }5455 i f ( v e r F a i l F l a g ==0)56 {57 p r i n t f ( ” V e r i f i c a t i o n s u c c e s s ! n” ) ;58 }5960 }6162 free ( hverifyptr ) ; 6
11. 11. 63 free ( hverify ) ;64 free (a) ;65 free ( h all ptr ) ;66 free ( h all ) ;67 }6869 free (b) ;70 f r e e ( chunk a ) ;71 free (h) ;72 f r e e ( hptr ) ;7374 MPI Finalize () ; Figure 1: Blocking Communication To summarize this technique, Figure 1 shows the dividing of block in a matrix. The number inside the block indicates the step. The red portion in block 1 indicate the amount of data (which is B integers) that is sent from process 0 to process 1 in the end of calculation of block 1, in step 1. 2.2.2 Solution 1: Linear-array Model First, we use linear-array topology to model our solution. Here is the model for communication part of our chosen blocking technique 1. Broadcasting chunk size, N, B, and I tcomm−bcast−4−int = 4 × (ts + tw ) × log2 (p) 2. Broadcasting of 2nd protein sequence (vector b) tcomm−bcast−protein−seq = (ts + tw × N ) × log2 (p) 3. Scattering chunk size for each process to compute Note that the size of chunk size is the following N chunk size = p Therefore communication time for scattering is shown below N tcomm−scatter−protein−seq = ts × log2 (p) + tw × p × (p − 1) 7
12. 12. 4. Sending shared data To start the ﬁrst block of computation, process with rank 0 does not need to wait for any data from other processes. That means we only have ( N +p−2) stages for sending shared data. The shared data is the B last row of current ﬁnished block which consists of B items.Therefore putting all of them together, communication time to send shared data is tcomm−send−shared−data = ( N + p − 2) × (ts + (B × tw )) B 5. Gathering calculated data Finally, we need to perform gather to combine all calculated data. Note that every process will need to combine N × chunk size data, which equals to N × N amount of data. Therefore the communication P time for this step is given by N tcomm−gather = ts × log2 (p) + tw × p × N × (p − 1) 6. Putting all the communication time together tcomm−all = tcomm−bcast−4−int +tcomm−bcast−protein−seq +tcomm−scatter−protein−seq + tcomm−send−shared−data + tcomm−gather 2 tcomm−all (B) = (6 log2 (p)+p−1)ts +((4+N ) log2 (p)+N + (N +N p)(p−1) )tw + N ts B + (p − 2) × B × tw Now we calculate the calculation time for this blocking technique. Notethat in our blocking technique we have N + p − 1 stages of block-calculation. BIn each block-calculation, we need to compute N × B points. Therefore, pif we represent time to compute one point as tc , we obtain this followingcalculation time model tcalc = ( N + p − 1) × ( N × B) × tc B p 2 tcalc = ( N + N B − p NB p ) × tc N2 N ×(p−1) tcalc = ( p + ( p ) × B) × tc Final model can be obtained by adding calculation time and communi-cation time ttotal = tcomm + tcalc 2 ttotal (B) = (6 log2 (p) + p − 1)ts + ((4 + N ) log2 (p) + N + (N +N p)(p−1) )tw + + (p − 2) × B × tw + ( N + ( N ×(p−1) ) × B) × tcN ts 2 B p p 8
13. 13. 2.2.3 Solution 1: Optimum B for Linear-array ModelTo ﬁnd optimum B for linear array model, we need to calculate derivative ofﬁnal model of the linear topology with respect to B , and set the derivativeto 0 as shown below dttotal (B) =0 dB And, using obtained model from section 2.2.2 we obtain this followingequation −N N (p − 1) t + (p − 2)tw + 2 s × tc = 0 B p N (p − 1) N (p − 2)tw + × tc = 2 ts p B N ts B2 = N (p−1) (p − 2)tw + p × tc pN ts B2 = p(p − 2)tw + N (p − 1) × tc pN ts B= p(p − 2)tw + N (p − 1) × tc Using assumption that P is very small in comparison with N, we simplifythe equation above into this following ts B≈ tc2.2.4 Solution 1: 2-D Mesh ModelUsing the same steps as in section 2.2.2, here is the 2-D Mesh Model ofsolution 1. 1. Broadcasting chunk size, N, B, and I √ tcomm−bcast−4−int = 4 × 2 × (ts + tw ) × log2 ( p) 2. Broadcasting of 2nd protein sequence (vector b) √ tcomm−bcast−protein−seq = 2 × (ts + tw × N ) × log2 ( p) 3. Scattering chunk size for each process to compute Note that the size of chunk size is the following N chunk size = p 9
14. 14. Communication time for scattering in 2-D Mesh model can be modeled using hypercube. It is similar as the communication time for scattering in Linear Array model.[1] N tcomm−scatter−protein−seq = ts × log2 (p) + tw × p × (p − 1) 4. Sending shared data Since sending shared data is using primitive send and receive, the communication time for this part in 2 D mesh model also does not change. tcomm−send−shared−data = ( N + p − 2) × (ts + (B × tw )) B 5. Gathering calculated data Communication time for gathering is using same formula as scattering, but diﬀerent size of data that is gathered. N tcomm−gather = ts × log2 (p) + tw × p × N × (p − 1) 6. Putting all the communication time together tcomm−all = tcomm−bcast−4−int +tcomm−bcast−protein−seq +tcomm−scatter−protein−seq + tcomm−send−shared−data + tcomm−gather √ √ tcomm−all (B) = ((10 log2 ( p)+log2 (p)+p−1)×ts +((8+2N ) log2 ( p)+ N ×(p−1) N 2 ×(p−1) N p +N + p ) × tw + B × ts + (p − 2)B × tw Calculation time does not change between 2-D mesh model and LinearArray model, therefore the calculation time is tcalc = ( N + ( N ×(p−1) ) × B) × tc 2 p p Putting all together ttotal = tcomm + tcalc 2 √ √ ttotal (B) = N ×tc +(10 log2 ( p)+log2 (p)+p−1)×ts +((8+2N ) log2 ( p)+ pN ×(p−1) 2 ×(p−1) p +N + N p ) × tw + N × ts + (p − 2)B × tw + (( N ×(p−1) ) × B) × tc B p2.2.5 Solution 1: Optimum B for 2-D Mesh ModelWe need to calculate derivative of ﬁnal model of the 2-D Mesh model withrespect to B , and set the derivative to 0 as shown below dttotal (B) =0 dB And, using obtained model from section 2.2.4 we obtain this followingequation 10
15. 15. −N N (p − 1) t + (p − 2)tw + 2 s tc = 0 B p N (p − 1) N (p − 2)tw + × tc = 2 ts p B N ts B2 = N (p−1) (p − 2)tw + p tc pN ts B2 = p(p − 2)tw + N (p − 1)tc pN ts B= p(p − 2)tw + N (p − 1)tc ts B≈ tc As we observed here, the optimum B does not change when we use 2-D Mesh to model the communication. Using our solution 1, the usage of2-D mesh model only aﬀect the broadcast time. And refering to total timeequation with respect to B(ttotal (B)), broadcast time is only a constant andit disappears when we calculate dttotal (B) . dB2.2.6 Solution 2: Using Send and ReceiveIn the second solution, we used Send and Receive methods provided in MPIlibrary for communicating among the processes. In this implementationevery process reads the input ﬁle. Every process also reads the similaritymatrix. After reading the ﬁles each process calculates the number of rows thatit has to process and declares the required memory. Process with rank 0declares the matrix H of size N * N. In our implementation data distribu-tion is fair among all the process. In case of number of rows in the list arenot divisible among all the processes we give one more row to each processstarting from the master process. Figure 2 shows the distribution of data incase where data is not equally divisible among the processes. Each process calculates the block size that it needs to communicate withits neighbour. Filling starts by master process and other process waits toreceive the block to start processing. Master communicates its ﬁrst block,with its neighbour, after processing its required number of rows for the ﬁrstblock. Below mentioned is the code snippet for ﬁlling the matrix at all theprocess. 11
16. 16. Figure 2: Data Partitioning among processes 1 i f ( i d == 0 ) 2 { 3 f o r ( i =0; i <ColumnBlock ; i ++) 4 { 5 f o r ( j =1; j<=s ; j ++) 6 { 7 f o r ( k=i ∗B+1;k<=( i +1)∗B && k<=b [ 0 ] && k <= N; k++) 8 { 9 i n t RowPosition ;10 i f ( id < r )11 RowPosition = i d ∗ ( (N/p ) +1)+j ;12 else13 RowPosition = ( r ∗ ( (N/p ) +1) ) +(( id−r ) ∗ (N/p ) )+j ;1415 d i a g = h [ j − 1 ] [ k −1] + sim [ a [ RowPosition ] ] [ b [ k ]];16 down = h [ j − 1 ] [ k ] + DELTA;17 r i g h t = h [ j ] [ k −1] + DELTA;18 max = MAX3( diag , down , r i g h t ) ;19 i f (max <= 0 ) {20 h [ j ] [ k ] = 0;21 } else {22 h [ j ] [ k ] = max ;23 }24 chunk [ k−( i ∗B+1) ] = h [ j ] [ k ] ;25 }26 }27 MPI Send ( chunk , B, MPI SHORT, i d +1 ,0 ,MPI COMM WORLD) ;28 }29 } else30 { 12
17. 17. 31 f o r ( i =0; i <ColumnBlock ; i ++)32 {33 MPI Recv ( chunk , B, MPI SHORT, id −1 ,0 ,MPI COMM WORLD,& status ) ;34 f o r ( z =0; z<B ; z++)35 {36 i f ( ( i ∗B+z +1) <= N)37 h [ 0 ] [ i ∗B+z +1] = chunk [ z ] ;38 }39 f o r ( j =1; j<=s ; j ++)40 {41 i n t RowPosition ;42 i f ( id < r )43 RowPosition = i d ∗ ( (N/p ) +1)+j ;44 else45 RowPosition = ( r ∗ ( (N/p ) +1) ) +(( id−r ) ∗ (N/p ) )+j ;4647 f o r ( k=i ∗B+1;k<=( i +1)∗B && k<=b [ 0 ] && k <= N; k++)48 {49 d i a g = h [ j − 1 ] [ k −1] + sim [ a [ RowPosition ] ] [ b [ k ]];50 down = h [ j − 1 ] [ k ] + DELTA;51 r i g h t = h [ j ] [ k −1] + DELTA;52 max = MAX3( diag , down , r i g h t ) ;53 i f (max <= 0 )54 h [ j ] [ k ] = 0;55 else56 h [ j ] [ k ] = max ;5758 chunk [ k−( i ∗B+1) ] = h [ j ] [ k ] ;5960 }61 }62 i f ( i d != p−1)63 MPI Send ( chunk , B, MPI SHORT, i d +1 ,0 ,MPI COMM WORLD );64 }65 } At the end every process sends its portion of the matrix H to the master process using the Send method available in the MPI library. Below men- tioned is the code snippet of gathering process. 1 i f ( i d ==0) 2 { 3 i n t row , c o l ; 4 f o r ( i =1; i <p ; i ++) 5 { 6 MPI Recv(&row , 1 , MPI INT , i , 0 ,MPI COMM WORLD,& s t a t u s ) ; 7 CHECK NULL( ( r e c v h p t r = ( i n t ∗ ) m a l l o c ( s i z e o f ( i n t ) ∗ ( row ) ∗ (N) ) ) ) ; 8 9 MPI Recv ( r e c v h p t r , row∗N, MPI INT , i , 0 ,MPI COMM WORLD 13
18. 18. ,& s t a t u s ) ;1011 f o r ( j =0; j <row ; j ++)12 {13 i n t RowPosition ;14 if ( i < r)15 RowPosition = ( i ∗ ( (N/p ) +1) )+j +1;16 else17 RowPosition = ( r ∗ ( (N/p ) +1) ) +(( i −r ) ∗ (N/p ) )+j +1;1819 f o r ( k=0;k<N; k++)20 h [ RowPosition ] [ k+1]= r e c v h p t r [ j ∗N+k ] ;21 }22 free ( recv hptr ) ;23 }24 }25 else26 {27 MPI Send(&s , 1 , MPI INT , 0 , 0 ,MPI COMM WORLD) ;28 CHECK NULL( ( r e c v h p t r = ( i n t ∗ ) m a l l o c ( s i z e o f ( i n t ) ∗ ( s ) ∗ ( N) ) ) ) ;2930 f o r ( j =0; j <s ; j ++)31 {32 f o r ( k=0;k<N; k++)33 {34 r e c v h p t r [ j ∗N+k ] = h [ j + 1 ] [ k + 1 ] ;35 }36 }37 MPI Send ( r e c v h p t r , s ∗N, MPI INT , 0 , 0 ,MPI COMM WORLD) ;38 free ( recv hptr ) ;39 } Once the result is gathered, process with rank 0 deallocates the memory. and perform optional veriﬁcation result. Figure 3: Blocking Communication As reﬂected in Figure 3, the dividing of block in Solution 2 is same with solution 1. But, instead of using scatter and gather to distribute data, 14
19. 19. solution 2 uses primitive sends and receives.2.2.7 Solution 2: Linear-array ModelInitially we calculated the performance model for the Linear interconnectionNetwork. The timing diagram could be found in the Appendix C. 1. In solution 2 every process calculates the N/p*B number of values be- fore communicating a chunk with the other process. It takes (N/B)+p- 1 steps in total for computation.Below mentioned is the equation for computation. tcomp1 = ( N + p − 1) × ( N × B) × tc B p 2. After computation step each process communicates a Block with its neighbour process. There are (N/B)+p-2 steps of communication among all the processes. tcomm1 = ( N + p − 2) × (ts + B × tw ) B 3. After completing their part of matrix H every process sends it to the master process. N tcomm2 = (ts + p × N × tw ) 4. In the end master process puts all the partial result in the matrix H to ﬁnalize the matrix H. N tcomp2 = (ts + p × N × tw )The total time can be calculated by combining all the communication times. ttotal = tcomp1 + tcomm1 + tcomp2 + tcomm2 ttotal = ( N + p − 1) × ( N × B) × tc + ( N + p − 2) × (ts + B × tw ) + (ts + B p BNp × N × tw ) + (ts + N × N × tw ) p2.2.8 Solution 2: Optimum B for Linear-array ModelTo ﬁnd optimum B for linear array model, we need to calculate derivative ofﬁnal model of the linear topology with respect to B , and set the derivativeto 0 as shown below dttotal (B) =0 dB And, using obtained model from section 2.2.7 we obtain this followingequation −N N (p − 1) t + (p − 2)tw + 2 s tc = 0 B p 15
20. 20. N (p − 1) N (p − 2)tw + tc = 2 ts p B N ts B2 = N (p−1) (p − 2)tw + p tc pN ts B2 = p(p − 2)tw + N (p − 1)tc pN ts B= p(p − 2)tw + N (p − 1)tc 2.2.9 Solution 2: 2-D Mesh Model We calculated the performance model for the 2D-Mesh interconnection Net- work. And we found that there is no diﬀerence between the Linear Array Model and 2-D Mesh model because the diﬀerence between them is mainly in the time to perform broadcasting and this solution does not involve any broadcasting of element from root to other processes in the system. 2.3 Blocking-and-Interleave Technique 2.3.1 Solution 1: Using Scatter and Gather Taking into account not only Blocking size B but also Interleave size I, we developed solution below. First step is to allocate memory for all necessary variables in each processes. Master process also allocates memory for the ﬁnal matrix where all the partial results will be stored. All slave processes will also allocate memory for partial result matrices which eventually will be send to the master process. 1 main ( i n t argc , char ∗ argv [ ] ) { 2 3 {...} 4 5 i n t B, I ; 6 7 M P I I n i t (& argc , &argv ) ; 8 MPI Comm rank (MPI COMM WORLD, &rank ) ; 9 MPI Comm size (MPI COMM WORLD, &t o t a l p r o c e s s e s ) ;1011 i f ( rank == 0 ) {12 chunk size = sizeA / ( t o t a l p r o c e s s e s ∗ I ) ;1314 CHECK NULL( ( h a l l p t r = ( i n t ∗ ) c a l l o c (N ∗ ( s i z e A +1) , s i z e o f ( i n t ) ) ) ) ; // r e s u l t i n g d a t a15 CHECK NULL( ( h a l l = ( i n t ∗ ∗ ) c a l l o c ( ( s i z e A +1) , s i z e o f ( i n t ∗ ) ) ) ) ; // c o n t a i n l i s t o f p o i n t e r 16
21. 21. 16 f o r ( i = 0 ; i < s i z e A ; i ++)17 h a l l [ i ] = h a l l p t r + i ∗ N;1819 // i n i t i a l i z e t h e f i r s t row o f r e s u l t i n g m a t r i x w i t h 020 f o r ( i = 0 ; i < N; i ++)21 {22 h all [ 0 ] [ i ] = 0;23 }2425 }2627 MPI Bcast(& c h u n k size , 1 , MPI INT , 0 , MPI COMM WORLD) ;28 MPI Bcast(&N, 1 , MPI INT , 0 , MPI COMM WORLD) ;29 MPI Bcast(&B, 1 , MPI INT , 0 , MPI COMM WORLD) ;30 MPI Bcast(&I , 1 , MPI INT , 0 , MPI COMM WORLD) ;3132 CHECK NULL( ( h p t r = ( i n t ∗ ) m a l l o c ( s i z e o f ( i n t ) ∗ (N) ∗ ( chunk size + 1) ) ) ) ;33 CHECK NULL( ( h = ( i n t ∗ ∗ ) m a l l o c ( s i z e o f ( i n t ∗ ) ∗ ( c h u n k s i z e + 1) ) ) ) ;34 f o r ( i = 0 ; i < c h u n k s i z e + 1 ; i ++)35 h [ i ] = h p t r + i ∗ N;3637 CHECK NULL( ( chunk a = ( short ∗ ) c a l l o c ( s i z e o f ( short ) , chunk size ) ) ) ;38 i f ( rank != 0 ) {39 CHECK NULL( ( b = ( short ∗ ) m a l l o c ( s i z e o f ( short ) ∗ (N) ) ) ) ;40 }41 MPI Bcast ( b , N, MPI SHORT, 0 , MPI COMM WORLD) ; The master process scattering vector A to each process partially. Each interleave step there will be send part of the vector A. Sequence of code for the interleave 0 will be the same as in previous section but only with one exception that the last process will send its results to the ﬁrst process. Each process receives size B data from previous one before processing next B columns. Each process sends data after processing B columns to the next processes but the last process sends the data to the ﬁrst(master) one if it’s not the last stage. Finally after calculating all partial matrices each process sends its result to the master process (It happens interleave times). 1 2 for ( int c u r r e n t i n t e r l e a v e = 0 ; c u r r e n t i n t e r l e a v e < I ; c u r r e n t i n t e r l e a v e ++) { 3 MPI Scatter ( a + c u r r e n t i n t e r l e a v e ∗ c h u n k s i z e ∗ total processes , 4 c h u n k s i z e , MPI SHORT, chunk a , c h u n k s i z e , MPI SHORT, 0 , MPI COMM WORLD) ; 5 int current column = 1 ; 6 f o r ( i = 0 ; i < c h u n k s i z e + 1 ; i ++) h [ i ] [ 0 ] = 0 ; 7 for ( int c u r r e n t b l o c k = 0 ; c u r r e n t b l o c k < t o t a l b l o c k s ; c u r r e n t b l o c k ++) { 17
22. 22. 8 // R e c e i v e 9 i n t b l o c k e n d = MIN2( c u r r e n t c o l u m n − ( c u r r e n t b l o c k == 0 ? 1 : 0 ) + B, N) ;10 i f ( rank == 0 && c u r r e n t i n t e r l e a v e == 0 ) {11 f o r ( i n t k = c u r r e n t c o l u m n ; k < b l o c k e n d ; k++) {12 h [ 0 ] [ k ] = 0;13 }14 } else {15 i n t r e c e i v e f r o m = rank == 0 ? t o t a l p r o c e s s e s − 1 : rank − 1 ;16 i n t s i z e t o r e c e i v e = c u r r e n t b l o c k == t o t a l b l o c k s − 1 ? l a s t b l o c k : B;17 MPI Recv ( h [ 0 ] + c u r r e n t b l o c k ∗ B, s i z e t o r e c e i v e ,18 MPI INT , r e c e i v e f r o m , 0 , MPI COMM WORLD, & status ) ;19 }20 // P r o c e s s21 f o r ( j = c u r r e n t c o l u m n ; j < b l o c k e n d ; j ++, c u r r e n t c o l u m n++) {22 f o r ( i = 1 ; i < c h u n k s i z e + 1 ; i ++) {23 d i a g = h [ i − 1 ] [ j −1] + sim [ chunk a [ i − 1 ] ] [ b [ j − 1]];24 down = h [ i − 1 ] [ j ] + DELTA;25 r i g h t = h [ i ] [ j −1] + DELTA;26 max = MAX3( diag , down , r i g h t ) ;27 i f (max <= 0 ) {28 h[ i ] [ j ] = 0;29 } else {30 h [ i ] [ j ] = max ;31 }32 }33 }34 // Send35 i f ( c u r r e n t i n t e r l e a v e + 1 != I | | rank + 1 != total processes ) {36 i n t s e n d t o = rank + 1 == t o t a l p r o c e s s e s ? 0 : rank + 1;37 i n t s i z e t o s e n d = c u r r e n t b l o c k == t o t a l b l o c k s − 1 ? l a s t b l o c k : B;38 MPI Send ( h [ c h u n k s i z e ] + c u r r e n t b l o c k ∗ B, size to send ,39 MPI INT , s e n d t o , 0 , MPI COMM WORLD) ;40 }41 }42 MPI Gather ( h p t r + N, N ∗ c h u n k s i z e , MPI INT ,43 h a l l p t r + N + current interleave ∗ chunk size ∗ t o t a l p r o c e s s e s ∗ N,44 N ∗ c h u n k s i z e , MPI INT , 0 , MPI COMM WORLD) ;45 }46 MPI Finalize () ;47 }48 { . . . } To summarize the interleave realization illustrated on Figure 4. 18
23. 23. Figure 4: Blocking and interleave communication2.3.2 Solution 1: Linear-Array ModelHere is linear array model for communication part for blocking techniquewith interleave 1. Broadcasting chunk size, N, B, and I tcomm−bcast−4−int = 4 × (ts + tw ) × log2 (p) 2. Broadcasting of 2nd protein sequence (vector b) tcomm−bcast−protein−seq = (ts + tw × N ) × log2 (p) 3. Scattering chunk size for each process to compute Note that the size of chunk size is the following N chunk size = p×I where I is the interleave factor. And scattering is performed I times. Therefore, the communication cost of scattering is N tcomm−scatter−protein−seq = I × (ts × log2 (p) + tw × p×I × (p − 1)) 4. Sending shared data To start the ﬁrst block of computation, process with rank 0 does not need to wait for any data from other processes. And, we need to take 19
24. 24. note that in each interleave except the last interleave, last process ((p − i)th process) needs to send N data to process 0. Therefore, for I − 1 occurences, we need ( N + p − 1) pipeline stages for sending B data, and for the last Interleave step (the I th steps), we will have ( N + p − 2) stages for sending data. The shared data is the last row B of current ﬁnished block which consists of B items.Therefore putting all of them together, communication time to send shared data is tcomm−send−shared−data = (I − 1) × ( N + p − 1) × (ts + (B × tw )) + ( N + B B p − 2) × (ts + (B × tw )) 5. Gathering calculated data We need to perform gather to combine all calculated data in every interleave step. Note that every process will need to combine N × chunk size data, which equals to N × PN amount of data. This ×I gather procedure is repeated I times. Therefore the communication time for this step is given by N tcomm−gather = I × (ts × log2 (p) + tw × p×I × N × (p − 1)) 6. Putting all the communication time together tcomm−all = tcomm−bcast−4−int +tcomm−bcast−protein−seq +tcomm−scatter−protein−seq + tcomm−send−shared−data + tcomm−gather Simplifying the equation with respect to B (by separating constant of the equation with the component of the equation containing B, so that we can easily calculate the derivative of the equation to obtain maximum B), we obtain this following equation tcomm−all (B) = ((5 + 2I)log2 (p) + (p − 1)(I − 1) + (p − 2)) × ts + ((4 + 2 N )log2 (p) + N (p − 1) + N (p − 1) + I − 1 + N ) × tw + IN × ts + ((I − p p B 1)(p − 1) + p − 2)B × tw Simplyﬁng the equation with respect to I, we obtain this following equation tcomm−all (I) = ((5 + 2I)log2 (p) − 1) × ts + ((4 + N )log2 (p) + N (p − p N2 1) + p (p − 1) + B) × tw + ( N + p − 1)(ts + Btw )I B Now we calculate the calculation time for this blocking technique. Notethat in our blocking technique we have I × ( N + p − 1) stages of block- B Ncalculation. In each block-calculation, we need to compute p×I × B points.Therefore, if we represent time to compute one point as tc , we obtain thisfollowing calculation time model 20
25. 25. tcalc = I × ( N + p − 1) × ( p×I × B) × tc B N I will be canceled and we obtain this following 2 tcalc = ( N + N B − p NB p ) × tc ( N ×(p−1) ) × B) × tc 2 tcalc = ( N + p p Final model can be obtained by adding calculation time and communi-cation time, and here is the ﬁnal equation with respect to B ttotal = tcomm + tcalc ttotal (B) = ((5+2I)log2 (p)+(p−1)(I −1)+(p−2))×ts +((4+N )log2 (p)+N 2p (p− 1) + N (p − 1) + I − 1 + N ) × tw + IN × ts + ((I − 1)(p − 1) + p − p B2)B × tw + ( N + ( N ×(p−1) ) × B) × tc 2 p p Here is the ﬁnal equation with respect to I N tcomm−all (I) = ((5 + 2I)log2 (p) − 1) × ts + ((4 + N )log2 (p) + p (p − 1) +N2 + ( N + p − 1)(ts + Btw )I + ( N + ( N ×(p−1) ) × B) × tc 2p (p − 1) + B) × tw B p p2.3.3 Solution 1: Optimum B and I for Linear-array Model dttotal (B)Optimum B can be derived by calculating dB and set the inequalityto 0. dttotal (B) =0 dB And, using obtained model from previous section we obtain this follow-ing equation −IN N (p − 1) t + ((I − 1)(p − 1) + (p − 2))tw + 2 s tc = 0 B p N (p − 1) IN ((I − 1)(p − 1) + (p − 2))tw + tc = 2 ts p B IN ts B2 = N (p−1) ((I − 1)(p − 1) + (p − 2))tw + p tc pIN ts B2 = ((I − 1)(p − 1) + (p − 2))ptw + N (p − 1)tc 21
26. 26. pIN ts B= ((I − 1)(p − 1) + (p − 2))ptw + N (p − 1)tc IN ts B≈ (N tc + I) However, we can not ﬁnd optimum I for Blocking-and-Interleave tech-nique because the derivation of dttotal (I) results in a constant as shown below dI dttotal (I) =0 dI N (+ p − 1)(ts + Btw ) = 0 B Looking at equation of dttotal (I), interleave factor only introduce morecommunication time when sending and receiving shared data. Therefore nooptimum interleave level can be derived using this model.2.3.4 Solution 1: 2-D Mesh ModelUsing similar technique as what we have done in Linear-array model, hereis the communication and computation model of 2-D Mesh Model 1. Broadcasting chunk size, N, B, and I √ tcomm−bcast−4−int = 4 × 2 × (ts + tw ) × log2 ( p) 2. Broadcasting of 2nd protein sequence (vector b) √ tcomm−bcast−protein−seq = 2 × (ts + tw × N ) × log2 ( p) 3. Scattering chunk size for each process to compute As what we discuss in section 2.2.4, scattering communication model between 2-D Mesh model and Linear Array model are equals. N tcomm−scatter−protein−seq = I × (ts × log2 (p) + tw × p×I × (p − 1)) 4. Sending shared data Communication time for sending shared data also equal between 2-D Mesh model and Linear Array model. tcomm−send−shared−data = (I − 1) × ( N + p − 1) × (ts + (B × tw )) + ( N + B B p − 2) × (ts + (B × tw )) 5. Gathering calculated data Gathering formula is equal to scattering except for the amount of data being gathered. N tcomm−gather = I × (ts × log2 (p) + tw × p×I × N × (p − 1)) 22
27. 27. 6. Putting all the communication time together tcomm−all = tcomm−bcast−4−int +tcomm−bcast−protein−seq +tcomm−scatter−protein−seq + tcomm−send−shared−data + tcomm−gather Simplifying the equation with respect to B (by separating constant of the equation with the component of the equation containing B, so that we can easily calculate the derivative of the equation to obtain maximum B), we obtain this following equation √ tcomm−all (B) = (10 log2 ( p) + 2I log2 (p) + (p − 1)(I − 1) + (p − 2)) × √ 2 ts + ((8 + 2N )log2 ( p) + N (p − 1) + N (p − 1) + I − 1 + N ) × tw + p p IN B × ts + ((I − 1)(p − 1) + p − 2)B × tw Simplyﬁng the equation with respect to I, we obtain this following equation √ √ tcomm−all (I) = (10 log2 ( p)+2I log2 (p)−1)×ts +((8+2N )log2 ( p)+ N N2 N p (p − 1) + p (p − 1) + B) × tw + ( B + p − 1)(ts + Btw )I2.3.5 Solution 1: Optimum B and I for 2-D Mesh Model dttotal (B)Optimum B can be derived by calculating dB and set the inequalityto 0. dttotal (B) =0 dB And, using obtained model from previous section we obtain this follow-ing equation −IN N (p − 1) t + ((I − 1)(p − 1) + (p − 2))tw + 2 s × tc = 0 B p N (p − 1) IN ((I − 1)(p − 1) + (p − 2))tw + × tc = 2 ts p B IN ts B2 = N (p−1) ((I − 1)(p − 1) + (p − 2))tw + p × tc pIN ts B2 = ((I − 1)(p − 1) + (p − 2))ptw + N (p − 1) × tc pIN ts B= ((I − 1)(p − 1) + (p − 2))ptw + N (p − 1) × tc 23
28. 28. We observe that the resulting optimum B for 2-D Mesh model is equal toLinear Array model. As what we have discussed in section 2.2.5, 2-D Meshmodel only diﬀers in the broadcast time which act as constant in ttotal (B)equation and the constant disappear when we calculate the derivaion of theequation. Similar to calculation of optimum I in Linear Array Model, we can notﬁnd optimum I for Blocking-and-Interleave technique because the derivationof dttotal (I) results in a constant as shown below dI dttotal (I) =0 dI N ( + p − 1)(ts + Btw ) = 0 B2.3.6 Solution 1: Improvement Figure 5: Blocking and Interleave Communication The main idea of this improvement is moving the gathering ﬁnal dataprocess into the end of whole calculation in each process. That means,refering to Figure 5, gathering will be performed after step 14. To implement this improvement, we performed these following steps: 1. Allocate enough memory for each process, to hold I × N × chunk size. Note that chunk size in this case is PN . ×I 24
29. 29. 1 CHECK NULL( ( h p t r = ( i n t ∗ ) m a l l o c ( s i z e o f ( i n t ) ∗ (N) ∗ I ∗ ( chunk size 2 + 1 ) ) ) ) ; // I n s t a n t i a t e temporary r e s u l t i n g m a t r i x f o r each process 3 CHECK NULL( ( h = ( i n t ∗ ∗ ) m a l l o c ( s i z e o f ( i n t ∗ ) ∗ I ∗ ( chunk size + 4 1 ) ) ) ) ; // l i s t o f p o i n t e r 5 6 i n t ∗∗∗ h f i n ; 7 CHECK NULL( h f i n = ( i n t ∗ ∗ ∗ ) m a l l o c ( s i z e o f ( i n t ∗ ∗ ∗ ) ∗ I ) ) ; 8 9 f o r ( i = 0 ; i < ( c h u n k s i z e + 1 ) ∗ I ; i ++) {10 h [ i ] = h p t r + i ∗ N; // p u t t h e p r o i n t e r i n t t h e a r r a y11 }1213 f o r ( i = 0 ; i < I ; i ++) {14 h f i n [ i ] = h + i ∗ ( chunk size + 1) ;15 }2. Change the way each process manipulates the data. Each process stores the data using hﬁn. hﬁn is a variable with type ***int, therefore we need to store the data as shown in the following code snippet 1 for ( int c u r r e n t i n t e r l e a v e = 0 ; c u r r e n t i n t e r l e a v e < I ; c u r r e n t i n t e r l e a v e ++) { 2 3 MPI Scatter ( a + c u r r e n t i n t e r l e a v e ∗ c h u n k s i z e ∗ total processes , 4 c h u n k s i z e , MPI SHORT, chunk a , c h u n k s i z e , MPI SHORT, 0 , MPI COMM WORLD) ; // c h u n k a i s the receiving buffer 5 6 int current column = 1 ; 7 f o r ( i = 0 ; i < c h u n k s i z e + 1 ; i ++) h f i n [ current interleave ] [ i ] [ 0 ] = 0; 8 9 for ( int c u r r e n t b l o c k = 0 ; c u r r e n t b l o c k < t o t a l b l o c k s ; c u r r e n t b l o c k ++) {10 // R e c e i v e11 i n t b l o c k e n d = MIN2( c u r r e n t c o l u m n − ( c u r r e n t b l o c k == 0 ? 1 : 0 ) + B, N) ;12 i f ( rank == 0 && c u r r e n t i n t e r l e a v e == 0 ) { // i f rank 0 i s p r o c e s s i n g t h e f i r s t b l o c k , i t doesn ’ t need t o r e c e i v e any t h i n g13 for ( int k = current column ; k < block end ; k++) {14 h f i n [ c u r r e n t i n t e r l e a v e ] [ 0 ] [ k ] = 0 ; // i n i t row 015 }16 } else {17 i n t r e c e i v e f r o m = rank == 0 ? t o t a l p r o c e s s e s − 1 : rank − 1 ; // r e c e i v e 25
30. 30. from n e i g h b o r i n g p r o c e s s18 i n t s i z e t o r e c e i v e = c u r r e n t b l o c k == t o t a l b l o c k s − 1 ? l a s t b l o c k : B;1920 MPI Recv ( h f i n [ c u r r e n t i n t e r l e a v e ] [ 0 ] + c u r r e n t b l o c k ∗ B, s i z e t o r e c e i v e , MPI INT , r e c e i v e f r o m , 0 , MPI COMM WORLD , &s t a t u s ) ;21 }22 f o r ( j = c u r r e n t c o l u m n ; j < b l o c k e n d ; j ++, c u r r e n t c o l u m n++) {23 f o r ( i = 1 ; i < c h u n k s i z e + 1 ; i ++) {24 d i a g = h f i n [ c u r r e n t i n t e r l e a v e ] [ i − 1 ] [ j −1] + sim [ chunk a [ i − 1 ] ] [ b [ j − 1 ] ] ;25 down = h f i n [ c u r r e n t i n t e r l e a v e ] [ i − 1 ] [ j ] + DELTA;26 r i g h t = h f i n [ c u r r e n t i n t e r l e a v e ] [ i ] [ j −1] + DELTA;27 max = MAX3( diag , down , r i g h t ) ;28 i f (max <= 0 ) {29 hfin [ current interleave ] [ i ] [ j ] = 0;30 } else {31 h f i n [ c u r r e n t i n t e r l e a v e ] [ i ] [ j ] = max ;32 }33 }34 }3536 // Send37 i f ( c u r r e n t i n t e r l e a v e + 1 != I | | rank + 1 != total processes ) {38 i n t s e n d t o = rank + 1 == t o t a l p r o c e s s e s ? 0 : rank + 1 ;39 i n t s i z e t o s e n d = c u r r e n t b l o c k == t o t a l b l o c k s − 1 ? l a s t b l o c k : B;40 MPI Send ( h f i n [ c u r r e n t i n t e r l e a v e ] [ c h u n k s i z e ] + c u r r e n t b l o c k ∗ B, s i z e t o s e n d , MPI INT , s e n d t o , 0 , MPI COMM WORLD) ;41 }42 }43 } Note that hﬁn[i] means it contains the data for the ith interleaving stage in each process.3. Move gathering process into the end of all calculation as shown in the following code snippet 1 f o r ( i = 0 ; i < I ; i ++) { 2 MPI Gather ( h p t r + N + i ∗ c h u n k s i z e ∗ N, N ∗ c h u n k s i z e , MPI INT , 3 h a l l p t r + N + i ∗ chunk size ∗ t o t a l p r o c e s s e s ∗ N, 26
31. 31. 4 N ∗ c h u n k s i z e , MPI INT , 0 , MPI COMM WORLD) ; 5 }2.3.7 Solution 1: Optimum B and I for the Improved SolutionHere is the part of the model that are aﬀected by the improved solution. 1. Sending shared data For the ﬁrst I - 1 interleaving stages the communication time is fol- lowed: N (I − 1) × (ts + tw × B) × B Then the last interleaving stage consist of following amount of com- munication time: (ts + tw × B) × ( N + P − 2) B Therefore putting all of them together, communication time to send shared data is (ts + tw × B) × ( N + P − 2) + (I − 1) × (ts + tw × B) × B N B 2. Computational time As well with sending and receive changes, time for computation are also improved. (N × B × B N P ×I × (I − 1) + B × N P ×I × ( N + P − 1)) × tc B Optimal B and I for Improved Solution To calculate the optimal value we ignore all the communication timewhich is not going to inﬂuent the value of optimal B and I. For optimal B,we only have the following formula the calculation. t total improved(B) = (ts + tw × B) × ( N + P − 2) + (I − 1) × (ts + tw × BB) × N + ( N × B × PN × (I − 1) + B × PN × ( N + P − 1)) × tc B B ×I ×I B dt total improved(B) =0 dB (I − 1) × ts × N N × ts N − 2 − 2 + (P − 2) × tw + (P − 1) × × tc = 0 B B P ×I I 2 × ts × N × P B= (P − 2) × tw × P × I + (P − 1) × N × tc I × N × ts B≈ (P − 2) × tw 27
32. 32. However, for optimal I value, we need to consider also scatter time as well. Therefore we obtain this following formula for t total improved(I) t total improved(I) = I × ts × log2 (p) + (ts + tw × B) × ( N + P − 2) + (I − B 1) × (ts + tw × B) × N + N × B × × PN × (I − 1) + B × PN × ( N + P − 1) × tc ) B B ×I ×I B dt total improved(I) =0 dI N N2 × B B×N N ts ×log2 (p)+(ts +tw ×B)× + 2 ×tc − 2 ×( +P −1)×tc = 0 B B×P ×I P ×I B B 2 × N × ( N + P − 1) × tc − N 2 × B × tc B I= B × P × ts × log2 (p) + (ts + tw × B) × N × P B × N × tc I≈ N ts × log2 (p) + N × tw + B × ts 2.3.8 Solution 2: Using Send and Receive This implementation also takes in account the row interleave factor along with the column interleave. Every process calculates the number of rows it has to process at every interleave and initializes the memory. Master process declares the matrix H and use it for its partial processing as well. Each process process N/(p*I) number of rows in every interleave and communicates the block with its neighbour process. Last process communi- cates its block with the master process and do not perform any communi- cation in the last interleave. 1 i f ( i d == 0 ) 2 { 3 f o r ( i =0; i <ColumnBlock ; i ++) 4 { 5 CHECK NULL( ( chunk = ( i n t ∗ ) m a l l o c ( s i z e o f ( i n t ) ∗ ( B) ) ) ) ; 6 7 f o r ( j =1; j<=s ; j ++) 8 { 9 f o r ( k=i ∗B+1;k<=( i +1)∗B && k<=b [ 0 ] && k <= N; k++)10 {11 i n t RowPosition ;1213 i f ( ( i n t e r l e a v e ∗p+i d ) < r )14 RowPosition = ( i n t e r l e a v e ∗ (N/ ( p∗ I ) +1)∗p ) + i d ∗ ( (N/ ( p∗ I ) +1) ) + j ;15 else 28
33. 33. 16 RowPosition = ( r ∗ (N/ ( p∗ I ) +1) ) + ( i n t e r l e a v e ∗p+id−r ) ∗ (N/ ( p∗ I ) ) + j ;1718 d i a g = h [ RowPosition − 1 ] [ k −1] + sim [ a [ RowPosition ] ] [ b [ k ] ] ;19 down = h [ RowPosition − 1 ] [ k ] + DELTA;20 r i g h t = h [ RowPosition ] [ k −1] + DELTA;21 max = MAX3( diag , down , r i g h t ) ;2223 i f (max <= 0 ) {24 h [ RowPosition ] [ k ] = 0 ;25 } else {26 h [ RowPosition ] [ k ] = max ;27 }28 chunk [ k−( i ∗B+1) ] = h [ RowPosition ] [ k ] ;29 }30 } // communicate t o e p a r t i a l b l o c k t o n e x t p r o c e s s31 MPI Send ( chunk , B, MPI INT , i d +1 ,0 ,MPI COMM WORLD) ;32 f r e e ( chunk ) ;33 }34 // end f i l l i n g m a t r i x H [ ] [ ] a t master35 } e l s e i f ( i d != p−1)36 { // f i l l i n g m a t r i x a t o t h e r p r o c e s s e s3738 f o r ( i =0; i <ColumnBlock ; i ++)39 {40 CHECK NULL( ( chunk = ( i n t ∗ ) m a l l o c ( s i z e o f ( i n t ) ∗ ( B) ) ) ) ;4142 MPI Recv ( chunk , B, MPI INT , id −1 ,0 ,MPI COMM WORLD,& status ) ;43 f o r ( z =0; z<B ; z++)44 {45 i f ( ( i ∗B+z ) <= N)46 h [ 0 ] [ i ∗B+z +1] = chunk [ z ] ;47 }48 f o r ( j =1; j<=s ; j ++)49 {50 i n t RowPosition ;5152 i f ( ( i n t e r l e a v e ∗p+i d ) < r )53 RowPosition = ( i n t e r l e a v e ∗ (N/ ( p∗ I ) +1)∗p ) + i d ∗ ( (N/ ( p∗ I ) +1) ) + j ;54 else55 RowPosition = ( r ∗ (N/ ( p∗ I ) +1) ) + ( i n t e r l e a v e ∗p+id−r ) ∗ (N/ ( p∗ I ) ) + j ;565758 f o r ( k=i ∗B+1;k<=( i +1)∗B && k<=b [ 0 ] && k <= N; k++)59 {60 d i a g = h [ j − 1 ] [ k −1] + sim [ a [ RowPosition ] ] [ b[k ] ] ;61 down = h [ j − 1 ] [ k ] + DELTA; 29
34. 34. 62 r i g h t = h [ j ] [ k −1] + DELTA;63 max = MAX3( diag , down , r i g h t ) ;64 i f (max <= 0 )65 h [ j ] [ k ] = 0;66 else67 h [ j ] [ k ] = max ;6869 chunk [ k−( i ∗B+1) ] = h [ j ] [ k ] ;70 }71 }72 MPI Send ( chunk , B, MPI INT , i d +1 ,0 ,MPI COMM WORLD) ;73 f r e e ( chunk ) ;74 } // end f i l l i n g m a t r i x a t o t h e r p r o c e s s e s75 } e l s e // s t a r t f i l l i n g m a t r i x a t l a s t p r o c e s s76 {77 f o r ( i =0; i <ColumnBlock ; i ++)78 {79 CHECK NULL( ( chunk = ( i n t ∗ ) m a l l o c ( s i z e o f ( i n t ) ∗ ( B) ) ) ) ;8081 MPI Recv ( chunk , B, MPI INT , id −1 ,0 ,MPI COMM WORLD,& status ) ;82 f o r ( z =0; z<B ; z++)83 {84 i f ( ( i ∗B+z ) <= N)85 h [ 0 ] [ i ∗B+z +1] = chunk [ z ] ;86 }8788 f r e e ( chunk ) ;89 f o r ( j =1; j<=s ; j ++)90 {91 i n t RowPosition ;92 i f ( ( i n t e r l e a v e ∗p+i d ) < r )93 RowPosition = ( i n t e r l e a v e ∗ (N/ ( p∗ I ) +1)∗p ) + i d ∗ ( (N/ ( p∗ I ) +1) ) + j ;94 else95 RowPosition = ( r ∗ (N/ ( p∗ I ) +1) ) + ( i n t e r l e a v e ∗p+id−r ) ∗ (N/ ( p∗ I ) ) + j ;969798 f o r ( k=i ∗B+1;k<=( i +1)∗B && k<=b [ 0 ] && k <= N; k++) 99 {100 d i a g = h [ j − 1 ] [ k −1] + sim [ a [ RowPosition ] ] [ b[k ] ] ;101 down = h [ j − 1 ] [ k ] + DELTA;102 r i g h t = h [ j ] [ k −1] + DELTA;103 max = MAX3( diag , down , r i g h t ) ;104 i f (max <= 0 )105 h [ j ] [ k ] = 0;106 else107 h [ j ] [ k ] = max ;108109 30
35. 35. 110 }111 }112 }113 } After ﬁlling the partial matrix H, every process sends the partial result to the master process at every interleave. Below mentioned is the code snippet of master gathering the partial result after every interleave. 1 i f ( i d ==0) 2 { 3 i n t row , c o l ; 4 f o r ( i =1; i <p ; i ++) 5 { 6 MPI Recv(&row , 1 , MPI INT , i , 0 ,MPI COMM WORLD,& status ) ; 7 CHECK NULL( ( r e c v h p t r = ( i n t ∗ ) m a l l o c ( s i z e o f ( i n t ) ∗ ( row ) ∗ (N) ) ) ) ; 8 9 MPI Recv ( r e c v h p t r , row∗N, MPI INT , i , 0 , MPI COMM WORLD,& s t a t u s ) ;1011 f o r ( j =0; j <row ; j ++)12 {13 i n t RowPosition ;1415 i f ( ( i n t e r l e a v e ∗p+i ) < r )16 RowPosition = ( i n t e r l e a v e ∗ (N/ ( p∗ I ) +1)∗p ) + i ∗ ( (N/ ( p∗ I ) +1) ) + j +1;17 else18 RowPosition = ( r ∗ (N/ ( p∗ I ) +1) ) + ( i n t e r l e a v e ∗p+i −r ) ∗ (N/ ( p∗ I ) ) + j +1;1920 f o r ( k =0;k<N; k++)21 h [ RowPosition ] [ k+1]= r e c v h p t r [ j ∗N+k ] ;2223 }24 free ( recv hptr ) ;25 }26 }27 else28 {29 MPI Send(&s , 1 , MPI INT , 0 , 0 ,MPI COMM WORLD) ;30 CHECK NULL( ( r e c v h p t r = ( i n t ∗ ) m a l l o c ( s i z e o f ( i n t ) ∗ ( s ) ∗ (N) ) ) ) ;3132 f o r ( j =0; j <s ; j ++)33 {34 f o r ( k=0;k<N; k++)35 r e c v h p t r [ j ∗N+k ] = h [ j + 1 ] [ k + 1 ] ;36 }37 MPI Send ( r e c v h p t r , s ∗N, MPI INT , 0 , 0 ,MPI COMM WORLD) ;3839 free ( recv hptr ) ; 31
36. 36. 40 } To summarize the interleave realization illustrated in Appendix D. 2.3.9 Solution 2: Linear-array Model 1. Every process calculates the (N/(p*I))*B number of values in every interleave before communicating a chunk with the other process. It takes ((N/B)+p-1)*I steps in total for computation.Below mentioned is the equation for computation. tcomp1 = I × ( N + p − 1) × ( p×I × B) × tc B N 2. After computation step each process communicates a Block with its neighbour process. There are (N/B)+p-2 steps of communication among all the processes. tcomm1 = (I −1)×( N +p−1)×(ts +B ×tw )+( N +p−2)×(ts +B ×tw ) B B 3. After completing their part of matrix H every process sends it to the master process. N tcomm2 = (ts + (p×I × N × tw ) × I 4. In the end master process puts all the partial result in the matrix H to ﬁnalize the matrix H. N tcomp2 = I × (ts + (p×I) × N × tw ) The total execution time can be calculated by combining all the times. ttotal = tcomp1 + tcomm1 + tcomp2 + tcomm2 ttotal = I ×( N +p−1)×( p×I ×B)×tc +(I −1)×( N +p−1)×(ts +B×tw )+ B N B ( N + p − 2) × (ts + B × tw ) + (ts + (p×I × N × tw ) × I + I × (ts + (p×I) × N × tw ) B N N 2.3.10 Solution 2: Optimum B and I for Linear-array Model dttotal (B) Optimum B can be derived by calculating dB and set the inequality to 0. dttotal (B) =0 dB And, using obtained model from previous section we obtain this follow- ing equation −IN N (p − 1) t + ((I − 1)(p − 1) + (p − 2))tw + 2 s =0 B p 32
37. 37. N (p − 1) IN ((I − 1)(p − 1) + (p − 2))tw + = 2 ts p B IN ts B2 = N (p−1) ((I − 1)(p − 1) + (p − 2))tw + p pIN ts B2 = ((I − 1)(p − 1) + (p − 2))ptw + N (p − 1) pIN ts B= ((I − 1)(p − 1) + (p − 2))ptw + N (p − 1) However, we can not ﬁnd optimum I for Blocking-and-Interleave tech-nique because the derivation of dttotal (I) results in a constant as shown below dI dttotal (I) =0 dI N ( + p − 1)(ts + Btw ) = 0 B Looking at equation of dttotal (I), interleave factor only introduce morecommunication time when sending and receiving shared data. Therefore nooptimum interleave level can be derived using this model.2.3.11 Solution 2: 2-D Mesh ModelAs we have discussed in section 2.2.9, 2-D Mesh Model is same with LinearArray model because 2-D Mesh Model only aﬀects the broadcast procedureand solution 2 does not include any broadcast procedure in its implementa-tion. 33
38. 38. 3 Performance ResultsWe did performance measurement of both parallel versions in Altix Machineand compare the results against the sequential version.3.1 Solution 13.1.1 Performance of Sequential CodeFirst we measured the performance of Smith-Waterman algorithm, usingsequential code. Figure 6 shows the results. Figure 6: Sequential Code Performance Measurement Result Figure 6 shows that when N is increased, the time taken to completeﬁlling matrix h is also increased almost linearly. 34
39. 39. 3.1.2 Find Out Optimum Number of Processor (P)At ﬁrst, we observe the performance by ﬁxing number of compared pro-tein(N) to 5000 and 10000, block size (B) to 100 and set the interleavefactor (I) to 1. The result is shown in Figure 7. 1. Protein size equals to 5000 (N = 5000) Block size (B) is 100, and In- terleave factor (I) is 1 Figure 7: Measurement result when N is 5000, B is 100 and I is 1 Plotting the result in diagram in Figure 8Figure 8: Diagram of measurement result when N is 5000, B is 100, I is 1 When the protein size (N) is 5000 and number of processor (P) is 4, we obtain t tparallel = 1.454 = 2.26 times speedup. serial 3.3 35
40. 40. 2. Protein size equals to 10000 (N = 10000) We obtain this following result in Figure 9 Figure 9: Measurement result when N is 10000, B is 100 and I is 1 Plotting the result in diagram in Figure 10Figure 10: Diagram of measurement result when N is 10000, B is 100, I is 1 When the protein size (N) is 10000 and number of processor (P) is 8, we obtains t tparallel = 12.508 = 5.06 times speedup. serial 2.47 Based on the result above, we found that maximum speedup is achievedwhen number of processor (P) is 8 and protein size (N) is 10000. Therefore,for the subsequent experiment, we will ﬁx the number of processor to 8 andmodify other parameters.3.1.3 Find Out Optimum Blocking Size (B)In this subsection, we analyze the performance result and ﬁnd optimumblocking size (B). We ﬁx number of processor (P) to 8, number of protein(N) to 10000 and interleave factor (I) to 1. The results are on Figure 11 36
41. 41. Figure 11: Performance measurement result when N is 10000, P is 8, I is 1Figure 12: Diagram of measurement result when N is 10000, P is 8, I is 1 Plotting the result into diagram as shown in Figure 12. We zoomed inthe diagram in right hand side of Figure 12 so that we have clearer pictureon the performance when B is less than or equal to 500. We found that optimum empirical blocking size (B) in the solution 1 is100. And this yield in t tparallel = 12.508 = 5.21 times speedup. serial 2.401 37
42. 42. 3.1.4 Find Out Optimum Interleave Factor (I)Using the result from previous section in ﬁnding optimum blocking size (B),we ﬁnd out most optimum I. The result is shown on Figure 13Figure 13: Diagram of measurement result when N is 10000, P is 8, B is 100 We found that optimum I is 1. And using optimum I of 1, we obtain4.76 times speedup compared to sequential execution.3.2 Solution 1-ImprovedWe did the same experiment as Solution 1 performance result to obtainnecessary data about our improved solution3.2.1 Find Out Optimum Number of Processor (P)At ﬁrst, we observe the performance by ﬁxing number of compared pro-tein(N) to 10000, block size (B) to 100 and set the interleave factor (I) to 1.The result is shown in Figure 14. We obtain this following result in Figure 14 Figure 14: Measurement result when N is 10000, B is 100 and I is 1 Plotting the result in diagram in Figure 15 When the protein size (N) is 10000 and number of processor (P) is 8, weobtains t tparallel = 12.508 = 4.201 times speedup. serial 2.977 38
43. 43. Figure 15: Diagram of measurement result when N is 10000, B is 100, I is 1 Based on the result above, we found that maximum speedup is achievedwhen number of processor (P) is 8 and protein size (N) is 10000. Therefore,for the subsequent experiment, we will ﬁx the number of processor to 8 andmodify other parameters. 39
44. 44. 3.2.2 Find Out Optimum Blocking Size (B)In this subsection, we analyze the performance result and ﬁnd optimumblocking size (B). We ﬁx number of processor (P) to 8, number of protein(N) to 10000 and interleave factor (I) to 1. The results are on Figure 16Figure 16: Performance measurement result when N is 10000, P is 8, I is 1 Plotting the result into diagram as shown in Figure 17. We zoomed inthe diagram in right hand side of Figure 17 so that we have clearer pictureon the performance when B is less than or equal to 500.Figure 17: Diagram of measurement result when N is 10000, P is 8, I is 1 We found that optimum empirical blocking size (B) in the solution 1 is serial200. And this yield in t tparallel = 12.508 = 5.08 times speedup. 2.464 40
45. 45. 3.2.3 Find Out Optimum Interleave Factor (I)Using the result from previous section in ﬁnding optimum blocking size (B),we ﬁnd out most optimum I. The result is shown on Figure 18Figure 18: Diagram of measurement result when N is 10000, P is 8, B is 200 We found that optimum I is 2. And using optimum I of 2, we obtain t serial = 12.508 = 5.08 times speedup.t parallel 2.6133.3 Solution 2Using similar sequential code performance result obtained during Solution1 evalution, we measured the performance of solution 2.3.3.1 Find Out Optimum Number of Processor (P)The ﬁrst step that we did is to observe the performance by ﬁxing number ofcompared protein (N), block size (B) and set the interleave factor (I) to 1. 1. Protein size equals to 5000 (N = 5000) Block size (B) is 100, and Interleave factor (I) is 1 Figure 19: Measurement result when N is 5000, B is 100 and I is 1 Plotting the result into diagram 41
46. 46. Figure 20: Diagram of measurement result when N is 5000, B is 100, I is 1 Using protein size (N) of 5000 and number of processor (P) is 32, we achieve maximum 31.55% speedup compared to existing sequential code. 2. Protein size equals to 10000 (N = 10000) Block size (B) is 100, and Interleave factor (I) is 1 Figure 21: Measurement result when N is 10000, B is 100 and I is 1 Plotting the result into Figure 22 Using protein size (N) equals to 10000 and number of processor (P) is 32, we achieve 54.67% speedup compared to existing sequential code. Based on results obtained in this section, we found that parallel imple-mentation of solution 2 achieve most speedup when the number of procesor 42