1. Ciaran Cox (1115773)
MA5610: Financial Computing 2: Assignment 1
0.1 Explicit Time Stepping: MPI-Parallelization
The solution is still maintained in computing with MPI. The algorithm is running much quicker
when all the processes are on the same computer, however they are slower when the processes are
split across multiple computers. The communication is set up to work slightly differently when an
even number of processes are used, compared to when an odd number are used. Table (1) shows
times and speed ups for all the processes on the same computer.
Table 1: Times and speed ups for processes on the same computer
M N Error 1 2 4 8
131072 128 8.411371e-05 1.56282 0.895603 0.617478 0.585737
Speed up 1.745 2.53 2.668
524288 256 2.973560e-05 11.1551 6.88205 4.29815 2.99606
Speed up 1.621 2.595 3.723
2097152 512 1.051285e-05 84.7041 48.9535 27.2987 17.7712
Speed up 1.73 3.103 4.766
8388608 1024 3.716830e-06 667.436 373.127 204.171 116.779
Speed up 1.789 3.269 5.715
A greater speed up is achieved as the problem gets larger, with the maximum speed up of 5.715
on 8 processes, compared to 2.668 for the smaller problem. When running the algorithm split
across four computers; table (2) shows the results. A slowdown occurs when splitting the problem
among separate computers, however when the problem increases in size the slow down gets smaller.
A speed up is not achieved on a problem this size split across multiple computers. In table (3)
is the next larger problem running across four computers. A speed up is achieved for problems
beyond M = 8388608,N = 1024 as each process has more computations to carry out, therefore, the
additional times communicating between the processes does not show a great effect on the times.
Compared when all the processes are running internally on the same machine, speed up is achieved
quicker due to quick communication times between processes. Code created appendix (.1).
1
2. Table 2: Times and speed ups for processes split across four computers
M N Error 1 2 4 8
131072 128 8.411371e-05 1.2026 6.468 6.44 6.4637
Speed up 0.186 0.1867 0.18605
524288 256 2.973560e-05 8.62632 29.3911 25.7779 25.7123
Speed up 0.2935 0.3346 0.3355
2097152 512 1.051285e-05 68.4614 130.529 135.822 103.727
Speed up 0.5245 0.504 0.66
8388608 1024 3.716830e-06 553.673 601.222 821.992 559.195
Speed up 0.9209 0.674 0.9901
Table 3: Times and speed ups for processes split across four computers
M N Error 1 2 4 8
33554432 2048 1.314096e-06 4506.54 3624.66 3291.96 3288.42
Speed up 1.2433 1.369 1.37
0.1.1 Open MP nested into MPI
Applying open mp to each of the processes in parallel: to the initial conditions, explicit solving
each time step and a reduction to the error computation. Shown below in table (4) are the times and
speed ups for using 1,2,4 and 8 threads on each process for 4 processes split across 4 computers. A
Table 4: Times and speed ups for open mp nested into 4 processes MPI
M N Error 1 2 4 8
131072 128 8.411371e-05 21.3483 21.4315 21.2217 20.4067
Speed up 0.996 1.006 1.05
524288 256 2.973560e-05 85.8835 85.1509 84.2632 82.393
Speed up 1.009 1.02 1.04
2097152 512 1.051285e-05 361.095 346.063 344.023 336.723
Speed up 1.04 1.05 1.07
8388608 1024 3.716830e-06 1606.83 1451.33 1407.97 1350.4
Speed up 1.107 1.141 1.19
gradual speed up is noticeable the more threads that are used per process, however, the problem is
2
3. still solved quicker just using MPI without open mp. For N = 128 with 4 processes and 8 threads
per process, there is 32 threads running with each thread only computing 128/32=4 elements. There
is increased internal communication between threads per process which is causing the algorithm to
run slower than just MPI alone. A greater speed up is achieved on a larger problem, for N = 1024
with 4 processes and 8 threads, each parallel section computing 1024/32=32 elements compared
with 4 for the smallest problem. More elements are needed per thread to make open mp worthwhile
in computing. Table (5) is the same run of tests however, with only 2 processes. With only 2
Table 5: Times and speed ups for open mp nested into 2 processes MPI
M N Error 1 2 4 8
131072 128 8.411371e-05 6.42831 6.427 6.43131 6.44648
Speed up 1.0002 0.9995 0.997
524288 256 2.973560e-05 33.5187 26.0446 25.7704 25.7165
Speed up 1.287 1.3 1.33
2097152 512 1.051285e-05 140.945 112.404 103.778 105.024
Speed up 1.254 1.358 1.342
8388608 1024 3.716830e-06 864.623 579.87 461.144 414.382
Speed up 1.491 1.875 2.09
processes a greater speed up is achieved and is computing the same results quicker than just MPI on
its own. This is due to more computation per thread compared with 4 processes. For N = 1024 with
both processes, with 8 threads calculating 1024/16=64 elements compared with 32 for 4 processes.
Therefore, less external communication is occurring and more internal communication resulting in
quicker calculations. Modifications to MPI code are shown in appendix (.2).
0.2 2d-PDE, MPI-Parallelization
Taking the serial c code cgfem2d.c and converting into MPI was done while still maintaining the
same result as the given serial code. The storage of the data was kept global, however the indices
in the for loops determined the processes computation. Table (6) shows times and speed ups for
different processes that all run on the same computer with the full broadcast of the p-vector in the
CG-algorithm. (Code used appendix (.3)).
The speed up is approximately decreasing as the problem size is getting larger. The larger the
problem the more iterations are used to arrive at the solution. This growth increases the total com-
munication between processes resulting in a decreasing slow down as the unknowns increase. Table
3
4. Table 6: Times and speed ups processes run on same computer
ne n Error CG iterations 1 2 3 4
256 65536 2.79676e-05 516 0.3398 0.1951 0.2107 0.204
Speed up 1.7417 1.6127 1.6657
512 262144 1.40111e-05 1037 2.284 1.5009 1.3583 1.797
Speed up 1.5218 1.6815 1.271
1024 1048576 7.01237e-06 2074 22.564 15.899 15.8831 15.938
Speed up 1.4192 1.4206 1.4157
2048 4194304 3.50787e-06 4049 186.694 132.32 138.798 129.396
Speed up 1.4109 1.345 1.4428
(7) shows running the processes split across four computers.
Table 7: Times and speed ups processes run on four seperate computer
ne n Error CG iterations 1 2 3 4
256 65536 2.79676e-05 516 0.332869 2.914 5.128 5.272
Speed up 0.11423 0.0649 0.0631
512 262144 1.40111e-05 1037 2.65 23.668 28.02 34.259
Speed up 0.112 0.095 0.077
1024 1048576 7.01237e-06 2074 23.964 187.064 192.746 210.813
Speed up 0.1281 0.124 0.114
2048 4194304 3.50787e-06 4049 188.05 1503.93 1501.46 1633.993
Speed up 0.125 0.125 0.115
The slowdown is approximately equal across problems from one process to two, with an increasing
slowdown on smaller problems with more processes. The computations are taking longer as more
processes are added due to broadcasting the full p-vector to all other processes in serial when each
process only needs a little segment of the p-vector. Optimal communication can be used here. Each
process sends only the relevant part of their segment to their neighbouring processes in parallel.
Table (8) shows times and speed ups for all processes run on the same machine with the optimal
communication of the p-vector in the CG-algorithm.
The problem is solved quicker with the optimal communication, however, the speed up is decreas-
ing as the problem size is getting larger. The CG iterations are increasing with the problem size.
Therefore, the communication between processes is reducing the speed up as more communication
4
5. Table 8: Times and speed ups processes run on the same computer
ne n Error CG iterations 1 2 3 4
256 65536 2.79676e-05 516 0.33477 0.134 0.0939 0.072
Speed up 2.4983 3.565 4.6496
512 262144 1.40111e-05 1037 2.2537 1.108 0.763 0.637
Speed up 2.034 2.9537 3.538
1024 1048576 7.01237e-06 2074 22.274 10.446 8.561 7.057
Speed up 2.1323 2.602 3.156
2048 4194304 3.50787e-06 4049 186.975 99.017 86.275 70.768
Speed up 1.888 2.167 2.642
is needed between processes. Table (9) shows running the processes split across four computers.
Table 9: Times and speed ups processes run on four separate computer
ne n Error CG iterations 1 2 3 4
256 65536 2.79676e-05 516 0.3327 0.372 0.294 0.354
Speed up 0.894 1.132 0.9398
512 262144 1.40111e-05 1037 2.359 1.5032 1.2023 1.379
Speed up 1.5693 1.962 1.71
1024 1048576 7.01237e-06 2074 22.704 9.602 6.9408 6.0282
Speed up 2.3645 3.271 3.7663
2048 4194304 3.50787e-06 4049 187.7764 95.8288 65.553 50.0723
Speed up 1.9595 2.8645 3.7501
The two largest problems were solved quicker, running the processes on separate machines. Due to
each process utilizing more of the machine’s cache and not needing so much of the main memory
which is increased communication. Whenever, all the processes are run on the same machine each
process only utilises part of the cache and needs to use more of the main memory. With optimal
communication being used between machines less interaction is occurring resulting in an increased
speed up as the problem size increases. Modifications to the full broadcast of the p-vector code are
shown in appendix (.4).
5
6. .1 Task 1
/*Ciaran Cox (1115773)*/
/*1115773@my.brunel.ac.uk*/
#include <stdio.h>
#include <mpi.h>
#include <string.h>
#include <math.h>
#include <stdlib.h>
#define PI (4.0*atan(1.0))
/* the right hand side forcing function */
double fxt(double x, double t)
{
double f = -PI*cos(PI*t)+(4*PI*PI-2)*(1-sin(PI*t));
return f * exp(-2*t)*cos(2*PI*x);
}
/* the exact solution */
double ux(double x, double t)
{
return (1-sin(PI*t))*exp(-2*t)*cos(2*PI*x);
}
/* the left hand boundary condition */
double uLt(double t)
{
return ux(0.0,t);
}
/* the right hand boundary condition */
double uRt(double t)
{
return ux(1.0,t);
6
7. }
/* the initial condition */
double u0x(double x)
{
return ux(x,0.0);
}
/*Main function*/
void main(int argc, char *argv[])
{
int n, sz, id, num, i0, i1,tag=0;
int N,M,i,j,inputM,inputN;
double a=0.0, b=1.0, T=2.0, alpha, h, k, *Uold=NULL, *Unew=NULL,
temp,error,error_total,error_final,t1,t2;
/*status for MPI communication*/
MPI_Status status;
/*Counting number of processes giving each process an id*/
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD,&id);
MPI_Comm_size(MPI_COMM_WORLD,&num);
/*asking user for input parameters*/
if (id==0)
{
printf("Enter M: n"); inputM=scanf("%d",&M);
printf("Enter N: n"); inputN=scanf("%d",&N);
/*Checking for correct input*/
if(inputM!=1 || inputN!=1){
printf("Incorrect input...exitingn");
exit(1);
}
printf("MtNtErrorttTimen");
}
/*broadcasting the input parameters to other processes*/
MPI_Bcast(&N,1,MPI_INT,0,MPI_COMM_WORLD);
7
8. MPI_Bcast(&M,1,MPI_INT,0,MPI_COMM_WORLD);
/*starting the timer*/
t1=MPI_Wtime();
/*defining constants*/
h = (b-a)/N; k =T/M; alpha = k/h/h;
/*computing each processes segment size, N-1 unknowns split*/
sz=(N-1)/num;
i0=id*sz; i1=(id+1)*sz;
/*keeping i1=N, all for loops less than i1, hence boundary wont be computed*/
if (id==(num-1)){
i1=N;
}
/*shifting first process along one because boundary is known*/
if(id==0){
i0=1;
}
/*Allocating global memory*/
if((Uold=(double*)malloc((N+1)*sizeof(double)))==NULL) exit(1);
if((Unew=(double*)malloc((N+1)*sizeof(double)))==NULL) exit(1);
/*process one, left boundary condition*/
if(id==0){
Uold[0]=uLt(0);
}
/*last process, right boundary condition*/
if(id==(num-1)){
Uold[i1]=uRt(0);
}
/*all processes computing initial conditions for their segment*/
for(i=i0;i<i1;i++){
Uold[i]=u0x(a+i*h);
}
/*broadcast relevant elements if more than 1 process*/
if(num>1){
/*if total number of processes is even*/
8
9. if(num%2==0){
/*if id is an odd number*/
if(id%2!=0){
/*Sending first number of segment to the left*/
MPI_Send(&Uold[i0],1,MPI_DOUBLE,id-1,tag,MPI_COMM_WORLD);
/*Receiving last number from left, placing in i0-1*/
MPI_Recv(&Uold[i0-1],1,MPI_DOUBLE,id-1,tag,MPI_COMM_WORLD,&status);
}
/*if id is an even number*/
if(id%2==0){
/*Receiving first number from left, placing in i1*/
MPI_Recv(&Uold[i1],1,MPI_DOUBLE,id+1,tag,MPI_COMM_WORLD,&status);
/*Sending last element of segment to the right*/
MPI_Send(&Uold[i1-1],1,MPI_DOUBLE,id+1,tag,MPI_COMM_WORLD);
}
/*if id is even and not equal to 0*/
if(id%2==0 && id!=0){
/*Sending first number of segment to the left*/
MPI_Send(&Uold[i0],1,MPI_DOUBLE,id-1,tag,MPI_COMM_WORLD);
/*Receiving last number from the left, placing in i0-1*/
MPI_Recv(&Uold[i0-1],1,MPI_DOUBLE,id-1,tag,MPI_COMM_WORLD,&status);
}
/*if id is odd and not equal to last process*/
if(id%2!=0 && id!=(num-1)){
/*Receiving first number from the right, placing in i1*/
MPI_Recv(&Uold[i1],1,MPI_DOUBLE,id+1,tag,MPI_COMM_WORLD,&status);
/*Sending last element of segment to the right*/
MPI_Send(&Uold[i1-1],1,MPI_DOUBLE,id+1,tag,MPI_COMM_WORLD);
}
}
/*if total number of processes is odd*/
else
{
/*if id is even and not equal to 0*/
9
10. if(id%2==0 && id!=0){
/*Sending first element of segment to the left*/
MPI_Send(&Uold[i0],1,MPI_DOUBLE,id-1,tag,MPI_COMM_WORLD);
/*Receiving last number from the left, placing in i0-1*/
MPI_Recv(&Uold[i0-1],1,MPI_DOUBLE,id-1,tag,MPI_COMM_WORLD,&status);
}
/*if id is odd*/
if(id%2!=0){
/*Receiving first element from left, placing in i1*/
MPI_Recv(&Uold[i1],1,MPI_DOUBLE,id+1,tag,MPI_COMM_WORLD,&status);
/*Sending last element of segment to the right*/
MPI_Send(&Uold[i1-1],1,MPI_DOUBLE,id+1,tag,MPI_COMM_WORLD);
}
/*if id is odd*/
if(id%2!=0){
/*Sending first element of segment to the left*/
MPI_Send(&Uold[i0],1,MPI_DOUBLE,id-1,tag,MPI_COMM_WORLD);
/*Receiving last element of segement from left, placing in i0-1*/
MPI_Recv(&Uold[i0-1],1,MPI_DOUBLE,id-1,tag,MPI_COMM_WORLD,&status);
}
/*if id is even and not equal to last process*/
if(id%2==0 && id!=(num-1)){
/*Receiving first element from the right, placing in i1*/
MPI_Recv(&Uold[i1],1,MPI_DOUBLE,id+1,tag,MPI_COMM_WORLD,&status);
/*Sending last element of segemnt to the right*/
MPI_Send(&Uold[i1-1],1,MPI_DOUBLE,id+1,tag,MPI_COMM_WORLD);
}
}
}
/*Starting loop up through time*/
for(j=1;j<=M;j++){
/*defining time step*/
double t=j*k;
/*if first process, left boundary condition*/
10
11. if(id==0){
Unew[0]=uLt(t);
}
/*if last process, right boundary condition*/
if(id==(num-1)){
Unew[i1]=uRt(t);
}
/*All processes explicity solving their segment*/
for(i=i0;i<i1;i++){
Unew[i]=alpha*(Uold[i-1]+Uold[i+1])+(1-2*alpha)*Uold[i];
Unew[i]+=k*fxt(a+i*h,t-k);
}
/*only broadcast relevant elements if more than 1 process,
* same broadcast as before with Unew instead of Uold*/
if(num>1){
if(num%2==0){
if(id%2!=0){
MPI_Send(&Unew[i0],1,MPI_DOUBLE,id-1,tag,MPI_COMM_WORLD);
MPI_Recv(&Unew[i0-1],1,MPI_DOUBLE,id-1,tag,MPI_COMM_WORLD,&status);
}
if(id%2==0){
MPI_Recv(&Unew[i1],1,MPI_DOUBLE,id+1,tag,MPI_COMM_WORLD,&status);
MPI_Send(&Unew[i1-1],1,MPI_DOUBLE,id+1,tag,MPI_COMM_WORLD);
}
if(id%2==0 && id!=0){
MPI_Send(&Unew[i0],1,MPI_DOUBLE,id-1,tag,MPI_COMM_WORLD);
MPI_Recv(&Unew[i0-1],1,MPI_DOUBLE,id-1,tag,MPI_COMM_WORLD,&status);
}
if(id%2!=0 && id!=(num-1)){
MPI_Recv(&Unew[i1],1,MPI_DOUBLE,id+1,tag,MPI_COMM_WORLD,&status);
MPI_Send(&Unew[i1-1],1,MPI_DOUBLE,id+1,tag,MPI_COMM_WORLD);
}
}
else
11
13. error=0;
for(i=i0;i<i1;i++){
temp=ux(a+i*h,T)-Uold[i];
error+=temp*temp;
}
/*if more than 1 process, an MPI reduce to get final error and stop timer*/
if(num>1){
error_total=0;
MPI_Reduce(&error,&error_total,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD);
error_final=sqrt(error_total);
t2=MPI_Wtime();
}
else{
/*if one process just sqrt error, stop timer*/
error_final=sqrt(error);
t2=MPI_Wtime();
}
/*printing results*/
if(id==0){
printf("%dt%dt%13.6let%lgnn",M,N,error_final,t2-t1);
}
/*free memeory and end MPI*/
free(Uold); free(Unew);
MPI_Finalize();
}
.2 Task 1:Open MP
Modification to initial conditions
/*all processes computing initial conditions for their segment in parallel*/
#pragma omp parallel for num_threads(NUM_THREADS) private(i)
for(i=i0;i<i1;i++){
Uold[i]=u0x(a+i*h);
}
13
14. Modification to solving explicitly each time step
/*All processes explicity solving their segment in parallel*/
#pragma omp parallel for num_threads(NUM_THREADS) private(i)
for(i=i0;i<i1;i++){
Unew[i]=alpha*(Uold[i-1]+Uold[i+1])+(1-2*alpha)*Uold[i]
+k*fxt(a+i*h,t-k);
}
Modification to error computation
error=0;
#pragma omp parallel for num_threads(NUM_THREADS) private(i) reduction(+:error)
for(i=i0;i<i1;i++){
error+=(ux(a+i*h,T)-Uold[i])*(ux(a+i*h,T)-Uold[i]);
}
.3 Task 2 full p-vector broadcast
/*Ciaran Cox (1115773)*/
/*1115773@my.brunel.ac.uk*/
/*Relevant Libraries*/
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <stdlib.h>
#include <mpi.h>
#include <omp.h>
/*Exact solution function*/
double uexact(double x,double y){
double temp=x*(1-x)*y*(1-y);
return temp*temp;
}
/*Function right hand side*/
double frhs(double x,double y){
14
15. return -(2-12*x+12*x*x)*y*y*(1-y)*(1-y)-(2-12*y+12*y*y)*x*x*(1-x)*(1-x);
}
/*Matrix-Vector multiplication function*/
void MVmult(double A, double B, double *x, double *y, int ne,
int i0e, int i1e){
int ie,je;
double s;
/*Slicing among processes the outer loop only ne*/
for (ie=i0e;ie<i1e;ie++)
for (je=0;je<ne;je++) {
s=A*x[ie*ne+je];
if (ie>0 ) s+=B*x[(ie-1)*ne+je ];
if (je>0 ) s+=B*x[ ie *ne+je-1];
if (ie<ne-1) s+=B*x[(ie+1)*ne+je ];
if (je<ne-1) s+=B*x[ ie *ne+je+1];
y[ie*ne+je]=s;
}
}
/*2D sparse CG-algorithm*/
int cgsparse2d(double A,double B,double *x,double *b,double eps,int ne,
int i0e, int i1e, int id, int sze, int num, int i0, int i1, int sz){
int i,k,n,jd,j0,j1;
double rr,pq,bb;
double alpha,beta,rrold;
double *r=NULL,*p=NULL,*q=NULL;
double my_rr, my_bb, my_pq;
/*Total number of inner nodes*/
n=ne*ne;
/*Global memory allocation*/
if( (r=(double*)malloc(n*sizeof(double)))==NULL) exit(1);
if( (p=(double*)malloc(n*sizeof(double)))==NULL) exit(1);
if( (q=(double*)malloc(n*sizeof(double)))==NULL) exit(1);
/*r=b-Ax, use of MVmult function file sending in i0e,i1e*/
MVmult(A,B,x,r,ne,i0e,i1e);
15
16. for(i=i0;i<i1;i++){
r[i]=b[i]-r[i];
}
/*computation of rr and bb norm with broadcast*/
my_rr=0;
for (i=i0;i<i1;i++){
my_rr+=r[i]*r[i];
}
rr=0;
MPI_Reduce(&my_rr,&rr,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD);
MPI_Bcast(&rr,1,MPI_DOUBLE,0,MPI_COMM_WORLD);
my_bb=0;
for (i=i0;i<i1;i++){
my_bb+=b[i]*b[i];
}
bb=0;
MPI_Reduce(&my_bb,&bb,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD);
MPI_Bcast(&bb,1,MPI_DOUBLE,0,MPI_COMM_WORLD);
/*Start of CG interations*/
k=0;
while(sqrt(rr)>eps*sqrt(bb)){
k=k+1;
if(k==1){
/*p=r for each process segment*/
for(i=i0;i<i1;i++) p[i]=r[i];
beta=0;
}
else{
beta=rr/rrold;
/*p=r+beta*p for each process segment*/
for(i=i0;i<i1;i++) p[i]=r[i]+beta*p[i];
}
/*full broadcast of p vector to other processes*/
for(jd=0;jd<num;jd++){
16
17. j0=jd*sz; j1=(jd+1)*sz;
if(jd==(num-1)) j1=n;
MPI_Bcast(&p[j0],j1-j0,MPI_DOUBLE,jd,MPI_COMM_WORLD);
}
/*Matrix vector multiplication, q=matrix*p*/
MVmult(A,B,p,q,ne,i0e,i1e);
/*computation of pq norm with broadcast*/
my_pq=0;
for (i=i0;i<i1;i++){
my_pq+=p[i]*q[i];
}
pq=0;
MPI_Reduce(&my_pq,&pq,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD);
MPI_Bcast(&pq,1,MPI_DOUBLE,0,MPI_COMM_WORLD);
alpha=rr/pq;
/*each segment computing x=x+alpha*p and r=r-alpha*q*/
for(i=i0;i<i1;i++){
x[i]=x[i]+alpha*p[i];
r[i]=r[i]-alpha*q[i];
}
rrold=rr;
/*computation of rr norm with broadcast*/
my_rr=0;
for (i=i0;i<i1;i++){
my_rr+=r[i]*r[i];
}
rr=0;
MPI_Reduce(&my_rr,&rr,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD);
MPI_Bcast(&rr,1,MPI_DOUBLE,0,MPI_COMM_WORLD);
}
/*freeing memory*/
free(p); free(q); free(r);
17
18. return k;
}
int main (int argc, char *argv[]) {
int ne,n,i,k,ie,je;
double h,t1,t2,error,temp,error_total, error_final;
double A,B;
double *x=NULL,*b=NULL;
double eps=1.e-10;
int id,num,i0e,i1e,sze,j0,j1,sz,i0,i1,jd,input;
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD,&id);
MPI_Comm_size(MPI_COMM_WORLD,&num);
/* read ne -- problemsize */
if(id==0){
printf("Number of inner nodes per edge=?n"); input=scanf("%d",&ne);
/*Checking for correct input*/
if(input!=1){
printf("Incorrect input...exitingn");
exit(1);
}
}
/*broadcast problemsize*/
MPI_Bcast(&ne,1,MPI_INT,0,MPI_COMM_WORLD);
if (ne<=2) { return 0; }
/* total number of inner nodes, i.e. number of coefficients */
n=ne*ne;
h=1.0/(ne+1);
A=4/h/h; B=-1/h/h;
18
19. /*Segment sizes for ne*/
sze=ne/num;
i0e=id*sze; i1e=(id+1)*sze;
if(id==(num-1)){
i1e=ne;
}
/*Segment sizes for n*/
sz=sze*ne;
i0=id*sz; i1=(id+1)*sz;
if(id==(num-1)){
i1=n;
}
/* allocate b */
if( (b=(double*)malloc(n*sizeof(double)))==NULL) exit(1);
/* initialize b */
for (ie=i0e;ie<i1e;ie++)
for (je=0;je<ne;je++)
b[ie*ne+je]=frhs((ie+1)*h,(je+1)*h);
/* initialize x (const 0) */
if( (x=(double*)malloc(n*sizeof(double)))==NULL) exit(1);
for (i=i0;i<i1;i++) x[i]=0;
/*solving system*/
t1=omp_get_wtime();
k=cgsparse2d(A,B,x,b,eps,ne,i0e,i1e,id,sze,num,i0,i1,sz);
t2=omp_get_wtime();
/*Error Computation*/
error=0;
for (ie=i0e;ie<i1e;ie++)
for(je=0;je<ne;je++) {
temp=x[ie*ne+je]-uexact((ie+1)*h,(je+1)*h);
error+=temp*temp;
19
20. }
MPI_Reduce(&error,&error_total,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD);
error_final=sqrt(error_total);
/*Printing results*/
if(id==0){
printf("Error: %lgn",error_final);
printf("Solver-time: %f, Iterations: %dn",t2-t1,k);
}
MPI_Finalize();
/*freeing memory*/
free(x); free(b);
exit(0);
}
.4 Modifications to full broadcast c code for optimal p-vector
communication
Integer tag and MPI status added for communication.
int i,k,n,jd,j0,j1,tag=0;
double rr,pq,bb;
double alpha,beta,rrold;
double *r=NULL,*p=NULL,*q=NULL;
double my_rr, my_bb, my_pq;
MPI_Status status;
Modifications to broadcasting of p vector in the while loop.
/*optimal communication of p vector*/
if(id%2==0){
if(id>0){
MPI_Send(&p[i0e*ne],ne,MPI_DOUBLE,id-1,tag,MPI_COMM_WORLD);
MPI_Recv(&p[(i0e-1)*ne],ne,MPI_DOUBLE,id-1,tag,MPI_COMM_WORLD,&status);
}
if(id<num-1){
20