SlideShare a Scribd company logo
1 of 21
Download to read offline
Ciaran Cox (1115773)
MA5610: Financial Computing 2: Assignment 1
0.1 Explicit Time Stepping: MPI-Parallelization
The solution is still maintained in computing with MPI. The algorithm is running much quicker
when all the processes are on the same computer, however they are slower when the processes are
split across multiple computers. The communication is set up to work slightly differently when an
even number of processes are used, compared to when an odd number are used. Table (1) shows
times and speed ups for all the processes on the same computer.
Table 1: Times and speed ups for processes on the same computer
M N Error 1 2 4 8
131072 128 8.411371e-05 1.56282 0.895603 0.617478 0.585737
Speed up 1.745 2.53 2.668
524288 256 2.973560e-05 11.1551 6.88205 4.29815 2.99606
Speed up 1.621 2.595 3.723
2097152 512 1.051285e-05 84.7041 48.9535 27.2987 17.7712
Speed up 1.73 3.103 4.766
8388608 1024 3.716830e-06 667.436 373.127 204.171 116.779
Speed up 1.789 3.269 5.715
A greater speed up is achieved as the problem gets larger, with the maximum speed up of 5.715
on 8 processes, compared to 2.668 for the smaller problem. When running the algorithm split
across four computers; table (2) shows the results. A slowdown occurs when splitting the problem
among separate computers, however when the problem increases in size the slow down gets smaller.
A speed up is not achieved on a problem this size split across multiple computers. In table (3)
is the next larger problem running across four computers. A speed up is achieved for problems
beyond M = 8388608,N = 1024 as each process has more computations to carry out, therefore, the
additional times communicating between the processes does not show a great effect on the times.
Compared when all the processes are running internally on the same machine, speed up is achieved
quicker due to quick communication times between processes. Code created appendix (.1).
1
Table 2: Times and speed ups for processes split across four computers
M N Error 1 2 4 8
131072 128 8.411371e-05 1.2026 6.468 6.44 6.4637
Speed up 0.186 0.1867 0.18605
524288 256 2.973560e-05 8.62632 29.3911 25.7779 25.7123
Speed up 0.2935 0.3346 0.3355
2097152 512 1.051285e-05 68.4614 130.529 135.822 103.727
Speed up 0.5245 0.504 0.66
8388608 1024 3.716830e-06 553.673 601.222 821.992 559.195
Speed up 0.9209 0.674 0.9901
Table 3: Times and speed ups for processes split across four computers
M N Error 1 2 4 8
33554432 2048 1.314096e-06 4506.54 3624.66 3291.96 3288.42
Speed up 1.2433 1.369 1.37
0.1.1 Open MP nested into MPI
Applying open mp to each of the processes in parallel: to the initial conditions, explicit solving
each time step and a reduction to the error computation. Shown below in table (4) are the times and
speed ups for using 1,2,4 and 8 threads on each process for 4 processes split across 4 computers. A
Table 4: Times and speed ups for open mp nested into 4 processes MPI
M N Error 1 2 4 8
131072 128 8.411371e-05 21.3483 21.4315 21.2217 20.4067
Speed up 0.996 1.006 1.05
524288 256 2.973560e-05 85.8835 85.1509 84.2632 82.393
Speed up 1.009 1.02 1.04
2097152 512 1.051285e-05 361.095 346.063 344.023 336.723
Speed up 1.04 1.05 1.07
8388608 1024 3.716830e-06 1606.83 1451.33 1407.97 1350.4
Speed up 1.107 1.141 1.19
gradual speed up is noticeable the more threads that are used per process, however, the problem is
2
still solved quicker just using MPI without open mp. For N = 128 with 4 processes and 8 threads
per process, there is 32 threads running with each thread only computing 128/32=4 elements. There
is increased internal communication between threads per process which is causing the algorithm to
run slower than just MPI alone. A greater speed up is achieved on a larger problem, for N = 1024
with 4 processes and 8 threads, each parallel section computing 1024/32=32 elements compared
with 4 for the smallest problem. More elements are needed per thread to make open mp worthwhile
in computing. Table (5) is the same run of tests however, with only 2 processes. With only 2
Table 5: Times and speed ups for open mp nested into 2 processes MPI
M N Error 1 2 4 8
131072 128 8.411371e-05 6.42831 6.427 6.43131 6.44648
Speed up 1.0002 0.9995 0.997
524288 256 2.973560e-05 33.5187 26.0446 25.7704 25.7165
Speed up 1.287 1.3 1.33
2097152 512 1.051285e-05 140.945 112.404 103.778 105.024
Speed up 1.254 1.358 1.342
8388608 1024 3.716830e-06 864.623 579.87 461.144 414.382
Speed up 1.491 1.875 2.09
processes a greater speed up is achieved and is computing the same results quicker than just MPI on
its own. This is due to more computation per thread compared with 4 processes. For N = 1024 with
both processes, with 8 threads calculating 1024/16=64 elements compared with 32 for 4 processes.
Therefore, less external communication is occurring and more internal communication resulting in
quicker calculations. Modifications to MPI code are shown in appendix (.2).
0.2 2d-PDE, MPI-Parallelization
Taking the serial c code cgfem2d.c and converting into MPI was done while still maintaining the
same result as the given serial code. The storage of the data was kept global, however the indices
in the for loops determined the processes computation. Table (6) shows times and speed ups for
different processes that all run on the same computer with the full broadcast of the p-vector in the
CG-algorithm. (Code used appendix (.3)).
The speed up is approximately decreasing as the problem size is getting larger. The larger the
problem the more iterations are used to arrive at the solution. This growth increases the total com-
munication between processes resulting in a decreasing slow down as the unknowns increase. Table
3
Table 6: Times and speed ups processes run on same computer
ne n Error CG iterations 1 2 3 4
256 65536 2.79676e-05 516 0.3398 0.1951 0.2107 0.204
Speed up 1.7417 1.6127 1.6657
512 262144 1.40111e-05 1037 2.284 1.5009 1.3583 1.797
Speed up 1.5218 1.6815 1.271
1024 1048576 7.01237e-06 2074 22.564 15.899 15.8831 15.938
Speed up 1.4192 1.4206 1.4157
2048 4194304 3.50787e-06 4049 186.694 132.32 138.798 129.396
Speed up 1.4109 1.345 1.4428
(7) shows running the processes split across four computers.
Table 7: Times and speed ups processes run on four seperate computer
ne n Error CG iterations 1 2 3 4
256 65536 2.79676e-05 516 0.332869 2.914 5.128 5.272
Speed up 0.11423 0.0649 0.0631
512 262144 1.40111e-05 1037 2.65 23.668 28.02 34.259
Speed up 0.112 0.095 0.077
1024 1048576 7.01237e-06 2074 23.964 187.064 192.746 210.813
Speed up 0.1281 0.124 0.114
2048 4194304 3.50787e-06 4049 188.05 1503.93 1501.46 1633.993
Speed up 0.125 0.125 0.115
The slowdown is approximately equal across problems from one process to two, with an increasing
slowdown on smaller problems with more processes. The computations are taking longer as more
processes are added due to broadcasting the full p-vector to all other processes in serial when each
process only needs a little segment of the p-vector. Optimal communication can be used here. Each
process sends only the relevant part of their segment to their neighbouring processes in parallel.
Table (8) shows times and speed ups for all processes run on the same machine with the optimal
communication of the p-vector in the CG-algorithm.
The problem is solved quicker with the optimal communication, however, the speed up is decreas-
ing as the problem size is getting larger. The CG iterations are increasing with the problem size.
Therefore, the communication between processes is reducing the speed up as more communication
4
Table 8: Times and speed ups processes run on the same computer
ne n Error CG iterations 1 2 3 4
256 65536 2.79676e-05 516 0.33477 0.134 0.0939 0.072
Speed up 2.4983 3.565 4.6496
512 262144 1.40111e-05 1037 2.2537 1.108 0.763 0.637
Speed up 2.034 2.9537 3.538
1024 1048576 7.01237e-06 2074 22.274 10.446 8.561 7.057
Speed up 2.1323 2.602 3.156
2048 4194304 3.50787e-06 4049 186.975 99.017 86.275 70.768
Speed up 1.888 2.167 2.642
is needed between processes. Table (9) shows running the processes split across four computers.
Table 9: Times and speed ups processes run on four separate computer
ne n Error CG iterations 1 2 3 4
256 65536 2.79676e-05 516 0.3327 0.372 0.294 0.354
Speed up 0.894 1.132 0.9398
512 262144 1.40111e-05 1037 2.359 1.5032 1.2023 1.379
Speed up 1.5693 1.962 1.71
1024 1048576 7.01237e-06 2074 22.704 9.602 6.9408 6.0282
Speed up 2.3645 3.271 3.7663
2048 4194304 3.50787e-06 4049 187.7764 95.8288 65.553 50.0723
Speed up 1.9595 2.8645 3.7501
The two largest problems were solved quicker, running the processes on separate machines. Due to
each process utilizing more of the machine’s cache and not needing so much of the main memory
which is increased communication. Whenever, all the processes are run on the same machine each
process only utilises part of the cache and needs to use more of the main memory. With optimal
communication being used between machines less interaction is occurring resulting in an increased
speed up as the problem size increases. Modifications to the full broadcast of the p-vector code are
shown in appendix (.4).
5
.1 Task 1
/*Ciaran Cox (1115773)*/
/*1115773@my.brunel.ac.uk*/
#include <stdio.h>
#include <mpi.h>
#include <string.h>
#include <math.h>
#include <stdlib.h>
#define PI (4.0*atan(1.0))
/* the right hand side forcing function */
double fxt(double x, double t)
{
double f = -PI*cos(PI*t)+(4*PI*PI-2)*(1-sin(PI*t));
return f * exp(-2*t)*cos(2*PI*x);
}
/* the exact solution */
double ux(double x, double t)
{
return (1-sin(PI*t))*exp(-2*t)*cos(2*PI*x);
}
/* the left hand boundary condition */
double uLt(double t)
{
return ux(0.0,t);
}
/* the right hand boundary condition */
double uRt(double t)
{
return ux(1.0,t);
6
}
/* the initial condition */
double u0x(double x)
{
return ux(x,0.0);
}
/*Main function*/
void main(int argc, char *argv[])
{
int n, sz, id, num, i0, i1,tag=0;
int N,M,i,j,inputM,inputN;
double a=0.0, b=1.0, T=2.0, alpha, h, k, *Uold=NULL, *Unew=NULL,
temp,error,error_total,error_final,t1,t2;
/*status for MPI communication*/
MPI_Status status;
/*Counting number of processes giving each process an id*/
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD,&id);
MPI_Comm_size(MPI_COMM_WORLD,&num);
/*asking user for input parameters*/
if (id==0)
{
printf("Enter M: n"); inputM=scanf("%d",&M);
printf("Enter N: n"); inputN=scanf("%d",&N);
/*Checking for correct input*/
if(inputM!=1 || inputN!=1){
printf("Incorrect input...exitingn");
exit(1);
}
printf("MtNtErrorttTimen");
}
/*broadcasting the input parameters to other processes*/
MPI_Bcast(&N,1,MPI_INT,0,MPI_COMM_WORLD);
7
MPI_Bcast(&M,1,MPI_INT,0,MPI_COMM_WORLD);
/*starting the timer*/
t1=MPI_Wtime();
/*defining constants*/
h = (b-a)/N; k =T/M; alpha = k/h/h;
/*computing each processes segment size, N-1 unknowns split*/
sz=(N-1)/num;
i0=id*sz; i1=(id+1)*sz;
/*keeping i1=N, all for loops less than i1, hence boundary wont be computed*/
if (id==(num-1)){
i1=N;
}
/*shifting first process along one because boundary is known*/
if(id==0){
i0=1;
}
/*Allocating global memory*/
if((Uold=(double*)malloc((N+1)*sizeof(double)))==NULL) exit(1);
if((Unew=(double*)malloc((N+1)*sizeof(double)))==NULL) exit(1);
/*process one, left boundary condition*/
if(id==0){
Uold[0]=uLt(0);
}
/*last process, right boundary condition*/
if(id==(num-1)){
Uold[i1]=uRt(0);
}
/*all processes computing initial conditions for their segment*/
for(i=i0;i<i1;i++){
Uold[i]=u0x(a+i*h);
}
/*broadcast relevant elements if more than 1 process*/
if(num>1){
/*if total number of processes is even*/
8
if(num%2==0){
/*if id is an odd number*/
if(id%2!=0){
/*Sending first number of segment to the left*/
MPI_Send(&Uold[i0],1,MPI_DOUBLE,id-1,tag,MPI_COMM_WORLD);
/*Receiving last number from left, placing in i0-1*/
MPI_Recv(&Uold[i0-1],1,MPI_DOUBLE,id-1,tag,MPI_COMM_WORLD,&status);
}
/*if id is an even number*/
if(id%2==0){
/*Receiving first number from left, placing in i1*/
MPI_Recv(&Uold[i1],1,MPI_DOUBLE,id+1,tag,MPI_COMM_WORLD,&status);
/*Sending last element of segment to the right*/
MPI_Send(&Uold[i1-1],1,MPI_DOUBLE,id+1,tag,MPI_COMM_WORLD);
}
/*if id is even and not equal to 0*/
if(id%2==0 && id!=0){
/*Sending first number of segment to the left*/
MPI_Send(&Uold[i0],1,MPI_DOUBLE,id-1,tag,MPI_COMM_WORLD);
/*Receiving last number from the left, placing in i0-1*/
MPI_Recv(&Uold[i0-1],1,MPI_DOUBLE,id-1,tag,MPI_COMM_WORLD,&status);
}
/*if id is odd and not equal to last process*/
if(id%2!=0 && id!=(num-1)){
/*Receiving first number from the right, placing in i1*/
MPI_Recv(&Uold[i1],1,MPI_DOUBLE,id+1,tag,MPI_COMM_WORLD,&status);
/*Sending last element of segment to the right*/
MPI_Send(&Uold[i1-1],1,MPI_DOUBLE,id+1,tag,MPI_COMM_WORLD);
}
}
/*if total number of processes is odd*/
else
{
/*if id is even and not equal to 0*/
9
if(id%2==0 && id!=0){
/*Sending first element of segment to the left*/
MPI_Send(&Uold[i0],1,MPI_DOUBLE,id-1,tag,MPI_COMM_WORLD);
/*Receiving last number from the left, placing in i0-1*/
MPI_Recv(&Uold[i0-1],1,MPI_DOUBLE,id-1,tag,MPI_COMM_WORLD,&status);
}
/*if id is odd*/
if(id%2!=0){
/*Receiving first element from left, placing in i1*/
MPI_Recv(&Uold[i1],1,MPI_DOUBLE,id+1,tag,MPI_COMM_WORLD,&status);
/*Sending last element of segment to the right*/
MPI_Send(&Uold[i1-1],1,MPI_DOUBLE,id+1,tag,MPI_COMM_WORLD);
}
/*if id is odd*/
if(id%2!=0){
/*Sending first element of segment to the left*/
MPI_Send(&Uold[i0],1,MPI_DOUBLE,id-1,tag,MPI_COMM_WORLD);
/*Receiving last element of segement from left, placing in i0-1*/
MPI_Recv(&Uold[i0-1],1,MPI_DOUBLE,id-1,tag,MPI_COMM_WORLD,&status);
}
/*if id is even and not equal to last process*/
if(id%2==0 && id!=(num-1)){
/*Receiving first element from the right, placing in i1*/
MPI_Recv(&Uold[i1],1,MPI_DOUBLE,id+1,tag,MPI_COMM_WORLD,&status);
/*Sending last element of segemnt to the right*/
MPI_Send(&Uold[i1-1],1,MPI_DOUBLE,id+1,tag,MPI_COMM_WORLD);
}
}
}
/*Starting loop up through time*/
for(j=1;j<=M;j++){
/*defining time step*/
double t=j*k;
/*if first process, left boundary condition*/
10
if(id==0){
Unew[0]=uLt(t);
}
/*if last process, right boundary condition*/
if(id==(num-1)){
Unew[i1]=uRt(t);
}
/*All processes explicity solving their segment*/
for(i=i0;i<i1;i++){
Unew[i]=alpha*(Uold[i-1]+Uold[i+1])+(1-2*alpha)*Uold[i];
Unew[i]+=k*fxt(a+i*h,t-k);
}
/*only broadcast relevant elements if more than 1 process,
* same broadcast as before with Unew instead of Uold*/
if(num>1){
if(num%2==0){
if(id%2!=0){
MPI_Send(&Unew[i0],1,MPI_DOUBLE,id-1,tag,MPI_COMM_WORLD);
MPI_Recv(&Unew[i0-1],1,MPI_DOUBLE,id-1,tag,MPI_COMM_WORLD,&status);
}
if(id%2==0){
MPI_Recv(&Unew[i1],1,MPI_DOUBLE,id+1,tag,MPI_COMM_WORLD,&status);
MPI_Send(&Unew[i1-1],1,MPI_DOUBLE,id+1,tag,MPI_COMM_WORLD);
}
if(id%2==0 && id!=0){
MPI_Send(&Unew[i0],1,MPI_DOUBLE,id-1,tag,MPI_COMM_WORLD);
MPI_Recv(&Unew[i0-1],1,MPI_DOUBLE,id-1,tag,MPI_COMM_WORLD,&status);
}
if(id%2!=0 && id!=(num-1)){
MPI_Recv(&Unew[i1],1,MPI_DOUBLE,id+1,tag,MPI_COMM_WORLD,&status);
MPI_Send(&Unew[i1-1],1,MPI_DOUBLE,id+1,tag,MPI_COMM_WORLD);
}
}
else
11
{
if(id%2==0 && id!=0){
MPI_Send(&Unew[i0],1,MPI_DOUBLE,id-1,tag,MPI_COMM_WORLD);
MPI_Recv(&Unew[i0-1],1,MPI_DOUBLE,id-1,tag,MPI_COMM_WORLD,&status);
}
if(id%2!=0){
MPI_Recv(&Unew[i1],1,MPI_DOUBLE,id+1,tag,MPI_COMM_WORLD,&status);
MPI_Send(&Unew[i1-1],1,MPI_DOUBLE,id+1,tag,MPI_COMM_WORLD);
}
if(id%2!=0){
MPI_Send(&Unew[i0],1,MPI_DOUBLE,id-1,tag,MPI_COMM_WORLD);
MPI_Recv(&Unew[i0-1],1,MPI_DOUBLE,id-1,tag,MPI_COMM_WORLD,&status);
}
if(id%2==0 && id!=(num-1)){
MPI_Recv(&Unew[i1],1,MPI_DOUBLE,id+1,tag,MPI_COMM_WORLD,&status);
MPI_Send(&Unew[i1-1],1,MPI_DOUBLE,id+1,tag,MPI_COMM_WORLD);
}
}
}
/*Changing pointers Unew and Uold*/
double *tmp=Unew;
Unew=Uold; Uold=tmp;
}
/*if first process shifting initial start back to boundary*/
if(id==0){
i0=0;
}
/*if last process, shifting end out one more to include boundary
* in error computation*/
if(id==(num-1)){
i1=N+1;
}
/*each process computes error summation*/
12
error=0;
for(i=i0;i<i1;i++){
temp=ux(a+i*h,T)-Uold[i];
error+=temp*temp;
}
/*if more than 1 process, an MPI reduce to get final error and stop timer*/
if(num>1){
error_total=0;
MPI_Reduce(&error,&error_total,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD);
error_final=sqrt(error_total);
t2=MPI_Wtime();
}
else{
/*if one process just sqrt error, stop timer*/
error_final=sqrt(error);
t2=MPI_Wtime();
}
/*printing results*/
if(id==0){
printf("%dt%dt%13.6let%lgnn",M,N,error_final,t2-t1);
}
/*free memeory and end MPI*/
free(Uold); free(Unew);
MPI_Finalize();
}
.2 Task 1:Open MP
Modification to initial conditions
/*all processes computing initial conditions for their segment in parallel*/
#pragma omp parallel for num_threads(NUM_THREADS) private(i)
for(i=i0;i<i1;i++){
Uold[i]=u0x(a+i*h);
}
13
Modification to solving explicitly each time step
/*All processes explicity solving their segment in parallel*/
#pragma omp parallel for num_threads(NUM_THREADS) private(i)
for(i=i0;i<i1;i++){
Unew[i]=alpha*(Uold[i-1]+Uold[i+1])+(1-2*alpha)*Uold[i]
+k*fxt(a+i*h,t-k);
}
Modification to error computation
error=0;
#pragma omp parallel for num_threads(NUM_THREADS) private(i) reduction(+:error)
for(i=i0;i<i1;i++){
error+=(ux(a+i*h,T)-Uold[i])*(ux(a+i*h,T)-Uold[i]);
}
.3 Task 2 full p-vector broadcast
/*Ciaran Cox (1115773)*/
/*1115773@my.brunel.ac.uk*/
/*Relevant Libraries*/
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <stdlib.h>
#include <mpi.h>
#include <omp.h>
/*Exact solution function*/
double uexact(double x,double y){
double temp=x*(1-x)*y*(1-y);
return temp*temp;
}
/*Function right hand side*/
double frhs(double x,double y){
14
return -(2-12*x+12*x*x)*y*y*(1-y)*(1-y)-(2-12*y+12*y*y)*x*x*(1-x)*(1-x);
}
/*Matrix-Vector multiplication function*/
void MVmult(double A, double B, double *x, double *y, int ne,
int i0e, int i1e){
int ie,je;
double s;
/*Slicing among processes the outer loop only ne*/
for (ie=i0e;ie<i1e;ie++)
for (je=0;je<ne;je++) {
s=A*x[ie*ne+je];
if (ie>0 ) s+=B*x[(ie-1)*ne+je ];
if (je>0 ) s+=B*x[ ie *ne+je-1];
if (ie<ne-1) s+=B*x[(ie+1)*ne+je ];
if (je<ne-1) s+=B*x[ ie *ne+je+1];
y[ie*ne+je]=s;
}
}
/*2D sparse CG-algorithm*/
int cgsparse2d(double A,double B,double *x,double *b,double eps,int ne,
int i0e, int i1e, int id, int sze, int num, int i0, int i1, int sz){
int i,k,n,jd,j0,j1;
double rr,pq,bb;
double alpha,beta,rrold;
double *r=NULL,*p=NULL,*q=NULL;
double my_rr, my_bb, my_pq;
/*Total number of inner nodes*/
n=ne*ne;
/*Global memory allocation*/
if( (r=(double*)malloc(n*sizeof(double)))==NULL) exit(1);
if( (p=(double*)malloc(n*sizeof(double)))==NULL) exit(1);
if( (q=(double*)malloc(n*sizeof(double)))==NULL) exit(1);
/*r=b-Ax, use of MVmult function file sending in i0e,i1e*/
MVmult(A,B,x,r,ne,i0e,i1e);
15
for(i=i0;i<i1;i++){
r[i]=b[i]-r[i];
}
/*computation of rr and bb norm with broadcast*/
my_rr=0;
for (i=i0;i<i1;i++){
my_rr+=r[i]*r[i];
}
rr=0;
MPI_Reduce(&my_rr,&rr,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD);
MPI_Bcast(&rr,1,MPI_DOUBLE,0,MPI_COMM_WORLD);
my_bb=0;
for (i=i0;i<i1;i++){
my_bb+=b[i]*b[i];
}
bb=0;
MPI_Reduce(&my_bb,&bb,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD);
MPI_Bcast(&bb,1,MPI_DOUBLE,0,MPI_COMM_WORLD);
/*Start of CG interations*/
k=0;
while(sqrt(rr)>eps*sqrt(bb)){
k=k+1;
if(k==1){
/*p=r for each process segment*/
for(i=i0;i<i1;i++) p[i]=r[i];
beta=0;
}
else{
beta=rr/rrold;
/*p=r+beta*p for each process segment*/
for(i=i0;i<i1;i++) p[i]=r[i]+beta*p[i];
}
/*full broadcast of p vector to other processes*/
for(jd=0;jd<num;jd++){
16
j0=jd*sz; j1=(jd+1)*sz;
if(jd==(num-1)) j1=n;
MPI_Bcast(&p[j0],j1-j0,MPI_DOUBLE,jd,MPI_COMM_WORLD);
}
/*Matrix vector multiplication, q=matrix*p*/
MVmult(A,B,p,q,ne,i0e,i1e);
/*computation of pq norm with broadcast*/
my_pq=0;
for (i=i0;i<i1;i++){
my_pq+=p[i]*q[i];
}
pq=0;
MPI_Reduce(&my_pq,&pq,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD);
MPI_Bcast(&pq,1,MPI_DOUBLE,0,MPI_COMM_WORLD);
alpha=rr/pq;
/*each segment computing x=x+alpha*p and r=r-alpha*q*/
for(i=i0;i<i1;i++){
x[i]=x[i]+alpha*p[i];
r[i]=r[i]-alpha*q[i];
}
rrold=rr;
/*computation of rr norm with broadcast*/
my_rr=0;
for (i=i0;i<i1;i++){
my_rr+=r[i]*r[i];
}
rr=0;
MPI_Reduce(&my_rr,&rr,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD);
MPI_Bcast(&rr,1,MPI_DOUBLE,0,MPI_COMM_WORLD);
}
/*freeing memory*/
free(p); free(q); free(r);
17
return k;
}
int main (int argc, char *argv[]) {
int ne,n,i,k,ie,je;
double h,t1,t2,error,temp,error_total, error_final;
double A,B;
double *x=NULL,*b=NULL;
double eps=1.e-10;
int id,num,i0e,i1e,sze,j0,j1,sz,i0,i1,jd,input;
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD,&id);
MPI_Comm_size(MPI_COMM_WORLD,&num);
/* read ne -- problemsize */
if(id==0){
printf("Number of inner nodes per edge=?n"); input=scanf("%d",&ne);
/*Checking for correct input*/
if(input!=1){
printf("Incorrect input...exitingn");
exit(1);
}
}
/*broadcast problemsize*/
MPI_Bcast(&ne,1,MPI_INT,0,MPI_COMM_WORLD);
if (ne<=2) { return 0; }
/* total number of inner nodes, i.e. number of coefficients */
n=ne*ne;
h=1.0/(ne+1);
A=4/h/h; B=-1/h/h;
18
/*Segment sizes for ne*/
sze=ne/num;
i0e=id*sze; i1e=(id+1)*sze;
if(id==(num-1)){
i1e=ne;
}
/*Segment sizes for n*/
sz=sze*ne;
i0=id*sz; i1=(id+1)*sz;
if(id==(num-1)){
i1=n;
}
/* allocate b */
if( (b=(double*)malloc(n*sizeof(double)))==NULL) exit(1);
/* initialize b */
for (ie=i0e;ie<i1e;ie++)
for (je=0;je<ne;je++)
b[ie*ne+je]=frhs((ie+1)*h,(je+1)*h);
/* initialize x (const 0) */
if( (x=(double*)malloc(n*sizeof(double)))==NULL) exit(1);
for (i=i0;i<i1;i++) x[i]=0;
/*solving system*/
t1=omp_get_wtime();
k=cgsparse2d(A,B,x,b,eps,ne,i0e,i1e,id,sze,num,i0,i1,sz);
t2=omp_get_wtime();
/*Error Computation*/
error=0;
for (ie=i0e;ie<i1e;ie++)
for(je=0;je<ne;je++) {
temp=x[ie*ne+je]-uexact((ie+1)*h,(je+1)*h);
error+=temp*temp;
19
}
MPI_Reduce(&error,&error_total,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD);
error_final=sqrt(error_total);
/*Printing results*/
if(id==0){
printf("Error: %lgn",error_final);
printf("Solver-time: %f, Iterations: %dn",t2-t1,k);
}
MPI_Finalize();
/*freeing memory*/
free(x); free(b);
exit(0);
}
.4 Modifications to full broadcast c code for optimal p-vector
communication
Integer tag and MPI status added for communication.
int i,k,n,jd,j0,j1,tag=0;
double rr,pq,bb;
double alpha,beta,rrold;
double *r=NULL,*p=NULL,*q=NULL;
double my_rr, my_bb, my_pq;
MPI_Status status;
Modifications to broadcasting of p vector in the while loop.
/*optimal communication of p vector*/
if(id%2==0){
if(id>0){
MPI_Send(&p[i0e*ne],ne,MPI_DOUBLE,id-1,tag,MPI_COMM_WORLD);
MPI_Recv(&p[(i0e-1)*ne],ne,MPI_DOUBLE,id-1,tag,MPI_COMM_WORLD,&status);
}
if(id<num-1){
20
MPI_Send(&p[(i1e-1)*ne],ne,MPI_DOUBLE,id+1,tag,MPI_COMM_WORLD);
MPI_Recv(&p[i1e*ne],ne,MPI_DOUBLE,id+1,tag,MPI_COMM_WORLD,&status);
}
}else{
if(id<num-1){
MPI_Recv(&p[i1e*ne],ne,MPI_DOUBLE,id+1,tag,MPI_COMM_WORLD,&status);
MPI_Send(&p[(i1e-1)*ne],ne,MPI_DOUBLE,id+1,tag,MPI_COMM_WORLD);
}
if(id>0){
MPI_Recv(&p[(i0e-1)*ne],ne,MPI_DOUBLE,id-1,tag,MPI_COMM_WORLD,&status);
MPI_Send(&p[i0e*ne],ne,MPI_DOUBLE,id-1,tag,MPI_COMM_WORLD);
}
}
21

More Related Content

Viewers also liked (8)

2016년 사업보고서
2016년 사업보고서2016년 사업보고서
2016년 사업보고서
 
St
StSt
St
 
The ingenuity of Age
The ingenuity of AgeThe ingenuity of Age
The ingenuity of Age
 
Bespoke project
Bespoke project Bespoke project
Bespoke project
 
Ao artìculo
Ao artìculoAo artìculo
Ao artìculo
 
카드 2016년 3분기 ko
카드 2016년 3분기 ko카드 2016년 3분기 ko
카드 2016년 3분기 ko
 
Webinar on Meso-level distribution: Opportunities and challenges
Webinar on Meso-level distribution: Opportunities and challengesWebinar on Meso-level distribution: Opportunities and challenges
Webinar on Meso-level distribution: Opportunities and challenges
 
국민디자인단 운영매뉴얼
국민디자인단 운영매뉴얼국민디자인단 운영매뉴얼
국민디자인단 운영매뉴얼
 

Similar to assignment_1

29 19 sep17 17may 6637 10140-1-ed(edit)
29 19 sep17 17may 6637 10140-1-ed(edit)29 19 sep17 17may 6637 10140-1-ed(edit)
29 19 sep17 17may 6637 10140-1-ed(edit)IAESIJEECS
 
29 19 sep17 17may 6637 10140-1-ed(edit)
29 19 sep17 17may 6637 10140-1-ed(edit)29 19 sep17 17may 6637 10140-1-ed(edit)
29 19 sep17 17may 6637 10140-1-ed(edit)IAESIJEECS
 
Kiến trúc máy tính-COE 301 - Performance.ppt
Kiến trúc máy tính-COE 301 - Performance.pptKiến trúc máy tính-COE 301 - Performance.ppt
Kiến trúc máy tính-COE 301 - Performance.pptTriTrang4
 
Time critical multitasking for multicore
Time critical multitasking for multicoreTime critical multitasking for multicore
Time critical multitasking for multicoreijesajournal
 
TIME CRITICAL MULTITASKING FOR MULTICORE MICROCONTROLLER USING XMOS® KIT
TIME CRITICAL MULTITASKING FOR MULTICORE MICROCONTROLLER USING XMOS® KITTIME CRITICAL MULTITASKING FOR MULTICORE MICROCONTROLLER USING XMOS® KIT
TIME CRITICAL MULTITASKING FOR MULTICORE MICROCONTROLLER USING XMOS® KITijesajournal
 
ICS 2410.Parallel.Sytsems.Lecture.Week 3.week5.pptx
ICS 2410.Parallel.Sytsems.Lecture.Week 3.week5.pptxICS 2410.Parallel.Sytsems.Lecture.Week 3.week5.pptx
ICS 2410.Parallel.Sytsems.Lecture.Week 3.week5.pptxjohnsmith96441
 
Computer Arithmetic Algorithm Arithmetic.pptx
Computer Arithmetic Algorithm Arithmetic.pptxComputer Arithmetic Algorithm Arithmetic.pptx
Computer Arithmetic Algorithm Arithmetic.pptxSukeshKr1
 
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...Fisnik Kraja
 
Paper id 25201467
Paper id 25201467Paper id 25201467
Paper id 25201467IJRAT
 
(Paper) Task scheduling algorithm for multicore processor system for minimiz...
 (Paper) Task scheduling algorithm for multicore processor system for minimiz... (Paper) Task scheduling algorithm for multicore processor system for minimiz...
(Paper) Task scheduling algorithm for multicore processor system for minimiz...Naoki Shibata
 
Eventdriven I/O - A hands on introduction
Eventdriven I/O - A hands on introductionEventdriven I/O - A hands on introduction
Eventdriven I/O - A hands on introductionMarc Seeger
 
Building A Linux Cluster Using Raspberry PI #2!
Building A Linux Cluster Using Raspberry PI #2!Building A Linux Cluster Using Raspberry PI #2!
Building A Linux Cluster Using Raspberry PI #2!A Jorge Garcia
 
Producer Consumer Problem in C explained.ppt
Producer Consumer Problem in C explained.pptProducer Consumer Problem in C explained.ppt
Producer Consumer Problem in C explained.pptossama8
 
Prometheus Everything, Observing Kubernetes in the Cloud
Prometheus Everything, Observing Kubernetes in the CloudPrometheus Everything, Observing Kubernetes in the Cloud
Prometheus Everything, Observing Kubernetes in the CloudSneha Inguva
 
BUD17-218: Scheduler Load tracking update and improvement
BUD17-218: Scheduler Load tracking update and improvement BUD17-218: Scheduler Load tracking update and improvement
BUD17-218: Scheduler Load tracking update and improvement Linaro
 
Performance Optimization of HPC Applications: From Hardware to Source Code
Performance Optimization of HPC Applications: From Hardware to Source CodePerformance Optimization of HPC Applications: From Hardware to Source Code
Performance Optimization of HPC Applications: From Hardware to Source CodeFisnik Kraja
 
CSE031.Lecture_11-Operating_Systems.Part_I.pptx
CSE031.Lecture_11-Operating_Systems.Part_I.pptxCSE031.Lecture_11-Operating_Systems.Part_I.pptx
CSE031.Lecture_11-Operating_Systems.Part_I.pptxNourhanTarek23
 

Similar to assignment_1 (20)

assignment_3
assignment_3assignment_3
assignment_3
 
Concurrent Programming
Concurrent ProgrammingConcurrent Programming
Concurrent Programming
 
29 19 sep17 17may 6637 10140-1-ed(edit)
29 19 sep17 17may 6637 10140-1-ed(edit)29 19 sep17 17may 6637 10140-1-ed(edit)
29 19 sep17 17may 6637 10140-1-ed(edit)
 
29 19 sep17 17may 6637 10140-1-ed(edit)
29 19 sep17 17may 6637 10140-1-ed(edit)29 19 sep17 17may 6637 10140-1-ed(edit)
29 19 sep17 17may 6637 10140-1-ed(edit)
 
Kiến trúc máy tính-COE 301 - Performance.ppt
Kiến trúc máy tính-COE 301 - Performance.pptKiến trúc máy tính-COE 301 - Performance.ppt
Kiến trúc máy tính-COE 301 - Performance.ppt
 
Unit 4 COA.pptx
Unit 4 COA.pptxUnit 4 COA.pptx
Unit 4 COA.pptx
 
Time critical multitasking for multicore
Time critical multitasking for multicoreTime critical multitasking for multicore
Time critical multitasking for multicore
 
TIME CRITICAL MULTITASKING FOR MULTICORE MICROCONTROLLER USING XMOS® KIT
TIME CRITICAL MULTITASKING FOR MULTICORE MICROCONTROLLER USING XMOS® KITTIME CRITICAL MULTITASKING FOR MULTICORE MICROCONTROLLER USING XMOS® KIT
TIME CRITICAL MULTITASKING FOR MULTICORE MICROCONTROLLER USING XMOS® KIT
 
ICS 2410.Parallel.Sytsems.Lecture.Week 3.week5.pptx
ICS 2410.Parallel.Sytsems.Lecture.Week 3.week5.pptxICS 2410.Parallel.Sytsems.Lecture.Week 3.week5.pptx
ICS 2410.Parallel.Sytsems.Lecture.Week 3.week5.pptx
 
Computer Arithmetic Algorithm Arithmetic.pptx
Computer Arithmetic Algorithm Arithmetic.pptxComputer Arithmetic Algorithm Arithmetic.pptx
Computer Arithmetic Algorithm Arithmetic.pptx
 
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
 
Paper id 25201467
Paper id 25201467Paper id 25201467
Paper id 25201467
 
(Paper) Task scheduling algorithm for multicore processor system for minimiz...
 (Paper) Task scheduling algorithm for multicore processor system for minimiz... (Paper) Task scheduling algorithm for multicore processor system for minimiz...
(Paper) Task scheduling algorithm for multicore processor system for minimiz...
 
Eventdriven I/O - A hands on introduction
Eventdriven I/O - A hands on introductionEventdriven I/O - A hands on introduction
Eventdriven I/O - A hands on introduction
 
Building A Linux Cluster Using Raspberry PI #2!
Building A Linux Cluster Using Raspberry PI #2!Building A Linux Cluster Using Raspberry PI #2!
Building A Linux Cluster Using Raspberry PI #2!
 
Producer Consumer Problem in C explained.ppt
Producer Consumer Problem in C explained.pptProducer Consumer Problem in C explained.ppt
Producer Consumer Problem in C explained.ppt
 
Prometheus Everything, Observing Kubernetes in the Cloud
Prometheus Everything, Observing Kubernetes in the CloudPrometheus Everything, Observing Kubernetes in the Cloud
Prometheus Everything, Observing Kubernetes in the Cloud
 
BUD17-218: Scheduler Load tracking update and improvement
BUD17-218: Scheduler Load tracking update and improvement BUD17-218: Scheduler Load tracking update and improvement
BUD17-218: Scheduler Load tracking update and improvement
 
Performance Optimization of HPC Applications: From Hardware to Source Code
Performance Optimization of HPC Applications: From Hardware to Source CodePerformance Optimization of HPC Applications: From Hardware to Source Code
Performance Optimization of HPC Applications: From Hardware to Source Code
 
CSE031.Lecture_11-Operating_Systems.Part_I.pptx
CSE031.Lecture_11-Operating_Systems.Part_I.pptxCSE031.Lecture_11-Operating_Systems.Part_I.pptx
CSE031.Lecture_11-Operating_Systems.Part_I.pptx
 

assignment_1

  • 1. Ciaran Cox (1115773) MA5610: Financial Computing 2: Assignment 1 0.1 Explicit Time Stepping: MPI-Parallelization The solution is still maintained in computing with MPI. The algorithm is running much quicker when all the processes are on the same computer, however they are slower when the processes are split across multiple computers. The communication is set up to work slightly differently when an even number of processes are used, compared to when an odd number are used. Table (1) shows times and speed ups for all the processes on the same computer. Table 1: Times and speed ups for processes on the same computer M N Error 1 2 4 8 131072 128 8.411371e-05 1.56282 0.895603 0.617478 0.585737 Speed up 1.745 2.53 2.668 524288 256 2.973560e-05 11.1551 6.88205 4.29815 2.99606 Speed up 1.621 2.595 3.723 2097152 512 1.051285e-05 84.7041 48.9535 27.2987 17.7712 Speed up 1.73 3.103 4.766 8388608 1024 3.716830e-06 667.436 373.127 204.171 116.779 Speed up 1.789 3.269 5.715 A greater speed up is achieved as the problem gets larger, with the maximum speed up of 5.715 on 8 processes, compared to 2.668 for the smaller problem. When running the algorithm split across four computers; table (2) shows the results. A slowdown occurs when splitting the problem among separate computers, however when the problem increases in size the slow down gets smaller. A speed up is not achieved on a problem this size split across multiple computers. In table (3) is the next larger problem running across four computers. A speed up is achieved for problems beyond M = 8388608,N = 1024 as each process has more computations to carry out, therefore, the additional times communicating between the processes does not show a great effect on the times. Compared when all the processes are running internally on the same machine, speed up is achieved quicker due to quick communication times between processes. Code created appendix (.1). 1
  • 2. Table 2: Times and speed ups for processes split across four computers M N Error 1 2 4 8 131072 128 8.411371e-05 1.2026 6.468 6.44 6.4637 Speed up 0.186 0.1867 0.18605 524288 256 2.973560e-05 8.62632 29.3911 25.7779 25.7123 Speed up 0.2935 0.3346 0.3355 2097152 512 1.051285e-05 68.4614 130.529 135.822 103.727 Speed up 0.5245 0.504 0.66 8388608 1024 3.716830e-06 553.673 601.222 821.992 559.195 Speed up 0.9209 0.674 0.9901 Table 3: Times and speed ups for processes split across four computers M N Error 1 2 4 8 33554432 2048 1.314096e-06 4506.54 3624.66 3291.96 3288.42 Speed up 1.2433 1.369 1.37 0.1.1 Open MP nested into MPI Applying open mp to each of the processes in parallel: to the initial conditions, explicit solving each time step and a reduction to the error computation. Shown below in table (4) are the times and speed ups for using 1,2,4 and 8 threads on each process for 4 processes split across 4 computers. A Table 4: Times and speed ups for open mp nested into 4 processes MPI M N Error 1 2 4 8 131072 128 8.411371e-05 21.3483 21.4315 21.2217 20.4067 Speed up 0.996 1.006 1.05 524288 256 2.973560e-05 85.8835 85.1509 84.2632 82.393 Speed up 1.009 1.02 1.04 2097152 512 1.051285e-05 361.095 346.063 344.023 336.723 Speed up 1.04 1.05 1.07 8388608 1024 3.716830e-06 1606.83 1451.33 1407.97 1350.4 Speed up 1.107 1.141 1.19 gradual speed up is noticeable the more threads that are used per process, however, the problem is 2
  • 3. still solved quicker just using MPI without open mp. For N = 128 with 4 processes and 8 threads per process, there is 32 threads running with each thread only computing 128/32=4 elements. There is increased internal communication between threads per process which is causing the algorithm to run slower than just MPI alone. A greater speed up is achieved on a larger problem, for N = 1024 with 4 processes and 8 threads, each parallel section computing 1024/32=32 elements compared with 4 for the smallest problem. More elements are needed per thread to make open mp worthwhile in computing. Table (5) is the same run of tests however, with only 2 processes. With only 2 Table 5: Times and speed ups for open mp nested into 2 processes MPI M N Error 1 2 4 8 131072 128 8.411371e-05 6.42831 6.427 6.43131 6.44648 Speed up 1.0002 0.9995 0.997 524288 256 2.973560e-05 33.5187 26.0446 25.7704 25.7165 Speed up 1.287 1.3 1.33 2097152 512 1.051285e-05 140.945 112.404 103.778 105.024 Speed up 1.254 1.358 1.342 8388608 1024 3.716830e-06 864.623 579.87 461.144 414.382 Speed up 1.491 1.875 2.09 processes a greater speed up is achieved and is computing the same results quicker than just MPI on its own. This is due to more computation per thread compared with 4 processes. For N = 1024 with both processes, with 8 threads calculating 1024/16=64 elements compared with 32 for 4 processes. Therefore, less external communication is occurring and more internal communication resulting in quicker calculations. Modifications to MPI code are shown in appendix (.2). 0.2 2d-PDE, MPI-Parallelization Taking the serial c code cgfem2d.c and converting into MPI was done while still maintaining the same result as the given serial code. The storage of the data was kept global, however the indices in the for loops determined the processes computation. Table (6) shows times and speed ups for different processes that all run on the same computer with the full broadcast of the p-vector in the CG-algorithm. (Code used appendix (.3)). The speed up is approximately decreasing as the problem size is getting larger. The larger the problem the more iterations are used to arrive at the solution. This growth increases the total com- munication between processes resulting in a decreasing slow down as the unknowns increase. Table 3
  • 4. Table 6: Times and speed ups processes run on same computer ne n Error CG iterations 1 2 3 4 256 65536 2.79676e-05 516 0.3398 0.1951 0.2107 0.204 Speed up 1.7417 1.6127 1.6657 512 262144 1.40111e-05 1037 2.284 1.5009 1.3583 1.797 Speed up 1.5218 1.6815 1.271 1024 1048576 7.01237e-06 2074 22.564 15.899 15.8831 15.938 Speed up 1.4192 1.4206 1.4157 2048 4194304 3.50787e-06 4049 186.694 132.32 138.798 129.396 Speed up 1.4109 1.345 1.4428 (7) shows running the processes split across four computers. Table 7: Times and speed ups processes run on four seperate computer ne n Error CG iterations 1 2 3 4 256 65536 2.79676e-05 516 0.332869 2.914 5.128 5.272 Speed up 0.11423 0.0649 0.0631 512 262144 1.40111e-05 1037 2.65 23.668 28.02 34.259 Speed up 0.112 0.095 0.077 1024 1048576 7.01237e-06 2074 23.964 187.064 192.746 210.813 Speed up 0.1281 0.124 0.114 2048 4194304 3.50787e-06 4049 188.05 1503.93 1501.46 1633.993 Speed up 0.125 0.125 0.115 The slowdown is approximately equal across problems from one process to two, with an increasing slowdown on smaller problems with more processes. The computations are taking longer as more processes are added due to broadcasting the full p-vector to all other processes in serial when each process only needs a little segment of the p-vector. Optimal communication can be used here. Each process sends only the relevant part of their segment to their neighbouring processes in parallel. Table (8) shows times and speed ups for all processes run on the same machine with the optimal communication of the p-vector in the CG-algorithm. The problem is solved quicker with the optimal communication, however, the speed up is decreas- ing as the problem size is getting larger. The CG iterations are increasing with the problem size. Therefore, the communication between processes is reducing the speed up as more communication 4
  • 5. Table 8: Times and speed ups processes run on the same computer ne n Error CG iterations 1 2 3 4 256 65536 2.79676e-05 516 0.33477 0.134 0.0939 0.072 Speed up 2.4983 3.565 4.6496 512 262144 1.40111e-05 1037 2.2537 1.108 0.763 0.637 Speed up 2.034 2.9537 3.538 1024 1048576 7.01237e-06 2074 22.274 10.446 8.561 7.057 Speed up 2.1323 2.602 3.156 2048 4194304 3.50787e-06 4049 186.975 99.017 86.275 70.768 Speed up 1.888 2.167 2.642 is needed between processes. Table (9) shows running the processes split across four computers. Table 9: Times and speed ups processes run on four separate computer ne n Error CG iterations 1 2 3 4 256 65536 2.79676e-05 516 0.3327 0.372 0.294 0.354 Speed up 0.894 1.132 0.9398 512 262144 1.40111e-05 1037 2.359 1.5032 1.2023 1.379 Speed up 1.5693 1.962 1.71 1024 1048576 7.01237e-06 2074 22.704 9.602 6.9408 6.0282 Speed up 2.3645 3.271 3.7663 2048 4194304 3.50787e-06 4049 187.7764 95.8288 65.553 50.0723 Speed up 1.9595 2.8645 3.7501 The two largest problems were solved quicker, running the processes on separate machines. Due to each process utilizing more of the machine’s cache and not needing so much of the main memory which is increased communication. Whenever, all the processes are run on the same machine each process only utilises part of the cache and needs to use more of the main memory. With optimal communication being used between machines less interaction is occurring resulting in an increased speed up as the problem size increases. Modifications to the full broadcast of the p-vector code are shown in appendix (.4). 5
  • 6. .1 Task 1 /*Ciaran Cox (1115773)*/ /*1115773@my.brunel.ac.uk*/ #include <stdio.h> #include <mpi.h> #include <string.h> #include <math.h> #include <stdlib.h> #define PI (4.0*atan(1.0)) /* the right hand side forcing function */ double fxt(double x, double t) { double f = -PI*cos(PI*t)+(4*PI*PI-2)*(1-sin(PI*t)); return f * exp(-2*t)*cos(2*PI*x); } /* the exact solution */ double ux(double x, double t) { return (1-sin(PI*t))*exp(-2*t)*cos(2*PI*x); } /* the left hand boundary condition */ double uLt(double t) { return ux(0.0,t); } /* the right hand boundary condition */ double uRt(double t) { return ux(1.0,t); 6
  • 7. } /* the initial condition */ double u0x(double x) { return ux(x,0.0); } /*Main function*/ void main(int argc, char *argv[]) { int n, sz, id, num, i0, i1,tag=0; int N,M,i,j,inputM,inputN; double a=0.0, b=1.0, T=2.0, alpha, h, k, *Uold=NULL, *Unew=NULL, temp,error,error_total,error_final,t1,t2; /*status for MPI communication*/ MPI_Status status; /*Counting number of processes giving each process an id*/ MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD,&id); MPI_Comm_size(MPI_COMM_WORLD,&num); /*asking user for input parameters*/ if (id==0) { printf("Enter M: n"); inputM=scanf("%d",&M); printf("Enter N: n"); inputN=scanf("%d",&N); /*Checking for correct input*/ if(inputM!=1 || inputN!=1){ printf("Incorrect input...exitingn"); exit(1); } printf("MtNtErrorttTimen"); } /*broadcasting the input parameters to other processes*/ MPI_Bcast(&N,1,MPI_INT,0,MPI_COMM_WORLD); 7
  • 8. MPI_Bcast(&M,1,MPI_INT,0,MPI_COMM_WORLD); /*starting the timer*/ t1=MPI_Wtime(); /*defining constants*/ h = (b-a)/N; k =T/M; alpha = k/h/h; /*computing each processes segment size, N-1 unknowns split*/ sz=(N-1)/num; i0=id*sz; i1=(id+1)*sz; /*keeping i1=N, all for loops less than i1, hence boundary wont be computed*/ if (id==(num-1)){ i1=N; } /*shifting first process along one because boundary is known*/ if(id==0){ i0=1; } /*Allocating global memory*/ if((Uold=(double*)malloc((N+1)*sizeof(double)))==NULL) exit(1); if((Unew=(double*)malloc((N+1)*sizeof(double)))==NULL) exit(1); /*process one, left boundary condition*/ if(id==0){ Uold[0]=uLt(0); } /*last process, right boundary condition*/ if(id==(num-1)){ Uold[i1]=uRt(0); } /*all processes computing initial conditions for their segment*/ for(i=i0;i<i1;i++){ Uold[i]=u0x(a+i*h); } /*broadcast relevant elements if more than 1 process*/ if(num>1){ /*if total number of processes is even*/ 8
  • 9. if(num%2==0){ /*if id is an odd number*/ if(id%2!=0){ /*Sending first number of segment to the left*/ MPI_Send(&Uold[i0],1,MPI_DOUBLE,id-1,tag,MPI_COMM_WORLD); /*Receiving last number from left, placing in i0-1*/ MPI_Recv(&Uold[i0-1],1,MPI_DOUBLE,id-1,tag,MPI_COMM_WORLD,&status); } /*if id is an even number*/ if(id%2==0){ /*Receiving first number from left, placing in i1*/ MPI_Recv(&Uold[i1],1,MPI_DOUBLE,id+1,tag,MPI_COMM_WORLD,&status); /*Sending last element of segment to the right*/ MPI_Send(&Uold[i1-1],1,MPI_DOUBLE,id+1,tag,MPI_COMM_WORLD); } /*if id is even and not equal to 0*/ if(id%2==0 && id!=0){ /*Sending first number of segment to the left*/ MPI_Send(&Uold[i0],1,MPI_DOUBLE,id-1,tag,MPI_COMM_WORLD); /*Receiving last number from the left, placing in i0-1*/ MPI_Recv(&Uold[i0-1],1,MPI_DOUBLE,id-1,tag,MPI_COMM_WORLD,&status); } /*if id is odd and not equal to last process*/ if(id%2!=0 && id!=(num-1)){ /*Receiving first number from the right, placing in i1*/ MPI_Recv(&Uold[i1],1,MPI_DOUBLE,id+1,tag,MPI_COMM_WORLD,&status); /*Sending last element of segment to the right*/ MPI_Send(&Uold[i1-1],1,MPI_DOUBLE,id+1,tag,MPI_COMM_WORLD); } } /*if total number of processes is odd*/ else { /*if id is even and not equal to 0*/ 9
  • 10. if(id%2==0 && id!=0){ /*Sending first element of segment to the left*/ MPI_Send(&Uold[i0],1,MPI_DOUBLE,id-1,tag,MPI_COMM_WORLD); /*Receiving last number from the left, placing in i0-1*/ MPI_Recv(&Uold[i0-1],1,MPI_DOUBLE,id-1,tag,MPI_COMM_WORLD,&status); } /*if id is odd*/ if(id%2!=0){ /*Receiving first element from left, placing in i1*/ MPI_Recv(&Uold[i1],1,MPI_DOUBLE,id+1,tag,MPI_COMM_WORLD,&status); /*Sending last element of segment to the right*/ MPI_Send(&Uold[i1-1],1,MPI_DOUBLE,id+1,tag,MPI_COMM_WORLD); } /*if id is odd*/ if(id%2!=0){ /*Sending first element of segment to the left*/ MPI_Send(&Uold[i0],1,MPI_DOUBLE,id-1,tag,MPI_COMM_WORLD); /*Receiving last element of segement from left, placing in i0-1*/ MPI_Recv(&Uold[i0-1],1,MPI_DOUBLE,id-1,tag,MPI_COMM_WORLD,&status); } /*if id is even and not equal to last process*/ if(id%2==0 && id!=(num-1)){ /*Receiving first element from the right, placing in i1*/ MPI_Recv(&Uold[i1],1,MPI_DOUBLE,id+1,tag,MPI_COMM_WORLD,&status); /*Sending last element of segemnt to the right*/ MPI_Send(&Uold[i1-1],1,MPI_DOUBLE,id+1,tag,MPI_COMM_WORLD); } } } /*Starting loop up through time*/ for(j=1;j<=M;j++){ /*defining time step*/ double t=j*k; /*if first process, left boundary condition*/ 10
  • 11. if(id==0){ Unew[0]=uLt(t); } /*if last process, right boundary condition*/ if(id==(num-1)){ Unew[i1]=uRt(t); } /*All processes explicity solving their segment*/ for(i=i0;i<i1;i++){ Unew[i]=alpha*(Uold[i-1]+Uold[i+1])+(1-2*alpha)*Uold[i]; Unew[i]+=k*fxt(a+i*h,t-k); } /*only broadcast relevant elements if more than 1 process, * same broadcast as before with Unew instead of Uold*/ if(num>1){ if(num%2==0){ if(id%2!=0){ MPI_Send(&Unew[i0],1,MPI_DOUBLE,id-1,tag,MPI_COMM_WORLD); MPI_Recv(&Unew[i0-1],1,MPI_DOUBLE,id-1,tag,MPI_COMM_WORLD,&status); } if(id%2==0){ MPI_Recv(&Unew[i1],1,MPI_DOUBLE,id+1,tag,MPI_COMM_WORLD,&status); MPI_Send(&Unew[i1-1],1,MPI_DOUBLE,id+1,tag,MPI_COMM_WORLD); } if(id%2==0 && id!=0){ MPI_Send(&Unew[i0],1,MPI_DOUBLE,id-1,tag,MPI_COMM_WORLD); MPI_Recv(&Unew[i0-1],1,MPI_DOUBLE,id-1,tag,MPI_COMM_WORLD,&status); } if(id%2!=0 && id!=(num-1)){ MPI_Recv(&Unew[i1],1,MPI_DOUBLE,id+1,tag,MPI_COMM_WORLD,&status); MPI_Send(&Unew[i1-1],1,MPI_DOUBLE,id+1,tag,MPI_COMM_WORLD); } } else 11
  • 12. { if(id%2==0 && id!=0){ MPI_Send(&Unew[i0],1,MPI_DOUBLE,id-1,tag,MPI_COMM_WORLD); MPI_Recv(&Unew[i0-1],1,MPI_DOUBLE,id-1,tag,MPI_COMM_WORLD,&status); } if(id%2!=0){ MPI_Recv(&Unew[i1],1,MPI_DOUBLE,id+1,tag,MPI_COMM_WORLD,&status); MPI_Send(&Unew[i1-1],1,MPI_DOUBLE,id+1,tag,MPI_COMM_WORLD); } if(id%2!=0){ MPI_Send(&Unew[i0],1,MPI_DOUBLE,id-1,tag,MPI_COMM_WORLD); MPI_Recv(&Unew[i0-1],1,MPI_DOUBLE,id-1,tag,MPI_COMM_WORLD,&status); } if(id%2==0 && id!=(num-1)){ MPI_Recv(&Unew[i1],1,MPI_DOUBLE,id+1,tag,MPI_COMM_WORLD,&status); MPI_Send(&Unew[i1-1],1,MPI_DOUBLE,id+1,tag,MPI_COMM_WORLD); } } } /*Changing pointers Unew and Uold*/ double *tmp=Unew; Unew=Uold; Uold=tmp; } /*if first process shifting initial start back to boundary*/ if(id==0){ i0=0; } /*if last process, shifting end out one more to include boundary * in error computation*/ if(id==(num-1)){ i1=N+1; } /*each process computes error summation*/ 12
  • 13. error=0; for(i=i0;i<i1;i++){ temp=ux(a+i*h,T)-Uold[i]; error+=temp*temp; } /*if more than 1 process, an MPI reduce to get final error and stop timer*/ if(num>1){ error_total=0; MPI_Reduce(&error,&error_total,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD); error_final=sqrt(error_total); t2=MPI_Wtime(); } else{ /*if one process just sqrt error, stop timer*/ error_final=sqrt(error); t2=MPI_Wtime(); } /*printing results*/ if(id==0){ printf("%dt%dt%13.6let%lgnn",M,N,error_final,t2-t1); } /*free memeory and end MPI*/ free(Uold); free(Unew); MPI_Finalize(); } .2 Task 1:Open MP Modification to initial conditions /*all processes computing initial conditions for their segment in parallel*/ #pragma omp parallel for num_threads(NUM_THREADS) private(i) for(i=i0;i<i1;i++){ Uold[i]=u0x(a+i*h); } 13
  • 14. Modification to solving explicitly each time step /*All processes explicity solving their segment in parallel*/ #pragma omp parallel for num_threads(NUM_THREADS) private(i) for(i=i0;i<i1;i++){ Unew[i]=alpha*(Uold[i-1]+Uold[i+1])+(1-2*alpha)*Uold[i] +k*fxt(a+i*h,t-k); } Modification to error computation error=0; #pragma omp parallel for num_threads(NUM_THREADS) private(i) reduction(+:error) for(i=i0;i<i1;i++){ error+=(ux(a+i*h,T)-Uold[i])*(ux(a+i*h,T)-Uold[i]); } .3 Task 2 full p-vector broadcast /*Ciaran Cox (1115773)*/ /*1115773@my.brunel.ac.uk*/ /*Relevant Libraries*/ #include <stdio.h> #include <string.h> #include <math.h> #include <stdlib.h> #include <mpi.h> #include <omp.h> /*Exact solution function*/ double uexact(double x,double y){ double temp=x*(1-x)*y*(1-y); return temp*temp; } /*Function right hand side*/ double frhs(double x,double y){ 14
  • 15. return -(2-12*x+12*x*x)*y*y*(1-y)*(1-y)-(2-12*y+12*y*y)*x*x*(1-x)*(1-x); } /*Matrix-Vector multiplication function*/ void MVmult(double A, double B, double *x, double *y, int ne, int i0e, int i1e){ int ie,je; double s; /*Slicing among processes the outer loop only ne*/ for (ie=i0e;ie<i1e;ie++) for (je=0;je<ne;je++) { s=A*x[ie*ne+je]; if (ie>0 ) s+=B*x[(ie-1)*ne+je ]; if (je>0 ) s+=B*x[ ie *ne+je-1]; if (ie<ne-1) s+=B*x[(ie+1)*ne+je ]; if (je<ne-1) s+=B*x[ ie *ne+je+1]; y[ie*ne+je]=s; } } /*2D sparse CG-algorithm*/ int cgsparse2d(double A,double B,double *x,double *b,double eps,int ne, int i0e, int i1e, int id, int sze, int num, int i0, int i1, int sz){ int i,k,n,jd,j0,j1; double rr,pq,bb; double alpha,beta,rrold; double *r=NULL,*p=NULL,*q=NULL; double my_rr, my_bb, my_pq; /*Total number of inner nodes*/ n=ne*ne; /*Global memory allocation*/ if( (r=(double*)malloc(n*sizeof(double)))==NULL) exit(1); if( (p=(double*)malloc(n*sizeof(double)))==NULL) exit(1); if( (q=(double*)malloc(n*sizeof(double)))==NULL) exit(1); /*r=b-Ax, use of MVmult function file sending in i0e,i1e*/ MVmult(A,B,x,r,ne,i0e,i1e); 15
  • 16. for(i=i0;i<i1;i++){ r[i]=b[i]-r[i]; } /*computation of rr and bb norm with broadcast*/ my_rr=0; for (i=i0;i<i1;i++){ my_rr+=r[i]*r[i]; } rr=0; MPI_Reduce(&my_rr,&rr,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD); MPI_Bcast(&rr,1,MPI_DOUBLE,0,MPI_COMM_WORLD); my_bb=0; for (i=i0;i<i1;i++){ my_bb+=b[i]*b[i]; } bb=0; MPI_Reduce(&my_bb,&bb,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD); MPI_Bcast(&bb,1,MPI_DOUBLE,0,MPI_COMM_WORLD); /*Start of CG interations*/ k=0; while(sqrt(rr)>eps*sqrt(bb)){ k=k+1; if(k==1){ /*p=r for each process segment*/ for(i=i0;i<i1;i++) p[i]=r[i]; beta=0; } else{ beta=rr/rrold; /*p=r+beta*p for each process segment*/ for(i=i0;i<i1;i++) p[i]=r[i]+beta*p[i]; } /*full broadcast of p vector to other processes*/ for(jd=0;jd<num;jd++){ 16
  • 17. j0=jd*sz; j1=(jd+1)*sz; if(jd==(num-1)) j1=n; MPI_Bcast(&p[j0],j1-j0,MPI_DOUBLE,jd,MPI_COMM_WORLD); } /*Matrix vector multiplication, q=matrix*p*/ MVmult(A,B,p,q,ne,i0e,i1e); /*computation of pq norm with broadcast*/ my_pq=0; for (i=i0;i<i1;i++){ my_pq+=p[i]*q[i]; } pq=0; MPI_Reduce(&my_pq,&pq,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD); MPI_Bcast(&pq,1,MPI_DOUBLE,0,MPI_COMM_WORLD); alpha=rr/pq; /*each segment computing x=x+alpha*p and r=r-alpha*q*/ for(i=i0;i<i1;i++){ x[i]=x[i]+alpha*p[i]; r[i]=r[i]-alpha*q[i]; } rrold=rr; /*computation of rr norm with broadcast*/ my_rr=0; for (i=i0;i<i1;i++){ my_rr+=r[i]*r[i]; } rr=0; MPI_Reduce(&my_rr,&rr,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD); MPI_Bcast(&rr,1,MPI_DOUBLE,0,MPI_COMM_WORLD); } /*freeing memory*/ free(p); free(q); free(r); 17
  • 18. return k; } int main (int argc, char *argv[]) { int ne,n,i,k,ie,je; double h,t1,t2,error,temp,error_total, error_final; double A,B; double *x=NULL,*b=NULL; double eps=1.e-10; int id,num,i0e,i1e,sze,j0,j1,sz,i0,i1,jd,input; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD,&id); MPI_Comm_size(MPI_COMM_WORLD,&num); /* read ne -- problemsize */ if(id==0){ printf("Number of inner nodes per edge=?n"); input=scanf("%d",&ne); /*Checking for correct input*/ if(input!=1){ printf("Incorrect input...exitingn"); exit(1); } } /*broadcast problemsize*/ MPI_Bcast(&ne,1,MPI_INT,0,MPI_COMM_WORLD); if (ne<=2) { return 0; } /* total number of inner nodes, i.e. number of coefficients */ n=ne*ne; h=1.0/(ne+1); A=4/h/h; B=-1/h/h; 18
  • 19. /*Segment sizes for ne*/ sze=ne/num; i0e=id*sze; i1e=(id+1)*sze; if(id==(num-1)){ i1e=ne; } /*Segment sizes for n*/ sz=sze*ne; i0=id*sz; i1=(id+1)*sz; if(id==(num-1)){ i1=n; } /* allocate b */ if( (b=(double*)malloc(n*sizeof(double)))==NULL) exit(1); /* initialize b */ for (ie=i0e;ie<i1e;ie++) for (je=0;je<ne;je++) b[ie*ne+je]=frhs((ie+1)*h,(je+1)*h); /* initialize x (const 0) */ if( (x=(double*)malloc(n*sizeof(double)))==NULL) exit(1); for (i=i0;i<i1;i++) x[i]=0; /*solving system*/ t1=omp_get_wtime(); k=cgsparse2d(A,B,x,b,eps,ne,i0e,i1e,id,sze,num,i0,i1,sz); t2=omp_get_wtime(); /*Error Computation*/ error=0; for (ie=i0e;ie<i1e;ie++) for(je=0;je<ne;je++) { temp=x[ie*ne+je]-uexact((ie+1)*h,(je+1)*h); error+=temp*temp; 19
  • 20. } MPI_Reduce(&error,&error_total,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD); error_final=sqrt(error_total); /*Printing results*/ if(id==0){ printf("Error: %lgn",error_final); printf("Solver-time: %f, Iterations: %dn",t2-t1,k); } MPI_Finalize(); /*freeing memory*/ free(x); free(b); exit(0); } .4 Modifications to full broadcast c code for optimal p-vector communication Integer tag and MPI status added for communication. int i,k,n,jd,j0,j1,tag=0; double rr,pq,bb; double alpha,beta,rrold; double *r=NULL,*p=NULL,*q=NULL; double my_rr, my_bb, my_pq; MPI_Status status; Modifications to broadcasting of p vector in the while loop. /*optimal communication of p vector*/ if(id%2==0){ if(id>0){ MPI_Send(&p[i0e*ne],ne,MPI_DOUBLE,id-1,tag,MPI_COMM_WORLD); MPI_Recv(&p[(i0e-1)*ne],ne,MPI_DOUBLE,id-1,tag,MPI_COMM_WORLD,&status); } if(id<num-1){ 20