Parallel Computing
Mohamed Zahran (aka Z)
mzahran@cs.nyu.edu
http://www.mzahran.com
CSCI-UA.0480-003
MPI - III
Many slides of this
lecture are adopted
and slightly modified from:
• Gerassimos Barlas
• Peter S. Pacheco
Collective
point-to-point
Data distributions
Copyright © 2010, Elsevier Inc.
All rights Reserved
Sequential version
Different partitions of a 12-
component vector among 3 processes
• Block: Assign blocks of consecutive components to each process.
• Cyclic: Assign components in a round robin fashion.
• Block-cyclic: Use a cyclic distribution of blocks of components.
Parallel implementation of
vector addition
Copyright © 2010, Elsevier Inc.
All rights Reserved
How will you distribute parts of x[] and y[] to processes?
Scatter
• Read an entire vector on process 0
• MPI_Scatter sends the needed
components to each of the other
processes.
# data items going
to each process
Important:
• All arguments are important for the source process (process 0 in our example)
• For all other processes, only recv_buf_p, recv_count, recv_type, src_proc,
and comm are important
Reading and distributing a vector
Copyright © 2010, Elsevier Inc.
All rights Reserved
process 0 itself
also receives data.
• send_buf_p
– is not used except by the sender.
– However, it must be defined or NULL on others to make the
code correct.
– Must have at least communicator size * send_count elements
• All processes must call MPI_Scatter, not only the sender.
• send_count the number of data items sent to each process.
• recv_buf_p must have at least send_count elements
• MPI_Scatter uses block distribution
0 21 3 4 5 6 7 0
0 1 2
3 4 5
6 7 8
Process 0
Process 0
Process 1
Process 2
Gather
• MPI_Gather collects all of the
components of the vector onto process
dest process, ordered in rank order.
Important:
• All arguments are important for the destination process.
• For all other processes, only send_buf_p, send_count, send_type, dest_proc,
and comm are important
number of elements
for any single receive
number of elements
in send_buf_p
Print a distributed vector (1)
Copyright © 2010, Elsevier Inc.
All rights Reserved
Print a distributed vector (2)
Copyright © 2010, Elsevier Inc.
All rights Reserved
Allgather
• Concatenates the contents of each
process’ send_buf_p and stores this in
each process’ recv_buf_p.
• As usual, recv_count is the amount of data
being received from each process.
Copyright © 2010, Elsevier Inc.
All rights Reserved
Matrix-vector multiplication
Copyright © 2010, Elsevier Inc.
All rights Reserved
i-th component of y
Dot product of the ith
row of A with x.
Matrix-vector multiplication
Pseudo-code Serial Version
C style arrays
Copyright © 2010, Elsevier Inc.
All rights Reserved
stored as
Serial matrix-vector multiplication
Let’s assume x[] is distributed among the different processes
An MPI matrix-vector
multiplication function (1)
Copyright © 2010, Elsevier Inc.
All rights Reserved
An MPI matrix-vector
multiplication function (2)
Copyright © 2010, Elsevier Inc.
All rights Reserved
Keep in mind …
• In distributed memory systems,
communication is more expensive than
computation.
• Distributing a fixed amount of data
among several messages is more
expensive than sending a single big
message.
Derived datatypes
• Used to represent any collection of data
items
• If a function that sends data knows this
information about a collection of data items,
it can collect the items from memory before
they are sent.
• A function that receives data can distribute
the items into their correct destinations in
memory when they’re received.
Copyright © 2010, Elsevier Inc.
All rights Reserved
Derived datatypes
• A sequence of basic MPI data types
together with a displacement for each
of the data types.
Copyright © 2010, Elsevier Inc.
All rights Reserved
Address in memory where the variables are stored
a and b are double; n is int
displacement from the beginning of the type
(We assume we start with a.)
MPI_Type create_struct
• Builds a derived datatype that consists
of individual elements that have
different basic types.
From the address of item 0an integer type that is big enough
to store an address on the system.
Number of elements in the type
Before you start using your new data type
Allows the MPI implementation to
optimize its internal representation of
the datatype for use in communication
functions.
When you are finished with your new type
This frees any additional storage used.
Copyright © 2010, Elsevier Inc.
All rights Reserved
Example (1)
Copyright © 2010, Elsevier Inc.
All rights Reserved
Example (2)
Copyright © 2010, Elsevier Inc.
All rights Reserved
Example (3)
Copyright © 2010, Elsevier Inc.
All rights Reserved
The receiving end can use
the received complex data
item as if it is a structure.
MEASURING TIME IN MPI
We have seen in the past …
• time in Linux
• clock() inside your code
• Does MPI offer anything else?
Elapsed parallel time
• Returns the number of seconds that
have elapsed since some time in the
past.
Elapsed time for
the calling process
How to Sync Processes?
MPI_Barrier
• Ensures that no process will return from
calling it until every process in the
communicator has started calling it.
Copyright © 2010, Elsevier Inc.
All rights Reserved
Let’s see how we can analyze the
performance of an MPI program
The matrix-vector multiplication
Run-times of serial and parallel
matrix-vector multiplication
Copyright © 2010, Elsevier Inc.
All rights Reserved
(Seconds)
Speedups of Parallel Matrix-
Vector Multiplication
Copyright © 2010, Elsevier Inc.
All rights Reserved
Efficiencies of Parallel Matrix-
Vector Multiplication
Copyright © 2010, Elsevier Inc.
All rights Reserved
Conclusions
• Reducing messages is a good
performance strategy!
– Collective vs point-to-point
• Distributing a fixed amount of data
among several messages is more
expensive than sending a single big
message.
Powered by TCPDF (www.tcpdf.org)Powered by TCPDF (www.tcpdf.org)Powered by TCPDF (www.tcpdf.org)

MPI - 3

  • 1.
    Parallel Computing Mohamed Zahran(aka Z) mzahran@cs.nyu.edu http://www.mzahran.com CSCI-UA.0480-003 MPI - III Many slides of this lecture are adopted and slightly modified from: • Gerassimos Barlas • Peter S. Pacheco
  • 2.
  • 3.
    Data distributions Copyright ©2010, Elsevier Inc. All rights Reserved Sequential version
  • 4.
    Different partitions ofa 12- component vector among 3 processes • Block: Assign blocks of consecutive components to each process. • Cyclic: Assign components in a round robin fashion. • Block-cyclic: Use a cyclic distribution of blocks of components.
  • 5.
    Parallel implementation of vectoraddition Copyright © 2010, Elsevier Inc. All rights Reserved How will you distribute parts of x[] and y[] to processes?
  • 6.
    Scatter • Read anentire vector on process 0 • MPI_Scatter sends the needed components to each of the other processes. # data items going to each process Important: • All arguments are important for the source process (process 0 in our example) • For all other processes, only recv_buf_p, recv_count, recv_type, src_proc, and comm are important
  • 7.
    Reading and distributinga vector Copyright © 2010, Elsevier Inc. All rights Reserved process 0 itself also receives data.
  • 8.
    • send_buf_p – isnot used except by the sender. – However, it must be defined or NULL on others to make the code correct. – Must have at least communicator size * send_count elements • All processes must call MPI_Scatter, not only the sender. • send_count the number of data items sent to each process. • recv_buf_p must have at least send_count elements • MPI_Scatter uses block distribution
  • 9.
    0 21 34 5 6 7 0 0 1 2 3 4 5 6 7 8 Process 0 Process 0 Process 1 Process 2
  • 10.
    Gather • MPI_Gather collectsall of the components of the vector onto process dest process, ordered in rank order. Important: • All arguments are important for the destination process. • For all other processes, only send_buf_p, send_count, send_type, dest_proc, and comm are important number of elements for any single receive number of elements in send_buf_p
  • 11.
    Print a distributedvector (1) Copyright © 2010, Elsevier Inc. All rights Reserved
  • 12.
    Print a distributedvector (2) Copyright © 2010, Elsevier Inc. All rights Reserved
  • 13.
    Allgather • Concatenates thecontents of each process’ send_buf_p and stores this in each process’ recv_buf_p. • As usual, recv_count is the amount of data being received from each process. Copyright © 2010, Elsevier Inc. All rights Reserved
  • 14.
    Matrix-vector multiplication Copyright ©2010, Elsevier Inc. All rights Reserved i-th component of y Dot product of the ith row of A with x.
  • 15.
  • 16.
    C style arrays Copyright© 2010, Elsevier Inc. All rights Reserved stored as
  • 17.
    Serial matrix-vector multiplication Let’sassume x[] is distributed among the different processes
  • 18.
    An MPI matrix-vector multiplicationfunction (1) Copyright © 2010, Elsevier Inc. All rights Reserved
  • 19.
    An MPI matrix-vector multiplicationfunction (2) Copyright © 2010, Elsevier Inc. All rights Reserved
  • 20.
    Keep in mind… • In distributed memory systems, communication is more expensive than computation. • Distributing a fixed amount of data among several messages is more expensive than sending a single big message.
  • 21.
    Derived datatypes • Usedto represent any collection of data items • If a function that sends data knows this information about a collection of data items, it can collect the items from memory before they are sent. • A function that receives data can distribute the items into their correct destinations in memory when they’re received. Copyright © 2010, Elsevier Inc. All rights Reserved
  • 22.
    Derived datatypes • Asequence of basic MPI data types together with a displacement for each of the data types. Copyright © 2010, Elsevier Inc. All rights Reserved Address in memory where the variables are stored a and b are double; n is int displacement from the beginning of the type (We assume we start with a.)
  • 23.
    MPI_Type create_struct • Buildsa derived datatype that consists of individual elements that have different basic types. From the address of item 0an integer type that is big enough to store an address on the system. Number of elements in the type
  • 24.
    Before you startusing your new data type Allows the MPI implementation to optimize its internal representation of the datatype for use in communication functions.
  • 25.
    When you arefinished with your new type This frees any additional storage used. Copyright © 2010, Elsevier Inc. All rights Reserved
  • 26.
    Example (1) Copyright ©2010, Elsevier Inc. All rights Reserved
  • 27.
    Example (2) Copyright ©2010, Elsevier Inc. All rights Reserved
  • 28.
    Example (3) Copyright ©2010, Elsevier Inc. All rights Reserved The receiving end can use the received complex data item as if it is a structure.
  • 29.
  • 30.
    We have seenin the past … • time in Linux • clock() inside your code • Does MPI offer anything else?
  • 31.
    Elapsed parallel time •Returns the number of seconds that have elapsed since some time in the past. Elapsed time for the calling process
  • 32.
    How to SyncProcesses?
  • 33.
    MPI_Barrier • Ensures thatno process will return from calling it until every process in the communicator has started calling it. Copyright © 2010, Elsevier Inc. All rights Reserved
  • 34.
    Let’s see howwe can analyze the performance of an MPI program The matrix-vector multiplication
  • 35.
    Run-times of serialand parallel matrix-vector multiplication Copyright © 2010, Elsevier Inc. All rights Reserved (Seconds)
  • 36.
    Speedups of ParallelMatrix- Vector Multiplication Copyright © 2010, Elsevier Inc. All rights Reserved
  • 37.
    Efficiencies of ParallelMatrix- Vector Multiplication Copyright © 2010, Elsevier Inc. All rights Reserved
  • 38.
    Conclusions • Reducing messagesis a good performance strategy! – Collective vs point-to-point • Distributing a fixed amount of data among several messages is more expensive than sending a single big message. Powered by TCPDF (www.tcpdf.org)Powered by TCPDF (www.tcpdf.org)Powered by TCPDF (www.tcpdf.org)