Programming the ANDC Cluster <ul><ul><li>Sudhang Shankar </li></ul></ul>
Traditional Programming <ul><li>Serial:  One instruction at a time, one after the other, on a single CPU. </li></ul>Instru...
The Funky Ishtyle <ul><li>Parallel:  The problem is split in parts. Each part is represented as a sequence of instructions...
Why Parallelise? <ul><li>Speed  - ”Many Hands Make Light Work” </li></ul><ul><li>Precision/Scale  – We can solve bigger pr...
Parallel Programming Models <ul><li>There are several parallel programming models in common use: </li></ul><ul><ul><li>Sha...
Message Passing Model <ul><li>The applications on the ANDC cluster as of now work on this model </li></ul><ul><li>Tasks us...
The MPI Standard <ul><li>MPI: Message Passing Interface </li></ul><ul><ul><li>A standard, with many implementations </li><...
How MPI works <ul><li>communicators  define which collection of processes may communicate with each other </li></ul><ul><l...
MPI primitives – Environment Setup <ul><li>MPI_INIT:  initialises the MPI execution environment </li></ul><ul><li>MPI_COMM...
MPI primitives – Message Passing <ul><li>MPI_Send(buffer,count,type,dest,tag,comm)  </li></ul><ul><li>MPI_Recv(buffer,coun...
An Example Application <ul><li>The Monte-Carlo Pi Estimation Algorithm </li></ul><ul><li>AKA ”The Dartboard Algorithm” </l...
Algorithm Description <ul><li>Imagine you have a square ”dartboard”, with a circle inscribed in it: </li></ul>
<ul><li>Randomly throw N darts at the board </li></ul><ul><li>Count the no of HITS ( darts landing within circle) </li></u...
<ul><li>pi will be the value obtained after multiplying the ratio of hits to total throws by 4 </li></ul>Why: pi = A c  / ...
Parallel Version <ul><li>Make each worker throw an equal number of darts </li></ul><ul><li>A worker counts the HITS </li><...
To Make it Faster.... <ul><li>Increase the number of workers p, while keeping N constant. </li></ul><ul><li>Each worker de...
To Make it ”Better” <ul><li>Increase the number of throws N </li></ul><ul><li>This makes the calculation more accurate </l...
MPI <ul><li>For each task, run the dartboard algorithm </li></ul><ul><li>homehits = dboard(DARTS); </li></ul><ul><li>Worke...
<ul><li>Master gets homehit values from workers     for (i=0;i<p;i++) { rc = MPI_Recv(&hitrecv, 1, MPI_DOUBLE,  MPI_ANY_SO...
MapReduce <ul><li>Framework for simplifying the development of parallel programs. </li></ul><ul><li>Developed at Google. <...
<ul><li>The MapReduce library requires the user to implement: </li></ul><ul><ul><li>Map():  takes as input a function and ...
How it works (oversimplified) <ul><li>map()  takes as input a set of <key,value> pairs produces a set of <intermediate key...
<ul><li>The MapReduce Library groups together all intermediate values associated with the same intermediate key I and pass...
Pi Estimation in MapReduce <ul><li>Here, map() is the dartboard algo. </li></ul><ul><li>Each worker runs the algo. Hits ar...
<ul><li>The library then clumps all the <bool,count> pairs into two ”sets”: one for key 0 and one for key 1 and passes the...
Other Solutions <ul><li>PVM </li></ul><ul><li>OpenMP </li></ul><ul><li>LINDA </li></ul><ul><li>Occam </li></ul><ul><li>Par...
Problems/Limitations <ul><li>Parallel Slowdown:  parallelization of a parallel computer program beyond a certain point cau...
Applications <ul><li>Compute clusters are used whenever we have: </li></ul><ul><ul><li>lots and lots of data to process,  ...
Finance <ul><li>Risk Assessment: </li></ul><ul><ul><li>India's NSE uses a linux cluster in order to monitor the risk of me...
Molecular Dynamics <ul><li>Given a collection of atoms, we’d like to calculate how they interact and move under realistic ...
<ul><li>Software:  </li></ul><ul><ul><li>GROMACS:  helps scientists simulate the behavior of large molecules (like protein...
Other Distributed Problems <ul><li>Rendering multiple frames of high-quality animation (eg – Shrek) </li></ul><ul><li>Inde...
Questions?
Upcoming SlideShare
Loading in …5
×

Parallel Programming on the ANDC cluster

1,673 views

Published on

Presentation by Sudhang Shankar @ ANDC Linux Workshop '09 on Parallel Programming

Published in: Education, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,673
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
32
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Parallel Programming on the ANDC cluster

    1. 1. Programming the ANDC Cluster <ul><ul><li>Sudhang Shankar </li></ul></ul>
    2. 2. Traditional Programming <ul><li>Serial: One instruction at a time, one after the other, on a single CPU. </li></ul>Instructions PROBLEM t6 t5 t4 t3 t2 t1 CPU
    3. 3. The Funky Ishtyle <ul><li>Parallel: The problem is split in parts. Each part is represented as a sequence of instructions. Each such sequence is run on a separate CPU. </li></ul>Problem Sub- Problem1 Sub- Problem2 t3 t2 t1 t3 t2 t1 CPU1 CPU2
    4. 4. Why Parallelise? <ul><li>Speed - ”Many Hands Make Light Work” </li></ul><ul><li>Precision/Scale – We can solve bigger problems, with greater accuracy. </li></ul>
    5. 5. Parallel Programming Models <ul><li>There are several parallel programming models in common use: </li></ul><ul><ul><li>Shared Memory </li></ul></ul><ul><ul><li>Threads </li></ul></ul><ul><ul><li>Message Passing </li></ul></ul><ul><ul><li>Data Parallel </li></ul></ul>
    6. 6. Message Passing Model <ul><li>The applications on the ANDC cluster as of now work on this model </li></ul><ul><li>Tasks use their own local memory during computation </li></ul><ul><li>Task exchange data through messages </li></ul>
    7. 7. The MPI Standard <ul><li>MPI: Message Passing Interface </li></ul><ul><ul><li>A standard, with many implementations </li></ul></ul><ul><ul><li>Codifies ”best practises” of the Parallel Design community </li></ul></ul><ul><ul><li>Implementations </li></ul></ul><ul><ul><ul><li>LAM/MPI – Argonne Labs </li></ul></ul></ul><ul><ul><ul><li>MPICH </li></ul></ul></ul><ul><ul><ul><li>openMPI </li></ul></ul></ul>
    8. 8. How MPI works <ul><li>communicators define which collection of processes may communicate with each other </li></ul><ul><li>Every process in a communicator has a unique rank </li></ul><ul><li>The size of the communicator is the total no. of processes in the communicator </li></ul>
    9. 9. MPI primitives – Environment Setup <ul><li>MPI_INIT: initialises the MPI execution environment </li></ul><ul><li>MPI_COMM_SIZE: Determines the number of processes in the group associated with a communicator </li></ul><ul><li>MPI_COMM_RANK: Determines the rank of the calling process within the communicator </li></ul><ul><li>MPI_FINALIZE: Terminates the MPI execution environment </li></ul>
    10. 10. MPI primitives – Message Passing <ul><li>MPI_Send(buffer,count,type,dest,tag,comm) </li></ul><ul><li>MPI_Recv(buffer,count,type,source,tag,comm,status) </li></ul>
    11. 11. An Example Application <ul><li>The Monte-Carlo Pi Estimation Algorithm </li></ul><ul><li>AKA ”The Dartboard Algorithm” </li></ul>
    12. 12. Algorithm Description <ul><li>Imagine you have a square ”dartboard”, with a circle inscribed in it: </li></ul>
    13. 13. <ul><li>Randomly throw N darts at the board </li></ul><ul><li>Count the no of HITS ( darts landing within circle) </li></ul><ul><li>hits </li></ul><ul><li>flops </li></ul>
    14. 14. <ul><li>pi will be the value obtained after multiplying the ratio of hits to total throws by 4 </li></ul>Why: pi = A c / r 2 A s = 4r 2 r 2 = A s / 4 pi = 4 * A c / A s <ul><li>hits </li></ul><ul><li>flops </li></ul>
    15. 15. Parallel Version <ul><li>Make each worker throw an equal number of darts </li></ul><ul><li>A worker counts the HITS </li></ul><ul><li>The Master adds all the individual ”HITS” </li></ul><ul><li>It then computes pi as: pi = (4.)*(HITS)/N </li></ul>
    16. 16. To Make it Faster.... <ul><li>Increase the number of workers p, while keeping N constant. </li></ul><ul><li>Each worker deals with (N/p) throws </li></ul><ul><li>so the greater the value of p, the fewer throws a worker handles </li></ul><ul><li>Fewer throws => faster </li></ul>
    17. 17. To Make it ”Better” <ul><li>Increase the number of throws N </li></ul><ul><li>This makes the calculation more accurate </li></ul>
    18. 18. MPI <ul><li>For each task, run the dartboard algorithm </li></ul><ul><li>homehits = dboard(DARTS); </li></ul><ul><li>Workers send homehits to master </li></ul><ul><li>if (taskid != MASTER) </li></ul><ul><li>MPI_Send(&homehits, 1, MPI_DOUBLE, </li></ul><ul><li> MASTER, count, </li></ul><ul><ul><ul><ul><ul><ul><ul><ul><li>MPI_COMM_WORLD); </li></ul></ul></ul></ul></ul></ul></ul></ul>
    19. 19. <ul><li>Master gets homehit values from workers for (i=0;i<p;i++) { rc = MPI_Recv(&hitrecv, 1, MPI_DOUBLE, MPI_ANY_SOURCE, mtype, MPI_COMM_WORLD, &status); totalhits = totalhits + hitrecv; } </li></ul><ul><li>Master calculates pi as pi = (4.0)*(totalhits)/N </li></ul>
    20. 20. MapReduce <ul><li>Framework for simplifying the development of parallel programs. </li></ul><ul><li>Developed at Google. </li></ul><ul><li>FLOSS implementations </li></ul><ul><ul><li>Hadoop (Java) from Yahoo </li></ul></ul><ul><ul><li>Disco (Erlang) from Nokia </li></ul></ul><ul><ul><li>Dumbo from Audioscrobbler </li></ul></ul><ul><ul><li>Many others (including a 36-line Ruby one!) </li></ul></ul>
    21. 21. <ul><li>The MapReduce library requires the user to implement: </li></ul><ul><ul><li>Map(): takes as input a function and a sequence of values. Applies the function to each value in the sequence </li></ul></ul><ul><ul><li>Reduce(): combines all the elements of a sequence using a binary operation </li></ul></ul>MapReduce
    22. 22. How it works (oversimplified) <ul><li>map() takes as input a set of <key,value> pairs produces a set of <intermediate key,value> pairs </li></ul><ul><li>This is all done in parallel, across many machines </li></ul><ul><li>The parallelisation is done by the mapreduce library (the programmer doesn't have to think about it) </li></ul>
    23. 23. <ul><li>The MapReduce Library groups together all intermediate values associated with the same intermediate key I and passes them to reduce() </li></ul><ul><li>A Reduce() instance takes a set of <intermediate key,value> pairs and produces an output value for that key, like a ”summary” value. </li></ul>
    24. 24. Pi Estimation in MapReduce <ul><li>Here, map() is the dartboard algo. </li></ul><ul><li>Each worker runs the algo. Hits are represented as <1,no_of_Hits> and flops as <0,no_of_flops> </li></ul><ul><li>Thus Each Map() instance returns two <boolean,count> to the MapReduce library. </li></ul>
    25. 25. <ul><li>The library then clumps all the <bool,count> pairs into two ”sets”: one for key 0 and one for key 1 and passes them to reduce() </li></ul><ul><li>Reduce() then adds up the ”count” for each key to produce a ”grand total”. Thus we know the total hits and total flops. These are output as <key, Grand_total> pairs to the master process. </li></ul><ul><li>The master then finds pi as pi = 4*(hits)/(hits+flops) </li></ul>
    26. 26. Other Solutions <ul><li>PVM </li></ul><ul><li>OpenMP </li></ul><ul><li>LINDA </li></ul><ul><li>Occam </li></ul><ul><li>Parallel/Scientific Python </li></ul>
    27. 27. Problems/Limitations <ul><li>Parallel Slowdown: parallelization of a parallel computer program beyond a certain point causes the program to run slower </li></ul><ul><li>The Amdahl Principle: Parallel speedup is limited by the sequential fraction of the program. </li></ul>
    28. 28. Applications <ul><li>Compute clusters are used whenever we have: </li></ul><ul><ul><li>lots and lots of data to process, </li></ul></ul><ul><ul><li>Too little time to work sequentially. </li></ul></ul>
    29. 29. Finance <ul><li>Risk Assessment: </li></ul><ul><ul><li>India's NSE uses a linux cluster in order to monitor the risk of members </li></ul></ul><ul><ul><li>Broker crosses VAR limit => account disabled </li></ul></ul><ul><ul><li>VAR is calculated in realtime using PRISM (Parallel Risk Management System), which uses MPI </li></ul></ul><ul><ul><li>NSE's PRISM handles 500 trades/sec and can scale to 1000 trades/sec. </li></ul></ul>
    30. 30. Molecular Dynamics <ul><li>Given a collection of atoms, we’d like to calculate how they interact and move under realistic laboratory conditions </li></ul><ul><li>expensive part: determining the force on each atom, since it depends on the positions of all other atoms in the system </li></ul>
    31. 31. <ul><li>Software: </li></ul><ul><ul><li>GROMACS: helps scientists simulate the behavior of large molecules (like proteins, lipids, and even polymers) </li></ul></ul><ul><ul><li>PyMol: molecular graphics and modelling package which can be also used to generate animated sequences. </li></ul></ul>Raytraced Lysozyme structure created with Pymol
    32. 32. Other Distributed Problems <ul><li>Rendering multiple frames of high-quality animation (eg – Shrek) </li></ul><ul><li>Indexing the web (eg – Google) </li></ul><ul><li>Data Mining </li></ul>
    33. 33. Questions?

    ×