Parallel Programming on the ANDC cluster
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Parallel Programming on the ANDC cluster

  • 2,833 views
Uploaded on

Presentation by Sudhang Shankar @ ANDC Linux Workshop '09 on Parallel Programming

Presentation by Sudhang Shankar @ ANDC Linux Workshop '09 on Parallel Programming

More in: Education , Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,833
On Slideshare
2,830
From Embeds
3
Number of Embeds
1

Actions

Shares
Downloads
30
Comments
0
Likes
1

Embeds 3

http://www.slideshare.net 3

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Programming the ANDC Cluster
      • Sudhang Shankar
  • 2. Traditional Programming
    • Serial: One instruction at a time, one after the other, on a single CPU.
    Instructions PROBLEM t6 t5 t4 t3 t2 t1 CPU
  • 3. The Funky Ishtyle
    • Parallel: The problem is split in parts. Each part is represented as a sequence of instructions. Each such sequence is run on a separate CPU.
    Problem Sub- Problem1 Sub- Problem2 t3 t2 t1 t3 t2 t1 CPU1 CPU2
  • 4. Why Parallelise?
    • Speed - ”Many Hands Make Light Work”
    • Precision/Scale – We can solve bigger problems, with greater accuracy.
  • 5. Parallel Programming Models
    • There are several parallel programming models in common use:
      • Shared Memory
      • Threads
      • Message Passing
      • Data Parallel
  • 6. Message Passing Model
    • The applications on the ANDC cluster as of now work on this model
    • Tasks use their own local memory during computation
    • Task exchange data through messages
  • 7. The MPI Standard
    • MPI: Message Passing Interface
      • A standard, with many implementations
      • Codifies ”best practises” of the Parallel Design community
      • Implementations
        • LAM/MPI – Argonne Labs
        • MPICH
        • openMPI
  • 8. How MPI works
    • communicators define which collection of processes may communicate with each other
    • Every process in a communicator has a unique rank
    • The size of the communicator is the total no. of processes in the communicator
  • 9. MPI primitives – Environment Setup
    • MPI_INIT: initialises the MPI execution environment
    • MPI_COMM_SIZE: Determines the number of processes in the group associated with a communicator
    • MPI_COMM_RANK: Determines the rank of the calling process within the communicator
    • MPI_FINALIZE: Terminates the MPI execution environment
  • 10. MPI primitives – Message Passing
    • MPI_Send(buffer,count,type,dest,tag,comm)
    • MPI_Recv(buffer,count,type,source,tag,comm,status)
  • 11. An Example Application
    • The Monte-Carlo Pi Estimation Algorithm
    • AKA ”The Dartboard Algorithm”
  • 12. Algorithm Description
    • Imagine you have a square ”dartboard”, with a circle inscribed in it:
  • 13.
    • Randomly throw N darts at the board
    • Count the no of HITS ( darts landing within circle)
    • hits
    • flops
  • 14.
    • pi will be the value obtained after multiplying the ratio of hits to total throws by 4
    Why: pi = A c / r 2 A s = 4r 2 r 2 = A s / 4 pi = 4 * A c / A s
    • hits
    • flops
  • 15. Parallel Version
    • Make each worker throw an equal number of darts
    • A worker counts the HITS
    • The Master adds all the individual ”HITS”
    • It then computes pi as: pi = (4.)*(HITS)/N
  • 16. To Make it Faster....
    • Increase the number of workers p, while keeping N constant.
    • Each worker deals with (N/p) throws
    • so the greater the value of p, the fewer throws a worker handles
    • Fewer throws => faster
  • 17. To Make it ”Better”
    • Increase the number of throws N
    • This makes the calculation more accurate
  • 18. MPI
    • For each task, run the dartboard algorithm
    • homehits = dboard(DARTS);
    • Workers send homehits to master
    • if (taskid != MASTER)
    • MPI_Send(&homehits, 1, MPI_DOUBLE,
    • MASTER, count,
                  • MPI_COMM_WORLD);
  • 19.
    • Master gets homehit values from workers for (i=0;i<p;i++) { rc = MPI_Recv(&hitrecv, 1, MPI_DOUBLE, MPI_ANY_SOURCE, mtype, MPI_COMM_WORLD, &status); totalhits = totalhits + hitrecv; }
    • Master calculates pi as pi = (4.0)*(totalhits)/N
  • 20. MapReduce
    • Framework for simplifying the development of parallel programs.
    • Developed at Google.
    • FLOSS implementations
      • Hadoop (Java) from Yahoo
      • Disco (Erlang) from Nokia
      • Dumbo from Audioscrobbler
      • Many others (including a 36-line Ruby one!)
  • 21.
    • The MapReduce library requires the user to implement:
      • Map(): takes as input a function and a sequence of values. Applies the function to each value in the sequence
      • Reduce(): combines all the elements of a sequence using a binary operation
    MapReduce
  • 22. How it works (oversimplified)
    • map() takes as input a set of <key,value> pairs produces a set of <intermediate key,value> pairs
    • This is all done in parallel, across many machines
    • The parallelisation is done by the mapreduce library (the programmer doesn't have to think about it)
  • 23.
    • The MapReduce Library groups together all intermediate values associated with the same intermediate key I and passes them to reduce()
    • A Reduce() instance takes a set of <intermediate key,value> pairs and produces an output value for that key, like a ”summary” value.
  • 24. Pi Estimation in MapReduce
    • Here, map() is the dartboard algo.
    • Each worker runs the algo. Hits are represented as <1,no_of_Hits> and flops as <0,no_of_flops>
    • Thus Each Map() instance returns two <boolean,count> to the MapReduce library.
  • 25.
    • The library then clumps all the <bool,count> pairs into two ”sets”: one for key 0 and one for key 1 and passes them to reduce()
    • Reduce() then adds up the ”count” for each key to produce a ”grand total”. Thus we know the total hits and total flops. These are output as <key, Grand_total> pairs to the master process.
    • The master then finds pi as pi = 4*(hits)/(hits+flops)
  • 26. Other Solutions
    • PVM
    • OpenMP
    • LINDA
    • Occam
    • Parallel/Scientific Python
  • 27. Problems/Limitations
    • Parallel Slowdown: parallelization of a parallel computer program beyond a certain point causes the program to run slower
    • The Amdahl Principle: Parallel speedup is limited by the sequential fraction of the program.
  • 28. Applications
    • Compute clusters are used whenever we have:
      • lots and lots of data to process,
      • Too little time to work sequentially.
  • 29. Finance
    • Risk Assessment:
      • India's NSE uses a linux cluster in order to monitor the risk of members
      • Broker crosses VAR limit => account disabled
      • VAR is calculated in realtime using PRISM (Parallel Risk Management System), which uses MPI
      • NSE's PRISM handles 500 trades/sec and can scale to 1000 trades/sec.
  • 30. Molecular Dynamics
    • Given a collection of atoms, we’d like to calculate how they interact and move under realistic laboratory conditions
    • expensive part: determining the force on each atom, since it depends on the positions of all other atoms in the system
  • 31.
    • Software:
      • GROMACS: helps scientists simulate the behavior of large molecules (like proteins, lipids, and even polymers)
      • PyMol: molecular graphics and modelling package which can be also used to generate animated sequences.
    Raytraced Lysozyme structure created with Pymol
  • 32. Other Distributed Problems
    • Rendering multiple frames of high-quality animation (eg – Shrek)
    • Indexing the web (eg – Google)
    • Data Mining
  • 33. Questions?