Example : parallelize a simple problem

  • 1,166 views
Uploaded on

Dr. Mohammad Ansari http://uqu.edu.sa/staff/ar/4300205

Dr. Mohammad Ansari http://uqu.edu.sa/staff/ar/4300205

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,166
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
16
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Parallel & Distributed Computer Systems Dr. Mohammad Ansari
  • 2. Course Details  Delivery ◦ Lectures/discussions: English ◦ Assessments: English ◦ Ask questions in class if you don’t understand ◦ Email me after class if you do not want to ask in class ◦ DO NOT LEAVE QUESTIONS TILL THE DAY BEFORE THE EXAM!!!  Assessments (this may change) ◦ Homework (~1 per week): 10% ◦ Midterm: 20% ◦ 1 project + final exam OR 2 projects: 35%+35%
  • 3. Course Details  Textbook ◦ Principles of Parallel Programming, Lin & Snyder  Other sources of information: ◦ COMP 322, Rice University ◦ CS 194, UC Berkeley ◦ Cilk lectures, MIT  Many sources of information on the internet for writing parallelized code
  • 4. Teaching Materials & Assignments  Everything is on Jusur ◦ Lectures ◦ Homeworks  Submit homework through Jusur  Homework is given out on Saturday  Homework due following Saturday  You lose 10% for each day late
  • 5. Homework 1  First homework is available on Jusur ◦ Install Linux on your computer  It is needed for future homework  It is needed to access the supercomputers ◦ Check settings/hardware  Submit pictures of your settings  Submit description of your processor ◦ Deadline: 27/03/1431 (submit on Jusur)
  • 6. Cheating in Homework/Projects  Cheating ◦ If you cheat, you get zero ◦ If you help others cheat, you will also get zero ◦ Copy + paste from Internet, e.g. Wikipedia, or elsewhere, is also cheating (called plagiarism) ◦ You can read any source of information, but you must write answers in your own words ◦ If you have problems, please ask for help.
  • 7. Outline  Previous lecture: ◦ Why study parallel computing? ◦ Topics covered on this course  This lecture: ◦ Example problem  Next week: ◦ Parallel processor architectures
  • 8. Example Problem  We will parallelize a simple problem  Begin to explore some of the issues related to parallel programming, and performance of parallel programs
  • 9. Example Problem: Array Sum  Add all the numbers in a large array  It has 100 million elements  int size = 100000000;  int array[] = {7,3,15,10,13,18,6,4,…};  What code should we write for a sequential program?
  • 10. Example Problem: Sequential int sum = 0; int i = 0; for(i = 0; i < size; i++) { sum += array[i]; //sum=sum+array[i]; }
  • 11. Example Problem: Sequential
  • 12. How Do We Parallelize?  Objective: Thinking about parallelism ◦ Multiple processors need something to do  A program/software has to be split into parts  Each part can be executed on a different processor. ◦ How do we improve performance over single processor?  If a problem takes 2 seconds on a single processor  And we break it into two (equal) parts: 1 second for each part  And we execute the two parts separately, but in parallel, on two processors, then we improve performance
  • 13. How Do We Parallelize? Time Sequential Parallel CPU 0 CPU 0 CPU 1 Part 0 Part 0 Part 1 1 Part 1 2
  • 14. How Do We Start Parallelizing?  What parts can be done separately? ◦ What parts can we do on separate processors? ◦ Meaning: What parts have no data dependence ◦ Data dependence:  The execution of an instruction (or line of code) is dependent on execution of a previous instruction (or line of code). ◦ Data independence:  The execution of an instruction (or line of code) is not dependent on execution of a previous instruction (or line of code).
  • 15. Example of Data Dependence int x = 0; int y = 5; x = 3; y = y + x; //Is this line dependent on the previous line?
  • 16. Data Dependence & Parallelism  In a sequential program, data dependence does not matter: each instruction executes in sequence. ◦ Instructions execute one by one  In a parallel program, data independence allows parallel execution of instructions. Data dependence prevents parallel execution of instructions. ◦ Reduces parallel performance ◦ Reduces number of processors that can be used
  • 17. Why is Data Dependence Bad For Parallel Programs?  Does not allow correct parallel execution CPU0 CPU1 x = 3; y = y + x; x = 3; y = 5; //(5 + 0)
  • 18. Why is Data Dependence Bad For Parallel Programs?  Does not allow correct parallel execution CPU0 CPU1 x = 3; WAIT x = 3; y = y + x; y = 8; //(5 + 3)
  • 19. Why is Data Dependence Bad For Parallel Programs?  Does not allow correct parallel execution CPU0 x = 3; y = y + x; x = 3; y = 8;
  • 20. Example of Data Independence int x = 0; int y = 5; x = 3; y = y + 5; //Is this line dependent on the previous line?
  • 21. Why is Data Independence Useful?  Allows correct parallel execution CPU0 CPU1 x = 3; y = y + 5; x = 3; y = 10;
  • 22. Back to Array Sum Example Does the code have data dependence? int sum = 0; for(int i = 0; i < size; i++) { sum += array[i]; //sum=sum+array[i]; }
  • 23. Back to Array Sum Example Does the code have data dependence? int sum = 0; for(int i = 0; i < size; i++) { sum += array[i]; //sum=sum+array[i]; } Not so easy to see
  • 24. Back to Array Sum Example Let’s unroll the loop: int sum = 0; sum += array[0]; //sum=sum+array[0]; sum += array[1]; //sum=sum+array[1]; sum += array[2]; //sum=sum+array[2]; sum += array[3]; //sum=sum+array[3]; … Now we can see dependence!
  • 25. Example Problem: Sequential
  • 26. Removing Dependencies  Sometimes this is possible. ◦ Dependencies discussed in detail later.  Tip: Can be useful to look at the problem being solved by the code, and not the code itself.
  • 27. Break Sum into Pieces P0 P1 7 3 1 0 2 9 5 8 3 6 S0 S1 SUM
  • 28. Some Details…  A program executes inside a process  If we want to use multiple processors ◦ We need multiple processes ◦ One process for each processor (not fixed rule)  Processes are big, heavyweight  Threads are lighter than processes ◦ But same strategy ◦ One thread for each processor (not fixed rule)  We will talk about threads and processes later, if necessary
  • 29. What Does the Code Look Like? int numThreads = 2; //Assume one thread per core, and 2 cores int sum = 0; int i = 0; int middleSum[numThreads]; int threadSetSize = size/numThreads //Each thread will execute this code with a different threadID for( i = threadID*threadSetSize; i < (threadID+1)*threadSetSize; i++) { middleSum[threadID] += array[i]; } //Only thread 0 will execute this code if (threadID==0) { for(i = 0; i < numThreads; i++) { sum += middleSum[i]; } }
  • 30. Load Balancing  Which processor is doing more work? P0 P1 7 3 1 0 2 9 5 8 3 6 S0 S1 SUM
  • 31. Load Balancing Time Sequential Parallel P0 P0 P1 Part 0 Part 0 Part 1 1.0 1.3 Part 1 2.0
  • 32. Example Problem: Array Sum  Parallelized code is more complex  Requires us to think differently about how to solve the problem ◦ Need to think about breaking it into parts ◦ Analyze data dependencies, remove if possible ◦ Need to load balance for better performance
  • 33. Example Problem: Array Sum  However, the parallel code is broken ◦ Thread 0 adds all the middle sums. ◦ What if thread 0 finishes its own work, but other threads have not?
  • 34. Synchronization  P0 will probably finish before P1 P0 P1 7 3 1 0 2 9 5 8 3 6 S0 S1 SUM
  • 35. How Can We Fix The Code to GUARANTEE It Works Correctly? int numThreads = 2; //Assume one thread per core, and 2 cores int sum = 0; int i = 0; int middleSum[numThreads]; int threadSetSize = size/numThreads //Each thread will execute this code with a different threadID for( i = threadID*threadSetSize; i < (threadID+1)*threadSetSize; i++) { middleSum[threadID] += array[i]; } //Only thread 0 will execute this code if (threadID==0) { for(i = 0; i < numThreads; i++) { sum += middleSum[i]; } }
  • 36. Synchronization  Sometimes we need to coordinate/organize threads  If we don’t, the code might calculate the wrong answer to the problem  Can happen even if load balance is perfect  Synchronization is concerned with this coordination / organization
  • 37. Code with Synchronization Fixed int numThreads = 2; //Assume one thread per core, & 2 cores int sum = 0; int i = 0; int middleSum[numThreads]; int threadSetSize = size/numThreads //Each thread will execute this code with a different threadID for( i = threadID*threadSetSize; i < (threadID+1)*threadSetSize; i++) { middleSum[threadID] += array[i]; } waitForAllThreads(); //Wait for all threads //Only thread 0 will execute this code if (threadID==0) { for(i = 0; i < numThreads; i++) { sum += middleSum[i]; } }
  • 38. Synchronization  The example shows a barrier  This is one type of synchronization  Barriers require all threads to reach that point in the code, before any thread is allowed to continue  It is like a gate. All threads come to the gate, and then it opens.
  • 39. Generalizing the Solution  We only looked at how to parallelize for 2 threads  But the code is more general ◦ Can use any number of threads ◦ Important that code is written this way ◦ We will look at this in more detail later
  • 40. Parallel Program
  • 41. Performance  Now the program is correct  Let’s look at performance Time on 2-core Processor 1 0.8 0.6 0.4 0.2 0 1 Thread 2 Threads 4 Threads
  • 42. Performance  Two-threads are not 2x fast. Why? ◦ The problem is called false sharing ◦ To understand this, we have to look at the computer architecture ◦ We will study this in the next lecture  Four-threads slower than two-threads. Why? ◦ The processor only has two cores ◦ Four threads adds scheduling overhead, wastes time
  • 43. Summary  Used an example to start looking at how to parallelize code, and some of the main issues ◦ Data dependence ◦ Load balancing ◦ Synchronization  Each will be discussed in more detail in later lectures