CS-416 Parallel and Distributed Systems<br />JawwadShamsi<br />Lecture #3 <br />20th January 2010<br />
Announcement<br />Possible Name Change to<br />High Performance Computing<br />
Recap<br />Pipelining<br />Vector Instruction<br />Super Scalar Execution<br />
Super-Scalar Execution<br />
Dependencies<br />Data Dependency<br />Resource dependency<br />Branch Dependency<br />
Dynamic Instruction Issue<br />3rd Segment<br />Processor needs capability of <br />Out of order sequencing<br />
Limitations of Memory Systems<br />Latency<br />Bandwidth<br />
Effect of Latency - Example<br />1 GHZ processor (1 ns)<br />100 ns latency<br />Two multiply-add units <br />four instruc...
Effect of Bandwidth<br />Process 1 GHZ<br />100 cycle latency DRAM <br />Block size is one word, the processor takes 100 c...
1 for (i = 0; i &lt; 1000; i++) <br />2 column_sum[i] = 0.0; <br />3 for (j = 0; j &lt; 1000; j++) <br />4 column_sum[i] +...
Pre-fetching<br />Multi-Threading<br />
Impact of bandwidth on multithreaded programs<br />Threads share Memory<br />Cache<br />Cache size will be limited<br />Li...
Simple Execution<br />for(i=0;i&lt;n;i++) <br />2 c[i] = dot_product(get_row(a, i), b);<br />
Threaded Execution<br />for(i=0;i&lt;n;i++) <br />2 c[i] = create_thread(dot_product, get_row(a, i), b);<br />
1 for (i = 0; i &lt; 1000; i++) <br />2 column_sum[i] = 0.0; 3 <br />for (j = 0; j &lt; 1000; j++) <br />4 for (i = 0; i &...
Upcoming SlideShare
Loading in...5
×

Lecture3

552
-1

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
552
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Lecture3

  1. 1. CS-416 Parallel and Distributed Systems<br />JawwadShamsi<br />Lecture #3 <br />20th January 2010<br />
  2. 2. Announcement<br />Possible Name Change to<br />High Performance Computing<br />
  3. 3. Recap<br />Pipelining<br />Vector Instruction<br />Super Scalar Execution<br />
  4. 4. Super-Scalar Execution<br />
  5. 5. Dependencies<br />Data Dependency<br />Resource dependency<br />Branch Dependency<br />
  6. 6. Dynamic Instruction Issue<br />3rd Segment<br />Processor needs capability of <br />Out of order sequencing<br />
  7. 7. Limitations of Memory Systems<br />Latency<br />Bandwidth<br />
  8. 8. Effect of Latency - Example<br />1 GHZ processor (1 ns)<br />100 ns latency<br />Two multiply-add units <br />four instructions in each cycle of 1 ns<br />Peak Rating<br />4GLOPS<br />Memory latency 100 cycles <br />block size is one word<br />Processor must wait 100 cycles before it can process the data.<br />Peak speed 1 floating point operation / 100 nsec<br />10 MFLOPS<br />
  9. 9. Effect of Bandwidth<br />Process 1 GHZ<br />100 cycle latency DRAM <br />Block size is one word, the processor takes 100 cycles to fetch each word. <br />Therefore, the algorithm performs one FLOP every 100 cycles for a peak speed of 10 MFLOPS<br />Increase Block Size?? <br />
  10. 10. 1 for (i = 0; i &lt; 1000; i++) <br />2 column_sum[i] = 0.0; <br />3 for (j = 0; j &lt; 1000; j++) <br />4 column_sum[i] += b[j][i];<br />
  11. 11.
  12. 12. Pre-fetching<br />Multi-Threading<br />
  13. 13. Impact of bandwidth on multithreaded programs<br />Threads share Memory<br />Cache<br />Cache size will be limited<br />Limited Cache-hit ratio<br />Decrease in effective bandwith<br />
  14. 14. Simple Execution<br />for(i=0;i&lt;n;i++) <br />2 c[i] = dot_product(get_row(a, i), b);<br />
  15. 15. Threaded Execution<br />for(i=0;i&lt;n;i++) <br />2 c[i] = create_thread(dot_product, get_row(a, i), b);<br />
  16. 16. 1 for (i = 0; i &lt; 1000; i++) <br />2 column_sum[i] = 0.0; 3 <br />for (j = 0; j &lt; 1000; j++) <br />4 for (i = 0; i &lt; 1000; i++) <br />5 column_sum[i] += b[j][i];<br />
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×